In a Ceph cluster with a single pool of 1024 Placement Groups (PGs), the PG distribution among devices will not be as expected. (see Predicting Ceph PG placement for details about this uneven distribution).
In the following, the difference between the expected number of PGs on a given host or device and the actual number of PGs can be as high as 25%. For instance, device4 has a weight of 1 and is expected to get 128 PGs but it only gets 96. And host2 (containing device4) has 83 less PGs than expected.
~expected~ ~actual~ ~delta~ ~delta%~ ~weight~ dc1 1024 1024 0 0.000000 1.0 host0 256 294 38 14.843750 2.0 device0 128 153 25 19.531250 1.0 device1 128 141 13 10.156250 1.0 host1 256 301 45 17.578125 2.0 device2 128 157 29 22.656250 1.0 device3 128 144 16 12.500000 1.0 host2 512 429 -83 -16.210938 4.0 device4 128 96 -32 -25.000000 1.0 device5 128 117 -11 -8.593750 1.0 device6 256 216 -40 -15.625000 2.0
Using minimization methods it is possible to find alternate weights for hosts and devices that significantly reduce the uneven distribution, as shown below.
~expected~ ~actual~ ~delta~ ~delta%~ ~weight~ dc1 1024 1024 0 0.000000 1.000000 host0 256 259 3 1.171875 0.680399 device0 128 129 1 0.781250 0.950001 device1 128 130 2 1.562500 1.050000 host1 256 258 2 0.781250 0.646225 device2 128 129 1 0.781250 0.867188 device3 128 129 1 0.781250 1.134375 host2 512 507 -5 -0.976562 7.879046 device4 128 126 -2 -1.562500 1.111111 device5 128 127 -1 -0.781250 0.933333 device6 256 254 -2 -0.781250 1.955555
For each host or device, we now have two weights. First, there is the target weight which is used to calculate the expected number of PGs. For instance, host0 should get 1024 * (target weight host0 = 2 / (target weight host0 = 2 + target weight host1 = 2 + target weight host3 = 4)) PGs, that is 1024 * ( 2 / ( 2 + 2 + 4 ) ) = 256. The second weight is weight set, which is used to choose where a PGs is mapped.
The problem is that the crushmap format only allows one weight per host or device. Although it would be possible to store the target weight outside of the crushmap, it would be inconvenient and difficult to understand. Instead, the crushmap syntax was extended to allow additional weights (via weight set) to be associated with each item (i.e. disk or host in the example below).
target ~expected~ ~actual~ ~delta~ ~delta%~ ~weight~ ~weight set~ dc1 1024 1024 0 0.000000 1.0 1.000000 host0 256 294 38 14.843750 2.0 0.680399 device0 128 153 25 19.531250 1.0 0.950001 device1 128 141 13 10.156250 1.0 1.050000 host1 256 301 45 17.578125 2.0 0.646225 device2 128 157 29 22.656250 1.0 0.867188 device3 128 144 16 12.500000 1.0 1.134375 host2 512 429 -83 -16.210938 4.0 7.879046 device4 128 96 -32 -25.000000 1.0 1.111111 device5 128 117 -11 -8.593750 1.0 0.933333 device6 256 216 -40 -15.625000 2.0 1.955555
Continue reading “Improving PGs distribution with CRUSH weight sets”