April 2017 – Loïc Dachary

In a Ceph cluster with a single pool of 1024 Placement Groups (PGs), the PG distribution among devices will not be as expected. (see Predicting Ceph PG placement for details about this uneven distribution).

In the following, the difference between the expected number of PGs on a given host or device and the actual number of PGs can be as high as 25%. For instance, device4 has a weight of 1 and is expected to get 128 PGs but it only gets 96. And host2 (containing device4) has 83 less PGs than expected.

         ~expected~  ~actual~  ~delta~   ~delta%~  ~weight~
dc1            1024      1024        0   0.000000       1.0
 host0          256       294       38  14.843750       2.0
  device0       128       153       25  19.531250       1.0
  device1       128       141       13  10.156250       1.0
 host1          256       301       45  17.578125       2.0
  device2       128       157       29  22.656250       1.0
  device3       128       144       16  12.500000       1.0
 host2          512       429      -83 -16.210938       4.0
  device4       128        96      -32 -25.000000       1.0
  device5       128       117      -11  -8.593750       1.0
  device6       256       216      -40 -15.625000       2.0

Using minimization methods it is possible to find alternate weights for hosts and devices that significantly reduce the uneven distribution, as shown below.

         ~expected~  ~actual~  ~delta~  ~delta%~  ~weight~
dc1            1024      1024        0  0.000000  1.000000
 host0          256       259        3  1.171875  0.680399
  device0       128       129        1  0.781250  0.950001
  device1       128       130        2  1.562500  1.050000
 host1          256       258        2  0.781250  0.646225
  device2       128       129        1  0.781250  0.867188
  device3       128       129        1  0.781250  1.134375
 host2          512       507       -5 -0.976562  7.879046
  device4       128       126       -2 -1.562500  1.111111
  device5       128       127       -1 -0.781250  0.933333
  device6       256       254       -2 -0.781250  1.955555

For each host or device, we now have two weights. First, there is the target weight which is used to calculate the expected number of PGs. For instance, host0 should get 1024 * (target weight host0 = 2 / (target weight host0 = 2 + target weight host1 = 2 + target weight host3 = 4)) PGs, that is 1024 * ( 2 / ( 2 + 2 + 4 ) ) = 256. The second weight is weight set, which is used to choose where a PGs is mapped.

The problem is that the crushmap format only allows one weight per host or device. Although it would be possible to store the target weight outside of the crushmap, it would be inconvenient and difficult to understand. Instead, the crushmap syntax was extended to allow additional weights (via weight set) to be associated with each item (i.e. disk or host in the example below).

                                                    target
         ~expected~  ~actual~  ~delta~   ~delta%~  ~weight~ ~weight set~
dc1            1024      1024        0   0.000000       1.0     1.000000
 host0          256       294       38  14.843750       2.0     0.680399
  device0       128       153       25  19.531250       1.0     0.950001
  device1       128       141       13  10.156250       1.0     1.050000
 host1          256       301       45  17.578125       2.0     0.646225
  device2       128       157       29  22.656250       1.0     0.867188
  device3       128       144       16  12.500000       1.0     1.134375
 host2          512       429      -83 -16.210938       4.0     7.879046
  device4       128        96      -32 -25.000000       1.0     1.111111
  device5       128       117      -11  -8.593750       1.0     0.933333
  device6       256       216      -40 -15.625000       2.0     1.955555

Continue reading “Improving PGs distribution with CRUSH weight sets”

The CRUSH function maps Ceph placement groups (PGs) and objects to OSDs. It is used extensively in Ceph clients and daemons as well as in the Linux kernel modules and its CPU cost should be reduced to the minimum.

It is common to define the Ceph CRUSH map so that PGs use OSDs on different hosts. The hierarchy can be as simple as 100 hosts, each containing 7 OSDs. When determining which OSDs belong to a given PG, in this scenario the CRUSH function will hash each of the 100 hosts to the PG, compare the results, and select the OSDs that score the highest. That is 100 + 7 calls to the hash function.

If the hierarchy is modified to have 20 racks, each containing 5 hosts with 7 OSDs per host, we only have 20 + 5 + 7 calls to the hash function. The first 20 calls are to select the rack, the next 5 to select the host in the rack and the last 7 to select the OSD within the host.

In other words, this hierarchical subdivision of the CRUSH map reduces the number of calls to the hash function from 107 to 32. It does not have any influence on how the PGs are mapped to OSDs but it saves CPU.

Month: April 2017

Improving PGs distribution with CRUSH weight sets

Faster Ceph CRUSH computation with smaller buckets