Predicting Ceph PG placement

When creating a new Ceph pool, deciding for the number of PG requires some thinking to ensure there are a few hundred PGs per OSD. The distribution can be verified with crush analyze as follows:

$ crush analyze --rule data --type device \
                --replication-count 2 \
                --crushmap crushmap.txt \
                --pool 0 --pg-num 512 --pgp-num 512
         ~id~  ~weight~  ~over/under used %~
~name~
device0     0       1.0                 9.86
device5     5       2.0                 8.54
device2     2       1.0                 1.07
device3     3       2.0                -1.12
device1     1       2.0                -5.52
device4     4       1.0               -14.75

The argument of the –pool option is unknown because the pool was not created yet, but pool numbers are easy to predict. If the highest pool number is 5, the next pool number will be 6. The output shows the PGs will not be evenly distributed because there are not enough of them. If there was a thousand times more PGs, they would be evenly distributed:

$ crush analyze --rule data --type device \
                --replication-count 2 \
                --crushmap crushmap \
                --pool 0 --pg-num 512000 --pgp-num 512000
         ~id~  ~weight~  ~over/under used %~
~name~
device4     4       1.0                 0.30
device3     3       2.0                 0.18
device2     2       1.0                -0.03
device5     5       2.0                -0.04
device1     1       2.0                -0.13
device0     0       1.0                -0.30

Increasing the number of PGs is not a practical solution because having more than a few hundred PGs per OSD requires too much CPU and RAM. Knowing that device0 will be the first OSD to fill up, reweight-by-utilization should be used when it is too full.
Continue reading “Predicting Ceph PG placement”

How many objects will move when changing a crushmap ?

After a crushmap is changed (e.g. addition/removal of devices, modification of weights or tunables), objects may move from one device to another. The crush compare command can be used to show what would happen for a given rule and replication count. In the following example, two new OSDs are added to the crushmap, causing 22% of the objects to move from the existing OSDs to the new ones.

$ crush compare --rule firstn \
                --replication-count 1 \
                --origin before.json --destination after.json
There are 1000 objects.

Replacing the crushmap specified with --origin with the crushmap
specified with --destination will move 229 objects (22.9% of the total)
from one item to another.

The rows below show the number of objects moved from the given
item to each item named in the columns. The objects% at the
end of the rows shows the percentage of the total number
of objects that is moved away from this particular item. The
last row shows the percentage of the total number of objects
that is moved to the item named in the column.

         osd.8    osd.9    objects%
osd.0        3        4       0.70%
osd.1        1        3       0.40%
osd.2       16       16       3.20%
osd.3       19       21       4.00%
osd.4       17       18       3.50%
osd.5       18       23       4.10%
osd.6       14       23       3.70%
osd.7       14       19       3.30%
objects%   10.20%   12.70%   22.90%

The crush compare command can also show the impact of a change in one or more “tunables”, such as setting chooseleaf_stable to 1.

$ diff -u original.json destination.json
--- original.json	2017-03-14 23:41:47.334740845 +0100
+++ destination.json	2017-03-04 18:36:00.817610217 +0100
@@ -608,7 +608,7 @@
         "choose_local_tries": 0,
         "choose_total_tries": 50,
         "chooseleaf_descend_once": 1,
-        "chooseleaf_stable": 0,
+        "chooseleaf_stable": 1,
         "chooseleaf_vary_r": 1,
         "straw_calc_version": 1
     }

In the following example some columns were removed for brevity and replaced with dots. It shows that 33% of the objects will move after chooseleaf_stable is changed from 0 to 1. Each device will receive and send more than 1% and less than 3% of these objects.

$ crush compare --origin original.json --destination destination.json \
                --rule replicated_ruleset --replication-count 3
There are 300000 objects.

Replacing the crushmap specified with --origin with the crushmap
specified with --destination will move 99882 objects (33.294% of the total)
from one item to another.

The rows below show the number of objects moved from the given
item to each item named in the columns. The objects% at the
end of the rows shows the percentage of the total number
of objects that is moved away from this particular item. The
last row shows the percentage of the total number of objects
that is moved to the item named in the column.

          osd.0  osd.1 osd.11 osd.13 osd.20 ... osd.8  osd.9 objects%
osd.0         0    116    180      0   3972 ...   138    211    1.89%
osd.1       121      0    129     64    116 ...   112    137    1.29%
osd.11      194    126      0     12      0 ...   168    222    1.94%
osd.13        0     75     19      0    211 ...     0   4552    2.06%
osd.20     4026    120      0    197      0 ...    90      0    1.92%
osd.21      120   2181     65    130    116 ...    85     75    1.29%
osd.24      176    150    265     63      0 ...   160    258    2.29%
osd.25      123     99    190    198     99 ...    92    182    2.19%
osd.26       54     83     62    258    254 ...    51     69    2.27%
osd.27      124    109      0     90     73 ...  1840      0    1.55%
osd.29       43     54      0     98    123 ...  1857      0    1.60%
osd.3        74     82   2112    137    153 ...    61     44    1.62%
osd.37       65    108      0      0    166 ...    67      0    1.66%
osd.38      163    119      0      0     73 ...    58      0    1.68%
osd.44       56     73   2250    148    173 ...    77     43    1.68%
osd.46       60     71    132     67      0 ...    39    125    1.31%
osd.47        0     51     70    126     70 ...     0     73    1.35%
osd.8       151    112    163      0     76 ...     0    175    1.67%
osd.9       197    130    202   4493      0 ...   188      0    2.03%
objects%  1.92%  1.29%  1.95%  2.03%  1.89% ... 1.69%  2.06%   33.29%

Continue reading “How many objects will move when changing a crushmap ?”

Predicting which Ceph OSD will fill up first

When a device is added to Ceph, it is assigned a weight that reflects its capacity. For instance if osd.1 is a 1TB disk, its weight will be 1.0 and if osd.2 is a 4TB disk, its weight will be 4.0. It is expected that osd.1 will receive exactly four times more objects than osd.2. So that when osd.1 is 80% full, osd.2 is also 80% full.

But running a simulation on a crushmap with four 4TB disks and one 1TB disk, shows something different:

         WEIGHT     %USED
osd.4       1.0       86%
osd.3       4.0       81%
osd.2       4.0       79%
osd.1       4.0       79%
osd.0       4.0       78%

It happens when these devices are used in a two replica pool because the distribution of the second replica depends on the distribution of the first replica. If the pool only has one copy of each object, the distribution is as expected (there is a variation but it is around 0.2% in this case):

         WEIGHT     %USED
osd.4       1.0       80%
osd.3       4.0       80%
osd.2       4.0       80%
osd.1       4.0       80%
osd.0       4.0       80%

This variation is not new but there was no way to conveniently show it from the crushmap. It can now be displayed with crush analyze command. For instance:

    $ ceph osd crush dump > crushmap-ceph.json
    $ crush ceph --convert crushmap-ceph.json > crushmap.json
    $ crush analyze --rule replicated --crushmap crushmap.json
            ~id~  ~weight~  ~over/under used~
    ~name~
    g9       -22  2.299988     10.400604
    g3        -4  1.500000     10.126750
    g12      -28  4.000000      4.573330
    g10      -24  4.980988      1.955702
    g2        -3  5.199982      1.903230
    n7        -9  5.484985      1.259041
    g1        -2  5.880997      0.502741
    g11      -25  6.225967     -0.957755
    g8       -20  6.679993     -1.730727
    g5       -15  8.799988     -7.884220

shows that g9 will be ~90% full when g1 is ~80% full (i.e. 10.40 – 0.50 ~= 10% difference) and g5 is ~74% full.

By monitoring disk usage on g9 and adding more disk space to the cluster when the disks on g9 reach a reasonable threshold (like 85% or 90%), one can ensure that the cluster will never fill up, since it is known that g9 will always be the first node to become overfull. Another possibility is to run the ceph osd reweight-by-utilization command from time to time and try to even the distribution.
Continue reading “Predicting which Ceph OSD will fill up first”