Predicting Ceph PG placement

When creating a new Ceph pool, deciding for the number of PG requires some thinking to ensure there are a few hundred PGs per OSD. The distribution can be verified with crush analyze as follows:

$ crush analyze --rule data --type device \
                --replication-count 2 \
                --crushmap crushmap.txt \
                --pool 0 --pg-num 512 --pgp-num 512
         ~id~  ~weight~  ~over/under used %~
~name~
device0     0       1.0                 9.86
device5     5       2.0                 8.54
device2     2       1.0                 1.07
device3     3       2.0                -1.12
device1     1       2.0                -5.52
device4     4       1.0               -14.75

The argument of the –pool option is unknown because the pool was not created yet, but pool numbers are easy to predict. If the highest pool number is 5, the next pool number will be 6. The output shows the PGs will not be evenly distributed because there are not enough of them. If there was a thousand times more PGs, they would be evenly distributed:

$ crush analyze --rule data --type device \
                --replication-count 2 \
                --crushmap crushmap \
                --pool 0 --pg-num 512000 --pgp-num 512000
         ~id~  ~weight~  ~over/under used %~
~name~
device4     4       1.0                 0.30
device3     3       2.0                 0.18
device2     2       1.0                -0.03
device5     5       2.0                -0.04
device1     1       2.0                -0.13
device0     0       1.0                -0.30

Increasing the number of PGs is not a practical solution because having more than a few hundred PGs per OSD requires too much CPU and RAM. Knowing that device0 will be the first OSD to fill up, reweight-by-utilization should be used when it is too full.

PG mapping details

The crush command uses the same C++ functions as Ceph to map a PG to an OSD. This is done in two steps:

  • the pool and pg number are hashed into a value
  • the value is mapped to OSDs using a crushmap

The –verbose flags displays the details of the mapping, with the name of the PG:

$ crush analyze --rule data --type device \
                --verbose \
                --replication-count 2 \
                --crushmap crushmap \
                --pool 0 --pg-num 512000 --pgp-num 512000
...
2017-03-27 09:37:14,382 DEBUG 0.6b == 105507960 mapped to [u'device5', u'device0']
2017-03-27 09:37:14,382 DEBUG 0.6c == 1533389179 mapped to [u'device5', u'device3']
...

The PG 0.6b is hashed to the value 105507960 and mapped to device5 and device0. The accuracy of the mapping can be verified with the output of ceph pg dump.

caveat

The crush hash assumes the hashpspool flag is set for the pool. It is the default and the only reason to unset that flag is to support legacy clusters.

2 Replies to “Predicting Ceph PG placement”

  1. I agree with your problem analysis for placement group counts, and have seen it several times in practice. Mark Nelson wrote a nice tool called readpgdump.py that will analyze “ceph pg dump” output to show where pools are imbalanced across OSDs.

    However, are these weights per-pool or per-OSD? If weights are assigned to the OSD, which is what I thought they were, then reweighting to optimize pool K will probably de-optimize pool J, will it not? If the weights were defined for the pool, not the OSD, then we could simultaneously optimize each pool’s weights without adversely affecting the others, right? And this could be done automatically as you describe above. Whereas now it appears that we can only optimize for one pool.

  2. > However, are these weights per-pool or per-OSD?

    These weights are per OSD in the crushmap, for a given pool. The pool uses a CRUSH rule which starts from a given root in the CRUSH hierarchy (the bucket named in the take step). From this root, it descent recursively to the OSD. Since it is possible for two pools to reference two different CRUSH rules, it follows that two OSD can have different weights depending in the pool.

    > If the weights were defined for the pool, not the OSD, then we could simultaneously optimize each pool’s weights without adversely affecting the others, right?

    Yes. It’s a different problem which is being worked on by Xavier Villaneau at http://libcrush.org/xvillaneau/crush-docs/raw/master/converted/Ceph%20pool%20capacity%20analysis.pdf

Comments are closed.