When a device is added to Ceph, it is assigned a weight that reflects its capacity. For instance if osd.1 is a 1TB disk, its weight will be 1.0 and if osd.2 is a 4TB disk, its weight will be 4.0. It is expected that osd.1 will receive exactly four times more objects than osd.2. So that when osd.1 is 80% full, osd.2 is also 80% full.
But running a simulation on a crushmap with four 4TB disks and one 1TB disk, shows something different:
WEIGHT %USED osd.4 1.0 86% osd.3 4.0 81% osd.2 4.0 79% osd.1 4.0 79% osd.0 4.0 78%
It happens when these devices are used in a two replica pool because the distribution of the second replica depends on the distribution of the first replica. If the pool only has one copy of each object, the distribution is as expected (there is a variation but it is around 0.2% in this case):
WEIGHT %USED osd.4 1.0 80% osd.3 4.0 80% osd.2 4.0 80% osd.1 4.0 80% osd.0 4.0 80%
This variation is not new but there was no way to conveniently show it from the crushmap. It can now be displayed with crush analyze command. For instance:
$ ceph osd crush dump > crushmap-ceph.json $ crush ceph --convert crushmap-ceph.json > crushmap.json $ crush analyze --rule replicated --crushmap crushmap.json ~id~ ~weight~ ~over/under used~ ~name~ g9 -22 2.299988 10.400604 g3 -4 1.500000 10.126750 g12 -28 4.000000 4.573330 g10 -24 4.980988 1.955702 g2 -3 5.199982 1.903230 n7 -9 5.484985 1.259041 g1 -2 5.880997 0.502741 g11 -25 6.225967 -0.957755 g8 -20 6.679993 -1.730727 g5 -15 8.799988 -7.884220
shows that g9 will be ~90% full when g1 is ~80% full (i.e. 10.40 – 0.50 ~= 10% difference) and g5 is ~74% full.
By monitoring disk usage on g9 and adding more disk space to the cluster when the disks on g9 reach a reasonable threshold (like 85% or 90%), one can ensure that the cluster will never fill up, since it is known that g9 will always be the first node to become overfull. Another possibility is to run the ceph osd reweight-by-utilization command from time to time and try to even the distribution.
Continue reading “Predicting which Ceph OSD will fill up first”