The algorithm to fix uneven CRUSH distributions in Ceph was implemented as the crush optimize subcommand. Given the output of ceph report, crush analyze can show buckets that are over/under filled:
$ ceph report > ceph_report.json $ crush analyze --crushmap ceph_report.json --pool 3 ~id~ ~weight~ ~PGs~ ~over/under filled %~ ~name~ cloud3-1363 -6 419424 1084 7.90 cloud3-1364 -7 427290 1103 7.77 cloud3-1361 -4 424668 1061 4.31 cloud3-1362 -5 419424 1042 3.72 cloud3-1359 -2 419424 1031 2.62 cloud3-1360 -3 419424 993 -1.16 cloud3-1396 -8 644866 1520 -1.59 cloud3-1456 -11 665842 1532 -3.94 cloud3-1397 -9 644866 1469 -4.90 cloud3-1398 -10 644866 1453 -5.93 Worst case scenario if a host fails: ~over filled %~ ~type~ device 30.15 host 10.53 root 0.00
The crush optimize command will create a crushmap rebalancing the PGs:
$ crush optimize --crushmap ceph_report.json \ --out-path optimized.crush --pool 3 2017-05-27 20:22:17,638 argv = optimize --crushmap ceph_report.json \ --out-path optimized.crush --pool 3 --replication-count=3 \ --pg-num=4096 --pgp-num=4096 --rule=data --out-version=j \ --no-positions --choose-args=3 2017-05-27 20:22:17,670 default optimizing 2017-05-27 20:22:24,165 default wants to swap 447 PGs 2017-05-27 20:22:24,172 cloud3-1360 optimizing 2017-05-27 20:22:24,173 cloud3-1359 optimizing 2017-05-27 20:22:24,174 cloud3-1361 optimizing 2017-05-27 20:22:24,175 cloud3-1362 optimizing 2017-05-27 20:22:24,177 cloud3-1364 optimizing 2017-05-27 20:22:24,177 cloud3-1363 optimizing 2017-05-27 20:22:24,179 cloud3-1396 optimizing 2017-05-27 20:22:24,188 cloud3-1397 optimizing 2017-05-27 20:22:27,726 cloud3-1360 wants to swap 21 PGs 2017-05-27 20:22:27,734 cloud3-1398 optimizing 2017-05-27 20:22:29,151 cloud3-1364 wants to swap 48 PGs 2017-05-27 20:22:29,176 cloud3-1456 optimizing 2017-05-27 20:22:29,182 cloud3-1362 wants to swap 32 PGs 2017-05-27 20:22:29,603 cloud3-1361 wants to swap 47 PGs 2017-05-27 20:22:31,406 cloud3-1396 wants to swap 77 PGs 2017-05-27 20:22:33,045 cloud3-1397 wants to swap 61 PGs 2017-05-27 20:22:33,160 cloud3-1456 wants to swap 58 PGs 2017-05-27 20:22:33,622 cloud3-1398 wants to swap 47 PGs 2017-05-27 20:23:51,645 cloud3-1359 wants to swap 26 PGs 2017-05-27 20:23:52,090 cloud3-1363 wants to swap 43 PGs
Before uploading the crushmap (with ceph osd setcrushmap -i optimized.crush), crush analyze can be used again to verify it improved as expected:
$ crush analyze --crushmap optimized.crush --pool 3 --replication-count=3 \ --pg-num=4096 --pgp-num=4096 --rule=data --choose-args=0 ~id~ ~weight~ ~PGs~ ~over/under filled %~ ~name~ cloud3-1359 -2 419424 1007 0.24 cloud3-1363 -6 419424 1006 0.14 cloud3-1360 -3 419424 1005 0.04 cloud3-1361 -4 424668 1017 -0.02 cloud3-1396 -8 644866 1544 -0.04 cloud3-1397 -9 644866 1544 -0.04 cloud3-1398 -10 644866 1544 -0.04 cloud3-1364 -7 427290 1023 -0.05 cloud3-1456 -11 665842 1594 -0.05 cloud3-1362 -5 419424 1004 -0.06 Worst case scenario if a host fails: ~over filled %~ ~type~ device 11.39 host 3.02 root 0.00
Incremental rebalancing
The number of PGs moved during rebalancing can be controlled with the –step option. For instance, –step 64 will stop rebalancing after 64 PGs are moved:
$ crush optimize --crushmap ceph_report.json --out-path optimized.crush \ --pool 3 --step 64 2017-05-27 21:23:31,498 default optimizing 2017-05-27 21:23:32,208 default wants to swap 72 PGs 2017-05-27 21:23:32,208 default will swap 72 PGs 2017-05-27 21:23:32,218 the optimized crushmap was written to optimized.crush 2017-05-27 21:23:32,218 now running simulation of the next steps 2017-05-27 21:23:32,218 this can be disabled with --no-forecast 2017-05-27 21:23:32,250 default optimizing 2017-05-27 21:23:33,087 default wants to swap 70 PGs 2017-05-27 21:23:33,087 default will swap 70 PGs ... 2017-05-27 21:39:33,421 cloud3-1363 already optimized 2017-05-27 21:39:44,724 cloud3-1359 already optimized 2017-05-27 21:39:44,757 step 13 moves 0 PGs
The forecast shows it needs 12 rounds to complete the rebalancing. Once the cluster finished rebalancing the PGs for the first round, crush optimize can be called again with the output of ceph report.
Caveats
If two pools share the same crush rule, they cannot be rebalanced. Only pools with a dedicated rule can be rebalanced.
Backward compatibility
Rebalancing is implemented for Luminous clusters and up. However, for clusters with a single pool, rebalancing can be done for pre-Luminous clusters. See the python-crush Ceph cookbook for more information.