The algorithm to fix uneven CRUSH distributions in Ceph was implemented as the crush optimize subcommand. Given the output of ceph report, crush analyze can show buckets that are over/under filled:
$ ceph report > ceph_report.json $ crush analyze --crushmap ceph_report.json --pool 3 ~id~ ~weight~ ~PGs~ ~over/under filled %~ ~name~ cloud3-1363 -6 419424 1084 7.90 cloud3-1364 -7 427290 1103 7.77 cloud3-1361 -4 424668 1061 4.31 cloud3-1362 -5 419424 1042 3.72 cloud3-1359 -2 419424 1031 2.62 cloud3-1360 -3 419424 993 -1.16 cloud3-1396 -8 644866 1520 -1.59 cloud3-1456 -11 665842 1532 -3.94 cloud3-1397 -9 644866 1469 -4.90 cloud3-1398 -10 644866 1453 -5.93 Worst case scenario if a host fails: ~over filled %~ ~type~ device 30.15 host 10.53 root 0.00
The crush optimize command will create a crushmap rebalancing the PGs:
$ crush optimize --crushmap ceph_report.json \ --out-path optimized.crush --pool 3 2017-05-27 20:22:17,638 argv = optimize --crushmap ceph_report.json \ --out-path optimized.crush --pool 3 --replication-count=3 \ --pg-num=4096 --pgp-num=4096 --rule=data --out-version=j \ --no-positions --choose-args=3 2017-05-27 20:22:17,670 default optimizing 2017-05-27 20:22:24,165 default wants to swap 447 PGs 2017-05-27 20:22:24,172 cloud3-1360 optimizing 2017-05-27 20:22:24,173 cloud3-1359 optimizing 2017-05-27 20:22:24,174 cloud3-1361 optimizing 2017-05-27 20:22:24,175 cloud3-1362 optimizing 2017-05-27 20:22:24,177 cloud3-1364 optimizing 2017-05-27 20:22:24,177 cloud3-1363 optimizing 2017-05-27 20:22:24,179 cloud3-1396 optimizing 2017-05-27 20:22:24,188 cloud3-1397 optimizing 2017-05-27 20:22:27,726 cloud3-1360 wants to swap 21 PGs 2017-05-27 20:22:27,734 cloud3-1398 optimizing 2017-05-27 20:22:29,151 cloud3-1364 wants to swap 48 PGs 2017-05-27 20:22:29,176 cloud3-1456 optimizing 2017-05-27 20:22:29,182 cloud3-1362 wants to swap 32 PGs 2017-05-27 20:22:29,603 cloud3-1361 wants to swap 47 PGs 2017-05-27 20:22:31,406 cloud3-1396 wants to swap 77 PGs 2017-05-27 20:22:33,045 cloud3-1397 wants to swap 61 PGs 2017-05-27 20:22:33,160 cloud3-1456 wants to swap 58 PGs 2017-05-27 20:22:33,622 cloud3-1398 wants to swap 47 PGs 2017-05-27 20:23:51,645 cloud3-1359 wants to swap 26 PGs 2017-05-27 20:23:52,090 cloud3-1363 wants to swap 43 PGs
Before uploading the crushmap (with ceph osd setcrushmap -i optimized.crush), crush analyze can be used again to verify it improved as expected:
$ crush analyze --crushmap optimized.crush --pool 3 --replication-count=3 \ --pg-num=4096 --pgp-num=4096 --rule=data --choose-args=0 ~id~ ~weight~ ~PGs~ ~over/under filled %~ ~name~ cloud3-1359 -2 419424 1007 0.24 cloud3-1363 -6 419424 1006 0.14 cloud3-1360 -3 419424 1005 0.04 cloud3-1361 -4 424668 1017 -0.02 cloud3-1396 -8 644866 1544 -0.04 cloud3-1397 -9 644866 1544 -0.04 cloud3-1398 -10 644866 1544 -0.04 cloud3-1364 -7 427290 1023 -0.05 cloud3-1456 -11 665842 1594 -0.05 cloud3-1362 -5 419424 1004 -0.06 Worst case scenario if a host fails: ~over filled %~ ~type~ device 11.39 host 3.02 root 0.00