The algorithm to fix uneven CRUSH distributions in Ceph was implemented as the crush optimize subcommand. Given the output of ceph report, crush analyze can show buckets that are over/under filled:
$ ceph report > ceph_report.json
$ crush analyze --crushmap ceph_report.json --pool 3
~id~ ~weight~ ~PGs~ ~over/under filled %~
~name~
cloud3-1363 -6 419424 1084 7.90
cloud3-1364 -7 427290 1103 7.77
cloud3-1361 -4 424668 1061 4.31
cloud3-1362 -5 419424 1042 3.72
cloud3-1359 -2 419424 1031 2.62
cloud3-1360 -3 419424 993 -1.16
cloud3-1396 -8 644866 1520 -1.59
cloud3-1456 -11 665842 1532 -3.94
cloud3-1397 -9 644866 1469 -4.90
cloud3-1398 -10 644866 1453 -5.93
Worst case scenario if a host fails:
~over filled %~
~type~
device 30.15
host 10.53
root 0.00
The crush optimize command will create a crushmap rebalancing the PGs:
$ crush optimize --crushmap ceph_report.json \
--out-path optimized.crush --pool 3
2017-05-27 20:22:17,638 argv = optimize --crushmap ceph_report.json \
--out-path optimized.crush --pool 3 --replication-count=3 \
--pg-num=4096 --pgp-num=4096 --rule=data --out-version=j \
--no-positions --choose-args=3
2017-05-27 20:22:17,670 default optimizing
2017-05-27 20:22:24,165 default wants to swap 447 PGs
2017-05-27 20:22:24,172 cloud3-1360 optimizing
2017-05-27 20:22:24,173 cloud3-1359 optimizing
2017-05-27 20:22:24,174 cloud3-1361 optimizing
2017-05-27 20:22:24,175 cloud3-1362 optimizing
2017-05-27 20:22:24,177 cloud3-1364 optimizing
2017-05-27 20:22:24,177 cloud3-1363 optimizing
2017-05-27 20:22:24,179 cloud3-1396 optimizing
2017-05-27 20:22:24,188 cloud3-1397 optimizing
2017-05-27 20:22:27,726 cloud3-1360 wants to swap 21 PGs
2017-05-27 20:22:27,734 cloud3-1398 optimizing
2017-05-27 20:22:29,151 cloud3-1364 wants to swap 48 PGs
2017-05-27 20:22:29,176 cloud3-1456 optimizing
2017-05-27 20:22:29,182 cloud3-1362 wants to swap 32 PGs
2017-05-27 20:22:29,603 cloud3-1361 wants to swap 47 PGs
2017-05-27 20:22:31,406 cloud3-1396 wants to swap 77 PGs
2017-05-27 20:22:33,045 cloud3-1397 wants to swap 61 PGs
2017-05-27 20:22:33,160 cloud3-1456 wants to swap 58 PGs
2017-05-27 20:22:33,622 cloud3-1398 wants to swap 47 PGs
2017-05-27 20:23:51,645 cloud3-1359 wants to swap 26 PGs
2017-05-27 20:23:52,090 cloud3-1363 wants to swap 43 PGs
Before uploading the crushmap (with ceph osd setcrushmap -i optimized.crush), crush analyze can be used again to verify it improved as expected:
$ crush analyze --crushmap optimized.crush --pool 3 --replication-count=3 \
--pg-num=4096 --pgp-num=4096 --rule=data --choose-args=0
~id~ ~weight~ ~PGs~ ~over/under filled %~
~name~
cloud3-1359 -2 419424 1007 0.24
cloud3-1363 -6 419424 1006 0.14
cloud3-1360 -3 419424 1005 0.04
cloud3-1361 -4 424668 1017 -0.02
cloud3-1396 -8 644866 1544 -0.04
cloud3-1397 -9 644866 1544 -0.04
cloud3-1398 -10 644866 1544 -0.04
cloud3-1364 -7 427290 1023 -0.05
cloud3-1456 -11 665842 1594 -0.05
cloud3-1362 -5 419424 1004 -0.06
Worst case scenario if a host fails:
~over filled %~
~type~
device 11.39
host 3.02
root 0.00










