July 2014 – Loïc Dachary

A datacenter containing three hosts of a non profit Ceph and OpenStack cluster suddenly lost connectivity and it could not be restored within 24h. The corresponding OSDs were marked out manually. The Ceph pool dedicated to this datacenter became unavailable as expected. However, a pool that was supposed to have at most one copy per datacenter turned out to have a faulty crush ruleset. As a result some placement groups in this pool were stuck.

$ ceph -s
...
health HEALTH_WARN 1 pgs degraded; 7 pgs down;
   7 pgs peering; 7 pgs recovering;
   7 pgs stuck inactive; 15 pgs stuck unclean;
   recovery 184/1141208 degraded (0.016%)
...

Continue reading “Ceph disaster recovery scenario”