Fully automated disks life cycle in a Ceph cluster

Adding, moving and removing disks in a Ceph cluster can easily be automated and require no manual intervention. New disks are formatted when the configuration tool ( Puppet, Chef etc. ) notices they are unknown according to ceph-disk. When disks are removed for whatever reason, Ceph recovers the information it contained, as expected. If disks are hot plugged from one machine to another, udev rules will automatically allow them to rejoin the cluster and update the crush map accordingly.

A machine dedicated to hosting Ceph OSDs is configured with a bootstrap-osd ( in /var/lib/ceph/bootstrap-osd/ceph-keyring ) key which gives it just enough permissions to create new OSDs.

Adding disks

When a new disk is hot plugged, it can be inserted in the cluster by:

Running ceph-disk prepare /dev/sdX
Running ceph-deploy osd create localhost:/dev/sdX as root on the node, which eventually calls ceph-disk to do the same.

A new partition table is created and a file system is created on a partition tagged with Ceph UUID. An OSD id is ( the bootstrap-osd key permission allows it ) is obtained with ceph osd create and stored in the newly created file system in the whoami file. udev is called ( udevadm trigger –subsystem-match=block –action=add ) and it will notice the Ceph UUID partition type and call ceph-disk activate /dev/sdX. It will mount the file system in a standard location : /var/lib/ceph/osd/cluster–osd-id and call the appropriate init script for the distribution.

Auto configuration

Discovery of new disks can be automated by asking ceph-disk list. It will report unknown disks and they can be added as described above while the others are left untouched. The new puppet-ceph module is going to implement this feature.

Removing disks

If a disk stops working for any reason ( disk crash or machine burns ), Ceph will transparently recover the data it contains. I will still show in the crush map, its authentication key is still in the keyring. From time to time these leftovers can be removed. If a disk is removed from the cluster by mistake, any attempt to plug it back will fail because its credentials have been removed. It will need to be zapped manually.

Moving disks

When a disk is removed, the OSD using it will crash because I/O suddenly fail. It is better to do it after shutting down the OSD and umounting the file system, but it will not damage the cluster. The same disk can then be plugged in another node. The udev rule notices the new disk and activates it using the same logic as when the disk is created. The init script will update its location in the crush map ( assuming osd_crush_update_on_start is set to true in /etc/ceph/ceph.conf ).

Rebooting the node

At boot time the init script will run one OSD daemon for each directory found in /var/lib/ceph/osd, using the OSD id which is part of the directory name.

OSD ids and rcscript

Before Ceph Cuttlefish, the udev based logic was not mature and it was necessary for various scripts ( including puppet modules ) to be concerned with OSD ids. Although this is no longer necessary, some of them still do, such as rcscript ( SuSE … ). To leverage the udev logic and help with the transition, it is recommended to use a script that walks /var/lib/ceph/osd and re-create the matching [osd.X] sections in /etc/ceph/ceph.conf.