Desktop based Ceph cluster for file sharing

July 1st 2013, Heinlein setup a Ceph cuttlefish ( now upgraded to version 0.61.8 ) cluster using the desktop of seven employees willing to host a ceph node and share part of their disk. The nodes are partly connected with 1Gb/s links and some only have 100Mb/s. The cluster supports a 4TB ceph file system

ceph-office$ df -h .
Filesystem                 Size  Used Avail Use% Mounted on
x.x.x.x,y.y.y.y,z.z.z.z:/  4,0T  2,0T  2,1T  49% /mnt/ceph-office

which is used as a temporary space to exchange files. On a typical day at least one desktop is switched off and on again. The cluster has been self healing since its installation, with the only exception of a placement group being stuck and fixed with a manual pg repair .

usage

Each employee willing to use the ceph file system can add the following line to /etc/fstab

x.x.x.x,y.y.y.y,z.z.z.z:/ /mnt/ceph-office ceph \
    noatime,dirstat,name=office,secret=SECRET_IN_BASE64 0 0

run mkdir /mnt/ceph-office ; mount /mnt/ceph-office and start taking / dropping files to exchange them within the company. Some use it to store temporary git repositories.

ceph-deploy

The installation of the nodes was done using ceph-deploy and following the documentation instructions. There are three monitors, two of which are running on the desktops and one of them in a virtual machine dedicated to Ceph. The same virtual machine hosts the active MDS and another sits on one of the desktops. As of today ceph -s shows:

$ ceph -s
   health HEALTH_OK
   monmap e7: 3 mons at {mon01=192.168.100.x:6789/0,\
                                      mon02=192.168.100.y:6789/0,\
                                      mon03=192.168.100.z:6789/0}, \
   election epoch 124, quorum 0,1,2 mon01,mon02,mon03
   osdmap e2497: 7 osds: 7 up, 7 in
    pgmap v329003: 464 pgs: 464 active+clean; 124 GB data, \
                1934 GB used, \
                2102 GB / 4059 GB avail; 614B/s wr, 0op/s
   mdsmap e31488: 1/1/1 up {0=192.168.100.a=up:active}, 1 up:standby

Deploying OSDs

On most machines a disk partition was dedicated to ceph and used to store the journal and the data. On others a LVM logical volume was created for ceph. After mounting it in /mnt/lvm/ceph, ceph-deploy was used to designate it as a directory to be used for the OSD.

/var/lib/ceph/osd$ ls -l
total 0
lrwxrwxrwx 1 root root 13 Jul  4 11:21 ceph-1 -> /mnt/lvm/ceph/

Although the logical volume could be used as a regular disk or partition, it would involve tricks with tools like kpartx with no real benefit. An attempt was made to use a loopback device but for some reason it led to a high IOwait and this option was abandonned.
All nodes use XFS and SATA disks.

crush map

The machines are on two different floors of the building and in different offices. The crush map is configured to reflect this but the two replicas are forced to use two different offices, regardless of the floor. The ceph osd tree looks like this:

# id	weight	type name	up/down	reweight
-1	3.08	root default
-12	0.35		floor three
-7	0.21			office 304
-5	0.21				host node01
3	0.21					osd.3	up	1
-8	0.06999			office 305
-6	0.06999				host node02
4	0.06999					osd.4	up	1
-9	0.06999			office 307
-2	0.06999				host node03
7	0.06999					osd.7	up	1
-13	2.73		floor four
-10	0.49			office 403
-3	0.24				host node04
1	0.24					osd.1	up	1
-14	0.25				host node05
5	0.25					osd.5	up	1
-11	0.24			office 404
-4	0.24				host node06
0	0.24					osd.0	up	1
-16	2			office 405
-15	2				host node07
6	2					osd.6	up	1

The relevant lines of the crush map are:

rule data {
	ruleset 0
	type replicated
	min_size 1
	max_size 10
	step take default
	step chooseleaf firstn 0 type office
	step emit
}
rule metadata {
	ruleset 1
	type replicated
	min_size 1
	max_size 10
	step take default
	step chooseleaf firstn 0 type office
	step emit
}