Desktop based Ceph cluster for file sharing

July 1st 2013, Heinlein setup a Ceph cuttlefish ( now upgraded to version 0.61.8 ) cluster using the desktop of seven employees willing to host a ceph node and share part of their disk. The nodes are partly connected with 1Gb/s links and some only have 100Mb/s. The cluster supports a 4TB ceph file system

ceph-office$ df -h .
Filesystem                 Size  Used Avail Use% Mounted on
x.x.x.x,y.y.y.y,z.z.z.z:/  4,0T  2,0T  2,1T  49% /mnt/ceph-office

which is used as a temporary space to exchange files. On a typical day at least one desktop is switched off and on again. The cluster has been self healing since its installation, with the only exception of a placement group being stuck and fixed with a manual pg repair .
Continue reading “Desktop based Ceph cluster for file sharing”

Resizeable and resilient mail storage with Ceph

A common use case for Ceph is to store mail as objects using the S3 API. Although most mail servers do not yet support such a storage backend, deploying them on Ceph block devices is a beneficial first step. The disk can be resized live, while the mail server is running, and will remain available even when a machine goes down. In the following example it gains ~100GB every 30 seconds:

$ df -h /data/bluemind
Filesystem      Size  Used Avail Use% Mounted on
/dev/rbd2       1.9T  196M  1.9T   1% /data/bluemind
$ sleep 30 ; df -h /data/bluemind
Filesystem      Size  Used Avail Use% Mounted on
/dev/rbd2       2.0T  199M  2.0T   1% /data/bluemind

When the mail system is upgraded to a S3 capable mail storage backend, it will be able to use the Ceph S3 API right away: Ceph uses the same underlying servers for both purposes ( block and object storage ).
Continue reading “Resizeable and resilient mail storage with Ceph”

HOWTO run a Ceph Q&A suite

Ceph has extensive Q&A suites that are made of individual teuthology tasks ( see btrfs.yaml for instance ). The schedule_suite.sh helper script could be used as follows to run the entire rados suite:

./schedule_suite.sh rados wip-5510 testing \
   loic@dachary.org basic master plana

Where wip-5510 is the branch to be tested as compiled by gitbuilder ( which happens automatically after each commit ), testing is the kernel to run on the test machines which is relevant when using krbd, basic is the compilation flavor of the wip-5510 branch ( could be notcmalloc or gcov ) and master is the teuthology branch to use. plana specifies the type of machines to be used, which is only relevant if using the inktank lab.
The queue_host defined in .teuthology.yaml

queue_host: teuthology.dachary.org
queue_port: 11300

will schedule a number of jobs ( ~300 for rados ) and send a mail when they complete.

name loic-2013-08-14_22:00:20-rados-wip-5510-testing-basic-plana
INFO:teuthology.suite:Collection basic in /home/loic/src/ceph-qa-suite/suites/rados/basic
INFO:teuthology.suite:Running teuthology-schedule with facets collection:basic clusters:fixed-2.yaml fs:btrfs.yaml msgr-failures:few.yaml tasks:rados_api_tests.yaml
Job scheduled with ID 106579
INFO:teuthology.suite:Running teuthology-schedule with facets collection:basic clusters:fixed-2.yaml fs:btrfs.yaml msgr-failures:few.yaml tasks:rados_cls_all.yaml
Job scheduled with ID 106580
....
INFO:teuthology.suite:Running teuthology-schedule with facets collection:verify 1thrash:none.yaml clusters:fixed-2.yaml fs:btrfs.yaml msgr-failures:few.yaml tasks:rados_cls_all.yaml validater:valgrind.yaml
Job scheduled with ID 106880
Job scheduled with ID 106881

Continue reading “HOWTO run a Ceph Q&A suite”

HOWTO valgrind Ceph with teuthology

Teuthology can run a designated daemon with valgrind and preserve the report for analysis. The notcmalloc flavor is preferred to silence valgrind errors unrelated to Ceph itself.

- install:
   project: ceph
   branch: wip-5510
   flavor: notcmalloc

A daemon running with valgrind is much slower and warnings will show in the logs that should be marked as non relevant in this context:

    log-whitelist:
    - slow request
    - clocks
    - wrongly marked me down
    - objects unfound and apparently lost

The first osd is marked to be run with valgrind:

- ceph:
    valgrind:
      osd.0: --tool=memcheck

After running teuthology with

./virtualenv/bin/teuthology -v --archive /tmp/wip-5510-valgrind \
  --owner loic@dachary.org \
  ~/private/ceph/targets.yaml \
  ~/private/ceph/wip-5510-valgrind.yaml

errors may show

DEBUG:teuthology.run_tasks:Exception was not quenched, exiting: Exception: saw valgrind issues
INFO:teuthology.run:Summary data:
{duration: 344.2433888912201, failure_reason: saw valgrind issues, flavor: notcmalloc,
  owner: loic@dachary.org, success: false}
INFO:teuthology.run:FAIL

And the valgrind XML report containing the details about the error can be retrieved from the /tmp/wip-5510-valgrind/remote/ubuntu@target1/log/valgrind/osd.0.log.gz
Continue reading “HOWTO valgrind Ceph with teuthology”

HOWTO install Ceph teuthology on OpenStack

Teuthology is used to run Ceph integration tests. It is installed from sources and will use newly created OpenStack instances as targets:

$ cat targets.yaml
targets:
  ubuntu@target1.novalocal: ssh-rsa AAAAB3NzaC1yc2...
  ubuntu@target2.novalocal: ssh-rsa AAAAB3NzaC1yc2...

They allow password free ubuntu ssh connection with full sudo privileges from the machine running teuthology. A Ubuntu precise 12.04.2 target must be configured with:

$ wget -q -O- 'https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/release.asc' | \
  sudo apt-key add -
$ echo '    ubuntu hard nofile 16384' > /etc/security/limits.d/ubuntu.conf

It can then be tried with a configuration file that does nothing but install Ceph and run the daemons.

$ cat noop.yaml
check-locks: false
roles:
- - mon.a
  - osd.0
- - osd.1
  - client.0
tasks:
- install:
   project: ceph
   branch: stable
- ceph:

The output should look like this:

$ ./virtualenv/bin/teuthology targets.yaml noop.yaml
INFO:teuthology.run_tasks:Running task internal.save_config...
INFO:teuthology.task.internal:Saving configuration
INFO:teuthology.run_tasks:Running task internal.check_lock...
INFO:teuthology.task.internal:Lock checking disabled.
INFO:teuthology.run_tasks:Running task internal.connect...
INFO:teuthology.task.internal:Opening connections...
DEBUG:teuthology.task.internal:connecting to ubuntu@teuthology2.novalocal
DEBUG:teuthology.task.internal:connecting to ubuntu@teuthology1.novalocal
...
INFO:teuthology.run:Summary data:
{duration: 363.5891010761261, flavor: basic, owner: ubuntu@teuthology, success: true}
INFO:teuthology.run:pass

Continue reading “HOWTO install Ceph teuthology on OpenStack”

How does Ceph backfilling pushes objects to replicas ?

When a placement group start backfilling it will ask the OSD be queued for recovery. It will eventually be processed and the OSD will ask it to start the recovery operations. Since it is backfilling (this is the original reason why the recovery operation was queued), it will attempt to reserve a backfill channel ( step 1, step 2 ). When the reservation is successfull it goes back to the initial backfilling state which will re-queue the PG for recovery. When processed, the same function is run but this time the backfill channel is reserved and it starts the backfilling operations. It scans the other OSD to retrieve a list of objects and their associated versions and pushes missing objects to the replicas. Each object pushed is locked for read and ( after trying some snapshot based heuristics ) will register a push and send a CEPH_OSD_OP_PUSH operation to the peer OSD. The receiving replica will handle the message by submitting the payload to a transaction by which the OSD will write it to the file.

Ceph RBD live resize with krbd

The Ubuntu precise 3.8 linux kernel is rebuilt with Laurent Barbe’s patch:

apt-get source linux-image-3.8.0-27-generic
cd linux-lts-raring-3.8.0/
curl https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/patch/?id=d98df63ea7e87d5df4dce0cece0210e2a777ac00 | patch -p1
dpkg-buildpackage -uc -us

and installed with

dpkg -i ../linux-image-3.8.0-27-generic_3.8.0-27.40~precise3_amd64.deb

Live resize of a mounted ext4 file system can then be done as follows:

# rbd create --size 10000 test
# rbd map test
# mkfs.ext4 -q /dev/rbd1
# mount /dev/rbd1 /mnt
# df -h /mnt
Filesystem      Size  Used Avail Use% Mounted on
/dev/rbd1       9.5G   22M  9.0G   1% /mnt
# blockdev --getsize64 /dev/rbd1
10485760000
# rbd resize --size 20000 test
Resizing image: 100% complete...done.
# blockdev --getsize64 /dev/rbd1
20971520000
# resize2fs /dev/rbd1
resize2fs 1.42 (29-Nov-2011)
Filesystem at /dev/rbd1 is mounted on /mnt; on-line resizing required
old_desc_blocks = 1, new_desc_blocks = 2
The filesystem on /dev/rbd1 is now 5120000 blocks long.
# df -h /mnt
Filesystem      Size  Used Avail Use% Mounted on
/dev/rbd1        20G   27M   19G   1% /mnt