A make check bot for Ceph contributors

The automated make check for Ceph bot runs on Ceph pull requests. It is still experimental and will not be triggered by all pull requests yet.

It does the following:

A use case for developers is:

  • write a patch and send a pull request
  • switch to another branch and work on another patch while the bot is running
  • if the bot reports failure, switch back to the original branch and repush a fix: the bot will notice the repush and run again

It also helps reviewers who can wait until the bot succeeds before looking at the patch closely.
Continue reading “A make check bot for Ceph contributors”

Teuthology docker targets hack (3/5)

The teuthology container hack is improved so each Ceph command is run via docker exec -i which can read from stdin as of docker 1.4 released in December 2014.
It can run the following job

machine_type: container
os_type: ubuntu
os_version: "14.04"
suite_path: /home/loic/software/ceph/ceph-qa-suite
roles:
- - mon.a
  - osd.0
  - osd.1
  - client.0
overrides:
  install:
    ceph:
      branch: master
  ceph:
    wait-for-scrub: false
tasks:
- install:
- ceph:

under one minute, when repeated a second time and the bulk of the installation can be reused.

{duration: 50.01510691642761, flavor: basic,
  owner: loic@dachary.org, success: true}

Continue reading “Teuthology docker targets hack (3/5)”

Why are by-partuuid symlinks missing or outdated ?

The ceph-disk script manages Ceph devices and rely on the content of the /dev/disk/by-partuuid directory which is updated by udev rules. For instance:

  • a new partition is created with /sbin/sgdisk --largest-new=1 --change-name=1:ceph data --partition-guid=1:83c14a9b-0493-4ccf-83ff-e3e07adae202 --typecode=1:89c57f98-2fe5-4dc0-89c1-f3ad0ceff2be -- /dev/loop4
  • the kernel is notified of the change with partprobe or partx and fires a udev event
  • the udev daemon receives UDEV [249708.246769] add /devices/virtual/block/loop4/loop4p1 (block) and the /lib/udev/rules.d/60-persistent-storage.rules script creates the corresponding symlink.

Let say the partition table is removed later (with sudo sgdisk --zap-all --clear --mbrtogpt -- /dev/loop4 for instance) and the kernel is not notified with partprobe or partx. If the first partition is created again and the kernel is notified as above, it will fail to notice any difference and will not send a udev event. As a result /dev/disk/by-partuuid will contain a symlink that is outdated.
The problem can be fixed by manually removing the stale symlink from /dev/disk/by-partuuid, clearing the partition table and notifying the kernel again. The events sent to udev can be displayed with:

# udevadm monitor
...
KERNEL[250902.072077] change   /devices/virtual/block/loop4 (block)
UDEV  [250902.100779] change   /devices/virtual/block/loop4 (block)
KERNEL[250902.101235] remove   /devices/virtual/block/loop4/loop4p1 (block)
UDEV  [250902.101421] remove   /devices/virtual/block/loop4/loop4p1 (block)
...

The environment and scripts used for a block device can be displayed with

# udevadm test /block/sdb/sdb1
...
udev_rules_apply_to_event: IMPORT '/sbin/blkid -o udev -p /dev/sdb/sdb1' /lib/udev/rules.d/60-ceph-partuuid-workaround.rules:28
udev_event_spawn: starting '/sbin/blkid -o udev -p /dev/sdb1'
...

How many PGs in each OSD of a Ceph cluster ?

To display how many PGs in each OSD of a Ceph cluster:

$ ceph --format xml pg dump | \
   xmlstarlet sel -t -m "//pg_stats/pg_stat/acting" -v osd -n | \
   sort -n | uniq -c
    332 0
    312 1
    299 2
    326 3
    291 4
    295 5
    316 6
    311 7
    301 8
    313 9

Where xmlstarlet loops over each PG acting set ( -m “//pg_stats/pg_stat/acting” ) and displays the OSDs it contains (-v osd), one by line (-n). The first column is the number of PGs in which the OSD in the second column shows.
To restrict the display to the PGs belonging to a given pool:

ceph --format xml pg dump |  \
  xmlstarlet sel -t -m "//pg_stats/pg_stat[starts-with(pgid,'0.')]/acting" -v osd -n | \
  sort -n | uniq -c

Where 0. is the prefix of each PG that belongs to pool 0.

HOWTO debug a teuthology task

To debug a modification to a ceph-qa-suite task ( for instance repair_test.py), a teuthology target is locked with:

$ ./virtualenv/bin/teuthology-lock --lock-many 1 --owner loic@dachary.org
$ ./virtualenv/bin/teuthology-lock --list-targets --owner loic@dachary.org > targets.yaml

and used to run the test with:

./virtualenv/bin/teuthology \
  --suite-path $HOME/software/ceph/ceph-qa-suite \
  --owner loic@dachary.org \
  $HOME/software/ceph/ceph-qa-suite/suites/rados/basic/tasks/repair_test.yaml \
  roles.yaml

where roles.yaml sets all roles to one target:

roles:
- [mon.0, osd.0, osd.1, osd.2, osd.3, osd.4, client.0]

Each run requires the installation and deinstallation of all Ceph packages and takes minutes. The installation part of repair_test.yaml can be commented out and the packages installed manually.

$ cat repair.yaml
...
tasks:
#- install:
- ceph:
- repair_test:

Continue reading “HOWTO debug a teuthology task”

Teuthology docker targets hack (2/5)

The teuthology container hack is improved to snapshot the container after Ceph and its dependencies have been installed. It helps quickly testing ceph-qa-suite tasks. A job doing nothing but install the Firefly version of Ceph takes 14 seconds after the initial installation (which can take between 5 to 15 minutes depending on how fast is the machine and how much bandwidth is available).

...
2014-11-17 01:21:00,067.067 INFO:teuthology.worker:Reserved job 42
2014-11-17 01:21:00,067.067 INFO:teuthology.worker:Config is:
machine_type: container
name: foo
os_type: ubuntu
os_version: '14.04'
overrides:
  install:
    ceph: {branch: firefly}
owner: loic@dachary.org
priority: 1000
roles:
- [mon.a, osd.0, osd.1, client.0]
tasks:
- {install: null}
tube: container
verbose: false

Fetching from upstream into /home/loic/src/ceph-qa-suite_master
...
completed on container001: sudo lsb_release '-is':  Ubuntu
reusing existing image ceph-base-ubuntu-14.04-firefly
running 'docker' 'stop' 'container001'
completed ('docker', 'stop', u'container001') on container001:  container001
...
2014-11-17 01:21:31,677.677 INFO:teuthology.run:Summary data:
{duration: 14, flavor: basic, success: true}
2014-11-17 01:21:31,677.677 INFO:teuthology.run:pass

Continue reading “Teuthology docker targets hack (2/5)”

Running make check on Ceph pull requests

Each Ceph contribution is expected to successfully run make check and pass all the unit tests it contains. The developer runs make check locally before submitting his changes but the result may be influenced by the development environment. A draft bot is proposed to watch the list of pull requests on a github repository and run a script based on github3.py each time a new patch is uploaded.

cephbot.py --user loic-bot --password XXXXX \
   --owner ceph --repository ceph \
   --script $HOME/makecheck/check.sh

If the script fails, it adds a comment with the output of the run to the pull request. Otherwise it reports success in the same way.
Continue reading “Running make check on Ceph pull requests”

Teuthology docker targets hack (1/5)

teuthology runs jobs testing the Ceph integration on targets that can either be virtual machines or bare metal. The container hack adds support for docker containers as a replacement.

...
Running task exec...
Executing custom commands...
Running commands on role mon.a host container002
running sudo 'TESTDIR=/home/ubuntu/cephtest' bash '-c' '/bin/true'
running docker exec container002 bash /tmp/tmp/tmptJ7hxa
Duration was 0.088931 seconds
...

Continue reading “Teuthology docker targets hack (1/5)”