Predicting which Ceph OSD will fill up first

When a device is added to Ceph, it is assigned a weight that reflects its capacity. For instance if osd.1 is a 1TB disk, its weight will be 1.0 and if osd.2 is a 4TB disk, its weight will be 4.0. It is expected that osd.1 will receive exactly four times more objects than osd.2. So that when osd.1 is 80% full, osd.2 is also 80% full.

But running a simulation on a crushmap with four 4TB disks and one 1TB disk, shows something different:

         WEIGHT     %USED
osd.4       1.0       86%
osd.3       4.0       81%
osd.2       4.0       79%
osd.1       4.0       79%
osd.0       4.0       78%

It happens when these devices are used in a two replica pool because the distribution of the second replica depends on the distribution of the first replica. If the pool only has one copy of each object, the distribution is as expected (there is a variation but it is around 0.2% in this case):

         WEIGHT     %USED
osd.4       1.0       80%
osd.3       4.0       80%
osd.2       4.0       80%
osd.1       4.0       80%
osd.0       4.0       80%

This variation is not new but there was no way to conveniently show it from the crushmap. It can now be displayed with crush analyze command. For instance:

    $ ceph osd crush dump > crushmap-ceph.json
    $ crush ceph --convert crushmap-ceph.json > crushmap.json
    $ crush analyze --rule replicated --crushmap crushmap.json
            ~id~  ~weight~  ~over/under used~
    g9       -22  2.299988     10.400604
    g3        -4  1.500000     10.126750
    g12      -28  4.000000      4.573330
    g10      -24  4.980988      1.955702
    g2        -3  5.199982      1.903230
    n7        -9  5.484985      1.259041
    g1        -2  5.880997      0.502741
    g11      -25  6.225967     -0.957755
    g8       -20  6.679993     -1.730727
    g5       -15  8.799988     -7.884220

shows that g9 will be ~90% full when g1 is ~80% full (i.e. 10.40 – 0.50 ~= 10% difference) and g5 is ~74% full.

By monitoring disk usage on g9 and adding more disk space to the cluster when the disks on g9 reach a reasonable threshold (like 85% or 90%), one can ensure that the cluster will never fill up, since it is known that g9 will always be the first node to become overfull. Another possibility is to run the ceph osd reweight-by-utilization command from time to time and try to even the distribution.
Continue reading “Predicting which Ceph OSD will fill up first”

logging udev events at boot time

Adapted from Peter Rajnoha post:

  • create a special systemd unit to monitor udev during boot:
    cat > /etc/systemd/system/systemd-udev-monitor.service <<EOF
    Description=udev Monitoring
    After=systemd-udevd-control.socket systemd-udevd-kernel.socket systemd-udev-trigger.service
    ExecStart=/usr/bin/sh -c "/usr/sbin/udevadm monitor --udev --env > /udev_monitor.log"
  • run systemctl daemon-reload
  • run systemctl enable systemd-udev-monitor.service
  • reboot
  • append “systemd.log_level=debug systemd.log_target=kmsg udev.log-priority=debug log_buf_len=8M” to kernel command line
  • collect the logs in /udev_monitor.log

Continue reading “logging udev events at boot time”

Testing Ceph with ARMv8 OpenStack instances

The Ceph integration tests can be run on ARMv8 (aka arm64 or aarch64) OpenStack instances on CloudLab or Runabove.

When logged in CloudLab an OpenStack cluster suitable for teuthology must be created. To start an experiment

click Change Profile

to select the OpenStackTeuthology profile

the description of the profile contains an example credential file (i.e. that can be copy/pasted on the local machine

the m400 default machine type will select ARMv8 hardware

in the last step, choose a name for the experiment. The file must be modified to reflect the chosen name because it shows in the URL of the authentication service. If a new experiment by the same name is run a month later, the same file can be used.

the page is then updated to show the progress of the provisionning. Note that it takes about 15 minutes for it to complete: even when the page says the experiment is up, the OpenStack setup is still going on and need a few more minutes.

Finally click on Profile instructions to display the link to the horizon dashboard and the password of the admin user (i.e. configuring OpenStack inside your experiment, you’ll be able to visit the OpenStack Dashboard WWW interface (approx. 5-15 minutes). Your OpenStack admin and instance VM password is randomly-generated by Cloudlab, and it is: 0905d783e7e7 .).

When the cluster is created, running the smoke integration tests for rados on jewel can be done with ceph-workbench

ceph-workbench --verbose ceph-qa-suite --ceph jewel --suite smoke --filter rados

assuming the file has been set in ~/.ceph-workbench/ When the command returns, it displays the URL of the web interface

2016-04-07 11:25:57,625.625 DEBUG:paramiko.transport:EOF in transport thread
2016-04-07 11:25:57,628.628 INFO:teuthology.openstack:
pulpito web interface:
ssh access           : ssh ubuntu@ # logs in /usr/share/nginx/html

And when the test completes successfully it will show in green. Otherwise the logs of the failed tests can be downloaded and analyzed.

Continue reading “Testing Ceph with ARMv8 OpenStack instances”

Semi-reliable GitHub scripting

The githubpy python library provides a thin layer on top of the GitHub V3 API, which is convenient because the official GitHub documentation can be used. The undocumented behavior of GitHub is outside of the scope of this library and needs to be addressed by the caller.

For instance creating a repository is asynchronous and checking for its existence may fail. Something similar to the following function should be used to wait until it exists:

    def project_exists(self, name):
        retry = 10
        while retry > 0:
                for repo in self.github.g.user('repos').get():
                    if repo['name'] == name:
                        return True
                return False
            except github.ApiError:
            retry -= 1
        raise Exception('error getting the list of repos')

    def add_project(self):
        r = self.github.g.user('repos').post(
        assert r['full_name'] == GITHUB['username'] + '/' + GITHUB['repo']
        while not self.project_exists(GITHUB['repo']):

Another example is merging a pull request. It sometimes fails (503, cannot be merged error) although it succeeds in the background. To cope with that, the state of the pull request should be checked immediately after the merge failed. It can either be merged or closed (although the GitHub web interface shows it as merged). The following function can be used to cope with that behavior:

    def merge(self, pr, message):
        retry = 10
        while retry > 0:
                current = self.github.repos().pulls(pr).get()
                if current['state'] in ('merged', 'closed'):
      'state = ' + current['state'])
            except github.ApiError as e:
                logging.exception('merging ' + str(pr) + ' ' + message)
            retry -= 1
        assert retry > 0

These two examples have been implemented as part of the ceph-workbench integration tests. The behavior described above can be reproduced by running the test in a loop during a few hours.

teuthology forensics with git, shell and paddles

When a teuthology integration test for Ceph fails, the results are analyzed to find the source of the problem. For instance the upgrade suite: pool_create failed with error -4 EINTR issue was reported early October 2015, with multiple integration job failures.
The first step is to look into the teuthology log which revealed that pools could not be created.

failed: error rados_pool_create(test-rados-api-vpm049-15238-1) \
  failed with error -4"

The 4 stands for EINTR. The paddles database is used by teuthology to store test results and can be queried via HTTP. For instance:

curl --silent |
  jq '.[] | \
      select(.name | contains("upgrade:firefly-hammer-x")) | \
      select(.branch == "infernalis") | \
      select(.status | contains("finished")) \
      | .name' | \
  while read run ; do eval run=$run ; \
    curl --silent$run/jobs/ | \
      jq '.[] | "\(.name)/jobs/\(.job_id)/"' ; \
  done | \
  while read url ; do eval url=$url ; \
    curl --silent $url | \
      jq 'if((.description != null) and \
             (.description | contains("parallel")) and \
             (.success == true)) then "'$url'" else null end' ; \
  done | grep -v null

shows which successful jobs the upgrade:firefly-hammer-x suites run against the infernalis branch (the first jq expression) were involved in a parallel test (that is the name of a subdirectory of the suite). This was not sufficient to figure out the root cause of the problem because:

  • it only provides access to the last 100 runs
  • it does allow to grep the teuthology log file for a string

With the teuthology logs in the /a directory (it’s actually a 100TB CephFS mount half full), the following shell snippet can be used to find the upgrade tests that failed with the error -4 message in the logs.

for run in *2015-{07,08,09,10}*upgrade* ; do for job in $run/* ; do \
  test -d $job || continue ; \
  config=$job/config.yaml ;   test -f $config || continue ; \
  summary=$job/summary.yaml ; test -f $summary || continue ; \
  if shyaml get-value branch < $config | grep -q hammer && \
     shyaml get-value success < $summary | grep -qi false && \
     grep -q 'error -4' $job/teuthology.log  ; then
       echo $job ;
   fi ; \
done ; done

It looks for all upgrade runs, back to July 2015. shyaml is used to query the branch from the job configuration and only keep those targeting hammer. If the job failed (according to the success value found in the summary file), the error is looked up in the teuthology.log file. The first failed job is found early september:


It happened on a regular basis after that date but was only reported early October. The commits merged in the hammer branch around that time are displayed with:

git log --merges --since 2015-09-01 --until 2015-09-11 --format='%H' ceph/hammer | \
while read sha1 ; do \
  echo ; git log --format='** %aD "%s":' ${sha1}^1..${sha1} ; \
done | perl -p -e 'print "* \"PR $1\":$1\n" if(/Merge pull request #(\d+)/)'

It can be copy pasted in redmine issue. It turns out that a pull request merged September 6th was responsible for the failure.

On demand Ceph packages for teuthology

When a teuthology jobs install Ceph, it uses packages created by gitbuilder. These packages are built every time a branch is pushed to the official repository.

Contributors who do not have write access to the official repository, can either ask a developer with access to push a branch for them or setup a gitbuilder repository, using autobuild-ceph. Asking a developer is inconvenient because it takes time and also because it creates packages for every supported operating system, even when only one of them would be enough. In addition there often is a long wait queue because the gitbuilder of the sepia lab is very busy. Setting up a gitbuilder repository reduces wait time but it has proven to be too time and resources consuming for most contributors.

The buildpackages task can be used to resolve that problem and create the packages required for a particular job on demand. When added to a job that has an install task, it will:

  • always run before the install task regardless of its position in the list of tasks (see the buildpackages_prep function in the teuthology internal tasks for more information).
  • create an http server, unless it already exists
  • set gitbuilder_host in ~/.teuthology.yaml to the http server
  • find the SHA1 of the commit that the install task needs
  • checkout the ceph repository at SHA1 and build the package, in a dedicated server
  • upload the packages to the http server, using directory names that mimic the gitbuilder conventions used in the lab gitbuilder and destroy the server used to build them

When the install task looks for packages, it uses the http server populated by the buildpackages task. The teuthology cluster keeps track of which packages were built for which architecture (via makefile timestamp files). When another job needs the same packages, the buildpackages task will notice they already have been built and uploaded to the http server and do nothing.

A test suite verifies the buildpackages task works as expected and can be run with:

teuthology-openstack --verbose \
   --key-name myself --key-filename ~/Downloads/myself \
   --ceph-git-url \
   --ceph hammer --suite teuthology/buildpackages

The –ceph-git-url is the repository from which the branch specified with –ceph is cloned. It defaults to which requires write access to the official Ceph repository.

Gitlab CI runner installation

The instructions to install GitLab CI runner are adapted to Ubuntu 14.04 to connect to GitLab CI and run jobs when a commit is pushed to a branch.

A runner token must be obtained from GitLab CI, at the URL for instance.

The gitlab-ci-multi-runner/ is installed as follows:

$ curl -L | sudo bash
$ sudo apt-get install gitlab-ci-multi-runner
$ $ sudo gitlab-ci-multi-runner register
Please enter the gitlab-ci coordinator URL (e.g.
Please enter the gitlab-ci token for this runner:
Please enter the gitlab-ci description for this runner:
[cong]: runner1
INFO[0156] 4418775e Registering runner... succeeded
Please enter the executor: shell, parallels, docker, docker-ssh, ssh:
[shell]: docker
Please enter the Docker image (eg. ruby:2.1):
If you want to enable mysql please enter version (X.Y) or enter latest?

If you want to enable postgres please enter version (X.Y) or enter latest?

If you want to enable redis please enter version (X.Y) or enter latest?

If you want to enable mongo please enter version (X.Y) or enter latest?

INFO[0281] Runner registered successfully. Feel free to start it, but if it's running already the config should be automatically reloaded!

It is configured to run each job in a golang docker container. The project git repository is expected to have a .gitlab-ci.yml file at the root. For instance if .gitlab-ci.yml was:

  script: "type go"

the GitLab runner would succeed with:

faster debugging of a teuthology workunit

The Ceph integration tests run via teuthology rely on workunits found in the Ceph repository. For instance:

  • the /cephtool/ workunit is modified
  • it is pushed to a wip- in the official Ceph git repository
  • the gitbuilder will automatically build packages for all supported distributions for this wip- branch
  • the rados/singleton/all/cephtool suite can be run with teuthology-suite –suite rados/singleton
  • the workunit task fetches the workunits directory from the Ceph git repository and runs it

There is no need for Ceph to be packaged each time the workunit script is modified. Instead it can be fetched from a pull request:

  • the cephtool/ workunit is modified
  • the pull request number 2043 is created or updated with the modified workunit
  • the workunit.yaml file is created with
          branch: refs/pull/2043/head
  • the rados/singleton/all/cephtool suite can be run with teuthology-suite –suite rados/singleton $(pwd)/workunit.yaml
  • the workunit task fetch the workunits directory in the branch refs/pull/2043/head from the Ceph git repository and runs it

For each pull request, github implicitly creates a reference in the target git repository. This reference is mirrored to where the workunit task can extract it. The teuthology-suite command accepts yaml files in argument and they are assumed to be relative to the root of a clone of the ceph-qa-suite repository. By providing an absolute path ($(pwd)/workunit.yaml) the file is read from the current directory instead and there is no need to commit it to the ceph-qa-suite repository.

write-only ssh based rsync server

A write-only rsync server can be used by anyone to upload content with no risk of deleting existing files. Assuming access to the rsync server is handled via ssh, the following line can be added to the ~/.ssh/authorized_keys file

command="rrsync /usr/share/nginx/html" ssh-rsa AAAAB3NzaC1y...

The rrsync script is found in the rsync package documentation and installed with:

gzip -d < /usr/share/doc/rsync/scripts/rrsync.gz > /usr/bin/rrsync
chmod +x /usr/bin/rrsync

Scaling out the Ceph community lab

Ceph integration tests are vital and expensive. Contrary to unit tests that can be run on a laptop, they require multiple machines to deploy an actual Ceph cluster. As the community of Ceph developers expands, the community lab needs to expand.

The current development workflow and its challenges

When a developer contributes to Ceph, it goes like this:

  • The Developer submits a pull request
  • After the Reviewer is satisfied with the pull request, it is scheduled for integration testing (by adding the needs-qa label)
  • A Tester merges the pull request in an integration branch, together with other pull requests that needs-qa and set a label informing (s)he did so (for instance if Kefu Chai did it, he would set the wip-kefu-testing label)
  • The Tester waits for the packages to be built for the integration branch
  • The Tester schedules a suite of integration tests in the community lab
  • When the suite finishes, the Tester analyzes the integration tests results, finds the pull request responsible for a failure (which can be challenging when there are more than a handfull of pull requests in the integration branch)
  • For each failure the Tester adds a comment to the faulty pull request with a link to the integration test logs, kindly asking the developer to address the issue
  • When the integration tests are clean, the Tester merges the pull requests

As the number of contributors to Ceph increases, running the integration tests and analyzing their results becomes the bottleneck, because:

  • getting the integration tests results usually takes a few days
  • only people with access to the community lab can run integration tests
  • analyzing test results is time consuming

Increasing the number of machines in the community lab would run integration tests faster. But acquiring hardware, hosting it and monitoring it not only takes months, it also require significant system administration work. The community of Ceph developers is growing faster than what the community lab. And to make things even more complicated, as Ceph evolves the number of integration tests increases and require even more resources.

When a developer frequently contributes to Ceph, (s)he is granted access to the VPN that allows her/him to schedule integration tests. For instance Abhishek Lekshmanan and Nathan Cutler who routinely run and analyze integration tests for backports now have access to the community lab and can do that on their own. But the process to get access to the VPN takes weeks and the learning curve to use it properly is significant.

Although it is mostly invisible to the community lab user, the system administration workload to keep it running is significant. Dan Mick, Zack Cerza and others fix problems on a daily basis. As the size of the community lab grows, this workload increases and requires skills that are difficult to acquire.

Simplifying the workflow with public OpenStack clouds

As of July 2015, it became possible to run integration tests on public OpenStack clouds. More importantly, it takes less than one hour for a new developer to register and schedule an integration test. This new facility can be leveraged to simplify the workflow as follows:

  • The Developer submits a pull request
  • The Developer is required to attach a successfull run of integration tests demonstrating the feature or the bug fix
  • After the Reviewer is satisfied with the pull request, it is merged.

There is no need for a Tester because the Developer now has the ability to run integration tests and interpret the results.

The interpretation of the test results is simpler because there is only one pull request for a run. The Developer can compare her/his run to a recent run from the community lab to verify the unmodified code passes. (S)He also can debug a failed test in interactive mode.

Contrary to the community lab, the test cluster has a short life span and requires no system administration skills. It is created in the cloud, on demand, and can be destroyed as soon as the results have been analyzed.

The learning curve to schedule and interpret integration tests is reduced. The Developer needs to know about the teuthology-openstack command and how to interpret a test failure. But (s)he does not need the other teuthology-* commands nor does (s)he have to get access to the VPN of the community lab.