teuthology forensics with git, shell and paddles

When a teuthology integration test for Ceph fails, the results are analyzed to find the source of the problem. For instance the upgrade suite: pool_create failed with error -4 EINTR issue was reported early October 2015, with multiple integration job failures.
The first step is to look into the teuthology log which revealed that pools could not be created.

failed: error rados_pool_create(test-rados-api-vpm049-15238-1) \
  failed with error -4"

The 4 stands for EINTR. The paddles database is used by teuthology to store test results and can be queried via HTTP. For instance:

curl --silent http://paddles.front.sepia.ceph.com/runs/ |
  jq '.[] | \
      select(.name | contains("upgrade:firefly-hammer-x")) | \
      select(.branch == "infernalis") | \
      select(.status | contains("finished")) \
      | .name' | \
  while read run ; do eval run=$run ; \
    curl --silent http://paddles.front.sepia.ceph.com/runs/$run/jobs/ | \
      jq '.[] | "http://paddles.front.sepia.ceph.com/runs/\(.name)/jobs/\(.job_id)/"' ; \
  done | \
  while read url ; do eval url=$url ; \
    curl --silent $url | \
      jq 'if((.description != null) and \
             (.description | contains("parallel")) and \
             (.success == true)) then "'$url'" else null end' ; \
  done | grep -v null

shows which successful jobs the upgrade:firefly-hammer-x suites run against the infernalis branch (the first jq expression) were involved in a parallel test (that is the name of a subdirectory of the suite). This was not sufficient to figure out the root cause of the problem because:

  • it only provides access to the last 100 runs
  • it does allow to grep the teuthology log file for a string

With the teuthology logs in the /a directory (it’s actually a 100TB CephFS mount half full), the following shell snippet can be used to find the upgrade tests that failed with the error -4 message in the logs.

for run in *2015-{07,08,09,10}*upgrade* ; do for job in $run/* ; do \
  test -d $job || continue ; \
  config=$job/config.yaml ;   test -f $config || continue ; \
  summary=$job/summary.yaml ; test -f $summary || continue ; \
  if shyaml get-value branch < $config | grep -q hammer && \
     shyaml get-value success < $summary | grep -qi false && \
     grep -q 'error -4' $job/teuthology.log  ; then
       echo $job ;
   fi ; \
done ; done

It looks for all upgrade runs, back to July 2015. shyaml is used to query the branch from the job configuration and only keep those targeting hammer. If the job failed (according to the success value found in the summary file), the error is looked up in the teuthology.log file. The first failed job is found early september:

teuthology-2015-09-11_17:18:07-upgrade:firefly-x-hammer-distro-basic-vps/1051109

It happened on a regular basis after that date but was only reported early October. The commits merged in the hammer branch around that time are displayed with:

git log --merges --since 2015-09-01 --until 2015-09-11 --format='%H' ceph/hammer | \
while read sha1 ; do \
  echo ; git log --format='** %aD "%s":https://github.com/ceph/ceph/commit/%H' ${sha1}^1..${sha1} ; \
done | perl -p -e 'print "* \"PR $1\":https://github.com/ceph/ceph/pull/$1\n" if(/Merge pull request #(\d+)/)'

It can be copy pasted in redmine issue. It turns out that a pull request merged September 6th was responsible for the failure.

On demand Ceph packages for teuthology

When a teuthology jobs install Ceph, it uses packages created by gitbuilder. These packages are built every time a branch is pushed to the official repository.

Contributors who do not have write access to the official repository, can either ask a developer with access to push a branch for them or setup a gitbuilder repository, using autobuild-ceph. Asking a developer is inconvenient because it takes time and also because it creates packages for every supported operating system, even when only one of them would be enough. In addition there often is a long wait queue because the gitbuilder of the sepia lab is very busy. Setting up a gitbuilder repository reduces wait time but it has proven to be too time and resources consuming for most contributors.

The buildpackages task can be used to resolve that problem and create the packages required for a particular job on demand. When added to a job that has an install task, it will:

  • always run before the install task regardless of its position in the list of tasks (see the buildpackages_prep function in the teuthology internal tasks for more information).
  • create an http server, unless it already exists
  • set gitbuilder_host in ~/.teuthology.yaml to the http server
  • find the SHA1 of the commit that the install task needs
  • checkout the ceph repository at SHA1 and build the package, in a dedicated server
  • upload the packages to the http server, using directory names that mimic the gitbuilder conventions used in the lab gitbuilder and destroy the server used to build them

When the install task looks for packages, it uses the http server populated by the buildpackages task. The teuthology cluster keeps track of which packages were built for which architecture (via makefile timestamp files). When another job needs the same packages, the buildpackages task will notice they already have been built and uploaded to the http server and do nothing.

A test suite verifies the buildpackages task works as expected and can be run with:

teuthology-openstack --verbose \
   --key-name myself --key-filename ~/Downloads/myself \
   --ceph-git-url http://workbench.dachary.org/ceph/ceph.git \
   --ceph hammer --suite teuthology/buildpackages

The –ceph-git-url is the repository from which the branch specified with –ceph is cloned. It defaults to http://github.com/ceph/ceph which requires write access to the official Ceph repository.

faster debugging of a teuthology workunit

The Ceph integration tests run via teuthology rely on workunits found in the Ceph repository. For instance:

  • the /cephtool/test.sh workunit is modified
  • it is pushed to a wip- in the official Ceph git repository
  • the gitbuilder will automatically build packages for all supported distributions for this wip- branch
  • the rados/singleton/all/cephtool suite can be run with teuthology-suite –suite rados/singleton
  • the workunit task fetches the workunits directory from the Ceph git repository and runs it

There is no need for Ceph to be packaged each time the workunit script is modified. Instead it can be fetched from a pull request:

  • the cephtool/test.sh workunit is modified
  • the pull request number 2043 is created or updated with the modified workunit
  • the workunit.yaml file is created with
    overrides:
      workunit:
          branch: refs/pull/2043/head
    
  • the rados/singleton/all/cephtool suite can be run with teuthology-suite –suite rados/singleton $(pwd)/workunit.yaml
  • the workunit task fetch the workunits directory in the branch refs/pull/2043/head from the Ceph git repository and runs it

For each pull request, github implicitly creates a reference in the target git repository. This reference is mirrored to git.ceph.com where the workunit task can extract it. The teuthology-suite command accepts yaml files in argument and they are assumed to be relative to the root of a clone of the ceph-qa-suite repository. By providing an absolute path ($(pwd)/workunit.yaml) the file is read from the current directory instead and there is no need to commit it to the ceph-qa-suite repository.

Scaling out the Ceph community lab

Ceph integration tests are vital and expensive. Contrary to unit tests that can be run on a laptop, they require multiple machines to deploy an actual Ceph cluster. As the community of Ceph developers expands, the community lab needs to expand.

The current development workflow and its challenges

When a developer contributes to Ceph, it goes like this:

  • The Developer submits a pull request
  • After the Reviewer is satisfied with the pull request, it is scheduled for integration testing (by adding the needs-qa label)
  • A Tester merges the pull request in an integration branch, together with other pull requests that needs-qa and set a label informing (s)he did so (for instance if Kefu Chai did it, he would set the wip-kefu-testing label)
  • The Tester waits for the packages to be built for the integration branch
  • The Tester schedules a suite of integration tests in the community lab
  • When the suite finishes, the Tester analyzes the integration tests results, finds the pull request responsible for a failure (which can be challenging when there are more than a handfull of pull requests in the integration branch)
  • For each failure the Tester adds a comment to the faulty pull request with a link to the integration test logs, kindly asking the developer to address the issue
  • When the integration tests are clean, the Tester merges the pull requests

As the number of contributors to Ceph increases, running the integration tests and analyzing their results becomes the bottleneck, because:

  • getting the integration tests results usually takes a few days
  • only people with access to the community lab can run integration tests
  • analyzing test results is time consuming

Increasing the number of machines in the community lab would run integration tests faster. But acquiring hardware, hosting it and monitoring it not only takes months, it also require significant system administration work. The community of Ceph developers is growing faster than what the community lab. And to make things even more complicated, as Ceph evolves the number of integration tests increases and require even more resources.

When a developer frequently contributes to Ceph, (s)he is granted access to the VPN that allows her/him to schedule integration tests. For instance Abhishek Lekshmanan and Nathan Cutler who routinely run and analyze integration tests for backports now have access to the community lab and can do that on their own. But the process to get access to the VPN takes weeks and the learning curve to use it properly is significant.

Although it is mostly invisible to the community lab user, the system administration workload to keep it running is significant. Dan Mick, Zack Cerza and others fix problems on a daily basis. As the size of the community lab grows, this workload increases and requires skills that are difficult to acquire.

Simplifying the workflow with public OpenStack clouds

As of July 2015, it became possible to run integration tests on public OpenStack clouds. More importantly, it takes less than one hour for a new developer to register and schedule an integration test. This new facility can be leveraged to simplify the workflow as follows:

  • The Developer submits a pull request
  • The Developer is required to attach a successfull run of integration tests demonstrating the feature or the bug fix
  • After the Reviewer is satisfied with the pull request, it is merged.

There is no need for a Tester because the Developer now has the ability to run integration tests and interpret the results.

The interpretation of the test results is simpler because there is only one pull request for a run. The Developer can compare her/his run to a recent run from the community lab to verify the unmodified code passes. (S)He also can debug a failed test in interactive mode.

Contrary to the community lab, the test cluster has a short life span and requires no system administration skills. It is created in the cloud, on demand, and can be destroyed as soon as the results have been analyzed.

The learning curve to schedule and interpret integration tests is reduced. The Developer needs to know about the teuthology-openstack command and how to interpret a test failure. But (s)he does not need the other teuthology-* commands nor does (s)he have to get access to the VPN of the community lab.

Sorting Ceph backport branches

When there are many backports in flight, they are more likely to overlap and conflict with each other. When a conflict can be trivially resolved because it comes from the context of a hunk, it’s often enough to just swap the two commits to avoid the conflict entirely. For instance let say a commit on

void foo() { }
void bar() {}

adds an argument to the foo function:

void foo(int a) { }
void bar() {}

and the second commit adds an argument to the bar function:

void foo(int a) { }
void bar(bool b) {}

If the second commit is backported before the first, it will conflict because it will find that the context of the bar function has the foo function without an argument.

When there are dozens of backport branches, they can be sorted so that the first to merge is the one that cherry picks the oldest ancestor in the master branch. In other words given the example above, a cherry-pick of the first commit be merged before the second commit because it is older in the commit history.

Sorting the branches also gracefully handles interdependent backports. For instance let say the first branch contains a few backported commits and a second branch contains a backported commit that can’t be applied unless the first branch is merged. Since it is required for each Ceph branch proposed for backports to pass make check, the most commonly used strategy is to include all the commits from the first branch in the second branch. This second branch is not intended to be merged and the title is usually prefixed with DNM (Do Not Merge). When the first branch is merged, the second is rebased against the target and the redundant commits disapear from the second branch.

Here is a three lines shell script that implements the sorting:

#
# Make a file with the hash of all commits found in master
# but discard those that already are in the hammer release.
#
git log --no-merges \
  --pretty='%H' ceph/hammer..ceph/master \
  > /tmp/master-commits
#
# Match each pull request with the commit from which it was
# cherry-picked. Just use the first commit: we expect the other to be
# immediate ancestors. If that's not the case we don't know how to
# use that information so we just ignore it.
#
for pr in $PRS ; do
  git log -1 --pretty=%b ceph/pull/$pr/merge^1..ceph/pull/$pr/merge^2 | \
   perl -ne 'print "$1 '$pr'\n" if(/cherry picked from commit (\w+)/)'
done > /tmp/pr-and-first-commit
#
# For each pull request, grep the cherry-picked commit and display its
# line number. Sort the result in reverse order to get the pull
# request sorted in the same way the cherry-picked commits are found
# in the master history.
#
SORTED_PRS=$(while read commit pr ; do
  grep --line-number $commit < /tmp/master-commits | \
  sed -e "s/\$/ $pr/" ; done  < /tmp/pr-and-first-commit | \
  sort -rn | \
  perl -p -e 's/.* (.*)\n/$1 /')

Ceph integration tests made simple with OpenStack

If an OpenStack tenant (account in the OpenStack parlance) is available, the Ceph integration tests can be run with the teuthology-openstack command , which will create the necessary virtual machines automatically (see the detailed instructions to get started). To do its work, it uses the teuthology OpenStack backend behind the scenes so the user does not need to know about it.
The teuthology-openstack command has the same options as teuthology-suite and can be run as follows:

$ teuthology-openstack \
  --simultaneous-jobs 70 --key-name myself \
  --subset 10/18 --suite rados \
  --suite-branch next --ceph next
...
Scheduling rados/thrash/{0-size-min-size-overrides/...
Suite rados in suites/rados scheduled 248 jobs.

web interface: http://167.114.242.148:8081/
ssh access   : ssh ubuntu@167.114.242.148 # logs in /usr/share/nginx/html

As the suite progresses, its status can be monitored by visiting the web interface::

And the horizon OpenStack dashboard shows resource usage for the run:


Continue reading “Ceph integration tests made simple with OpenStack”

HOWTO setup a postgresql server on Ubuntu 14.04

In the context of the teuthology (the integration test framework for Ceph, there needs to be a PostgreSQL available, locally only, with a single user dedicated to teuthology. It can be setup from a new Ubuntu 14.04 install with:

    sudo apt-get -qq install -y postgresql postgresql-contrib

    if ! sudo /etc/init.d/postgresql status ; then
        sudo mkdir -p /etc/postgresql
        sudo chown postgres /etc/postgresql
        sudo -u postgres pg_createcluster 9.3 paddles
        sudo /etc/init.d/postgresql start
    fi
    if ! psql --command 'select 1' \
          'postgresql://paddles:paddles@localhost/paddles' > /dev/null
    then
        sudo -u postgres psql \
            -c "CREATE USER paddles with PASSWORD 'paddles';"
        sudo -u postgres createdb -O paddles paddles
    fi

If anyone knows of a simpler way to do the same thing, I’d be very interested to know about it.

oneliner to deploy teuthology on OpenStack

Note: this is obsoleted by Ceph integration tests made simple with OpenStack

The teuthology can be installed as a dedicated OpenStack instance on OVH using the OpenStack backend with:

nova boot \
   --image 'Ubuntu 14.04' \
   --flavor 'vps-ssd-1' \
   --key-name loic \
   --user-data <(curl --silent \
     https://raw.githubusercontent.com/dachary/teuthology/wip-6502-openstack/openstack-user-data.txt | \
     sed -e "s|OPENRC|$(env | grep OS_ | tr '\n' ' ')|") teuthology

Assuming the IP assigned to the instance is 167.114.235.222, the following will display the progress of the integration tests that are run immediately after the instance is created:

ssh ubuntu@167.114.235.222 tail -n 2000000 -f /tmp/init.out

If all goes well, it will complete with:

...
========================= 8 passed in 1845.59 seconds =============
___________________________________ summary _________________________
  openstack-integration: commands succeeded
  congratulations :)

And the pulpito dashboard will display the remains of the integration tests at 167.114.235.222:8081 like so:

Running your own Ceph integration tests with OpenStack

Note: this is obsoleted by Ceph integration tests made simple with OpenStack

The Ceph lab has hundreds of machines continuously running integration and upgrade tests. For instance, when a pull request modifies the Ceph core, it goes through a run of the rados suite before being merged into master. The Ceph lab has between 100 to 3000 jobs in its queue at all times and it is convenient to be able to run integration tests on an independent infrastructure to:

  • run a failed job and verify a patch fixes it
  • run a full suite prior to submitting a complex modification
  • verify the upgrade path from a given Ceph version to another
  • etc.

If an OpenStack account is not available (a tenant in the OpenStack parlance), it is possible to rent one (it takes a few minutes). For instance, OVH provides an horizon dashboard showing how many instances are being used to run integration tests:

The OpenStack usage is billed monthly and the accumulated costs are displayed on the customer dashboard:


Continue reading “Running your own Ceph integration tests with OpenStack”

configuring ansible for teuthology

As of July 8th, 2015, teuthology (the Ceph integration test software) switched from using Chef to using Ansible. To keep it working, two files must be created. The /etc/ansible/hosts/group_vars/all.yml file with:

modify_fstab: false

The modify_fstab is necessary for OpenStack provisioned instances but it won’t hurt if it’s always there (the only drawback being that mount options are not persisted in /etc/fstab, but they are set as they should). The /etc/ansible/hosts/mylab file must then be populated with

[testnodes]
ovh224000.teuthology
ovh224001.teuthology
...

where ovh224000.teuthology etc. are the fqdns of all machines that will be used as teuthology targets. The Ansible playbooks will expect to find all targets under the [testnodes] section. The output of a teuthology job should show that the Ansible playbook is being used with something like:

...
teuthology.run_tasks:Running task ansible.cephlab...
...
INFO:teuthology.task.ansible.out:PLAY [all] *****
...
TASK: [ansible-managed | Create the sudo group.] ******************************
...

Continue reading “configuring ansible for teuthology”