HOWTO run a Ceph Q&A suite

Ceph has extensive Q&A suites that are made of individual teuthology tasks ( see btrfs.yaml for instance ). The schedule_suite.sh helper script could be used as follows to run the entire rados suite:

./schedule_suite.sh rados wip-5510 testing \
   loic@dachary.org basic master plana

Where wip-5510 is the branch to be tested as compiled by gitbuilder ( which happens automatically after each commit ), testing is the kernel to run on the test machines which is relevant when using krbd, basic is the compilation flavor of the wip-5510 branch ( could be notcmalloc or gcov ) and master is the teuthology branch to use. plana specifies the type of machines to be used, which is only relevant if using the inktank lab.
The queue_host defined in .teuthology.yaml

queue_host: teuthology.dachary.org
queue_port: 11300

will schedule a number of jobs ( ~300 for rados ) and send a mail when they complete.

name loic-2013-08-14_22:00:20-rados-wip-5510-testing-basic-plana
INFO:teuthology.suite:Collection basic in /home/loic/src/ceph-qa-suite/suites/rados/basic
INFO:teuthology.suite:Running teuthology-schedule with facets collection:basic clusters:fixed-2.yaml fs:btrfs.yaml msgr-failures:few.yaml tasks:rados_api_tests.yaml
Job scheduled with ID 106579
INFO:teuthology.suite:Running teuthology-schedule with facets collection:basic clusters:fixed-2.yaml fs:btrfs.yaml msgr-failures:few.yaml tasks:rados_cls_all.yaml
Job scheduled with ID 106580
....
INFO:teuthology.suite:Running teuthology-schedule with facets collection:verify 1thrash:none.yaml clusters:fixed-2.yaml fs:btrfs.yaml msgr-failures:few.yaml tasks:rados_cls_all.yaml validater:valgrind.yaml
Job scheduled with ID 106880
Job scheduled with ID 106881

To watch the progress of a suite on the queue host:

$ cd loic-2013-08-14_22:00:20-rados-wip-5510-testing-basic-plana
$ watch-suite.sh
Every 2.0s: pwd ; echo `teuthology-ls --archive-dir . | grep -c pass` passes ; teuthology-ls --archive-dir . | grep -v pass                             Tue Aug 20 08:20:11 2013

/a/loic-2013-08-20_15:26:33-rados-wip-5510-testing-basic-plana
14 passes
3241 FAIL scheduled_loic@fold collection:basic clusters:fixed-2.yaml fs:btrfs.yaml msgr-failures:few.yaml tasks:rados_python.yaml 261s
3242       (pid 22204) 2013-08-20T08:20:09.606 INFO:teuthology.task.internal:waiting for more machines to be free...
3244       (pid 22369) 2013-08-20T08:20:12.148 INFO:teuthology.task.workunit.client.0.out:[]: op 6822 completed, throughput=5MB/sec
3245       (pid 22442) 2013-08-20T08:18:10.337 INFO:teuthology.task.workunit.client.0.err:[]: 2013-08-20 08:18:12.176100 7fbdf03b4700  0 -- XX:0/1023213 >
> XX:6808/15301 pipe(0x7fbde001a6b0 sd=11 :39359 s=2 pgs=51 cs=1 l=1 c=0x7fbde003e290).injecting socket failure
3246       (pid 22573) 2013-08-20T08:20:11.472 INFO:teuthology.task.internal:waiting for more machines to be free...
3247       (pid 22679) 2013-08-20T08:20:03.051 INFO:teuthology.task.internal:waiting for more machines to be free...
....

If a job fails, its configuration file can be used individually, without the need to replay the whole test suite. For instance, on the queue host, a directory named after the suite queue ( see above ) contains the leftovers of a failed run:

ubuntu@teuthology$ cd /var/lib/teuthworker/archive/
ubuntu@teuthology$ cd loic-2013-08-14_22:00:20-rados-wip-5510-testing-basic-plana/3241
ubuntu@teuthology$ find
./teuthology.log
./pid
./owner
./orig.config.yaml
./config.yaml
./cephtest
./cephtest/106675
./cephtest/106675/adjust-ulimits
./cephtest/106675/archive
./cephtest/106675/archive/syslog
./cephtest/106675/archive/syslog/misc.log
./cephtest/106675/archive/syslog/kern.log
./cephtest/106675/archive/coredump
./cephtest/106675/kcon_most
./cephtest/106675/chdir-coredump
./cephtest/106675/daemon-helper
./cephtest/106675/valgrind.supp
./cephtest/106675/data
./cephtest.tar.gz

The ./orig.config.yaml file can be renamed into wip-5510-fail.yaml and used to reschedule the task that triggered the core, provided the right number of target machines, with:

./virtualenv/bin/teuthology -v --archive /tmp/t/wip-5510-fail \
  --owner loic@dachary.org ~/private/ceph/targets.yaml \
  ~/private/ceph/wip-5510-fail.yaml

When a core dump is created ( cephtest/lo1308151045/archive/coredump/1376556544.4478.core for instance ), it can be used to display the stack trace as follows:

$ gdb /usr/bin/ceph-osd cephtest/lo1308151045/archive/coredump/1376556544.4478.core
Core was generated by `ceph-osd -f -i 0'.
Program terminated with signal 6, Aborted.
(gdb) bt
#0  0x00007f3ea7625b7b in raise (sig=) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42
#1  0x000000000080686e in reraise_fatal (signum=6) at global/signal_handler.cc:59
#2  handle_fatal_signal (signum=6) at global/signal_handler.cc:105
#3  
#4  0x00007f3ea56f3425 in __GI_raise (sig=) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#5  0x00007f3ea56f6b8b in __GI_abort () at abort.c:91
#6  0x00007f3ea604569d in __gnu_cxx::__verbose_terminate_handler() ()
   from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007f3ea6043846 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007f3ea6043873 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007f3ea604396e in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x00000000008cd21f in ceph::__ceph_assert_fail (assertion=0xa53e38 "_lock.is_locked()",
    file=, line=269, func=0x9ea6e0 "void PG::kick()") at common/assert.cc:77
#11 0x00000000005f2f07 in kick (this=0x27cd000) at osd/PG.h:269
#12 ReplicatedPG::object_context_destructor_callback (this=0x27cd000, obc=)
    at osd/ReplicatedPG.cc:4677
#13 0x0000000000633769 in Context::complete (this=0x2828820, r=) at ./include/Context.h:44
#14 0x0000000000660bff in ~ObjectContext (this=0x27efa00, __in_chrg=) at osd/osd_types.h:2028
#15 SharedPtrRegistry::OnRemoval::operator() (this=0x28e7d38, to_remove=0x27efa00)
    at ./common/sharedptr_registry.hpp:50
#16 0x000000000063a069 in _M_release (this=0x28e7d20) at /usr/include/c++/4.6/tr1/shared_ptr.h:147
#17 std::tr1::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=,
    __in_chrg=) at /usr/include/c++/4.6/tr1/shared_ptr.h:348
#18 0x000000000063a548 in ~__shared_ptr (this=0x2893870, __in_chrg=)
    at /usr/include/c++/4.6/tr1/shared_ptr.h:548
#19 ~shared_ptr (this=0x2893870, __in_chrg=) at /usr/include/c++/4.6/tr1/shared_ptr.h:992
#20 ~C_OSD_AppliedRecoveredObject (this=0x2893860, __in_chrg=) at osd/ReplicatedPG.h:835
#21 ReplicatedPG::C_OSD_AppliedRecoveredObject::~C_OSD_AppliedRecoveredObject (this=0x2893860,
    __in_chrg=) at osd/ReplicatedPG.h:835
#22 0x00000000006b859d in finish_contexts (cct=0x0, finished=..., result=0) at ./include/Context.h:100
#23 0x0000000000633769 in Context::complete (this=0x274a960, r=) at ./include/Context.h:44
#24 0x000000000081a710 in Finisher::finisher_thread_entry (this=0x26e2c00) at common/Finisher.cc:56
#25 0x00007f3ea761de9a in start_thread (arg=0x7f3e9cc92700) at pthread_create.c:308
#26 0x00007f3ea57b0ccd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#27 0x0000000000000000 in ?? ()

After fixing the problem, rebasing the commit and a new build of wip-5510 has run, the wip-5510-fail.yaml must be changed because it references the SHA1 of the commit from the previous run. For instance, 2bc79c2ee4ef525acf7a416563ddc90ab3f6 will have a matching set of packages at http://gitbuilder.ceph.com/ceph-deb-precise-x86_64-basic/sha1/2bc79c2ee4ef525acf7a416563ddc90ab3f6
71e4/. For a rebased commit with SHA1 XXXX, all occurences of 2bc79c2ee4ef525acf7a416563ddc90ab3f6 in wip-5510-fail.yaml must be replaced with XXXX.
To make a proper diagnostic of the error, it is useful to keep the Ceph cluster running instead of destroying it, which is the default for teuthology tasks. Adding

interactive-on-error: true

to the .yaml file will skip the destruction phase.

4 Replies to “HOWTO run a Ceph Q&A suite”

Rajesh says:

June 10, 2014 at 3:11 pm

Hi,
I am trying to run schedule_suite.sh on our custom Ceph build for leveraging InkTank suites in our testing. Can someone help me in using this shell script, where I can provide my own targets instead of the script picking from Ceph lab? Also kindly let me know if anyone has setup a lock server for this script to run. If yes, please share the details on how to setup the lock server.

Thanks and Regards,
Rajesh Raman
1. Loic Dachary says:
  
  June 10, 2014 at 3:48 pm
  
  I don’t know how to create a lock server. This run was done in the InkTank lab, using the internal lock server. You can ask Zack on irc.oftc.net#ceph-devel (nick zackc) for more information.
Kapil Sharma says:

August 1, 2014 at 2:50 pm

Hi Loic,

Thanks for this article. I am trying to setup teuthology outside the inktank lab in our company’s lab and have a couple of questions –
1) queue_host could be any machine in my network ? Or it has to be a specific server with some required packages already installed ?
2) I don’t see schedule_suite.sh script in the teuthology repo anymore. Has it been depricated ?
3)IS it true that teuthology only attempts to report results to the
results_server if a job was scheduled via teuthology-schedule. Can
we not report results to paddles/pupito if we are running simple
test runs like – “teuthology testdata.yaml”

Regards,
Kapil.
Loic Dachary says:

August 9, 2014 at 5:54 pm

1) queue_host : I’ve never configured a queue_host.
2) schedule_suite.sh is now teuthology-suite
3) IS it true that teuthology only attempts to report results to the results_server if a job was scheduled via teuthology-schedule : I don’t know

Comments are closed.