HOWTO extract a stack trace from teuthology (take 1)

When a teuthology test suite fails on Ceph, it shows in pulpito. For instance there is one failure in the monthrash test suite with details and a link to the logs. By removing the teuthology.log part of the link a directory listing shows all informations archived for this run are available.
In the example above the logs show:

client.0.plana34.stderr:+ ceph_test_rados_api_io
client.0.plana34.stdout:Running main() from gtest_main.cc
client.0.plana34.stdout:[==========] Running 43 tests from 4 test cases.
client.0.plana34.stdout:[----------] Global test environment set-up.
client.0.plana34.stdout:[----------] 11 tests from LibRadosIo
client.0.plana34.stdout:[ RUN      ] LibRadosIo.SimpleWrite
client.0.plana34.stdout:[       OK ] LibRadosIo.SimpleWrite (1509 ms)
client.0.plana34.stdout:[ RUN      ] LibRadosIo.ReadTimeout
client.0.plana34.stderr:Segmentation fault (core dumped)

That shows ceph_test_rados_api_io is running from the plana34 machine and core dumped and the remote/plana34/coredump subdirectory contains the corresponding core dump.
The teuthology logs show the repository from which the binary was downloaded (it was produced by gitbuilder).

echo deb http://gitbuilder.ceph.com/ceph-deb-precise-x86_64-basic/sha1/f5c1d3b6988bae5ffb914d2ac0b2858caeffe12c precise main | sudo tee /etc/apt/sources.list.d/ceph.list

and running this line on an Ubuntu precise 12.04 64bits as suggested by the name of the subdirectory precise-x86_64 will make the corresponding binary packages available. It is also possible to download them directly from the pool/main/c/ceph subdirectory. The packages that are suffixed with -dbg retain the debug symbols that are necessary for gdb to display an informative stack trace.
The ceph_test_rados_api_io binary is part of the ceph-test package and can be extracted with

$ dpkg --fsys-tarfile ceph-test_0.85-726-gf5c1d3b-1precise_amd64.deb | \
  tar xOf -  ./usr/bin/ceph_test_rados_api_io \
  > ceph_test_rados_api_io

and the stack trace displayed with

$ gdb /usr/bin/ceph_test_rados_api_io 1411176209.8835.core
(gdb) bt
#0  0x00007f541b95750a in pthread_rwlock_wrlock () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f541bd41341 in RWLock::get_write(bool) () from /usr/lib/librados.so.2
#2  0x00007f541bd2bbc9 in Objecter::op_cancel(Objecter::OSDSession*, unsigned long, int) () from /usr/lib/librados.so.2
#3  0x00007f541bcf1349 in Context::complete(int) () from /usr/lib/librados.so.2
#4  0x00007f541bdad5ea in RWTimer::timer_thread() () from /usr/lib/librados.so.2
#5  0x00007f541bdb149d in RWTimerThread::entry() () from /usr/lib/librados.so.2
#6  0x00007f541b953e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#7  0x00007f541b16a3fd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#8  0x0000000000000000 in ?? ()

Running python rados tests in Ceph

When Ceph is built from sources, make check will not run the test_rados.py tests.
A minimal cluster is required and can be run from the src directory with:

CEPH_NUM_MON=1 CEPH_NUM_OSD=3 ./vstart.sh -d -n -X -l mon osd

The test can then be run with

$  LD_LIBRARY_PATH=.libs PYTHONPATH=pybind nosetests -v
   test/pybind/test_rados.py

and if only the TestIoctx.test_aio_read is of interest, it can be appended to the filename:

$  LD_LIBRARY_PATH=.libs PYTHONPATH=pybind nosetests -v
   test/pybind/test_rados.py:TestIoctx.test_aio_read
test_rados.TestIoctx.test_aio_read ... ok

-------------------------------
Ran 1 test in 4.227s

OK

Ceph placement group memory footprint, in debug mode

A Ceph cluster is run from sources with

CEPH_NUM_MON=1 CEPH_NUM_OSD=5 ./vstart.sh -d -n -X -l mon osd

and each ceph-osd uses approximately 50MB of resident memory

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
loic      7489  1.7  0.2 586080 43676 ?        Ssl  17:55   0:01  ceph-osd
loic      7667  1.6  0.2 586080 43672 ?        Ssl  17:55   0:01  ceph-osd

A pool is created with 10,000 placement groups

$ ceph osd pool create manypg 10000
pool 'manypg' created

the creation completes within half an hour

$ ceph -w
...
2014-09-19 17:57:35.193706 mon.0 [INF] pgmap v40: 10152
   pgs: 10000 creating, 152 active+clean; 0 bytes data, 808 GB used, 102 GB / 911 GB avail
...
2014-09-19 18:35:08.668877 mon.0 [INF] pgmap v583: 10152
   pgs: 46 active, 10106 active+clean; 0 bytes data, 815 GB used, 98440 MB / 911 GB avail
2014-09-19 18:35:13.505841 mon.0 [INF] pgmap v584: 10152
   pgs: 10152 active+clean; 0 bytes data, 815 GB used, 98435 MB / 911 GB avail

The OSD now use approximately 150MB which suggests that each additional placement group uses ~10KB of resident memory.

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
loic      7489  0.7  1.0 725952 166144 ?       Ssl  17:55   2:02 ceph-osd
loic      7667  0.7  0.9 720808 160440 ?       Ssl  17:55   2:03 ceph-osd

Running node-rados from sources

The nodejs rados module comes with an example that requires a Ceph cluster.
If Ceph was compiled from source, a cluster can be run from the source tree with

rm -fr dev out ;  mkdir -p dev
CEPH_NUM_MON=1 CEPH_NUM_OSD=3 \
 ./vstart.sh -d -n -X -l mon osd

It can be used by modifying the /etc/ceph/ceph.conf in the example to the one from the sources : $CEPHSOURCE/src/ceph.conf. The expected output is

$ node exemple.js
fsid : c041968a-a895-4a5c-a0a7-6621e08a4f07
ls pools : rbd
 --- RUN Sync Write / Read ---
Read data : 01234567ABCDEF
 --- RUN ASync Write / Read ---
 --- RUN Attributes Write / Read ---
testfile3 xattr = {"attr1":"first attr","attr2":"second attr","attr3":"last attr value"}

HOWTO test a Ceph crush rule

The crushtool utility can be used to test Ceph crush rules before applying them to a cluster.

$ crushtool --outfn crushmap --build --num_osds 10 \
   host straw 2 rack straw 2 default straw 0
# id	weight	type name	reweight
-9	10	default default
-6	4		rack rack0
-1	2			host host0
0	1				osd.0	1
1	1				osd.1	1
-2	2			host host1
2	1				osd.2	1
3	1				osd.3	1
-7	4		rack rack1
-3	2			host host2
4	1				osd.4	1
5	1				osd.5	1
-4	2			host host3
6	1				osd.6	1
7	1				osd.7	1
-8	2		rack rack2
-5	2			host host4
8	1				osd.8	1
9	1				osd.9	1

Creates a crushmap from scratch (–build). It assumes there is a total of 10 OSDs available ( –num_osds 10 ). It then places two OSDs in each host ( host straw 2 ). The resulting hosts (five of them) are then placed in racks, at most two per racks ( rack straw 2 ). All racks are placed in the default root (that’s what the zero stands for : all of them) ( default straw 0 ). The last rack only has one host because there is an odd number of hosts available.
The crush rule to be tested can be injected in the crushmap with

crushtool --outfn crushmap --build --num_osds 10 host straw 2 rack straw 2 default straw 0
crushtool -d crushmap -o crushmap.txt
cat >> crushmap.txt <<EOF
rule myrule {
	ruleset 1
	type replicated
	min_size 1
	max_size 10
	step take default
	step choose firstn 2 type rack
	step chooseleaf firstn 2 type host
	step emit
}
EOF
crushtool -c crushmap.txt -o crushmap

This crushmap should be able to provide two OSDs ( for placement groups for instance ) and it can be verified with the –test option.

$ crushtool -i crushmap --test --show-statistics --rule 1 --min-x 1 --max-x 2 --num-rep 2
rule 1 (myrule), x = 1..2, numrep = 2..2
CRUSH rule 1 x 1 [0,2]
CRUSH rule 1 x 2 [7,4]
rule 1 (myrule) num_rep 2 result size == 2:	2/2

The –rule 1 designates the rule that was injected. The –rule 0 is the default rule that is created by default. The x can be thought of as the unique name of the placement group for which OSDs are reclaimed. The –min-x 1 –max-x 2 varies the value of x from 1 to 2 therefore trying the rule only twice. –min-x 1 –max-x 2048 would create 2048 lines. Each line shows the value of x after the rule number. In rule 1 x 2 the 1 is the rule number and the 2 is the value of x. The last line shows that for all values of x (2/2 i.e. 2 values of x out of 2), when asked to provide 2 OSDs (num_rep 2) the crush rule was able to provide 2 (result size == 2).

If asked for 4 OSDs, the same crush rule may fail because it has barely enough resources to satisfy the requirements.

$ crushtool -i crushmap --test --show-statistics --rule 1 --min-x 1 --max-x 2 --num-rep 4
rule 1 (myrule), x = 1..2, numrep = 4..4
CRUSH rule 1 x 1 [0,2,9]
CRUSH rule 1 x 2 [7,4,1,3]
rule 1 (myrule) num_rep 4 result size == 3:	1/2
rule 1 (myrule) num_rep 4 result size == 4:	1/2

The statistics at the end shows that one of the two mappings failed: the result size == 3 is lower than the required number num_rep 4. If asked for more OSDs than the rule can provide, the rule will always fail.

crushtool -i crushmap --test --show-statistics --rule 1 --min-x 1 --max-x 2 --num-rep 5
rule 1 (myrule), x = 1..2, numrep = 5..5
CRUSH rule 1 x 1 [0,2,9]
CRUSH rule 1 x 2 [7,4,1,3]
rule 1 (myrule) num_rep 5 result size == 3:	1/2
rule 1 (myrule) num_rep 5 result size == 4:	1/2

More examples of crushtool usage can be found in the crushtool directory of the Ceph sources.

HOWTO test teuthology tasks

The Ceph integration tests run by teuthology are described with YAML files in the ceph-qa-suite repository. The actual work is carried out on machines provisioned by teuthology via tasks. For instance, the workunit task runs a script found in the qa/workunits directory of the Ceph repository.
The workunit.py script, although small, is complex enough to deserve testing. Creating unit tests would require a lot of mocking and it would not catch a typo in a shell command to be run on an actual machine. Another approach is to create light weight integration tests within the ceph-qa-suite repository itself. For instance tests/workunit is designed to maximize coverage of the workunit.py script and run as quickly as possible.
Continue reading “HOWTO test teuthology tasks”

Tell teuthology to use a local ceph-qa-suite directory

By default teuthology will clone the ceph-qa-suite repository and use the tasks it contains. If tasks have been modified localy, teuthology can be instructed to use a local directory by inserting something like:

suite_path: /home/loic/software/ceph/ceph-qa-suite

in the teuthology job yaml file. The directory must then be added to the PYTHONPATH

PYTHONPATH=/home/loic/software/ceph/ceph-qa-suite \
   ./virtualenv/bin/teuthology  --owner loic@dachary.org \
   /tmp/work.yaml targets.yaml

Temporarily disable Ceph scrubbing to resolve high IO load

In a Ceph cluster with low bandwidth, the root disk of an OpenStack instance became extremely slow during days.

When an OSD is scrubbing a placement group, it has a significant impact on performances and this is expected, for a short while. In this case, however it slowed down to the point where the OSD was marked down because it did not reply in time:

2014-07-30 06:43:27.331776 7fcd69ccc700  1
   mon.bm0015@0(leader).osd e287968
   we have enough reports/reporters to mark osd.12 down

To get out of this situation, both scrub and deep scrub were deactivated with:

root@bm0015:~# ceph osd set noscrub
set noscrub
root@bm0015:~# ceph osd set nodeep-scrub
set nodeep-scrub

After a day, as the IO load remained stable confirming that no other factor was causing it, scrubbing was re-activated. The context causing the excessive IO load was changed and it did not repeat itself after another 24 hours, although scrubbing was confirmed to resume when examining the logs on the same machine:

2014-07-31 15:29:54.783491 7ffa77d68700  0 log [INF] : 7.19 deep-scrub ok
2014-07-31 15:29:57.935632 7ffa77d68700  0 log [INF] : 3.5f deep-scrub ok
2014-07-31 15:37:23.553460 7ffa77d68700  0 log [INF] : 7.1c deep-scrub ok
2014-07-31 15:37:39.344618 7ffa77d68700  0 log [INF] : 3.22 deep-scrub ok
2014-08-01 03:25:05.247201 7ffa77d68700  0 log [INF] : 3.46 deep-scrub ok

Ceph disaster recovery scenario

A datacenter containing three hosts of a non profit Ceph and OpenStack cluster suddenly lost connectivity and it could not be restored within 24h. The corresponding OSDs were marked out manually. The Ceph pool dedicated to this datacenter became unavailable as expected. However, a pool that was supposed to have at most one copy per datacenter turned out to have a faulty crush ruleset. As a result some placement groups in this pool were stuck.

$ ceph -s
...
health HEALTH_WARN 1 pgs degraded; 7 pgs down;
   7 pgs peering; 7 pgs recovering;
   7 pgs stuck inactive; 15 pgs stuck unclean;
   recovery 184/1141208 degraded (0.016%)
...

Continue reading “Ceph disaster recovery scenario”