Benchmarking Ceph erasure code plugins

The erasure code implementation in Ceph relies on the jerasure library. It is packaged into a plugin that is dynamically loaded by erasure coded pools.
The ceph_erasure_code_benchmark is implemented to help benchmark the competing erasure code plugins implementations and to find the best parameters for a given plugin. It shows the jerasure technique cauchy_good with a packet size of 3072 to be the most efficient on a Intel(R) Xeon(R) CPU E3-1245 V2 @ 3.40GHz when compiled with gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5). The test was done assuming each object is spread over six OSDs and two extra OSDs are used for parity ( K=6 and M=2 ).

  • Encoding: 4.2GB/s
  • Decoding: no processing necessary (because the code is systematic)
  • Recovering the loss of one OSD: 10GB/s
  • Recovering the loss of two OSD: 3.2GB/s

The processing is done on the primary OSDs and therefore distributed on the Ceph cluster. Encoding and decoding is an order of magnitude faster than the typical storage hardware throughput.
Continue reading “Benchmarking Ceph erasure code plugins”

Profiling CPU usage of a ceph command (callgrind)

After compiling Ceph from sources with:

./configure --with-debug CFLAGS='-g' CXXFLAGS='-g'

The crushtool test mode is used to profile the crush implementation with:

valgrind --tool=callgrind \
         --callgrind-out-file=crush.callgrind \
         src/crushtool \
         -i src/test/cli/crushtool/one-hundered-devices.crushmap \
         --test --show-bad-mappings

The resulting crush.callgrind file can then be analyzed with

kcachegrind crush.callgrind


Any Ceph command can be profiled in this way.

Profiling CPU usage of a ceph command (gperftools)

After compiling Ceph from sources with:

./configure --with-debug CFLAGS='-g' CXXFLAGS='-g'

The crushtool test mode is used to profile the crush implementation with:

LD_PRELOAD=/usr/lib/libprofiler.so.0 \
CPUPROFILE=crush.prof src/crushtool \
  -i src/test/cli/crushtool/one-hundered-devices.crushmap \
  --test --show-bad-mappings

as instructed in the cpu profiler documentation. The resulting crush.prof file can then be analyzed with

google-pprof --ignore=vector --focus=bucket_choose \
  --gv ./src/crushtool crush.prof

and displays the following result:

Any Ceph command can be profiled in this way.

wget on an OpenStack instance hangs ? Try lowering the MTU

Why would OpenStack instances fail to wget a URL and work perfectly on others ? For instance:

$ wget -O - 'https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/autobuild.asc'
Connecting to ceph.com (ceph.com)|208.113.241.137|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: `STDOUT'

    [<=>                                                                           ] 0           --.-K/s              ^

If it can be fixed by lowering the MTU from the default of 1500 to 1400 with:

$ sudo ip link set mtu 1400 dev eth0
$ sudo ip link show dev eth0
2: eth0:  mtu 1400 qdisc pfifo_fast state UP qlen 1000
    link/ether fa:16:3e:85:ee:a5 brd ff:ff:ff:ff:ff:ff

it means the underlying OpenStack DHCP should be fixed to set the MTU to 1400.
Continue reading “wget on an OpenStack instance hangs ? Try lowering the MTU”

Testing a Ceph crush map

After modifying a crush map it should be tested to check that all rules can provide the specified number of replicas. If a pool is created to use the metadata rule with seven replicas, could it fail to find enough devices ? The crushtool test mode can be used to simulate the situation as follows:

$ crushtool -i the-new-crush-map --test --show-bad-mappings
bad mapping rule 1 x 781 num_rep 7 result [8,10,2,11,6,9]

The output shows that for rule 1 ( metadata by default is rule 1 ), an attempt to find seven replicas ( num_rep 7 ) for the object 781 (the hash of its name) failed and only returned six ( [8,10,2,11,6,9] ). It can be resolved by increasing the number of devices, lowering the number of replicas or changing the way replicas are selected.

When all attempts to find the required number of replicas are one device short, it simply means there are not enough devices to satisfy the rule and the only solution is to add at least one. CRUSH may not find a device mapping that satisfies all constraints the first time around and it will need to try again. If it fails more than fifty times it will give up and return less devices than required. Lowering the required number of replica is one way to solve this problem.

Although it is possible to increase the number of times CRUSH will try, this is dangerous on a running cluster because it may modify the mapping for existing objects.