SSE optimization for erasure code in Ceph

The jerasure library is the default erasure code plugin of Ceph. The gf-complete companion library supports SSE optimizations at compile time, when the compiler provides them (-msse4.2 etc.). The jerasure (and gf-complete with it) plugin is compiled multiple times with various levels of SSE features:

  • jerasure_sse4 uses SSE4.2, SSE4.1, SSSE3, SSE3, SSE2, SSE
  • jerasure_sse3 uses SSSE3, SSE3, SSE2, SSE
  • jerasure_generic uses no SSE instructions

When an OSD loads the jerasure plugin, the CPU features are probed and the appropriate plugin is selected depending on their availability.
The gf-complete source code is cleanly divided into functions that take advantage of specific SSE features. It should be easy to use the ifunc attribute to semi-manually select each function individually, at runtime and without performance penalty (because the choice is made the first time the function is called and recorded for later calls). With such a fine grain selection, there would be no need to compile three plugins because each function would be compiled with exactly the set of flag it needs.

Your first exabyte in a Ceph cluster

$ rbd create --size $((1024 * 1024 * 1024 * 1024)) tiny
$ rbd info tiny
rbd image 'tiny':
	size 1024 PB in 274877906944 objects
	order 22 (4096 kB objects)
	block_name_prefix: rb.0.1009.6b8b4567
	format: 1

Note: rbd rm tiny will take a long time.

The footprints of 192 Ceph developers

Gource is run on the Ceph git repository for each of the 192 developers who contributed to its development over the past six years. Their footprint is the last image of a video clip created from all the commits they authored.

video clip
Sage Weil

video clip
Yehuda Sadeh

video clip
Greg Farnum

video clip
Samuel Just

video clip
Colin P. McCabe

video clip
Danny Al-Gaaf

video clip
Josh Durgin

video clip
John Wilkins

video clip
Loic Dachary

video clip
Dan Mick

Continue reading “The footprints of 192 Ceph developers”

Benchmarking Ceph jerasure version 2 plugin

The Ceph erasure code plugin benchmark for jerasure version 1 are compared after an upgrade to jerasure version 2, using the same command, on the same hardware.

  • Encoding: 5.2GB/s which is ~20% better than 4.2GB/s
  • Decoding: no processing necessary (because the code is systematic)
  • Recovering the loss of one OSD: 11.3GB/s which is ~13% better than 10GB/s
  • Recovering the loss of two OSD: 4.42GB/s which is ~35% better than 3.2GB/s

The relevant lines from the full output of the benchmark are:

seconds         KB      plugin          k m work.   iter.   size    eras.
0.088136        1048576 jerasure        6 2 decode  1024    1048576 1
0.226118        1048576 jerasure        6 2 decode  1024    1048576 2
0.191825        1048576 jerasure        6 2 encode  1024    1048576 0

The improvements are likely to be greater for larger K+M values.

Ceph erasure code : ready for alpha testing

The addition of erasure code in Ceph started in april 2013 and was discussed during the first Ceph Developer Summit. The implementation reached an important milestone a few days ago and it is now ready for alpha testing.
For the record, here is the simplest way to store and retrieve an object in an erasure coded pool as of today:

parameters="erasure-code-k=2 erasure-code-m=1"
./ceph osd crush rule create-erasure ecruleset \
  $parameters \
  erasure-code-ruleset-failure-domain=osd
./ceph osd pool create ecpool 12 12 erasure \
  crush_ruleset=ecruleset \
  $parameters
./rados --pool ecpool put SOMETHING /etc/group
./rados --pool ecpool get SOMETHING /tmp/group
$ tail -3 /tmp/group
postfix:x:133:
postdrop:x:134:
_cvsadmin:x:135:

The chunks are stored in three objects and it can be reconstructed if any of them are lost.

find dev | grep SOMETHING
dev/osd4/current/3.7s0_head/SOMETHING__head_847441D7__3_ffffffffffffffff_0
dev/osd6/current/3.7s1_head/SOMETHING__head_847441D7__3_ffffffffffffffff_1
dev/osd9/current/3.7s2_head/SOMETHING__head_847441D7__3_ffffffffffffffff_2

How does a Ceph OSD handle a read message ? (in Firefly and up)

When an OSD handles an operation it is queued to a PG, it is added to the op_wq work queue ( or to the waiting_for_map list if the queue_op method of PG finds that it must wait for an OSDMap ) and will be dequeued asynchronously. The dequeued operation is processed by the ReplicatedPG::do_request method which calls the the do_op method because it is a CEPH_MSG_OSD_OP. An OpContext is allocated and is executed.

2014-02-24 09:28:34.571489 7fc18006f700 10 osd.4 pg_epoch: 26 pg[3.7s0( v 26’1 (0’0,26’1] local-les=26 n=1 ec=25 les/c 26/26 25/25/25) [4,6,9] r=0 lpr=25 crt=0’0 lcod 0’0 mlcod 0’0 active+clean] execute_ctx 0x7fc16c08a3b0

A transaction (which is either a RPGTransaction for a replicated backend or an ECTransaction for an erasure coded backend) is obtained from the PGBackend. The transaction is attached to a OpContext (which was allocated by do_op). Note that in the following log line although do_op shows, it comes from the execute_ctx method.

2014-02-24 09:28:34.571563 7fc18006f700 10 osd.4 pg_epoch: 26 pg[3.7s0( v 26’1 (0’0,26’1] local-les=26 n=1 ec=25 les/c 26/26 25/25/25) [4,6,9] r=0 lpr=25 crt=0’0 lcod 0’0 mlcod 0’0 active+clean] do_op 847441d7/SOMETHING/head//3 [read 0~4194304] ov 26’1

The execute_ctx method calls prepare_transaction which calls do_osd_ops which prepares the CEPH_OSD_OP_READ.

2014-02-24 09:28:34.571663 7fc18006f700 10 osd.4 pg_epoch: 26 pg[3.7s0( v 26’1 (0’0,26’1] local-les=26 n=1 ec=25 les/c 26/26 25/25/25) [4,6,9] r=0 lpr=25 crt=0’0 lcod 0’0 mlcod 0’0 active+clean] async_read noted for 847441d7/SOMETHING/head//3

The execute_ctx method continues when prepare_transaction returns and creates the MOSDOpReply object. Then it calls start_async_reads which calls objects_read_async on the backend (which is either ReplicatedBackend::objects_read_async or ECBackend::objects_read_async). When the read completes (this code path is not explored here), it calls the OnReadComplete::finish method (because the OnReadComplete object was given as an argument to objects_read_async) which calls ReplicatedPG::OpContext::finish_read each time a ready completes (i.e. if reading from an erasure coded pool, on each chunk) which calls ReplicatedPG::complete_read_ctx (if there are no pending reads) which sends the reply to the client .

figuring out why ccache misses

When compiling Ceph, ccache may appear to miss more than expected, as shown by the cache miss line of ccache -s

cache directory                     /home/loic/.ccache
cache hit (direct)                     1
cache hit (preprocessed)               0
cache miss                             1
files in cache                         3
cache size                           392 Kbytes
max cache size                      10.0 Gbytes

Compiling Ceph from clones in two different directories does not explain the miss, unless CCACHE_HASHDIR is set. It should be unset with:

unset CCACHE_HASHDIR

Continue reading “figuring out why ccache misses”

Ceph paxos propose interval

When a command is sent to the Ceph monitor, such as ceph osd pool create, it will add a pool to the pending changes of the maps. The modification is stashed for paxos propose interval seconds before it is used to build new maps and becomes effective. This guarantees that the mons are not updated more than once a second ( the default value of paxos propose interval ).
When running make check changing the paxos propose interval value to 0.01 seconds for the cephtool tests roughly saves half the time (going from ~2.5mn to ~1.25mn real time).

--paxos-propose-interval=0.01