ceph – Page 8 – Loïc Dachary

March 28, 2014

SSE optimization for erasure code in Ceph

The jerasure library is the default erasure code plugin of Ceph. The gf-complete companion library supports SSE optimizations at compile time, when the compiler provides them (-msse4.2 etc.). The jerasure (and gf-complete with it) plugin is compiled multiple times with various levels of SSE features:

jerasure_sse4 uses SSE4.2, SSE4.1, SSSE3, SSE3, SSE2, SSE
jerasure_sse3 uses SSSE3, SSE3, SSE2, SSE
jerasure_generic uses no SSE instructions

When an OSD loads the jerasure plugin, the CPU features are probed and the appropriate plugin is selected depending on their availability.
The gf-complete source code is cleanly divided into functions that take advantage of specific SSE features. It should be easy to use the ifunc attribute to semi-manually select each function individually, at runtime and without performance penalty (because the choice is made the first time the function is called and recorded for later calls). With such a fine grain selection, there would be no need to compile three plugins because each function would be compiled with exactly the set of flag it needs.

March 26, 2014

Your first exabyte in a Ceph cluster

$ rbd create --size $((1024 * 1024 * 1024 * 1024)) tiny
$ rbd info tiny
rbd image 'tiny':
	size 1024 PB in 274877906944 objects
	order 22 (4096 kB objects)
	block_name_prefix: rb.0.1009.6b8b4567
	format: 1

Note: rbd rm tiny will take a long time.

March 23, 2014

The footprints of 192 Ceph developers

Gource is run on the Ceph git repository for each of the 192 developers who contributed to its development over the past six years. Their footprint is the last image of a video clip created from all the commits they authored.

March 21, 2014

working with git submodules in Ceph

The gf-complete and jerasure libraries implement the erasure code functions used in Ceph. They were copied in Ceph in 2013 because there were no reference repositories at the time. The copy was removed from the Ceph repository and replaced by git submodules to decouple the release cycles.
Continue reading “working with git submodules in Ceph”

March 6, 2014

Benchmarking Ceph jerasure version 2 plugin

The Ceph erasure code plugin benchmark for jerasure version 1 are compared after an upgrade to jerasure version 2, using the same command, on the same hardware.

Encoding: 5.2GB/s which is ~20% better than 4.2GB/s
Decoding: no processing necessary (because the code is systematic)
Recovering the loss of one OSD: 11.3GB/s which is ~13% better than 10GB/s
Recovering the loss of two OSD: 4.42GB/s which is ~35% better than 3.2GB/s

The relevant lines from the full output of the benchmark are:

seconds         KB      plugin          k m work.   iter.   size    eras.
0.088136        1048576 jerasure        6 2 decode  1024    1048576 1
0.226118        1048576 jerasure        6 2 decode  1024    1048576 2
0.191825        1048576 jerasure        6 2 encode  1024    1048576 0

The improvements are likely to be greater for larger K+M values.

March 3, 2014

Ceph erasure code : ready for alpha testing

The addition of erasure code in Ceph started in april 2013 and was discussed during the first Ceph Developer Summit. The implementation reached an important milestone a few days ago and it is now ready for alpha testing.
For the record, here is the simplest way to store and retrieve an object in an erasure coded pool as of today:

parameters="erasure-code-k=2 erasure-code-m=1"
./ceph osd crush rule create-erasure ecruleset \
  $parameters \
  erasure-code-ruleset-failure-domain=osd
./ceph osd pool create ecpool 12 12 erasure \
  crush_ruleset=ecruleset \
  $parameters
./rados --pool ecpool put SOMETHING /etc/group
./rados --pool ecpool get SOMETHING /tmp/group
$ tail -3 /tmp/group
postfix:x:133:
postdrop:x:134:
_cvsadmin:x:135:

The chunks are stored in three objects and it can be reconstructed if any of them are lost.

find dev | grep SOMETHING
dev/osd4/current/3.7s0_head/SOMETHING__head_847441D7__3_ffffffffffffffff_0
dev/osd6/current/3.7s1_head/SOMETHING__head_847441D7__3_ffffffffffffffff_1
dev/osd9/current/3.7s2_head/SOMETHING__head_847441D7__3_ffffffffffffffff_2

February 24, 2014

How does a Ceph OSD handle a read message ? (in Firefly and up)

When an OSD handles an operation it is queued to a PG, it is added to the op_wq work queue ( or to the waiting_for_map list if the queue_op method of PG finds that it must wait for an OSDMap ) and will be dequeued asynchronously. The dequeued operation is processed by the ReplicatedPG::do_request method which calls the the do_op method because it is a CEPH_MSG_OSD_OP. An OpContext is allocated and is executed.

2014-02-24 09:28:34.571489 7fc18006f700 10 osd.4 pg_epoch: 26 pg[3.7s0( v 26’1 (0’0,26’1] local-les=26 n=1 ec=25 les/c 26/26 25/25/25) [4,6,9] r=0 lpr=25 crt=0’0 lcod 0’0 mlcod 0’0 active+clean] execute_ctx 0x7fc16c08a3b0

A transaction (which is either a RPGTransaction for a replicated backend or an ECTransaction for an erasure coded backend) is obtained from the PGBackend. The transaction is attached to a OpContext (which was allocated by do_op). Note that in the following log line although do_op shows, it comes from the execute_ctx method.

2014-02-24 09:28:34.571563 7fc18006f700 10 osd.4 pg_epoch: 26 pg[3.7s0( v 26’1 (0’0,26’1] local-les=26 n=1 ec=25 les/c 26/26 25/25/25) [4,6,9] r=0 lpr=25 crt=0’0 lcod 0’0 mlcod 0’0 active+clean] do_op 847441d7/SOMETHING/head//3 [read 0~4194304] ov 26’1

The execute_ctx method calls prepare_transaction which calls do_osd_ops which prepares the CEPH_OSD_OP_READ.

2014-02-24 09:28:34.571663 7fc18006f700 10 osd.4 pg_epoch: 26 pg[3.7s0( v 26’1 (0’0,26’1] local-les=26 n=1 ec=25 les/c 26/26 25/25/25) [4,6,9] r=0 lpr=25 crt=0’0 lcod 0’0 mlcod 0’0 active+clean] async_read noted for 847441d7/SOMETHING/head//3

The execute_ctx method continues when prepare_transaction returns and creates the MOSDOpReply object. Then it calls start_async_reads which calls objects_read_async on the backend (which is either ReplicatedBackend::objects_read_async or ECBackend::objects_read_async). When the read completes (this code path is not explored here), it calls the OnReadComplete::finish method (because the OnReadComplete object was given as an argument to objects_read_async) which calls ReplicatedPG::OpContext::finish_read each time a ready completes (i.e. if reading from an erasure coded pool, on each chunk) which calls ReplicatedPG::complete_read_ctx (if there are no pending reads) which sends the reply to the client .

February 9, 2014

figuring out why ccache misses

When compiling Ceph, ccache may appear to miss more than expected, as shown by the cache miss line of ccache -s

cache directory                     /home/loic/.ccache
cache hit (direct)                     1
cache hit (preprocessed)               0
cache miss                             1
files in cache                         3
cache size                           392 Kbytes
max cache size                      10.0 Gbytes

Compiling Ceph from clones in two different directories does not explain the miss, unless CCACHE_HASHDIR is set. It should be unset with:

unset CCACHE_HASHDIR

Continue reading “figuring out why ccache misses”

February 1, 2014

Ceph paxos propose interval

When a command is sent to the Ceph monitor, such as ceph osd pool create, it will add a pool to the pending changes of the maps. The modification is stashed for paxos propose interval seconds before it is used to build new maps and becomes effective. This guarantees that the mons are not updated more than once a second ( the default value of paxos propose interval ).
When running make check changing the paxos propose interval value to 0.01 seconds for the cephtool tests roughly saves half the time (going from ~2.5mn to ~1.25mn real time).

--paxos-propose-interval=0.01

January 30, 2014

Hadoop like computing with Ceph

Computation can be co-located on the machine where a Ceph object resides and access it from the local disk instead of going through the network. Noah Watkins explains it in great detail and it can be experimented with a Hello World example which calls the hello plugin included in the Emperor release.
Continue reading “Hadoop like computing with Ceph”