Vue subjective de la naissance de l'Erasure Code dans Ceph

L’erasure code, c’est aussi le RAID5, qui permet de perdre un disque dur sans perdre ses données. Du point de vue de l’utilisateur, le concept est simple et utile, mais pour la personne qui est chargée de concevoir le logiciel qui fait le travail, c’est un casse-tête. On trouve des boîtiers RAID5 à trois disques dans n’importe quelle boutique : quand l’un d’eux cesse de fonctionner, on le remplace et les fichiers sont toujours là. On pourrait imaginer ça avec six disques dont deux cessent de fonctionner simultanément. Mais non : au lieu d’avoir recours à une opération XOR, assimilable en cinq minutes, il faut des corps de Galois, un bon bagage mathématique et beaucoup de calculs. Pour corser la difficulté, dans un système de stockage distribué tel que Ceph, les disques sont souvent déconnectés temporairement pour cause d’indisponibilité réseau.
Continue reading “Vue subjective de la naissance de l'Erasure Code dans Ceph”

Benchmarking Ceph jerasure version 2 plugin

The Ceph erasure code plugin benchmark for jerasure version 1 are compared after an upgrade to jerasure version 2, using the same command, on the same hardware.

  • Encoding: 5.2GB/s which is ~20% better than 4.2GB/s
  • Decoding: no processing necessary (because the code is systematic)
  • Recovering the loss of one OSD: 11.3GB/s which is ~13% better than 10GB/s
  • Recovering the loss of two OSD: 4.42GB/s which is ~35% better than 3.2GB/s

The relevant lines from the full output of the benchmark are:

seconds         KB      plugin          k m work.   iter.   size    eras.
0.088136        1048576 jerasure        6 2 decode  1024    1048576 1
0.226118        1048576 jerasure        6 2 decode  1024    1048576 2
0.191825        1048576 jerasure        6 2 encode  1024    1048576 0

The improvements are likely to be greater for larger K+M values.

OpenStack Upstream Training in Atlanta

The OpenStack Foundation is delivering a training program to accelerate the speed at which new OpenStack developers are successful at integrating their own roadmap into that of the OpenStack project.  If you’re a new OpenStack contributor or plan on becoming one soon, you should sign up for the next OpenStack Upstream Training in Atlanta, May 10-11. Participation is strongly advised also for first time participants to OpenStack Design Summit.

Continue reading “OpenStack Upstream Training in Atlanta”

Ceph erasure code : ready for alpha testing

The addition of erasure code in Ceph started in april 2013 and was discussed during the first Ceph Developer Summit. The implementation reached an important milestone a few days ago and it is now ready for alpha testing.
For the record, here is the simplest way to store and retrieve an object in an erasure coded pool as of today:

parameters="erasure-code-k=2 erasure-code-m=1"
./ceph osd crush rule create-erasure ecruleset \
  $parameters \
  erasure-code-ruleset-failure-domain=osd
./ceph osd pool create ecpool 12 12 erasure \
  crush_ruleset=ecruleset \
  $parameters
./rados --pool ecpool put SOMETHING /etc/group
./rados --pool ecpool get SOMETHING /tmp/group
$ tail -3 /tmp/group
postfix:x:133:
postdrop:x:134:
_cvsadmin:x:135:

The chunks are stored in three objects and it can be reconstructed if any of them are lost.

find dev | grep SOMETHING
dev/osd4/current/3.7s0_head/SOMETHING__head_847441D7__3_ffffffffffffffff_0
dev/osd6/current/3.7s1_head/SOMETHING__head_847441D7__3_ffffffffffffffff_1
dev/osd9/current/3.7s2_head/SOMETHING__head_847441D7__3_ffffffffffffffff_2

How does a Ceph OSD handle a read message ? (in Firefly and up)

When an OSD handles an operation it is queued to a PG, it is added to the op_wq work queue ( or to the waiting_for_map list if the queue_op method of PG finds that it must wait for an OSDMap ) and will be dequeued asynchronously. The dequeued operation is processed by the ReplicatedPG::do_request method which calls the the do_op method because it is a CEPH_MSG_OSD_OP. An OpContext is allocated and is executed.

2014-02-24 09:28:34.571489 7fc18006f700 10 osd.4 pg_epoch: 26 pg[3.7s0( v 26’1 (0’0,26’1] local-les=26 n=1 ec=25 les/c 26/26 25/25/25) [4,6,9] r=0 lpr=25 crt=0’0 lcod 0’0 mlcod 0’0 active+clean] execute_ctx 0x7fc16c08a3b0

A transaction (which is either a RPGTransaction for a replicated backend or an ECTransaction for an erasure coded backend) is obtained from the PGBackend. The transaction is attached to a OpContext (which was allocated by do_op). Note that in the following log line although do_op shows, it comes from the execute_ctx method.

2014-02-24 09:28:34.571563 7fc18006f700 10 osd.4 pg_epoch: 26 pg[3.7s0( v 26’1 (0’0,26’1] local-les=26 n=1 ec=25 les/c 26/26 25/25/25) [4,6,9] r=0 lpr=25 crt=0’0 lcod 0’0 mlcod 0’0 active+clean] do_op 847441d7/SOMETHING/head//3 [read 0~4194304] ov 26’1

The execute_ctx method calls prepare_transaction which calls do_osd_ops which prepares the CEPH_OSD_OP_READ.

2014-02-24 09:28:34.571663 7fc18006f700 10 osd.4 pg_epoch: 26 pg[3.7s0( v 26’1 (0’0,26’1] local-les=26 n=1 ec=25 les/c 26/26 25/25/25) [4,6,9] r=0 lpr=25 crt=0’0 lcod 0’0 mlcod 0’0 active+clean] async_read noted for 847441d7/SOMETHING/head//3

The execute_ctx method continues when prepare_transaction returns and creates the MOSDOpReply object. Then it calls start_async_reads which calls objects_read_async on the backend (which is either ReplicatedBackend::objects_read_async or ECBackend::objects_read_async). When the read completes (this code path is not explored here), it calls the OnReadComplete::finish method (because the OnReadComplete object was given as an argument to objects_read_async) which calls ReplicatedPG::OpContext::finish_read each time a ready completes (i.e. if reading from an erasure coded pool, on each chunk) which calls ReplicatedPG::complete_read_ctx (if there are no pending reads) which sends the reply to the client .

figuring out why ccache misses

When compiling Ceph, ccache may appear to miss more than expected, as shown by the cache miss line of ccache -s

cache directory                     /home/loic/.ccache
cache hit (direct)                     1
cache hit (preprocessed)               0
cache miss                             1
files in cache                         3
cache size                           392 Kbytes
max cache size                      10.0 Gbytes

Compiling Ceph from clones in two different directories does not explain the miss, unless CCACHE_HASHDIR is set. It should be unset with:

unset CCACHE_HASHDIR

Continue reading “figuring out why ccache misses”

Ceph paxos propose interval

When a command is sent to the Ceph monitor, such as ceph osd pool create, it will add a pool to the pending changes of the maps. The modification is stashed for paxos propose interval seconds before it is used to build new maps and becomes effective. This guarantees that the mons are not updated more than once a second ( the default value of paxos propose interval ).
When running make check changing the paxos propose interval value to 0.01 seconds for the cephtool tests roughly saves half the time (going from ~2.5mn to ~1.25mn real time).

--paxos-propose-interval=0.01

Organization mapping and Reviewed-by statistics with git

shortlog is convenient to print a leader board counting contributions. For instance to display the top ten commiters of Ceph over the past year:

$ git shortlog --since='1 year' --no-merges -nes | nl | head -10
     1	  1890	Sage Weil <sage@inktank.com>
     2	   805	Danny Al-Gaaf <danny.al-gaaf@bisect.de>
     3	   491	Samuel Just <sam.just@inktank.com>
     4	   462	Yehuda Sadeh <yehuda@inktank.com>
     5	   443	John Wilkins <john.wilkins@inktank.com>
     6	   303	Greg Farnum <greg@inktank.com>
     7	   288	Dan Mick <dan.mick@inktank.com>
     8	   274	Loic Dachary <loic@dachary.org>
     9	   219	Yan, Zheng <zheng.z.yan@intel.com>
    10	   214	João Eduardo Luís <joao.luis@inktank.com>

To get the same output for reviewers over the past year, assuming the Reviewed-by is set consistently in the commit messages, the following can be used:

git log  --since='1 year' --pretty=%b | \
 perl -n -e 'print "$_\n" if(s/^\s*Reviewed-by:\s*(.*<.*>)\s*$/\1/)'  | \
 git check-mailmap --stdin | \
 sort | uniq -c | sort -rn | nl | head -10
     1	    652 Sage Weil <sage@inktank.com>
     2	    265 Greg Farnum <greg@inktank.com>
     3	    185 Samuel Just <sam.just@inktank.com>
     4	    106 Josh Durgin <josh.durgin@inktank.com>
     5	     95 João Eduardo Luís <joao.luis@inktank.com>
     6	     95 Dan Mick <dan.mick@inktank.com>
     7	     69 Yehuda Sadeh <yehuda@inktank.com>
     8	     46 David Zafman <david.zafman@inktank.com>
     9	     36 Loic Dachary <loic@dachary.org>
    10	     21 Gary Lowell <gary.lowell@inktank.com>

The body of the commit messages ( –pretty=%b ) is displayed for commits from the past year ( –since=’1 year’ ). perl reads an does not print anything ( -n ) unless it finds a Reviewed-by: string followed by what looks like First Last <mail@dot.com> ( ^\s*Reviewed-by:\s*(.*<.*>)\s*$ ). The authors found are remapped to fix typos ( git check-mailmap –stdin ).
The authors can further be remapped into the organization to which they are affiliated using the .organizationmap file which has the same format as the .mailmap file, only remapping normalized author names to organization names with git -c mailmap.file=.organizationmap check-mailmap –stdin

git log  --since='1 year' --pretty=%b | \
 perl -n -e 'print "$_\n" if(s/^\s*Reviewed-by:\s*(.*<.*>)\s*$/\1/)'  | \
 git check-mailmap --stdin | \
 git -c mailmap.file=.organizationmap check-mailmap --stdin | \
 sort | uniq -c | sort -rn | nl | head -10
     1	   1572 Inktank <contact@inktank.com>
     2	     39 Cloudwatt <libre.licensing@cloudwatt.com>
     3	      7 Intel <contact@intel.com>
     4	      4 University of California, Santa Cruz <contact@cs.ucsc.edu>
     5	      4 Roald van Loon Consultancy <roald@roaldvanloon.nl>
     6	      2 CERN <contact@cern.ch>
     7	      1 SUSE <contact@suse.com>
     8	      1 Mark Kirkwood <mark.kirkwood@catalyst.net.nz>
     9	      1 IWeb <contact@iweb.com>
    10	      1 Gaudenz Steinlin <gaudenz@debian.org>

Becoming a Core Contributor : the fast track

Anyone willing to become a better Free Software contributor is invited to attend the next session of Upstream University in advance of FOSDEM. The training starts January 30th, 2014 in the morning, at a walking distance from Grand Place in Brussels.

Participating in Free Software projects is not just about technical skills : there will be informal followups in bars and restaurants afterwards 🙂 This session will be the first to focus on Core Contributors and what it takes to become one, based on lessons learnt from OpenStack and Ceph.
Continue reading “Becoming a Core Contributor : the fast track”