L’erasure code, c’est aussi le RAID5, qui permet de perdre un disque dur sans perdre ses données. Du point de vue de l’utilisateur, le concept est simple et utile, mais pour la personne qui est chargée de concevoir le logiciel qui fait le travail, c’est un casse-tête. On trouve des boîtiers RAID5 à trois disques dans n’importe quelle boutique : quand l’un d’eux cesse de fonctionner, on le remplace et les fichiers sont toujours là. On pourrait imaginer ça avec six disques dont deux cessent de fonctionner simultanément. Mais non : au lieu d’avoir recours à une opération XOR, assimilable en cinq minutes, il faut des corps de Galois, un bon bagage mathématique et beaucoup de calculs. Pour corser la difficulté, dans un système de stockage distribué tel que Ceph, les disques sont souvent déconnectés temporairement pour cause d’indisponibilité réseau.
Continue reading “Vue subjective de la naissance de l'Erasure Code dans Ceph”
Benchmarking Ceph jerasure version 2 plugin
The Ceph erasure code plugin benchmark for jerasure version 1 are compared after an upgrade to jerasure version 2, using the same command, on the same hardware.
- Encoding: 5.2GB/s which is ~20% better than 4.2GB/s
- Decoding: no processing necessary (because the code is systematic)
- Recovering the loss of one OSD: 11.3GB/s which is ~13% better than 10GB/s
- Recovering the loss of two OSD: 4.42GB/s which is ~35% better than 3.2GB/s
The relevant lines from the full output of the benchmark are:
seconds KB plugin k m work. iter. size eras. 0.088136 1048576 jerasure 6 2 decode 1024 1048576 1 0.226118 1048576 jerasure 6 2 decode 1024 1048576 2 0.191825 1048576 jerasure 6 2 encode 1024 1048576 0
The improvements are likely to be greater for larger K+M values.
OpenStack Upstream Training in Atlanta
The OpenStack Foundation is delivering a training program to accelerate the speed at which new OpenStack developers are successful at integrating their own roadmap into that of the OpenStack project. If you’re a new OpenStack contributor or plan on becoming one soon, you should sign up for the next OpenStack Upstream Training in Atlanta, May 10-11. Participation is strongly advised also for first time participants to OpenStack Design Summit.
Ceph erasure code : ready for alpha testing
The addition of erasure code in Ceph started in april 2013 and was discussed during the first Ceph Developer Summit. The implementation reached an important milestone a few days ago and it is now ready for alpha testing.
For the record, here is the simplest way to store and retrieve an object in an erasure coded pool as of today:
parameters="erasure-code-k=2 erasure-code-m=1" ./ceph osd crush rule create-erasure ecruleset \ $parameters \ erasure-code-ruleset-failure-domain=osd ./ceph osd pool create ecpool 12 12 erasure \ crush_ruleset=ecruleset \ $parameters ./rados --pool ecpool put SOMETHING /etc/group ./rados --pool ecpool get SOMETHING /tmp/group $ tail -3 /tmp/group postfix:x:133: postdrop:x:134: _cvsadmin:x:135:
The chunks are stored in three objects and it can be reconstructed if any of them are lost.
find dev | grep SOMETHING dev/osd4/current/3.7s0_head/SOMETHING__head_847441D7__3_ffffffffffffffff_0 dev/osd6/current/3.7s1_head/SOMETHING__head_847441D7__3_ffffffffffffffff_1 dev/osd9/current/3.7s2_head/SOMETHING__head_847441D7__3_ffffffffffffffff_2
How does a Ceph OSD handle a read message ? (in Firefly and up)
When an OSD handles an operation it is queued to a PG, it is added to the op_wq work queue ( or to the waiting_for_map list if the queue_op method of PG finds that it must wait for an OSDMap ) and will be dequeued asynchronously. The dequeued operation is processed by the ReplicatedPG::do_request method which calls the the do_op method because it is a CEPH_MSG_OSD_OP. An OpContext is allocated and is executed.
2014-02-24 09:28:34.571489 7fc18006f700 10 osd.4 pg_epoch: 26 pg[3.7s0( v 26’1 (0’0,26’1] local-les=26 n=1 ec=25 les/c 26/26 25/25/25) [4,6,9] r=0 lpr=25 crt=0’0 lcod 0’0 mlcod 0’0 active+clean] execute_ctx 0x7fc16c08a3b0
A transaction (which is either a RPGTransaction for a replicated backend or an ECTransaction for an erasure coded backend) is obtained from the PGBackend. The transaction is attached to a OpContext (which was allocated by do_op). Note that in the following log line although do_op shows, it comes from the execute_ctx method.
2014-02-24 09:28:34.571563 7fc18006f700 10 osd.4 pg_epoch: 26 pg[3.7s0( v 26’1 (0’0,26’1] local-les=26 n=1 ec=25 les/c 26/26 25/25/25) [4,6,9] r=0 lpr=25 crt=0’0 lcod 0’0 mlcod 0’0 active+clean] do_op 847441d7/SOMETHING/head//3 [read 0~4194304] ov 26’1
The execute_ctx method calls prepare_transaction which calls do_osd_ops which prepares the CEPH_OSD_OP_READ.
2014-02-24 09:28:34.571663 7fc18006f700 10 osd.4 pg_epoch: 26 pg[3.7s0( v 26’1 (0’0,26’1] local-les=26 n=1 ec=25 les/c 26/26 25/25/25) [4,6,9] r=0 lpr=25 crt=0’0 lcod 0’0 mlcod 0’0 active+clean] async_read noted for 847441d7/SOMETHING/head//3
The execute_ctx method continues when prepare_transaction returns and creates the MOSDOpReply object. Then it calls start_async_reads which calls objects_read_async on the backend (which is either ReplicatedBackend::objects_read_async or ECBackend::objects_read_async). When the read completes (this code path is not explored here), it calls the OnReadComplete::finish method (because the OnReadComplete object was given as an argument to objects_read_async) which calls ReplicatedPG::OpContext::finish_read each time a ready completes (i.e. if reading from an erasure coded pool, on each chunk) which calls ReplicatedPG::complete_read_ctx (if there are no pending reads) which sends the reply to the client .
figuring out why ccache misses
When compiling Ceph, ccache may appear to miss more than expected, as shown by the cache miss line of ccache -s
cache directory /home/loic/.ccache cache hit (direct) 1 cache hit (preprocessed) 0 cache miss 1 files in cache 3 cache size 392 Kbytes max cache size 10.0 Gbytes
Compiling Ceph from clones in two different directories does not explain the miss, unless CCACHE_HASHDIR is set. It should be unset with:
unset CCACHE_HASHDIR
Ceph paxos propose interval
When a command is sent to the Ceph monitor, such as ceph osd pool create, it will add a pool to the pending changes of the maps. The modification is stashed for paxos propose interval seconds before it is used to build new maps and becomes effective. This guarantees that the mons are not updated more than once a second ( the default value of paxos propose interval ).
When running make check changing the paxos propose interval value to 0.01 seconds for the cephtool tests roughly saves half the time (going from ~2.5mn to ~1.25mn real time).
--paxos-propose-interval=0.01
Hadoop like computing with Ceph
Computation can be co-located on the machine where a Ceph object resides and access it from the local disk instead of going through the network. Noah Watkins explains it in great detail and it can be experimented with a Hello World example which calls the hello plugin included in the Emperor release.
Continue reading “Hadoop like computing with Ceph”
Organization mapping and Reviewed-by statistics with git
shortlog is convenient to print a leader board counting contributions. For instance to display the top ten commiters of Ceph over the past year:
$ git shortlog --since='1 year' --no-merges -nes | nl | head -10 1 1890 Sage Weil <sage@inktank.com> 2 805 Danny Al-Gaaf <danny.al-gaaf@bisect.de> 3 491 Samuel Just <sam.just@inktank.com> 4 462 Yehuda Sadeh <yehuda@inktank.com> 5 443 John Wilkins <john.wilkins@inktank.com> 6 303 Greg Farnum <greg@inktank.com> 7 288 Dan Mick <dan.mick@inktank.com> 8 274 Loic Dachary <loic@dachary.org> 9 219 Yan, Zheng <zheng.z.yan@intel.com> 10 214 João Eduardo Luís <joao.luis@inktank.com>
To get the same output for reviewers over the past year, assuming the Reviewed-by is set consistently in the commit messages, the following can be used:
git log --since='1 year' --pretty=%b | \ perl -n -e 'print "$_\n" if(s/^\s*Reviewed-by:\s*(.*<.*>)\s*$/\1/)' | \ git check-mailmap --stdin | \ sort | uniq -c | sort -rn | nl | head -10 1 652 Sage Weil <sage@inktank.com> 2 265 Greg Farnum <greg@inktank.com> 3 185 Samuel Just <sam.just@inktank.com> 4 106 Josh Durgin <josh.durgin@inktank.com> 5 95 João Eduardo Luís <joao.luis@inktank.com> 6 95 Dan Mick <dan.mick@inktank.com> 7 69 Yehuda Sadeh <yehuda@inktank.com> 8 46 David Zafman <david.zafman@inktank.com> 9 36 Loic Dachary <loic@dachary.org> 10 21 Gary Lowell <gary.lowell@inktank.com>
The body of the commit messages ( –pretty=%b ) is displayed for commits from the past year ( –since=’1 year’ ). perl reads an does not print anything ( -n ) unless it finds a Reviewed-by: string followed by what looks like First Last <mail@dot.com> ( ^\s*Reviewed-by:\s*(.*<.*>)\s*$ ). The authors found are remapped to fix typos ( git check-mailmap –stdin ).
The authors can further be remapped into the organization to which they are affiliated using the .organizationmap file which has the same format as the .mailmap file, only remapping normalized author names to organization names with git -c mailmap.file=.organizationmap check-mailmap –stdin
git log --since='1 year' --pretty=%b | \ perl -n -e 'print "$_\n" if(s/^\s*Reviewed-by:\s*(.*<.*>)\s*$/\1/)' | \ git check-mailmap --stdin | \ git -c mailmap.file=.organizationmap check-mailmap --stdin | \ sort | uniq -c | sort -rn | nl | head -10 1 1572 Inktank <contact@inktank.com> 2 39 Cloudwatt <libre.licensing@cloudwatt.com> 3 7 Intel <contact@intel.com> 4 4 University of California, Santa Cruz <contact@cs.ucsc.edu> 5 4 Roald van Loon Consultancy <roald@roaldvanloon.nl> 6 2 CERN <contact@cern.ch> 7 1 SUSE <contact@suse.com> 8 1 Mark Kirkwood <mark.kirkwood@catalyst.net.nz> 9 1 IWeb <contact@iweb.com> 10 1 Gaudenz Steinlin <gaudenz@debian.org>
Becoming a Core Contributor : the fast track
Anyone willing to become a better Free Software contributor is invited to attend the next session of Upstream University in advance of FOSDEM. The training starts January 30th, 2014 in the morning, at a walking distance from Grand Place in Brussels.
- Registration is free and requires to pick a contribution to work on in the bug tracker of a Free Software project (it can be any Free Software project)
Participating in Free Software projects is not just about technical skills : there will be informal followups in bars and restaurants afterwards 🙂 This session will be the first to focus on Core Contributors and what it takes to become one, based on lessons learnt from OpenStack and Ceph.
Continue reading “Becoming a Core Contributor : the fast track”