ceph – Page 9 – Loïc Dachary

January 10, 2014

Organization mapping and Reviewed-by statistics with git

shortlog is convenient to print a leader board counting contributions. For instance to display the top ten commiters of Ceph over the past year:

$ git shortlog --since='1 year' --no-merges -nes | nl | head -10
     1	  1890	Sage Weil <sage@inktank.com>
     2	   805	Danny Al-Gaaf <danny.al-gaaf@bisect.de>
     3	   491	Samuel Just <sam.just@inktank.com>
     4	   462	Yehuda Sadeh <yehuda@inktank.com>
     5	   443	John Wilkins <john.wilkins@inktank.com>
     6	   303	Greg Farnum <greg@inktank.com>
     7	   288	Dan Mick <dan.mick@inktank.com>
     8	   274	Loic Dachary <loic@dachary.org>
     9	   219	Yan, Zheng <zheng.z.yan@intel.com>
    10	   214	João Eduardo Luís <joao.luis@inktank.com>

To get the same output for reviewers over the past year, assuming the Reviewed-by is set consistently in the commit messages, the following can be used:

git log  --since='1 year' --pretty=%b | \
 perl -n -e 'print "$_\n" if(s/^\s*Reviewed-by:\s*(.*<.*>)\s*$/\1/)'  | \
 git check-mailmap --stdin | \
 sort | uniq -c | sort -rn | nl | head -10
     1	    652 Sage Weil <sage@inktank.com>
     2	    265 Greg Farnum <greg@inktank.com>
     3	    185 Samuel Just <sam.just@inktank.com>
     4	    106 Josh Durgin <josh.durgin@inktank.com>
     5	     95 João Eduardo Luís <joao.luis@inktank.com>
     6	     95 Dan Mick <dan.mick@inktank.com>
     7	     69 Yehuda Sadeh <yehuda@inktank.com>
     8	     46 David Zafman <david.zafman@inktank.com>
     9	     36 Loic Dachary <loic@dachary.org>
    10	     21 Gary Lowell <gary.lowell@inktank.com>

The body of the commit messages ( –pretty=%b ) is displayed for commits from the past year ( –since=’1 year’ ). perl reads an does not print anything ( -n ) unless it finds a Reviewed-by: string followed by what looks like First Last <mail@dot.com> ( ^\s*Reviewed-by:\s*(.*<.*>)\s*$ ). The authors found are remapped to fix typos ( git check-mailmap –stdin ).
The authors can further be remapped into the organization to which they are affiliated using the .organizationmap file which has the same format as the .mailmap file, only remapping normalized author names to organization names with git -c mailmap.file=.organizationmap check-mailmap –stdin

git log  --since='1 year' --pretty=%b | \
 perl -n -e 'print "$_\n" if(s/^\s*Reviewed-by:\s*(.*<.*>)\s*$/\1/)'  | \
 git check-mailmap --stdin | \
 git -c mailmap.file=.organizationmap check-mailmap --stdin | \
 sort | uniq -c | sort -rn | nl | head -10
     1	   1572 Inktank <contact@inktank.com>
     2	     39 Cloudwatt <libre.licensing@cloudwatt.com>
     3	      7 Intel <contact@intel.com>
     4	      4 University of California, Santa Cruz <contact@cs.ucsc.edu>
     5	      4 Roald van Loon Consultancy <roald@roaldvanloon.nl>
     6	      2 CERN <contact@cern.ch>
     7	      1 SUSE <contact@suse.com>
     8	      1 Mark Kirkwood <mark.kirkwood@catalyst.net.nz>
     9	      1 IWeb <contact@iweb.com>
    10	      1 Gaudenz Steinlin <gaudenz@debian.org>

January 7, 2014

Becoming a Core Contributor : the fast track

Anyone willing to become a better Free Software contributor is invited to attend the next session of Upstream University in advance of FOSDEM. The training starts January 30th, 2014 in the morning, at a walking distance from Grand Place in Brussels.

Registration is free and requires to pick a contribution to work on in the bug tracker of a Free Software project (it can be any Free Software project)

Participating in Free Software projects is not just about technical skills : there will be informal followups in bars and restaurants afterwards 🙂 This session will be the first to focus on Core Contributors and what it takes to become one, based on lessons learnt from OpenStack and Ceph.
Continue reading “Becoming a Core Contributor : the fast track”

January 5, 2014

Exploring Ceph cache pool implementation

Sage Weil and Greg Farnum presentation during the Firefly Ceph Developer Summit in 2013 is used as an introduction to the cache pool that is being implemented for the upcoming Firefly release.
The CEPH_OSD_OP_COPY_FROM etc.. rados operations have been introduced in Emperor and tested by ceph_test_rados which is used by teuthology for integration tests by doing COPY_FROM and COPY_GET at random.
After a cache pool has been defined using the osd tier commands, objects can be promoted to the cache pool ( see the corresponding test case ).
The HitSets keep track of which object have been read or written ( using bloom filters ).

December 21, 2013

Benchmarking Ceph erasure code plugins

The erasure code implementation in Ceph relies on the jerasure library. It is packaged into a plugin that is dynamically loaded by erasure coded pools.
The ceph_erasure_code_benchmark is implemented to help benchmark the competing erasure code plugins implementations and to find the best parameters for a given plugin. It shows the jerasure technique cauchy_good with a packet size of 3072 to be the most efficient on a Intel(R) Xeon(R) CPU E3-1245 V2 @ 3.40GHz when compiled with gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5). The test was done assuming each object is spread over six OSDs and two extra OSDs are used for parity ( K=6 and M=2 ).

Encoding: 4.2GB/s
Decoding: no processing necessary (because the code is systematic)
Recovering the loss of one OSD: 10GB/s
Recovering the loss of two OSD: 3.2GB/s

The processing is done on the primary OSDs and therefore distributed on the Ceph cluster. Encoding and decoding is an order of magnitude faster than the typical storage hardware throughput.
Continue reading “Benchmarking Ceph erasure code plugins”

December 12, 2013

Profiling CPU usage of a ceph command (callgrind)

After compiling Ceph from sources with:

./configure --with-debug CFLAGS='-g' CXXFLAGS='-g'

The crushtool test mode is used to profile the crush implementation with:

valgrind --tool=callgrind \
         --callgrind-out-file=crush.callgrind \
         src/crushtool \
         -i src/test/cli/crushtool/one-hundered-devices.crushmap \
         --test --show-bad-mappings

The resulting crush.callgrind file can then be analyzed with

kcachegrind crush.callgrind

Any Ceph command can be profiled in this way.

December 11, 2013

Profiling CPU usage of a ceph command (gperftools)

After compiling Ceph from sources with:

./configure --with-debug CFLAGS='-g' CXXFLAGS='-g'

The crushtool test mode is used to profile the crush implementation with:

LD_PRELOAD=/usr/lib/libprofiler.so.0 \
CPUPROFILE=crush.prof src/crushtool \
  -i src/test/cli/crushtool/one-hundered-devices.crushmap \
  --test --show-bad-mappings

as instructed in the cpu profiler documentation. The resulting crush.prof file can then be analyzed with

google-pprof --ignore=vector --focus=bucket_choose \
  --gv ./src/crushtool crush.prof

and displays the following result:

Any Ceph command can be profiled in this way.

December 9, 2013

Testing a Ceph crush map

After modifying a crush map it should be tested to check that all rules can provide the specified number of replicas. If a pool is created to use the metadata rule with seven replicas, could it fail to find enough devices ? The crushtool test mode can be used to simulate the situation as follows:

$ crushtool -i the-new-crush-map --test --show-bad-mappings
bad mapping rule 1 x 781 num_rep 7 result [8,10,2,11,6,9]

The output shows that for rule 1 ( metadata by default is rule 1 ), an attempt to find seven replicas ( num_rep 7 ) for the object 781 (the hash of its name) failed and only returned six ( [8,10,2,11,6,9] ). It can be resolved by increasing the number of devices, lowering the number of replicas or changing the way replicas are selected.

When all attempts to find the required number of replicas are one device short, it simply means there are not enough devices to satisfy the rule and the only solution is to add at least one. CRUSH may not find a device mapping that satisfies all constraints the first time around and it will need to try again. If it fails more than fifty times it will give up and return less devices than required. Lowering the required number of replica is one way to solve this problem.

Although it is possible to increase the number of times CRUSH will try, this is dangerous on a running cluster because it may modify the mapping for existing objects.

November 21, 2013

Manage a multi-datacenter crush map with the command line

A new datacenter is added to the crush map of a Ceph cluster:

# ceph osd crush add-bucket fsf datacenter
added bucket fsf type datacenter to crush map
# ceph osd crush move fsf root=default
moved item id -13 name 'fsf' to location {root=default} in crush map
# ceph osd tree
# id    weight  type name       up/down reweight
-13     0               datacenter fsf
-5      7.28            datacenter ovh
-2      1.82                    host bm0014
0       1.82                            osd.0   up      1
...

The datacenter bucket type already exists by default in the default crush map that is provided when the cluster is created. The fsf bucket is moved ( with crush move ) to the root of the crush map.
Continue reading “Manage a multi-datacenter crush map with the command line”

November 19, 2013

Mixing Ceph and LVM volumes in OpenStack

Ceph pools are defined to collocate volumes and instances in OpenStack Havana. For volumes that do not need the resilience provided by Ceph, a LVM cinder backend is defined in /etc/cinder/cinder.conf:

[lvm]
volume_group=cinder-volumes
volume_driver=cinder.volume.drivers.lvm.LVMISCSIDriver
volume_backend_name=LVM

and appended to the list of existing backends:

enabled_backends=rbd-default,rbd-ovh,rbd-hetzner,rbd-cloudwatt,lvm

A cinder volume type is created and associated with it:

# cinder type-create lvm
+--------------------------------------+------+
|                  ID                  | Name |
+--------------------------------------+------+
| c77552ff-e513-4851-a5e6-2c83d0acb998 | lvm  |
+--------------------------------------+------+
# cinder type-key lvm set volume_backend_name=LVM
#  cinder extra-specs-list
+--------------------------------------+-----------+--------------------------------------------+
|                  ID                  |    Name   |                extra_specs                 |
+--------------------------------------+-----------+--------------------------------------------+
...
| c77552ff-e513-4851-a5e6-2c83d0acb998 |    lvm    |      {u'volume_backend_name': u'LVM'}      |
...
+--------------------------------------+-----------+--------------------------------------------+

To reduce the network overhead, a backend availability zone is defined for each bare metal by adding to /etc/cinder/cinder.conf:

storage_availability_zone=bm0015

and restarting cinder-volume:

# restart cinder-volume
# sleep 5
# cinder-manage host list
host                            zone
...
bm0015.the.re@lvm               bm0015
...

where bm0015 is the hostname of the machine. To create a LVM backed volume that is located on bm0015:

cinder create --availability-zone bm0015 --volume-type lvm --display-name test 1

In order for the allocation of RBD volumes to keep working without specifying an availability zone, there must be at least one cinder volume running in the default availability zone ( nova presumably ) and configured with the expected RBD backends. This can be checked with:

# cinder-manage host list | grep nova
...
bm0017.the.re@rbd-cloudwatt     nova
bm0017.the.re@rbd-ovh           nova
bm0017.the.re@lvm               nova
bm0017.the.re@rbd-default       nova
bm0017.the.re@rbd-hetzner       nova
...

In the above the lvm volume type is also available in the nova availability zone and is used as a catch all when a LVM volume is prefered but collocating it on the same machine as the instance does not matter.

November 18, 2013

Creating a Ceph OSD from a designated disk partition

When a new Ceph OSD is setup with ceph-disk on a designated disk partition ( say /dev/sdc3 ), it will not be prepared and the sgdisk command must be run manually:

# osd_uuid=$(uuidgen)
# partition_number=3
# ptype_tobe=89c57f98-2fe5-4dc0-89c1-f3ad0ceff2be
# sgdisk --change-name="${partition_number}:ceph data" \
       --partition-guid="${partition_number}:${osd_uuid}" \
       --typecode="${partition_number}:${ptype_tobe}"
       /dev/sdc
# sgdisk --info=${partition_number} /dev/sdc
Partition GUID code: 89C57F98-2FE5-4DC0-89C1-F3AD0CEFF2BE (Unknown)
Partition unique GUID: 22FD939D-C203-43A9-966A-04570B63FABB
...
Partition name: 'ceph data'

The ptype_tobe is a partition type known to Ceph and set when it is being worked on. Assuming /dev/sda is a SSD disk from which a journal partition can be created, the OSD can be prepared with:

# ceph-disk prepare --osd-uuid "$osd_uuid" \
     --fs-type xfs --cluster ceph -- \
     /dev/sdc${partition_number} /dev/sda
WARNING:ceph-disk:OSD will not be hot-swappable if ...
Information: Moved requested sector from 34 to 2048 in
order to align on 2048-sector boundaries.
The operation has completed successfully.
meta-data=/dev/sdc3              isize=2048   agcount=4, agsize=61083136 blks
         =                       sectsz=512   attr=2, projid32bit=0
data     =                       bsize=4096   blocks=244332544, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=119303, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

The journal and data partitions should be associated with each other :

# ceph-disk list
/dev/sda :
 /dev/sda1 ceph journal, for /dev/sdc3
/dev/sdb :
 /dev/sdb2 other, ext4, mounted on /
 /dev/sdb3 swap, swap
/dev/sdc :
 /dev/sdc1 other, primary
 /dev/sdc2 other, ext4, mounted on /mnt
 /dev/sdc3 ceph data, prepared, cluster ceph, journal /dev/sda1

The type of the partition can be changed so that udev triggered scripts notice it and provision the osd.

# ptype=4fbd7e29-9d25-41b8-afd0-062c0ceff05d
# sgdisk --typecode="${partition_number}:${ptype}" /dev/sdc
# udevadm trigger --subsystem-match=block --action=add
# df | grep /var/lib/ceph
/dev/sdc3       932G 160M  931G   1% /var/lib/ceph/osd/ceph-9