Exploring Ceph cache pool implementation

Sage Weil and Greg Farnum presentation during the Firefly Ceph Developer Summit in 2013 is used as an introduction to the cache pool that is being implemented for the upcoming Firefly release.
The CEPH_OSD_OP_COPY_FROM etc.. rados operations have been introduced in Emperor and tested by ceph_test_rados which is used by teuthology for integration tests by doing COPY_FROM and COPY_GET at random.
After a cache pool has been defined using the osd tier commands, objects can be promoted to the cache pool ( see the corresponding test case ).
The HitSets keep track of which object have been read or written ( using bloom filters ).

Benchmarking Ceph erasure code plugins

The erasure code implementation in Ceph relies on the jerasure library. It is packaged into a plugin that is dynamically loaded by erasure coded pools.
The ceph_erasure_code_benchmark is implemented to help benchmark the competing erasure code plugins implementations and to find the best parameters for a given plugin. It shows the jerasure technique cauchy_good with a packet size of 3072 to be the most efficient on a Intel(R) Xeon(R) CPU E3-1245 V2 @ 3.40GHz when compiled with gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5). The test was done assuming each object is spread over six OSDs and two extra OSDs are used for parity ( K=6 and M=2 ).

  • Encoding: 4.2GB/s
  • Decoding: no processing necessary (because the code is systematic)
  • Recovering the loss of one OSD: 10GB/s
  • Recovering the loss of two OSD: 3.2GB/s

The processing is done on the primary OSDs and therefore distributed on the Ceph cluster. Encoding and decoding is an order of magnitude faster than the typical storage hardware throughput.
Continue reading “Benchmarking Ceph erasure code plugins”

Profiling CPU usage of a ceph command (callgrind)

After compiling Ceph from sources with:

./configure --with-debug CFLAGS='-g' CXXFLAGS='-g'

The crushtool test mode is used to profile the crush implementation with:

valgrind --tool=callgrind \
         --callgrind-out-file=crush.callgrind \
         src/crushtool \
         -i src/test/cli/crushtool/one-hundered-devices.crushmap \
         --test --show-bad-mappings

The resulting crush.callgrind file can then be analyzed with

kcachegrind crush.callgrind


Any Ceph command can be profiled in this way.

Profiling CPU usage of a ceph command (gperftools)

After compiling Ceph from sources with:

./configure --with-debug CFLAGS='-g' CXXFLAGS='-g'

The crushtool test mode is used to profile the crush implementation with:

LD_PRELOAD=/usr/lib/libprofiler.so.0 \
CPUPROFILE=crush.prof src/crushtool \
  -i src/test/cli/crushtool/one-hundered-devices.crushmap \
  --test --show-bad-mappings

as instructed in the cpu profiler documentation. The resulting crush.prof file can then be analyzed with

google-pprof --ignore=vector --focus=bucket_choose \
  --gv ./src/crushtool crush.prof

and displays the following result:

Any Ceph command can be profiled in this way.

wget on an OpenStack instance hangs ? Try lowering the MTU

Why would OpenStack instances fail to wget a URL and work perfectly on others ? For instance:

$ wget -O - 'https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/autobuild.asc'
Connecting to ceph.com (ceph.com)|208.113.241.137|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: `STDOUT'

    [<=>                                                                           ] 0           --.-K/s              ^

If it can be fixed by lowering the MTU from the default of 1500 to 1400 with:

$ sudo ip link set mtu 1400 dev eth0
$ sudo ip link show dev eth0
2: eth0:  mtu 1400 qdisc pfifo_fast state UP qlen 1000
    link/ether fa:16:3e:85:ee:a5 brd ff:ff:ff:ff:ff:ff

it means the underlying OpenStack DHCP should be fixed to set the MTU to 1400.
Continue reading “wget on an OpenStack instance hangs ? Try lowering the MTU”

Testing a Ceph crush map

After modifying a crush map it should be tested to check that all rules can provide the specified number of replicas. If a pool is created to use the metadata rule with seven replicas, could it fail to find enough devices ? The crushtool test mode can be used to simulate the situation as follows:

$ crushtool -i the-new-crush-map --test --show-bad-mappings
bad mapping rule 1 x 781 num_rep 7 result [8,10,2,11,6,9]

The output shows that for rule 1 ( metadata by default is rule 1 ), an attempt to find seven replicas ( num_rep 7 ) for the object 781 (the hash of its name) failed and only returned six ( [8,10,2,11,6,9] ). It can be resolved by increasing the number of devices, lowering the number of replicas or changing the way replicas are selected.

When all attempts to find the required number of replicas are one device short, it simply means there are not enough devices to satisfy the rule and the only solution is to add at least one. CRUSH may not find a device mapping that satisfies all constraints the first time around and it will need to try again. If it fails more than fifty times it will give up and return less devices than required. Lowering the required number of replica is one way to solve this problem.

Although it is possible to increase the number of times CRUSH will try, this is dangerous on a running cluster because it may modify the mapping for existing objects.

Manage a multi-datacenter crush map with the command line

A new datacenter is added to the crush map of a Ceph cluster:

# ceph osd crush add-bucket fsf datacenter
added bucket fsf type datacenter to crush map
# ceph osd crush move fsf root=default
moved item id -13 name 'fsf' to location {root=default} in crush map
# ceph osd tree
# id    weight  type name       up/down reweight
-13     0               datacenter fsf
-5      7.28            datacenter ovh
-2      1.82                    host bm0014
0       1.82                            osd.0   up      1
...

The datacenter bucket type already exists by default in the default crush map that is provided when the cluster is created. The fsf bucket is moved ( with crush move ) to the root of the crush map.
Continue reading “Manage a multi-datacenter crush map with the command line”

Transparently route a public subnet through shorewall

The 3.20.168.160/27 is routed to a firewall running shorewall. Behind the firewall is an OpenStack cluster running a neutron l3 agent and known to the firewall as 192.168.25.221. A parallel zone is defined as follows:

diff -r 34984beb770d hosts
--- /dev/null   Thu Jan 01 00:00:00 1970 +0000
+++ b/hosts     Wed Nov 20 14:59:09 2013 +0100
@@ -0,0 +1,1 @@
+opens  eth0:3.20.168.160/27
diff -r 34984beb770d policy
--- a/policy    Wed Jun 05 00:19:12 2013 +0200
+++ b/policy    Wed Nov 20 14:59:09 2013 +0100
@@ -113,6 +113,7 @@
 # If you want to force clients to access the Internet via a proxy server
 # on your firewall, change the loc to net policy to REJECT info.
 loc            net             ACCEPT
+loc            opens           ACCEPT
 loc            $FW             ACCEPT
 loc            all             REJECT          info

@@ -124,6 +125,7 @@
 # This may be useful if you run a proxy server on the firewall.
 #$FW           net             REJECT          info
 $FW            net             ACCEPT
+$FW            opens           ACCEPT
 $FW            loc             ACCEPT
 $FW            all             REJECT          info

@@ -132,6 +134,7 @@
 #
 net            $FW             DROP            info
 net            loc             DROP            info
+net            opens           ACCEPT
 net            all             DROP            info

 # THE FOLLOWING POLICY MUST BE LAST
diff -r 34984beb770d zones
--- a/zones     Wed Jun 05 00:19:12 2013 +0200
+++ b/zones     Wed Nov 20 14:59:09 2013 +0100
@@ -115,5 +115,6 @@
 fw     firewall
 net    ipv4
 loc    ipv4
+opens  ipv4

and net incoming packets are accepted for the subnet when targeting the loc zone which contains the 192.168.25.0/24 subnet:

ACCEPT          net             loc:3.20.168.163/27

A route is added

ip r add 3.20.168.160/27 via 192.168.25.221

A ping from the firewall will show on the destination interface

# tcpdump -i eth0 -n host  3.20.168.163
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
15:03:29.258592 IP 192.168.25.253 > 3.20.168.163: ICMP echo request, id 48701, seq 1, length 64

even if it timesout because the IP is not actually there

# ping -c 1 3.20.168.163
PING 3.20.168.163 (3.20.168.163) 56(84) bytes of data.
--- 3.20.168.163 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms

The subnet must be excluded from the masquerading rules by setting /etc/shorewall/masq as follows:

eth1                    eth0!3.20.168.160/27

which says to masquerade all but the subnet that is transparently routed. The result can then be checked from a virtual machine to which an IP has been routed with:

# wget --quiet -O - http://bot.whatismyipaddress.com ; echo
3.20.168.169

Mixing Ceph and LVM volumes in OpenStack

Ceph pools are defined to collocate volumes and instances in OpenStack Havana. For volumes that do not need the resilience provided by Ceph, a LVM cinder backend is defined in /etc/cinder/cinder.conf:

[lvm]
volume_group=cinder-volumes
volume_driver=cinder.volume.drivers.lvm.LVMISCSIDriver
volume_backend_name=LVM

and appended to the list of existing backends:

enabled_backends=rbd-default,rbd-ovh,rbd-hetzner,rbd-cloudwatt,lvm

A cinder volume type is created and associated with it:

# cinder type-create lvm
+--------------------------------------+------+
|                  ID                  | Name |
+--------------------------------------+------+
| c77552ff-e513-4851-a5e6-2c83d0acb998 | lvm  |
+--------------------------------------+------+
# cinder type-key lvm set volume_backend_name=LVM
#  cinder extra-specs-list
+--------------------------------------+-----------+--------------------------------------------+
|                  ID                  |    Name   |                extra_specs                 |
+--------------------------------------+-----------+--------------------------------------------+
...
| c77552ff-e513-4851-a5e6-2c83d0acb998 |    lvm    |      {u'volume_backend_name': u'LVM'}      |
...
+--------------------------------------+-----------+--------------------------------------------+

To reduce the network overhead, a backend availability zone is defined for each bare metal by adding to /etc/cinder/cinder.conf:

storage_availability_zone=bm0015

and restarting cinder-volume:

# restart cinder-volume
# sleep 5
# cinder-manage host list
host                            zone
...
bm0015.the.re@lvm               bm0015
...

where bm0015 is the hostname of the machine. To create a LVM backed volume that is located on bm0015:

cinder create --availability-zone bm0015 --volume-type lvm --display-name test 1

In order for the allocation of RBD volumes to keep working without specifying an availability zone, there must be at least one cinder volume running in the default availability zone ( nova presumably ) and configured with the expected RBD backends. This can be checked with:

# cinder-manage host list | grep nova
...
bm0017.the.re@rbd-cloudwatt     nova
bm0017.the.re@rbd-ovh           nova
bm0017.the.re@lvm               nova
bm0017.the.re@rbd-default       nova
bm0017.the.re@rbd-hetzner       nova
...

In the above the lvm volume type is also available in the nova availability zone and is used as a catch all when a LVM volume is prefered but collocating it on the same machine as the instance does not matter.

Creating a Ceph OSD from a designated disk partition

When a new Ceph OSD is setup with ceph-disk on a designated disk partition ( say /dev/sdc3 ), it will not be prepared and the sgdisk command must be run manually:

# osd_uuid=$(uuidgen)
# partition_number=3
# ptype_tobe=89c57f98-2fe5-4dc0-89c1-f3ad0ceff2be
# sgdisk --change-name="${partition_number}:ceph data" \
       --partition-guid="${partition_number}:${osd_uuid}" \
       --typecode="${partition_number}:${ptype_tobe}"
       /dev/sdc
# sgdisk --info=${partition_number} /dev/sdc
Partition GUID code: 89C57F98-2FE5-4DC0-89C1-F3AD0CEFF2BE (Unknown)
Partition unique GUID: 22FD939D-C203-43A9-966A-04570B63FABB
...
Partition name: 'ceph data'

The ptype_tobe is a partition type known to Ceph and set when it is being worked on. Assuming /dev/sda is a SSD disk from which a journal partition can be created, the OSD can be prepared with:

# ceph-disk prepare --osd-uuid "$osd_uuid" \
     --fs-type xfs --cluster ceph -- \
     /dev/sdc${partition_number} /dev/sda
WARNING:ceph-disk:OSD will not be hot-swappable if ...
Information: Moved requested sector from 34 to 2048 in
order to align on 2048-sector boundaries.
The operation has completed successfully.
meta-data=/dev/sdc3              isize=2048   agcount=4, agsize=61083136 blks
         =                       sectsz=512   attr=2, projid32bit=0
data     =                       bsize=4096   blocks=244332544, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=119303, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

The journal and data partitions should be associated with each other :

# ceph-disk list
/dev/sda :
 /dev/sda1 ceph journal, for /dev/sdc3
/dev/sdb :
 /dev/sdb2 other, ext4, mounted on /
 /dev/sdb3 swap, swap
/dev/sdc :
 /dev/sdc1 other, primary
 /dev/sdc2 other, ext4, mounted on /mnt
 /dev/sdc3 ceph data, prepared, cluster ceph, journal /dev/sda1

The type of the partition can be changed so that udev triggered scripts notice it and provision the osd.

# ptype=4fbd7e29-9d25-41b8-afd0-062c0ceff05d
# sgdisk --typecode="${partition_number}:${ptype}" /dev/sdc
# udevadm trigger --subsystem-match=block --action=add
# df | grep /var/lib/ceph
/dev/sdc3       932G 160M  931G   1% /var/lib/ceph/osd/ceph-9