SSE optimization for erasure code in Ceph

The jerasure library is the default erasure code plugin of Ceph. The gf-complete companion library supports SSE optimizations at compile time, when the compiler provides them (-msse4.2 etc.). The jerasure (and gf-complete with it) plugin is compiled multiple times with various levels of SSE features:

  • jerasure_sse4 uses SSE4.2, SSE4.1, SSSE3, SSE3, SSE2, SSE
  • jerasure_sse3 uses SSSE3, SSE3, SSE2, SSE
  • jerasure_generic uses no SSE instructions

When an OSD loads the jerasure plugin, the CPU features are probed and the appropriate plugin is selected depending on their availability.
The gf-complete source code is cleanly divided into functions that take advantage of specific SSE features. It should be easy to use the ifunc attribute to semi-manually select each function individually, at runtime and without performance penalty (because the choice is made the first time the function is called and recorded for later calls). With such a fine grain selection, there would be no need to compile three plugins because each function would be compiled with exactly the set of flag it needs.

Testing CPU features with Qemu

The Ceph erasure code plugin must run on Intel CPU that have no SSE4.2 support. A Qemu is run without SSE4.2 support:

qemu-system-x86_64 -machine accel=kvm:tcg -m 2048 \
  -drive file=server.img -boot c \
  -display sdl \
  -net nic -net user,hostfwd=tcp::2222-:22 \
  -fsdev local,security_model=passthrough,id=fsdev0,path=~/ceph \
  -device virtio-9p-pci,id=fs0,fsdev=fsdev0,mount_tag=hostshare

The qemu CPU has no SSE4.2 although the native CPU has it:

$ grep sse4.2 /proc/cpuinfo | wc -l
4
$ ssh -p 2222 loic@127.0.0.1 grep sse4.2 /proc/cpuinfo | wc -l
0

The local development directory is a Plan 9 folder shared over Virtio mounted inside the VM:

sudo mount -t 9p -o trans=virtio,version=9p2000.L hostshare /home/loic/ceph

and the functional test is run to assert that encoding and decoding an object:

$ cd /home/loic/ceph/src
$ ./unittest_erasure_code_jerasure
...
[----------] Global test environment tear-down
[==========] 16 tests from 8 test cases ran. (30 ms total)
[  PASSED  ] 16 tests.

Your first exabyte in a Ceph cluster

$ rbd create --size $((1024 * 1024 * 1024 * 1024)) tiny
$ rbd info tiny
rbd image 'tiny':
	size 1024 PB in 274877906944 objects
	order 22 (4096 kB objects)
	block_name_prefix: rb.0.1009.6b8b4567
	format: 1

Note: rbd rm tiny will take a long time.

The footprints of 192 Ceph developers

Gource is run on the Ceph git repository for each of the 192 developers who contributed to its development over the past six years. Their footprint is the last image of a video clip created from all the commits they authored.

video clip
Sage Weil

video clip
Yehuda Sadeh

video clip
Greg Farnum

video clip
Samuel Just

video clip
Colin P. McCabe

video clip
Danny Al-Gaaf

video clip
Josh Durgin

video clip
John Wilkins

video clip
Loic Dachary

video clip
Dan Mick

Continue reading “The footprints of 192 Ceph developers”

BIOS and console access via VNC

The AMT of an ASRock Q87M motherboard is configured to enable remote power control (power cycle) and display of the BIOS and the console. It is a cheap alternative to iLO or IPMI that can be used with Free Software. AMT is a feature of vPro that was available in 2011 with some Sandy Bridge chipsets. It is included in many of the more recent Haswell chipsets.

The following is a screenshot of vinagre connected to the AMT VNC server displaying the BIOS of the ASRock Q87M motherboard.


Continue reading “BIOS and console access via VNC”

Vue subjective de la naissance de l'Erasure Code dans Ceph

L’erasure code, c’est aussi le RAID5, qui permet de perdre un disque dur sans perdre ses données. Du point de vue de l’utilisateur, le concept est simple et utile, mais pour la personne qui est chargée de concevoir le logiciel qui fait le travail, c’est un casse-tête. On trouve des boîtiers RAID5 à trois disques dans n’importe quelle boutique : quand l’un d’eux cesse de fonctionner, on le remplace et les fichiers sont toujours là. On pourrait imaginer ça avec six disques dont deux cessent de fonctionner simultanément. Mais non : au lieu d’avoir recours à une opération XOR, assimilable en cinq minutes, il faut des corps de Galois, un bon bagage mathématique et beaucoup de calculs. Pour corser la difficulté, dans un système de stockage distribué tel que Ceph, les disques sont souvent déconnectés temporairement pour cause d’indisponibilité réseau.
Continue reading “Vue subjective de la naissance de l'Erasure Code dans Ceph”

Benchmarking Ceph jerasure version 2 plugin

The Ceph erasure code plugin benchmark for jerasure version 1 are compared after an upgrade to jerasure version 2, using the same command, on the same hardware.

  • Encoding: 5.2GB/s which is ~20% better than 4.2GB/s
  • Decoding: no processing necessary (because the code is systematic)
  • Recovering the loss of one OSD: 11.3GB/s which is ~13% better than 10GB/s
  • Recovering the loss of two OSD: 4.42GB/s which is ~35% better than 3.2GB/s

The relevant lines from the full output of the benchmark are:

seconds         KB      plugin          k m work.   iter.   size    eras.
0.088136        1048576 jerasure        6 2 decode  1024    1048576 1
0.226118        1048576 jerasure        6 2 decode  1024    1048576 2
0.191825        1048576 jerasure        6 2 encode  1024    1048576 0

The improvements are likely to be greater for larger K+M values.

OpenStack Upstream Training in Atlanta

The OpenStack Foundation is delivering a training program to accelerate the speed at which new OpenStack developers are successful at integrating their own roadmap into that of the OpenStack project.  If you’re a new OpenStack contributor or plan on becoming one soon, you should sign up for the next OpenStack Upstream Training in Atlanta, May 10-11. Participation is strongly advised also for first time participants to OpenStack Design Summit.

Continue reading “OpenStack Upstream Training in Atlanta”