Manage a multi-datacenter crush map with the command line

A new datacenter is added to the crush map of a Ceph cluster:

# ceph osd crush add-bucket fsf datacenter
added bucket fsf type datacenter to crush map
# ceph osd crush move fsf root=default
moved item id -13 name 'fsf' to location {root=default} in crush map
# ceph osd tree
# id    weight  type name       up/down reweight
-13     0               datacenter fsf
-5      7.28            datacenter ovh
-2      1.82                    host bm0014
0       1.82                            osd.0   up      1
...

The datacenter bucket type already exists by default in the default crush map that is provided when the cluster is created. The fsf bucket is moved ( with crush move ) to the root of the crush map.
Continue reading “Manage a multi-datacenter crush map with the command line”

Transparently route a public subnet through shorewall

The 3.20.168.160/27 is routed to a firewall running shorewall. Behind the firewall is an OpenStack cluster running a neutron l3 agent and known to the firewall as 192.168.25.221. A parallel zone is defined as follows:

diff -r 34984beb770d hosts
--- /dev/null   Thu Jan 01 00:00:00 1970 +0000
+++ b/hosts     Wed Nov 20 14:59:09 2013 +0100
@@ -0,0 +1,1 @@
+opens  eth0:3.20.168.160/27
diff -r 34984beb770d policy
--- a/policy    Wed Jun 05 00:19:12 2013 +0200
+++ b/policy    Wed Nov 20 14:59:09 2013 +0100
@@ -113,6 +113,7 @@
 # If you want to force clients to access the Internet via a proxy server
 # on your firewall, change the loc to net policy to REJECT info.
 loc            net             ACCEPT
+loc            opens           ACCEPT
 loc            $FW             ACCEPT
 loc            all             REJECT          info

@@ -124,6 +125,7 @@
 # This may be useful if you run a proxy server on the firewall.
 #$FW           net             REJECT          info
 $FW            net             ACCEPT
+$FW            opens           ACCEPT
 $FW            loc             ACCEPT
 $FW            all             REJECT          info

@@ -132,6 +134,7 @@
 #
 net            $FW             DROP            info
 net            loc             DROP            info
+net            opens           ACCEPT
 net            all             DROP            info

 # THE FOLLOWING POLICY MUST BE LAST
diff -r 34984beb770d zones
--- a/zones     Wed Jun 05 00:19:12 2013 +0200
+++ b/zones     Wed Nov 20 14:59:09 2013 +0100
@@ -115,5 +115,6 @@
 fw     firewall
 net    ipv4
 loc    ipv4
+opens  ipv4

and net incoming packets are accepted for the subnet when targeting the loc zone which contains the 192.168.25.0/24 subnet:

ACCEPT          net             loc:3.20.168.163/27

A route is added

ip r add 3.20.168.160/27 via 192.168.25.221

A ping from the firewall will show on the destination interface

# tcpdump -i eth0 -n host  3.20.168.163
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
15:03:29.258592 IP 192.168.25.253 > 3.20.168.163: ICMP echo request, id 48701, seq 1, length 64

even if it timesout because the IP is not actually there

# ping -c 1 3.20.168.163
PING 3.20.168.163 (3.20.168.163) 56(84) bytes of data.
--- 3.20.168.163 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms

The subnet must be excluded from the masquerading rules by setting /etc/shorewall/masq as follows:

eth1                    eth0!3.20.168.160/27

which says to masquerade all but the subnet that is transparently routed. The result can then be checked from a virtual machine to which an IP has been routed with:

# wget --quiet -O - http://bot.whatismyipaddress.com ; echo
3.20.168.169

Mixing Ceph and LVM volumes in OpenStack

Ceph pools are defined to collocate volumes and instances in OpenStack Havana. For volumes that do not need the resilience provided by Ceph, a LVM cinder backend is defined in /etc/cinder/cinder.conf:

[lvm]
volume_group=cinder-volumes
volume_driver=cinder.volume.drivers.lvm.LVMISCSIDriver
volume_backend_name=LVM

and appended to the list of existing backends:

enabled_backends=rbd-default,rbd-ovh,rbd-hetzner,rbd-cloudwatt,lvm

A cinder volume type is created and associated with it:

# cinder type-create lvm
+--------------------------------------+------+
|                  ID                  | Name |
+--------------------------------------+------+
| c77552ff-e513-4851-a5e6-2c83d0acb998 | lvm  |
+--------------------------------------+------+
# cinder type-key lvm set volume_backend_name=LVM
#  cinder extra-specs-list
+--------------------------------------+-----------+--------------------------------------------+
|                  ID                  |    Name   |                extra_specs                 |
+--------------------------------------+-----------+--------------------------------------------+
...
| c77552ff-e513-4851-a5e6-2c83d0acb998 |    lvm    |      {u'volume_backend_name': u'LVM'}      |
...
+--------------------------------------+-----------+--------------------------------------------+

To reduce the network overhead, a backend availability zone is defined for each bare metal by adding to /etc/cinder/cinder.conf:

storage_availability_zone=bm0015

and restarting cinder-volume:

# restart cinder-volume
# sleep 5
# cinder-manage host list
host                            zone
...
bm0015.the.re@lvm               bm0015
...

where bm0015 is the hostname of the machine. To create a LVM backed volume that is located on bm0015:

cinder create --availability-zone bm0015 --volume-type lvm --display-name test 1

In order for the allocation of RBD volumes to keep working without specifying an availability zone, there must be at least one cinder volume running in the default availability zone ( nova presumably ) and configured with the expected RBD backends. This can be checked with:

# cinder-manage host list | grep nova
...
bm0017.the.re@rbd-cloudwatt     nova
bm0017.the.re@rbd-ovh           nova
bm0017.the.re@lvm               nova
bm0017.the.re@rbd-default       nova
bm0017.the.re@rbd-hetzner       nova
...

In the above the lvm volume type is also available in the nova availability zone and is used as a catch all when a LVM volume is prefered but collocating it on the same machine as the instance does not matter.

Creating a Ceph OSD from a designated disk partition

When a new Ceph OSD is setup with ceph-disk on a designated disk partition ( say /dev/sdc3 ), it will not be prepared and the sgdisk command must be run manually:

# osd_uuid=$(uuidgen)
# partition_number=3
# ptype_tobe=89c57f98-2fe5-4dc0-89c1-f3ad0ceff2be
# sgdisk --change-name="${partition_number}:ceph data" \
       --partition-guid="${partition_number}:${osd_uuid}" \
       --typecode="${partition_number}:${ptype_tobe}"
       /dev/sdc
# sgdisk --info=${partition_number} /dev/sdc
Partition GUID code: 89C57F98-2FE5-4DC0-89C1-F3AD0CEFF2BE (Unknown)
Partition unique GUID: 22FD939D-C203-43A9-966A-04570B63FABB
...
Partition name: 'ceph data'

The ptype_tobe is a partition type known to Ceph and set when it is being worked on. Assuming /dev/sda is a SSD disk from which a journal partition can be created, the OSD can be prepared with:

# ceph-disk prepare --osd-uuid "$osd_uuid" \
     --fs-type xfs --cluster ceph -- \
     /dev/sdc${partition_number} /dev/sda
WARNING:ceph-disk:OSD will not be hot-swappable if ...
Information: Moved requested sector from 34 to 2048 in
order to align on 2048-sector boundaries.
The operation has completed successfully.
meta-data=/dev/sdc3              isize=2048   agcount=4, agsize=61083136 blks
         =                       sectsz=512   attr=2, projid32bit=0
data     =                       bsize=4096   blocks=244332544, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=119303, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

The journal and data partitions should be associated with each other :

# ceph-disk list
/dev/sda :
 /dev/sda1 ceph journal, for /dev/sdc3
/dev/sdb :
 /dev/sdb2 other, ext4, mounted on /
 /dev/sdb3 swap, swap
/dev/sdc :
 /dev/sdc1 other, primary
 /dev/sdc2 other, ext4, mounted on /mnt
 /dev/sdc3 ceph data, prepared, cluster ceph, journal /dev/sda1

The type of the partition can be changed so that udev triggered scripts notice it and provision the osd.

# ptype=4fbd7e29-9d25-41b8-afd0-062c0ceff05d
# sgdisk --typecode="${partition_number}:${ptype}" /dev/sdc
# udevadm trigger --subsystem-match=block --action=add
# df | grep /var/lib/ceph
/dev/sdc3       932G 160M  931G   1% /var/lib/ceph/osd/ceph-9

Migrating from ganeti to OpenStack via Ceph

On ganeti, shutdown the instance and activate its disks:

z2-8:~# gnt-instance shutdown nerrant
Waiting for job 1089813 for nerrant...
z2-8:~# gnt-instance activate-disks nerrant
z2-8.host.gnt:disk/0:/dev/drbd10

On an OpenStack Havana installation using a Ceph cinder backend, create a volume with the same size:

# cinder create --volume-type ovh --display-name nerrant 10
+---------------------+--------------------------------------+
|       Property      |                Value                 |
+---------------------+--------------------------------------+
|     attachments     |                  []                  |
|  availability_zone  |                 nova                 |
|       bootable      |                false                 |
|      created_at     |      2013-11-12T13:00:39.614541      |
| display_description |                 None                 |
|     display_name    |              nerrant                 |
|          id         | 3ec2035e-ff76-43a9-bbb3-6c003c1c0e16 |
|       metadata      |                  {}                  |
|         size        |                  10                  |
|     snapshot_id     |                 None                 |
|     source_volid    |                 None                 |
|        status       |               creating               |
|     volume_type     |                 ovh                  |
+---------------------+--------------------------------------+
# rbd --pool ovh info volume-3ec2035e-ff76-43a9-bbb3-6c003c1c0e16
rbd image 'volume-3ec2035e-ff76-43a9-bbb3-6c003c1c0e16':
        size 10240 MB in 2560 objects
        order 22 (4096 KB objects)
        block_name_prefix: rbd_data.90f0417089fa
        format: 2
        features: layering

On a host connected to the Ceph cluster and running a linux-kernel > 3.8 ( because of the format: 2 above ), map to a bloc device with:

# rbd map --pool ovh volume-3ec2035e-ff76-43a9-bbb3-6c003c1c0e16
# rbd showmapped
id pool image                                       snap device
1  ovh  volume-3ec2035e-ff76-43a9-bbb3-6c003c1c0e16 -    /dev/rbd1

Copy the ganeti volume with:

z2-8:~# pv < /dev/drbd10 | ssh bm0014 dd of=/dev/rbd1
2,29GB 0:09:14 [4,23MB/s] [==========================>      ] 22% ETA 0:31:09

and unmap the device when it completes.

rbd unmap /dev/rbd1

The volume is ready to boot.

Collocating Ceph volumes and instances in a multi-datacenter setup

OpenStack Havana is installed on machines rented from OVH and Hetzner. An aggregate is created for machines hosted at OVH and another for machines hosted at Hetzner. A Ceph cluster is created with a pool using disks from OVH and another pool using disks from Hetzner. A cinder backend is created for each Ceph pool. From the dashboard, an instance can be created in the OVH availability zone using a Ceph volume provided by the matching OVH pool.

Continue reading “Collocating Ceph volumes and instances in a multi-datacenter setup”

Fragmented floating IP pools and multiple AS hack

When an OpenStack Havana cluster is deployed on hardware rented from OVH and Hetzner, IPv4 are rented by the month and are either isolated ( just one IP, not a proper subnet ) or made of a collection of disjoint subnets of various sizes.

91.121.254.238/32
188.165.144.248/30
...

OpenStack does not provide a way to deal with this situation and a hack involving a double nat using a subnet of floating IP is proposed.
A L3 agent runs on an OVH machine and pretends that 10.88.15.0/24 is a subnet of floating IPs, although they are not publicly available. Another L3 agent is setup on a Hetzner machine and uses the 10.88.16.0/24 subnet.
When an instance is created, it may chose a Hetzner private subnet, which is connected to a Hetzner router for which the gateway has been set to a network providing the Hetzner floating IPs. And the same is done for OVH.
A few floating IP are rented from OVH and Hetzner. On the host running the L3 agent dedicated to the OVH AS, a 1 to 1 nat is established between each IP in the 10.88.15.0/24 subnet and the OVH floating IPs. For instance the following /etc/init/nat.conf upstart script associates 10.88.15.3 with the 91.121.254.238 floating IP.

description "OVH nat hack"

start on neutron-l3-agent

script
  iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
  ip addr add 10.88.15.1/24 dev br-ex
  while read private public ; do
    test "$public" || continue
    iptables -t nat -A POSTROUTING -s $private/32 -j SNAT --to-source $public
    iptables -t nat -A PREROUTING -d $public/32 -j DNAT --to-destination $private
  done <<EOF
10.88.15.3      91.121.254.238
EOF
end script

Continue reading “Fragmented floating IP pools and multiple AS hack”