Disaster recovery on host failure in OpenStack

The host bm0002.the.re becomes unavailable because of a partial disk failure on an Essex based OpenStack cluster using LVM based volumes and multi-host nova-network. The host had daily backups using rsync / and each LV was copied and compressed. Although the disk is failing badly, the host is not down and some reads can still be done. The nova services are shutdown, the host disabled using nova-manage and an attempt is made to recover from partially damaged disks and LV, when it leads to better results than reverting to yesterday’s backup.

restoring an instance from backup

The host is marked as unavailable

nova-manage service disable --host=bm0002.the.re --service=nova-compute
nova-manage service disable --host=bm0002.the.re --service=nova-network
nova-manage service disable --host=bm0002.the.re --service=nova-volume

and shows as such when listed

# nova-manage service list --host=bm0002.the.re
Binary           Host    Zone Status     State Updated_At
nova-compute     bm0002.the.re  bm0002  disabled   XXX   2013-05-11 09:18:25
nova-network     bm0002.the.re  bm0002  disabled   XXX   2013-05-11 09:18:30
nova-volume      bm0002.the.re  bm0002  disabled   XXX   2013-05-11 09:18:33

It can be removed completely later by modifying the mysql database directly. The april-ci instance was running on bm0002.the.re:

# nova list --name april-ci
+--------------------------------------+----------+---------+--------------------------------------+
|                  ID                  |   Name   |  Status |               Networks               |
+--------------------------------------+----------+---------+--------------------------------------+
| 4e8a8126-b27d-4c9e-abeb-4dc574c54254 | april-ci | SHUTOFF | novanetwork=10.145.9.5, 176.31.18.26 |
+--------------------------------------+----------+---------+--------------------------------------+

It is artificially moved to a host that is enabled:

mysql -e "update instances set host = 'bm0001.the.re', availability_zone = 'bm0001' where hostname = 'april-ci'" nova

and deleted

nova delete april-ci

Assuming the content of failed host was backed up entirely ( i.e. rsync / ), the april-ci disk is located using the id shown above as the output of nova list

# grep 4dc574c54254 /var/lib/nova/instances/*/*.xml
/var/lib/nova/instances/instance-000001de/libvirt.xml:    4e8a8126-b27d-4c9e-abeb-4dc574c54254

and the corresponding disk is turned into a minimal file system

chroot /backup/bm0002.the.re
mount -t proc none /proc
qemu-nbd --port 20000 /var/lib/nova/instances/instance-000001de/disk &
nbd-client localhost 20000 /dev/nbd0
pv /dev/nbd0 > april-ci.april-ci.img
fsck -fy $(pwd)/april-ci.april-ci.img
resize2fs -M april-ci.april-ci.img
exit

and uploaded to glance, using the same kernel and initrd, as shown with nova image-show original-image-of-april-ci

glance add name="april-ci-2013-05-11" disk_format=ami container_format=ami \
 kernel_id=2e714ea3-45e5-4bb8-ab5d-92bfff64ad28 \
 ramdisk_id=6458acca-24ef-4568-bb2b-e52322a5a11c < /backup/bm0002.the.re/april-ci.april-ci.img

it is then rebooted using the same flavor

nova boot --image 'april-ci-2013-05-11' \
  --flavor e.1-cpu.10GB-disk.1GB-ram \
  --key_name loic --availability_zone=bm0001 --poll april-ci

recovering from a partially damaged logical volume

A 30GB volume contains bad blocks toward the end ( after 26GB ) but it was not full. A fsck is run on a copy of the disk to check how much the recovery process would lose. It turns out to be less than a hundred files in a non-critical area. A new disk of the same size is allocated on another machine with

# euca-create-volume --zone bm0001 --size 30
VOLUME  vol-0000005b    30      bm0001  creating        2013-05-11T11:22:19.889Z

and the content of the damaged volume are copied over, until it fails with an I/O error.

ssh -A root@bm0001.the.re
ssh bm0002.the.re pv /dev/nova-volumes/volume-00000143 | \
 pv > /dev/nova-volumes/volume-0000005b

and it is repaired

fsck -fy /dev/nova-volumes/volume-0000005b

The volume residing on the failed host is removed directly from the database

mysql -e "update volumes set deleted = 1 where id = 30" nova

recovering from a partially damaged instance disk

An instance disk has a few failed blocks and may be recovered if the others are copied over. Because rsync is more resilient to I/O errors than dd or pv, it is used to recover as much as possible with:

# ssh -A root@bm0002.the.re
# rsync --inplace --progress /var/lib/nova/instances/instance-00000089/disk root@bm0001.the.re:/backup/bm0002.the.re/var/lib/nova/instances/instance-00000089/disk
  1843396608 100%    8.41MB/s    0:03:28 (xfer#1, to-check=0/1)
rsync: read errors mapping "/mnt/var/lib/nova/instances/instance-00000089/disk": Input/output error (5)
WARNING: disk failed verification -- update retained (will try again).
disk
  1843396608 100%   37.37MB/s    0:00:47 (xfer#2, to-check=0/1)
rsync: read errors mapping "/var/lib/nova/instances/instance-00000089/disk": Input/output error (5)
ERROR: disk failed verification -- update retained.
sent 1843836447 bytes  received 858892 bytes  7000741.32 bytes/sec
total size is 1843396608  speedup is 1.00
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1070) [sender=3.0.9]

It is then turned into a file using nbd as shown above and checked for errors:

# fsck -fy $(pwd)/openstack.jenkins.img
fsck from util-linux 2.20.1
e2fsck 1.42.5 (29-Jul-2012)
/openstack.jenkins.img: recovering journal
Clearing orphaned inode 117551 (uid=0, gid=0, mode=0100644, size=0)
Clearing orphaned inode 9764 (uid=0, gid=0, mode=0100644, size=1393052)
Clearing orphaned inode 9765 (uid=0, gid=0, mode=0100644, size=302040)
Clearing orphaned inode 7050 (uid=105, gid=109, mode=0100644, size=0)
Clearing orphaned inode 8841 (uid=0, gid=0, mode=0100644, size=81800)
Clearing orphaned inode 10235 (uid=0, gid=0, mode=0100644, size=253328)
Clearing orphaned inode 10240 (uid=0, gid=0, mode=0100644, size=180624)
Clearing orphaned inode 8840 (uid=0, gid=0, mode=0100644, size=874608)
Clearing orphaned inode 6469 (uid=0, gid=0, mode=0100755, size=1245180)
Clearing orphaned inode 10739 (uid=0, gid=0, mode=0100644, size=18192)
Clearing orphaned inode 10927 (uid=0, gid=0, mode=0100644, size=19908)
Clearing orphaned inode 10754 (uid=0, gid=0, mode=0100644, size=100820)
Clearing orphaned inode 10738 (uid=0, gid=0, mode=0100644, size=11468)
Clearing orphaned inode 10926 (uid=0, gid=0, mode=0100644, size=31568)
Clearing orphaned inode 10956 (uid=0, gid=0, mode=0100644, size=18780)
Clearing orphaned inode 10958 (uid=0, gid=0, mode=0100644, size=22312)
Clearing orphaned inode 10723 (uid=0, gid=0, mode=0100644, size=13976)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong (2299561, counted=2283092).
Fix? yes
Free inodes count wrong (538192, counted=534536).
Fix? yes
/openstack.jenkins.img: ***** FILE SYSTEM WAS MODIFIED *****
/openstack.jenkins.img: 52984/587520 files (0.3% non-contiguous), 338348/2621440 blocks

If the lossage is better than recovering from yesterday's backup, the instance is rebooting using this copy.