Why I have backups

why-i-have-backups

Sven Vermeulen Thu 30 December 2010

You often read stories about people who have data loss and did not keep any (recent) backups, and are now fully equipped with a state-of-the-art backup mechanism. So no - no such failure story here but an example why backups are important.

Yesterday I had a vicious RAID/LVM failure. Due to my expeditions in the world of SELinux, for some odd reason when I booted with SELinux enforcing on, my RAID-1 where an LVM volume group (with /usr, /var, /opt and /home) was hosted (coincidentally the only RAID-1 with 1.2 metadata version, the others are at 0.90) suddenly decided to split itself into two (!) degraded RAID-1 systems: /dev/md126 and /dev/md127. During detection, LVM found two devices (the two RAID metadevices) but with the same meta data on it (same physical volume signature), so randomly picked one as its physical volume.

Found duplicate PV Lgrl5nNfenRUg9bIwM20q1hfMrWylyyL: using /dev/md126 not /dev/md127

Result: after a few reboots (no, I didn't notice it at first - why would I, everything seemed to work well so I didn't look at the dmesg output) I started noticing that changes I made were suddenly gone (for instance, ebuild updates that I made) which almost immediately triggers for me a "remount read-only, check logs and take emergency backup"-adrenaline surge. And then I noticed that there were I/O errors in my logs, together with the previously mentioned error message. So I quickly made an emergency backup of my most critical file system locations (/home as well as /etc and some files in var) and then tried to fix the problem (without having to reinstall everything).

The first thing I did - and that might have been the trigger for real pandemonium - was to try and found out which RAID (md126 or md127) is being used. The vgdisplay and other commands showed me that only md127 was used at that time. Also, /proc/mdstats showed that md126 was in a auto read-only state, meaning it wasn't touched since my system booted. So I decided to drop md126 and add its underlying partitions to the md127 RAID device. Once added, I would expect that the degraded array would start syncing, but no: the moment the partition was added, the RAID was shown to be fully operational.

So I rebooted my system, only to find out it couldn't mount md127. File system checks, duplicate inodes, deleted blocks, the whole shebang. Even running multiple fsck -y commands didn't help. The volume group was totally corrupted and my system almost totally gone. At that time, it was around 1am and knowing I wouldn't be able to sleep before my system is back operational - and knowing that I cannot sleep long as my daughter will wake up at her usual hour - I decided to remove the array, recreate it and pull back my original backup (not the one I just took as it might already have corrupted files). As I take daily backups (they are made at 6 o'clock or during first boot, whatever comes first) I quickly had most of my /home recovered (the backup doesn't take caches, git/svn/cvs snapshots etc. into account). A quick delta check between the newly restored /home and the backup I took yielded a few files which I have changed since, so those were recovered as well. But it also showed lost changes, lost files and just corrupted files so I'm glad I have my original backups.

I don't take backups of my /usr as it is only a system-rebuild away. As /etc wasn't corrupted, I recovered my /var, threw in a Gentoo stage snapshot (but not the full tarball as that would overwrite clean files) and ran a emerge -pe world --keep-going.

When I woke up, my system was almost fully recovered with only a few failed installs - which were identified and fixed in the next hour.

Knowing that my backup system is rudimentary (an rsync command which uses hardlinks for incremental updates towards a second system plus a secure file upload to a remote system for really important files) I was quite happy to have only lost a few changes which I neglected to commit/push. So, what did I learn?

  • Keep taking backups (and perhaps start using binpkg for fast recovery),
  • Use 0.90 raid metadata version,
  • Commit often, and
  • Install a log checking tool that warns me the moment something weird might be occurring