# Ich hab Streichhölzer.

WASSEER STATT COLA!

# Vinum safety drill

Scroll down if you are just interested in a quick overview, or read on for the details.

The colocation provider does an amazing job of providing highly convenient storage volumes for our VMs. Since they take care about storing all our data at least with double redundancy, their storage volumes provides a reliable piece storage backed by unreliably physical disks—much in the same fashion that TCP provides a reliable transport stream over unreliable IP. This allows us to disregard all the mundane stuff about hardware failures, RAID setup and so forth—instead, we can concentrate on the high-level picture, just set up our file systems (and backup plans!) and be done with it.

For my personal workstation, unfortunately I do not have this kind of luxury. Instead, the machine just has a bunch of disks—and it’s my duty to build something at least somewhat reliably on top of them. My workstation doesn’t have a hardware RAID controller, and I don’t think that so-called ‘software RAID controllers’ are worth the silicone from which they are built. Instead, I am using my operating system’s software RAID facilities to build a mirroring setup (read up on RAID-1 for the details). For [[https://dragonflybsd.org/|Dragonfly BSD, I am currently using the vinum volume manager. Vinum is pretty much a kitchen sink of a volume manager and I am sure that it can do all sorts of stuff. I just use it to set up a bunch of hard drives to mirror each other.1)

But what good is a fail-safe system if you don’t know how to use it in an emergency? That’s what fire and safety drills are for: Just as with the building you are probably right now sitting in, you should periodically check your setup, try out the emergency systems and in general know what you are supposed to do if something goes wrong. You have a backup system? Neat! But did you ever try out how to do a restore? What does your RAID system do when (not if!) one of your drives actually fails? How do you get your system up and running? It’s far better to prepare for the catastrophe in a calm, controlled manner, taking notes and making plans. When the inevitable happens, you are far more likely to refrain from panicking, since you can just consult your notes and get yourself out of the mess—instead of driving yourself further into it.

An off-list remark got me to thinking about my own safety drill, which so far I have neglected. I do not prefer to lose my data, therefore it’s high time to do it now that I am in the mood of doing so.

## My test setup

### Partitions and slices

I don’t have the space to set up a second machine to do the safety drill with, so I’m using my company’s test VM for this. I could do all of this with ‘real’ hardware, it’s just less of a hassle to set up this way. My test dummies are a set of storage volumes,namely vbd2, vbd4 and vbd5. All of them just contain a single slice which in turn contains one single vinum partition:

# gpt show vbd2
start      size  index  contents
0         1      -  PMBR
1         1      -  Pri GPT header
2        32      -  Pri GPT table
34      2014      -
2048  20967424      0  GPT part - DragonFly Label64
20969472      2015      -
20971487        32      -  Sec GPT table
20971519         1      -  Sec GPT header
# disklabel vbd2s0
# /dev/vbd2s0:
#
# Informational fields calculated from the above
# All byte equivalent offsets must be aligned
#
# boot space:    1077248 bytes
# data space:   10482652 blocks # 10236.96 MB (10734235648 bytes)
#
# NOTE: If the partition data base looks odd it may be
#       physically aligned instead of slice-aligned
#
diskid: 383834b3-b971-11e6-98e9-010000000000
boot2 data base:      0x000000001000
partitions data base: 0x000000108000
partitions data stop: 0x00027fdff000
backup label:         0x00027fdff000
total size:           0x00027fe00000    # 10238.00 MB
alignment: 4096
display block size: 1024        # for partition display only

16 partitions:
#          size     offset    fstype   fsuuid
p:   10481596          0     vinum    #   10235.934MB
p-stor_uuid: 4283c6ba-b971-11e6-98e9-010000000000

The other two volumes are set up in the same fashion. To keep costs low, the volumes are very small. I won’t be putting anything on them, I just want to try out how vinum reacts when they fail.

My vinum partitions are thus vbd2s0p, vbd4s0p and vbd5s0p. I will be using the first two devices to construct the RAID-1, then fail one of the drives and use the third device as a ‘replacement’. This is the scenario that I expect to happen: One (physical) drive will fail, the other will (hopefully) remain intact. I would replace the faulty drive and then have vinum repair itself.

### vinum layout

vinum create will create the initial RAID-1 configuration. With my example drives, the initial configuration will look like this:

vinum.conf
drive links device /dev/vbd2s0p
drive rechts device /dev/vbd4s0p

volume katastrophenübung setupstate
plex name linkes org concat
subdisk length 0 drive links
plex name rechtes org concat
subdisk length 0 drive rechts

Since I am creating a RAID-1 device, all content will be mirrored bit by bit to each of the constituent volumes. It is therefore not necessary to initialise the volume (which would be done by vinum init)—blocks that have not been written to yet cannot contain real data. This is what setupstate does: It tells vinum to not bother clearing the disk and to consider it ‘online’ right away:

2 drives:
D links                 State: up       Device /dev/vbd2s0p     Avail: 0/10235 MB (0%)
D rechts                State: up       Device /dev/vbd4s0p     Avail: 0/10235 MB (0%)

1 volumes:
V katastrophenübung    State: up        Plexes:       2 Size:          9 GB

2 plexes:
P linkes              C State: up       Subdisks:     1 Size:          9 GB
P rechtes             C State: up       Subdisks:     1 Size:          9 GB

2 subdisks:
S linkes.s0             State: up       PO:        0  B Size:          9 GB
S rechtes.s0            State: up       PO:        0  B Size:          9 GB
==> messages <==
Dec  3 19:54:46 sumi kernel: vinum: drive links is up
Dec  3 19:54:46 sumi kernel: vinum: drive rechts is up
Dec  3 19:54:46 sumi kernel: vinum: linkes.s0 is up
Dec  3 19:54:46 sumi kernel: vinum: rechtes.s0 is up
Dec  3 19:54:46 sumi kernel: vinum: linkes is up
Dec  3 19:54:46 sumi kernel: vinum: katastrophenübung is up

The volume now shows up as /dev/vinum/vol/katastrophenübung. I can now continue with cryptsetup to set up LUKS containers, then set up my file systems and so on—everything is ready at this point.

## Catastrophe strikes

### A drive has failed!

Of course, it didn’t really. On my VM, I’ll just tell vinum that a drive would have failed. If I were doing this on real hardware, I‘d unplug the drive and plug in the replacement, then reboot. Doesn’t make that big a difference for this drill.

I want vinum to think that a physical drive has failed. Therefore I stop one of the drives:2)

vinum stop -f links

And it works!

Dec  3 20:02:52 sumi kernel: vinum: drive links is down
Dec  3 20:02:52 sumi kernel: vinum: linkes.s0 is crashed
Dec  3 20:02:52 sumi kernel: vinum: linkes is faulty

vinum list now shows a grim picture:

2 drives:
D links                 State: down     Device /dev/vbd2s0p     Avail: 0/10235 MB (0%)
D rechts                State: up       Device /dev/vbd4s0p     Avail: 0/10235 MB (0%)

1 volumes:
V katastrophenübung    State: up        Plexes:       2 Size:          9 GB

2 plexes:
P linkes              C State: faulty   Subdisks:     1 Size:          9 GB
P rechtes             C State: up       Subdisks:     1 Size:          9 GB

2 subdisks:
S linkes.s0             State: crashed  PO:        0  B Size:          9 GB
S rechtes.s0            State: up       PO:        0  B Size:          9 GB

### Reinforcements have arrived

With real hardware, a failing drive most likely will drag the system down with it. After replacing it and rebooting, the faulty drive would not be present. It takes some poking, but vinum can be convinced of this.

vinum stop
dd if=/dev/zero of=/dev/vbd2s0p bs=1m count=10
vinum read /dev/vbd[245]s0p

Just as if the machine were rebooted with a brand-new replacement drive, vinum now only sees the remaining drive of the volume—the one that didn’t fail. vbd5, the replacement drive, is pristine—vinum won’t find anything on it:

1 drives:
D rechts                State: up       Device /dev/vbd4s0p     Avail: 0/10235 MB (0%)
D links                 State: referenced       Device unknown  Avail: 0/0 MB

1 volumes:
V katastrophenübung    State: up        Plexes:       2 Size:          9 GB

2 plexes:
P linkes              C State: faulty   Subdisks:     1 Size:          9 GB
P rechtes             C State: up       Subdisks:     1 Size:          9 GB

2 subdisks:
S linkes.s0             State: crashed  PO:        0  B Size:          9 GB
S rechtes.s0            State: up       PO:        0  B Size:          9 GB

The interesting part is State: referenced. Vinum remembers that the volume used to comprise two volumes, only one of which is actually present right now. I could create a new plex from the replacement drive, attach it to the volume and remove the old plex, but it’s much easier to just replace the missing drive by means of vinum create:3)

vinum.replacement.conf
drive links device /dev/vbd5s0p

This will overwrite the Device: unknown with the newly installed replacement drive:

==> messages <==
Dec  3 20:10:19 sumi kernel: vinum: drive links is up
Dec  3 20:10:19 sumi kernel: vinum: linkes.s0 is reborn
Dec  3 20:10:19 sumi kernel: vinum: linkes is flaky
Dec  3 20:10:19 sumi kernel: vinum: linkes is faulty

vinum list will confirm this:

2 drives:
D rechts                State: up       Device /dev/vbd4s0p     Avail: 0/10235 MB (0%)
D links                 State: up       Device /dev/vbd5s0p     Avail: 0/10235 MB (0%)

1 volumes:
V katastrophenübung    State: up        Plexes:       2 Size:          9 GB

2 plexes:
P linkes              C State: faulty   Subdisks:     1 Size:          9 GB
P rechtes             C State: up       Subdisks:     1 Size:          9 GB

2 subdisks:
S linkes.s0             State: stale    PO:        0  B Size:          9 GB
S rechtes.s0            State: up       PO:        0  B Size:          9 GB

### Back to normal again

The volume remains up (and—most importantly—useable), but one of the plexes will remain in ‘faulty‘ state and will not participate in the RAID. Vinum won’t start the repairing on it’s own—it’s a very time-consuming operation and the administrator usually knows best when he is ready for this.

A simple vinum start linkes is enough to trigger the rebuilding:

# vinum start linkes
Reviving linkes.s0 in the background

This will take some sweet time. Go take a coffee, watch some TV, have a few days off work at the local spa. I’m not kidding—resilvering a few terabytes of hard disk storage will take days and will put considerable strain on the remaining drives. In the worst case, the extra strain could just as well push the remaining device over the edge, too. That‘s what your backup is for, then…

Eventually, the rebuilding is done:

==> messages <==
Dec  3 20:16:40 sumi kernel: vinum: linkes.s0 is up by force
Dec  3 20:16:40 sumi kernel: vinum: linkes is up
Dec  3 20:16:40 sumi kernel: vinum: linkes.s0 is up

And everything is bright and shiny yet again:

2 drives:
D rechts                State: up       Device /dev/vbd4s0p     Avail: 0/10235 MB (0%)
D links                 State: up       Device /dev/vbd5s0p     Avail: 0/10235 MB (0%)

1 volumes:
V katastrophenübung    State: up        Plexes:       2 Size:          9 GB

2 plexes:
P linkes              C State: up       Subdisks:     1 Size:          9 GB
P rechtes             C State: up       Subdisks:     1 Size:          9 GB

2 subdisks:
S linkes.s0             State: up       PO:        0  B Size:          9 GB
S rechtes.s0            State: up       PO:        0  B Size:          9 GB

## Wrapping it all up

Turns out that vinum keeps out of the way and does most of the work for me. When a drive fails, all I have to do is:

shutdown -p now

# Replace the device and reboot

# Re-create partitions…:
gpt create vbd5

# … and slices:
disklabel -r -w vbd5s0 auto neu-links
disklabel -e vbd5s0

# Re-create the failed vinum drive:
vinum create <( print "drive links device /dev/vbd5s0p" )

# Repair the re-created plex:
vinum start linkes

And that’s it. ◀

# Diskussion

1) Tomohiro Kusumi, one of the Dragonfly developers, mentioned on the mailing list that in principle, everything that is needed by my setup should also be possibly by using dm, but I didn’t yet look into it. When I did the initial setup for my machine, I couldn’t get lvm2 to work, so I eventually gave up and stuck with vinum. Perhaps I’ll take another look into dm and lvm2, if indeed this is the subsystem that I ‘should’ be using.
2) The -f is needed to convince vinum to actually fail the device—since it is in use by a volume, vinum would try to prevent me from shooting myself in the foot.
3) Of course, you need to re-create the slices and partitions on the replacement drive by means of gpt and disklabel first, otherwise /dev/vbd5s0p would’t exist at this point.
addamasartus/vinum-safety-drill.268.txt · Zuletzt geändert: 2016-12-03 21:03 (vor 3 Jahren) von Stefan Unterweger