Making a Champagne VTL on a Beer Budget


This post will be about how to make some really nice backup storage and some random thoughts about different ways to make it all work, without having to be rich as Croesus to do it. :-) At work (SGI), I happened to have an old version of our MAID product on hand. And I wanted to use it for backups. There's other ways to build a backup server and SGI has some other products that might be more suitable too (more on that later). But this is the hardware I happened to have available.

More on the MAID product. Basically, this is storage that's very dense, and meant to be used strictly for archival purposes. And to prevent over-heating and to keep the electricity costs down, only some of the drives are powered on at any given time. SGI now has storage offerings that are even denser and that allow all the drives to be on all the time, but of course, that uses more electricity and costs more to operate. :-) See the MIS and NAS products for more info on them.

So, the MAID storage is, at the time of this writing, able to contain almost 3 petabytes in a single rack. Each rack contains up to 8 shelves. Each shelf contains 26 usable LUNs and one special LUN. The drives that make up a LUN are powered on and off on demand. And only a configurable number of LUNs are allowed to be powered on at one time - in my case, 7. The 27th LUN in each shelf contains what's called an "always on region" or AOR. A portion of each of the other 26 LUNs is cached here. For instance, the default AOR region for linux systems contains a portion of the start and end of each LUN. That way, when a linux system boots, and it sees 26*8 disk devices (26 LUNs on 8 shelves) and it scans the partition table on each disk device, it can read this from the AOR and not have to power on every LUN. When the OS tries to mount the partition on the LUN, it will need to access other parts of the LUN and there could be a 15-20 second delay while the drives are powered on and spun up. As an aside, I made a custom AOR that makes it possible to mount and unmount a single XFS filesystem on each LUN without having to spin up the drives, but I chose to do something else.

I've found one really effective way to make efficient use of this sort of storage is to format each LUN's single partition with a nice label like, shelf-X-lunYY, and then have the automounter setup to mount these on demand. For instance, I'd have this line in my auto.master file:
/maid /etc/auto.maid
And /etc/auto.maid would have:
#!/bin/bash
key="$1"
echo "-fstype=xfs :-L$key"

So then I could access the LUN on shelf 3, LUN 24 just by accessing /maid/shelf3-lun24 (assuming I'd formatted it with that filesystem label). And it also means that most of the time, most of the LUNs aren't mounted. When I reboot, there's few, if any, MAID LUNs to unmount first, and none required to mount when the OS restarts. Nice 'n clean.

So how could I make use of this in a backup program? Most backup software allows us to write to filesystems these days. But the tricky bit with the MAID is how can I guarantee that the backup software will never use more than 7 filesystems on a shelf at any given time? What I'd wind up having to do is to make "pools" of storage that use no more than 7 LUNs on any given shelf. Then I have to juggle my schedules and clients around to make sure they use the right pools on the right days. It could be done, but it means I have to configure clients, pools, and schedules carefully or I wind up with some pools running out of space, some with little space used, etc. And then when you toss in the ability for anyone to start a recovery at any time, it just gets more complicated.

The obvious solution is to just turn each shelf into a VTL (virtual tape library) with 7 tape drives and 26 slots. Each slot is a LUN. That'd still mean making sure the backup software would spread the load out across 8 separate tape libraries, and it could be horrifically expensive to license the backup software that way (if you're using commercial software). If massive I/O throughput isn't an issue (and it wasn't for me), you can also just set it up as one giant tape library with 208 slots and 7 tape drives. That way the VTL abstraction (7 tape drives) prevents any backup software from ever using too many LUNs at the same time, but you have tons of storage and you can treat it all the same if you like. No juggling of pools, clients and schedules. Use pools and schedules when they benefit you, not to try to trick the backup software into behaving a certain way.

So, I compiled and installed some opensource VTL software called MHVTL. This is pretty cool stuff. Basically, I configured a new virtual tape library with 256 slots (just a nice, round number bigger than 26 * 8) and then told it I wanted 208 tapes with volume names like "S00021D7". The trailing D7 tells MHVTL that this tape is a DLT7000. The first 00 is a shelf number. The 02 would be a LUN on that shelf, and the trailing 1 would be a partition. I chose this naming convention just in case I ever wanted to have each LUN broken into multiple partitions. I'd have to make a custom AOR for this to work efficiently (one that contained the partition table plus the first part of every partition so it'd contain all the filesystem labels too).

Then I told MHVTL that this tape library had 7 tape drives that were each DLT7000 devices. By default, it wants to find a directory with the same name as every tape label in /opt/mhvtl/ (you can specify a different base directory for each VTL you setup). Initially, I used the automounter to handle this so I wouldn't need to have them all mounted all the time. I made /opt/mhvtl/30/tape-label automount the LUN with the filesystem labeled tape-label. And it worked ok until the backup software tried to label all the tapes, accessing them one by one in rapid succession to write a header to each tape. The automounter didn't deal well with that as it didn't unmount the filesystem when the VTL unmounted a virtual tape from a virtual drive - only when it was idle for longer than X minutes.

As a quick workaround, I tweaked bacula (the backup software I was testing with) so that it'd execute a script wrapper around all jukebox operations. So I made a wrapper script that would mount /opt/mhvtl/30/whatever before it asked MHVTL to mount that virtual slot in that virtual drive, and to unmount it (and flush the buffers) after telling the jukebox to unmount that virtual tape and put it back in it's virtual slot. That worked like a charm. Bacula happily mounted 'n mounted every virtual tape, labeling them all for the "Default" pool and MHVTL did it's job making all these XFS filesystems look like really big tapes.

At the moment, one limitation is that MHVTL assumes every virtual tape is the same length. So if one shelf has 500 gb drives and one shelf has 750 gb drives (yeah, this is old hardware, remember, so I don't have 4 TB drives - grin), then the LUNs will be different sizes. One possibility would be to make multiple partitions per LUN, so each LUN would be exactly the same size, but shelves with bigger drives would have more partitions per LUN than shelves with smaller drives. Of course, this would require a custom AOR file. But with shelves containing the same types of drives, one could make a VTL that had 208 virtual tapes, each of which was around 2.5 TB of uncompressed storage. MHVTL can do decent compression too (zlib or lzo). Now that's some decent backups. :-)

I also tinkered with making a custom AOR so that I could mount and unmount every LUN without it requiring the LUN's drives to power on. The AOR was big enough that I was able to do this, and it worked. But the downside is that if I ever had an unclean shutdown all of these filesystems would be marked as dirty and would have to be scanned on a reboot. So I'd much rather only mount them as needed and unmount them as soon as I'm done. It just makes more sense.

Not long after I finished my quick 'n dirty proof of concept, SGI rolled out a new product that I think will be an even better solution called SGI InfiniteStorage Gateway. This, to me, looks very exciting because it makes the whole VTL abstraction unnecessary. Basically, you have hot, fast storage in an appliance that keeps all the metadata for a giant filesystem in the hot storage, but can take blocks of data that haven't been accessed in a while and move them off to other tiers of storage. These other tiers might be MAID storage, or even physical tape libraries.

So with this, I could just tell the backup software "Here, use this truly humongous filesystem as your storage", write to it, and have the InfiniteStorage Gateway automatically take backups that were, say, a week old and migrate them to archival MAID storage. The data is still there and if/when I had to do a recovery and the backup software tried to access one of these files whose blocks had been migrated to archival, the gateway would automagically pull it back into the hot storage within (I think) 15-20 seconds. Very cool!

Basically, I could have my low-power archival storage, but use it the same as any other big, heat-generating, power-sucking storage array. :-) And if that weren't compelling enough, you can configure the Gateway so that it requires the migrated data to have multiple copies for extra redundancy. I'll have to check, but it might even be possible to have it migrated to other geographies. In a perfect world, I'd want the Gateway to not only have NFS/CIFS support but to also run a bacula storage daemon. But I suspect that's possible.

Gonna hafta lay my hands on one and give it a try, just to see... (grin)