Making a Champagne VTL on a Beer Budget
This post will be about how to make some really nice backup storage and some
random thoughts about different ways to make it all work, without having
to be rich as Croesus to do it. :-) At work
(SGI), I happened to have an old version of our
MAID product on hand.
And I wanted to use it for backups. There's other ways to build a backup
server and SGI has some other products that might be more suitable too (more
on that later). But this is the hardware I happened to have available.
More on the MAID product. Basically, this is storage that's very dense, and
meant to be used strictly for archival purposes. And to prevent over-heating
and to keep the electricity costs down, only some of the drives are powered
on at any given time. SGI now has storage offerings that are even denser and
that allow all the drives to be on all the time, but of course, that uses
more electricity and costs more to operate. :-) See the
MIS and
NAS products for more
info on them.
So, the MAID storage is, at the time of this writing, able to contain almost
3 petabytes in a single rack. Each rack contains up to 8 shelves. Each
shelf contains 26 usable LUNs and one special LUN. The drives that make up
a LUN are powered on and off on demand. And only a configurable number of
LUNs are allowed to be powered on at one time - in my case, 7. The 27th LUN
in each shelf contains what's called an "always on region" or AOR. A portion
of each of the other 26 LUNs is cached here. For instance, the default AOR
region for linux systems contains a portion of the start and end of each LUN.
That way, when a linux system boots, and it sees 26*8 disk devices (26 LUNs
on 8 shelves) and it scans the partition table on each disk device, it can
read this from the AOR and not have to power on every LUN. When the OS
tries to mount the partition on the LUN, it will need to access other parts
of the LUN and there could be a 15-20 second delay while the drives are
powered on and spun up. As an aside, I made a custom AOR that makes it
possible to mount and unmount a single XFS filesystem on each LUN without
having to spin up the drives, but I chose to do something else.
I've found one really effective way to make efficient use of this sort of
storage is to format each LUN's single partition with a nice label like,
shelf-X-lunYY, and then have the automounter setup to mount these on
demand. For instance, I'd have this line in my auto.master file:
/maid /etc/auto.maid
And /etc/auto.maid would have:
#!/bin/bash
key="$1"
echo "-fstype=xfs :-L$key"
So then I could access the LUN on shelf 3, LUN 24 just by accessing
/maid/shelf3-lun24 (assuming I'd formatted it with that filesystem label).
And it also means that most of the time, most of the LUNs aren't mounted.
When I reboot, there's few, if any, MAID LUNs to unmount first, and none
required to mount when the OS restarts. Nice 'n clean.
So how could I make use of this in a backup program? Most backup software
allows us to write to filesystems these days. But the tricky bit with the
MAID is how can I guarantee that the backup software will never use more than
7 filesystems on a shelf at any given time? What I'd wind up having to do
is to make "pools" of storage that use no more than 7 LUNs on any given
shelf. Then I have to juggle my schedules and clients around to make sure
they use the right pools on the right days. It could be done, but it means
I have to configure clients, pools, and schedules carefully or I wind up
with some pools running out of space, some with little space used, etc. And
then when you toss in the ability for anyone to start a recovery at any
time, it just gets more complicated.
The obvious solution is to just turn each shelf into a VTL (virtual tape
library) with 7 tape drives and 26 slots. Each slot is a LUN. That'd still
mean making sure the backup software would spread the load out across 8
separate tape libraries, and it could be horrifically expensive to license
the backup software that way (if you're using commercial software). If
massive I/O throughput isn't an issue (and it wasn't for me), you can also
just set it up as one giant tape library with 208 slots and 7 tape drives.
That way the VTL abstraction (7 tape drives) prevents any backup software
from ever using too many LUNs at the same time, but you have tons of storage
and you can treat it all the same if you like. No juggling of pools, clients
and schedules. Use pools and schedules when they benefit you, not to
try to trick the backup software into behaving a certain way.
So, I compiled and installed some opensource VTL software called
MHVTL. This is pretty
cool stuff. Basically, I configured a new virtual tape library with 256
slots (just a nice, round number bigger than 26 * 8) and then told it I wanted
208 tapes with volume names like "S00021D7". The trailing D7 tells MHVTL
that this tape is a DLT7000. The first 00 is a shelf number. The 02 would
be a LUN on that shelf, and the trailing 1 would be a partition. I chose
this naming convention just in case I ever wanted to have each LUN
broken into multiple partitions. I'd have to make a custom AOR for this
to work efficiently (one that contained the partition table plus the first
part of every partition so it'd contain all the filesystem labels too).
Then I told MHVTL that this tape library had 7 tape drives that were each
DLT7000 devices. By default, it wants to find a directory with the same name
as every tape label in /opt/mhvtl/ (you can specify a different
base directory for each VTL you setup). Initially, I used the automounter
to handle this so I wouldn't need to have them all mounted all the time. I
made /opt/mhvtl/30/tape-label automount the LUN with the filesystem labeled
tape-label. And it worked ok until the backup software tried to label all the
tapes, accessing them one by one in rapid succession to write a header
to each tape. The automounter didn't deal well with that as it didn't
unmount the filesystem when the VTL unmounted a virtual tape from a virtual
drive - only when it was idle for longer than X minutes.
As a quick workaround, I tweaked bacula (the backup software I was testing
with) so that it'd execute a script wrapper around all jukebox operations.
So I made a wrapper script that would mount /opt/mhvtl/30/whatever before
it asked MHVTL to mount that virtual slot in that virtual drive, and to
unmount it (and flush the buffers) after telling the jukebox to unmount that
virtual tape and put it back in it's virtual slot. That worked like a
charm. Bacula happily mounted 'n mounted every virtual tape, labeling them
all for the "Default" pool and MHVTL did it's job making all these XFS
filesystems look like really big tapes.
At the moment, one limitation is that MHVTL assumes every virtual tape is the
same length. So if one shelf has 500 gb drives and one shelf has 750 gb
drives (yeah, this is old hardware, remember, so I don't have 4 TB drives -
grin), then the LUNs will be different sizes. One possibility would be to
make multiple partitions per LUN, so each LUN would be exactly the same
size, but shelves with bigger drives would have more partitions per LUN
than shelves with smaller drives. Of course, this would require a custom
AOR file. But with shelves containing the same types of drives, one could
make a VTL that had 208 virtual tapes, each of which was around 2.5 TB
of uncompressed storage. MHVTL can do decent compression too (zlib or lzo).
Now that's some decent backups. :-)
I also tinkered with making a custom AOR so that I could mount
and unmount every LUN without it requiring the LUN's drives to power on.
The AOR was big enough that I was able to do this, and it worked. But the
downside is that if I ever had an unclean shutdown all of these filesystems
would be marked as dirty and would have to be scanned on a reboot. So I'd
much rather only mount them as needed and unmount them as soon as I'm done.
It just makes more sense.
Not long after I finished my quick 'n dirty proof of concept, SGI rolled
out a new product that I think will be an even better solution called
SGI InfiniteStorage
Gateway. This, to me, looks very exciting because it makes the whole
VTL abstraction unnecessary. Basically, you have hot, fast storage in an
appliance that keeps all the metadata for a giant filesystem in the hot
storage, but can take blocks of data that haven't been accessed in a while
and move them off to other tiers of storage. These other tiers might be
MAID storage, or even physical tape libraries.
So with this, I could just tell the backup software "Here, use this truly
humongous filesystem as your storage", write to it, and have the
InfiniteStorage Gateway automatically take backups that were, say, a week
old and migrate them to archival MAID storage. The data is still there
and if/when I had to do a recovery and the backup software tried to access
one of these files whose blocks had been migrated to archival, the gateway
would automagically pull it back into the hot storage within (I think)
15-20 seconds. Very cool!
Basically, I could have my low-power archival storage, but use it the same
as any other big, heat-generating, power-sucking storage array. :-) And if
that weren't compelling enough, you can configure the Gateway so that
it requires the migrated data to have multiple copies for extra redundancy.
I'll have to check, but it might even be possible to have it migrated to
other geographies. In a perfect world, I'd want the Gateway to not only
have NFS/CIFS support but to also run a bacula storage daemon. But I
suspect that's possible.
Gonna hafta lay my hands on one and give it a try, just to see... (grin)