3/15/2007 – What I’m doing that I’m writing down here
3/16/2007 – Initial thoughts on hardware and software
3/22/2007 – Block Devices, File Systems, and card catalogs
My big project of the moment is setting up a high-availability Xen cluster. The idea is to be able to run several mission-critical machines in an environment that allows for automated recovery and failover should anything happen to the hardware behind the setup.
The constraint under which I’m operating is, basically, budgetary. What I’m trying to do would be fairly easy if I could put $30,000 – $50,000 into building up a Storage Area Network (SAN), but I don’t have that luxury. Instead, I’m trying to do this using less than $10,000 worth of black-box (unbranded) server hardware.
A few definitions that will, I expect, help with understanding later additions to this page:
- High-Availability – In a technical context, a high-availability configuration is one that can withstand the occasional hardware failures and other issues that will, on occasion, happen. A highly-available server is one that has redundant features like multiple disks with identical data, multiple power supplies, and others. A highly available server cluster is one in which the role or roles of a single machine can be seamlessly taken over by another of the clustered machines, without impacting users (or with minimal impact).
The High-Availability Linux Project has developed much of the software needed for a setup like the one that I’m working on.
- Resources – A very general phrase referring to the various elements of computer hardware. A server’s resources include things like: computation cycles (or CPU cycles, or just CPU); short-term, very fast storage during computation (RAM, or memory); long-term, less volatile, but slower storage (Hard Drive space, disk space, etc).
- Virtualization – Virtualization refers to the separation of logical computers from physical hardware, and the process of separating the two of them out.
Generally speaking, when we talking about “a server” or “a computer”, we mean a standalone piece of hardware, and a set of software that runs on top of it. However, we’ve gotten to the point that commodity server hardware is often “too” powerful — only very specialized software could ever use all of the resources available.
With virtualization, software installed on the machine (called a “hypervisor”) creates virtual machines — fake computers, each of which is assigned a portion of the total resources available on a particular piece of hardware, each of which can be manipulated separately. Each virtual machine has its own operating system and its own set of software and configurations. Different virtual machines can even use different operating systems. One virtual machine can be started up, upgraded, rebooted, or pretty much anything else, all without affecting the other virtual machines running on the same hardware.
There are great benefits to this approach, from a systems administration and systems engineering standpoint. If one virtual machine crashes, the others continue to run without even noticing. If one virtual machine needs a new version of some software (or if a new version of some software needs to be tested), that can happen without affecting the current version on which other systems rely. If a particular machine needs more hardware resources, it can be allocated more — more CPU, more RAM, or more hard disk.
- Xen – Xen is the hypervisor that I’m using. It’s open-source, and is well integrated into the linux kernel. Most linux distributions include tools that allow for the creation and management of Xen virtual machines through a relatively friendly graphical interface.
A little about the hardware and the software that I’m using for this little project.
The hardware that I’m using is a pair of machines, from PCs for Everyone. PCs4E is a local vendor, and they have treated us quite well over the years. They build machines to our specifications, offer fantastic service, and are very responsive to our needs.
The particular machines that I’m using are are 1U rackmount servers. (A “U” is a unit of height, basically, and a way of measuring how much space a server requires in a computer rack. A standard computer rack (like this one) is 42U.)
The hardware specs:
- 2 x 2GHz Dual-Core AMD Opteron HE processors
- 8 GB of DDR2 RAM
- 2 x 160GB SATA hard drives, at 7200 RPM
- Tyan motherboard with 2 x Broadcom 10/100/1000 Ethernet, and 1 x Intel 10/100 Ethernet onboard
As a base operating system, I’m using Fedora Core 6. I’d use a Linux distribution regardless of other factors; in this case, it’s mandated by the use of Xen, as the Xen hypervisor runs in the context of a linux machine.
Using Fedora is actually something of a departure for me, as I’ve spent most of my time using SUSE Linux over the past 3 years or so. There are a couple of factors that make the change worthwhile.
First, Fedora is the distribution on which most of the Xen development is being done, as far as I can tell. (There’s a pretty significant Debian and Ubuntu base as well, but I want to stick with a distribution that uses RPM as opposed to .deb as its packaging system.)
Second, Fedora is sort of like an open-source branch of RedHat. A number of folks at RedHat develop their software for Fedora; and then the changes that they make get rolled into the various RedHat Linux products. Since a lot of the software that I’m thinking I’ll probably use is maintained by folks who work for RedHat, it makes sense to switch to a distribution where I’ll have quicker access to their updates and changes.
First, two terms that will be helpful to understanding what the rest of this means:
- block device – A block device generally refers to a hard drive, or a CD-ROM drive, or another type of a storage device, on a Linux computer system. It’s a particular type of a Device node. It’s worth noting that anything that appears as a block device on a Linux machine can be read from or written to, just like a hard drive. In the desktop computer world, the hard drive inside of your desktop or your laptop is a type of a block device, as is the CD-ROM drive, and the floppy drive. If you have a USB Stick, that’s a block device as well.
- file system (or filesystem) – A file system is what keeps track of what physical bits on the disk correspond to what files, folders, and so on that appear on your computer. Think of it like an old-style library, where the books on the shelves are like data on a disk: the file system is the card catalog. When you format a disk, what you are doing is writing a blank file system onto it.
The first step to getting this cluster thing going is to figure out the storage requirements, and how I’m going to configure these machines such that there’s a useful storage backend.
Because I want this cluster to be high-availability, I need storage that is resilient to failure. It needs to be shared between the nodes, and it needs to stay up in the event that one of the nodes goes down.
Because I don’t have a SAN, the most logical option seems to be DRBD. As the DRBD website states it, “DRBD gives you about the same semantics as a shared device, but it does not need any uncommon hardware. It runs on top of IP networks, which are to my impression less expensive than special storage networks.” As it happens, this is precisely the behavior that I’m looking for.
Through version 7 of DRBD, it was a single-primary system. That is, a given DRBD device could be mounted and used by a single machine, while the other machine passively received the changes made to the device. If you wanted to use the device on the other machine, it was necessary to disable it on the first, and then bring it up on the second.
This is useful when data redundancy is all that’s being sought, but in the world of Xen is has a significant limitation in that it doesn’t allow for migration of a virtual server. (“Migration”, in this context, refers to the process of moving a running virtual server from one piece of hardware to another.) Because there exists a point at which the DRBD device is not mounted on either server, there is a point at which the virtual machine cannot be running.
However, DRBD version 8 allows for a multi-primary setup. In this setup, the DRBD device is mounted and available to both machines. This means that a virtual machine can be migrated — while running — from one server to another. This makes things like maintenance much easier than they would otherwise be, as virtual machines can be moved to the second set of hardware when the first set needs work.
A multi-primary setup does cause some issues, because of the limitations of most file systems. File systems, generally speaking, are designed to have a single computer reading information off of them, and (more important) writing information to them.
To go back to the card catalog metaphor for a second, and to extend it a bit…
A file system is like a card catalog. It’s a card catalog that can hold precisely X cards — for the sake of simplicity, lets say 1000. 200 of those cards have been filed in.
Now say that you’ve got this block of 4 cards:
BUKOWSKI, C | BURROUGHS, W | blank | CHEKOV, A
A traditional file system will only allow one machine at a time to write, because otherwise you might have one machine trying to add a card for CARROLL, L while another machine is trying to add a card for CATHER, W.
In recent years, however, a couple of different file systems have been created that get around this limitation. As a class, they’re referred to as “cluster file systems”. The two notable examples are GFS2 (Global File System version 2), and OCFS2 (Oracle Cluster File System 2).
Over the past couple of days, I’ve been playing with both of these, and discovering that they’re both almost-but-not-quite entirely ready. GFS2 has a bug that can cause a system deadlock in some cases — one of those bugs that you’re only likely to ever hit when doing benchmarking, but which happens to expose a deep failing in the code. A fix for this was checked into the Linux kernel tree just over a week ago, and should be in the next prerelease kernel version. OCFS2, meanwhile, fails to run when the backend storage device is a DRBD device. That problem, too, has been fixed, though I’m not sure if that patch has worked its way into the kernel tree as yet.
(On a side note: it’s pretty cool to be working on a project where the software is being written just a week or two ahead of the work that you’re doing.)
Over the next couple of days I’m hoping to get a customized Xen kernel built, using the latest pre-release kernel version (2.6.21-rc4, as of this moment). Building a Xen-enabled kernel is something of a challenge itself, but that’s an update for another time.
A new kernel should, if nothing else, give me a working GFS2 implementation, which will allow me to continue doing some benchmarking. From what I’ve seen so far, DRBD is not noticeably slower than a local drive when testing a single device at once. I need to do some testing, though, that’ll show what happens if multiple primary machines are writing to the same device at once, and if multiple devices are being written to from all kinds of different angles.