Failures in technical systems are inevitable. Drives die, network interfaces wink out, backhoes take out cross-country backbones, data rooms flood. The effects of such failures range from minor inconveniences to crippling outages, but thoughtful planning can greatly increase the possibility that the next failure will be the former instead of the latter. The fact is that 100% uptime for 100% of users is an unrealistic goal. Creating an information technology infrastructure that expects failures and minimizes user exposure to those failures is critical to preserving continuity of service to the majority of users. This is the point of transcendence into carrier-class thinking.

Military planners always factor in casualties when deciding on a plan of action. The failure of an individual component (a soldier, a tank, an airplane) is expected, but the overall goal will still be achieved. Many enterprises do excellent risk management in their business operations but fail to apply those same principles to their IT infrastructure. The mantra of smart investing is diversification; likewise, in the insurance industry, the goal is to spread the company's risk among a wide population, only a few of whom will actually make a claim in a given year. Yet when it comes to IT planning, all the eggs go into one large (often expensive) basket. No amount of money can ensure that a single point of failure will never fail. That money would be better spent on engineering around failures, to design systems that fail gracefully, or that at least fail only partially, limiting the damage to some subset of users. Put another way, plan for the failure of the most critical piece of infrastructure and engineer service continuity despite that failure.

I like stories. As a systems administrator for more than 10 years, I have my fair share of them, both good and bad, and the really memorable ones have valuable lessons to teach us about how to construct systems that allow most users to see little or no interruption to their service.

In my ISP (Internet Service Provider) days, the company I worked for had one physical server hosting email. It was a relatively large, expensive UNIX server, but it got the job done and had impressive reliability compared to PC hardware of the day. As the ISP business expanded, the demand for email grew beyond what the server could handle, and when the inevitable outage occurred, it affected every single mail user in the system. The solution was, of course, to get more servers. It was not cost-effective to grow with more big UNIX servers, due to a number of factors such as rack space and power, not to mention the capital investment. We needed more (and smaller) servers to store the mail, to both absorb our growing capacity of users and to reduce the impact of an individual server going down. The system we came up with decoupled mail routing from mail delivery and mailbox access. This enabled us to deploy lightweight MX (Mail Exchanger) servers that didn't need much in the way of local storage, as all incoming mail was delivered to some other host. The MX servers were behind a load balancer, so we could scale them horizontally as required to keep up with demand. The mail storage hosts had more local storage, utilizing RAID (Redundant Array of Inexpensive Disks) to survive disk failures, and had standby hosts to which all mailbox data was replicated in case of host failure. Gluing it all together was a set of proxy hosts backed by LDAP (Lightweight Directory Access Protocol) to locate users' mail storage host and handle mailbox access. The directory service was also used by the MX hosts for inbound delivery, to locate the appropriate storage host. Users connected to the proxies instead of directly to their mail storage host. We could do quick maintenance or handle short outages without most customers ever realizing there was a problem. For example, POP (Post Office Protocol) clients checking for new mail would be given a "no new mail" response when the backing store was unavailable. This architecture was much more resilient to failures, and in the event of a failure (or even a maintenance event), the existence of the proxy between users and the actual server allowed us to reduce the users' exposure to the problem.

The next illustrative story comes from a client who operates a large email infrastructure supporting millions of users. Their mail storage sits on a SAN (Storage Area Network), implemented on three expensive, vertically-integrated systems from a major vendor and interconnected on a costly Fibre Channel switching fabric (which is, as ZFS author Jeff Bonwick puts it, "a network designed by disk firmware writers. God help you.") The result is a very high ratio of spindles to control units, so when there is a problem with one unit, that problem affects one-third of their customers, which could run well over several million users. That's a lot of eggs in one basket. The price of the basket does not guarantee an absence of problems-- the redundant control heads must run the same firmware version, so a firmware bug will wipe out both of them. The cost of the storage platform is sufficiently high that scaling horizontally becomes prohibitively expensive, and doesn't go very far to address the spindle-to-control-unit ratio. What they need is a fundamental shift in storage planning. More and cheaper baskets to hold fewer eggs each, so fewer eggs are lost when a basket fails. In this case the baskets are commodity servers and direct-attach storage running free software and exporting block devices over iSCSI (Internet Small Computer Systems Interface) to the servers handling client connections. For the cost of one of the vendor-supplied storage systems, we get nine new storage nodes, each with redundant control heads and data storage. These nine nodes provide the same amount of usable space as the three old units, and have capacity to spare. The cost savings enables more nodes to be purchased, and facilitates horizontal scaling to meet demand. Additional cost savings are realized on the interconnects, which can be standard 10Gb Ethernet. The larger number of nodes means a three-fold decrease in the number of users exposed to a node failure, and future scaling only decreases this number further.

Turning away from email, my final story covers data warehousing for a large, web-focused marketing company. Their OLTP (Online Transaction Processing) database that backs the website runs on Oracle. They need a separate place to run intensive data-mining queries and transformations that are not appropriate for the OLTP system, for which the typical solution is an Operational Data Store (ODS), a type of data warehouse. Initially this was another Oracle instance on a single server. When the size of the dataset grew beyond the capacity of the server, a decision had to be made. A server with enough memory and CPU power to handle the load would have exceeded the Oracle product license, but purchasing additional licenses was cost-prohibitive. The solution was two-fold: convert the ODS to the open source PostgreSQL server, and put it on two systems instead of one. The conversion to PostgreSQL is outside the scope of this article, but the decision to use two servers provides several distinct advantages. First, they are not set up as master/slave, which keeps the setup simple. They both replicate from Oracle in parallel, having no awareness of one another. This works fine since the data-mining queries are essentially read-only (some jobs do data transformations, but they operate on temporary tables.) Second, both systems are fast enough to handle the entire operational load, so if one system is down, all its jobs can be shifted to the other with no degradation of service to users. Third, upgrades to PostgreSQL can be tested with live data without disrupting service, as jobs can again be shifted away from the instance being upgraded. Under normal circumstances, both servers are used for production work, yielding the best return on investment.

These stories illustrate the advantages of expecting failure and engineering around it to create robust internet architectures. Failure is inevitable, but dire consequences need not be.