OmniTI ~ Embracing Failure to Rise Above Enterprise-Class Thinking

Embracing Failure to Rise Above Enterprise-Class Thinking

By: Eric Sproul 13 Feb '09

Level:

This article reviews a fundamental concept or principle
This article reviews an intermediate concept or principle
This article reviews an advanced concept or principle
This article expresses an opinion or just a downright rant

Failures in technical systems are inevitable. Drives die, network interfaces wink out, backhoes take out cross-country backbones, data rooms flood. The effects of such failures range from minor inconveniences to crippling outages, but thoughtful planning can greatly increase the possibility that the next failure will be the former instead of the latter. The fact is that 100% uptime for 100% of users is an unrealistic goal. Creating an information technology infrastructure that expects failures and minimizes user exposure to those failures is critical to preserving continuity of service to the majority of users. This is the point of transcendence into carrier-class thinking.

Military planners always factor in casualties when deciding on a plan of action. The failure of an individual component (a soldier, a tank, an airplane) is expected, but the overall goal will still be achieved. Many enterprises do excellent risk management in their business operations but fail to apply those same principles to their IT infrastructure. The mantra of smart investing is diversification; likewise, in the insurance industry, the goal is to spread the company's risk among a wide population, only a few of whom will actually make a claim in a given year. Yet when it comes to IT planning, all the eggs go into one large (often expensive) basket. No amount of money can ensure that a single point of failure will never fail. That money would be better spent on engineering around failures, to design systems that fail gracefully, or that at least fail only partially, limiting the damage to some subset of users. Put another way, plan for the failure of the most critical piece of infrastructure and engineer service continuity despite that failure.

I like stories. As a systems administrator for more than 10 years, I have my fair share of them, both good and bad, and the really memorable ones have valuable lessons to teach us about how to construct systems that allow most users to see little or no interruption to their service.

In my ISP (Internet Service Provider) days, the company I worked for had one physical server hosting email. It was a relatively large, expensive UNIX server, but it got the job done and had impressive reliability compared to PC hardware of the day. As the ISP business expanded, the demand for email grew beyond what the server could handle, and when the inevitable outage occurred, it affected every single mail user in the system. The solution was, of course, to get more servers. It was not cost-effective to grow with more big UNIX servers, due to a number of factors such as rack space and power, not to mention the capital investment. We needed more (and smaller) servers to store the mail, to both absorb our growing capacity of users and to reduce the impact of an individual server going down. The system we came up with decoupled mail routing from mail delivery and mailbox access. This enabled us to deploy lightweight MX (Mail Exchanger) servers that didn't need much in the way of local storage, as all incoming mail was delivered to some other host. The MX servers were behind a load balancer, so we could scale them horizontally as required to keep up with demand. The mail storage hosts had more local storage, utilizing RAID (Redundant Array of Inexpensive Disks) to survive disk failures, and had standby hosts to which all mailbox data was replicated in case of host failure. Gluing it all together was a set of proxy hosts backed by LDAP (Lightweight Directory Access Protocol) to locate users' mail storage host and handle mailbox access. The directory service was also used by the MX hosts for inbound delivery, to locate the appropriate storage host. Users connected to the proxies instead of directly to their mail storage host. We could do quick maintenance or handle short outages without most customers ever realizing there was a problem. For example, POP (Post Office Protocol) clients checking for new mail would be given a "no new mail" response when the backing store was unavailable. This architecture was much more resilient to failures, and in the event of a failure (or even a maintenance event), the existence of the proxy between users and the actual server allowed us to reduce the users' exposure to the problem.

The next illustrative story comes from a client who operates a large email infrastructure supporting millions of users. Their mail storage sits on a SAN (Storage Area Network), implemented on three expensive, vertically-integrated systems from a major vendor and interconnected on a costly Fibre Channel switching fabric (which is, as ZFS author Jeff Bonwick puts it, "a network designed by disk firmware writers. God help you.") The result is a very high ratio of spindles to control units, so when there is a problem with one unit, that problem affects one-third of their customers, which could run well over several million users. That's a lot of eggs in one basket. The price of the basket does not guarantee an absence of problems-- the redundant control heads must run the same firmware version, so a firmware bug will wipe out both of them. The cost of the storage platform is sufficiently high that scaling horizontally becomes prohibitively expensive, and doesn't go very far to address the spindle-to-control-unit ratio. What they need is a fundamental shift in storage planning. More and cheaper baskets to hold fewer eggs each, so fewer eggs are lost when a basket fails. In this case the baskets are commodity servers and direct-attach storage running free software and exporting block devices over iSCSI (Internet Small Computer Systems Interface) to the servers handling client connections. For the cost of one of the vendor-supplied storage systems, we get nine new storage nodes, each with redundant control heads and data storage. These nine nodes provide the same amount of usable space as the three old units, and have capacity to spare. The cost savings enables more nodes to be purchased, and facilitates horizontal scaling to meet demand. Additional cost savings are realized on the interconnects, which can be standard 10Gb Ethernet. The larger number of nodes means a three-fold decrease in the number of users exposed to a node failure, and future scaling only decreases this number further.

Turning away from email, my final story covers data warehousing for a large, web-focused marketing company. Their OLTP (Online Transaction Processing) database that backs the website runs on Oracle. They need a separate place to run intensive data-mining queries and transformations that are not appropriate for the OLTP system, for which the typical solution is an Operational Data Store (ODS), a type of data warehouse. Initially this was another Oracle instance on a single server. When the size of the dataset grew beyond the capacity of the server, a decision had to be made. A server with enough memory and CPU power to handle the load would have exceeded the Oracle product license, but purchasing additional licenses was cost-prohibitive. The solution was two-fold: convert the ODS to the open source PostgreSQL server, and put it on two systems instead of one. The conversion to PostgreSQL is outside the scope of this article, but the decision to use two servers provides several distinct advantages. First, they are not set up as master/slave, which keeps the setup simple. They both replicate from Oracle in parallel, having no awareness of one another. This works fine since the data-mining queries are essentially read-only (some jobs do data transformations, but they operate on temporary tables.) Second, both systems are fast enough to handle the entire operational load, so if one system is down, all its jobs can be shifted to the other with no degradation of service to users. Third, upgrades to PostgreSQL can be tested with live data without disrupting service, as jobs can again be shifted away from the instance being upgraded. Under normal circumstances, both servers are used for production work, yielding the best return on investment.

These stories illustrate the advantages of expecting failure and engineering around it to create robust internet architectures. Failure is inevitable, but dire consequences need not be.

In This Issue…

Marketing Malware

23 Dec '09 from Mark Hammonds
Level:
- This article reviews a fundamental concept or principle
- This article reviews an intermediate concept or principle
- This article reviews an advanced concept or principle
- This article expresses an opinion or just a downright rant
Internet registrar GoDaddy.com is notorious for two things: domain names and risque super bow…
Business Metrics Too

16 Dec '09 from Jason Dixon
Level:
- This article reviews a fundamental concept or principle
- This article reviews an intermediate concept or principle
- This article reviews an advanced concept or principle
- This article expresses an opinion or just a downright rant
When I began tinkering around with web services as a hobby, it was common to fiddle with an ap…
Transcending the Medium

15 Oct '09 from Mark Hammonds
Level:
- This article reviews a fundamental concept or principle
- This article reviews an intermediate concept or principle
- This article reviews an advanced concept or principle
- This article expresses an opinion or just a downright rant
I grew up in Boyd County, Kentucky, where John Deere tractors are practically an indigenous sp…
When Commodity Makes Sense

29 Sep '09 from Eric Sproul
Level:
- This article reviews a fundamental concept or principle
- This article reviews an intermediate concept or principle
- This article reviews an advanced concept or principle
- This article expresses an opinion or just a downright rant
We’d all like to spend as little money as possible to get the performance we desire from…
Stacking the Deck for Publishers

22 Sep '09 from Jon Tan
Level:
- This article reviews a fundamental concept or principle
- This article reviews an intermediate concept or principle
- This article reviews an advanced concept or principle
- This article expresses an opinion or just a downright rant
Newspapers and magazines have a unique opportunity with online publishing. They have the bes…
Concepts of Cloud(ish) Storage

22 Sep '09 from Theo Schlossnagle
Level:
- This article reviews a fundamental concept or principle
- This article reviews an intermediate concept or principle
- This article reviews an advanced concept or principle
- This article expresses an opinion or just a downright rant
It’s rare that I write an article simply to educate. Most of the time I am attempting t…
What is Web Operations?

22 Sep '09 from Theo Schlossnagle
Level:
- This article reviews a fundamental concept or principle
- This article reviews an intermediate concept or principle
- This article reviews an advanced concept or principle
- This article expresses an opinion or just a downright rant
The field of web operations is one with which I am intimately familiar. For the last twelve y…
ORMs Done Right

22 Sep '09 from Clinton Wolfe
Level:
- This article reviews a fundamental concept or principle
- This article reviews an intermediate concept or principle
- This article reviews an advanced concept or principle
- This article expresses an opinion or just a downright rant
Object-Relational Mapper (ORM) systems are one of the most contentious topics in database appl…
Virtualization, ZFS and Zetaback

22 Sep '09 from Mark Harrison
Level:
- This article reviews a fundamental concept or principle
- This article reviews an intermediate concept or principle
- This article reviews an advanced concept or principle
- This article expresses an opinion or just a downright rant
It used to be the case that when you wanted to deploy a new application you would need to buy …
Dissecting Today's Internet Traffic Spikes

22 Sep '09 from Theo Schlossnagle
Level:
- This article reviews a fundamental concept or principle
- This article reviews an intermediate concept or principle
- This article reviews an advanced concept or principle
- This article expresses an opinion or just a downright rant
Today's Internet has changed quite a bit from the Internet I used to know. The Internet has a…
YSlow! to YFast! in 45 minutes.

22 Sep '09 from Theo Schlossnagle
Level:
- This article reviews a fundamental concept or principle
- This article reviews an intermediate concept or principle
- This article reviews an advanced concept or principle
- This article expresses an opinion or just a downright rant
The web is a complex beast. There are many moving parts involved in delivering a complete web…
RubyRep - Yet Another Tool For PostgreSQL Replication

19 Aug '09 from Denish Patel
Level:
- This article reviews a fundamental concept or principle
- This article reviews an intermediate concept or principle
- This article reviews an advanced concept or principle
- This article expresses an opinion or just a downright rant
One of the key features any enterprise considers when choosing a database technology for their…
Under the Hood

8 Jun '09 from Sherry Schlossnagle
Level:
- This article reviews a fundamental concept or principle
- This article reviews an intermediate concept or principle
- This article reviews an advanced concept or principle
- This article expresses an opinion or just a downright rant
My perspective on the evolution of OmniTI is somewhat like that of a mechanic on a team of rac…
Increasing the Aperture on Security

13 Feb '09 from Jason Dixon
Level:
- This article reviews a fundamental concept or principle
- This article reviews an intermediate concept or principle
- This article reviews an advanced concept or principle
- This article expresses an opinion or just a downright rant
Security is good. Security is necessary. Security is someone else's concern. Security is for o…
Embracing Failure to Rise Above Enterprise-Class Thinking

13 Feb '09 from Eric Sproul
Level:
- This article reviews a fundamental concept or principle
- This article reviews an intermediate concept or principle
- This article reviews an advanced concept or principle
- This article expresses an opinion or just a downright rant
Failures in technical systems are inevitable. Drives die, network interfaces wink out, backho…
The Irony of Sun Database Technology

13 Feb '09 from Robert Treat
Level:
- This article reviews a fundamental concept or principle
- This article reviews an intermediate concept or principle
- This article reviews an advanced concept or principle
- This article expresses an opinion or just a downright rant
It's been just over a year since Sun announced it had agreed to purchase MySQL, the ever popul…
Custom Trending and the Benefits of Source Code Availability

13 Feb '09 from Mark Harrison
Level:
- This article reviews a fundamental concept or principle
- This article reviews an intermediate concept or principle
- This article reviews an advanced concept or principle
- This article expresses an opinion or just a downright rant
One of the self evident truths about system administration is that you need to know what is go…
Using Less is Green

13 Feb '09 from Theo Schlossnagle
Level:
- This article reviews a fundamental concept or principle
- This article reviews an intermediate concept or principle
- This article reviews an advanced concept or principle
- This article expresses an opinion or just a downright rant
Every time I hear about green computing I feel like there is a gap — an enor…
Oracle Flashback Versions Query

27 Jan '09 from Denish Patel
Level:
- This article reviews a fundamental concept or principle
- This article reviews an intermediate concept or principle
- This article reviews an advanced concept or principle
- This article expresses an opinion or just a downright rant
In database system, some times flashback for the older versions of the data or query is import…

Seeds 2009

Year:

Embracing Failure to Rise Above Enterprise-Class Thinking

In This Issue…

Marketing Malware

Business Metrics Too

Transcending the Medium

When Commodity Makes Sense

Stacking the Deck for Publishers

Concepts of Cloud(ish) Storage

What is Web Operations?

ORMs Done Right

Virtualization, ZFS and Zetaback

Dissecting Today's Internet Traffic Spikes

YSlow! to YFast! in 45 minutes.

RubyRep - Yet Another Tool For PostgreSQL Replication

Under the Hood

Increasing the Aperture on Security

Embracing Failure to Rise Above Enterprise-Class Thinking

The Irony of Sun Database Technology

Custom Trending and the Benefits of Source Code Availability

Using Less is Green

Oracle Flashback Versions Query

What We Do

Footer

Year:

Embracing Failure to Rise Above Enterprise-Class Thinking

Subscribe to Our Newsletter

Stay in Touch!

In This Issue…

What We Do

Footer