It’s rare that I write an article simply to educate. Most of the time I am attempting to articulate or justify a position, or simply rebutting someone’s nonsensical yammering. For a refreshing change, I thought I would take some time to educate you on the fundamentals of large-scale data storage. Many people think of “storage as a service” (now being called “cloud storage”) as a magic black box. At the end of the day, it is just bits on disks. And like all things, if you use enough of it, you can more than cover the cost of managing it yourself by simply eliminating your vendor’s margin (insourcing).

There are more and more services providing outsourced storage. The concept is simple: you upload a digital asset to the vendor (via some sort of API or tool), they return an identifying key of some sort (sometimes this key is provided by you, the uploader) and they store the asset for you. To retrieve the asset, you use a similar method to the one used to upload. In the simplest terms, you can think of it as a mapped network drive to which you can save assets, and later reconnect to retrieve them.

By no means is this new technology. However, the idea of managing one’s own storage, combined with growing space requirements and fear of loss due to lack of redundancy, have driven people to want to make this particular problem someone else’s. Making this choice — to solve the problem yourself or to outsource — is always the outcome of several factors: cost, convenience, and safety.

Redundancy: The basics

Let’s take a look at the fundamentals of data storage. We all want our data to be safe. It’s pretty obvious that storing exactly one copy of the data isn’t safe, but it’s actually more complex than you would think — storing two copies doesn’t buy you much without taking a few extra steps.

Before we dive in and explore methods for keeping data safe across systems, we need to realize that one of our fundamental assumptions is invalid. We assume that when we write data to a disk, it will have no errors when we read it back. This assumption is fundamentally wrong. There’s this little evil thing called a bit error (basically, one of the zeros or ones that was written came back inverted). How often this type of error occurs is a probability called bit error rate (BER). The BER on modern spinning disks is usually around 10-13 or 10-14. Basically, for every 1 to 10 terabytes you write, one of the bits — when read — won’t equal what was written. A single erroneous bit might not matter for some types of data, but for others, such an error could be disastrous. We write a lot of data these days, and bit errors are silent, so the lesson here is: write checksums with your data.

The classic method of ensuring that data is safe is to store multiple copies on different physical media. Inside a single system, this can be accomplished with RAID1 (mirroring), which makes sure all data is on two physically separate disks. With a bit of (somewhat) clever math, we can take that same data, split it into a few pieces and store each piece on a different drive. We can then calculate a block of parity data, and store that on an additional drive. Retracing the same math backwards shows that we can lose any single disk in the set, and we’ll still be able to reconstruct our data. This is the basis for RAID5. Sometimes systems need to be resistant to multiple concurrent disk failures (hence the introduction of RAID6, which uses an erasure code such as Reed-Solomon).

None of these scenarios are designed to reduce the risk of data corruption. Rather, they were designed to prevent data loss due to hardware failure of one or more underlying disks. One issue with using RAID is that you are storing files on a set of drives, those files consist of chunks of data (blocks) which map to physical blocks of bits on the drives, and somewhere along that path we could lose our way. If a specific physical block goes bad, or somehow becomes unreadable, we can’t easily map it back to a logical object, such as a file. We only find out that there’s a problem when we try to read the object. Another problem with this general technique is that all of these disks live in a single system and if that system fails, all of the data is unavailable (or worse, lost).

So, RAID is designed to keep our data somewhat safe within a single system, but it doesn’t address system failures. The most obvious design is to put all of our information on two systems. There are pros and cons with this approach. On the positive side, once we’ve identified which system holds a copy of our asset, we only need to communicate with that single system to retrieve a copy of the asset — simplicity. The downside here is that we’ve used half of our storage as redundancy, and yet if two of our nodes fail, we’ve necessarily made unavailable (or permanently lost) 1/(N*(N-1)) of our assets. With two nodes, this works out to 100% (of course), and with 10 nodes, it’s around 2%.

Taking a different approach altogether allows us to use half of our storage for redundancy, while maintaining dramatically greater availability.

Erasure codes

High availability of assets in light of system failures is achieved by today’s peer-to-peer systems. Their technical description is clear-cut, yet extremely detailed. By using erasure codes, these systems are able to split data into many pieces (similar to RAID5), but instead of calculating simple parity, they calculate unique erasure codes.

Imagine we split our data into 5 pieces, and then calculate 5 additional pieces of data, any of which could be used to reconstruct any of the original 5 pieces were they found to be unavailable — these are erasure codes. So, with the data in 5 fragments + 5 erasure fragments, we’ve consumed twice the space but can now stand to lose any five pieces before the data becomes unavailable and/or lost. The main drawbacks to such a system are that calculating and distributing erasure codes is much more complicated than simply storing two copies of the same data, and that retrieving data requires contacting at least 5 machines to serve an asset.

This erasure code approach assumes a slightly larger network of servers. With two copies and 100 machines we see 99.8% availability with 5 machine faiures. With a 10 fragment (5 data + 5 coded) scenario, if 5 nodes fail, we maintain 100% availability. In the pathological case where 50 of our 100 nodes fail, the two-copy method would result in an availability of approximately 75.3%, whereas the erasure code method would achieve approximatately 98.7% asset availability.

Back to reality

In peer-to-peer systems, where clients enter and leave the network rapidly, the use of erasure codes for high redundancy is quite necessary. However, in a datacenter environment, with redundancy on each system and maintenance windows that we control, the situation is entirely different. Controlling the servers, their configuration and their region of deployment gives us a landscape on which we can build a sufficiently redundant system with all sorts of advantages.

Reduced system complexity and simple distributed processing are significant advantages that result from having whole data objects like images or documents present on a single node. With this model, we can offload some computational processing to the nodes that hold the data and they can act without consuming additional resources such as the CPU time and network bandwidth required to reconstitute whole objects from their distributed pieces.

At the end of the day, a hybrid/adaptive approach between the two would yield the best outcome. I see that being the next thing in distributed storage. Most of us that are faced with storing large amounts of data have already thrown traditional filesystems and POSIX-compliance to the wind and are looking for fresh, more appropriate solutions to our specific problems.

For now, until these merge, the approach of redundantly storing whole assets makes the most sense. It is simple and easy to build, deploy and administer. It is also trivially easy to understand and troubleshoot.