The cloud is great. Stop the hype.
By: Theo Schlossnagle 23 Mar '10
Cloud computing isn't new, though I'm sure you've heard more about it in the last few months than you did previously. The cloud is an amazing thing, but one that is poorly understood. I believe this lack of understanding stems from technology confusion which is trumpeted by corporations that have identified "the cloud" as a medium for expansion and profit. Don't get me wrong, the cloud is useful — but I hear some of the dumbest reasons why.
Before I launch my rant, I'll qualify that SaaS existed before "the cloud," yet in many defintions (like the link above) it is considered a cloud service. I consider the cloud to be only the infrastructure because the software and the platform has been provided by a third-party successfully before the term "cloud" arrived. It isn't fair to legitimize your concept by repackaging two successfully proven technologies under your brand.
The cloud... what is it? A cloud is an infrastructure in which I can provision computing systems. What makes this different from a rack of servers? Very little, actually. The most important difference is that provisioning of these systems is made convenient. When a system is needed, the requester can programmatically start a new one and needs not be concerned with network infrastructure, machine specifications, power, cooling, etc. The cloud is built by someone who cares about all of those things, but then it is packaged in an easily consumable fashion. How does this happen? Well, this is where people get confused.
This simple provisioning is empowered by some sort of virtualization technology like Xen (likely one of the commercial implementations), VMWare, Solaris Containers (Zones), Virtuozzo/OpenVZ, etc. Why is this confusing? Beats me, but I see people listing the advantages of virtualization as advantages of the cloud. As with most technologies, you inherit the advantages of your foundation. Virtualization brings a lot to the table, but you don't need "the cloud" to get it. Period.
The concept of private and public clouds is also poorly defined. Some people hate the two terms, while others define them in useless ways. I'll define them in a very practical way in which the differences have deep business meaning.
The public cloud is Amazon's EC2 and other similar "cloud providers" where the owner of the underlying physical infrastructure and the owner of the services running on the provisioned systems are not the same. In this environment, your services run on someone else's equipment. What does this mean?
If they don't pay their bills, the equipment can be seized. Other companies may be running virtual environments on the same hardware, same disks, same network. This means bugs in virtualization and data isolation could result in information disclosure — the really bad kind. At this point in time, I can't envision a way to make public cloud infrastructure PCI-DSS compliant — and even if you could, I believe it increases the possibility of compromise.
No virtualization is perfect (yet) in resource provisioning. This means that defining a reliable performance expectation for a node in the cloud can be very challenging.
It's not all negative though. Because public clouds are popular, they tend to have ample resources, which means more room for growth, and a provisioning request is "less likely" to result in an message that says "better luck next time, I'm flat out of horses."
Private clouds are not shared. A private cloud is deployed by an organization that wants the benefits of a cloud, but wants the processes and premise controls over the infrastructure that powers it. The key differences between a private cloud and public one are control and size.
In a private cloud, you have fine-grained control over geographic location. This can be important for meeting data availability and/or redundancy guarantees made to clients. It can also be useful for ensuring that at least part of your infrastructure is in a country whose laws more closely align with your business needs.
There are clearly enormous advantages in the private cloud, in that data security exists and the design and operation of the private cloud can be congruent with business requirements providing more aligned availability and consistent performance. The downside is that it is likely to have more limited resources — provisioning 1000 new instances is far more likely to result in a failure due to insufficient resources.
So how big is big? In my experience, when you hit a run rate of 40 instances, build yourself a private cloud. That's the point at which it becomes undeniably cheaper.
One distinct and measurable difference between how private and public clouds can be run is seen in the choice of virtualization technology. Public clouds, by their nature, must isolate resources between customers as extensively as possible to achieve acceptable quality of service. There is no trust or cooperation between virtualized customers.
No virtualization technology does this perfectly, but some do a better job than others. Xen-based, and VMware-like solutions are some of the more capable in this arena. Because both implementations run completely separate operating system environments from a hypervisor, they tend to segregate the guests more thoroughly by sharing less resources.
This is good for guests, but bad for resource utilization. If I need as much as 16GB of RAM for my instance and I'd like to run 8 of them, that means I need 128GB of RAM in my host machine — that's an expensive box. On the other hand, if I need very little RAM (say 256MB on average of which 128MB is kernel and OS related processes) the hypervised virtualization becomes quite bulky.
On the other side of the virtualization field are technologies like OpenVZ and Solaris Containers (a.k.a. Zones). These technologies share a kernel (and usually a filesystem buffer cache) across guests. CPU resources can be sliced up, but memory (as it is shared) is a challenge to dedicate cleanly to individual guests. While this is clearly a bad (or at least challenging) thing for public cloud providers, it is often completely acceptable for private cloud needs.
The advantage of this "lightweight" virtualization is that you can pack more guests onto a single host. We regularly run 40 Solaris Zones on a single commodity server without issue. It is particularly useful for applications that are low-powered, but in need of multiple instances to meet their availability commitments.
Burning the Straw Man
Now that we know what clouds are, what's the problem? The hype. The hype is the problem. With hype come straw man arguments that delay or hold back the healthy evolution and incorporation of this technological paradigm.
I need the cloud. In the cloud, if I need to deploy 50 machines, I can just do it. Without the cloud, I have to buy servers and wait weeks for install and spend hours installing them.
Deploying 50 new instances in a cloud is easier than 50 new physical machines. But just because you can, doesn't mean you should. If it takes hours to install new machines, then you are doing your job wrong. If it takes weeks to get your machines, then you are using the wrong vendor. And most importantly if you suddenly realize that you need 50 new machines, then you simply didn't do your job well. The cloud is not an excuse to avoid a business model. A business model includes a budget and a solid, implementable plan for growth based on thorough capacity planning. With that, you should see it coming.
There are two reasons I hear when people justify the need to deploy a large number of new machines, and both arguments fall apart when you take a closer look.
Holy cow! Look at that traffic! I need fifty new instances. Now!
I know a bit about sudden traffic spikes. If you need 50 machines suddenly to handle a traffic spike, then, in all likelihood, you have built something wrong and no amount of provisioning will help. I've had the privilege of working with some of the largest sites on the planet. I've seen traffic spikes of 10000% happen inside 30 seconds, but then again I've also seen more than a gigabit of production traffic served to the masses off two $3k USD boxes. If you are in that situation, you need a plan — and it likely shouldn't include "Oh shit! Bring 50 more instances online!"
If you are providing a service that is unavoidably computationally intensive, you actually have a solid argument. This is rare and I'll touch on that later.
I have a lot of developers and they each need their own instance quickly and easily.
This is actually an awesome argument for the cloud. However, since these are development instances, they don't consume resources in the same way that production instances do. We give out instances like candy at OmniTI and typically can sustain about 40 instances on a single $3k USD box using lightweight virtualization. CapEx and OpEx on that are basically non-existent compared to an EC2 bill for the same. As you can see, this is an argument for virtualization, not the cloud.
I want to use the cloud because that way I don't have to worry about networking and hardware management.
Network management has to happen. Hardware management has to happen. You pay for it one way or you pay for it another. I've heard people say that it takes countless hours per month to run 40 systems including servers, switching equipment, routing, firewalls, etc. We manage around 1000 servers at OmniTI and from our immaculately maintained time tracking system I can tell you that less than 35 hours per month are spent on hardware provisioning, systems installation and concerns of space/power/cooling. That comes out to about 2 minutes per machine per month. Furthermore, I don't have any reason to believe that a cloud provider can do a significantly better job.
So, if so little time is spent on hardware and infrastructure management, why does OmniTI have a busy ops team? Because we're doing all the other stuff. Configuring software, performance tuning, and monitoring systems; monitoring systems to an egregious and offensive level. I'm not speaking of CPU temperature and disk failures (everyone monitors those). I'm talking about realized I/O ops per spindle, network packets per interface, HTTP response times, SSH keys, ICMP response latency, DNS, database health, application-level correctness and, most importantly, business level metrics. If you find this intimidating, look at Circonus as an enablement platform. If you like the cloud and/or SaaS, you'll love this service.
The operations team is the one place with access to data and traffic that is "real-time enough" to detect business issues before they manifest in significant monetary loss. Traffic anomalies, chargeback rates, visitor retention... all these translate into money. This is what ops does; they make things work; they make the business work. And they spend a lot more time trending, investigating and analyzing than they do replacing hard drives and network cards.
I can provision quickly in the cloud.
Yes. Yes you can. This is due to virtualization, not the cloud. Download a virtualization technology and provision quickly outside the cloud. I suppose that if my OS natively supports Virtualization (like all modern OSs do), and my operations team leverages that to deploy new instances quickly and easily, then we've created a cloud whether we like it or not. Damn terminology. While it is now called a "private cloud," I tend to just call it infrastructure operations.
Operating in the cloud makes your environment more resilient because you have to accommodate unexpected failures.
What? This has to be the most back-assward statement I've heard on cloud computing. Eagerly adopting an environment with a higher failure rate because it forces you to be a better engineer? Well, that's not an engineer I'd hire. Good engineers have always known that things can fail and have always had to design to accommodate that truth — incessant reinforcement by some public cloud providers is unwelcome and unneeded in this case. Assuming a well engineered system (which should be an expected outcome of any engineering group) the goal should always be to minimize the likelihood of failure within budget.
What the Cloud Lacks
In addition to dismantling poorly constructed arguments for the cloud, I thought I'd detail some of the things I find completely missing in the cloud.
Generalization is the root of all evil when it comes to performance. Just because you know how to use MySQL or PostgreSQL doesn't mean it is the right tool for every data storage need. People have learned this lesson fairly well. In cloud infrastructures, there is a goal to make systems alike to improve price points for capital expenditure, reduce operation expenditure (slightly) by learning one type of system well, and make the provisioning system simplistic. This leads to the abomination that is "small," "large," and "huge" instance sizes at some cloud providers.
As an engineer, when I have to build a system for a purpose I specify as much as possible. AMD vs. Intel vs. Sparc? How many gigs of RAM? What speed should the the RAM be? How much storage do I need? How many I/O operations per second are required? Should I use SSDs? How many networks must the system be on? Should we be using link aggregation or not? VLANs? No VLANs? These are all important things. If you need these things sometimes and everything has to be the same, then you get these things all the time — paying for it when you don't need it.
It is a reality that when systems are specified, compromises are made due to vendor relationships and part availability. However, the requirements that drive these specifications still exist and are at the root of the decisions: for instance I need 16GB of non disk-buffer memory for working sets and 10,000 I/O operations per second. That simply doesn't translate to three cookie-cutter sizes.
Data is a big issue. There are a lot of companies out there working on solving the data security issues that exist in public clouds — let's assume for a second that this is no longer an issue. A follow-on issue is that the cloud is "out there" and the only way to get data into and out of it is via the drinking-straw that is its uplink. Drinking-straw you ask? Yes. The internet is, even today, not as fast as a tractor-trailer full of tapes. If I have 10 TB of data (which is extremely reasonable for any business intelligence system these days), how do I back it up? I need a copy of that data off-site and secure. We have some creative solutions around this using ZFS, but still — I am contractually obligated to have my tapes (or some other off-site and off-line storage medium). Private clouds do not have this issue.
Scaling Out or Scaling Up
So many people talk about scaling out. Scaling out. Scaling out. Scaling out. Scaling out is an excellent approach to tackling requirements that cannot be easily accomplished on today's hardware. Not everything needs to be scaled out. I hear people say "I'm going to have millions of records, I need to make sure my design can operate on many machines." Millions? You're going to go through the effort of tackling distributed systems problems for a million rows? You have priority issues. A single machine (with failover) is enough to do most jobs. People lose sight of this too often. Making things redundant (hot failover) is a lot easier than making them actively distributed. So, if you can get away with scaling something vertically, do it.
There are many cases where the growth of a specific system component simply outpaces the availability of reasonably priced hardware to scale it vertically. In these cases, you should make your problems smaller. (You'd be surprised what can be accomplished over beers with an expert in the field). If that fails, then you roll up your sleeves and design your system to scale horizontally. Very few systems require horizontal scalability from soup to nuts.
Where It Works
I said before that if you need to spin up 50 instances you clearly didn't do a good job planning. I'll recant that and better qualify where that is acceptable. That is acceptable when that is your well thought-out plan. When would you need to spin up 50 new instances? Let's say you need to transcode a ton of video, let's say you need to sequence some DNA, let's say you need to use a lot of computational resources for a brief period of time and that is essential to your business model. This is where the cloud shines like a super-star.
For computationally intensive tasks that are irregular, the idea of batching work into a cloud of compute nodes is an excellent one. Here, the advantages are clear. Given that each job can really gobble up CPU resources, you can't leverage the consolidation that virtualization offers. At this point, the disadvantages are purely the outcome of an equation of economics. How much does a CPU-second cost and how much does it cost me to move the input for my job into the cloud and extract the output from the cloud: instance costs and bandwidth costs.
The Honest Truth
While it may appear that I hate the cloud, it simply isn't so. I hate the half-baked arguments for it. I hate the hype. It is a perfectly legitimate tool in the already large arsenal of engineering tools. Use the cloud where it makes sense, but please stop bludgeoning me with the hype.