Many technologists that I know are math nerds. I know I am. Just in case you don't recall though, here's an asymptote.
What does this graph have to do with DevOps? Well, I like to think about the pursuit of perfection of something as a parabola and an asymptote. It's impossible for the two to meet, just like it is impossible for you to reach perfection, aka, "Devops Utopia." While we live in a finite world (like our little parabola) and can never reach perfection (again, like our little parabola), there is a positive side to all this. Lets look at quadrant one of our graph.
The blue point: that's your organization right now. The green point: that's a good goal to reach. Surely, you do some "devops-y" things now, but what can we do to improve? Luckily, we can make significant progress when we move along that X axis in a positive direction.
So, What Is This DevOps Thing Anyway?
In order to improve upon something, one must understand it. So, "What is DevOps?" is a great first question. I found some definitions by some really smart people. These ideas are intertwined throughout this article, so here they are in all their independent glory.
A Cultural and Professional Movement.
- Adam Jacob, Opscode
Anything that makes interactions between development and operations, better.
- Thomas Limoncelli, Google
I've noticed that a couple of interesting things are missing from these definitions. There is no mention of it being a new department or job title. Being a movement, many people rightfully identify with it, however, I don't believe it's "you." I view it as the next evolution along the path of "WebOps." I see it as "doing things in a DevOps fashion." WebOps 2.0, if you will.
Well then, how do we learn how to do DevOps? "By their deeds you will know them." DevOps is something you do, personally and professionally. It is this pursuit that I consider here.
The Cultural Movement of DevOps
There is clearly a culture behind DevOps. Some even call it the "Cult of DevOps." Whatever you call it, the cultural portion involves the interactions between you and your peers. Just like some forces in a galaxy far, far away, there are Light and Dark sides; the Inclusives and the Exclusives.
The Inclusives. These are the people that you find you want to be around. You all build cool stuff together and help each other out. These are the awesome people who have a passion for what they do; they love to show you and tell you all about it, given a chance. Think about the people you can't wait to tell when you have a cool, new project. Chances are, these people are Inclusives.
The Exclusives. These people are on the complete opposite side of the spectrum. No one likes to be around them. If you dare share your passion with them, don't expect much. They attack you if your choice of tool isn't the same as theirs. They fight you if you dare have an opinion of your own, and they are the grumpy "Us vs. Them" bastards. They might be chronic assholes.
It's worth mentioning that my first draft of "The Asymptote of DevOps Utopia" pointed to the "No Asshole Rule" as what ultimately distinguished the Inclusives from the Exclusives. I've come to the realization (after much discussion) that a "No Asshole Rule" is simply not enough. It's like Google's "Don't Be Evil" mantra. Everyone has their own definition of what makes an asshole. Therefore, I propose the "Be Fucking Nice" rule instead. Common courtesy, respect and leniency for others, especially those different than you, should be something we should all strive toward. The DevOps meetups I've attended always feel like extended family reunions. Extend that same respect you show your meetup members to your coworkers and I'm betting it will go a long way, indeed.
The Professional Movement of DevOps
We've all heard about the departmental silos that exist in many organizations. You've got your sysadmins on one team, the developers on another, security and network in another and designers in yet another. Requests have to filter up the chain, through the director that runs the silos, to a VP or other director, who then forwards the requests down to the appropriate team, until it reaches someone who actually does the work. This is woefully inefficient and it's where we get the "throw it over the wall" cliche. What can we do? Enter, the "Happy Fun Pile."
No, the "Happy Fun Pile" isn't a giant, adult-sized, ball pit (though I hear some companies have those). The Happy Fun Pile is where you get everyone working together. It's a really simple concept, though many companies seem to have trouble embracing it. One misconception is that, without silos, you wouldn't need directors. Directors are still needed, but their job is no longer to funnel communications and requests between teams. Their job is to make sure the team has whatever is necessary to get things done and work together well.
So, how does the ambitious director facilitate the Happy Fun Pile? Here are three things to start with (though I'm sure you can find many more):
- Optimize for serendipitous interactions
- Embrace asynchronous communication
- Fix your broken meetings (hint: most of them are broken)
Optimizing for serendipitous interactions is like fluid physics for ideas. Setup physical work spaces in such a way that people can easily interact together: if two people are discussing their ideas, then others can join in. Of course, the culture must support this, but it can help people refine their ideas. Exactly how you do this with your space is up to you. Maybe clusters of desks throughout the room. Maybe on big table with everyone gathered around it. However you do it, mix everyone together and let ideas collide like atoms in a fluid.
Asynchronous communications allow collaboration to occur when people "feel like it." Think back to when you did your best work, when you happened upon your "eureka" moments; they probably didn't happen at the same time as everyone else's moments. Perhaps you were pondering your ideas over a cup of coffee in the early morning. Maybe it was in a late night hacking session. Chances are, you've had more inspired moments away from your desk or outside of meetings than during them. How can your organization leverage these bursts of inspired thinking? Asynchronous communication. Email and chat programs that feature a browsable, searchable history of can act as a collaborative "Commonplace Book". This can allow asynchronous work schedules to succeed as well, though that is a topic that warrants it's own, dedicated article (or series of articles).
Fix your broken meetings. Chances are, most of your meetings are broken. Personally, I think meetings suck. Most of the technical people I know work on what's called the "maker's schedule." The type of work they do typically involves throwing the entirety of their cognitive abilities at a hard problem over a significant period of time. This clashes with the "manager's schedule," upon which most managers operate. The "manager's schedule" is best visualized with a day planner. Convenient slots of time every half hour or so, where you can write in what you will be doing. Due to the work of many managers, their day may be sharded into many little chunks each day. Penciling in an afternoon meeting is no problem.
Unfortunately, many businesses attempt to force their makers operate to on a manager's schedule too. Little meetings scheduled in the middle of the day are a bane to the maker's productivity. The positive side of this is that it's easy to change. Here are a few tests you can run before scheduling a meeting, to see if it's impact is truly justified:
Calculate and add the following:
- Is the meeting value > human-hours?
- (Y) meeting_value/(#_attending * rate * length_of_meeting); (N) -1
- Is the meeting at the beginning or end of the day? (Y) +1; (N) -1
- Will there be real food provided? (Y) +1; (N) -1
- Is attendance completely optional? (Y) +1; (N) -1
- Is this the most efficient way to convey the information? (Y) +1; (N) -1
- Perhaps, weekly recaps emailed from each team member instead.
- Leverage asynchronous chat logs.
- Is this primarily a social event? (Y) +1; (N) -1
Add together your values for each portion. If the value is positive, schedule the meeting. If it's negative, don't schedule it. Just be honest with your answers.
Operations ALL The Things!
Theo Schlossnagle mentions "*OPs" in his "Web Operations as a Career" talk. In it, he mentions how you must integrate the operational mindset into every part of your business. Operations is very broad and covers many things, so I'll be focusing on "technical operations," like we see in the typical sysadmin career.
Technical operations is responsible for two primary things: system availability and efficiency. System availability is the easier half of the equation and involves the person's troubleshooting skills. "Is it down? Get it back up!" Efficiency though, is much more difficult. When I say efficiency, I'm not talking about the efficiency of your servers. Rather, I'm talking about the efficiency of everyone else in your organization. How can you achieve efficiency? Three main things: set standards, enable everyone, and be the fire marshal.
Set standards and ensure they are followed. Of course, it is your responsibility to make sure that the standards are highly efficient. An example of an efficient standard might be, "All new server instances must have ldap credentials and ssh keys setup for all sysadmins and the dev teams that need access to that machine, within 5 minutes of creation." Design a process to accomplish this task (such as through automation, in this example), audit and test your standard process in different cases, and verify it works. Then ensure the standard is upheld.
Enabling everyone helps maintain productivity in an obvious way. If people are blocked from doing what they need to be doing, then they can't get it done. It's not rocket science. Let's say a developer comes to you and says "I need a bunch of hardware in the datacenter by the end of the week. I'll need it setup to go to production by the following monday." There are (at least) two ways to handle this. First, you could say, "Sorry mate! I simply can't get that done. I can order the hardware and have it express shipped to the datacenter, but there's simply no way for me to get that done in time." Here, the developer came to you with a problem because he needed it solved and you're the guy who comes to mind. Now, you are turning him down. Perhaps, instead, the conversation could go like this, "Well, I can get the hardware here on express shipping. Tomorrow, while you and your dev team read up on the standards I wrote, I'll go rent a bus. Then you, your team and I will go to the datacenter and get this done!" Chances are, your developer will have a change of heart and realize that it's not that important. On the other hand, if he says OK, well. . .you get to drive a bus! Also, this lets you push responsibility to the edges.
Being the Fire Marshal is cool. You get to put stickers on stuff and tell people, "please don't do that, lest you conflagrate." You get to be like that, but with (hopefully) no burning stuff. Many people consider operations as a sort of "firefighting". Really, though, good operations isn't about individual heroics that save the day, and if you rely on that to keep your systems up, then I'm sorry that your ops team hates their job. Instead, you want to have your ops team be fire marshals. You see, fire marshals set standards. They are also responsible for drilling for disaster. They test procedures to ensure that they work. You can do this too. Break production infrastructure and test your disaster recovery systems. Validate your systems against a "tenth floor test." By doing these things, you prepare people for the worst. Remember, being the fire marshal is cool, and makes your life and organization that much better.
Site Reliability Engineering
"Site Reliability Engineering" (SRE) is something you hear more and more these days. It often feels like the elusive "DevOps," in that, when I talk to people about it, no one seems to share the same definition. Personally, I view it as the sophisticated name for WebOps. Lets break it down.
With site reliability engineering, we have two goals: high velocity and extreme reliability. High velocity means rapid growth in a positive direction and extreme reliability means with record breaking uptime. These embody the core of web operations too. Shall we go further? Site. Reliability. Engineering.
Your Site is your. . .SaaS; web app; website; service. Whatever it is, this is the objective toward which you direct your efforts. Reliability means that your site is consistently available, operable and fast. Not just fast as in speed, but velocity. We must focus on speed in a positive direction, rather than speed for speed's sake. Engineering seeks to build things that make life better.
Three key tenets of reliability: Reliability Budgets, Operable Code, and Monitoring. Set standards for these and your site will show it.
Reliability budgets are based on your SLA, typically your quarterly SLA. Collect metrics that measure the availability of your site and report on its uptime. This should be reported automatically and visible to anyone in your organization. When it comes time for a new deployment, use a "canary system" to test, deploy and potentially roll back. Automated push and roll back cannot be stressed enough here. Use it to upgrade a single machine and test. If it's still good, keep gradually rolling it out until you hit a threshold and upgrade everything. (The threshold is up to you.) If (when) things go bad, roll back. It's called a canary system for a reason. It exists to tell you when things are going bad. After a failed deploy, the reliability budget is recalculated and you can try again. The only time you can't keep pushing is if you have already reached the limit set in your SLA.
The key with reliability budgets is that they're entirely numbers driven. This helps remove personal bias from the process of deployment. Hopefully, such a system will push the team toward operable code. What is operable code though?
Operable code fails gracefully. It reports useful error codes. It has solid and useful documentation. Honestly, if you don't document your code, who will? There are many different ways to implement documentation. Whatever you decide, just get with it and stick with it.
Monitoring has obvious benefits. You obviously want to know if your site is up and operational. Reliability budgets rely on monitoring. However, your monitoring needs to be more than, "Yup, site is there." You should monitor everything you can. The only thing I will caution here is that, while monitoring everything is good, be selective of what you alert and trend on. If you are watching a particular metric for trending, make sure you can tie it to direct business impact. An example that comes to mind for me is the "average load time." While this is important, do not be distracted by it. Let's say you push new code out to your website. You've done your tuning and got everything set. You test it and find that your average load time went from 600ms to 625ms. What if you have a really busy site with millions of users? You probably don't flinch at 25ms in difference. However, what if the thing that caused your spike wasn't a slight increase in overall average, but a significant increase in outlying cases? What if, for 10% of your users, it's actually taking more than a second to load now? Because you get such high traffic on your site, you don't even realize it. This is where focusing on the wrong metric can blind you to real problems. Alerts are much more simple. "Can this metric be tied directly to significant financial impact?" If yes, go ahead and page for it at 3 a.m. If not, no one cares that much. No one. If they tell you they do. . .they lie. (You can have them paged instead!)
Engineering (aka, Building Cool Stuff)
One of the responsibilities of your operations team is to build and/or implement tools to make other people more efficient and make their lives better. Things like a canary system or automation tools fall under this category. One important point is that there should be a "self service portal" for developers so that they can access SRE knowledge bases. It should also let developers request new server instances, implement monitoring and dashboards, troubleshoot problems and prepare for launch readiness reviews. All of this should be possible without the assistance of a sysadmin. "Push Button/Receive Server" should be the design objective of this portal, literally. After doing so, developers should get an email with login information, a dashboard setup and monitoring automatically in place.
This sort of system makes life much easier when testing code. Instead of the hacked together blob that is the average developer's laptop, you get a known state system that is identical to what will be faced in production. You'll hear no more, "Well, it worked on my system/laptop/workstation."
The documentation available on the portal should include articles, videos and how-to sessions. SRE "open office hours" should be posted so that developers can ask questions of the SREs. Ultimately, the goal is to build up the skills of the developers, so that they can be self-supporting. This way, those issues that actually reach the SREs are due to a true, deeper problem.
Overall, the portal acts as a workforce multiplier. A small SRE team can support many, many developers. You achieve this via automation. SREs touching each and every server is a system that does not scale. Don't consider such a situation acceptable. The portal, being such a powerful workforce multiplier, enables our next point.
Dedicated SRE Team Support
Sometimes, projects require dedicated SRE team support. Empowering the developers to run all their own stuff only goes so far. So, let's lay down some requirements. For something to be eligible, it must be of high importance to the company, have a low operational burden, and pass a hand-off readiness review. Perhaps there are regulations like Sarbanes-Oxley. Of course, above all other requirements is SRE availability.
The hand-off readiness review is paramount here. This checks to see if the project is operable. Volume of alerts are checked. A high volume of alerts would indicate that there is something broken in the underlying system and that needs to be worked out first. Next, monitoring, system architecture and release management. These all revolve around reliability and scalability. Outstanding bugs and a general review of "production hygiene" complete our set. These aim to ensure that there are no underlying issues that may have crippling effects. The review is done with the development team and a couple of SREs. The most successful teams, in order to pass the review with flying colors, work with the SREs during office hours and request consultations as needed, before the review. Such consultations should be accommodated to the best of the SRE team's ability.
Once an SRE team takes on dedicated support, the development team is kept up to date and regular communication still occurs. If the system starts to deteriorate, (say, crappy code is wrecking things) then the SRE team can--and should--hand back the operations of the site to the developers. This will allow the developers to fix the code issues and clear things up. Then another, quicker review is done and the SRE team resumes operational support.
Benefits for All!
What are the benefits of making the movement toward "DevOps Utopia?"
Developers are committed to fixing issues SREs are not expected nor required to support substandard services SREs can say "yes" to change, yet have a way to encourage stability
Future designs will reflect the knowledge and experience gained from running their own infrastructure Access to SRE knowledge, monitoring and tools allows them to do their jobs more efficiently Developers know what to expect when it comes to deployment and working with SREs
The adversarial relationship that can exist between developers and sysadmins is eliminated It makes life better for everyone
Now, I know there are many other things you can do to move toward "DevOps Utopia," but I hope this gives you a starting point. If it seems overwhelming, pick one point and start on it today. Within a few weeks, it will be second nature and you can work on the next point. All that matters is the positive movement forward. Remember the parabola? Even the small movements make significant differences.
Once you feel you have reached the green point on the graph, congratulations! Now, step back and observe. What else can you do? Challenge yourself. Put your newly reformed organization on the red dot again. How do you get to the green dot? While "DevOps Utopia" is ultimately unobtainable, keep going and you will find you can get pretty damn close.
Futher study and commentary over on my blog, liberumvir.com.