The field of web operations is one with which I am intimately familiar. For the last twelve years, I have immersed myself in this field and have had the distinct privilege in helping define it. Even now, writing a job description for a web operations specialist is nearly impossible and when I speak with colleagues about what web operations truly is, we all seem to articulate things differently. I wrote an article a little over a year ago after attending the first O’Reilly Velocity Summit. I now sit in a hotel room preparing my workshop for delivery at the second annual Velocity conference and realize very little has changed. While I still believe the definition of web operations is in flux, I truly appreciate a forum in which it can be explored further. I strongly encourage anyone in the bay area to swing by and partake.
While attending the summit that helps plan this conference, I had two epiphanies:
- a realization of the lack of a career path for people who do what we do (no standard titles, no standard roles and responsibilities and certainly a lack of sex appeal);
- a clear lack of terminology for the technology requirements that are so common in these environments.
Terminology is easy, in my opinion — you just argue until someone wins. Of course, arguing is a hobby of mine, so I have bias. On the other hand, defining a career path that is an industry accepted path is hard.
The Career: Web Operations
The term Web Operations was used a lot during this event. While it is not awful, I really do not like this term. The hard part is that the captains, superstars, or heroes in these roles are multidisciplinary experts. They have a deep understanding of networks, routing, switching, firewalls, load-balancing, high availability, disaster recovery, TCP & UDP services, NOC management, hardware specifications, several different flavors of UNIX, several web server technologies, caching technologies, several databases, storage infrastructure, cryptography, algorithms, trending and capacity planning. The issue: how can we expect to find good candidates that have fluency in such a nimiety of technologies? In the traditional enterprise, you have architects which are broad and shallow and their team of experts which are focused and deep. However, the expectation is that your “web operations” engineer be both broad and deep: fix your gigabit switch, optimize your MySQL database and guide the overall architecture design to meet scalability requirements.
I struggle with this. Not everyone can be a superstar. More importantly, no one can really start as a superstar. If we use an apprentice model (which is common in industries without institutional support) we limit the total number of able workers in this field. So, how do we (re)define the requirements for a junior web operations person?
We have to have a plan for hiring on people and progressing them through a career path to make this a legitimate discipline. During conversation, one of my colleagues said they just hire people that they think are agile — “If I tell them to know IOS well enough to configure a router and troubleshoot a problem, I expect them to show up tomorrow with a basic understanding of IOS and ready to start typing in commands at a console.” I agree this sort of “no boundaries” attitude is required for the job, but where do you start?
Another person mentioned that the reason for the lack of sex appeal in the position was due to popular attitude. Many people apply for development positions and “don’t quite make the cut” and are instead offered system administration positions. I personally don't subscribe to this philosophy and we certainly do not operate like that at OmniTI, but I have seen it in other companies — I hope it is not prevalent.
Basically, this is one of the few positions in the organization that has no boundaries of responsibility. If something breaks, it is your problem. Why isn’t this the case throughout the organization — why is it that even the most junior of developers doesn't wake up to fix their code when it breaks and causes service degradation in the middle of the night? It is uncommon that this level of responsibility is expected of developers, while it is a quite common expectation of the operations crew.
Circling back, I really do not like the term “web ops.” I realize it is not far off, but it isn’t sexy. Google has a few different roles with this level of responsibility. One I like is called: “Site Reliability Engineer.” However, I would like a set of job titles and a progression through them that makes this an appealing career path for young, ambitious geeks.
In order to define these roles, we should think about what they are responsible for. In our organization I see this as a few things:
Junior
On the junior level, they are responsible for learning. They are responsible for deploying new services and documenting such deployments. They are responsible for instrumenting deployments to make sure that faults are detected and trending is possible.
Mid-level
On the mid-level, they are responsible for all of the above, and more. Effective and complete troubleshooting of failures. Making sense of trending information. Understanding work loads that exist. Tuning systems to better accommodate current workloads and proactive tuning to handle known future workloads. One of the key differences between mid-level and junior is the ability to correctly prioritize remediation of issues during incident response. Staying calm, collected and executing with clarity of thought during an emergency.
What does “complete troubleshooting” mean? I mean troubleshooting without boundaries. I want no shyness in cracking open developer code and telling them what they did wrong and why. Finger pointing at people simply doesn’t work, you have to point your finger at implementation problems, not people. To do that requires the skill to track a performance problem or reliability issue down to a specific line of code or approach.
Senior
On the senior side, technology research and selection is a must. Additionally, they are responsible for incorporating new technologies in the architecture to improve availability and reduce costs, constantly analyzing systems to improve efficiency and capacity planning to understand growth well enough to ensure provisioning and deployment outpace need. Donald Knuth long said that premature optimization is the root of all evil; I've long said that the ability to accurately determine what is premature separates senior from junior.
One of the core responsibilities that all engineering disciplines share is assessing the appropriateness of the technologies at hand. For example, a “Web Architect” must ensure that technology selection as well as development and deployment strategy match the business need. This is “hard.”
Above and Beyond
Web operations is a special role. This role is in no way fitting for failed developers, it is for developers/engineers that have outpaced their career path. One that has a deep understanding of how things work: “a complete systemic view of general site architecture.” However, they want more responsibility, they want to make sure that all of it works all of the time: the app, the stack, the hardware, the network. Whatever technology the business needs, it must work, it must performs and it must be able to meet demand. Lastly, in their heart of hearts, they must believe that all problems are equal in their need for resolution and problem prioritization is dictated by business impact and not by flights of fancy (how cool or interesting the problem is).
It is an impossible job requirement: “Knows everything about all technologies deployed in Internet architectures.” While no one fills this requirement, what I want is someone whose career goal is to find out how close they can get.