Instrumentation and Observability
By: Theo Schlossnagle 8 Sep '10
There has been considerable momentum established behind a movement called devops. This momentum is good. There does not appear to be anyone coming out and saying "this whole devops movement is bad and ignorant." So, as one can assume with no notable adversaries, it stands to reason that the movement is a "good thing."
The devops movement is often thought of as an effort to bring the operations world into the development (software engineering) world. Statements like "A into B" are vague. Let's be clear on the concept: introduce the wisdom and experience from software engineering into the operations realm. The software engineering world, over its brief history, has established many excellent paradigms including testing, version control, release management, quality control, quality assurance and code review (just to name a few). These concepts, while they exist in good operations groups, are admittedly far less formalized and could stand some rigor.
So, while one might think we'd discuss the merits of software engineering principles in operations in this seed, we're happy to disappoint. There are plenty of people talking about this already; they are making excellent points and getting their points across.
Operations is not, and has never been, a janitorial service. Operations crews are responsible for the impossible: it must be up and functioning all the time. This is an expectation that one can never exceed. One can argue that we establish SLAs (service level agreements) to bring these expectations within reason, but SLAs are legal terms that articulate allowable downtime, not desired down time. Users want services available all the time. As a result, operations is faced with an impossible task and, amazingly, makes good on unpromised availability more often than not. Let's talk about the not.
Operations is, by definition, the group that operates things. These "things" encompass the entire technology stack: networking and systems hardware, operating systems, COTS (commercial, off-the-shelf) and open source application software and in-house tools. Consider the following statement, which seems obvious, but is commonly overlooked: "It is easy to operate software and hardware that is operable." Many common components in the information technology stack are simply inoperable by our definition. This is where we get into the meat of things: how does one define operable?
Defining a component as operable is quite simple. Inevitably, things go wrong. When things go wrong, they must be understood to be repaired. Troubleshooting is a zetetic process. To progress, one must ask questions. These questions must be answered. This should be plain and obvious to anyone who has ever experienced an unexpected outcome to a situation (technical or not). So, why is this complicated? To be effective, one must not change the situation during the course of the question. This caveat is where things get complicated and fortunes are made.
To observe a situation without changing it is the ultimate achievement. While Heisenberg believed this to be impossible (and we agree), one can achieve a reasonably small disturbance during observation. An excellent example is the classic philosophical question, "If a tree falls and no one is there to hear it, does it make a sound?" Let's think about that question for a moment to better understand impact and side-effect. Is the sound of the tree falling more or less likely to affect the overall situation than the actual destruction and subsequent felling of the tree? The problem with many observation systems is that, in order to observe the sound of the tree, they must hew a tree during every instance of observation. We suggest a different approach.
Many systems have critical metrics, which are diverse and specific to the business in question. For the purposes of this discussion, consider a system where advertisements are shown. We, of course, track every advertisement displayed in the system and that information is available for query. Herein the problem lies. Most systems put that information in a data store that is designed to answer marketing-oriented information: who clicked on what, what was shown where, etc. Answering the question, "How many were shown?" is possible but is not particularly efficient. In order to answer the question, one must hew the tree and wait to hear the sound of its fall.
Instead of asking analytic questions, applications should expose this information as a consequence of normal behavior. Just as the sound of the tree falling is a natural consequence of the act of hewing, the ad serving system is responsible for tallying the total impressions and exposing that information to those that care. No significant work need be performed by the application to answer this question, just a pre-calculated response to a simple question. This enables a new way of application observation where witnessing metrics and their changes requires no substantial work by the application. This paves the way to new types of application monitors (for example high-frequency monitors) that need not worry about altering the situation by observing it.
Not all questions can be asked before a problem occurs. This is where observation ends and instrumentation begins. Instrumenting code allows new questions to be asked and subsequently answered in a running environment. A system admin or developer may look at a malfunctioning system and think, "How do I recreate this situation in a test environment?" The reason we ask that question is because debugging in production is taboo. If a developer instruments code well, profound knowledge of the problem may be derived without the risk of altering its state. DTrace is the king of these systems and its adoption across various operating environments is growing. Nevertheless, no one should argue that they should throw in the towel just because they don't have DTrace available to them. While powerful instrumentation might elude those without DTrace, we've found that we can get most of the way there with careful logging (a poor-man's instrumentation) and continuously exploring critical metrics to expose for observation.
Many architectural components today provide an HTTP interface, primarily via a REST API. Use it! Extend the HTTP server to expose critical component metrics via HTTP. Use JSON, or use the Resmon XML DTD. In Java, expose metrics via a Bean accessible via JMX. This can be a bit frustrating because Java-centric tools must be used to observe it, so instead, just expose those metrics via a servlet. There is even some free code for that: Resmon Java Servlet (see Resmon.java and ResmonResult.java). Exposed metrics can be tracked, trended and alerted on easily using tools like Reconnoiter or Circonus.
Making applications operable means that never again should operations personnel be stuck on the question, "The application appears hung, I wonder what it is doing?" All production code should be prepared to answer questions such as these at any time. "What are you doing?" and "How long is it taking?" are perfectly reasonable questions to ask of any piece of production code and you should demand a prompt and accurate answer. The resulting metric data is consumable by both dev and ops teams, and even by those teams' managers. After all, trending metrics is not just about detecting problems. It is also fundamental to quantifying success. This is what it means to be operable. Software engineers everywhere, please make your software operable!