When I began tinkering around with web services as a hobby, it was common to fiddle with an application for days. I would curse and grind and sputter with Apache and hobbled-together programs. This would frequently unearth new challenges: setting up a mail service, creating a database to store user accounts and perhaps pulling content from a third party. Inevitably these minor distractions would monopolize my attention and the original application would be left to gather dust, without any documentation or monitoring in place.

This seems to be a common problem with many professional software development shops. Project managers help to keep the development teams focused, but their goals are still feature-driven with an eye on the next release cycle. The IT and Operations teams are painfully undermanned, left to maintain their systems and services without any training on the care and feeding of their new pet. For the hobbyist or Open Source project, this becomes an annoyance. If you’re running a business, operational neglect can have a dire impact on your bottom line.

Systems Administrators are unconsciously trained to look at everything through a boolean filter. Hosts are up or down. Services are on or off. Their understanding of the application stack is often superficial, limited to the same perspective as that of a typical user. Does the website load? Can I ping the servers? This is a completely logical approach. And yet, it fails to consider those "corner" cases where activity looks normal, but an internal component suffers an unexpected condition. Stealthy failures like these can be missed for months and result in significant lost revenue or wasted overhead.

Monitoring systems have improved over the years with "advanced" features like automatic discovery of hosts and services. Scanning a network, they can identify hosts and differentiate web servers from workstations. Resources are grouped logically. It’s a very turnkey way to add monitoring to your infrastructure. Unfortunately, for many companies, this is where the story ends (and the pain begins). Attentions are subsequently focused elsewhere. Priorities are reestablished. One of the most important resources, the one that makes sure everything else is operating smoothly, becomes forgotten and orphaned. It’s easy to forget about something that doesn’t make your job easier or offer intrinsic value to your bottom line.

A poor economy and high unemployment levels remind us how important it is to optimize our existing architecture. The current trend towards Cloud Computing and Virtualization makes this even more challenging. These technologies are useful for creating highly elastic platforms on a budget, but they complicate engineering by outsourcing data storage and processing to an external black box. In turn, we’re forced to add resiliency in the form of additional processing nodes and redundant storage. This added complexity introduces countless opportunities for disaster. It’s a vicious cycle.

As the Web has become the obvious target for fresh product development, additional layers of abstraction are introduced into the application stack. New technologies and components offer exciting ways to communicate with the end user and from one business to another. The higher we go, the more these layers are decoupled from traditional monitoring proficiencies. The resulting programs are overly intricate and opaque. We need new ways to increase visibility and derive useful data from modern business systems.

Gaining visibility over business operations is probably the easiest improvement any company can make. Quality analytics require a solid understanding of your IT operations and business processes, which come from transparency into your systems. Once these have been established we should be equipped with the tools to streamline and simplify any infrastructure.

  1. Key Performance Indictators

    First and foremost, identify the external business metrics that directly affect your revenue. Establish thresholds and put fault-detection monitors into place, just like you would for any server or application. Alerts on business operations (e.g. new user registrations, orders per hour) are more important than the systems that support them. Remember that revenue is an asset, and hardware is a cost. Not the other way around.

  2. Review IT Monitors

    Evaluate your existing IT monitoring systems. Ensure that metrics are being gathered for every single host and service. The breadth and depth of data collected now will directly influence the quality of the information that can be extracted later on. It’s paramount to have the metrics to support your decisions, but you won’t know which they are until you can juxtapose them later.

  3. Stockpile Data

    Collect as many metrics as possible, for as long as possible. There are no good excuses for not storing metrics indefinitely. Storage is inexpensive, and a variety of technologies allow us to scale capacity with ease. In three years we should be able to look back on data with as much granularity as the information that was collected just yesterday.

  4. Highlight Deficiencies

    Graph your metrics. Study their trends and formulate a plan to address the immediate capacity limitations. When deploying new resources, look for hints in the trends that reveal hidden relationships in your network. But remember that this goes beyond planning for the future; this data has inherent value in supporting your ongoing decisions.

  5. Build Relationships

    Correlate graphs in ways that represent your business systems. Pinpoint metrics that relate towards a common goal (sales per visit, length of visit, average page size, network latency and webserver load). You might be shocked at the patterns revealed. If your trending application doesn’t allow you to correlate incongruent data easily, find a new one.

  6. Empower Stakeholders

    Distribute the accumulated knowledge with individuals and teams within your organization that can take action towards positive change. If possible, give them access to all of the information, not just the data that directly affects them. For large architectures, there is rarely a single person with a holistic view of the entire stack. Trust your partners and there’s a good chance they’ll unearth something you missed previously.

Fault-detection and Trending solutions should return more on investment than high uptime or speedy notifications. They should prepare an organization to increase capacity before limits are reached, realign resources to meet unexpected traffic spikes, help Development & Design teams to better understand your customers, and decrease the maintenance and staffing necessary for normal IT operations. Feedback should be real-time and tuned to the needs of your organization. In a nutshell, it should pay for itself and then some.