Describing a set of data over time, with sufficient detail in a concise enough format to be useful, is no easy task. There are line graphs of counts and derivatives, where each unit of time corresponds to a single data point. With sets of values per time, however, compromises have to be made to fit the same format, consolidating information with averages, medians, minimums and maximums. Unfortunately, even with all of those calculated variables, important patterns will be lost. Histograms can save the day.

A histogram is a way of describing frequency within a data set. Values are grouped based upon ranges, as fine-tuned or widely encompassing as one desires, and then the group totals are graphed in some way. For a single set of data, the typical graph is a bar graph, with height proportional to count. If, instead of height viewed from the side, frequency counts were represented by color from above, then it becomes trivial to show multiple sets over time by stacking these colored representations together. This visualization, the histogram heatmap, is essential for monitoring.

chart1_enhanced_v1.png

Network problems are notoriously difficult to identify and describe. Above, is a contrived example of latency data and a corresponding graph. With a simple average graph, one would have no idea about actual latency distribution; though the sets vary wildly, the averages are more or less the same, and it would be assumed that results are consistent.

table_enhanced_v1.png

With additional maximum and minimum graphs, the picture is slightly more accurate, but one won't see the magnitude of the problem. Are they outliers or something more frequent? This risks either over- or under-reacting, solving non existent problems and ignoring fundamental issues. Disproportionate action is going to cost unnecessary time and money somewhere along the line.

histogram_enhanced_v1.png

Applying this to real response times using Circonus' histogram feature, the benefit becomes apparent. The graph above contains response times (ms) for calls to a specific network service over time, with the darker green representing more frequent results. The blue line is the average response time. There are some points above 500 ms, but for visual clarity, the scale has been limited. Generally, the average hovers around 30 ms, near the more prominent lower band. This completely masks the existence of the second band in the 400 ms area. Such a low average, might lead one to conclude the maximums to be outliers in the overall picture. With the histogram, however, it becomes clear that a small, but not negligible, portion of requests was fairly consistently coming in an order of magnitude above the average, indicating that something was amiss with that service. After investigation and a code change, that higher band goes away completely, confirming the fix worked.

The process for creating histograms differs, depending upon the monitoring and graphing solution you use. For the above, we used Circonus, which makes this a breeze. With Circonus, the most straightforward way is to use an http trap type check and send it messages of the JSON format:


{
  "key" : {
    "_type" : "n",
    "_value" : [1,2,3]
  }
}

Each key will correspond to a metric in the http trap check. The type is "n" for numeric. Value is an array of values, so it's both possible and encouraged to aggregate application side rather than pushing each individual value separately. After the data is flowing, enable the metric and then go and enable the histogram collection (icon of three little blocks). All that remains is to add it to a graph.

Whenever you have changing sets of data over time, histograms are the way to go. They condense information into an easily consumable graph without sacrificing detail, shedding light on any number of problems.