Time series overlay of cluster fault injection testing

Fault injection at scale is one of the more challenging performance testing scenarios that we’re asked to tackle. Not only are we dealing with deploying high scale workload simulations on complex infrastructures, but now we are asked to attribute changes in workload patterns to specific, controlled faults. This requires a deep understanding of the workload under test, and strict controls on that workload to ensure any observed variations in data patterns can be accounted for (and hopefully attributable to the fault being injected).

Small multiples – big data sets

These simulations provide for incredibly rich data graphs that can be extremely powerful, and beautiful, intriguing visualzations of massive amounts of data. In this case, we’re looking at a system servicing 400+ transactions per second over a period of 2 hours where multiple database nodes were failed in different combinations throughout the experiment. Instrumentation collected here focused on the application tier logs, application to database connectivity, and CPU utilization at the application and database tiers. The application logs are the most interesting – tracking three distinct workloads over time, monitoring latency on the y-axis using a log scale to enhance visibility into distinct sub-populations of transactions. The use of small multiples (5 minute interval scatter plots arranged in each square, side by side) allow for incredibly information density – nearly 3 million distinct transactions are represented in these plots (click on it, unless you have super human vision…it’s a fairly large graphic):

The time series overlays were a bit tricky to line up as the application server clock was 2 minutes and 5 seconds ahead of the database (yes, an oversite, a time server should have been in use) but it all worked out in the end.

Negative space

You can see the interruptions in service as distinct white gaps – note the power of negative space in this instance. Your eye is drawn horizontally through the time series data and you immediately focus your attention on where the data is not present. This cannot be accomplished by trolling through raw text based log files, nor can you glean this level of visibility into transactional processing via typical performance views, specifically the old standby of “average tps” and “average latency”. Visualizations of the entire data set provide far more insight into system behavior than isolated metrics commonly exposed through standard performance harnesses.