data visualizations

Often the most effective way to describe, explore, and summarize a large set of numbers – even a very large set – is to look at pictures of those numbers.  Edward R. Tufte

I’ve spent most of my career integrating, developing and tuning a variety of software and hardware products that support core infrastructure applications. For the past 5 years I have been able to focus on aggregating data across complex infrastructure stacks and developing techniques for  enhancing our visual representation of these large scale data sets.

My philosophy has focused on viewing the entire data set for rapid assessment of the behavior of a system. This focus has been on the performance of high scale systems, mostly delving into low level profiling of transactional latency patterns for OLTP style applications. Though my ultimate goal is to provide rapid solutions to my customers for their performance issues, I also strive to make the representation of their data sets comprehensive, visual, clear, and interesting to view.

I’ve compiled a sample of some of the data visualization projects I have taken on over the years, applying a variety of techniques championed by Edward Tufte to these data sets. Hopefully you’ll find looking at these data visualizations as much fun as I had building them.

In God we trust, all others must bring data.  W. Edwards Deming

Key concepts

Small multiples

Small multiples is a concept that leverages small plots repeated in succession to illustrate how conditions may be changing over time. The use of a series of small plots can be very powerful in representing time series views of large data sets in a very small space. I probably use this concept the most for visualizing transactional data sets consisting of millions of data points. Typically we can squeeze 10’s of millions of transactions into a single page, presenting a full day of transactions in a very limited space.

Data density

One challenge that I enjoy working on is increasing the efficient use of space when representing a data set. This involves looking at the data density, or the amount of data points represented per square inch in a plot. This use of space focuses on maximizing the amount of information presented to the user, allowing for a rich visualization that is efficient in summarizing large amounts of data.

Sparklines

Sparklines are a concept created by Tufte that leverages small word graphs that are embedded in text or used in small multiples to visualize a data set and embed it directly in your field of view when reading a document or parsing data. The classic use case is to embed the sparkline directly in a paragraph describing the data. This allows the reader to view the data without having to move their eye to a different part of the page. It’s far more descriptive and efficient.

Negative space

Sometimes we can say a lot by saying nothing. With data visualizations, we can show a lack of activity by simply showing white space. For instance, for OLTP style workloads, this can be quite useful in illustrating server failures. A time series plot of transactional activity may show white gaps with a lack of data. This is a great indicator of a problem on the system, and large swaths of logs can quickly be parsed by looking at time series plots for small white gaps.

Sample data visualizations

You can click on an image to view it at full scale.

Plotting project test sequencing with performance data

I like blending different sources of data that create a complete view of a project. In this example a large scale POC was conducted to illustrate the failover capabilities of a clustered database.

This particular data visualization illustrates the project timeline along with the tests conducted and their results. You can see the progress made during the project from setup through tuning and finally in characterization. Additional efficiency metrics are captured below the plot. more…

hybridTimeline

Time series representation of fault injection testing

The previous visualization contained a series of data points (experiments) connected via lines (studies) that illustrates a complete test plan. The data visualization to the right is the complete data set visualization of a single experiment.

In this case all transactions issued to the database host were plotted by workload (colored band) over time. Breaks in the transaction populations indicate fault injection tests (intentional failures injected into the test to monitor how the system reacts). Additional server resource consumption data for the cluster is presented in the bottom half of the data visualization. more…

timeSeriesFaultInjection

Visualizing second order metrics

These interesting time series data visualizations are a personal favorite. Each bubble series is the visualization of a distinct OLTP workload’s second order metric. This metric is useful as a portable mechanism for evaluating the capabilities of a customer’s infrastructure to service the application workload.

The top plot shows how an underbuffered database prioritized the heaviest workload (largest diameter bubble) as it should, but the key latency sensitive workload (tiny black dots at the top) was deprioritized. This visualization clearly illustrated the need for more memory to the customer. more…

slope_workloadConflictslope_workloadConflictTuned

Visualizing fault injection impact across a database cluster

This is another cluster test illustrating a failover event. The goal here was to show four different workload types and their behavior as the failover occurred. Additionally CPU consumption data from the cluster was presented below the workload visualization. A callout was used to highlight the specific failure event. more…

failoverPlot-NodeAffinityTest

Representing software release to release sequence testing

This was the first project oriented data visualization that I created. My intent was to illustrate the performance scrum team’s velocity over time while also conveying the structure of the experiments (data points) and studies (lines connecting collections of experiments) and the results achieved.

Time is represented on the x-axis. Clusters of data points illustrate higher velocity. The up/down bars in the lower right illustrate the gain / loss ratios of the new release relative to the baseline. This is another personal favorite because of how thoroughly it represents months worth of work. more…

R2R_visualization

Dendrobates tinctorius ‘azureus’ native weather data

This project aggregated temperature and humidity data from the Sipaliwini Savanah. This area is the home range for a type of dart frog that I keep, Dendrobates tinctorius ‘azureus’.

My objective with this project was to better understand the native environmental conditions this species experiences so I can attempt recreate it in their vivarium. Additionally, understanding seasonal changes in temperature and humidity helps “cycle” the frogs, taking them in and out of breeding season. more…

sipaliwini_02_monthlySummarysipaliwini_01_availableData

Dendrobates tinctorius ‘azureus’ vivarium data

This is the subsequent project to the native weather data project. This particular set of data looked at temperature (left) and humidity (right) over the course of several months. more…

tempPlot-hourly rhPlot-hourly

Visualizing complete data sets

text

troubleshootingTput-scatterTimeSeries

Using small multiples to visualize database buffer pool performance

This was a fun scripting project that pulled buffer pool resizing events from DB2’s diagnostic logs to present a time series view of buffer pool size (left). Additionally data was extracted from DB2 snapshots (right) to provide a complete view of how big the buffer pools were and how frequently they were being read from. This is a much more powerful view than trying to parse thousands of lines in a text based log. more…

DB2bufferPlot

High scale server performance data overlay using small multiples

One concept that  continually surfaces in my data visualizations is the use of small multiples and sparklines to overlay data from multiple sources in an application’s supporting infrastructure. In this case, metrics are pulled from client, application, app OS, database, database OS and storage. This view is also constructed in a time series, and a “steady state” interval (in red) can then be dynamically shifted to understand the impact on the infrastructure as workloads shift. more…

sampleExcelOverlay