Time series view of application performance metrics

This post builds off a previous post on the evolution of a second order metric, using the metric developed there as a basis for more advanced visualizations illustrating interactions between multiple workloads executed on the same infrastructure for this application. You may recall the original 2D frequency plot that illustrated the strong linear relationship we are observing between the number of rows retrieved from a specific table and the duration of that retrieval. This key relationship drives the overall search performance for this particular application:

The source of these selection queries can be isolated by the source of the workload. In this example, four distinct workloads are being tracked, but plotted in a different way:

  • x-axis is time, hourly increments
  • y-axis is “slope” value using a log scale
  • “acceptable” performance slope value = 1.0 is illustrated using the horizontal line
  • the diameter of a circle represents the relative size of the selections executed for that workload per hour
  • each workload is represented by a colored circle time series

In the case of this first plot, we can see that one workload is dominating the others with a much lower (better) slope value over time. The other three workloads struggle to sustain an acceptable level of performance, with end user generated searches being the worst performing workload. You can also see how as increases in selection size go up, the database buffer & storage pools adapt to those new read patterns and the cache efficiency of that workload improves. This particular deployment however struggles to provide adequate performance for 3 out of the 4 workloads under review. Additional changes to the environment were recommended including expanding available memory on the DB nodes and the use of workload isolation strategies pairing workloads to application and database node pairs. In essence, each workload would be bound to it’s own DB node providing it with an opportunity to warm a dedicated pool on that node to its selection profile.

Here’s another example of leveraging this visualization strategy:

In this case the performance of 3 of the 4 key workloads under observation is quite good. The fourth workload, end user search, again has such a small relative selection size that default caching strategies at the database tier de-prioritize this important workload. A workload isolation strategy similar to the recommendation made above was made for the end user workload in this case.

Another interesting observation is related to the impact of non-search related workloads (GET and PUT) on the selection rates for search workloads. In this example, an increase in selection latencies correlates to an increase in the update (GET and PUT harness) traffic coming into the system. The update rate is tracked by the lower time series bubble plots under the main selection bubble plot. This illustrates how we need to be concerned with all workloads in the system, including those not directly responsible for the latencies of concern.