data prep and plotting with VBA

Over the years I have picked up on VB scripting within Excel as a means for sharing data analysis and plotting capabilities with non-technical field personnel. This weird intersection of high scale performance data analytics and Microsoft Excel morphed out of a need to share analytics methods with the field on a platform that everyone had. In addition, later versions of Excel (2007 then 2010) added expanded color palates in addition to the existing plotting controls which allowed for hyper control of plot layouts. All of these features led me to develop some Excel skills over the years that have now been difficult to re-invent in other environments. I figured I’d take the time to share a few useful methods here for data set manipulation and plotting automation. Enjoy! Disclaimer – in no way do I claim to be a VBA expert or adhere to VB best practices. My goal with these scripts was to end up with the data I wanted, in the proper format, plotted, while providing easy access to field personnel to use these methods. I could have syntax errors, or do things that look “weird”, but they achieved my end goal.

July 2014 – please note I’m actively building out this page and child links. Check back for updates on links that may not be active yet.

 

 

Why go through all this trouble?

Well this is going to be a long dry page, so I figured I’d start it with a few images of plots that we are looking to generate from a wide array of really poorly formatted data sources from a time series performance profiling perspective. Though this plot is not related to server performance tuning, it shows ho you can extract data from a variety of sources (in this case, weather station data) to create long running time series plots that show trends you would not normally see my reviewing individual data points:

saolourenco837360_tempRH

Similarly, data can be pulled from millions of transactions and plotted to provide a detailed view of transaction profiles that provide an indication of issues that may occur on a smaller scale. In this case, viewing all available data points extracted from several gigabytes of text based files provides insight into the timing of outages in this particular application:

troubleshootingTput-scatterTimeSeries

The previous example shows how one data source can be valuable in identifying application issues. Adding other data sources can provide insight across components of a supporting infrastructure, helping pinpoint the source of outages or slow downs. Here’s an overlay during a clustered database test:

timeSeriesFaultInjection

This should give you an idea of what I’m after, but getting to these plots requires a lot of data manipulation. This page walks you through a few scripts that I use to pull data from a variety of server logs to create complex data visualizations.

 

Manipulating performance data sources

This section provides a jumping off point to posts that contain example VBA scripts that allow you to manipulate raw performance data sources from within Excel. Manipulation of these data sources is the first step in performance profiling; we need to get the data in the right time series format so we can overlay data visually to build a representation of what is going at multiple layers in the infrastructure stack. The second half of this page walks you through some of the plotting methods I use, but first up is getting the data we need to conduct an analysis.

 

Using VBA dictionaries and arrays for generating time series summaries

A performance engineer is often faced with large custom log files from applications that may be underperforming. Making sense of these logs can be difficult if you are parsing through them manually. Analysis scripts can help here if we can orient the data around a proper time series view. I prefer using a one minute interval as a standard approach for analyzing data sources for OLTP style applications that support moderate throughput rates. This allows me to aggregate data on a reasonable scale with a good sample rate (a few thousand to ten’s of thousands of transactions per minute) while using standard methods for manipulating data (e.g. hour(), minute() functions). This also provides for 1440 samples per day, which generates fairly precise plots when we are considering long running time series analytics that may span multiple days or even weeks. I’ve put together a post that uses Microsoft Excel’s VBA dictionary and array capabilities to cache data within the script on these 1 minute increments that stores the specific performance statistics within a dictionary as data for a transaction is compiled. This temporary data cached in the dictionary is then aggregated into an array representing the cumulative time series view of the data that will eventually be written out to a CSV file. This approach can be used for a variety of parsing applications, including compiling unique ID’s encountered per minute, latency and throughput counts per minute for distinct transaction types, or counts of specific events encountered per interval.

Click here for the full post on extracting data with Excel VBA from text based logs using dictionaries and arrays as a caching mechanism for output to a clean CSV file.

 

Using VBA to conduct Oracle Statspack time series analysis

Oracle Statspack is a text based reporting infrastructure that has a fixed format structure that summarizes many performance characteristics of a specific instance on a regular cadence. Oracle calls an interval a “snap”. Most Oracle DBAs use this reporting infrastructure to generate a report over a time period of interest, often many hours. In working with Oracle DBAs and application owners, I encourage them to generate more frequent Statspack reports and analyze system changes over time on key metrics critical to the application using the database. This allows you to develop an Oracle Statspack time series view that presents you with a data rich profile of the applications database usage patterns over time as workloads shift. This can be a critical tool for use in debugging specific outages. This page delves into a deeper discussion on the Oracle Statspack format and how I used Excel VBA to extract that data from a sequence of Statspack files (in this case 24×1 hour reports per Oracle RAC node across a 4 node cluster for a total of 96 reports per day). This script targeted metrics key to a specific application, but the methods can be used to extract other data from different tables in the Oracle Statspack reports if your application is sensitive to other database conditions.

Click here for the full post on extracting data with VBA from Oracle Statspack.

 

Using VBA to conduct Oracle AWR time series analysis

The Oracle AWR (Automatic Workload Repository) report is similar to Oracle Statspack in that it is a performance report generated on “snap” intervals within the Oracle data collection repository, but this report is HTML based and easier to navigate when conducting diagnostic data on a single report. There are hyperlinks within the document that take you from a specific workload profile for a particular query to the full text of the query itself. This use of links and nicely formatted tables makes the Oracle AWR report a nice improvement over Oracle Statspack output. That said, like most other Oracle components, there’s a licensing charge associated with access to AWR reports, while Statspack is included in the base license. I have many customers that either opt to not license the AWR infrastructure, or that choose to stick with the older text based Statspack format. I’ve outlined a method for extracting data from Statspack above, while this section focuses on a different approach for compiling similar statistics within Excel VBA from a sequence of HTML pages.

Click here for the full post on extracting data with VBA from Oracle AWR reports

 

IOSTAT VBA based time series extract

IOSTAT is a useful debugging utility on Linux and Unix systems that provides detailed IO performance data that is often critical in identifying IO bottlenecks. I routinely use this utility as a data source for conducting high scale OLTP workload analytics / stack modeling. I use the script detailed in the following link to extract useful data from IOSTAT collecting at a 15 second increment. This script parses the text file specified in a worksheet as a target file, then extracts the data from the multi-row table per occurrence and consolidates it into a single row time series table.

Click here for the full post on extracting data with Excel VBA from IOSTAT output

 

NMON VBA based time series extract

NMON is an AIX and Linux utility that traps a variety of system level metrics that are often quite useful in time series performance analysis or enterprise performance modeling. I routinely use output from this utility as a data source in my time series stack analysis either from lab data sources or from customer sites. The following script is an Excel VBA based macro I use for pulling targeted data from NMON output into Excel for plotting.

Click here for the full post on extracting data with Excel VBA from NMON output

 

SVMON based time series extract

SVMON is a useful tool that I draw from periodically when customers are having memory performance issues and we need to extract a time series view of a single PID over time to validate memory behavior. I wrote this simple script to pull a target PID from concatenated SVMON output using Excel VBA. The final product is a single table containing one row of memory usage data for each concatenated entry within the source file.

Click here for the full post on extracting data with Excel VBA from SVMON output

 

DB2 Snapshot time series extract

Getting repetitive yet? This is yet another script I use to pull data of interest from DB2. “Snapshots” are used at a set interval for trapping DB2 performance and diagnostic data on a specific cadence. This script extracts the data I typically use for time series analysis from a directory containing multiple snapshots and compiles them into a single time series table that can be used for plotting and data source overlay modeling.

Click here for the full post on extracting data with Excel VBA from DB2 snapshots

 

DB2diag log parsing – now we’re talking obscure scripts

So I’m probably the only person in the world that cares about parsing DB2diag logs for buffer pool resize events so I can use that data to include in time series models of enterprise transactional systems models. I use this script to pull buffer pool resizing events from the text files that I place in a single table which can then be used to track buffer pool change events over time, and hopefully correlate to any database performance issues.

Click here for the full post on extracting data with Excel VBA from DB2diag logs

 

Automating plot generation

 

Creating and arranging small multiples with VBA

 

Using Excel for dynamic data analytics