After a performance test has been executed it is important to communicate the results in a manner that is understood by the stakeholders. Performance test reports contain statistical response time information that can be misinterpreted by stakeholders due to pre-conceived ideas of what the terms mean, leading to false conclusions and inappropriate decisions. In rare cases, there may be a complete lack of knowledge of statistical terms.

This may mean that there are different reports for different stakeholders giving different levels of statistical information and visualisations that are appropriate to that group, but in all cases, it is appropriate to give an explainer of each term in the appendices of a report.

Most performance test tools have suitable report outputs that are customisable to give different visualisations helping to make the results understandable. A number of tools also give the ability to compare two or more tests in a single report. There may also be access to the raw test results that can be exported to Excel or other data analysis or Business Intelligence tools which can produce dashboards showing trends of different versions of software or infrastructure changes which can be useful and aid understanding.

In this article, we will explore an example of a typical output of a transaction in a report to explain the various statistical terms. In a sequential article in this series, we will give examples of different visualisations and reports that will help explain the performance of an application or system.

Typical user response times of an application show a normal distribution (bell curve) as shown.

Normal distributions are weighted towards the centre and as can be seen, both Txn A and Txn B are weighted toward a response time of around 6.5, even though they are very different looking graphs.

the following is an example of what a summary output would look like in a report for Txn A and Txn B and explain what each of the terms mean. Note that all response time statistics are based on passed transactions only.

It is important to address the Average, Standard Deviation and Percentile in detail.

This is the most used statistic in performance testing and on its own can misrepresent the performance of an application. The average is the sum of the dataset divided by the number in the dataset. The average value should not be confused with the median value of a dataset; the median value is the middle value in the dataset. For example, if the data set of 101 values is sorted from the lowest to the highest, the median value is the 51^{st} value in the set. For an even-numbered dataset the median value is the average of the two middle values.

In our example, Txn A and Txn B have an average response time of around 6.5 even though the normal distribution graph shows a very different performance and so we should not rely on this figure alone. The average values should be considered in conjunction with the min and max values, the standard deviation, and the size of the dataset (the number of passes).

The standard deviation of a dataset gives a value showing how the data points are clustered around the average of that dataset. In other words, how much does the data vary from the average? It is, therefore, better to have a lower standard deviation as possible. As a rule of thumb, a dataset that has a standard deviation greater than *half* the average should be treated with caution as it may not be displaying a normal distribution.

The smaller the standard deviation is, the taller and narrower the graph is in a normal distribution which can be seen in our example. Txn A has a standard deviation of 1.83 and Txn B has a value of 2.6 meaning that Txn A values cluster closer to its average than Txn B. However, both values are less than half of their average values so the results can be relied upon.

Without a doubt, the percentile is the term people have the most difficulty with understanding. It is a straightforward concept once mastered. The best way to think of it is to arrange the dataset from the lowest value to the highest value, for instance, the 95th percentile would be the value at 95% along the data points. In other words, 95% of the data points will be at this value or below. Other common percentiles used in reporting include the 90^{th} and 99^{th} depending on the requirements for the system or application.

Any data points that fall outside the 99% are normally considered as outliers and can be dismissed, but this should be treated with caution. It is advisable to repeat tests to see if the same outliers persist and if they do then this should be investigated as this could be a genuine issue. This is presuming that dataset is statistically significant (is your dataset large enough?) and it follows a normal distribution.

In conclusion, it is very important to know what the statistical terms in a performance test report mean to fully appreciate the behaviour of your application or system. We would advise that terms used in a report are described in the report appendices and that the graphs and data shown in the reports are explained and not just left to the stakeholder to figure out. Also, reports should be tailored to different stakeholder groups; a technical person may appreciate lots of data and statistics whereas other groups may not need that level of information and a few good visualisations would suffice.

To see how SQA Consulting may assist your company in performance testing your applications, please contact us.