Statistical Inferences from Performance Data of Application
Performance tests are crucial for any web or customer application. They reveal how a system behaves and responds during various situations. The system may run very fine with 1000 concurrent users but how would it behave when 100,000 users are logged on? In terms of performance, it means to achieve high speed, scalability, and stability of the system. There are different tools which gives different performance numbers to measure these Key performance indicators like Response time, CPU time, DB time etc. with respect to single user or multiple user tests.
However, just merely getting these performance numbers is not enough as it only conveys whether the current performance value indicates a good performance or bad performance. There is need of additional statistical inferences to be drawn out of these indicative numbers. The paper presented below talks about how we can draw statistical inferences, build relationships between different Key performance indicators of any web application, predict futuristic probabilities charts and conduct hypothesis tests and perform regression analysis
In today’s business world, to enhance productivity of any website system, whether it be a personal project, a business venture or otherwise, it’s critical a web application is tested for its responsiveness in terms of its stability, i.e. how well it can handle specified or expected workload. Through software performance testing, efficiency of any vital application can be measured, which would help to understand the good or bad behavior of the application .
Non-performant (i.e., badly performing) applications generally don’t deliver their intended benefit to an organization; they add to net cost of time and money and a loss of reputation from the application users, and therefore considered as non-reliable or loss to the organization . If a software application is not delivering its intended service in a performant and highly reliable manner, this leads to detrimental effect on everyone involved with the products, right from the designers, architects, coders, testers and end users .
With so much essence for Performance testing, yet it continues to be pretermitted in contrast to functional tests, which are well understood and have a high maturity level in most business organizations. It is truly inconceivable that companies continue to overlook the significance of performance testing while frequently creating and deploying applications with negligible or no understanding of their performance, and eventually beleaguered with performance and scalability issues after the release. However, over course of several years this mindset has changed and organizations started caring to check for the performance behavior of their applications.
Problem statement: Good or Bad performance
So how a good or bad performance of any website or application is judged? Ultimately it is all about perceived response, some crucial applications are expected to deliver output within thresholds of 1 second or less example, bank applications, while for others it is still okay to take a minute or less to attain to user’s request example, shopping, Facebook website. In former case, a delay of few seconds might irritate a user whereas in latter case the user does not care much about the few extra seconds or minutes spent. So precisely, a well-performing application is one that lets the end user carry out a given task without undue perceived delay or irritation. The important fact is that the performance really is in the eye of the beholder .
From an end user perspective, it sounds simple, however from organizations perspective which are accountable and need to take care of their applications performance, there is an exigency to translate this very fact into quantifiable output. Different organizations might invent different ways of predicting the response but use of standard Key performance indicators (KPIs) such as end to end response time, cpu time, database time, memory etc should be considered here. These KPIs are measurable and assessable and can tell whether a system or application under test is behaving good or bad by comparing it against some standard thresholds numbers. A gauge of these performance indicators would tell how well (or not) application is providing service to the end user or efficiency-oriented indicators such as throughput and capacity would measure how well (or not) an application makes use of the hosting infrastructure.
Yet these indicative numbers desert to convey the essence of the data to the end-user. The inferences, which are paramount to take important decisions such as, can the software be released to the customers or is it a better competitive product in the market or how well application would perform in near future with increasing user/data etc. are still lagging and cannot be interpreted easily. So there arises a necessity of important statistical inferences to be drawn out of these key performance indicators of test data which would help the organization to take sagacious decisions for any business.
In this paper, I have presented different techniques such as creation of box plot for five-number summary, defining frequency distribution for performance test data, identification of probability distribution for future data-set, building hypothesis tests and conducting regression analysis for performance test data which would help to arrive at statistical inferences from performance kpi numbers.
Creation of box plot for five-number summary of performance test data
A box plot is a graphical summary of test data that is based on a five-number summary minimum, first quartile, median, third quartile and maximum. A quartile divides data into different quarters. The first quartile (Q1) is defined as the middle number between the smallest number and the median of the data set. The second quartile (Q2) is the median of the data. The third quartile (Q3) is the middle value between the median and the highest value of the data set . The first quarter represents lowest 25% of data, the second quarter describes the next 25% of data up to median value, the third quarter specifies higher 25% of data above the median value and the last quarter describes highest 25% data up to the maximum value in test data.
Box plot diagram is effective to eliminate outliers from the test data. It allows to compare different categories of data for easier, effective decision-making. A key to the development of box plot is the computation of the interquartile range, IQR = Q3 – Q1.
Here, I have taken an organization test data for one product area and have recorded key performance indicators values, such as End to end performance time, CPU time and Database time for it.
Computing the five-number summary for the performance data results in:
Then calculating the Interquartile range, lower limit and upper limit:
Next is to draw the box plot diagram using the above values:
- The complete set of test data can be represented by the five number values
- The central dispersion of end to end response time value lies at 1.491 and 50% of these values lies within range of 0.9005 (first quartile) and 2.721 (third quartile) which indicates half of the time performance of application would fall within these thresholds
- The lower limit and upper limit are at -1.83 and 5.45 respectively which means for an end user, the response time of the application can be as close as to 0.00… seconds in ideal environment (ignoring the negative values for time) or as bad as 5.45 seconds
- The maximum value of 48.65 second is an outlier here since it falls beyond the scope of upper limit value. Similar values such as 14 secs, 16 secs, 25 secs are outliers for given application performance and should be eliminated from the test data in order to eliminate their impact
Defining frequency distribution for performance test data
A frequency distribution is tabular summary of test data showing the number (frequency) of observations in each of several nonoverlapping categories or classes. A frequency distribution is intended to show how many instances there are of each value of a variable. 
Using frequency distribution, the organizations can calculate what is the standard performance kpi (response time or cpu time) value under which most applications fall.
We can compute the frequency distribution table first and then chart as defined in below steps:
- Find the range of the data: The range is the difference between the largest and the smallest values 
- Determine the number of classes i.e, which data are to be grouped. H.A. Sturges has given a formula to determine the approximation number of classes. It can vary usually from 5 to 20 
K = 1 + 3.322 log N
Where K= Number of Classes, log N = Logarithm of the total number of observations.
For Example: If the total number of observations is 50, the number of classes would be
- Calculate the approximate class interval size: The size of class interval can be obtained by dividing the range of data by number of classes and denoted by h class interval size 
(h)= Range/Number of Classes
- Decide the initial class value 
- Compute the remaining class limits: Once the lowest class boundary of the lowest class has been decided, then by adding the class interval size to the lower-class boundary, compute the upper-class boundary. The remaining lower and upper class limits may be determined by adding the class interval size repeatedly till the largest value of the data is observed in the class. 
- Distribute the data into respective classes: All the observations are marked into respective classes 
Here, I have used excel functions and above steps to calculate the frequency distribution table and chart as below:
|Row Labels||Count of End to End Response Time [s]|
- Majority of end to end response time values of the application lies below 5 seconds i.e., almost 953 of the values recorded for a product area falls under 5 seconds’ response time threshold
- Only 1 or 2 values at most have peak values such as 11 seconds or 48 seconds. These bare minimum values are the outliers in the performance of the application
- So, with consideration to the above two points the application takes on average the threshold value of 5 seconds to respond for any user interaction step
Identification of probability distribution for future data-set
The probability distribution represents the likelihood of an event occurring in near future.
In contrast to frequency distribution where the diffusion of current data set is measured, the probability distribution table and graph conveys information about the futuristic events. 
In business scenario, using probability distribution we can compute the chances of an application, for instance, to perform within the defined threshold limit. It can be used to generate scenario analyses. A scenario analysis method make use of probability distributions to construct several exclusive and distinct possibilities for the outcome of a future event. For example, there can be three identified scenarios for a business: worst-case, probable-case and best-case. The worst-case scenario would contain some value from the lower end of the probability distribution; the probable-case scenario would contain a value from the middle of the distribution; and the best-case scenario would contain a value in the upper end of the scenario. 
Risk evaluation could be another beneficial denouement of probability distribution which would help to understand and mitigate the risk for any business.
Assuming the data is normalized, I calculated the probability chart for the performance test data set as below, using excel data analysis functions (NORM.DIST):
|Response time threshold (x)||Probability function f(x)||Probability percent|
- As per the current readings of the product area, the application will have response time value of 3 seconds in almost 58% of future probable cases. This differs from the frequency distribution chart where 94% of the data have response time of 3 seconds
- Here we can ask questions like what would be the probability for end to end response time value to be 4 seconds or 5 seconds? So, from above chart the probability holds good at 79% for 5 seconds’ criteria of any application performance. This helps in drawing conclusions for a business case or deciding on the future release of an application
- Probability for response time value to fall within 5 seconds and 10 seconds’ thresholds is (0.987931 – 0.785915 = 0.2020) i.e., 20%. Similar extrapolations can be drawn further.
Building hypothesis test
In hypothesis testing we begin by making a tentative assumption about a population parameter. This tentative assumption is called Null hypothesis and is denoted by H0. We then define another hypothesis called the alternative hypothesis, which is the opposite of the statement in null hypothesis. The alternative hypothesis is denoted by Ha. 
The hypothesis testing procedure uses data from a sample to test the two competing statements indicated by H0 and Ha . The p-value is then calculated using mean, hypothesized mean and standard error and this p-value is compared against the alpha or the significance level using the below rule:
|p-value< alpha||Reject Null hypothesis|
|p-value>alpha||Fail to reject Null hypothesis|
In the current data sets, I took a sample of data from the current population and assumed that the Database CPU time would fall within threshold of 0.11 seconds for the application. So, writing this statement in Null hypothesis form and opposite statement in alternative hypothesis form:
H0: Database CPU time < 0.11 seconds
Ha: Database CPU time > 0.11 seconds
|Null Hypothesis: H0||0.11|
|Alternative hypothesis: H1||Greater than 0.11 seconds|
|Test statistics (t)||-0.8411710|
|Fail to reject H0 at||0.11 seconds||Not significant|
So as per the rule, I fail to reject null hypothesis and conclude that database time falls within threshold of 0.11 seconds for the dataset. Similarly, I made different assumption that the database time would fall within threshold of 0.08 seconds. However, this time the null hypothesis is rejected based on the below computations as p-value came out to be smaller than alpha value. This value was significant enough to reject the claim.
|Null Hypothesis: H0||0.08|
|Alternative hypothesis: H1||> 0.08 seconds|
|Test statistics (t)||2.1051953|
|Reject H0 at||0.08 seconds||Significant|
Conducting regression analysis
Organizational decisions are often based on the relationship between two or more variables. For example, after predicting the relationship between advertising expenditures and sale, a marketing manager might attempt to predict sales. Sometimes this relationship would be built based on past experiences or sometimes solely on the intuition of the management 
However, a more accurate method called Regression analysis can be employed to develop an equation between two or multiple variables and to show how they are related. In regression technology, the variable being predicted is called dependent variable and the variable or variables being used to predict the value of the dependent variable are called Independent variables. Depending on the number of independent variables used, we term the regression analysis as Simple linear regression (one-variable) or Multiple regression (two or more variables).
Here, I have applied linear regression on variables End to end response time and Database time and tried to predict the relationship between them using Excel data analysis functions:
|Adjusted R Square||0.756536|
|Coefficients||Standard Error||t Stat||P-value||Lower 95%||Upper 95%||Lower 95.0%||Upper 95.0%|
|X Variable 1||1.042476||0.077438||13.46212||2.38E-19||0.88741||1.197543||0.88741||1.197543|
From the analysis table above, the relationship between two variables is envisioned as:
Y = 1.0425 x + 1.4733
i.e., in the form of Y= mx+ c
The correctness of these coefficient values can be determined by p-value in the regression chart. The p-value of 9.17E-09 and 2.38E-19 for intercept and x-variable coefficient respectively, are extremely low and thus can be interpreted as highly accurate.
Multiple R is the correlation coefficient and its value determined was 87%, which indicates that the dependent variable response time is strongly related to independent variable database time.
R square is the coefficient of determination which states that 76% of the variance in response time can be attributed to database time. It is a statistical measure of how close the data are to the fitted regression line.
The Anova highlights some key figures:
Degree of freedom (no of independent variable used) which is 1 for regression, 57 for Residual (n-k-1) and Total 58 (n-1),
Sum of Squares of Regression values, mean sum of squares and F- statistics significance value.
The F-statistics value which is very low in this case means regression analysis is good.
The residual plot and probability plot between response time and database time is plotted as below:
 Book on Statistics for Business and Economics (by Anderson, Sweeny, Williams, Camm, Cochran)