The Transaction In The Lake (Part 2)

JimSpath · ‎12-02-2008

See: Part 1 for the preliminaries

We're going to go for a short detour in this transaction investigation.

Originally, we started looking for a reason for the increase in average response time for a set of transactions, then focused on one of them. As part of that focus, we found that one user id accounted for a large portion of the workload, and with a detailed SQL trace, found some suspicious results. While we're waiting for a DBA to verify our results (and also waiting to post a review of a crusty but useful SAP note), I wanted to delve into statistics a bit.

It's not that I really think statistics are great, but I got an email from a blog reader who isn't sure he or she is allowed to post due to corporate policies. So I'll paraphrase their questions, comments and concerns.

Question 1

How much of the discrete data records for an average are below 100 milliseconds?

To determine the answer empirically, I pulled data from ST03 "Business Transaction Analysis" for a very frequently executed transaction. The three charts below represent 30 minutes of wall time each, with around 3,000 transactions executed in each period. The answer to question 1 for these data is: "15 or 16 percent."

What does this mean? To me, these might be negligible steps with little business processing occurring, such as screen transitions, or possibly data lookups from within application server memory. They are represented by the left of the 3 spikes in each graph. They don't follow the 80/20 rule in this case, nor occupy a large percentage of the values in the average.

Question 2

Is "average" response time meaningful for non-normal data?

There are a several implications here. The first is that the base dialog steps don't have a normal distribution. Right. They don't. I would have expected to see a curve that skews to the right (very long, based on user not supplying full keys, optimizer picking wrong index, too much data, etc.), but I am not sure I expected to see 3 peaks.

The second implication is that small changes in the average are not indicative of small user experience changes. I'd agree with that. On the other hand, large changes in the average would tend to indicate something worth looking into.

The third implication is that improving the average is not worth the effort, because users might not benefit noticeably (I'm probably stretching here). I think as long as the tuning or performance improvements are documented and able to be reversed, changes that reduce system load are worthwhile.

Question 3

Isn't the time of day a transaction is executed important?

See below (chart 4).

Yes, it's important, particularly if you have discrete online and batch windows, and are trying to keep online users happy. I haven't pulled much time-of-day data from ST03 (or ST03N) but this is probably worth looking into. A noticeable drift during a 1 or 2 hour period may help isolate issues more quickly.

On the other hand, adding additional metrics that need to be reviewed might add to the background hum of data.

1

2

3

4

By now, I hope the question on your mind is, "why are there 3 peaks to the transaction distribution charts?" We've looked at the left-most values above (question 1), so what are the others? In a skewed distribution caused by the fact that we can't have transactions taking less than 0 seconds (until the time machine starts working), and we could have very long transactions (peaks at 10, 20, 30 seconds or more), I would expect to see a single peak that represents the typical case.

In Full House (see below), Stephen Jay Gould describes a right-skewed distribution as one where the median is to the right of the mode, and the mean is to the right of both the others. For this transaction, we have:

mean	median	mode
672	476	507
739	508	507
768	508	507

OK, two out of three; the median is to the right of the mode in only one sample.

The data distribution is skewed (obvious from the charts), but why 2 significant peaks rather than 1? My guess is this transaction, during the time I sampled, has a variety of business operations represented, some which I'll call "easy" and and some I'll call "hard". The easy steps flow from top to bottom, lft to right, with all data easily found; the hard steps involve data lookups (first time that fills the caches), or other calculations the easy transactions don't need. We could dig into this further to analyze the contributing factors, but the intent is to study what we have, and what we should do with it.

What's remarkable is how much each distinct time period resembles the others. This tells me that we can draw conclusions. and expect that seeing improvements in this short a time period (30 minutes) is significant.

The mean varies more than 10% among the 3 samples, but the median and mode less. I'd interpret this to say we can look at average values, but small fluctuations (10, 15 or 20% are not big indicators over the short term).

The long tail

To the far right in each graph are what might be termed "outliers" which are transaction times far out of the norm. If there are a lot of them, they will have a large influence on the average, not to mention those represent some unhappy users (real or virtual). Looking into those, finding the contributing factors, educating users about wild cards, are ways to reduce this these tails.

I've cut the charts off at 3 seconds, but you could guess there's more data to the right that's not shown. In each sample, there were 70-100 values beyond the range show. That's more than 1 spike per minute (on average, so to speak).

References

Gould, Stephen Jay, Full House: The Spread of Excellence from Plato to Darwin e.g., Online NewsHour conversation

Continued

Part 3

Part 4