After our Business Warehouse upgrade to 7.0 in July 2009, I found that ST03 was paralyzingly slow to respond when drilling down to where process chain reports were displayed before the upgrade. As the symptoms did not appear immediately after the upgrade, but a few days later, I presumed it was data volume related. There was a lot I didn’t know at the time, though.
Previous chapters of this saga I related in blog posts starting 14-Aug-2009:
- Operator? Support? Hello? No, I don’t want to talk to Mr. Veedle.
- Is This The Party To Whom I Am Speaking? – 20-Aug
- Workload Analysis (ST03) Switchboard Jam (part 3) – 01-Sep
After the typical drawn-out struggle with support to identify the responsible code and/or data causing the observed symptoms, and an equally unsurprising wait to apply an unpublished SAP note, we began the path to production. Sometimes fixes for issues such as this are simple enough that they may be applied to production in short order. Others are complex enough, and the timing is such that we need to wait.
We had a production copyback near the end of August, part of our monthly production transport cycle. Major changes are regression tested for a week with the latest data to ensure the most representative testing. As relayed in the last prior chapter, my QA tests showed the proposed code fix repaired the problem.
This past weekend, transports moved into production. As it was Labor Day in the U.S., I did not begin verifying the correction until Tuesday. To my surprise, the first drill down into ST03 took about five minutes before dumping.
This was a surprise. Previous faults had simply been a long delay, much longer than even the five minutes before this failure. While it looked like the code died in about the same place, I wasn’t sure why. The QA system had a huge volume of data, and should have had exactly the same code.
My next step was to verify that the code fix had gone into production (it had), and then to decide whether to update the SAP ticket, or look around a little more. We chose the latter.
The BW architect reported:
|… read thru the dump and was suspicious of one of the messages….so I reviewed all the data in the cubes and it seems like one of the loads failed over the weekend while I was on vacation. So the virtual cube had to access about 4 days of data…|
(“virtual cube” sounds like science fiction to me).
Next update was:
|… corrected the failed load so the virtual cube only had to access today’s data and ST03 now runs in about a minute.|
Later in the day I ran my own verifications. It would not make sense to simply assume everything worked fine, after the disconnects and wrong numbers on this case so far.
The symptoms were much improved — I was able to see both the full month of August and the partial month of July (the first part now living in the “640” space).
Figure 2 above shows what I had been looking for – a quick summary of a month’s worth of process chain runs. This view gives me the identity of the longest running applications, which on a month-by-month basis will show trends such as new contributors, and runtime degradation.
Drilling into one of the longest-running chains reveals more about the metrics. You may notice that I sorted this display by start time, rather than by total (or component) time. This gives me a clearer picture of the sequence of events. As the individual parts don’t add up to anywhere near the whole, there’s a gap in the chronology. Nearly 4 days in fact, as you see once you blot out the commas in the timestamp column.
It appears that a step failed, then the process chain lay dormant. In this case, we’d be looking to our alert and escalation procedures more than the performance tuning that might have been implied by long running chain.
Are we done? For this blog series, I expect so. I can follow up via the commentary. For this issue, no we’re not done. Here’s where we are in stablizing the workload statistics objects in production:
… loads that fail and the data grows too large for the virtual cubes to work. Currently, I tried to reprocess the failed load and seems that it short dumps too. (compute overflow) I found a note to apply, but that won’t be till next monthly move. I’ll have to hand hold this load everyday now unless
respond on the ticket about the issue with the 14 day deletion not working and put back into SAP hands.
Part of the “postive call closure” survey will be links to this blog series.