We got the dispatch call on Tuesday September 18, 2007. “Bad response time” on sales order entry. Unfortunately for us, the problem had started a week earlier. Things were bad and getting worse, heading into the last weeks of the fiscal quarter, typically one of the biggest.
Users had noticed sales order entry (VA01) and related transactions were running slower than normal early in the prior week, but for reasons unknown the escalation went to the network teams to research WAN throughput, and nothing came to the attention of the Basis, database or system administration teams for 6 business days.
Overall response time looked about the same as prior weeks, so we started researching database activity and quickly noted a major degradation of that part of the VA01 transaction (using statistics in ST03). Nothing appeared abnormal about general database health, but there were several suspicious transactions in the SQL buffer cache (ST04).
Triage was pointing to a particular application change as the root cause, rather than network congestion or system changes.
I saw a few condition tables (e.g., A527) that were doing full table scans, meaning the SAP application servers were asking for all of the rows. Not highly unusual, but the tables didn’t look familiar and the volume of data being transferred seemed high. As 3 of the statements were in the top 5 for buffer gets, I wrote the names down as noteworthy.
At that point, it made sense to take a tour through the application server table buffers. As I’ve been through there (ST10) plenty of times, it was quickly obvious that tables were fully buffered that shouldn’t be. One was over 100MB, dwarfing the generic buffer.
To cement the case, I ran SQL traces (ST05) on a user who appeared to working in VA01. If I had more time to follow up with the customer service team, I would have gotten the duty manager on the phone, found out who was working, and traced them. But I was in a hurry, so I just viewed the system activity (SM66) to find a rep to trace. The SQL trace showed lots of table accesses to the database instead of to the table buffers. The tables with many full table scans weren’t the top time for this user, but the pattern was becoming clear.
Then I went back through the ST03 (SAP application metrics) maze, pulling out data to cast trends by day, by week, and by user. The user I had traced was showing average times for VA01 around 1.5 seconds, which sounds OK, but as they were seeing 0.8 seconds before this episode, they were getting hourglasses for twice as long as normal.
We bounced email and tickets around for a day or so before getting to the conference call where actions were decided. As we had found several table definitions changes in the SAP data dictionary from partial to fully buffered as of the last monthly move, we recommended an emergency change for Thursday night. We negotiated an acceptable time with the application managers, and then went through getting the proper approvals for the changes.
After running some errands after work Thursday, I logged back onto the R/3 system and watched as users in Australia/New Zealand took orders. I polled values from ST03 for the “last minute’s load [SIC]” and found the results below:
|Trans/Rep.||Short text||# Steps||T Response Time||Ø Time||T CPU~||Ø CPU~||T DB Time||Ø DB Time|
|VA01||Create Sales Order||1,096||1,629||1,486.5||126||115.1||861||785.8|
Response time (ms)
Time in WPs (ms)
Wait time (ms)
CPU time (ms)
DB req. time (ms)
Overnight EMEA came on line and users uniformly reported improvements in each market. As the Basis manager commented, it’s nice (though too rare) to get an email when we get to work in the morning on the US East Coast that “everything is fine.”
Here’s a chart from data I collected from weekly values:
Well that’s what we in technology found. What happened was clear, but why and how were still a mystery. As we sifted through the change and transport logs, we could see a transport had gone to production on September 9th. The effect was to revert the condition tables, which are generated from configuration values or rules, to being fully buffered, which was how they were delivered from SAP. However, we had changed them to partially buffered (“generic”) anywhere from 6 months to 2 years earlier as we noticed the volumes increase.
What we have discovered is that new text values were added to the tables (whether screens or data I’m not sure yet) to support our post-Unicode conversion go-lives for China, Hong Kong and Taiwan. The reports are that we’ve safely made similar changes in the past as we’ve gone live in other markets, but this was different somehow.
We have a ticket open with SAP, and if anyone with a clue catches this blog feel free to see me in Las Vegas this week!