Skip to Content

  We got the dispatch call on Tuesday September 18, 2007. “Bad response time” on sales order entry. Unfortunately for us, the problem had started a week earlier. Things were bad and getting worse, heading into the last weeks of the fiscal quarter, typically one of the biggest.

 

  Users had noticed sales order entry (VA01) and related transactions were running slower than normal early in the prior week, but for reasons unknown the escalation went to the network teams to research WAN throughput, and nothing came to the attention of the Basis, database or system administration teams for 6 business days.

 

  Overall response time looked about the same as prior weeks, so we started researching database activity and quickly noted a major degradation of that part of the VA01 transaction (using statistics in ST03). Nothing appeared abnormal about general database health, but there were several suspicious transactions in the SQL buffer cache (ST04).

 

  Triage was pointing to a particular application change as the root cause, rather than network congestion or system changes.

 

  I saw a few condition tables (e.g., A527) that were doing full table scans, meaning the SAP application servers were asking for all of the rows. Not highly unusual, but the tables didn’t look familiar and the volume of data being transferred seemed high. As 3 of the statements were in the top 5 for buffer gets, I wrote the names down as noteworthy.

 

  At that point, it made sense to take a tour through the application server table buffers. As I’ve been through there (ST10) plenty of times, it was quickly obvious that tables were fully buffered that shouldn’t be. One was over 100MB, dwarfing the generic buffer.

 

  To cement the case, I ran SQL traces (ST05) on a user who appeared to working in VA01. If I had more time to follow up with the customer service team, I would have gotten the duty manager on the phone, found out who was working, and traced them. But I was in a hurry, so I just viewed the system activity (SM66) to find a rep to trace. The SQL trace showed lots of table accesses to the database instead of to the table buffers. The tables with many full table scans weren’t the top time for this user, but the pattern was becoming clear.

 

  Then I went back through the ST03 (SAP application metrics) maze, pulling out data to cast trends by day, by week, and by user. The user I had traced was showing average times for VA01 around 1.5 seconds, which sounds OK, but as they were seeing 0.8 seconds before this episode, they were getting hourglasses for twice as long as normal.

 

  We bounced email and tickets around for a day or so before getting to the conference call where actions were decided. As we had found several table definitions changes in the SAP data dictionary from partial to fully buffered as of the last monthly move, we recommended an emergency change for Thursday night. We negotiated an acceptable time with the application managers, and then went through getting the proper approvals for the changes.

 

  After running some errands after work Thursday, I logged back onto the R/3 system and watched as users in Australia/New Zealand took orders. I polled values from ST03 for the “last minute’s load [SIC]” and found the results below:

 

19-Sep-07

 

 

Trans/Rep. Short text # Steps T Response Time Ø Time T CPU~ Ø CPU~ T DB Time Ø DB Time
VA01 Create Sales Order 1,096 1,629 1,486.5 126 115.1 861 785.8

 

 

21-Sep-07

 

Response time (ms)
 Time in WPs (ms)
Wait time (ms)
CPU time (ms)
DB req. time (ms)
840 744   104 192

 

 

  Overnight EMEA came on line and users uniformly reported improvements in each market. As the Basis manager commented, it’s nice (though too rare) to get an email when we get to work in the morning on the US East Coast that “everything is fine.”

 

  Here’s a chart from data I collected from weekly values:

 

image

 

   Well that’s what we in technology found.  What happened was clear, but why and how were still a mystery.  As we sifted through the change and transport logs, we could see a transport had gone to production on September 9th.  The effect was to revert the condition tables, which are generated from configuration values or rules, to being fully buffered, which was how they were delivered from SAP.  However, we had changed them to partially buffered (“generic”) anywhere from 6 months to 2 years earlier as we noticed the volumes increase.

 

  What we have discovered is that new text values were added to the tables (whether screens or data I’m not sure yet) to support our post-Unicode conversion go-lives for China, Hong Kong and Taiwan.  The reports are that we’ve safely made similar changes in the past as we’ve gone live in other markets, but this was different somehow.

 

   We have a ticket open with SAP, and if anyone with a clue catches this blog feel free to see me in Las Vegas this week!

 

Jim

To report this post you need to login first.

9 Comments

You must be Logged on to comment or reply to a post.

    1. Jim Spath Post author
      I appreciate the positive feedback!  As noted in the blog that doesn’t always happen (I’m as guilty as the next).

      I would like to expand on this thread if you can tell me which part you are most interested in.

      Jim

      (0) 
  1. Rui Pedro Dantas
    Hello, Jim.
    Very good post, but I have one doubt. With those settings (huge tables as fully buffered), I would expect an increase in memory consumption, but not necessarily a big impact on response time. Why exactly would you say the response time was affected?

    Regards,
    Rui Dantas

    (0) 
    1. Jim Spath Post author
      Rui:

        What we think was happening, based on user experience and the system tracing we did, was that the application servers were spending a lot of overhead time asking for data that wasn’t needed.  Each full table scan might take around a second (don’t have exact numbers), which was noticeable to users.

      Jim

      (0) 
      1. Rui Pedro Dantas
        Mmm, ok, thanks. I would expect the *first* access to each table to take longer (all the records would be read from the db to SAP’s table buffer), but not from then on (the buffer would be used).
        There might have been problems, though, if the table buffer size or max number of entries was not enough, causing a lot of swap.
        (0) 
        1. Jim Spath Post author
          Rui:

          Right.  The tables in question were too big to fit into the SAP table buffer, so were invalidated right away.  Each subsequent retrieval did the same thing.

          As Oracle puts full table scan blocks at the other end of the least recently used list, each data retrieval went to disk instead of database block memory.

          Based on the improvement of user response time, we’re sure that table buffer settings were the culprit.

          Jim

          (0) 
          1. Rui Pedro Dantas
            Ok, thank you again for your answer.

            My questions were just to understand if we should expect this huge impact on responsive time
            1) only if the buffer is not big enough (which is your case), or,
            2) even if the buffer was bigger, performance problems reading the huge resulting buffer itself would happen. In our case, zcsa/table_buffer_area
            is currently 500,000,000, but we definitely also have to change buffer settings for some tables to be able to reduce this value.

            By the way, in case you are interesested note 886773 (Q. 4) suggests a modification to avoid buffering type changes after regeneration of condition tables.

            Rui

            (0) 
            1. Jim Spath Post author
              Ah, thanks for the clue to prevent this in the future.  I’ve passed the note number to Basis/DBA. 

              Send me a note off line as I’d like to know how you sized your table buffers to 500MB.  I’ll post a summary back into the blog thread.

              Jim

              (0) 

Leave a Reply