Skip to Content

The Wind Up

After our BW 7 upgrade, we looked pretty good once a backlog of process chains cleared out, and the tourist trade moved on from “hey look at the shiny new warehouse”.  I was able to view my stock performance reports, and other than new wrinkles with the Java stack, everything looked rosy.  Except that within a few days, I could no longer view process chains in ST03/Workload Analysis.  I dug deeper.

Development and Quality had been okay, as far as drilling into process chains as I have done for a few years.  There was one glitch with the DBA cockpit that caused an ABAP dump when looking at specific menu tabs in storage history, but that was cleared up with help from Twitter and service.sap.com.  The problem with ST03 was that I was getting an hourglass, a timeout, a hang.  I impatiently killed the transaction and went to my colleagues.

Several reported “no problem” with ST03 initially, which led me to look at desktop settings, GUI, etc.  Then, normal problem isolation techniques found that the problem happened for my user id on other systems.  It was me, or something I did that others didn’t.  Finally a BW architect hit the same snag, trying to look at process chains.

The Play By Play

Once the problem was isolated to transaction steps and reproducible, I began looking for the long running code.  Our initial SQL trace attempts were inconclusive, as the trace showed nothing running on my account, yet I continued to have long runtimes.  I looked at SM66 and found processes that matched the approximate start time of my report, but they were on a different application server, and not under my ID.  My data request had been transformed into an RFC call. Outsourced, if you will.

Time to open a ticket.  And wait.  I knew this was going to be a tricky one as it had taken us a little while to find it.  I’ll fast forward over the first several non-responses to get to the gems.

  • Your database parameters are incorrect.

 

Well, fine. I bet you read that in a book.  Called “Early Watch Report”.  Support has access to reports run by analysts who have been working for SAP less time than we’ve been running SAP.  Our DBAs came back with point-by-point explanations of why we have the settings that we have.  As we’ve been trained by Oracle Center of Excellence world-class DBAs, we’re confident that isn’t the issue here.

In fact, our lead DBA looked at the statement originally and said even though there are indexes on the tables in the view, there is too much data.  

  • Your buffer parameters are incorrect

 

Second try.  Another “cook book” answer to a problem.  An irrelevant solution to a specific problem.  Table buffer sizes have nothing to do with this code.  There are millions of rows in the RSDDSTAT tables now, so buffering them would be a no-win.

The Double Header

By now, I’m getting to be impatient as support is waltzing around different parts of the dance floor, getting nowhere near a fix to the problem.  In the meantime, I’m working an entirely different technology issue with a different vendor, a problem where the time in Mexico City is different than in the U.S.  Users there say the time is the same as our Central time zone, but their daylight savings times rules are different.  I believe the users, of course, and find evidence online in one of the timezone sites I visit (the other is ambivalent unfortunately).  However, the code in the OS and in the scheduling application doesn’t work.

The timezone call took more than 40 minutes, where I was forced to listen to technicians doing web searches, and read fixes out loud that we had viewed and discarded immediately as inapplicable weeks before.  When that call ended with a “we’ll have to get back to you” I wasn’t in the best of moods.  Then, the latest SAP ticket update with “unable to reproduce” led me to volunteer to report my findings in the case log. Here we are with Mr. Vidal (pronounced “Veedle” – by Lily Tomlin), or “Operator” (shtick from the 1960s by Bob Newhart).  Hello?  Yes, we have a problem on the line.

The Box Scores

Here’s the code I saw in trying to reproduce the issue for support to look at.

SQL Plan [15:37]

 

Parse Timestamp: 20090813 15:37:30

System: BW
SQL_ID  8fzky9kv03m4c, child number 0
-------------------------------------
SELECT * FROM "RSDDSTAT_OLAP" WHERE "STARTTIME" > :A0 ORDER BY "SESSIONUID" , "STEPCNT" ,
"HANDLETP" , "HANDLEID"

Plan hash value: 3919501385

-----------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name             | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |                  |       |       |       | 54732 (100)|          |
|   1 |  SORT ORDER BY                 |                  |   677K|   139M|   311M| 54732   (2)| 00:10:35 |
|*  2 |   HASH JOIN                    |                  |   677K|   139M|   112M| 28079   (3)| 00:05:26 |
|*  3 |    HASH JOIN                   |                  |   677K|   104M|       | 10103   (3)| 00:01:58 |
|   4 |     TABLE ACCESS BY INDEX ROWID| RSDDSTATINFO     | 14760 |  1109K|       |   515   (1)| 00:00:06 |
|*  5 |      INDEX RANGE SCAN          | RSDDSTATINFO~010 |  2657 |       |       |    30   (0)| 00:00:01 |
|   6 |     TABLE ACCESS FULL          | RSDDSTATHEADER   |  5120K|   415M|       |  9542   (2)| 00:01:51 |
|   7 |    TABLE ACCESS FULL           | RSDDSTATEVDATA   |  7633K|   393M|       |  9425   (3)| 00:01:50 |
-----------------------------------------------------------------------------------------------------------

 

Sharp DBA types will notice that RSDDSTAT_OLAP is a 3 table join view, and that 2 of the tables are huge, and that Oracle says “do a full table scan” on both of them.  Ugly code.  But, there it is.

A few minutes after I grabbed this SQL statement (and pasted it into the ticket) I viewed SM51, to show the running dialog process I was waiting for.

 

Picture 1 [15:53]

 

image

 

 

Picture 2 [16:00] – more than 20 minutes after I grabbed the SQL plan.

 

  image

 

TO BE CONTINUED!

To report this post you need to login first.

14 Comments

You must be Logged on to comment or reply to a post.

  1. Stefan Koehler
    Hello Jim,
    my experience with the SAP support is very different in the last time.

    I am very happy with the BC-DB-* component, if some specific people are handling my call. While we’re at it, greets to Lars Breddemann, Michael Braunstein, Martin Frauendorfer and Matthias Lienert .. you guys do an awesome (great!) job. I always hope that these guys are handling my very rare requests.

    If we have no luck and we need to open calls on other components (or we get some other guys) it is getting worse. Sometimes i am really afraid and tired of opening SAP calls … i prepare for long ping-pong calls with no focus on the real problem and in the last time without any solution.

    Just a statement to your problem:
    It would be interesting to see the real “runtime” statistics of the SQL query (not only the estimated ones by the optimizer) and the wait events 🙂

    Wish you all the best, that you will get some help soon.

    Regards
    Stefan

    (0) 
  2. Riccardo Escher
    Hello Jim,

    the last month I had the impression that support (for SV-SMG-…) has become more mature and I am quite happy.

    But in the past I had similar fights with the windmills too.
    Sometimes it is better to close a message (with appropriate “positive” close voting, when you give a 2 they phone you and ask) and to reopen it, may be the second time you have more luck with your support man/women.

    To your problem, just a little hint: please have also a look at ST04 -> Performance -> SQL Statement Analysis -> Shared Cursor Cache and search there for your statement: if you find it, please press the “Explain” button.

    Surprisingly sometimes the paths differs from the one shown in ST05. SAP keeps the cursors open as long as possible, so sometimes the calculated path in the SGA cache is out of date.

    Good luck!

    Bye
    Riccardo Escher

    (0) 
  3. Bala Prabahar
    Hello Jim,

    I don’t know the server’s configuration you are running this on. Full table scan on tables with 5.1M(415MB of data) and 7.6M(393MB of data) records should probably not take 20 minutes unless there is some issue with the structures of the tables.
    The width of those tables are very small.

    Additional thought: The retention time for storing statistics data is 14 days (Default). Are you retaining stats data for a longer period?

    Thanks,
    Bala Prabahar

    (0) 
    1. Bala Prabahar
      Hello Jim,

      I would also check the configuration of temporary tablespace. The SQL plan shows that the sql uses approximately 400MB of temp space. Is there any I/O bottleneck?

      I guess this is what I would do:

         1) Do you see any difference in SQL plan between when you run in SQLPLUS vs in st03n? ST04’s sql plan would be helpful.
         2) How long does this query take when you run in sqlplus? You may get the value of one dynamic parameter from Oracle dictionary or st05 output whould show the value of dynamic parameter A0.
         3) Run Validate structure utility on 3 tables of the view. Do they show any issues.
         4) SQL plan shows HASH join. Is this appropriate? You have already mentioned that DBA’s have reviewed the parameters. One more question: How current are you with Oracle patches? Hash join from my experience is a double-edged sword. Sometimes it is very helpful; sometimes hash join may hurt.

      Thanks,
      Bala Prabahar

      (0) 
      1. Jim Spath Post author
        Bela, et al:

          Thanks for your comments.  This isn’t a database problem.  More details will be in part 2.

          “How current are you with Oracle patches?” ROTFL.

        Jim

        (0) 
        1. Bala Prabahar
          Hello Jim,

          I thought I will ask that question on patches. I follow your blogs so second thought: I probably should not have asked….

          Regards,
          Bala Prabahar

          (0) 
          1. Jim Spath Post author
            Bala:

              I hope you took my patch comment as the joke it was intended to be.  It was a good question, but as your second thought told you, I don’t believe our Oracle patch level to be a contributing factor for this performance topic.

              I appreciate the feedback so far.  It may a few more days until I post part 2.

            Jim

            (0) 
  4. Fidel Vales
    hi,
    interesting post.
    I’ll not try to guess what is the problem. I do not think thre is enough information 😉 (I would need the information mentioned by Stefan ) and you already mentioned that it isn’t a database problem.
    I assume that you had bad luck and some not experienced or newbie supported, not everybody is as experienced as the ones mentioned by Stefan.
    Thanks
    Fidel
    (0) 
    1. Stefan Koehler
      Hello Fidel,
      but that is the exactly the point that i tried to mention. In my opionion a member of the support staff should be an expert in his area, otherwise you don’t need to pay for support.

      My last two support calls were very disappointing until i said to the supporter that he should forward the call to a guy who has knowledege. In one call i built up a solution on my own, because the call doesn’t seem to get a success. I rated these calls, but never get a phone call.

      I also get the impression (from my other basis colleagues), that the calls taking much more time as in the past.

      In my opinion support quality is a very important topic, that SAP should pay more attention.

      Just my 2 cents :-))

      Regards
      Stefan

      (0) 
  5. Bala Prabahar
    Hello Jim,

    I thought I will share a few more points from your screenshot:

    1) The SQL statement you see in SM50 is not the one currently running. That was last sql statement. As you said, this is not a database issue.
    2) Secondly almost every second you waited was using CPU. You waited for 486 between two screenshots; CPU usage was 475 seconds.This is quite unusual(swapping??).Does dev_w0 file show anything unusual? If not, truss output on work process while you run st03n would throw more light. truss output would be very useful(additional thoughts: increase trace level on work processes and perform se30).
    3)You discussed DB settings;wondering about O/S configuration, patches, SAP kernel level etc?

    Thanks,
    Bala Prabahar

    (0) 
    1. Jim Spath Post author
      Bala:

        I love these Easter Egg hunts.  You make good points.  However, you are probably on the wrong trail to the root cause.

        You might notice I tagged this blog under “SAP Solution Manager” though I don’t think that tool has any positive benefit for bug chases like this one.  In a sense, it allows the support person to give us the wrong answer much faster.

        I said _SM51_ not _SM50_, by the way.

      Jim

      p.s. “truss” is operating system specific.  Avoid silly notions like blaming the operating system.

      (0) 
      1. Bala Prabahar
        Jim,

        I don’t have enough information to blame anything. The recommendation to run truss(I should have said ‘truss’ like command applicable to your o/s if available) was just a suggestion to gather more details. If you thought that I was blaming o/s, my apologies. I will never ever blame anything unless I have all facts. I use truss in Solaris extensively to gether more information.
        SM51 vs SM50 – SMHID

        Regards,
        Bala Prabahar 

        (0) 

Leave a Reply