We recently upgraded the Oracle Databases of all our SAP systems from version 11.2.0.2 to 11.2.0.3  with latest oracle SBP available at that time (11.2.0.3.9).

We almost did upgrade of 40+ SAP systems, starting from Dev to production , an exercise lasting almost 4 months. ERP, BI/BW, PI, MDM and portal systems were involved.

Everything went smoothly ( At least on face value) for all the system , but one of the testing teams involved for the smoke testing for our BW/BI systems didn’t do their bit properly as they were supposed to do so.

Data loads were not tested after the database upgrade on development and integration systems. No one found it important enough to run a test load.

What followed one fine day after the production database upgrade of the BW system wasn’t less than a nightmare for me and my basis team. All the production BW systems , and we have three of them, couldn’t complete any scheduled process chain/data load on time. And with all the scheduled loads piling up , all the three systems came to a standstill overnight with no resources free to do anything else . No data, no reporting, complete outage.

Even a data load which used to run for 5 minutes started taking more than 3 hours to complete. What had suddenly gone wrong with our system , no parameter changes , nothing, so what exactly did we change “We upgraded our database to version 11.2.0.3″. That’s the only change.


But a database upgrade is supposed to increase performance and make DB stable, if not increase then at-least shouldn’t bring down the database performance to such a degraded state. All other applications ERP, SRM, PI, MDM etc, were running fine, why only BW application is dying. It can’t be database upgrade. I didn’t believe so,

Did you check the loads on development or integration system? We asked the BW team. Not to our surprise they said , NO. Ok, no problems lets test it now. It will run fine on dev and integration systems, we said to ourselves. There was no way we could still believe that database upgrade will cause this. BW team would have done something to the application.

We tested in Dev and Integration BW systems and none of the loads which were tested could finish. So, something wrong with the only change we made was Database upgrade to version 11.2.0.3.

We couldn’t find any SAP notes talking about any similar problems after upgrading to 11.2.0.3. We raised a very high message with SAP, and side by side starting our investigations.

SAP message wasn’t going anywhere with the lamest of the answers we could ever get to a question possible as a reply to our question and it went on an on.

We took the matter in our hands and started investigation , On the following lines.

1. Why the percentage of DB reads has increased many folds, upto 400MB/s

2. Why are there direct path reads with full table scans causing enhanced I/Os and physical reads. It never used to happen.

3. Physical storage has reached saturation point. Can’t handle any more I/O. There’s too much load from the application

4. Log file sync event , average value has increased 10 times.

All the extraction jobs were going to direct path read with full table scans causing enhanced I/Os and physical reads, which meant that these were overriding the buffer cache with full table scan & enormous  physical reads. This was probably not happening before, something caused this and stalled every thing after oracle patching. Full table scan meant more blocks being read and with many parallel jobs running, went through the roof.

Well we found the answer here http://www.pythian.com/blog/upgraded-to-11gr2-congrats-you-are-in-direct-reads-trouble/

and in the oracle metalink document ID 793845.1.

Please go through it.

Resolution of our Problem

By the time we found this useful link, we had already started identifying the expensive SQL statements which the loads were triggering ( All the loads having problem were self BW loads only)  and had started creating new indexes depending upon what ODS tables or cubes the SQL were referring to and sequence of columns in those statements.

We ended up creating 12-14 new indexes , the the SQL statement for the loads , instead of full table scans with direct path read, starting using index range scans and the load time reduced drastically, and even faster that it used to run before.

That’s the one benefit we had from this problem. We tuned the loads to run even faster 🙂

We even tested the suggestion from the Meta link document and the blog and tested the event 10949.

After the event 10949 was set, the DB behavior was back to what it was before the 11.2.0.3.9 upgrade.  Direct path reads were not happening after setting the event  and  we could see that the wait  event ” db file scattered read” has also returned. (Its not used in 11g by default).

But we kept our indexes in place and didn’t move the event to our production environment. This only happened for BW system because I guess, its an OLAP application, where there are more reads and which form a major part of the database performance.

I hope this information help others too.

Thanks

Amit

To report this post you need to login first.

5 Comments

You must be Logged on to comment or reply to a post.

  1. Fidel Vales

    Unfortunatelly I do not have enough information and time to rebate most of your entry but probably this “issue” is simply that an already BAD statement went wrong.

    It also denote certainly lack of knowledge how Oracle works.

    For starters you mention:

    After the event 10949 was set, the DB behavior was back to what it was before the 11.2.0.3.9 upgrade.  Direct path reads were not happening after setting the event  and  we could see that the wait  event ” db file scattered read” has also returned. (Its not used in 11g by default).

    db file scattered read it is used in 11g, Oracle decides to use it or use the direct path read depending on the size of the table and the size of the db buffer.

    In any case, it is using a “multiblock read” to read a BIG table, the diference being if it put the data in the buffer (taking other data out of the buffer) or not.

    My question would be if you need a full table access to do such a load or it can be optimized (you are talking about improving the situation creating indexes, that clearly point to my impression of a previous BAD access)

    In resume, with the little information you provide, oracle did absolute nothing wrong as what you describe is something everybody who works in SAP and “Oracle performance” knows.

    read the SAP Note 619188  – FAQ: Oracle wait events point 15:

    As of Oracle 11g, full table scans for larger tables are read via direct path.

    in most of the cases this makes that the general performance of the instance is better, but (as everything) might have cases where badly impact something.

    (0) 
    1. Amit Bangotra Post author

      Hi Fidel,

      Yes, I have knowledge of oracle wait events as such and which are important from performance perspective.

      There was no way to identify bad SQLs when the same statement was running with acceptable performance on 11.2.0.2 and suddenly degrade on 11.2.0.3.  The reason we couldn’t identify this was that testing wasn’t properly done on integration after upgrade.

      Yes , probably they were badly designed or had bad access already, but they were not doing a full scan or doing direct path read otherwise we would have know that before upgrade , that behavior was only introduced with 11.2.0.3 and not with 11.2.0.2 in our case and that’s why we spent time on tuning them, instead of straight away setting the event and it did gave us a chance to tune those SQLs and loads.

      Probably in previous version 11.2.0.2 it was still going through buffer cache. I couldn’t check and verify it and in docs its says this behavior is default in 11g

      What this event 10949 does is, disables new functionality related to Direct reads enhancement introduced in the 11.2.0.2 and behaves as older versions and hence we could see more of db file scattered read events.

      Thanks

      Amit

      (0) 
      1. Fidel Vales

        Yes , probably they were badly designed or had bad access already, but they were not doing a full scan or doing direct path read otherwise we would have know that before upgrade

        then, if they were not doing a FTS before the issue is not the tittle of your blog, the issue is that the CBO decided to change the plan for whatever reason, and the new plan is worse than the previous because it includes a FTS on a big table

        (0) 
      2. Fidel Vales

        BTW

        disables new functionality related to Direct reads enhancement introduced in the 11.2.0.2

        • direct path reads was not introduced in 11.2.0.2 but in 11.2.0.1
        • you were already in 11.2.0.2 without the problem

        therefore Direct Path Read was not the issue but any feature that might have been introduced in 11.2.0.3 that made the CBO change opinion (which could be a new set of statistics, not even a change on the CBO)

        (0) 
        1. Amit Bangotra Post author

          Yes, I understand what you are saying, but we had no such problems on 11.2.0.2 with direct path read. I would love to test it again on an 11.2.0.2 system , but sadly I don’t have any system now to test. The below link talks about this on 11.2.0.2 but we didn’t had any.

          Adaptive path read operation tips

          Oracle 11g release 2 upgrade performance problems

          Also the fact that enabling the event 10949 brings back the performance to earlier state without any direct path reads, tells that the problem aggravates in 11.2.0.3 like in our case and not 11.2.0.2. The runtime for the loads after we had set this event were exactly what we had before the upgrade. That we tested in our integration environment.

          Another thing is that this problem occurred on on SAP BW/BI application only and that too on all systems simultaneously  and not on any other system which is probably because it is the most read intensive and uses a lot of parallelism which can aggravate the problem. We had DTPs which run more than 30 parallel jobs at same time.

          Thanks

          (0) 

Leave a Reply