We recently upgraded the Oracle Databases of all our SAP systems from version 220.127.116.11 to 18.104.22.168 with latest oracle SBP available at that time (22.214.171.124.9).
We almost did upgrade of 40+ SAP systems, starting from Dev to production , an exercise lasting almost 4 months. ERP, BI/BW, PI, MDM and portal systems were involved.
Everything went smoothly ( At least on face value) for all the system , but one of the testing teams involved for the smoke testing for our BW/BI systems didn’t do their bit properly as they were supposed to do so.
Data loads were not tested after the database upgrade on development and integration systems. No one found it important enough to run a test load.
What followed one fine day after the production database upgrade of the BW system wasn’t less than a nightmare for me and my basis team. All the production BW systems , and we have three of them, couldn’t complete any scheduled process chain/data load on time. And with all the scheduled loads piling up , all the three systems came to a standstill overnight with no resources free to do anything else . No data, no reporting, complete outage.
Even a data load which used to run for 5 minutes started taking more than 3 hours to complete. What had suddenly gone wrong with our system , no parameter changes , nothing, so what exactly did we change “We upgraded our database to version 126.96.36.199″. That’s the only change.
But a database upgrade is supposed to increase performance and make DB stable, if not increase then at-least shouldn’t bring down the database performance to such a degraded state. All other applications ERP, SRM, PI, MDM etc, were running fine, why only BW application is dying. It can’t be database upgrade. I didn’t believe so,
Did you check the loads on development or integration system? We asked the BW team. Not to our surprise they said , NO. Ok, no problems lets test it now. It will run fine on dev and integration systems, we said to ourselves. There was no way we could still believe that database upgrade will cause this. BW team would have done something to the application.
We tested in Dev and Integration BW systems and none of the loads which were tested could finish. So, something wrong with the only change we made was Database upgrade to version 188.8.131.52.
We couldn’t find any SAP notes talking about any similar problems after upgrading to 184.108.40.206. We raised a very high message with SAP, and side by side starting our investigations.
SAP message wasn’t going anywhere with the lamest of the answers we could ever get to a question possible as a reply to our question and it went on an on.
We took the matter in our hands and started investigation , On the following lines.
1. Why the percentage of DB reads has increased many folds, upto 400MB/s
2. Why are there direct path reads with full table scans causing enhanced I/Os and physical reads. It never used to happen.
3. Physical storage has reached saturation point. Can’t handle any more I/O. There’s too much load from the application
4. Log file sync event , average value has increased 10 times.
All the extraction jobs were going to direct path read with full table scans causing enhanced I/Os and physical reads, which meant that these were overriding the buffer cache with full table scan & enormous physical reads. This was probably not happening before, something caused this and stalled every thing after oracle patching. Full table scan meant more blocks being read and with many parallel jobs running, went through the roof.
Well we found the answer here http://www.pythian.com/blog/upgraded-to-11gr2-congrats-you-are-in-direct-reads-trouble/
and in the oracle metalink document ID 793845.1.
Please go through it.
Resolution of our Problem
By the time we found this useful link, we had already started identifying the expensive SQL statements which the loads were triggering ( All the loads having problem were self BW loads only) and had started creating new indexes depending upon what ODS tables or cubes the SQL were referring to and sequence of columns in those statements.
We ended up creating 12-14 new indexes , the the SQL statement for the loads , instead of full table scans with direct path read, starting using index range scans and the load time reduced drastically, and even faster that it used to run before.
That’s the one benefit we had from this problem. We tuned the loads to run even faster 🙂
We even tested the suggestion from the Meta link document and the blog and tested the event 10949.
After the event 10949 was set, the DB behavior was back to what it was before the 220.127.116.11.9 upgrade. Direct path reads were not happening after setting the event and we could see that the wait event ” db file scattered read” has also returned. (Its not used in 11g by default).
But we kept our indexes in place and didn’t move the event to our production environment. This only happened for BW system because I guess, its an OLAP application, where there are more reads and which form a major part of the database performance.
I hope this information help others too.