Introduction

Have you ever used ASH based samples or tried to use AWR reports for identifying and fixing a database performance issue in a specific business process like batch processes? Have you ever requested a SAP Technical Performance Optimization (TPO), because of some specific performance problems? Have you ever looked at “silver-bullet” ratio values like buffer cache quality / hit ratio and assumed that these values need to be improved to fix your performance problem?

If you answer one of these questions with yes, then you maybe have implemented all of your self-developed performance suggestions or recommended changes by a SAP TPO in the past, just to notice that the response time for the end user did not improve at all.

In consequence you and your end users are frustrated and maybe not satisfied with the provided services at the market? Is this end result caused by bad service quality or maybe just based on the used approach and not well defined tuning targets? From time to time i am hired for Oracle / SAP performance analysis after all of that previous steps were already taken and try to explain my clients why the previous used approaches have failed to improve the performance.

This blog post is about thinking clearly about Oracle / SAP performance (What is performance?) and why it is so important to remember that by using the right analysis and tuning approach.

Footnote: Performance analysis with ASH / AWR based data can be useful as well, but it all depends on the tuning targets. Pretty good examples for using AWR reports are comparing I/O response times after an I/O sub system replacement or reducing the CPU load on the database server, but due to its design AWR reports are pretty bad for identifying the root cause of a performance issue in specific business process.

Thinking clearly about performance

Before we start with examples, analysis or tuning approaches, let’s think about performance and what it is all about for the end user.

End_user.png

So the question is: What do your end users really care about? Your end users do not get frustrated, because of some bad buffer cache hit ratio or something like that. They get frustrated because of their click, button, link, query, report or batch job does not finish quickly. This is all what performance is about.

“The click, button, link, query, report or batch job has to finish quickly.”

All of us are end users and even if we just use Google. So we all know that performance can be a good or bad experience, but how do we define “system performance”?

“… Describe your system’s performance …

Always either fast for everyone, or slow for everyone?

No.

Fast on some days, slow on others.

Fast for some people, slow for others.

Fast for some clicks, slow for others.

Performance is an attribute of individual experiences.”

So let’s keep in mind, that each performance experience is an individual experience for each end user by executing a specific task at a given time. In other words performance is a feature of the end user experience for a specific combination of user, executed task and point in time.

At this point you can sum up if ASH or AWR based data provides you the needed information for such individual user experiences. Does the buffer cache hit ratio or top 5 wait events section (for example) provide you some information about that? Does the ASH sampled data provide you information about each wait event and how long the end users task spent with that? The answer is pretty simple – No.

The approaches / methods

Let’s take a step back and clarify the source and scope of ASH / AWR based data before we go on with the approaches and methods.

ASH (Active Session History) base data

ASH data is based on 1 second samples or on samples of the samples (1 out of 10)

    • V$ACTIVE_SESSION_HISTORY“It contains snapshots of active database sessions taken once a second. A database session is considered active if it was on the CPU or was waiting for an event that didn’t belong to the Idle wait class.”
    • DBA_HIST_ACTIVE_SESS_HISTORY“DBA_HIST_ACTIVE_SESS_HISTORY displays the history of the contents of the in-memory active session history of recent system activity. This view contains snapshots of V$ACTIVE_SESSION_HISTORY.”

AWR (Automatic Workload Repository) reports

AWR reports are based on a concept of database time (DB time) which is defined as:

    • Total time in database calls by foreground sessions
    • Includes CPU time, IO time and non-idle wait time
    • DB Time <> Response Time (that means, that the DB Time is not necessarily the time, that the end user / client experiences as database response time)

The most important key points (for understanding performance) are mentioned above, but what does it mean for interpreting that provided data?

At first we need to remember at any time, that the DB time can be totally different from end user database response time (key word “SQL*Net message from client” as it is completely unaccounted and defined as an idle event, which is true from a database perspective, but not from an end user perspective).

You can not tell how long something took just by counting how many times it happened (e.g. counting the wait events in view V$ACTIVE_SESSION_HISTORY for a specific session).

Performance analysis and diagnosing performance issues can be very misleading and frustrating, if they are focused on system-wide statistics (e.g. in AWR report sections). How do you know that your performance issue in a specific business process is caused by “latch free” wait events (for example) – just because of “latch free” is in the top 5 wait event section and it should not be there? No, you can not get that information from ASH or AWR with safety.

ASH / AWR based data is not providing information about any asymmetrical data (e.g. you execute 100 database calls, but how much time is spent on the first database call and on every subsequent call or how does the detailed response time histogram look like for a specific wait event in the business process). You can not extrapolate details from an aggregate.

So far enough about ASH / AWR based data and why you need to be careful by using it for performance analysis. Let’s go into the approaches and methods.

MethodC.pngMethodR.png

** The screenshots are from the book “Optimizing Oracle Performance” – for more details check the reference section **

Do you find your previous used approach for performance analysis in one of these two methods? From my experience most performance analysis are based on “Method C” (even SAP TPO in parts). Is it the best and most efficient way to improve the performance for the end user? In my opinion and from my experience it is not. Of course you can earn more money with “Method C” on client site, because of you need much more time to reduce wait event by wait event until you maybe hit the root cause of the specific performance problem. However you maybe have “tuned” all wait events, but the end user may not notice that all. My consulting philosophy (based on Method R) is to provide my clients the most effective and sustaining solution with an comprehensible explanation and this can be achieved with Method R in the best way.

“Method C” was also described with “Compulsive Tuning Disorder” in a different manner:

“The signs of this illness were the excessive checking of many performance- related statistics, most of them ratio-based, and the inability to focus on what was really important. They simply thought that by applying some “simple” rules, it was possible to tune their databases. History teaches us that results are not always as good as expected. Why was this the case? Well, all the rules used to check whether a given ratio (or value) was acceptable were defined independently of the user experience. In other words, false negatives or positives were the rule and not the exception. Even worse, an enormous amount of time was spent on these tasks.”

So how does “Method R” look like in a SAP environment related to an Oracle database performance bottleneck (however the approach can be adapted to any kind of performance issue)?

  1. Define the business processes that need to be improved (e.g. batch job or chains of batch jobs for a specific business process, reports, etc.)
  2. Define a tuning target for these business processes (e.g. batch job need to be finished in 5 minutes, report need to be presented in 1 min, etc. and maybe with a fault tolerance)
  3. Crosscheck the response time (STAD / ST03n) or trace it (SE30) in SAP to identify the component, that took the most time (let’s assume that the bottleneck is the DB response time section in this case)
  4. Enable a dedicated Oracle SQL trace (no SAP SQL trace) with wait events, binds and SQL execution plan dumps
  5. Profile the raw SQL trace and improve the performance related to the findings (which maybe result in further SQL execution plan analysis for sure)
  6. Measure the response time (STAD / ST03n) or trace it (SE30) in SAP again
  7. Crosscheck response time with the defined tuning target and stop or repeat step 3 to 6 until it is reached

I am often asked how to profile the raw SQL trace file (step 5) and which tool i use. I usually use the (free) Trivadis Extended Tracefile Analysis Tool. Currently Oracle’s Tkprof tool provides information about wait events, but it still has three major problems that are addressed in TVD$XTAT:

  • As soon as the argument sort is specified, the relationship between SQL statements is lost
  • Data is provided only in aggregated form. Consequently, useful information is lost
  • No information about bind variables is provided

In addition TVD$XTAT creates a HTML report with links, which makes the jumping between data and analysis much more comfortable and got some other nice enhancements as well. However i can also highly recommend the Method R Profiler, that currently starts by $1,180 USD.

An example

The following simple example illustrates a pretty common case in which ASH / AWR based analysis will always fail and why “Method R” should be the way to go (in my opinion). Let’s assume the following scenario. We have an ABAP application that calls an ABAP function module frequently to get some detailed data about a bill. This is a pretty common scenario, when custom code is written and SAP standard function modules are used to retrieve trivial dependent information. In my case i rebuilt that scenario in Java as i have no SAP system right here and i needed the usual client / server architecture. I am not a J2EE developer, so please don’t be cruel to me, because of the bad java coding style (however the SQL part is knowingly written that bad).

The following demo was run on Oracle database (11.2.0.3.6) on OEL 6.4 (2.6.39-400.109.1.el6uek.x86_64) and the Java program was executed on Mac OS (Darwin Kernel Version 11.4.2) with Java 1.6.0_51.

SQL> create table BILL (BILLNR number, TEXT varchar(20));
SQL> create table BILLPOS (BILLNR number, BILLPOSNR number, BILLDTEXT varchar(20));
SQL> create unique index BILL_I on BILL(BILLNR);
SQL> create unique index BILLPOS_I on BILLPOS(BILLNR, BILLPOSNR);
SQL> begin
for i in 1..20000 loop
    insert into BILL values(i,'TEST DESCRIPTION');
   end loop;
end;
/
SQL> begin
for i in 1..20000 loop
    insert into BILLPOS values(i,1,'TEST DESCRIPTION');
    insert into BILLPOS values(i,2,'TEST DESCRIPTION');
   end loop;
end;
/
SQL> commit;
SQL> exec DBMS_STATS.GATHER_TABLE_STATS(USER, 'BILL');
SQL> exec DBMS_STATS.GATHER_TABLE_STATS(USER, 'BILLPOS');

These are the base Oracle database objects and the Java application code looks like this.

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.util.Properties;
public class OracleJdbcExample {
    public static void main(String args[]) throws SQLException {
        //URL of Oracle database server
        String url = "jdbc:oracle:thin:@OEL:1522:T11DB";
        //properties for creating connection to Oracle database
        Properties props = new Properties();
        props.setProperty("user", "SYS");
        props.setProperty("password", "<PASS>");
        props.setProperty("internal_logon","sysdba");
        //creating connection to Oracle database using JDBC
        Connection conn = DriverManager.getConnection(url,props);
        String sql  ="select * from BILL";
        String sql2 ="select BILLNR, BILLPOSNR, BILLDTEXT from BILLPOS where BILLNR = ?";
        //creating PreparedStatement object to execute query
        PreparedStatement preStatement = conn.prepareStatement(sql);
        PreparedStatement preStatement2 = conn.prepareStatement(sql2);
        ResultSet result = preStatement.executeQuery();
        while(result.next()){
            int billnr = result.getInt("BILLNR");
            preStatement2.setInt(1, billnr); 
            ResultSet result2 = preStatement2.executeQuery();
            while(result2.next()){ 
               String billnr2   = result2.getString("BILLNR");
               String billposnr = result2.getString("BILLPOSNR");
               String billdtext = result2.getString("BILLDTEXT");
               System.out.println(billnr2 + " " + billposnr + " " + billdtext);
            }
        }
    }
}

Let’s run this Java program now, enable an Oracle SQL trace for this J2EE connections / modules and create and AWR report for comparison of the performance measured data.

SQL> exec DBMS_WORKLOAD_REPOSITORY.CREATE_SNAPSHOT();
SQL> exec DBMS_MONITOR.SERV_MOD_ACT_TRACE_ENABLE('SYS$USERS','JDBC Thin Client',
          waits => TRUE, binds => TRUE, plan_stat => 'ALL_EXECUTIONS');
shell> time java -cp /Users/Downloads/ojdbc6.jar:. OracleJdbcExample
real    0m29.658s
user    0m3.782s
sys    0m0.880s
SQL> exec DBMS_MONITOR.SERV_MOD_ACT_TRACE_DISABLE('SYS$USERS','JDBC Thin Client');
SQL> exec DBMS_WORKLOAD_REPOSITORY.CREATE_SNAPSHOT();

We can see that the end user has waited round about 29.7 seconds until the program has finished. This is the performance target value, i usually talk about at a client site, because of this time is what the end users usually care about.

At first let’s take a look at the AWR report for that measuring period.

AWR_Header.pngTop_SQL.png

The AWR report states, that the SQLs 6r6dng606gx8p and 5206f148n2vdu took only 4.57 seconds (“DB Time <> Response Time” is demonstrated clearly), but the end user experiences a response time of round about 29.7 seconds. These two SQLs statements were the only logic, that was processed by the Java code. So what do you want to optimize based on that information? Yes – the SQL 6r6dng606gx8p is running fully on CPU, but even if you were able to reduce the CPU usage of this SQL statement, you would reduce only max. 4.44 seconds from the end user response time of round about 29.7 second.

Now let’s take a look at the raw SQL trace file and the generated report with SQL profiler TVD$XTAT.

TVDXTAT_SQL_Overview.png

Detail_SQL1.pngDetail_SQL2.png

Now you can see clearly, that round about 24.283 seconds (from an end user response time of 29.7 seconds) are spent on the whole database layer (JDBC / OCI – OPI – DB engine) from an application point of view, but you only see round about 4.6 seconds as database time in an AWR report. You maybe wonder why? Well, it is based on how the database time is measured by Oracle. Remember that the DB time includes CPU time, IO time and non-idle wait time, but the main wait event in our example here is “SQL*Net message from client”, which is defined as an idle event by Oracle and so it is completely unaccounted.

This example was designed to be that extreme of course, but i wanted to be sure, that it is very obvious, that looking at ASH sampled data or an AWR report maybe not sufficient to diagnose the root cause of a database performance issue in a specific business process. The solution for this performance issue from above should be obvious too of course. You can use the Java code from above and re-write it on your own to a simple 2 table join and measure the performance again.

Summary

I hope you get the main idea behind “Method R” and why it is much more productive than a trial and error method (“Method C”) like reducing the system wide top 5 wait events for example. Method R can also be adapted, if you want to reduce your average database response time in a SAP system for example. You just start with the most database time consuming SAP program (STAD / ST03n = End user performance experience) and go on step by step until you have reached your tuning target. No more guessing – just hard facts and measured performance data.

If you have any further questions – please feel free to ask or get in contact directly, if you need assistance by troubleshooting Oracle / SAP performance issues.

References

To report this post you need to login first.

11 Comments

You must be Logged on to comment or reply to a post.

  1. Fidel Vales

    Hi,

    Nice. 🙂

    I have to say that I (we) try to get a “nice” objective.

    Unfortunately a lot of times when I am requested to do a TPO, the objective it is “tune the system“, “Reduce DB time” …. and that is where a lot of times any “tuning” fails as the final user may not notice any improvement.

    Also, you can use (up to a certain degree) the ASH to find out what a session is doing without the need of a trace.

    (0) 
    1. Stefan Koehler Post author

      Hi Fidel,

      thank you very much for your feedback.

      I know that a lot of Oracle / SAP tuning requests (no matter, if it is a SAP TPO or a request to a third party company) are like “tune the system” or “take a look at the AWR report and make a suggestion to improve the performance”, but in my opinion it is our duty to guide our clients on the right way. I prefer a happy client (with happy end users) in the end instead of following an unqualified request like “Reduce DB time” without any focus to the real performance issues of the end users. This guidance is maybe a little bit easier for third party companies or persons like i am, but in consequence the end result is what really matters 😉

      > Also, you can use (up to a certain degree) the ASH to find out what a session is doing without the need of a trace.

      Yes for sure, if you already know that it is a single long running SQL (like an OLAP query) and you do not have any asymmetrical behavior, but in most cases you just have the STAD / ST03n database request time and if it comes to identify the root cause for such “high” database request times in a whole business process ASH is useless 🙂

      Just out of curiosity: Does SAP perform a (client) satisfaction survey after such TPO sessions?

      Regards

      Stefan

      (0) 
      1. Fidel Vales

        Stefan Koehler wrote:

        Just out of curiosity: Does SAP perform a (client) satisfaction survey after such TPO sessions?

        Regards

        Stefan

                           

        Hi Stefan,

        Yes. There is a Service Feedback on the Service Marketplace, where the customer download the service report.

        I (and probably more people) also provide the same on Word/PDF to be filled by the customer and be sent by email/fax to SAP (not to the person who deliver the service) or given to the TQM.

        But I think that it is not filled frequently for whatever reason.

        Grüßes

        Fidel

        (0) 
  2. Aaron Liao

    Hi Stefan,

    The method-R looks to me more like another service offering by SAP, Business Process Performance Optimization (BPPO), where customers and tuner know exactly which transaction(s)/process(es) need to be tuned and traces to be captured and analyzed.

    Thanks & Regards,

    Aaron

    (0) 
    1. Stefan Koehler Post author

      Hi Aaron,

      thank you very much for your feedback.

      “Method R” was documented by Cary Millsap and Jeff Holt back in 2003 and describes a process to accomplish a specific goal based on response time and response time (related to Oracle in this case here) is all what the end users care about. You are absolutely right, that SAP BPPO seems to be based on that methodology in parts.

      However a SAP BPPO may also fail by finding or identifying the root cause of an Oracle performance issue in a transaction / business process with the used tool sets and corresponding approaches (check prerequisites for that service), but it is focused much better on what the end users really care about.

      Questioning Method R: An Interview with Cary Millsap:

      “With Method R, you know exactly whether or not a computer-executed business task has any time in it that’s been wasted. If there is, you either fix it or prove that fixing it isn’t worth the cost. If there isn’t any wasted time in there, you move on. That’s the key: either way you move on. You don’t linger in a non-productive state, because you know exactly where there’s opportunity for improvement and you know exactly where there’s not.”

      Regards

      Stefan

      (0) 
  3. Tony Fitzgerald

    Hi Stefan,

    Great blog.

    I think it’s also important to note that even if a TPO has no focus from an end-user perspective (a failing, I agree), the load on the database is mostly reduced and sometimes significantly. This can reduce the need to add more hardware and save the company large amounts of cash.

    Cheers,

    Tony

    (0) 
    1. Stefan Koehler Post author

      Hi Tony,

      thank you very much for your feedback. I totally agree with you as i stated this in the footnote of the introduction as well:

      “Pretty good examples for using AWR reports are comparing I/O response times after an I/O sub system replacement or reducing the CPU load on the database server”

      Checking wait events (e.g) from a database perspective (and not focusing on client / end user response time) is absolutely reasonable for reducing the hardware load (e.g. CPU or I/O), realizing a better hardware utilization and saving money for additional hardware in consequence.

      SAP TPO is absolutely reasonable, useful and valuable to clients by using that focus.

      In my experience most clients are using this service for analyzing / solving performance issues instead with all of its possible failures (“The primary goal of the CQC Technical Performance Optimization service is to identify performance problems caused by the configuration or operation of the database that affect the normal flow of business operations in your company.”).

      How can (Oracle) performance problems be identified without focusing on end user response time for a given task / business process by considering the whole stack?

      Sometimes i need to push the focus / methodology (“Method R”) and spent some time in explaining the approach as it is not the conventional / expected way on client site. I also like how Cary Millsap describes this – even “Method R” is not that new anymore:

      “The method is unconventional

      The method is new and different, and it’s not what people are accustomed to seeing.


      In the meantime, some of your colleagues are going to require careful explanations about why you’re recommending a completely unconventional performance optimization method that doesn’t rely on Statspack or any of the several popular performance monitoring tools for which your company may have paid dearly. They may cite your use of an unconventional method as one of the reasons to reject your proposals.

      But my whole system is slow. I need to tune my whole system, not just one user. When are you going to come out with a method that helps me tune my whole system? We hear it everywhere we go. What if the whole system is slow? Practitioners often react nervously to a performance improvement method that restricts analysis to just one user action at a time. Especially if users perceive that the whole system is slow, there is often an overwhelming compulsion to begin an analysis with the collection of system-wide statistics. The fear is that if you restrict the scope of analysis to anything less than the entire system, you might miss something important.”

      Source: Cary Millsap / Jeff Holt – Book: Optimizing Oracle Performance

      You can get a free copy of chapter 1 here and read the pages 25 up to 30. The illustrations and explanations are pretty good for understanding the scope of “Method R”.

      Best Regards

      Stefan

      (0) 
      1. Tony Fitzgerald

        Hi Stefan,

        Thanks for your response and for the additional info. I completely agree with you in relation to  a “Method R” type of analysis to solve performance problems. I’m of the same opinion. I’ve often struggled to convince managers and colleagues that this approach is the most beneficial.

        Cheers,

        Tony

        (0) 
  4. Jose Manuel Salguero

    Hi Stefan,

    Many thanks for the blog.

    Just say that I totally agree with the approach, sometimes We waste a lot of time just to reduce 100 ms of general Response Time in order to fill some SLA but actually to the end users this is no improvement at all.

    Furthermore, with this approach you are able to discover some design issues in programs which are hard to discover with the other method.

    Maybe that’s why sometimes is difficult to get from the application teams the list of “problematic” reports for end users, It’s better to say “the system is slow” than “we have problems with the program ZXXX”. 😉

    For this porpuse is useful to send periodically Performance Surveys to end users to have the real system behavior feedback and to have a real communication channel between IT and Business.

    Cheers.

    (0) 
    1. Stefan Koehler Post author

      Hi Jose,

      thank you very much for your feedback. I really appreciate the participation.

      The most interesting questions to me are: Why do we act like that? Why do we setup SLAs based on the average response time of a specific task type or transaction? Why do we generate an AWR or Statspack report at first when somebody complains about the database performance?

      Most of the time i can answer these questions for me with: “It is not our fault – we are/were told to do it this way.” No matter if you attend an Oracle performance troubleshooting / monitoring seminar or the SAP workload analysis class. Most of the time you are told to do it that way. In my early Oracle / SAP years i worked in the same manner until this methodology failed several times and i discovered “Method R” by Cary Millsap.

      Another important point is that in most cases nobody thinks about performance measurement and handling from a business / end user point of view by implementing SAP itself or a new functionality. The most important fact is functionality and the dead line (which is absolutely reasonable up to a specific point of course). After the application has gone live you look at the average response time (e.g in ST03n) of your transaction or new functionality, hypothesize that value <X> is good and build up SLAs on that, because of the value is already calculated and it sounds reasonable to do it that way at first.

      > Maybe that’s why sometimes is difficult to get from the application teams the list of “problematic” reports for end users …

      Sometimes i am engaged by clients just to “finger-point” at a department or person for defining the responsibility of a (database) performance issues. But in reality both parties (administrators and developers) are responsible for performance, because of some problems can only by fixed with help of the developers and some problems can not be identified by the developers and they need the help of administrators. It is up to both participants to collect and provide the needed data for each other to fix the issues. 

      > For this purpose is useful to send periodically Performance Surveys to end users to have the real system behavior feedback

      Regular meetings (let’s say once in a quarter or half a year) are pretty good for interaction with the application owners. Maybe there are some new critical business processes you need to care about or the response time requirements have changed, etc..

      However for getting the “objective behavior” it is better to setup cases, that run the typical business process and measure the response time. You also can create trend analysis by running a business process and measuring the overall response time of it. Not quite sure, if SolMan can run and analyze such cases, but some clients just setup a wrapper for a business case, execute this (via RFC) and monitor / report that response time (in a third party tool). This point is very individual and depends on the kind of process of course.

      Regards

      Stefan

      (0) 
  5. Lars Breddemann

    Hey Stefan,

    well written blog post! In fact, you could have created a little series out of this one! 😉

    I really like that you try to distribute the idea of tuning what will help users instead of some technical numbers – this perspective is clearly not new but so often overlooked or even ignored.

    Keep on your very good work!

    Lars

    (0) 

Leave a Reply