Introduction

Last Saturday i got a phone call from a SCN member who found me due to my blogs and postings on SCN (thanks to SCN for the new client 🙂 ). He told me that his company faced a critical performance break-down with parallel queries, which were partially not executed in full parallelism anymore. The SAP ECC system / Oracle database was quite large (≈ 12 TB) and running on a powerful IBM Power platform with a lot of concurrent users. They implemented parallel execution (query) some time ago to fulfill the performance requirements of the business and it was running smoothly until this Saturday, when they had a SAP go-live with several new sites.

At first we talked about the usual suspects like adaptive multiuser algorithm or automatic degree of parallelism, but none of them were appropriate as PARALLEL_ADAPTIVE_MULTI_USER was set to FALSE and most of the other influencing parameters like PARALLEL_DEGREE_POLICY or PARALLEL_MIN_PERCENT were stick to their default values.

So basically said the developers (and DBAs) have full control of the degree of parallelism in such a configuration until all PX servers are used up. The client also provided me a parallel execution report, which clearly showed that only a few PX servers were currently in use, but the requested degree of parallelism was not allocated by some of these critical parallel queries.

The whole picture was kind of strange and we could summarize the situation like that:

  • Hardware is able to handle a huge amount of PX servers
  • Parameter settings are appropriate to allocate (free) PX server
  • Critical SQLs are using the parallel hint and the execution plans illustrate PX as well
  • Huge amount of PX servers are free, but the requested degree of parallelism is not allocated and used by the queries

Unfortunately we were not able to figure out the root cause immediately by phone and so i requested a PX trace to troubleshoot this issue furthermore.

I will change my usual blog structure as it is easier to follow the troubleshooting path without knowing the root cause in first place. In consequence the following demo is structured like that:

  • Show PX parameter settings
  • Execute SQL with parallel hint and show usage of PX servers
  • Troubleshooting PX execution
  • Show root cause and adapt solution
  • Execute SQL with parallel hint once again and show usage of PX servers

Demo of the PX issue

The following demo was run on an Oracle database (11.2.0.3.6) on OEL 6.4 (2.6.39-400.109.1.el6uek.x86_64) with 3 vCPUs. OEL 6.4 is running as a virtual machine on Oracle Virtual Box 4.2.12, but that does not matter right here at all.

All of the following content is executed with a normal Oracle user called “TESTUSER”.

Create the base table


SQL> create table TEST_TAB as select * from DBA_OBJECTS where rownum <= 60000;
SQL> exec DBMS_STATS.GATHER_TABLE_STATS(USER,'TEST_TAB');

























Show PX parameter settings


SQL> show parameter cpu_count
SQL> show parameter parallel_

























PX_Parameter.png

*** Footnote: A few more parameters are modified to run into the previously described issue, but they are omitted at this point for better illustration. The whole pre-steps are included in the demo sub-section called “Show root cause and adapt solution”.

Execute SQL with parallel hint and show usage of PX servers


SQL> select * from V$PX_PROCESS_SYSSTAT where STATISTIC like 'Servers%';

























PX_PROCESS_SYSSTAT_01.png


SQL> select * from V$PQ_SYSSTAT where STATISTIC like 'Servers%';

























PQ_SYSSTAT_01.png


SQL> select /*+ PARALLEL(20) */ count(*) from TEST_TAB;
























PX_Execution_Plan_01.png

The execution plan itself includes the parallel query steps and states that the (requested) degree of parallelism is 20 because of the hint.


SQL> select * from V$PQ_SESSTAT;
























PQ_SESSTAT_01.png


SQL> select * from V$PX_PROCESS_SYSSTAT where STATISTIC like 'Servers%';
























PX_PROCESS_SYSSTAT_02.png


SQL> select * from V$PQ_SYSSTAT where STATISTIC like 'Servers%';
























PQ_SYSSTAT_02.png


SQL> select NAME, VALUE from GV$SYSSTAT  where upper(NAME) like '%PARALLEL OPERATIONS%'
     or upper(NAME) like '%PARALLELIZED%' or upper(NAME) like '%PX%';
























SYSSTAT_01.png


SQL> select SQL_TEXT, PX_SERVERS_EXECUTIONS, EXECUTIONS,
(PX_SERVERS_EXECUTIONS/EXECUTIONS) AS AVG_PQ_EXEC
from V$SQL where SQL_ID = '8d9zmz93tbm66';
























SQL_01.png

What is wrong right here? As previously mentioned the execution plan states, that the degree of parallelism should be 20, but the SQL itself was only executed with a degree of parallelism of 5. The requested degree of parallelism was downgraded, even if no adaptive multiuser algorithm or automatic degree of parallelism is activated and 120 PX servers were still free, but we only requested a parallelism of 20. In consequence the SQL response time is not as good as it should be due to the “unexpected” downgrade. There seems to be something seriously wrong, but we do not have an explanation for that until yet – so let’s trace the PX execution and try to find the root cause for that behavior.

Troubleshooting PX execution


SQL> alter session set "_px_trace"=high,all;
SQL> select /*+ PARALLEL(20) */ count(*) from TEST_TAB;





















PX_Trace_01.png

The PX trace clearly states the root cause of that “unexpected” downgrade. The requested degree of parallelism was 20, but it was downgraded to 5 by the Oracle database resource manager.

Show root cause and adapt solution / additional preparation steps for reproduction of demo case


SQL> alter system set resource_manager_plan='PL_PQTEST' scope=memory sid='*';
SQL> exec DBMS_RESOURCE_MANAGER.CLEAR_PENDING_AREA();
SQL> exec DBMS_RESOURCE_MANAGER.CREATE_PENDING_AREA();
SQL> exec DBMS_RESOURCE_MANAGER.CREATE_CONSUMER_GROUP('CG_HPQTEST','High PQ load allowed');
SQL> exec DBMS_RESOURCE_MANAGER.CREATE_PLAN('PL_PQTEST','PQ Test Plan');
SQL> exec DBMS_RESOURCE_MANAGER.CREATE_PLAN_DIRECTIVE('PL_PQTEST','CG_HPQTEST','Directive - High PQ Load', PARALLEL_DEGREE_LIMIT_P1 => NULL);
SQL> exec DBMS_RESOURCE_MANAGER.CREATE_PLAN_DIRECTIVE('PL_PQTEST','OTHER_GROUPS','Directive - Standard PQ Load', PARALLEL_DEGREE_LIMIT_P1 => 5);
SQL> exec DBMS_RESOURCE_MANAGER.SET_CONSUMER_GROUP_MAPPING(DBMS_RESOURCE_MANAGER.CLIENT_ID, 'SAPUSER100', 'CG_HPQTEST');
SQL> exec DBMS_RESOURCE_MANAGER.VALIDATE_PENDING_AREA();
SQL> exec DBMS_RESOURCE_MANAGER.SUBMIT_PENDING_AREA();
SQL> exec DBMS_RESOURCE_MANAGER_PRIVS.GRANT_SWITCH_CONSUMER_GROUP('TESTUSER','CG_HPQTEST',FALSE);





















The original resource plan was more complex of course, but basically a resource plan like this (related to PX) was implemented for that particular SAP system. I stripped it down to an absolute minimum to keep it as simple as possible. The plan directive for the consumer group OTHER_GROUPS limits the degree of parallelism (of a single operation) to 5 in contrast to the unlimited degree of parallelism for the consumer group CG_HPQTEST. The consumer group mapping is implemented on session attribute level (client identifier), which basically means that in my case only the SAP user SAPUSER100 (not Oracle user!) is allowed to run a higher degree of parallelism (of a single operation) than 5. The session attribute client identifier is set by the SAP kernel. So let’s set the session attribute to SAPUSER100 and re-check it again.

Execute SQL with parallel hint once again and show usage of PX servers


SQL> select SID, PQ_STATUS, RESOURCE_CONSUMER_GROUP, CLIENT_IDENTIFIER
from V$SESSION
where SID = sys_context('USERENV','SID');




















SESSION_01.png


SQL> exec DBMS_SESSION.SET_IDENTIFIER('SAPUSER100');





















SQL> select SID, PQ_STATUS, RESOURCE_CONSUMER_GROUP, CLIENT_IDENTIFIER
from V$SESSION
  where SID = sys_context('USERENV','SID');




















SESSION_02.png

The session 174 is dynamically assigned to the consumer group CG_HPQTEST based on the session identifier attribute SAPUSER100. Now let’s re-run the parallel query and check the requested and allocated degree of parallelism. The statistics are reset as the instance was restarted to get a “clean state” once again.


SQL> select * from V$PX_PROCESS_SYSSTAT where STATISTIC like 'Servers%';




















PX_PROCESS_SYSSTAT_01.png


SQL> select * from V$PQ_SYSSTAT where STATISTIC like 'Servers%';




















PQ_SYSSTAT_01.png


SQL> select /*+ PARALLEL(20) */ count(*) from TEST_TAB;




















PX_Execution_Plan_01.png

The execution plan is exactly the same as before.


SQL> select * from V$PQ_SESSTAT;




















PQ_SESSTAT_02.png


SQL> select * from V$PX_PROCESS_SYSSTAT where STATISTIC like 'Servers%';




















PX_PROCESS_SYSSTAT_03.png


SQL> select * from V$PQ_SYSSTAT where STATISTIC like 'Servers%';




















PQ_SYSSTAT_03.png


SQL> select NAME, VALUE from GV$SYSSTAT  where upper(NAME) like '%PARALLEL OPERATIONS%'
     or upper(NAME) like '%PARALLELIZED%' or upper(NAME) like '%PX%';




















SYSSTAT_02.png


SQL> select SQL_TEXT, PX_SERVERS_EXECUTIONS, EXECUTIONS,
(PX_SERVERS_EXECUTIONS/EXECUTIONS) AS AVG_PQ_EXEC
from V$SQL where SQL_ID = '8d9zmz93tbm66';




















SQL_02.png


SQL> alter session set "_px_trace"=high,all;
SQL> select /*+ PARALLEL(20) */ count(*) from TEST_TAB;




















PX_Trace_02.png

Now we got exactly the resources, that we requested from the database. The SQL itself is executed in parallel with a degree of parallelism of 20 – nothing is downgraded.

Further basic explanations

Oracle documentation

Oracle Database Resource Manager (the Resource Manager) is an infrastructure that provides granular control of database resources allocated to users, applications, and services. The Resource Manager solves many resource allocation problems that an operating system does not manage well, including:

    • Excessive overhead
    • Inefficient scheduling
    • Inappropriate allocation of resources
    • Inability to manage database-specific resources

The Resource Manager helps overcome these problems by giving the database more control over allocation of hardware resources and enabling you to prioritize work within the database. You can classify sessions into groups based on session attributes, and then allocate resources to these groups to optimize hardware utilization. Resources are allocated to users according to a resource plan specified by the database administrator. The plan specifies how the resources are to be distributed among resource consumer groups, which are user sessions grouped by resource requirements. A resource plan directive associates a resource consumer group with a plan and specifies how resources are to be allocated to the group.

The Database Resource Manager provides the ability to prioritize work within the Oracle system. Users with higher priority jobs get resources in order to minimize response time for online work, for example, while users with lower priority jobs, such as batch jobs or reports, might encounter slower response times. This priority assignment enables more granular control over resources and provides features such as automatic consumer group switching, maximum active sessions control, query execution time estimations and undo pool quotas for consumer groups.

SAP

Oracle Database Resource Manager (in a SAP environment) is supported since SAP kernel => 7.10 and Oracle => 11.2.0.2 (=> SAP Bundle Patch April 2011). The SAP DBSL (SAP database interface) provides the necessary information (e.g. SAP username, application name) for the workload management and the Oracle Resource Manager.


Trace facility “_px_trace”

Since 9.0.1 a new mechanism has been introduced to obtain Parallel Execution (PX) trace information. This trace mechanism is keyword and parameter based. The existing event-based tracing is still available, but for some areas in _px_trace like compilation there is no event. Because each component of PX controls its own tracing, getting an event-based trace usually means setting one or more events and then deciding at what level they should be set; this can be cumbersome and there may not be any consistency of trace format across the different events. Keyword-based tracing is simpler to set than event-based tracing and is consistent across PX components.

There are up to three components that can be specified when starting the trace:

    • Area in which tracing is required
    • Verbosity
    • Do we want timing information

Example: alter session set “_px_trace”=[[Verbosity,]area],[[Verbosity,]area],..,[time];

The possible values for each of these is

    • Verbosity: high, medium, low
    • Area: scheduling (equivalent to some of event 10384 and some of 10390), execution (equivalent to some of event 10390), granule (equivalent to some of event 10390 and some of 10391), messaging (equivalent to event 10392 and event 10393), buffer (equivalent to event 10399) compilation ( no equivalent event), all (all of previous mentioned areas), none (none of previous mentioned areas)
    • Timing: time



Summary

Now you may wonder how this situation arrived so suddenly in that SAP system. The new (go-live) sites used several new SAP users, who executed that critical business processes in consequence. The Oracle resource manager mapping was implemented on SAP user (Oracle client identifier) level and all the new SAP users (from the go-live) were just allowed to run the parallel queries with a lower degree of parallelism as the mapping was not adjusted. The Oracle resource manager itself was introduced to protect the system from unexpected and unwanted load as only the performance critical business processes should be allowed to use such a high degree of parallelism. A mapping based on the module / application level may be the better (dynamic) solution in such cases as the critical business processes are pretty static, but the (SAP) users who should run them may change frequently.

Oracle itself provides great insights into parallel execution with the help of the PX trace facility and allows us to troubleshoot it on our own very deeply.

If you have any further questions – please feel free to ask or get in contact directly, if you need assistance by troubleshooting Oracle database (performance) issues.

References

To report this post you need to login first.

3 Comments

You must be Logged on to comment or reply to a post.

  1. Pedro Ruiz

    Hi Stefan,

     

    This is a great blog with a very detailed analysis regarding parallelism, I just have one comment, as per my understanding and the recommendations of SAP Note 1579946, it is not generally recommended the usage of the Oracle resource manager  due to the wait event “resmgr:cpu quantum”.

     

    Best regards,

    Pedro

    (0) 
    1. Stefan Koehler Post author

      Hi Pedro,

      thanks. This was a bug and is fixed since version 11.2.0.2 (which is already out of support for long time). In general “resmgr:cpu quantum” is something that you might want with Oracle resource manager as you want to “reserve” the CPU resources for some more important business processes, etc.

      Regards

      Stefan

      (0) 

Leave a Reply