Skip to Content

Introduction

Every now and then you notice an Oracle performance issue without any obvious reason. In the past year two of my clients hit the following issue without any previous notice. This blog focus on an “advanced troubleshooting technique”, if all of the usual performance analysis are failed. It describes the performance issue, the root cause (in this particular case it was an Oracle bug) and the analysis itself by using a drill down approach. It will also introduce several tools and scripts that i use on regularly basis.

“Drill-Down Approach”

1. Using the wait interface / additional SQL execution statistics

Sampling or grouping the wait events for a particular SQL by V$SESSION or AWR. For further analysis i use the additional SQL execution statistics, SQL tracing or the Real Time SQL Monitoring since Oracle 11g R2. In many cases you already can provide a solution with this data, but what if the SQL wait interface is not showing anything (like CPU usage only). You can check the root causer (execution plan step) of the CPU usage with Real Time SQL Monitoring, but not the “why”.

2. Using the session statistics / performance counters

Sampling data with the simple but still awesome V$SESSTAT. This view reports what Oracle is doing – not anything about timing (like the wait interface), but you can track down the operation that Oracle is doing. At this time i want to introduce the first tool for sampling this particular data called Snapper by Tanel Poder. We will use this script later on for troubleshooting the previously mentioned Oracle bug.

3. Using stack traces or system call / signal traces

If all of the previous efforts are still not enough and you still have no clue what is going on and why – one of the last options is about checking the internals stuff like call stacks and so on. You should not be so often at this point, but if it is getting really nasty you are here 🙂

There are several tools for tracing and most of them are OS platform dependent – just to mention a few:

    • DTrace for Solaris (unfortunately such a comprehensive dynamic tracing facility is only available on Solaris in full function, but we can use an Oracle database approach to get similar information on any OS platform – introducing the second tool called “Oradebug short_Stack Profiler” also by Tanel Poder)
    • Pstack / Truss
    • Strace

A lot of theory about approaches and how i usually do it, but let’s go on with a real life example and see how it looks like and why it makes sense (to me).

The performance issue

My two clients called me, because of their SAP systems were nearly unusable. In both cases it was a “large” SAP system with thousands of concurrent users and nearly all work processes try to perform a simple INSERT on the same table. This blog explains the approach and the issue using the example of BSIS, but it can occur on any other table as well. Each INSERT (for just one data set) took round about 60 to 80 seconds and in consequence all work processes were busy.

The SQL looks quite simple like this:

SQL> INSERT INTO "BSIS" VALUES(:A0 ,:A1 ,:A2 ,:A3 ,:A4 ,:A5 ,:A6 ,:A7 ,:A8 ,:A9 ,:A10 ,:A11 ,:A12 ,:A13 ,:A14 ,:A15 ,:A16 ,:A17 ,:A18
,:A19 ,:A20 ,:A21 ,:A22 ,:A23 ,:A24 ,:A25 ,:A26 ,:A27 ,:A28 ,:A29 ,:A30 ,:A31 ,:A32 ,:A33 ,:A34 ,:A35 ,:A36 ,:A37 ,:A38 ,:A39 ,:A40 ,
:A41 ,:A42 ,:A43 ,:A44 ,:A45 ,:A46 ,:A47 ,:A48 ,:A49 ,:A50 ,:A51 ,:A52 ,:A53 ,:A54 ,:A55 ,:A56 ,:A57 ,:A58 ,:A59 ,:A60 ,:A61 ,:A62 ,
:A63 ,:A64 ,:A65 ,:A66 ,:A67 ,:A68 ,:A69 ,:A70 ,:A71 ,:A72 ,:A73 ,:A74 ,:A75 ,:A76 ,:A77 ,:A78 ,:A79 ,:A80 ,:A81 );

The analysis

Let’s follow the previous explained approach for troubleshooting this issue. Here is a short summary of the corresponding database objects (we will need that later on) before we start with the analysis itself.

OBJ.png

1. Using the wait interface / additional SQL execution statistics

Unfortunately the wait interface (by running query on V$SESSION) showed no waiting state (like “db file sequential read” or “enq: TX – index contention”) – the SQL statement was just running on CPU. From time to time you could see a wait event called “latch: cache buffers chain”, that gave us a first hint why the CPU was burned up (by running through the database cache like crazy).

The execution plan of this SQL basically looks like this:

XPLAN.png

In this case it makes no sense to run this SQL with additional execution statistics or monitor it with Real Time SQL Monitoring, because of that execution plan is too simple and we already know that it is running on CPU only.

2. Using the session statistics / performance counters

We need to dig into the session statistics to check what Oracle is doing by running on CPU as we don’t get more information from the wait interface now. I ran the snapper tool on a particular session that performed that INSERT statement on BSIS.

/wp-content/uploads/2013/01/snapper_173306.png

Our previous assumption was right – the CPU is used by running through the database cache only. The statistic “consistent gets from cache (fastpath)” is just a special way (by pinning buffers) of getting the data from the buffer cache. The database reads a lot of data – just for a simple INSERT with 2 indexes.

I wanted to be absolutely sure, that these buffer gets correspond to the table and not to the indexes. I flushed the buffer cache and enabled a short SQL trace by using the PL/SQL procedure DBMS_MONITOR.SESSION_TRACE_ENABLE for one particular session.

SQLTrace.png

This output is just an extract of the whole SQL trace, but it continues that way. The INSERT statement performed thousands of single block I/Os on the object 437114 (which refers to the table object BSIS and not to the indexes – check the object summary above) for no obvious reason.

We knew the root cause of the high CPU usage now, but what we still did not know why Oracle performed so many single block I/Os on the table for a simple INSERT statement. There is no logical reason for that in the current context. So we needed to check the call stack, if there is an indicator for that behavior.

3. Using stack traces or system call / signal traces

I did not have anything like DTrace for checking the call stack in real time as the database was running on AIX. I have used the manual approach by running oradebug several times and checking the call stack, but you could also use the previously mentioned sampling tool “Oradebug short_Stack Profiler” for that. For those who are not familiar with oradebug – you need to attach to a process with oradebug and then you can perform several actions on/with that process (in my case i attached to a process that performed that particular INSERT of course).

/wp-content/uploads/2013/01/oradebug_173332.png

That particular session always ran through the code path of function “ktspscan_bmb”. Scan sounds like “Read” and bmb sounds like “Bitmap Block” to me. The table (and indexes) were stored in an ASSM tablespace which substantiated the suspicion. I will not dig into ASSM internals (check the references for more information about that), but basically said the BMB contains the information for finding free space, which would make sense in our INSERT example as well.

My suspicion was confirmed after checking MOS and finding several issues about ASSM and the function ktspscan_bmb. To make a long story short – my clients hit Oracle Bug #13641076. Luckily the patch was already available in a SBP 😎 and the issue could be solved here.

Summary

The wait interface and the session statistics are the “winner” in most cases, but if there is some really strange behavior for no obvious reason you need to dig deeper. Oradebug is very useful for troubleshooting or for understanding how Oracle works internally (if the OS does not provide the possibility).

If you have any further questions – please feel free to ask or get in contact directly, if you need assistance by troubleshooting Oracle database issues.


References

To report this post you need to login first.

3 Comments

You must be Logged on to comment or reply to a post.

    1. Stefan Koehler Post author

      Hi Samik,

      thanks for your positive feedback. Unfortunately the screenshots contain all the critical information and demos. Data loss on SCN is a shame. 😡

      UPDATE: I was able to reconstruct the lost images with help of google cache (http://google.com/search?q=cache). Thank god google is safer than SCN 😉


      Regards

      Stefan

      (0) 

Leave a Reply