[Oracle] Researching internal latch implementation (ksl_get_shared_latch, kslfre, kslgetsl_w) and crashing PMON
Introduction
Last week the DOAG 2014 conference took place in Nuremberg and it was a blast with a lot of useful presentations and especially great conversations and meet ups with Oracle friends. I had a nice talk about the Oracle latch implementation with a participant, who told me that his instance crashes every time, if he (manually) sets a shared latch in exclusive mode and tries to release it afterwards. It sounded really interesting as i have done this so many times without ever noticing such an issue. He also told me that this issue is reproducible at least on Oracle 11g R2 and 12c R1. I had no immediate answer or clue about the described issue and needed to research it furthermore. In consequence this blog post is the promised follow up to our talk, Markus 😉
The first place to go for a topic like latches is a guy called Andrey S. Nikolaev. He is well known for researching internals about latching – please check the reference section, if you have never heard of him yet. However there are several ways to acquire a latch and here is a relevant summary for our issue (Source “Presentation – Latch internals by Andrey S. Nikolaev”).
Latch acquisition in “no-wait” mode
- C function for exclusive latch get: kslgetl(laddress,wait,why,where)
- C function for shared latch get (< Oracle 11g): kslgetsl(laddress,wait,why,where,mode)
- C function for shared latch get (>= Oracle 11g): kslgetsl_w(laddress,wait,why,where,mode)
Where and why Oracle gets the latch
- Laddress: Address of latch in SGA
- Wait: Flag for no-wait (0) or wait (1) latch acquisition
- Where: Integer code for location from where the latch is acquired. Oracle lists possible latch “where” values in x$ksllw.indx
- Why: Context of why the latch is acquiring at this “where”. It contains DBA for CBC latches (Tanel Poder), SGA chunk address for shared pool latch, session address, etc.. “Why” meaning for some “where” can be found in x$ksllw.ksllwlbl
- Mode: Requesting state for shared latches (8 = Shared mode, 16 = Exclusive mode)
So far so good and that is exactly the same way i used it, whenever needed. The easiest way (for me) to research the described crashing issue is using the latching mechanism for the Oracle result cache. Shared latches for the result cache are acquired in exclusive or shared mode dependent on, if the cache result is already populated or not.
The main researching
The following demo was run on Solaris x86_64 11.2 (as VM on Virtual Box 4.2.12) and Oracle 12.1.0.2 (no CDB/PDB). The main reason for using Solaris is DTrace in this case.
Create test data
SQL> create table t as select * from dba_objects;
Oracle database settings, memory address and execution plan
SQL> show parameter result_cache_
SQL> select NAME, ADDR from v$latch where name = ‘Result Cache: RC Latch’;
SQL> select /*+ result_cache */ distinct(OWNER) from t;
DTracing count of C function calls for shadow process (PID 1528)
The following output represents the relevant C (latch) functions while accessing the populated result cache for the previous query. I stripped down the output to its minimum, but it is pretty interesting that you can not spot any of the previously described C functions like kslgetsl / kslgetsl_w for getting a shared latch.
shell> dtrace -n ‘pid$target:::entry { @num[probefunc] = count(); }’ -p 1528
…
ksl_get_shared_latch 9
kslfre 10
…
DTracing C functions ksl_get_shared_latch and kslfre with the first argument for shadow process (PID 1528)
The following output represents the relevant C (latch) functions while accessing the populated result cache for the previous query. I stripped down the output to its minimum again, but if you look closely you can see that the C function ksl_get_shared_latch is used to acquire a shared latch on memory address 0x600572B0, which fairly fits to memory address of the latch “Result Cache: RC Latch”.
shell> dtrace -n ‘pid$target::ksl_get_shared_latch:entry { printf(“%s(0x%X)\n”,probefunc,arg0); }’ -n ‘pid$target::kslfre:entry { printf(“%s(0x%X)\n”,probefunc,arg0); }’ -p 1528
dtrace: description ‘pid$target::ksl_get_shared_latch:entry ‘ matched 1 probe
dtrace: description ‘pid$target::kslfre:entry ‘ matched 1 probe
CPU ID FUNCTION:NAME
…
2 78440 ksl_get_shared_latch:entry ksl_get_shared_latch(0x600572B0)
2 78441 kslfre:entry kslfre(0x600572B0)
…
DTracing C function ksl_get_shared_latch with all arguments for shadow process (PID 1528)
The following output represents the relevant C (latch) function while accessing the populated result cache for the previous query. I stripped down the output to its minimum again. The argument description is still valid and you can see that the shared latch was acquired in shared mode (5th parameter = 8).
shell> dtrace -n ‘pid$target::ksl_get_shared_latch:entry { printf(“%s(0x%X,%d,%d,%d,%d)\n”,probefunc,arg0,arg1,arg2,arg3,arg4); }’ -p 1528
dtrace: description ‘pid$target::ksl_get_shared_latch:entry ‘ matched 1 probe
CPU ID FUNCTION:NAME
…
2 78440 ksl_get_shared_latch:entry ksl_get_shared_latch(0x600572B0,1,1,5358,8)
…
Verifying C functions of Oracle binary to crosscheck their existence (e.g. of kslgetsl_w)
Everything that is mentioned in 2012 by Andrey S. Nikolaev still exists so far.
shell> nm $ORACLE_HOME/bin/oracle
/oracle/rdbms/12102/bin/oracle:
[Index] Value Size Type Bind Other Shndx Name
…
[277345] | 32219600| 3111|FUNC |GLOB |0 |16 |kslfre
[274307] | 52558896| 85|FUNC |GLOB |0 |16 |kslgetsl_w
[319317] | 32287984| 1993|FUNC |GLOB |0 |16 |ksl_get_shared_latch
…
As a next step let’s verify the functionality once again on Oracle 12c R1 (with C functions kslgetsl_w / kslfre) and then try the “official” implemented Oracle approach with C function ksl_get_shared_latch.
Acquire and testing the latch (manually) with oradebug
SQL> select NAME, ADDR, GETS from v$latch where name = ‘Result Cache: RC Latch’;
SQL> oradebug setmypid
SQL> oradebug call kslgetsl_w 0x600572B0 1 1 1 8
Function returned 1
SQL> oradebug call kslfre 0x600572B0
Function returned 0
SQL> select NAME, ADDR, GETS from v$latch where name = ‘Result Cache: RC Latch’;
Everything works like a charm with the “old” approach by using kslgetsl_w to acquire a shared latch in shared mode. However he told me that the crashing issue happens only by acquiring a shared latch in exclusive mode, so let’s try this with the old approach as well.
SQL> oradebug call kslgetsl_w 0x600572B0 1 1 1 16
Function returned 1
SQL> select /*+ result_cache */ distinct(OWNER) from t;
*** Query hangs ***
SQL> oradebug call kslfre 0x600572B0
Function returned 0
*** Query returns with result ***
It works like expected and the Oracle instance (PMON) did not crash. Strange indeed as Markus told me that this happens every time when a shared latch is acquired by oradebug in exclusive mode. He was absolutely sure about this and so let’s try the “official” implemented Oracle approach with C function ksl_get_shared_latch.
SQL> select NAME, ADDR, GETS from v$latch where name = ‘Result Cache: RC Latch’;
SQL> oradebug call ksl_get_shared_latch 0x600572B0 1 1 1 16
Function returned 1
SQL> select NAME, ADDR, GETS from v$latch where name = ‘Result Cache: RC Latch’;
SQL> select /*+ result_cache */ distinct(OWNER) from t;
*** Query hangs ***
It still works up to this point and Oracle behaves as it should be. So let’s release the shared latch (in exclusive mode) with kslfre.
SQL> oradebug call kslfre 0x600572B0
At this point my SQL*Plus session was frozen and the Oracle instance crashed a few seconds later. The following output is from the corresponding alert log file.
…
Starting background process PMON
PMON started with pid=2, OS id=1277
…
ORA-00600: internal error code, arguments: [533], [0x0600572B0], [352928640], [16], [], [], [], [], [], [], [], []
ORA-00600: internal error code, arguments: [kslrwfl_1], [16], [], [], [], [], [], [], [], [], [], []
ORA-00600: internal error code, arguments: [533], [0x0600572B0], [352928640], [16], [], [], [], [], [], [], [], []
ORA-00600: internal error code, arguments: [kslrwfl_lv_3], [], [], [], [], [], [], [], [], [], [], []
ORA-00600: internal error code, arguments: [kslrwfl_lv_3], [], [], [], [], [], [], [], [], [], [], []
USER (ospid: 1277): terminating the instance due to error 472
…
Jackpot! It crashes and there obviously seems to be something different when using the “official” Oracle C function flow approach with oradebug.
DTracing C function ksl_get_shared_latch with all arguments for shadow process (PID 2665)
I assumed that the arguments may changed with function ksl_get_shared_latch (for exclusive mode). Let’s verify this with an empty result cache and DTracing the population, when a shared latch is acquired in exclusive mode.
SQL> select NAME, ADDR from v$latch where name = ‘Result Cache: RC Latch’;
*** Empty result cache ***
SQL> select /*+ result_cache */ distinct(OWNER) from t;
shell> dtrace -n ‘pid$target::ksl_get_shared_latch:entry { printf(“%s(0x%X,%d,%d,%d,%d)”,probefunc,arg0,arg1,arg2,arg3,arg4); }’ -n ‘pid$target::kslfre:entry { printf(“%s(0x%X)\n”,probefunc,arg0); }’ -p 2665
CPU ID FUNCTION:NAME
…
2 78466 ksl_get_shared_latch:entry ksl_get_shared_latch(0x600572B0,1,1,5347,16)
2 78467 kslfre:entry kslfre(0x600572B0)
…
Nothing special here. It still looks like the same as by acquiring a shared latch in shared mode (except the difference for the 5th argument of course). It is reproducible at any time when using ksl_get_shared_latch in exclusive mode.
I found Oracle bug #12538779 (“ORA-00600: INTERNAL ERROR CODE, ARGUMENTS: [KSLRWFL_1], PMON DIES”) and reference bug #4520815 (“IN-FLUX DETERMINATION FOR SHARED LATCHES HELD IN EXCLUSIVE MODE DOES NOT WORK”) after searching for a while. Unfortunately there is not much information about it and reference bug #4520815 should already be fixed by patchset 10.2.0.2. However remember that we directly invoke a C function out of the context with oradebug and the bug description sounds like some kind of our issue.
I also found an additional information in presentation by Andrey S. Nikolaev that may corresponds to this issue / error.
“11g – ksl_get_shared_latch (…) / Use 32-bit ports! 64bit Oracle corrupts the arguments of oradebug call in 10.2.0.4 and above!”
Further tests
I also researched the behavior for Oracle 11.2.0.3.6 and 12.1.0.1 on Linux, but the Oracle implementation and result was always the same. The Oracle instance (PMON) crashed as soon as i tried to release the shared latch (in exclusive mode) with C function kslfre, if it was previously acquired by C function ksl_get_shared_latch. Shared latches in shared mode always work like a charm, no matter which function you invoke by oradebug.
I also did some researching for Oracle 12.1.0.2 on Linux with help of GDB. Here are the command references, if you are interested into doing this as well.
shell> gdb –pid=2439
(gdb) break ksl_get_shared_latch
(gdb) info break
Num Type Disp Enb Address What
1 breakpoint keep y 0x000000000b3e3890 <ksl_get_shared_latch>
(gdb) c
Breakpoint 1, 0x000000000b3e3890 in ksl_get_shared_latch ()
(gdb) backtrace
#0 0x000000000b3e3890 in ksl_get_shared_latch ()
#1 0x000000000b408103 in ksupucg ()
#2 0x000000000b5bc205 in opiodr ()
#3 0x000000000b7be6d3 in ttcpip ()
#4 0x0000000001e452f4 in opitsk ()
#5 0x0000000001e49c28 in opiino ()
#6 0x000000000b5bc3c2 in opiodr ()
#7 0x0000000001e4118a in opidrv ()
#8 0x00000000025381f8 in sou2o ()
#9 0x0000000000b393a7 in opimai_real ()
#10 0x0000000002542898 in ssthrdmain ()
#11 0x0000000000b392de in main ()
(gdb) info frame 0
Stack frame at 0x7fff3e9b45d0:
rip = 0xb3e3890 in ksl_get_shared_latch; saved rip 0xb408103
called by frame at 0x7fff3e9b4630
Arglist at 0x7fff3e9b45c0, args:
Locals at 0x7fff3e9b45c0, Previous frame’s sp is 0x7fff3e9b45d0
Saved registers:
rip at 0x7fff3e9b45c8
Summary
Use the “old” C functions like kslgetsl_w whenever possible, if you need to manually acquire a shared latch by invoking functions via oradebug. The bottom line that i was remembered while researching this issue: “It can crash your whole instance, even if you think that you are using a save functionality“. Take care of using something like this in production, if you have not tested it very carefully.
If you have any further questions – please feel free to ask or get in contact directly, if you need assistance by troubleshooting Oracle database (performance) issues.
References
- Homepage Andrey S. Nikolaev
- Oracle Bug #4520815 – IN-FLUX DETERMINATION FOR SHARED LATCHES HELD IN EXCLUSIVE MODE DOES NOT WORK
- Oracle Bug #12538779 – ORA-00600: INTERNAL ERROR CODE, ARGUMENTS: [KSLRWFL_1], PMON DIES
- Presentation – Latch internals by Andrey S. Nikolaev
- Result Cache Latch in 11GR2: Shared Mode Gets by Alex Fatkulin
I had a short chat with Andrey Nikolaev via mail these days and he got a great explanation for the mentioned behavior - so here is probably the reason for the instance crashing behavior. I am just quoting his explanation from mail and then provide my DTrace proof.
Mail snippet from Andrey Nikolaev
----------
The procstack on AIX indicates that the ksl_get_shared_latch() actually has 6 arguments. The DTrace on Solaris demonstrates that the first five arguments are laddr,wait,why,where,mode, and the last one is the new_value for the latch.
I guess that Oracle internally has some C macro that checks the current latch value, if mode=8 then adds 1 to it, or, if the mode=16 then calculates the new latch value: 20000000 00000000 | <PID>. After that Oracle calls the ksl_get_shared_latch() with six arguments to set this value.
I suppose that the kslgetsl_w() function with 5 arguments, which I used for demos, is the wrapper function for this macro.
----------
DTrace proof
----------
SYS@S12DB:10> select NAME, ADDR, GETS from v$latch where name = 'Result Cache: RC Latch';
NAME ADDR GETS
---------------------------------------------------------------- ---------------- ----------
Result Cache: RC Latch 00000000600572B0 0
TEST@S12DB:180> select /*+ result_cache */ distinct(OWNER) from t;
-- DTrace of populating the result cache
root@SOL:~# dtrace -n 'pid$target::ksl_get_shared_latch:entry { printf("%s(0x%X,%d,%d,%d,%d,%d)\n",probefunc,arg0,arg1,arg2,arg3,arg4,arg5); }' -p 1231
dtrace: description 'pid$target::ksl_get_shared_latch:entry ' matched 2 probes
CPU ID FUNCTION:NAME
0 78476 ksl_get_shared_latch:entry ksl_get_shared_latch(0x600572B0,1,1,5347,16,2305843009213694019)
0 78477 ksl_get_shared_latch:entry ksl_get_shared_latch(0x600572B0,1,1,5347,16,2305843009213694019)
0 78476 ksl_get_shared_latch:entry ksl_get_shared_latch(0x600572B0,1,1,5348,16,2305843009213694019)
0 78477 ksl_get_shared_latch:entry ksl_get_shared_latch(0x600572B0,1,1,5348,16,2305843009213694019)
0 78476 ksl_get_shared_latch:entry ksl_get_shared_latch(0x600572B0,1,1,5358,8,1)
0 78477 ksl_get_shared_latch:entry ksl_get_shared_latch(0x600572B0,1,1,5358,8,1)
0 78476 ksl_get_shared_latch:entry ksl_get_shared_latch(0x600572B0,1,1,5347,16,2305843009213694019)
0 78477 ksl_get_shared_latch:entry ksl_get_shared_latch(0x600572B0,1,1,5347,16,2305843009213694019)
0 78476 ksl_get_shared_latch:entry ksl_get_shared_latch(0x600572B0,1,1,5358,16,2305843009213694019)
0 78477 ksl_get_shared_latch:entry ksl_get_shared_latch(0x600572B0,1,1,5358,16,2305843009213694019)
0 78476 ksl_get_shared_latch:entry ksl_get_shared_latch(0x600572B0,1,1,5374,16,2305843009213694019)
0 78477 ksl_get_shared_latch:entry ksl_get_shared_latch(0x600572B0,1,1,5374,16,2305843009213694019)
-- DTrace of already populated result cache
root@SOL:~# dtrace -n 'pid$target::ksl_get_shared_latch:entry { printf("%s(0x%X,%d,%d,%d,%d,%d)\n",probefunc,arg0,arg1,arg2,arg3,arg4,arg5); }' -p 1231
dtrace: description 'pid$target::ksl_get_shared_latch:entry ' matched 1 probe
CPU ID FUNCTION:NAME
2 78476 ksl_get_shared_latch:entry ksl_get_shared_latch(0x600572B0,1,1,5358,8,1)
SQL> oradebug setmypid
Statement processed.
SQL> oradebug call ksl_get_shared_latch 0x600572B0 1 1 1 16 2305843009213694019
Function returned 1
-- No crash of instance by releasing latch
SQL> oradebug call kslfre 0x600572B0
Function returned 0
----------