System debugging and analysis techniques for ASE o...

Former Member · ‎06-27-2014

Stack tracing

A. When to use this tool

It can often be very useful to look at the function call stacks of ASE while it is running. I will cover a couple of specific use cases and give some examples of what to look at. In general, the two symptoms where this technique will be most useful are when ASE appears to be hung, and when ASE CPU is at or very near to 100% busy. When ASE encounters some internal error condition, such as a timeslice or “infected with 11” it will write a stack trace to the ASE errorlog. However, if the server is hung at some point or very busy there may not be an errorlog stacktrace to look at. In those cases, using an OS tool to generate a stack trace can be quite useful.

B. What to look for

A stack trace consists of a list of functions and address, with the most recent calls being at the top. Although an exact analysis of a particular stack trace may require the product internal knowledge that SAP Product Support can provide; we can often get a good idea of what is going on by looking at the function names. From those names we can determine of whether we are looking at some error condition or a resource contention/shortage situation.

When a pstack or procstack output has multiple threads, it can be difficult to know which thread you should be looking at. Here is a quick guide: If you are running an ASE version prior to 15.7, or in 15.7 and higher using process mode; there will be one dataserver process id running on the system per engine. The stack output for that process will show several threads, but the one you really care about will be thread #1. Note that depending on which tool you use that thread may be at the top or bottom of the output. If you are running a server in threaded mode there will only be one dataserver process, and the pstcak or procstack output will show many threads, including one for each engine. We’ll look at an example below and show how you may be able to identify which threads belong to engines.

C. Specific tools

The name of stack trace tool will vary by platform. On AIX it is called “procstack”, and on Linux, HPUX, and Solaris it is called “pstack”. The syntax for all is the same, though; it is just the command name followed by a dataserver process id (pid). Note that if you are not running in threaded mode each engine will be a separate Unix process with a unique pid, so you would need to run the stack command for each one.

D. Examples

Here is an example of a stack trace taken when a server appeared to be hung (i.e. unresponsive to any logins or commands). This particular example is using pstack on a Linux system from ASE 15.0.3:

pstack 28249

Thread 7 (Thread 0x7f114bdfb700 (LWP 28250)):

#0 0x0000003736633b57 in sigwaitinfo () from /lib64/libc.so.6

#1 0x0000000000d35c78 in kisigwait ()

#2 0x0000003736a079d1 in start_thread () from /lib64/libpthread.so.0

#3 0x00000037366e8b6d in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7f114b3e7700 (LWP 28251)):

#0 0x0000003736a0e75d in read () from /lib64/libpthread.so.0

#1 0x0000000000f1e46b in comn__sig_catcher_thread ()

#2 0x0000003736a079d1 in start_thread () from /lib64/libpthread.so.0

#3 0x00000037366e8b6d in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7f114a9e6700 (LWP 28252)):

#0 0x0000003736a0b5bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

#1 0x0000000000f1d7ec in comn_wait_cond ()

#2 0x0000000000ee75cc in rman__thread ()

#3 0x0000003736a079d1 in start_thread () from /lib64/libpthread.so.0

#4 0x00000037366e8b6d in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f1149fe5700 (LWP 28253)):

#0 0x0000003736a0b5bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

#1 0x0000000000f1d7ec in comn_wait_cond ()

#2 0x0000000000ee75cc in rman__thread ()

#3 0x0000003736a079d1 in start_thread () from /lib64/libpthread.so.0

#4 0x00000037366e8b6d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f11495e4700 (LWP 28254)):

#0 0x0000003736a0b5bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

#1 0x0000000000f1d7ec in comn_wait_cond ()

#2 0x0000000000ee75cc in rman__thread ()

#3 0x0000003736a079d1 in start_thread () from /lib64/libpthread.so.0

#4 0x00000037366e8b6d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f1148be3700 (LWP 28255)):

#0 0x00000037366df343 in poll () from /lib64/libc.so.6

#1 0x0000000000eebfce in sybnet__do_poll ()

#2 0x0000000000eeb6ce in sybnet__async_poll ()

#3 0x0000003736a079d1 in start_thread () from /lib64/libpthread.so.0

#4 0x00000037366e8b6d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f11522ad720 (LWP 28249)):

#0 0x0000003736a0e6fd in write () from /lib64/libpthread.so.0

#1 0x0000000000d33570 in b_cwrite_console ()

#2 0x0000000000d334c7 in b_cerrlog_uio ()

#3 0x0000000000d32a32 in ucierrlog_only ()

#4 0x0000000000c7d401 in ex_printmsg ()

#5 0x0000000000c7cd03 in ex__doprint ()

#6 0x0000000000c96244 in _ex_cleanup ()

#7 0x0000000000c7c603 in ex_print ()

#8 0x0000000000c7fbe6 in ex_callprint ()

As mentioned above, we want to look at Thread #1. We can see here that it was waiting on a write() system call to return. Reading down a little bit we can deduce that it was writing to either the errorlog or the console. If we look up b_cerrlog_uio in either Sybase “Solved Cases” or SAP’s Support Portal we’ll find that this stack occurs when ASE is trying to write to the terminal that booted ASE, but that terminal is blocked and not receiving output.

2. Here is a partial example of stacktrace taken from a server running in threaded mode that is showing 100% busy. This is procstack output from a server running on AIX:

...

---------- tid# 24052445 (pthread ID: 10795) ----------

0x00000001017b2dc4 aix_get_lock

0x00000001017777ac ptn_unkeep

0x00000001017776a8 ptn_nextcopy

0x0000000101777550 srch_get_idxptn

0x00000001017d0594 dol_insert.fdpr.chunk.10

0x00000001017ced98 ins_insert.fdpr.chunk.9

0x0000000101805438 insert

0x000000010194b100 crt_cols.fdpr.chunk.23

---------- tid# 31392003 (pthread ID: 10538) ----------

0x0900000000a9cea4 _event_sleep

0x0900000000a9daf8 _event_wait

0x0900000000aab724 _cond_wait_local

0x0900000000aabe7c _cond_wait

0x0900000000aac73c pthread_cond_timedwait

0x000000010008e530 symbol_fixer.2028.fdpr.chunk.5

0x000000010064ed10 ktMuxThreadSleep

0x000000010064e714 kt__MuxMain

0x000000010064eff0 symbol_fixer.140.fdpr.chunk.4

0x0000000101a3f02c thread_create_handler

0x0900000000a8ad50 _pthread_body

---------- tid# 30737349 (pthread ID: 10281) ----------

0x000000010001d820 aix_get_lock.fdpr.chunk.1

0x000000010179e248 dol_bt_getnext.fdpr.chunk.10

0x00000001017c49c4 getnext

0x00000001018c85d8 attrib_nextrow

0x00000001018c6cd8 cm_entity_alone_p

0x0000000101191e94 q_set_cache_strategy

0x0000000101192e6c q_score_nci_scan

0x000000010118d864 q_score_useful_ixscan

0x000000010118bb2c q_select_best_index

0x00000001005b09e0 q_compute_joincls_cost

0x00000001005b0784 cost_joins

0x00000001014f99c8 decision.fdpr.chunk.2

0x0000000101573600 decision_all

0x00000001014fc004 optimize_cmd.fdpr.chunk.1

0x000000010157329c s_optimizecmds

0x00000001014f95c4 decision_subq

0x00000001014f92d0 decision_all.fdpr.chunk.2

0x000000010157352c optimize_cmd

0x000000010157329c s_optimizecmds

0x00000001014fde00 s_ap_compat_gen

0x00000001017ae5f4 LeProcessExecStats(ExeCtxt&)

0x000000010030b5a8 ApPlan::ApGreedyApply()

0x00000001008747dc OptGlobal::_GblSearchEngine.fdpr.chunk.2(int*)

0x00000001016ccb64 OptGlobal::GblOptimize(int*)

0x0000000101823e30 Optimize

0x000000010182191c s_compile_stmt.fdpr.chunk.20

0x00000001018254dc s_compile.fdpr.chunk.14

0x000000010151ecfc s_recompile

0x000000010151e9d8 sequencer.fdpr.chunk.12

Looking at just the first and third threads we see that the top function is ibm_get_lock().

(Note that on AIX function names may have the suffix “fdpr.chunk.xx”. That is added by the feedback optimizer, and should be removed when looking at the function name. ) If we go again to Solved Cases and look up ibm_get_lock we can find that it is the name of the spinlock routine on AIX. So, we may have a spinlock contention issue on this server that we could diagnose further by querying monSpinlockActivity and running sp_sysmon.

The second stack (pthread ID: 10538) is an example of an idle engine thread. Note the name “kt_MuxMain” – that is the clue that it is an engine (multiplexed) thread. If you see kt__RTCMain (RTC standing for "Run to Completion" on a stack that indicates that it is not an engine thread (and probably the stack will not be of interest).

One factor to keep into account, when looking at stack traces, is that each stack is just the picture of what was going on at the instant the stack command was run. That means that you may not be able to conclude much from just a single stack trace, since you could have caught the dataserver process in an unusual temporary state. I would recommend getting at least 4 or 5 pstack/procstack outputs before attempting to do any analysis.

E. Conclusion

Looking at stack output from a running ASE server can help identify what internal ASE routines are involved when you are having a problem. While you might need to contact SAP support or search in the online support pages to figure out what some of the routines do; often the names can give a good idea of what is going on. That can allow you to take actions to resolve the problem at hand.

System debugging and analysis techniques for ASE on Unix/Linux - Part 3

Get Your SAP HANA Idea Incubator Badge Today!

SCN Mission - SAP HANA Quiz Challenge is now retired

Share your #HANAStory and Win