Skip to Content
Technical Articles

HANA – analyzing runtime dumps

The description runtime dump means not only when you trigger it manually or when a system crashes. There are multiple flavors which can be viewed in note 2400007.

In this blog I will explain with 2 examples (OOM dump and indexserver crash) which tools you can use to analyze them and what you have to watch out for.

  1. High level: creation of a runtime dump
    1. manual trigger
    2. Other configurable triggers
  2. Tools for an analysis (OOM example)
    1. HANA Dump Viewer
    2. HANA Dump Analyzer
    3. Old school way
  3. Indexserver Crash example

High level: creation of a runtime dump

– for details check How to generate a runtime dump

There are different kinds of runtime dumps:

Classic *.rtedump.<timestamp>.trc
Memory *.rtedump.<timestamp>.oom.trc
Index handle waits *.indexhandle_blocked.trc
Slave indexserver start / stop: * longrunning_start_stop.trc
Long running transaction commit / rollback / cancellation *.longrunning_transaction.trc
Page corruption *.rtedump.<timestamp>.page.trc
Savepoint *.savepoint_blocked.trc
Log shipping timeout *.secondaryReplayTimeout.trc
Table replication turned off *.table_replication_turn_off.trc

They are all created inside the trace directory of the systemdb (alias cdtrace) or tenant (cdtrace; cd DB_<tenant_SID>).

I think the most famous are the first two once. Every time a service is crashing or running out-of-memory (oom) such a runtime dump is written for further analysis. But if the system is in a tricky situation and you are finding the root cause or a solution not point in time, you will save the status of the threads and the system views for further analysis or another expert. Therefor you can trigger a dump manually:

  1. via command line: Python script
  2. via command line: hdbcons
  3. via SQL command
  4. hana cockpit: Collect and Download Diagnosis Information

Manual trigger

  1. Python script
#As sidadm:
cdpy
python fullSystemInfoDump.py
  1. hdbcons
#For Indexserver
hdbcons [-p <pid>] “runtimedump dump”

#for Nameserver:
hdbcons [-p <pid>] -e 'hdbnameserver' “runtimedump dump”
  1. SQL

With SPS4 and upwards:

ALTER SYSTEM CREATE RUNTIMEDUMP [AT '<host>:<port>'] [SECTIONS <sections> | PROFILE <profile>] [INTO FILE <file>]

 

Other configurable triggers

If you have trouble with long running savepoints (check my blog for details):

indexserver.ini -> [persistence] -> runtimedump_for_blocked_savepoint_timeout = <x> (unit seconds) 

 

If you want to display the values of the bind variables use:

indexserver.ini -> [sql] -> display_sql_statement_parameters_in_dump = true

 

Official sources:
2400007 – FAQ: SAP HANA Runtime Dumps
2498739 – How-To: Analyzing Runtime Dumps with SAP HANA Dump Analyzer
2491748 – How-To: Analyzing Runtime Dumps with SAP HANADumpViewer
Blog: Analyzing SAP HANA Runtime Dumps with SAP HANA dump analyzer
Blog: Changes for SAP HANA dump analyzer


Tools for an analysis (OOM example)

There are currently 2 official tools you can use:

  • SAP HANA Dump Viewer
  • SAP HANA Dump Analyzer

At first, we will throw some light on the difference of SAP HANA Dump Analyzer and SAP HANA Dump Viewer.

The HANA Dump Viewer is a python tool which is designed to be on the HANA server itself. Python is part of every HANA installation – nothing to install here besides the script.

The Dump Analyzer is a java tool which should be placed on your client. It is java so it is working on windows, MAC or Linux.

The HANA Dump Analyzer is the newer tool and has all features of the Viewer as GUI.

If you like the quick and dirty way and you may not have rights to run java then you should use the HANA Dump Viewer. In all other cases just use the Dump Analyzer because there are features like flame graphs, memory leak detection, auto analyzer etc. which will make your life easier if you are not familiar with runtime dumps.

 

SAP HANA Dump Viewer (python)

#change to the trace directory (systemDB / nameserver) as sidadm:
cdtrace
ls -lrt *rte*
sidadm@hostname:/usr/sap/SID/HDB00/vhostname/trace> ls -lrt *rte*
-rw-r--r-- 1 sidadm sapsys 1940488 Dec 20 02:40 nameserver_vhostname.30001.rtedump.20191220-024019.028599.oom.trc
-rw-r--r-- 1 sidadm sapsys 1931891 Dec 21 14:19 nameserver_vhostname.30001.rtedump.20191221-141934.028599.oom.trc
-rw-r--r-- 1 sidadm sapsys 1907141 Dec 22 14:26 nameserver_vhostname.30001.rtedump.20191222-142630.028599.oom.trc
#change the directory to the tenant
cd DB_*
ls -lrt *rte*
sidadm@hostname:/usr/sap/SID/HDB00/vhostname/trace/DB_TenaneSID> ls -lrt *rte*
-rw-r--r-- 1 sidadm sapsys 2431267 Dec 19 02:40 indexserver_vhostname.30040.rtedump.20191219-024019.065384.oom.trc
-rw-r--r-- 1 sidadm sapsys 2020932 Dec 20 02:40 indexserver_vhostname.30040.rtedump.20191220-024019.065384.oom.trc
-rw-r--r-- 1 sidadm sapsys 2134609 Dec 21 14:19 indexserver_vhostname.30040.rtedump.20191221-141934.065384.oom.trc
-rw-r--r-- 1 sidadm sapsys 2099624 Dec 22 14:26 indexserver_vhostname.30040.rtedump.20191222-142604.065384.oom.trc
-rw-r--r-- 1 sidadm sapsys 1883623 Dec 25 21:57 indexserver_vhostname.30040.rtedump.20191225-215711.065384.oom.trc

Usage:

  1. Download the script from github and read the pdf
  2. Create a file in a /tmp directory or for resident usage into the python path of the HANA DB (cdpy – check the correct rights for this)
  3. Run the script: python <scriptname.py> <options>

Example as sidadm

cdtrace; cd DB_*
python /usr/sap/SID/HDB00/exe/python_support/hanadumpviewer.py -df indexserver_hostname.30040.rtedump.20191225-215711.065384.oom.trc
File /tmp/hanadumpviewer_output/indexserver_hostname.30040.rtedump.20191225-215711.065384.oom.trc.dot was created

Such a *.dot file can be visualized with webgraphviz. Just paste the content of the *.dot file on the website.

 

In every dump there are also the extraction of the system views. You can copy them via editor and copy them to excel or you can do it the easy way with the HANADump viewer option “-mv”. This is valuable when the retention time of the systems views are over and the data were deleted.

 

In this mode the graph separates the normal and the exception threads:
Normal threads are shown in cyan colored boxes and have a thread id + thread type.

python /usr/sap/SID/HDB00/exe/python_support/hanadumpviewer.py -df indexserver_hostname.30040.rtedump.20191225-215711.065384.oom.trc -pt true
File /tmp/hanadumpviewer_output/indexserver_hostname.30040.rtedump.20191225-215711.065384.oom.trc.dot was created

Exception threads are shown in orange colored boxes, their id is simply their order number in the .trc file and they have a reason for failure:

Normally there should be an exception thread and the identification is easy. But in our case there is no orange colored box:

So, one way is to go the old school (expert with experience) manual way:

Look into the dump via an editor 😉

Table of contents of an oom dump:
[BUILD]  Build information
[STACK_SHORT]  Short call stacks and pending exceptions of all threads
[PROCESS_INFO]  Process Info
[THREADS]  Running threads
[OS_MEMORY]  Operating system information about memory
[MEMORY_OOM]  Information about current out of memory situation

 

hdbcons “runtimedump l” for an overview of all sections

Have a look into the following areas (search for the terms) to get a quick overview:
[OS_MEMORY]
[MEMORY_OOM]

For details about the meminfo and the linux memory management check out my latest blog.

[OS_MEMORY]  Operating system information about memory: (2019-12-25 21:57:11 927 Local)
Maximum size of the process's virtual memory: soft limit=18446744073709551615, hard limit=18446744073709551615

/proc/meminfo:
MemTotal:       209519808 kB
MemFree:        44673280 kB
MemAvailable:   48709312 kB
Buffers:           19584 kB
Cached:         21846592 kB
SwapCached:        36992 kB
Active:         141622400 kB
Inactive:       17604096 kB
Active(anon):   137681920 kB
Inactive(anon): 16901312 kB
Active(file):    3940480 kB
Inactive(file):   702784 kB
Unevictable:      185216 kB
Mlocked:          185216 kB
SwapTotal:      12578688 kB
SwapFree:       11258048 kB
Dirty:              3904 kB
Writeback:             0 kB
AnonPages:      137531008 kB
Mapped:         17483392 kB
Shmem:          17191808 kB
Slab:            1387456 kB
SReclaimable:     698752 kB
SUnreclaim:       688704 kB
KernelStack:       26432 kB
PageTables:      1580224 kB

 

[MEMORY_OOM]  Information about current out of memory situation: (2019-12-25 21:57:11 929 Local)
OUT OF MEMORY occurred.
Host: hostname
Executable: hdbindexserver
PID: 65384
Failed to allocate 3gb (3221225472b).
 
Allocation failure type: PROCESS_ALLOCATION_LIMIT
 
Current callstack:
1: 0x00007fff75ef1fcc in MemoryManager::PoolAllocator::notifyOOM(unsigned long, bool) const+0xdc at PoolAllocator.cpp:2911 (libhdbbasis.so)
2: 0x00007fff75fefd20 in MemoryManager::PoolAllocator::oomHandling(unsigned long, unsigned long, unsigned long, bool)+0xf0 at PoolAllocator.cpp:1280 (libhdbbasis.so)
3: 0x00007fff75ef36e8 in MemoryManager::PoolAllocator::allocateNoThrowImpl(unsigned long, void const*)+0x248 at PoolAllocator.cpp:2507 (libhdbbasis.so)
4: 0x00007fff776f743c in ltt::allocator::allocate(unsigned long)+0x5c at memory.cpp:146 (libhdbdataaccess.so)
5: 0x00007fff7728faa8 in DataAccess::BackupChannel::BackupChannel(ltt::allocator&, ResourceManager::ResourceContainer&, DAEncryption::EncryptionManager const*, DataAccess::PersistenceManager&, ltt::smartptr_handle<Stream::ChannelCallback> const&, unsigned long, PageAccess::PageIO*, PageAccess::PhysPageNo const&, DAEncryption::InitializationVector const&, DataAccess::PersistentSpaceRestartData const&, PageAccess::Converter*, PageAccess::Converter*, PageAccess::Converter*, PageAccess::Converter*, unsigned long const&, unsigned long const&, bool, bool, unsigned long&)+0x15b8 at releasable_handle.hpp:449 (libhdbdataaccess.so)
6: 0x00007fff7736a918 in DataAccess::PersistentSpaceImpl::openBackupChannel(ltt::allocator&, PageAccess::PhysPageNo const&, DAEncryption::InitializationVector const&, DataAccess::PersistentSpaceRestartData const&, ltt::smartptr_handle<Stream::ChannelCallback>, unsigned long, unsigned long const&, unsigned long const&, bool, bool, unsigned long&)+0x108 at PersistentSpaceImpl.cpp:2653 (libhdbdataaccess.so)
7: 0x00007fff772d39f4 in DataAccess::PersistenceManagerImpl::BackupHandlerImpl::triggerBackup(ltt::smartptr_handle<Stream::ChannelCallback> const&, unsigned long, unsigned long&)+0x184 at releasable_handle.hpp:463 (libhdbdataaccess.so)
8: 0x00007fff772d3e90 in DataAccess::DataBackupHandler::triggerBackup(ltt::smartptr_handle<Stream::ChannelCallback> const&, unsigned long, unsigned long&)+0x20 at BackupInfoRestartData.hpp:43 (libhdbdataaccess.so)
9: 0x00007fff8adfcd58 in Backup::BackupSrc_CommonSnapshot::prepareBackupOperation()+0x1f8 at TraceEntryExit.hpp:103 (libhdbbackup.so)
10: 0x00007fff8ad79a34 in Backup::BackupExe_SimpleExecutor::globalSavepoint(ltt::list<ltt::smartptr_handle<Backup::BackupSrc_Source> >&, unsigned long&, unsigned long&, ltt::guid&, unsigned long&)+0x154 at BackupExe_SimpleExecutor.cpp:261 (libhdbbackup.so)
11: 0x00007fff8ad2c9f4 in Backup::BackupExe_RequestHandler::globalSavePoint(Backup::BackupExe_Request&, Backup::BackupExe_Response&)+0x254 at BackupExe_RequestHandler.cpp:534 (libhdbbackup.so)
12: 0x00007fff8ad2d58c in Backup::BackupExe_RequestHandler::handleRequest(Backup::BackupExe_Request&, Backup::BackupExe_Response&, Backup::BackupCbc_Federation*, Backup::RecoverCbc_Federation*)+0x1ec at BackupExe_RequestHandler.cpp:420 (libhdbbackup.so)
13: 0x00007fff8adb3b40 in Backup::Backup_ExecutorHandler::handleRequest(TrexNet::Request&, Backup::BackupCbc_Federation*)+0xe0 at Backup_RecoverHandler.cpp:191 (libhdbbackup.so)
  • This means the first 4 Callstacks coming from the allocation, means from the oom, itself
  • The following once are showing us that a backup was executed
  • Global savepoint was executed and a snapshot (yes, a HANA backup based on a kind of snapshot) was created
  • This crashed the server because no more memory was available at this point in time
  • Next step: we have to check all other allocators and the complete sizing of the server

For details we can check the system for IPMM short info and search for the allocators, or we are using the dump analyzer for visualization.

IPMM short info:

GLOBAL_ALLOCATION_LIMIT (GAL) = 189.39gb (203360829440b), SHARED_MEMORY = 15.79gb (16958947328b), CODE_SIZE = 1.99gb (2140995584b), OOM_RESERVATION_ALLOCATOR_SIZE = 96.18mb (100859904b)

PID=57454 (hdbcompileserver), PAL=119.47gb (128290521088b), AB=1.27gb (1367474176b), UA=0, U=192.34mb (201686714b), FSL=0b, FMC=0b, Blocked=979.75mb (1027342336b)
PID=57456 (hdbpreprocessor), PAL=119.47gb (128290521088b), AB=1.25gb (1346764800b), UA=0, U=171.11mb (179424071b), FSL=0b, FMC=0b
PID=65384 (hdbindexserver), PAL=119.47gb (128290521088b), AB=119.47gb (128290521088b), UA=0, U=99.17gb (106492272674b), FSL=0b, FMC=0b
PID=57820 (hdbwebdispatcher), PAL=119.47gb (128290521088b), AB=1.47gb (1581187072b), UA=0, U=392.86mb (411946659b), FSL=0b, FMC=0b
PID=28599 (hdbnameserver), PAL=119.47gb (128290521088b), AB=11.15gb (11974541312b), UA=0, U=3.51gb (3774016214b), FSL=160mb (167772160b), FMC=4gb (4306436096b)

Total allocated memory= 152.41gb (163660431360b)
Total used memory     = 121.21gb (130159289244b)
Sum AB                = 144560488448
Sum Used              = 111059346332
Heap memory fragmentation: 20% (this value may be high if defragmentation does not help solving the current memory request)
Top "M_HEAP_MEMORY" allocators (component, name, size). Ordered descending by inclusive_size_in_use.
 1: System:                                     /                                                                    99.17gb (106492272674b)
 2: System:                                     Pool                                                                 94.36gb (101319674758b)
 3: Monitoring & Statistical Data:              Pool/Statistics                                                      25.22gb (27083137056b)
 4: System:                                     Pool/RowEngine                                                       24.74gb (26567735344b)
 5: System:                                     Pool/PersistenceManager                                              20.33gb (21839976360b)
 6: System:                                     Pool/RowEngine/QueryExecution                                        11.04gb (11858086600b)
 7: System:                                     Pool/RowEngine/QueryExecution/SearchAlloc                            10.97gb (11788559016b)
 8: System:                                     Pool/ResourceContainer                                               8.75gb (9406436776b)
 9: System:                                     Pool/PersistenceManager/ContainerFileIDMapping                       8.02gb (8614085728b)
10: System:                                     Pool/PersistenceManager/VirtualFile entry TID map                    6.76gb (7268271488b)
11: Row Store Tables:                           Pool/RowStoreTables                                                  6.17gb (6628813168b)
12: Row Store Tables:                           Pool/RowStoreTables/CpbTree                                          6.07gb (6525295232b)
13: System:                                     Pool/RowEngine/Session                                               4.92gb (5291610176b)
14: System:                                     AllocateOnlyAllocator-unlimited                                      4.36gb (4689452952b)
15: Caches:                                     Pool/RowEngine/SQLPlan                                               4.13gb (4445782520b)
16: System:                                     AllocateOnlyAllocator-unlimited/FLA-UL<24592,1>/MemoryMapLevel3Nodes 3.99gb (4288820208b)
17: System:                                     AllocateOnlyAllocator-unlimited/FLA-UL<24592,1>                      3.99gb (4288820208b)
18: System:                                     Pool/PersistenceManager/PersistentSpace                              3.63gb (3905547760b)
19: System:                                     Pool/ResourceContainer/ResourceHeader                                3.47gb (3734215968b)
20: Statement Execution & Intermediate Results: Pool/Contexts                                                        3.22gb (3463665264b)

Result of the analyses => have a look at the dump analyzer


HANA Dump analyzer (java)

You can find a current version of the SAP HANA dump analyzer here.

Usage:

Source: SCN_Doc_Automatically_Analyze_HANA_Issues_with_SAP_HANA_dump_analyzer.pdf

  • You can see that the hdbindexserver PAL (=Process Allocation Limit) is at 119,47GB
  • The AB (Allocated Bytes) is also at 119,47GB
  • So there is a bottleneck at the PAL or one of its heap allocators

Term clarification:

SId Process secondary identifier
AB Allocated bytes by allocators for heap and stacks (inclusive allocator cache)
UA Unaccounted memory (by unlimited allocators for exceptions etc.), maybe overcommit
U Used bytes for heap and stacks (without allocator cache)
FSL Free segment list size
FMC Free precharge memory cache size
SMAS/SMDS System memory traffic information (accumulation of allocations/deallocations)
PAL Process allocation limit
TPAL Temp process allocation limit
Blocked Memory (TPAL!) blocked/reserved for this process
W Weight of process

Heap Allocators (details) with the OOM Flame Graph (Expert Mode):

Memory Allocator FlameGraph

The following Memory FlameGraph shows the memory consumption distribution from HANA allocators. The wider box consumes more memory. Move the cursor onto the box to display the memory consumed by the allocator. Check for SAP Note 1999997 for allocator-specific recommendations

If you don’t know FlameGraph check out the usage at the website of Brendan Gregg.

You can search for terms in the right upper corner and you can zoom in with a left click and reset zoom with a right click.

  • You may have noticed that the pool allocator with nearly 100% has only 96GB
  • Our PAL is at 119GB
  • There is heap fragmentation which is explained at the IPMM report on the last line:
    • “Heap memory fragmentation: 20% (this value may be high if defragmentation does not help solving the current memory request)”

  • Our Backup request reached the limit with only 1,6GB more memory allocation
  • This the reason for the oom as result of to less memory left but not the root cause
  • Also all other allocators are not as big as we could identify a bottleneck
  • Solution: extend the PAL because we have some memory left till we hit the GAL

 

Here we can see the mixed Flame Graph (expert mode: concurrency) for a detailed thread analyses which is not needed in this case.


Index Server Crash example

Auto Crash Analysis with HANA Dump Analyzer

  • Check the first stack call (0) for known issues in SMP
  • If your issue is not under the search results use call stack 1 or 2 to expand the search

Mixed Flame Graph

  • Use the search in the upper right corner and search for the term which we have seen in our stacktrace (see above): AttributeEngine::BTreeAttribute<TrexTypes::RawAttributeValue>::jeUpdateValidMapping

  • 7 threads are using this call stack
  • There is a very inaccurate note for this issue
  • “Due to a programming error in SAP HANA, indexserver may crash caused by a rare race condition between reading and writing the dictionary of a column store table.”

The manual way

Have a look at the following areas to get a quick overview:
CRASH_SHORTINFO
CRASH_EXTINFO
CRASH_CONTEXT
CRASH_STACK

[CRASH_STACK]  Stacktrace of crash: (2019-11-12 22:26:57 779 Local)
----> Symbolic stack backtrace <----
  0: AttributeEngine::BTreeAttribute<TrexTypes::RawAttributeValue>::jeUpdateValidMapping(ltt_adp::vector<AttributeEngine::AttributeValueContainer*, ltt::integral_constant<bool, true> >&, int) + 0x528
         Symbol: _ZN15AttributeEngine14BTreeAttributeIN9TrexTypes17RawAttributeValueEE20jeUpdateValidMappingERN7ltt_adp6vectorIPNS_23AttributeValueContainerEN3ltt17integral_constantIbLb1EEEEEi
         SFrame: IP: 0x00007fff9df357e8 (0x00007fff9df352c0+0x528) FP: 0x00007ffeec2316b0 SP: 0x00007ffeec2214f0 RP: 0x00007fff9c532bf4
         Params: 0x1ffb2df200000, 0x2f30a, 0x0, 0x7fff9995386c, 0x2b0000002b, 0x0, 0x1ec2a5313e160021, 0x1ec2a5313e1600
         Regs: gpr00=0x7fff9df35734, gpr02/toc=0x7fffa031bc00, gpr03/arg1/ret=0x1ffb2df200000, gpr04/arg2=0x2f30a, gpr05/arg3=0x0, gpr06/arg4=0x7fff9995386c, gpr07/arg5=0x2b0000002b, gpr08/arg6=0x0, gpr09/arg7=0x1ec2a5313e160021, gpr10/arg9=0x1ec2a5313e1600
         Source: AttributeStatistics.hpp:32
         Module: /usr/sap/SID/HDB00/exe/libhdbcs.so
     -----------------------------------------
  1: AttributeEngine::AttributeApi::jeOpenAttributeUpdateValidMapping(AttributeEngine::AttributeStoreHandle*, AttributeEngine::AttributeValueContainerHandle*, AttributeEngine::LazyMode, int) + 0x444
         Symbol: _ZN15AttributeEngine12AttributeApi33jeOpenAttributeUpdateValidMappingEPNS_20AttributeStoreHandleEPNS_29AttributeValueContainerHandleENS_8LazyModeEi
         SFrame: IP: 0x00007fff9c532bf4 (0x00007fff9c5327b0+0x444) FP: 0x00007ffeec2319c0 SP: 0x00007ffeec2316b0 RP: 0x00007fff9c533144
         Source: AttributeApi.h:151
         Module: /usr/sap/SID/HDB00/exe/libhdbcs.so
     -----------------------------------------
  2: AttributeEngine::AttributeApi::jeReadIndex(TRexUtils::IndexVectorAligned const&, TRexUtils::BitVector const*, ltt::vector<int> const*, bool, long, TRexUtils::IndexVectorAligned&, TRexUtils::IndexVectorAligned&, AttributeEngine::QueryStats&) + 0xb4
         Symbol: _ZN15AttributeEngine12AttributeApi11jeReadIndexERKN9TRexUtils18IndexVectorAlignedEPKNS1_9BitVectorEPKN3ltt6vectorIiEEblRS2_SD_RNS_10QueryStatsE
         SFrame: IP: 0x00007fff9c533144 (0x00007fff9c533090+0xb4) FP: 0x00007ffeec231af0 SP: 0x00007ffeec2319c0 RP: 0x00007fff8b82aec8
         Source: AttributeApi.h:187
         Module: /usr/sap/SID/HDB00/exe/libhdbcs.so


Summary

At the end every runtime dump looks different and have to analyzed on its own way, but the starting point is every time the same with every tool. The dump analyzer is a great tool for visualization and the first indicator for a root cause analyses.

What you should have learned:

  • the difference between HANA Dump Viewer and Analyzer
  • how to create a runtime dump manually and via trigger
  • how to create and analyze a mixed flame graph

Thanks to Nina Li for this great tool. Check out her blogs for details about the features and further development.

2 Comments
You must be Logged on to comment or reply to a post.
  • Great collection of tools and techniques!
    Well done, Jens Gleichmann. Good start into the new decade 😉

    As so many times before, I just wish that there would be reference material available outside of SAP that would allow making sense of the data in the trace files beyond reading function/class/method/task names and guessing what those mean.

    Anyway, that does not take away from your very well researched blog posts that I would recommend to any HANA operator.

    Keep up the good work (and maybe consider sharing those posts on LinkedIN and Twitter yourself).

    • Thanks again for reading and commenting my geeky stuff 🙂

      Fully agree in context of the documentation. There is a gap of information which you currently only can close with experience. Here SAP should definitely improve the detail degree to avoid too many OSS messages.

      Regarding sharing: My social network is faster than I in this, but I will consider it 😉

      Regards,

      Jens