Analyzing SAP HANA Runtime Dumps with SAP HANA dump analyzer
In the past months, I’ve been working on SAP HANA dump analyzer: an intelligent, easy-to-use, Java executable to automatically analyze HANA issues. You can get the latest SAP HANA dump analyzer here. You can also check the changes for HANA dump analyzer.
There will be a series of blogs to explain the essential features in the SAP HANA dump analyzer and to introduce intelligent solutions build on top of SAP HANA dump analyzer (Autonomous Self-Analysis System).
This post is going to give you an overview of the SAP HANA dump analyzer.
If you’ve ever supported or managed one SAP HANA system, most likely you worked with SAP HANA Runtime Dumps already. An SAP HANA runtime dump is a text file that provides various information about the current state of the SAP HANA database. HANA runtime dumps are frequently used to troubleshoot technical issues like system hangs, high memory consumption etc. However, HANA runtime dumps are plain text files with massive information including hundreds of callstacks, thread information, HANA statistics etc. This makes it difficult to analyze the dumps and to derive a conclusion within a reasonable time frame.
The very first challenge to analyze the HANA runtime dump is to understand the pattern from thousands of threads call stacks.
One HANA thread call stack from the HANA runtime dump looks like the following:
1817914258[thr=73928]: JobWrk11358 at 1: 0x00007fcf3887dfd9 in syscall+0x15 (libc.so.6) 2: 0x00007fcf3a5e2019 in Synchronization::BinarySemaphore::timedWait(unsigned long, Execution::Context&)+0x255 at LinuxFutexOps.hpp:53 (libhdbbasis.so) 3: 0x00007fcf4008b935 in Executor::X2OldLock::calculate(Executor::X2Statistics&)+0x4a1 at X2OldLock.cpp:609 (libhdbexecutor.so) 4: 0x00007fcf4002bd39 in Executor::PlanExecutor::calculateX2(TRexCommonObjects::TRexApiError&, Executor::X2Statistics&)+0x295 at PlanExecutor.cpp:862 (libhdbexecutor.so) 5: 0x00007fcf4002ca67 in Executor::PlanExecutor::calculate(TRexCommonObjects::TRexApiError&, Executor::X2Statistics&)+0x1b3 at PlanExecutor.cpp:687 (libhdbexecutor.so) 6: 0x00007fcf5cf92038 in JoinEvaluator::JoinAPI::execute(TRexCommonObjects::TRexApiError&)+0x784 at PlanExecutor.h:61 (libhdbcsapi.so) 7: 0x00007fcf5c5a35ca in TRexAPI::JoinSearchImpl::executeSearch(Execution::Context&, TRexAPI::PreparedQuery const&, TRexAPI::QueryRuntime&, ltt::smartptr_handle<TRexCommonObjects::InternalTableBase>&)+0xe16 at JoinSearchImpl.cpp:40 (libhdbcsapi.so) 8: 0x00007fcf5c5e3236 in TRexAPI::SearchAPI::extractResults(Execution::Context&, TRexAPI::Search::RowProjectors*, TRexAPI::Search::RawResultContext*)+0x152 at SearchAPI.cpp:317 (libhdbcsapi.so) 9: 0x00007fcf5c5e3f7e in TRexAPI::SearchAPI::fetchAll(Execution::Context&, bool)+0x3a at SearchAPI.cpp:893 (libhdbcsapi.so) 10: 0x00007fcf4366be57 in ptime::TrexOltpSearch::search(Execution::Context&, bool, bool)+0x93 at trex_oltp_query.cc:194 (libhdbcswrapper.so) 11: 0x00007fcf45c28e89 in ptime::Trex_oltp_search::do_open(ptime::OperatorEnv&, ptime::QEParams, int) const+0x425 at qe_trex_search.cc:4215 (libhdbrskernel.so) 12: 0x00007fcf45b94e6c in ptime::Table::open(ptime::Env&, ptime::QEParams, int) const+0x148 at qe_table.cc:230 (libhdbrskernel.so) 13: 0x00007fcf45c6a2ef in ptime::Itab_materializer::do_open(ptime::OperatorEnv&, ptime::QEParams, int) const+0x27b at qe_itab_materializer.cc:94 (libhdbrskernel.so) 14: 0x00007fcf45b94e6c in ptime::Table::open(ptime::Env&, ptime::QEParams, int) const+0x148 at qe_table.cc:230 (libhdbrskernel.so) 15: 0x00007fcf45c12fb3 in ptime::Trex_oltp_search::evaluateChildren(ptime::OperatorEnv&, TRexAPI::QueryRuntimeData&, ptime::QEParams) const+0x80 at qe_trex_search.cc:3849 (libhdbrskernel.so) 16: 0x00007fcf45c28c87 in ptime::Trex_oltp_search::do_open(ptime::OperatorEnv&, ptime::QEParams, int) const+0x223 at qe_trex_search.cc:4169 (libhdbrskernel.so) 17: 0x00007fcf45b94e6c in ptime::Table::open(ptime::Env&, ptime::QEParams, int) const+0x148 at qe_table.cc:230 (libhdbrskernel.so) 18: 0x00007fcf435f54e5 in ptime::TrexPlanOp::executePtimeOp(ltt_adp::vector<Executor::PlanData*, ltt::integral_constant<bool, true> > const&, ltt_adp::vector<Executor::PlanData*, ltt::integral_constant<bool, true> > const&, TRexCommonObjects::TRexApiError&, Executor::ExecutionInfo const&)+0x111 at trex_plan.cc:385 (libhdbcswrapper.so) 19: 0x00007fcf435f573c in ptime::TrexPlanOp::executePop(ltt_adp::vector<Executor::PlanData*, ltt::integral_constant<bool, true> > const&, ltt_adp::vector<Executor::PlanData*, ltt::integral_constant<bool, true> > const&, TRexCommonObjects::TRexApiError&, Executor::ExecutionInfo const&)+0x38 at trex_plan.cc:266 (libhdbcswrapper.so) 20: 0x00007fcf4008d46d in Executor::X2OldLock::runPopTask(Executor::X2::PopTaskInfo&, int&, ltt::allocator&, ltt::allocator&)+0x14a9 at X2OldLock.cpp:2473 (libhdbexecutor.so) 21: 0x00007fcf4007d2ec in Executor::X2OldLock::runPopJob(Executor::X2Job*)+0x78 at X2OldLock.cpp:2090 (libhdbexecutor.so) 22: 0x00007fcf4007e7c3 in Executor::X2OldLockJob::run(Execution::JobObject&)+0x1f0 at X2OldLock.cpp:4495 (libhdbexecutor.so) 23: 0x00007fcf3a32915b in Execution::JobObjectImpl::run(Execution::JobWorker*)+0x1217 at JobExecutorImpl.cpp:1098 (libhdbbasis.so) 24: 0x00007fcf3a3347d4 in Execution::JobWorker::runJob(ltt::smartptr_handle<Execution::JobObjectForHandle>&)+0x3b0 at JobExecutorThreads.cpp:217 (libhdbbasis.so) 25: 0x00007fcf3a337037 in Execution::JobWorker::run(void*&)+0x1f3 at JobExecutorThreads.cpp:436 (libhdbbasis.so) 26: 0x00007fcf3a38f637 in Execution::Thread::staticMainImp(void**)+0x743 at Thread.cpp:463 (libhdbbasis.so) 27: 0x00007fcf3a390cc8 in Execution::Thread::staticMain(void*)+0x34 at ThreadMain.cpp:26 (libhdbbasis.so)
For a busy HANA system, the threads call stacks could look like the following (i.e. the following is an example of around 2000 working HANA threads call stacks):
Understanding the above HANA threads call stacks within limited time feels insurmountable because there is too much data to be studied!
FlameGraph for HANA
Inspired by Brendan Gregg’s FlameGraphs, the HANA threads call stacks are visualized in FlameGraphs. Each column is one HANA thread. Different threads with similar calls stacks are grouped together, e.g. there are many threads on the right side of the following FlameGraph that are blocked by the savepoint (i.e. on call stack frame DataAccess::SavepointLock::lockShared). The FlameGraph visualization provides intuitive result, i.e. you are naturally looking at the bigger section on the threads flame graph with/without deep knowledge on HANA.
There is different variation of the flame graph to analyze from different angles, e.g.
- Reversed FlameGraph can be used for better visualization if many threads are waiting or the same lock, i.e. the leaf call stack frames are same, though they may come from different parent call stack frames.
- Memory FlameGraph can be used to visualize HANA allocator memory consumption when M_HEAP_MEMORY is available from the dump.
- Concurrent FlameGraph visualizes OLAP query execution (i.e. threads hierarchy).
With the better visualization from FlameGraph, it’s much easier to see a pattern from the threads call stacks.
The Auto Analyzer
However, is this already the full package for my desired HANA dump analyzer?
I want one HANA dump analyzer to answer me directly:
- Is there any issue from the runtime dump?
- If yes, what is the issue, how is it concluded?
- What are the possible workarounds or solutions and how to move forward?
With the above Q, I try to seek for an A: the Auto Analyzer feature of HANA dump analyzer is created to automatically and systematically analyze the issue from the runtime dump and create the analysis report. An example of the analysis report looks like the following:
SAP HANA dump analyzer
The following part will describe in detail about the latest version of SAP HANA dump analyzer including auto analyzer and expert mode.
SAP HANA dump analyzer is a Java program that can be executed directly when Java is properly installed, the GUI looks like
It allows drag and drop of runtime dumps, the dump will be automatically analyzed after double clicking the selected runtime dump or clicking “Auto Analyzer” button. The analysis report will be automatically created. The analysis report can be saved as a single HTML page via browser means.
The SAP HANA dump analyzer can be executed via command line as well to analyze from a provided SAP HANA runtime dump and return the analysis report. The help page of SAP HANA dump analyzer command line is available via:
java -jar HANADumpAnalyzer.jar -help
It’s possible to integrate SAP HANA dump analyzer with the monitoring infrastructure together with other tools. This could in the best case achieve an Autonomous Self-Analysis System to self-detect and analyze HANA issues at certain scenarios for SAP HANA. Please understand the Autonomous Self-Analysis System with SAP HANA Dump analyzer may help you to automize certain monitoring tasks and analyze the scenario but you need to know what you want to do. The set up should be test enough as it will be used on “your own risk”.
The SAP HANA dump analyzer can run in Windows, Linux and MacOS environments. It doesn’t connect to the SAP HANA database, so neither credentials nor a running SAP HANA database are required when using it.
Currently the following analyzers have been implemented. The analyzers will automatically analyze the issues. If the analyzers find out issues, it will create tab pages in the analysis report and show the issues also in the summary page. The analysis report summary page also provides further information regarding runtime dump information, e.g. runtime dump name, the time when runtime dump is generated and runtime dump duration.
|Analyzer||Issue to be analyzed by the analyzer||Details|
|Crash Analyzer||HANA crash issue||Crash Analyzer analyzes whether there is a crash issue from the dump. If there is a HANA crash issue, the crash analyzer will create crash analysis report showing where HANA crashes, e.g. crash call stack, exception violation condition etc.Usually you need to report an SAP Incident for checking the crash issue if the exception violation is not directly clear.|
|OOM Analyzer||HANA OOM issue||
OOM Analyzer analyzes whether there is an OOM issue from the dump. If there is an OOM issue, it will create the OOM analysis report including information e.g. global allocation limit& Inter process Memory Management (i.e. IPMM), memory consumption distribution from different connections and heap allocators.
Here is an example of memory consumption distribution analysis from the OOM analysis report. It analyzes how the memory is consumed by different connections. In case there is one expensive query being the biggest memory consumer, the OOM analyzer will provide the conclusion directly:
|HANA Workload Analyzer||HANA job worker exhaustion issue, i.e.
-All available job workers are busy
– no new job workers can be started anymore
– jobs are queuing up
The workload Analyzer analyzes whether there is a job worker exhaustion issue. If there is a job worker exhaustion issue, the work load analyzer tries to analyze how the job workers are configured and what the job workers are busy with in the analysis report. To provide details on what the job workers are doing, the workload analyzer provides: e.g. OLAP workload concurrency FlameGraph, pie chart visualization of threads number on Application & statement, job worker call stacks visualization in flame graph.
An example of workload analyzer:
|High CPU Analyzer||High CPU issue, i.e.
– over (60% * PROCESSOR_NUM) number of threads are running i.e. not waiting on synchronization
– there are many running threads, however no CPU_INFO is captured in the runtime dump
The high CPU analyzer analyzes whether there is a (potential) CPU resource exhaustion issue from the runtime dump. In this case, the high CPU analyzer will provide analysis including CPU load statistics, concurrency FlameGraph for visualizing how OLAP load is using threads resources, threads stack flame graph for visualizing threads call stacks.
An example of threads call stack flame graph on the High CPU analysis tab:
|Savepoint Analyzer||Savpoint blocked issue and many threads are blocked on savepoint||
The savepoint analyzer analyzes whether there is a savepoint blocked issue from the runtime dump. If savepoint is blocked and blocks lots of other threads, the savepoint analyzer provide the savepoint blocked analysis including call stack savepoint blocker and further information (e.g. running SQL) of savepoint blocker, threads blocked by savepoint.
An example of savepoint blocked analysis:
|Waitgraph Analyzer||Waitgraph is detected, many threads are blocked||
The waitgraph analyzer analyzes whether there is a blocked situation which is visible from the waitgraph. In this case, the waitgraph analyzer provides the analysis including waitgraph and threads call FlameGraph.
An example of the waitgraph on the analysis report:
|Blocked Transactions Analyzer||Many transactions are blocked||
The blocked transactions analyzer analyzes whether there are many blocked transactions. If there are many blocked transactions, the blocked transactions analyzer provides analysis including blocked transaction graph and threads call stack visualized in threads stack FlameGraph.
An example of the blocked transaction visualization on the analysis report:
|IndexHandle State Analyzer||Many threads are waiting on acquiring an index handle||
The indexHandle state analyzer analyzes whether there are many threads waiting on acquiring the index handle. In this case, the indexHandle state analyzer visualizes the blocking situation and provide the threads stack FlameGraph.
An example of indexHandle internal state analysis from analysis report:
A more detailed documentation can be found here.
If the auto analyzer doesn’t find a known scenario or you want to perform some individual analysis, you can switch to the “Expert Mode” tab and use analysis options provided there, e.g.:
|Call stack representation via flame graph||Flame Graph -> Stack -> Create Flame Graph||
A flame graph represents call stacks in a way that more frequent call stacks are displayed larger than less frequent modules.
Further options like showing differences between call stacks are available.
|Memory allocation representation via flame graph||Flame Graph -> Memory -> Create Flame Graph||
A similar flame graph can be created for memory allocation.
|Call stack generation via DOT||Dot Graph -> Create Dot Graph||
A different way to display call graphs is the DOT format. In this case boxes are colored in different shades of red depending on the number of threads in the module. Final modules (where the thread actually works) are marked with a blue frame.
|Extraction of [INDEXMANAGER _WAITGRAPH] locking scenarios||Wait Graph -> Create Wait Graph||The [INDEXMANAGER_WAITGRAPH] section of a runtime dump may already contain a wait graph in DOT format that is extracted and displayed.|
|Extraction of monitoring view data||Statistics||The section [STATISTICS] of runtime dumps contains raw data of specific monitoring views. This can be extracted and opened with Excel. The M_SERVICE_THREADS_STATISTICS is available if the dump contains the section|
A more detailed documentation can be found here.
Please feel free to post any feedback of SAP HANA dump analyzer on this blog or write an email to my mailbox email@example.com. In case the SAP HANA dump analyzer is not working as expected or need to be fixed, please attach the HANA runtime dump while you write to me. Thanks!