Purpose

The purpose of this document is to instruct SAP customers on how to analyse hanging situations in their HANA system.

Overview

So what constitutes a hanging situation in HANA? When we talk about a hanging situation we generally refer to a whole system wide hang, as opposed to highlighting one specific activity, ie: queries or operations. This means we are going to look at the systems conditions (performance) which leads the DB to run slowly, or not at all in some cases;


ex:

  • Database has stopped responding but has not crashed.
  • Database hangs during startup and does not start up.
  • Application is hanging but new connections to the database are still possible (Possibly not a HANA DB issue which must be analysed from an application perspective before looked at from a HANA DB perspective)

Like all other software systems, SAP HANA relies on hardware to run its processes. Even when looking at the most basic single sever, single host, we can see many areas which over lap and interact with one another, in other words, when you see a hang do not just assume the cause is all related to HANA.

Wiki.png

So a small example of a hang / freeze situation I have witnessed is when a user goes to open the “Administrative Tab” in HANA Studio and the system hangs for a long period of time. Below is a small example of troubleshooting this issue.

Troubleshooting

So here are some issues that you may face with hanging and what Traces are needed to troubleshoot.


1: Hanging situation with High CPU Utilization.

2: Hanging situations with CPU Utilization because all threads are waiting.

  • Traces needed for further analysis: Runtime Dumps.

3: Hanging situations where a logon via SSH is not possible. (Either wrong OS configuration or an OS/Hardware issue.)


Wrong OS configuration


The system must not swap. What you have to remember is that HANA limits its memory usage by the Global Allocation Limit. Other non HANA processes or other instances can interfere with the HANA system. So it is important that HANA is really assigned the memory up to the GAL and nothing else using this.

A large file cache can lead to problem. When checking the cache size (top command) please remember that HANA shared memory is also booked as “shared”. The remaining cache size is (in general) not available for HANA. If this is high, then we have to find out why?. The most probable reason for this is outside HANA. (See Linux paging improvements)

Transparent Huge Pages must be deactivated. THP will make HANA run quicker for a while but when it comes to splitting THP’s , the system gets so slow that working with it is not possible.


The first thing you have to think of when you face this situation is to execute Runtime Dumps immediately.

Runtime Dumps can useful for the following situations:

  • Standstill situations
  • High CPU
  • Hanging Threads
  • Query cancellation not working

By checking a RTE dump you can look for certain key indicators such as Large Query plans. Large Query plans can indicate problematic SQL Queries. The thread of this SQL query can then be checked via the parent and also its child threads. These threads then link you back to short stacks which then can checked to see what exactly this stack is doing. See Troubleshooting High CPU for further info.

As the HANA Studio will more than likely be hanging during the hang / slow performance, you can use SSH to run the Runtime Dumps with 2 minute intervals through the means of this hdbsql script:

DATE=`date +%Y%m%d.%H%M%S` ; for PID in `ps x|grep -E “hdb(index)server”|grep -v grep|awk ‘{print $1}’` ; do CMDLINE=`ps x|grep -E “^ *${PID}”|grep -v grep |awk ‘{for(i=5;i<=NF;i++) printf $(i)}’` ; echo $PID – $CMDLINE ; hdbcons -p ${PID} “runtimedump dump -c” >${CMDLINE}-${PID}-${DATE}.dump ; done ; sleep 120 ; DATE=`date +%Y%m%d.%H%M%S` ; for PID in `ps x|grep -E “hdb((index+)server”|grep -v grep|awk ‘{print $1}’` ; do CMDLINE=`ps x|grep -E “^ *${PID}”|grep -v grep |awk ‘{for(i=5;i<=NF;i++) printf $(i)}’` ; echo $PID – $CMDLINE ; hdbcons -p ${PID} “runtimedump dump -c” >${CMDLINE}-${PID}-${DATE}.dump ; done ; sleep 120 ; DATE=`date +%Y%m%d.%H%M%S` ; for PID in `ps x|grep -E “hdb((index+)server”|grep -v grep|awk ‘{print $1}’` ; do CMDLINE=`ps x|grep -E “^ *${PID}”|grep -v grep |awk ‘{for(i=5;i<=NF;i++) printf $(i)}’` ; echo $PID – $CMDLINE ; hdbcons -p ${PID} “runtimedump dump -c” >${CMDLINE}-${PID}-${DATE}.dump ; done

After running the script you can then open the generated RTE dumps. The dumps will show you what exact queries were running at the time of the hang / freeze.

These queries can then either be searched on the SAP search or you can check to see if these are your own custom queries which need to be looked at it terms of optimization. (Also if you have to open an incident with SAP this information will be what the engineer will be looking for)

In relation to the HANA Studio Hang, the solution for this can be found by searching the SQL Query generated which will return the Note High CPU when opening admin console

The vast majority of hanging situations are related to bottleneck issues with CPU, Storage, Network etc.

Usually the DBA will know the time and date of the hang that is causing the issues, but if this is not known you can always use the Performance load graph. As of SP9 you can now use the HANA Cockpit load graph. (I know this was a function already available in previous revisions but this did not work very well and crashed a lot). This preforms better and does not crash like its predecessor in Studio if the nameserverhistory file was large.

Going to the SAP HANA Cockpit, you can then see the SAP HANA Database Administration section with nice looking Fiori designed tiles:

Wiki 1.PNG

Here you can check at what time and date did the system experience the issues.


Please also be aware of the HANA Offline Cockpit functionality that became available recently. By logging in with the SIDADM user you can use the “Troubleshoot Unresponsive System” also:

Wiki4.PNG

If the load graph cannot be accessed by either Studio or Cockpit you can also use the TOP command at OS level which will show you the running processes:

wiki 3.PNG

So now you have the Time Stamp of the issue. Now you go to HANA Diagnosis Tab in Studio , or its corresponding tile in Cockpit.

Here is where you locate the time stamp in the relevant files so you can see what was happening before, during and after the hang.

The first files to look into are the indexserver + nameserver. Check the corresponding time stamps (Before and during the time) in these files to see if any obvious errors are apparent. Some examples of errors you may see before the system hang can be:

  • SQL error 131: transaction rolled back by lock wait timeout
  • SQL error 133: transaction rolled back by detected deadlock

If you see these please see Note on Lock Analysis. Many Useful SQL scripts exist also at Note 1969700

The so called MVCC Ani Ager periodically checks for problematic statements. It reports idle cursors or long running write transactions after 60 minutes.

It closes idle cursors after 12 hours.

  • mvcc_anti_ager.cc(01291) : There are too many un-collected versions.

        (‘number of versions > 1000000’ or ‘maximum number of versions per record > 100000’)

  • The cursor possibly block the garbage collection of HANA database.

         mvcc_anti_ager.cc(01291) : There are too many un-collected versions on table “<schema>”.”<table>”

         (‘number of versions for one of the partitions > 1000000’ or ‘maximum number of versions per record > 100000’)

  • The transaction blocks the garbage collection of HANA database.

        mvcc_anti_ager.cc(01199) : long running uncommitted write transaction detected.

        mvcc_anti_ager.cc(01082) : The Connection is disconnected forcefully because it is blocking garbage collection for too long period.

        Statement.cc(03190) : session control command is performed by …, user=SYSTEM, query=ALTER SYSTEM DISCONNECT SESSION ‘<conn_id>’

        mvcc_anti_ager.cc(00834) : long running cursor detected.

  • The open cursor possibly blocks the garbage collection of HANA database.

         Please close a cursor in application or kill the connection by “ALTER SYSTEM DISCONNECT SESSION ‘<conn_id>’ “

The above refers to a long running transaction that has yet to be committed and could be causing your system to hang. If you were to see any of these errors please see FAQ on Garbage Collection.


Blocked transactions can lead to a hanging situation from an application perspective. Blocked transactions are usually not a database issue and need to be analyzed from an application point of view.

To find out the blocked transaction see the System Information Tab. You then query “Blocked Transactions“:

Wiki1.PNG

Long running transaction can block the garbage collection for executing. A very high number of MVCC versions (>5 Mil) can lead to a slow system or even a hanging like situation. If you would like to query the MVCC versions amount you can find this in the monitoring view M_MVCC_OVERVIEW:

WIki 2.PNG

If you then like to drill down further into the MVCC versions, you can also see how many MVCC versions per table by querying M_RS_TABLE_VERSION_STATISTICS:

wiki 3.PNG

So by using the above information you should be able to find the blocker transaction. You should then try to disconnect this connection:

ALTER SYSTEM DISCONNECT SESSION ‘12345’. If the cancellation was successful then the reason is normally related to an application or user issue and should be investigated more from that perspective.

If the connection cannot be cancelled then this is either 1 or 2 things:

1: A long running query that cannot be cancelled. So this is the time to run the runtime dumps as mentioned earlier.

2: Or it could be caused by an issue that requires further attention from SAP Development support via an incident.

Hanging situations in relation to SAVEPOINTS:

Savepoints speed up the startup time of a database because not all the redo logs have to be replayed but only the log from the last savepoint. The Savepoint coordinator periodically performs savepoints, which the default is 5 minutes. The savepoints are also triggered by several other operations such as data backups, a database shutdown or after a restart is completed.

If a system crashes during the savepoint operation, the system can still be restored from the last savepoint due to the shadow page concept. The shadow page concept is more about how to allocate and reuse pages in the data file and doesn’t affect the recover-ability that much. But it e.g. frees you (to a very large extent) from the need for data file reorganisation. Here changed pages will not be overwritten directly, but instead marked as available and the changed content is placed at some other available location in the data file. Since SAP HANA keeps track on which pages contain the current data, there is no need to overwrite or clear unused pages, so after some time the whole data file will contain some data.

Data backup operations write a global savepoint, which is a consistent set of savepoints from all servers in the SAP HANA system. It is possible to restore a SAP HANA system from such a data backup, without replaying the redo log.

The Savepoint is split into three individual stages:

Phase 1 (PAGEFLUSH): All modified pages are determined that are not yet written to disk. The savepoint coordinator triggers writing of all these pages and waits until the I/O operations are complete.

Phase 2 (CRITICAL): The is the critical part of a savepoint operation where no concurrent write operations are allowed. This is achieved using the consistent change lock. To minimize the impact on concurrent operations, phase 2 must be kept as short as possible. The savepoint coordinator determines and stores the savepoint log position and the list of open transactions. Also pages that were changed during phase 1 are written to disk asynchronously.

Phase 3 (POSTCRITICAL): Changes are allowed in this phase again. The savepoint coordinator waits until all asynchronous I/O operations related to the savepoint are finished and marks the savepoint as completed.

During the critical phase the savepoint holds an exclusive Consistent Change Lock. Other write operations into the data volume are blocked during that time.

You can identify these by selecting the following:

M_SAVEPOINTS monitoring view in runtimedump or via hdbcons command (STATE ENTERCRITICAL or CRITICAL)

Capture.PNG

Possible Reasons: Bad IO throughput (check SAP Note 1999930) Blocked Consistent Change Lock by waiting writer (check SAP Note 2214279)

If you have checked the HANA logs and you see nothing obvious or problematic, then here is when you check the var/log/messages.

If you see some irregularities in these files then open a ticket with your hardware provider.

The main point to take from this document is to ALWAYS try and capture the hang with runtime dumps. This will then give you the DBA or SAP a very good chance of identifying the root cause of the hang. Without runtime dumps & kernel profiler trace you cannot be given a definite RCA, even if you open a support incident with SAP.

Related KBA’s:

2280288 – TREXviaDBSL and TREXviaDBSLWithParameter Statements can remain Open with Status ‘Suspended’

2256719 – Potential Standstill Situations Due to Wrong Calculation of Parked Workers

1999020 – How-To: SAP HANA Troubleshooting when Database is no longer accessible

To report this post you need to login first.

5 Comments

You must be Logged on to comment or reply to a post.

Leave a Reply