HANA Savepoint Analysis
I would like to share some knowledge about the savepoints in HANA. And note “2100009 – FAQ: SAP HANA Savepoints” is the reference.
1.What are savepoints?
- Savepoints are required to synchronize changes in memory with the persistency on disk level. All modified pages of row and column store are written to disk during a savepoint.
- Each SAP HANA host and service has its own savepoints.
- The data belonging to a savepoint represents a consistent state of the data on disk and remains untouched until the next savepoint operation has been completed.
2.When is a savepoint triggered?
During normal operations, savepoints are automatically triggered when a predefined time since the last savepoint is passed. The length of the time interval between two consecutive savepoints can be controlled with the following parameter:
global.ini -> [persistence] -> savepoint_interval_s
Its default value is 300, so savepoints are taken at intervals of 300 seconds (5 minutes).
System command (manual)
The following command can be used to execute a savepoint manually:
ALTER SYSTEM SAVEPOINT
A soft shutdown invokes a savepoint before the services are stopped.
A hard shutdown doesn’t trigger a savepoint. This can increase the subsequent restart time.
A global savepoint is performed before a data backup is started.
A savepoint is written after the backup of a specific service if finished.
After a consistent database state is reached during startup, a savepoint is performed.
Snapshots are savepoints that are preserved for longer use and so they are not overwritten by the next savepoint.
3. Helpful Views
|M_SAVEPOINT_STATISTICS||Global savepoint information per host and service|
|M_SAVEPOINTS||Detailed information for individual savepoints|
|As of SAP HANA SPS 10 savepoint details are logged for THREAD_TYPE = ‘PeriodicSavepoint’ (see SAP Note 2114710).|
4. Helpful SQL Script.
1969700 – SQL statement collection for SAP HANA
|SQL: “HANA_IO_Savepoints“||Detailed information for individual savepoints|
|SQL: “HANA_IO_Snapshots”||Snapshot information|
5. Blocking Phase
The majority of the savepoint is performed online without holding a lock, but the finalization of the savepoint requires a lock. This step is called the blocking phase of the savepoint. It consists of two major subphases:
|Sub phase||Thread detail||Description|
|WaitForLock||enterCriticalPhase(waitForLock)||Before the critical phase is entered, a ConsistentChangeLock needs to be allocated by the savepoint. If this lock is held by other threads / transactions, the duration of this phase is increasing. At the same time all other modifications on the underlying table like INSERT, UPDATE or DELETE are blocked by the savepoint with ConsistentChangeLock.|
|Critical||processCriticalPhase||Once the ConsistentChangeLock is acquired, the actual critical phase is entered and remaining I/O writes are performed in order to guarantee a consistent set of data on disk level. During this time other transactions aren’t allowed to perform changes on the underlying table and are blocked with ConsistentChangeLock.|
6. Typical savepoint issues analysis
|Long waitForLock phase|| enterCriticalPhase
|Long durations of the blocking phase (outside of the critical phase) are typically caused by SAP HANA internal lock contention. The following known scenarios exist
Starting with Rev. 102 you can configure the following parameter in order to trigger a runtime dump (SAP Note 2400007) in case waiting for entering the critical phase takes longer than <seconds> seconds:indexserver.ini -> [persistence] -> runtimedump_for_blocked_savepoint_timeout = ‘<seconds>’
(This is not a default parameter, add this parameter manually )
|Long critical phase||processCriticalPhase||Delays during the critical phase are often caused by problems in the disk I/O area.|
7. Analyze the runtime dump
is triggerred by the parameter runtimedump_for_blocked_savepoint_timeout.
You could check the runtime dump from the following aspects.
We could find the savepoint thread,
Savepoint Callstack contains “DataAccess::SavepointLock::lockExclusive”
Other threads(SQL thread) waiting for the lock, Callstack contains: “DataAccess::SavepointSPI::lockSavepoint”
Runtime dump : section [SAVEPOINT_SHAREDLOCK_OWNERS]
Always, most time the savepoint hangs because the exclusive lock is occupied by other thread. This section can helps find which thread is occupying the lock.
SAVEPOINT_SHAREDLOCK_OWNERS Owners of shared ConsistentChangeLock locks In case a savepoint is blocked in the waitForLock phase (SAP Note 2100009), the blocking activities can be found in this section.
Example: In the following section, you could find that the thread id 298995 is blocking the shared lock which leads to the exclusive lock is blocked and hangs the savepoint.
[SAVEPOINT_SHAREDLOCK_OWNERS] Owners of shared SavepointLocks: (2017-10-10 11:18:13 112 Local)
96034[thr=298995]: JobWrk0145, TID: 4856, UTID: 1588661641, CID: -1, LCID: 0, parent: 299143, SQLUserName: “”, AppUserName: “”, AppName: “”, ConnCtx: —, StmtCtx: —, type: “JobWorker”, method: “”, detail: “”, command: “” at 0x00007efe63342e88 in ltt::string_base<char, ltt::char_traits<char> >::trim_(unsigned long)+0xb8 at string.hpp:683 (libhdbcs.so)
After you got the thread id of the sharedlock owner, you could search the thread id and try to find its parent thread id. In this example, we could find the parent thread id is the following:
107423[thr=299143]: MergedogMerger, TID: 4856, UTID: 1588661641, CID: -1, LCID: 0, parent: 299445, SQLUserName: “”, AppUserName: “”, AppName: “”, ConnCtx: —, StmtCtx: —, type: “MergedogMerger“, method: “”, detail: “3 of 3 table(s): SAPERP:/1LT/VF00094506“, command: “” at 0x00007efe4e645f59 in syscall+0x19 (libc.so.6)
We got the conclusion that the merge of the table /1LT/VF00094506 is blocking the shared lock. Then we could try to find if any issue with the merge of the table.
Runtime dump: Section : [STATISTICS] M_SAVEPOINTS_
Import the data of this view to excel, and sort by column “CRITICAL_PHASE_WAIT_TIME” and “CRITICAL_PHASE_DURATION”
And we could see that the CRITICAL_PHASE_WAIT_TIME is over 10s, which is quite slow. This proves that there is an issue with the savepoint and also an issue with the exclusive lock.
And if you could find long duration of “CRITICAL_PHASE_DURATION”. This means there is an issue with the I/O.
Hope this helps to understand the savepoint and the root cause the savepoint hanging issue.