Health checks of HANA system
As HANA administrators we often think about how to keep a HANA system healthy, what are the things we need to check to ensure that all the things that are necessary for good performance of the system are in place. One way to do that is using the SQL scripts provided by SAP in OSS note 1969700 – SQL Statement Collection for SAP HANA
This OSS note provides a collection of very useful SQL scripts with which you can check a wide range of areas that are critical to a HANA system. In this blog, I will be describing how you can use these scripts to maintain good health of your HANA systems. I would suggest to perform these checks every 6 months.
Note: Some of these SQL scripts are version dependent, so you need to use the one which suits your version.
Description: SAP introduced the concept of hybrid LOBs from SPS07 onwards. Upto SPS06 memory LOBs were used for storing large objects. The disadvantage of memory LOBs was that when any query accessed the columns containing memory LOBs, the entire column along with the LOBs had to be loaded into memory leading to higher memory usage. With hybrid LOBs, the LOB data is only loaded into memory when it’s size doesn’t exceed a certain configurable threshold (1000 bytes by default), otherwise it resides on disk.
Executing the SQL script “HANA_Configuration_HybridLOBActivation_CommandGenerator” gives you the table name and the column name which has memory LOBs and also the commands to convert them into hybrid LOBs.
Action: Open an SQL console and execute the “Alter table …” commands from the output of the SQL script. For more information, refer to OSS notes 1994962 – How-To: Activation of Hybrid LOBs in SAP HANA and 2220627 – FAQ: SAP HANA LOBs
Description: This script checks for tables and hybrid LOB containers that are located on different hosts. If they are located on different hosts, it has a significant performance impact as it will increase the network data transfer time, leading to increased time in query execution accessing the LOBs.
From the output of the script you can see the table name, the host in which the table is located and the host in which the LOB containers are located.
Action: You can use this SQL statement to relocate them to the same host.- alter table “<schema_name>”.”<table_name>” move to ‘<target_host>:<target port>’ PHYSICAL
Note that the option “physical” is important here. Without the physical option the table data will be moved to the target host excluding the LOB data. Using physical option moves the LOB data as well.
Description: This script gives you the fragmentation information of the data and log files. Usually it’s the data files that gets fragmented over the time so by default the script considers data files only for reporting. In the modification section you can update the PORT as ‘%03” so that only indexserver data files are reported. This is to avoid reporting of data files of other services like scriptserver and xsengine as those are very small in size and defragmenting those will not yield any positive results and is not desired.
Check the column “Fragmentation_Pct”. If the fragmentation percentage is considerable, say more than 40 or 50% then it’s worth doing defragmentation of that data file.
Action: You can execute defragmentation of the data files using the command – Alter system reclaim datavolume ‘<host>:<port>’ 120 defragment
You must take some precautionary measures if you have system replication enabled. You need to ensure that no replication snapshots exist in the system when you are running defragmentation otherwise the defragmentation will fail with the error message “general error: Shrink canceled, probably because of snapshot pages SQLSTATE: HY000”. Refer to OSS note 2332284 – Data volume reclaim failed because of snapshot pages and change the parameters as suggested then perform the defragmentation.
Description: You can use this script to check the avg. commit time of the data files for all nodes of your environment.
Action: From the example above, you can see that the avg. commit time for node 13 is too high. With almost the same number of commits, the avg. commit time of node 11 is much less. The main reason is that the I/O time for the indexserver data file of node 13 is higher compared to other nodes hence increasing the commit time. In such scenarios you need to contact your OS team and ask them to check for I/O issues in the persistent layer. Refer to OSS note 1999930 – FAQ: SAP HANA I/O Analysis for more info on I/O analysis.
Also if you have system replication enabled, low thoughput in the system replication network between the primary servers and the secondary servers can increase the commit time on primary significantly, especially if you are using “logreplay” operation mode. You can check the throughput by executing the command hdbcons -e hdbindexserver “replication info” at the OS level of the primary server. Look for the value of “shippedLogBufferThroughput”. If the throughput is too low you should contact your OS team and ask them to check the servers on the secondary site. Sometimes hardware issues on the secondary site can decrease the throughput.
The benchmark for avg. commit time is < 10 ms.
Description: You can run this script by updating the start and end time as per your requirement in the modification section. Also at the bottom of the script I prefer to use the column MAX_BLK_S in the order by clause. This will give you the savepoint statistics of all the nodes within the chosen time period sorted by the max blocking period. You will get a result like below.
The column MAX_BLK_S gives you the blocking period of the savepoints and the column MAX_CRIT_S shows the critical phase duration. From the example above, you can see the blocking phase duration of some of the savepoints is high but the critical phase duration is low.
Savepoint phases can be categorized as below in the order of their occurrence –
Phase 1: Pageflush
Phase 2: Critical phase
Phase 2.1: Blocking phase
Phase 2.2: Critical phase
Phase 3: Post critical
Phase 2 is the most important phase which has severe impact on system performance if that phase takes longer time to complete. During the blocking phase, the system waits for acquiring lock. Once it acquires the lock it then enters the critical phase to write the data into persistence. If either the blocking or the critical phase is too high, it will impact the system performance because no DML operations are permitted in the system during this phase. This manifests in the form of long running sessions and long running threads. All threads performing DML operations like insert, update, delete will get blocked until the lock is released by the savepoint blocking phase. This will lead to severe system slowness.
Action: The blocking phase duration can be high because of bugs in the HANA revisions or some other reasons as well. Check OSS note 2100009 – FAQ: SAP HANA Savepoints, section 6 “How can typical savepoint issues be analyzed and resolved?” and 1999998 – FAQ: SAP HANA Lock Analysis, section 6 “How can internal lock waits be analyzed?”
I have also observed that if you have low throughout in the system replication network between primary and secondary systems, the savepoint blocking phase duration can increase significantly when a full initial load is going on between primary and secondary and also at other times as well. Refer to the section “HANA I/O Commits” above to see how to measure the system replication throughput. So this is another area that you need to check if you are observing high savepoint blocking phase duration very frequently.
If the critical phase duration is high, it is most likely due to I/O issues at the persistence layer. You can check 1999930 – FAQ: SAP HANA I/O Analysis for likely causes and solution. Also you can run the SQL script HANA_Configuration_MiniChecks and refer to the value of savepoint write throughput. It should be > 100 mb/s at least. If this this lesser than 100 mb/s then you should contact your OS/hardware team.
Note that these things change a lot with new HANA revisions so you should watch the latest version of the above-mentioned OSS notes to get updated information. For e.g. consistent change lock is not acquired anymore during the blocking phase from SPS11 onwards.
Description: Ececuting this script shows you the log buffer wait count ratio and the log buffer race count ratio. Both should not be more than 1. In the example below you can see the log switch wait count ratio is little high for node 17.
Action: There can be several ways to investigate this. You can increase the log buffer size and/or number of log buffers. Refer to OSS note 2215131 – Alert Log Switch Wait Count Ratio. Note the caution mentioned in the note on the side effect of increasing the log buffer size too much.
There can also be I/O issues at the hardware level which may slow down the write speed from log buffers to persistence hence increasing the wait time for log buffer switch. You can download the hardware configuration check tool from OSS note 1943937 – Hardware Configuration Check Tool – Central Note and compare the throughput of your system with the benchmark provided in the pdf document attached to the note.
Another reason can be sub-optimal distribution of data if you are using a scaled-out landscape. If the load on some servers are too high compared to other nodes, it may lead to a high log buffer wait count ratio on those nodes. You can perform a table redistribution operation to ensure that tables are optimally distributed across all nodes so that only few nodes are not overloaded.
Description: You can use this script to check the partitioning details of all tables. You can update the parameter “MIN_NUM_PARTITIONS” to 100 in the modification section of the script so that the output of the script shows only those tables that has more than 100 partitions. Having too many partitions can cause adverse performance impact as any query which is not using the partitioning column in their where clause will have to scan all the partitions. Also there is a limit on the maximum number of partitions allowed for one table, 1000 till SPS09 and 16000 from SPS10 onwards. So you need to keep a watch on the tables having the highest number of partitions.
Action: 1. First you need to find out the hosts where all the partitions reside. From HANA studio under the system, right click on the “catalog” and click on find table. Enter the table name and go to the “Runtime information” tab. At the bottom section “Details for table” under the sub-section “Parts” you can see the hosts where the partitions reside. See screen shot below.
From here you can see the table has 6 partition groups each partition groups having several partitions each.
2. Execute the SQL query – alter table <table_name> move partition <partition group no.> to ‘<host>:<port>’, where partition group no is the number that you see on the left half of the screen shot above, to move all the partitions belonging to that partition group to one host. Execute this SQL query for all partition groups of the table.
Note that you can only move partition groups, you cannot move single partitions.
3. Merge the partitions – alter table <table_name> merge partitions
4. Run the ABAP report RSDU_TABLE_CONSISTENCY for the table by selecting the check “CL_SCEN_PARTITION_SPEC” first by selecting the “Store” option and then the “Repair” option.
Now you will see the number of partitions in the table has decreased significantly.
Description: Having a bigger row store has multiple disadvantages like – consumption of more memory, increase in database startup time (as the row store is entirely loaded into memory during DB startup). So it’s always beneficial to keep the row store slim and trim.
Row store consists of memory segments 64 MB each. Each of the memory segments further consists of fixed size pages. When a large number of records are deleted from row store tables, it creates empty pages across multiple memory segments. This causes the row store to have a bigger size even though the used space is much less.
Executing this SQL script gives you an output where you can see the amount of space that is fragmented and the fragmentation %.
If the fragmentation % is more than 30, then it’s better to perform a row store reorganization. The reorganization moves all the used pages from the sparse segments into other segments and hence frees up segments which are then released causing reduction in its size.
Action: There are 2 ways to perform row store reorganization – online and offline. SAP recommends to perform offline reorganization to achieve maximum compaction ratio. Refer to OSS note 1813245 – SAP HANA DB: Row store reorganization on how to perform a row store reorganization. Note that online reorganization has been discontinued from HANA 2.0 SPS02.
After running a row store reorganization, it may again get fragmented after few months depending on how much data is getting inserted and deleted. So you should check this at least every 6 months to see of there is significant fragmentation in the row store.
Description: Auto compression should be enabled for all non-BW tables. After every delta merge, the mergedog triggers the compression of the table/table partitions automatically. (However, note that manual execution of delta merge does not trigger compression automatically, you need to do that manually). Now this happens only for tables that have auto compression enabled. So, you can imagine, if you have lots of tables with auto compression disabled, it can not only lead to greater consumption of memory but also disk space and increase backup size as well.
This script gives you the table names that have auto compression disabled and the command to enable it. By default the modification section of this script excludes the BW tables as auto compression is not expected to be enabled for BW tables.
Action: Execute the generated “alter table …” commands to enable auto compression of the tables. For more information, refer to the OSS note 2112604 – FAQ: SAP HANA Compression
Description: All non-BW tables should have auto merge enabled. This will ensure that the delta store does not grow too much as it has performance implications. The delta store contains uncompressed data to speed up insert/update queries. The main store is read optimized. So data from delta store needs to be merged regularly with main store to improve the read performance of the queries and also to reduce memory consumption as tables/table partitions are compressed automatically after a merge operation.
However, note that BW tables use smart merge option which is controlled by the SAP application. There are certain parameters in the mergedog section under indexserver.ini which controls this. Auto merge should not be enabled for BW tables otherwise they may interfere with the smart merge decisions and cause adverse impact.
You can use this SQL script to find out the non-BW tables (by default the modification section of the script excludes BW tables) that have auto merge disabled.
Action: Execute the “Alter table …” commands from the output of the SQL script to enable auto merge for the non-BW tables. For more information, refer to the OSS note 2057046 – FAQ: SAP HANA Delta Merges
Description: You can use this script to find out the tables that have never been considered for compression optimization. If there are several big tables listed here, it can cause a significant increase in memory consumption of the database.
Action: Execute the update command generated by the script to perform a force compression on the table. You can do this only for big tables and ignore small ones. For e.g. in the above screen shot except the first table, all the other tables are very small in size and need not be considered. For more information, refer to the OSS note 2112604 – FAQ: SAP HANA Compression
Description: This script gives you the list of column store table columns without advanced compression. If there are table columns with more than a million records without compression, it can significantly increase the memory consumption.
Action: Execute the update command from the output of the script to perform compression of those table columns. For more information, refer to the OSS note 2112604 – FAQ: SAP HANA Compression
Description: Row store tables usually consists of a single container, but sometimes more containers can get added, e.g. when adding columns. If multiple containers per row store table is enabled, then adjusting certain table structures like adding columns becomes faster. However it can also introduce problems like performance overhead and terminations.
Using this SQL query you can find out the which row store tables has more than 1 containers. The script also outputs the SQL command to convert the tables into single container tables.
Action: Execute the “Alter table …” command from the output of the SQL script to convert the multi container tables into single container.
From HANA 2.0 SPS01 adjusting row store tables (like adding columns) will no longer generate multiple containers. Refer to OSS note 2222277 – FAQ: SAP HANA Column Store and Row Store section 22. What are row store containers and which problems can be related to them? for more information.
There are other areas as well that needs to be checked periodically to ensure optimal performance of a HANA system. I will try to write a second part of such health checks that are very easy to follow to maintain your HANA systems’ health.