Technical Articles
HANA Workload Management deep dive part II
last updated: 2023-01-09 18:00 CEST
Ok, you already read part I of the deep dive series of HANA workload management. You are waiting for more insides after you identified some HANA resource bottlenecks? Then you are on the right track. In this part I will describe how you can get an overview of your systems workload and how to handle it with different methods.
It is important to know the behavior and the load if the system. Every system is used in another way. Means you have to analyze, interpret and understand the workload. You have to know if it was just a one-time occurrence or a frequent behavior.
Most common scenarios:
- The system is overloaded and cannot fulfill the request – means the DB can no longer respond
- Performance issues for certain queries
- System is not using the defined thresholds
HANA workload management deep dive part II
General workload management
There are multiple ways to consume your resources efficiently. I will only describe the common methods and go not too deep into details of NUMA and CPU affinities.
If a statement is not limited by a parameter or a workload class, it can use all resources if the optimizer choose a execution plan with parallel threads. This resources will be granted directly after the connect of the session.
First of all how to identify such resource bottlenecks? Again please use and study the scripts of Martin Frauendorfer in note 1969700. It is worth to spend the time to know how to use them.
Identify the time frames when the CPU goes over 70% for a long-time span is a good indicator. Please keep in mind that there might be some short critical time frames which are not included in the thread sampling of the HANA. It is not exactly documented by SAP how the HANA is getting the resource details from the OS process and how grading works. But by default if the tracking (service_thread_sampling_monitor_enabled) is activated in newer revisions (Rev. 55), but as the name says it is just sampling and no exact measurement. This means short peaks below 2-3 seconds can get lost, but the admission control can be activated due to another measurement method. Here an expert monitoring is needed. But back to 90% of the cases:
There is CPU time divided into CPU(host and indexserver) and SYS (System).
CPU on host level means the CPU time outside of HANA, e.g. third-party tools like AV, FS Backup etc.
CPU on indexserver means the CPU time of the indexserver without the interrupts and OS kernel time.
SYS is including the interrupts (I/O, network, etc.).
Other tools like nmon, top or the SQL collection will only break it down to CPU and SYS. Means indexserver CPU as CPU and all third party and interrupts as SYS.
HANA_LoadHistory_Services_*
SNAPSHOT_TIME HOST PORT PING_MS CPU SYS USED_GB SWAP_MB CONNS TRANS BTRANS STMT_PS ACT_THR WAIT_THR wait_thr_in_% ACT_SQL WAIT_SQL wait_sql_in_% PEND_SESS VERSIONS COM_RANGE HANDLES MERGES UNLOADS 2022/08/03 10:02:00 hanahost00 any 26 83 1 2738,65 0 753 250 0 2098 397 55,66 14 48,16 42,16 88 0.00 108053 34376 2003 12 0 2022/08/03 04:47:00 hanahost00 any 42 70 1 2710,83 0 607 191 0 76 527,83 216,16 41 7,66 7,16 93 0.00 1503 574 1847 0 0 2022/08/03 04:46:00 hanahost00 any 45 81 1 2740,91 0 606 192 0 120 514,83 209,83 41 12,83 12,16 95 0.00 3044 922 1839 6 0 2022/08/03 04:45:00 hanahost00 any 55 92 1 2757,53 0 600 183 0 86 690,83 293,83 43 13,16 12,83 97 0.00 3724 866 1838 0 0 2022/08/03 04:44:00 hanahost00 any 52 87 1 2745,53 0 598 180 0 123 564,5 215,33 38 14,33 13,33 93 0.00 7350 1125 1844 0 0 2022/08/03 04:43:00 hanahost00 any 60 94 1 2765,34 0 600 182 0 77 774,33 371,16 48 19,16 17,66 92 0.00 13602 747 1833 4 0 2022/08/03 04:42:00 hanahost00 any 62 95 1 2776,03 0 601 182 0 104 789,83 390 49 21,5 20,83 97 0.00 15275 694 1835 0 0 2022/08/03 04:41:00 hanahost00 any 50 95 0 2769,15 0 597 187 0 70 743,33 346,83 47 21,5 20,66 96 0.00 3469 730 1818 0 0 2022/08/03 04:40:00 hanahost00 any 48 95 0 2757,77 0 594 177 0 94 719,33 321,66 45 21,5 20,83 97 0.00 3256 569 1838 0 0 2022/08/03 04:39:00 hanahost00 any 62 94 1 2776,18 0 590 184 0 69 773,16 370 48 23,16 22,83 99 0.00 3685 856 1840 1 0 2022/08/03 04:38:00 hanahost00 any 61 94 0 2768,06 0 594 177 0 93 738,66 336,83 46 22,5 21,5 96 0.00 2294 730 1828 2 0 2022/08/03 04:37:00 hanahost00 any 55 94 0 2768,63 0 600 175 0 63 771,16 370,33 48 24,16 22,83 94 0.00 2052 627 1838 0 0 2022/08/03 04:36:00 hanahost00 any 53 94 0 2767,8 0 0 180 0 75 730,83 330 45 23,83 23,16 97 0.00 3664 899 1836 2 0 2022/08/03 04:35:00 hanahost00 any 51 94 0 2760,59 0 605 185 0 64 736,5 334,66 45 23 22,5 98 0.00 5446 970 1843 12 0 2022/08/03 04:34:00 hanahost00 any 76 94 0 2795,51 0 596 179 0 120 852,16 449,5 53 24,66 24 97 0.00 4342 872 1844 0 0 2022/08/03 04:33:00 hanahost00 any 57 95 1 2782,26 0 598 181 0 62 826 424,33 51 23,33 23,16 99 0.00 2043 623 1836 21 0 2022/08/03 04:32:00 hanahost00 any 68 93 0 2750,41 0 600 178 0 97 715,66 336,33 47 25,16 23,5 93 0.00 1923 599 1839 0 0 2022/08/03 04:31:00 hanahost00 any 61 94 0 2768,73 0 598 176 0 64 790,33 391,5 50 23 22,83 99 0.00 4343 899 1837 0 0 2022/08/03 04:30:00 hanahost00 any 62 94 1 2795,97 0 594 177 0 97 877,5 477,33 54 23,5 23 98 0.00 3561 452 1843 0 0 2022/08/03 04:29:00 hanahost00 any 54 94 0 2750,98 0 597 177 0 59 817,16 417,83 51 23,66 23,33 99 0.00 3345 555 1849 9 0 2022/08/03 04:28:00 hanahost00 any 47 91 1 2774,15 0 597 177 0 146 761,16 369,83 49 24,5 23,83 97 0.00 2459 555 1845 1 0 2022/08/03 04:27:00 hanahost00 any 73 94 1 2792,96 0 600 180 0 40 853 451,33 53 23,16 23 99 0.00 1738 378 1843 1 0 2022/08/03 04:26:00 hanahost00 any 47 94 0 2763,75 0 598 180 0 49 734,2 338,4 46 24,4 23,4 96 0.00 1796 413 1837 0 0 2022/08/03 04:25:00 hanahost00 any 74 94 0 2786,59 0 598 185 0 36 794,66 394,16 50 23,16 22,83 99 0.00 5119 2079 1837 4 0 2022/08/03 04:24:00 hanahost00 any 48 94 1 2794,16 0 600 184 0 86 788,16 387,83 49 23,33 22,83 98 0.00 7957 4114 1847 6 0 2022/08/03 04:23:00 hanahost00 any 65 94 0 2833,23 0 608 183 0 496 855,5 453,66 53 24,5 23,33 95 0.00 8601 4093 1845 0 0 2022/08/03 04:22:00 hanahost00 any 78 94 0 2798,62 0 615 179 0 612 901,66 500,16 55 25,16 23,83 95 0.00 7236 2065 1854 16 0 2022/08/03 04:21:00 hanahost00 any 52 95 1 2782,35 0 614 172 0 72 930,83 528,83 57 25,66 24,83 97 0.00 5773 1255 1856 0 0 2022/08/03 04:20:00 hanahost00 any 56 94 0 2768,99 0 612 182 0 107 772,83 373,16 48 24,66 23,5 95 0.00 4697 844 1860 17 0 2022/08/03 04:19:00 hanahost00 any 43 95 0 2743,13 0 609 177 0 99 746,5 347,5 47 23,83 23 97 0.00 4134 872 1865 0 0 2022/08/03 04:18:00 hanahost00 any 63 94 1 2773,71 0 608 172 0 137 733 332,66 45 24,83 24,16 97 0.00 3110 1188 1844 0 0 2022/08/03 04:17:00 hanahost00 any 47 93 0 2757,57 0 604 174 0 87 651,33 263,33 40 25,33 24,66 97 0.00 3221 946 1838 0 0 2022/08/03 04:16:00 hanahost00 any 58 94 1 2829,29 0 600 174 0 83 929,83 529,66 57 27,33 26,5 97 0.00 2099 666 1835 0 0 2022/08/03 04:15:00 hanahost00 any 51 94 1 2744,07 0 599 174 0 54 845,66 443,16 52 24,16 24 99 0.00 964 273 1842 0 0 2022/08/03 04:14:00 hanahost00 any 46 78 0 2743,65 0 602 175 0 98 640,83 309,16 48 23,66 22,83 96 0.00 1709 334 1854 0 0 2022/08/02 23:03:00 hanahost00 any 54 74 5 2669,79 0 575 168 0 1605 500,16 181,66 36 41,66 32,16 77 0.00 2825 2123 1820 0 0 2022/08/02 23:00:00 hanahost00 any 42 70 4 2664,07 0 552 168 0 472 422,16 156,33 37 35,5 30,5 86 0.00 140 45 1799 0 0 2022/08/02 22:55:00 hanahost00 any 42 70 4 2660,87 0 555 163 0 364 455,66 181,83 40 35,5 30,16 85 0.00 24 47 1797 0 0 |
It is hard to read this format, but I just want to give you a feeling of what you will get if you are using the SQL collection. You have to import it in excel and create your own graphs. You can also use the HANA cockpit but you will never get such detailed information as you get it from the SQL scripts.
Find the critical time frames where the CPU is over 70%. Here you can also see the number of SQLs and threads at this point in time. When the ratio of ACT_SQL and WAIT_SQL is over 50% you should investigate more time for this time frame. In this example we can identify the timeframe from 04:14 to 04:47 as critical. Means the system was under high load. But why?
We can see a lower number of statements per second which is the consequence of the resource bottleneck. We also see a high number of threads. But was it only one statement which was not limited by the systems parameter or have we found a situation with multiple SQLs which running at the same point in time and block all resources although limited by the default_statement_concurrency_limit or even multiple workload classes?
Situation one is easy – just configure your system correctly with default parameter and/or workload classes.
Situation two needs more time for investigation. You have to find out the root cause. To be more accurate you need the statement hash, user or application name. With the latest version of the script you also have the possibility to check if the admission control was active at the point in time. This will be described in more detail in the associated chapter (part III).
Live monitoring of the dyn. Workload Management
In part I we already talked about the dyn. workload management
If you want to know how it is currently working in the system, check your current setup via SQL script.
HANA_Jobs_Executors_2.00*:
----------------------------------------------------------------------------------------------------------------------------------- |HOST |PORT |GROUP|CONC_CONFIG|CONC_CURR|TOTAL|BUSY|PARKED|FREE|SYSWAIT|JOBWAIT|IMM_MS|SHORT_MS|NORMAL_MS|SQL_BUSY|OTH_BUSY| ----------------------------------------------------------------------------------------------------------------------------------- |saphana10000|30001| 0| 16| 16| 16| 0| 1| 12| 3| 0| 0.00| 0.02| 0.01| 0| 0| |saphana10000|30001| 1| 16| 16| 16| 0| 1| 15| 0| 0| 0.00| 0.00| 0.02| 0| 0| |saphana10000|30002| 0| 16| 16| 16| 0| 1| 15| 0| 0| 0.00| 0.00| 0.00| 0| 0| |saphana10000|30002| 1| 16| 16| 14| 0| 0| 14| 0| 0| 0.00| 0.00| 0.00| 0| 0|
To understand what this results shows you, you have to understand the different values:
CONC_CONFIG: Configured max_concurrency value CONC_CURR: Currently used max_concurrency value (can be smaller than configured one in case of dynamic reductions) TOTAL: Total number of JobWorkers BUSY: Running JobWorkers PARKED: JobWorkers parked for future use FREE: Idle JobWorkers SYSWAIT: JobWorkers waiting for synchronization JOBWAIT: JobWorkers waiting for other JobWorkers IMM_MS: Average time an immediate job waits in some queue SHORT_MS: Average time a short running job waits in some queue NORMAL_MS: Average time a normal job waits in some queue SQL_BUSY: Running SqlWorker and SqlExecutor threads OTH_BUSY: Running threads different from SqlWorker, SqlExecutor and JobWorker
A repeatedly increased number of queued jobs can indicate that the JobWorker load (i.e. parallelized activities) is higher than the available JobWorker threads.
The internally used max_concurrency limitation for JobWorker threads can differ in some cases unintentionally from the setting of the related parameter, e.g. because an UNSET of the parameter wasn’t properly considered by the running system. If the used concurrency is much smaller than the intended one, the number of JobWorker threads is limited and throughput can reduce. In this case you can as a workaround manually set max_concurrency to a reasonably high explicit value.
Admission Control
Source: © 2022 SAP SE or an SAP affiliate company. All rights reserved.
The Admission Control should be the last bastion to prevent your system from overloading. You should not use or define it as workload management.
In this blog we will only cover the admission control part. In the next blog we will focus on the resource control.
How does it work?
You set up your thresholds at which workload the system should activate this control function. This can be CPU or memory percentage value. By default 90% of CPU – means it never gets active for memory. If you keep this default, it means when the system is reaching this threshold for a certain time (here is a gap inside the documentation how the sampling is working => averaging_factor, EWMA without a known time range) it is a) the control function is active and will do the following:
Source: © 2022 SAP SE or an SAP affiliate company. All rights reserved.
- Collection of statistics values (statistics_collection_interval; default 1000ms)
- weighting of the collected statistic value (averaging_factor; default 70; 100 means no averaging)
The decision manager will admit a session or queue/dequeue it. When one of the thresholds (CPU/memory) is reached queueing is active. It is important to know if a session has passed the check and become active it will not be reduced or limited in its resources afterwards.
If queueing is reaching a defined number of requests (max_queue_size; default: 10000) or the configured values (reject_memory_threshold / reject_cpu_threshold ; default: 0) rejecting is active.
What happens if a request is queued? When will it be released?
This depends on when the request was queued and which priority it has.
If the weighted statistics value drops below the threshold which will be checked every 1000ms by default (dequeue_interval)
A set of 50 (default; dequeue_size) requests will be released and assigned to free resources. If you have only small requests which are running faster than 1000ms you will wait also if there are more free resources.
This means you can increase your dequeue size that more requests can be dequeued in 1000ms or you can lower the dequeue interval that the dequeueing will only wait e.g. 500ms and not 1000ms. |
Which variant you should prefer depends on the number of cores and your workload. If you have over 100 cores you can increase the dequeue size. If you have a workload like BW with a high runtime you should not lower the dequeue interval.
What happens if a request is rejected?
Rejecting means also that a SAP work process cannot connect to the DB and is not ready to work in the application server.
Even administration SQLs may not connect anymore. This means identifying the root cause of the issue can be difficult, if no SQL session can connect.
To check the latest events (be careful the view shows only 10000 events => controlled by indexserver.ini -> [admission_control_events] -> record_limit):
HANA_Workload_AdmissionControlEvents_2.00.010+ HANA Cockpit Admission control Monitor |
Summary Admission Control
The admission control should only be active in unknown load scenarios preventing your system from overloading. It should not be used as a tool which is frequently active and acting as a bouncer/doorkeeper.
If the admission control is frequently active, check the root cause of the situation:
- Was it caused by a known SQL / program
- Is the system still sized correctly for the SQL requirements?
- Are the parameters for memory and CPU correct?
- Are there any user parameters active?
- Is the workload class mapping correct?
For all important workloads you should create workload classes to control and guarantee the needed resources for the priority of the workloads. Best example is if you have OLAP/BW workload on your OLTP/ERP system. The main workload should be OLTP, but due to the high degree of parallelism there can be a bottleneck in resources which will affect the OLTP load as well.
Exactly this to identify and to manage is one of the core responsibilities of HANA DBA. Know your system, your peaks and react early if there is something wrong with your resources. If you can successfully prioritize and manage the current system resources for tasks which are really needed, you won’t need to buy new expensive resources in the near future. Not always more hardware is the right answer. Use the existing hardware in the right manner.
Thanks Jens Gleichmann. It's a really nice Blog !.. Kudos to you.
Please clarify what has to be modified exactly for situation 1..
"Find the critical time frames where the CPU is over 70%. Here you can also see the number of SQLs and threads at this point in time. When the ratio of ACT_SQL and WAIT_SQL is over 50% you should investigate more time for this time frame"
Situation one is easy – just configure your system correctly with default parameters - which parameters did you mention about?
Hi Daniel,
just check the default parameters of part I. You can also use the parameter script of note 1969700 to configure the system to your needs.
Regards,
Jens
Thanks Jens Gleichmann. It's a really nice Blog !.. Kudos to you.
Looking forward to your 'How to configure the right parameters/workload classes (Part III)'
Hi,
thanks - part III is now also online. Enjoy reading 😉
Regards,
Jens