Skip to Content
Technical Articles
Author's profile photo Jens Gleichmann

HANA Workload Management deep dive part II

last updated: 2023-01-09 18:00 CEST

Ok, you already read part I of the deep dive series of HANA workload management. You are waiting for more insides after you identified some HANA resource bottlenecks? Then you are on the right track. In this part I will describe how you can get an overview of your systems workload and how to handle it with different methods.

It is important to know the behavior and the load if the system. Every system is used in another way. Means you have to analyze, interpret and understand the workload. You have to know if it was just a one-time occurrence or a frequent behavior.

Most common scenarios:

  1. The system is overloaded and cannot fulfill the request – means the DB can no longer respond
  2. Performance issues for certain queries
  3. System is not using the defined thresholds

 

HANA workload management deep dive part II

    1. General workload management
    2. Admission Control

General workload management

There are multiple ways to consume your resources efficiently. I will only describe the common methods and go not too deep into details of NUMA and CPU affinities.

If a statement is not limited by a parameter or a workload class, it can use all resources if the optimizer choose a execution plan with parallel threads. This resources will be granted directly after the connect of the session.

First of all how to identify such resource bottlenecks? Again please use and study the scripts of Martin Frauendorfer in note 1969700. It is worth to spend the time to know how to use them.

Identify the time frames when the CPU goes over 70% for a long-time span is a good indicator. Please keep in mind that there might be some short critical time frames which are not included in the thread sampling of the HANA. It is not exactly documented by SAP how the HANA is getting the resource details from the OS process and how grading works. But by default if the tracking (service_thread_sampling_monitor_enabled) is activated in newer revisions (Rev. 55), but as the name says it is just sampling and no exact measurement. This means short peaks below 2-3 seconds can get lost, but the admission control can be activated due to another measurement method. Here an expert monitoring is needed. But back to 90% of the cases:

There is CPU time divided into CPU(host and indexserver) and SYS (System).

CPU on host level means the CPU time outside of HANA, e.g. third-party tools like AV, FS Backup etc.

CPU on indexserver means the CPU time of the indexserver without the interrupts and OS kernel time.

SYS is including the interrupts (I/O, network, etc.).

Other tools like nmon, top or the SQL collection will only break it down to CPU and SYS. Means indexserver CPU as CPU and all third party and interrupts as SYS.

HANA_LoadHistory_Services_*

SNAPSHOT_TIME HOST   PORT   PING_MS       CPU    SYS    USED_GB       SWAP_MB       CONNS  TRANS  BTRANS STMT_PS       ACT_THR       WAIT_THR     wait_thr_in_% ACT_SQL       WAIT_SQL      wait_sql_in_% PEND_SESS    VERSIONS      COM_RANGE    HANDLES       MERGES       UNLOADS

2022/08/03 10:02:00 hanahost00   any    26     83     1      2738,65       0      753    250    0      2098   397    55,66  14     48,16       42,16  88     0.00   108053 34376  2003   12     0

2022/08/03 04:47:00 hanahost00   any    42     70     1      2710,83       0      607    191    0      76     527,83 216,16 41     7,66       7,16   93     0.00   1503   574    1847   0      0

2022/08/03 04:46:00 hanahost00   any    45     81     1      2740,91       0      606    192    0      120    514,83 209,83 41     12,83       12,16  95     0.00   3044   922    1839   6      0

2022/08/03 04:45:00 hanahost00   any    55     92     1      2757,53       0      600    183    0      86     690,83 293,83 43     13,16       12,83  97     0.00   3724   866    1838   0      0

2022/08/03 04:44:00 hanahost00   any    52     87     1      2745,53       0      598    180    0      123    564,5  215,33 38     14,33       13,33  93     0.00   7350   1125   1844   0      0

2022/08/03 04:43:00 hanahost00   any    60     94     1      2765,34       0      600    182    0      77     774,33 371,16 48     19,16       17,66  92     0.00   13602  747    1833   4      0

2022/08/03 04:42:00 hanahost00   any    62     95     1      2776,03       0      601    182    0      104    789,83 390    49     21,5       20,83  97     0.00   15275  694    1835   0      0

2022/08/03 04:41:00 hanahost00   any    50     95     0      2769,15       0      597    187    0      70     743,33 346,83 47     21,5       20,66  96     0.00   3469   730    1818   0      0

2022/08/03 04:40:00 hanahost00   any    48     95     0      2757,77       0      594    177    0      94     719,33 321,66 45     21,5       20,83  97     0.00   3256   569    1838   0      0

2022/08/03 04:39:00 hanahost00   any    62     94     1      2776,18       0      590    184    0      69     773,16 370    48     23,16       22,83  99     0.00   3685   856    1840   1      0

2022/08/03 04:38:00 hanahost00   any    61     94     0      2768,06       0      594    177    0      93     738,66 336,83 46     22,5       21,5   96     0.00   2294   730    1828   2      0

2022/08/03 04:37:00 hanahost00   any    55     94     0      2768,63       0      600    175    0      63     771,16 370,33 48     24,16       22,83  94     0.00   2052   627    1838   0      0

2022/08/03 04:36:00 hanahost00   any    53     94     0      2767,8 0      0      180    0      75     730,83 330    45     23,83  23,16       97     0.00   3664   899    1836   2      0

2022/08/03 04:35:00 hanahost00   any    51     94     0      2760,59       0      605    185    0      64     736,5  334,66 45     23       22,5   98     0.00   5446   970    1843   12     0

2022/08/03 04:34:00 hanahost00   any    76     94     0      2795,51       0      596    179    0      120    852,16 449,5  53     24,66       24     97     0.00   4342   872    1844   0      0

2022/08/03 04:33:00 hanahost00   any    57     95     1      2782,26       0      598    181    0      62     826    424,33 51     23,33       23,16  99     0.00   2043   623    1836   21     0

2022/08/03 04:32:00 hanahost00   any    68     93     0      2750,41       0      600    178    0      97     715,66 336,33 47     25,16       23,5   93     0.00   1923   599    1839   0      0

2022/08/03 04:31:00 hanahost00   any    61     94     0      2768,73       0      598    176    0      64     790,33 391,5  50     23       22,83  99     0.00   4343   899    1837   0      0

2022/08/03 04:30:00 hanahost00   any    62     94     1      2795,97       0      594    177    0      97     877,5  477,33 54     23,5       23     98     0.00   3561   452    1843   0      0

2022/08/03 04:29:00 hanahost00   any    54     94     0      2750,98       0      597    177    0      59     817,16 417,83 51     23,66       23,33  99     0.00   3345   555    1849   9      0

2022/08/03 04:28:00 hanahost00   any    47     91     1      2774,15       0      597    177    0      146    761,16 369,83 49     24,5       23,83  97     0.00   2459   555    1845   1      0

2022/08/03 04:27:00 hanahost00   any    73     94     1      2792,96       0      600    180    0      40     853    451,33 53     23,16       23     99     0.00   1738   378    1843   1      0

2022/08/03 04:26:00 hanahost00   any    47     94     0      2763,75       0      598    180    0      49     734,2  338,4  46     24,4       23,4   96     0.00   1796   413    1837   0      0

2022/08/03 04:25:00 hanahost00   any    74     94     0      2786,59       0      598    185    0      36     794,66 394,16 50     23,16       22,83  99     0.00   5119   2079   1837   4      0

2022/08/03 04:24:00 hanahost00   any    48     94     1      2794,16       0      600    184    0      86     788,16 387,83 49     23,33       22,83  98     0.00   7957   4114   1847   6      0

2022/08/03 04:23:00 hanahost00   any    65     94     0      2833,23       0      608    183    0      496    855,5  453,66 53     24,5       23,33  95     0.00   8601   4093   1845   0      0

2022/08/03 04:22:00 hanahost00   any    78     94     0      2798,62       0      615    179    0      612    901,66 500,16 55     25,16       23,83  95     0.00   7236   2065   1854   16     0

2022/08/03 04:21:00 hanahost00   any    52     95     1      2782,35       0      614    172    0      72     930,83 528,83 57     25,66       24,83  97     0.00   5773   1255   1856   0      0

2022/08/03 04:20:00 hanahost00   any    56     94     0      2768,99       0      612    182    0      107    772,83 373,16 48     24,66       23,5   95     0.00   4697   844    1860   17     0

2022/08/03 04:19:00 hanahost00   any    43     95     0      2743,13       0      609    177    0      99     746,5  347,5  47     23,83       23     97     0.00   4134   872    1865   0      0

2022/08/03 04:18:00 hanahost00   any    63     94     1      2773,71       0      608    172    0      137    733    332,66 45     24,83       24,16  97     0.00   3110   1188   1844   0      0

2022/08/03 04:17:00 hanahost00   any    47     93     0      2757,57       0      604    174    0      87     651,33 263,33 40     25,33       24,66  97     0.00   3221   946    1838   0      0

2022/08/03 04:16:00 hanahost00   any    58     94     1      2829,29       0      600    174    0      83     929,83 529,66 57     27,33       26,5   97     0.00   2099   666    1835   0      0

2022/08/03 04:15:00 hanahost00   any    51     94     1      2744,07       0      599    174    0      54     845,66 443,16 52     24,16       24     99     0.00   964    273    1842   0      0

2022/08/03 04:14:00 hanahost00   any    46     78     0      2743,65       0      602    175    0      98     640,83 309,16 48     23,66       22,83  96     0.00   1709   334    1854   0      0

2022/08/02 23:03:00 hanahost00   any    54     74     5      2669,79       0      575    168    0      1605   500,16 181,66 36     41,66       32,16  77     0.00   2825   2123   1820   0      0

2022/08/02 23:00:00 hanahost00   any    42     70     4      2664,07       0      552    168    0      472    422,16 156,33 37     35,5       30,5   86     0.00   140    45     1799   0      0

2022/08/02 22:55:00 hanahost00   any    42     70     4      2660,87       0      555    163    0      364    455,66 181,83 40     35,5       30,16  85     0.00   24     47     1797   0      0

 

It is hard to read this format, but I just want to give you a feeling of what you will get if you are using the SQL collection. You have to import it in excel and create your own graphs. You can also use the HANA cockpit but you will never get such detailed information as you get it from the SQL scripts.

Find the critical time frames where the CPU is over 70%. Here you can also see the number of SQLs and threads at this point in time. When the ratio of ACT_SQL and WAIT_SQL is over 50% you should investigate more time for this time frame. In this example we can identify the timeframe from 04:14 to 04:47 as critical. Means the system was under high load. But why?

We can see a lower number of statements per second which is the consequence of the resource bottleneck. We also see a high number of threads. But was it only one statement which was not limited by the systems parameter or have we found a situation with multiple SQLs which running at the same point in time and block all resources although limited by the default_statement_concurrency_limit or even multiple workload classes?

Situation one is easy – just configure your system correctly with default parameter and/or workload classes.

Situation two needs more time for investigation. You have to find out the root cause. To be more accurate you need the statement hash, user or application name. With the latest version of the script you also have the possibility to check if the admission control was active at the point in time. This will be described in more detail in the associated chapter (part III).


Live monitoring of the dyn. Workload Management

In part I we already talked about the dyn. workload management

If you want to know how it is currently working in the system, check your current setup via SQL script.

HANA_Jobs_Executors_2.00*:

-----------------------------------------------------------------------------------------------------------------------------------

|HOST        |PORT |GROUP|CONC_CONFIG|CONC_CURR|TOTAL|BUSY|PARKED|FREE|SYSWAIT|JOBWAIT|IMM_MS|SHORT_MS|NORMAL_MS|SQL_BUSY|OTH_BUSY|

-----------------------------------------------------------------------------------------------------------------------------------

|saphana10000|30001|    0|         16|       16|   16|   0|     1|  12|      3|      0|  0.00|    0.02|     0.01|       0|       0|

|saphana10000|30001|    1|         16|       16|   16|   0|     1|  15|      0|      0|  0.00|    0.00|     0.02|       0|       0|

|saphana10000|30002|    0|         16|       16|   16|   0|     1|  15|      0|      0|  0.00|    0.00|     0.00|       0|       0|

|saphana10000|30002|    1|         16|       16|   14|   0|     0|  14|      0|      0|  0.00|    0.00|     0.00|       0|       0|


To understand what this results shows you, you have to understand the different values:

CONC_CONFIG: Configured max_concurrency value
CONC_CURR:   Currently used max_concurrency value (can be smaller than configured one in case of dynamic reductions)
TOTAL:       Total number of JobWorkers
BUSY:        Running JobWorkers
PARKED:      JobWorkers parked for future use
FREE:        Idle JobWorkers
SYSWAIT:     JobWorkers waiting for synchronization
JOBWAIT:     JobWorkers waiting for other JobWorkers
IMM_MS:      Average time an immediate job waits in some queue
SHORT_MS:    Average time a short running job waits in some queue
NORMAL_MS:   Average time a normal job waits in some queue
SQL_BUSY:    Running SqlWorker and SqlExecutor threads
OTH_BUSY:    Running threads different from SqlWorker, SqlExecutor and JobWorker

A repeatedly increased number of queued jobs can indicate that the JobWorker load (i.e. parallelized activities) is higher than the available JobWorker threads.

 

The internally used max_concurrency limitation for JobWorker threads can differ in some cases unintentionally from the setting of the related parameter, e.g. because an UNSET of the parameter wasn’t properly considered by the running system. If the used concurrency is much smaller than the intended one, the number of JobWorker threads is limited and throughput can reduce. In this case you can as a workaround manually set max_concurrency to a reasonably high explicit value.


Admission Control

Source: © 2022 SAP SE or an SAP affiliate company. All rights reserved.

The Admission Control should be the last bastion to prevent your system from overloading. You should not use or define it as workload management.

In this blog we will only cover the admission control part. In the next blog we will focus on the resource control.

 

How does it work?

You set up your thresholds at which workload the system should activate this control function. This can be CPU or memory percentage value. By default 90% of CPU – means it never gets active for memory. If you keep this default, it means when the system is reaching this threshold for a certain time (here is a gap inside the documentation how the sampling is working => averaging_factor, EWMA without a known time range) it is a) the control function is active and will do the following:

Source: © 2022 SAP SE or an SAP affiliate company. All rights reserved.

 

  1. Collection of statistics values (statistics_collection_interval; default 1000ms)
  2. weighting of the collected statistic value (averaging_factor; default 70; 100 means no averaging)

The decision manager will admit a session or queue/dequeue it. When one of the thresholds (CPU/memory) is reached queueing is active. It is important to know if a session has passed the check and become active it will not be reduced or limited in its resources afterwards.

If queueing is reaching a defined number of requests (max_queue_size; default: 10000) or the configured values (reject_memory_threshold / reject_cpu_threshold ; default: 0) rejecting is active.

 

What happens if a request is queued? When will it be released?

This depends on when the request was queued and which priority it has.

If the weighted statistics value drops below the threshold which will be checked every 1000ms by default (dequeue_interval)

A set of 50 (default; dequeue_size) requests will be released and assigned to free resources. If you have only small requests which are running faster than 1000ms you will wait also if there are more free resources.

 

This means you can increase your dequeue size that more requests can be dequeued in 1000ms

or

you can lower the dequeue interval that the dequeueing will only wait e.g. 500ms and not 1000ms.

 

Which variant you should prefer depends on the number of cores and your workload. If you have over 100 cores you can increase the dequeue size. If you have a workload like BW with a high runtime you should not lower the dequeue interval.

 

What happens if a request is rejected?

Rejecting means also that a SAP work process cannot connect to the DB and is not ready to work in the application server.

Even administration SQLs may not connect anymore. This means identifying the root cause of the issue can be difficult, if no SQL session can connect.

 

To check the latest events (be careful the view shows only 10000 events => controlled by indexserver.ini -> [admission_control_events] -> record_limit):

HANA_Workload_AdmissionControlEvents_2.00.010+

HANA Cockpit Admission control Monitor

 


 

Summary Admission Control

The admission control should only be active in unknown load scenarios preventing your system from overloading. It should not be used as a tool which is frequently active and acting as a bouncer/doorkeeper.

If the admission control is frequently active, check the root cause of the situation:

  • Was it caused by a known SQL / program
  • Is the system still sized correctly for the SQL requirements?
  • Are the parameters for memory and CPU correct?
  • Are there any user parmeters active?
  • Is the workload class mapping correct?

For all important workloads you should create workload classes to control and guarantee the needed resources for the priority of the workloads. Best example is if you have OLAP/BW workload on your OLTP/ERP system. The main workload should be OLTP, but due to the high degree of parallelism there can be a bottleneck in resources which will affect the OLTP load as well.

Exactly this to identify and to manage is one of the core responsibilities of HANA DBA. Know your system, your peaks and react early if there is something wrong with your resources. If you can successfully prioritize and manage the current system resources for tasks which are really needed, you won’t need to buy new expensive resources in the near future. Not always more hardware is the right answer. Use the existing hardware in the right manner.

Assigned Tags

      Be the first to leave a comment
      You must be Logged on to comment or reply to a post.