HANA Workload Management deep dive part I

jgleichmann · ‎09-08-2022

last updated: 2023-04-11 10:15 CEST

We all know that HANA needs a lot of system resources due its in-memory design. When we are not monitoring the prod. systems well the system can get overloaded by too highly parallelized applications, a memory leak or a SQL which was not restrictive enough ending up in a OOM (out of memory) dump. This behavior is nothing special and can + will happen to any system. This is not only the case for user workload. Also system workload like the merge, partitioning process and consistency check can be controlled.

There are several automatisms that are triggered by threshold events. Other actions are managed by dynamic workload management or parameters.

How can you manage it?

To handle the workload of system correctly

you have to understand how HANA manages its resources (part I),

how you can identify bottlenecks (part II) and

how to configure the right parameters / workload classes (part III).

This can save money if you solved it in the past with more hardware and avoid bottlenecks for your important workload in peak situations.

HANA workload management deep dive part II

1. General workload management
2. Admission Control

HANA workload management deep dive part III

1. User parameters
2. Workload Classes

Workload - holistic view

Every system should be sized carefully regarding CPU and memory resources. It should be monitored frequently regarding database growth and CPU peaks. To understand how the workload management is working, first you must know about the internal treatment. On an intel platform which is the most widespread one, you have on activated Hyper Threading (HT) 2 threads per physical core in the other case you only have one. This includes also the hyperscalers. The SMT (simultaneous multi-threading) on IBM Power (and the IBM Cloud - non intel servers) depends on the setup. With Power9 you normally have 8 threads per core and on Power10 you have 4 when you want to run HANA workload.

But don't get confused by these threads. If one thread of a core is utilized by 100% pure calculation load the other thread(s) cannot overperform this 100%. Only if there are interrupts for memory, network or disk accesses more threads per core can be useful. This means if you have a pure CPU calc load with a lot of threads, this load can be handled best by more phys. cores compared to more logical threads. If the phys. core is overloaded, you have wait events. In the end HT works only if there are unused processor cycles, due to interrupts. Less utilized systems or workloads that are not highly multi-threaded may not benefit from enabling SMT. It always depends on your workload!

Intel: has more phys. cores (hyperthreading gains in average 15% performance depending on the load)
Power: has a better single thread performance due to the higher clock frequency and higher memory bandwidth

Source : hardwareinside.de

Ok, know we know more about cores and threads. But how HANA is handling the CPU resources?
To get an overview you can run the "infrastructure overview" script from the SQL collection (note 1969700).
You will get something like this:

HOST CPU_CLOCK CORES THREADS SOCKETS CPU_MANUFACTURER CPU_MODEL

============= ========== ===== ======= ======= =================== ===============================================

host1 2693 96 192 4 GenuineIntel Intel(R) Xeon(R) Platinum 8280M CPU @ 2.70GHz

One socket means one NUMA node. On one NUMA node is the attached memory connected by the bus system. The NUMA scoring is also an important aspect and has a big influence on your performance. Take attention when you are running your systems virtualized!
In our example we have 96 cores with HT on => 192 threads.
96(cores) / 4(sockets) = 24
This means we have 24 cores per socket / numa node resulting in 48 threads each node.

HANA will recognize the visible CPUs - means the logical cores - at every startup. Every change on the topology must not be published from OS to the HANA (IBM PPC: /proc/powerpc/topology_updates should be set to off) . In all other cases the DB can crash due to NUMA changes.

If you want know more about NUMA and its possibilities / how to analyze / tune please have a look at the documentation.

The default_statement_concurrency_limit is set to 48 - therefore an application workload may consume 25% of the available CPU threads resources in our scenario (192 threads).

If, however, the application submits 4 such requests via 4 different sessions simultaneously this workload may easily exhaust the CPU resources. This illustrates the importance of finding the right workload management balance for your individual system; there is no "one-configuration-fits-all” approach and you may therefore need to change the workload management related configuration parameters to fit your own specific application requirements. In the given scenario, for example, the solution may be to further decrease the concurrency degree of individual database requests by adjusting default_statement_concurrency_limit to even lower levels.

If you want to find out which statements might be affected by adjusting this parameter just run "HANA_Threads_ThreadSamples_StatisticalRecords" and edit the modification section: MIN_PX_MAX provide 25% (our 48 threads). All statements in the result set might be affected by this default configuration change.

By default HANA will take the number of this called logical threads as basis value for:

SQL executors
SQL executors are threads, which are responsible for normal SQL request processing.

Job workers
Job workers are threads, which are responsible to process parallelized OLAP load and internal activities like savepoints or garbage collection.

In this case we have 192 threads to handle the workload of the system. Some of them are frequently used by internal system actions like savepoints, disk flushes, system replication snapshots, delta merges etc.
The rest can be occupied by the user workload:
OLAP workload, for example reporting in BW systems or with SAP Analytics Cloud live connections.
OLTP workload, for example transactions in an ERP system.
Mixed workload, meaning both OLAP and OLTP, for example modern ERP systems with transactions and analytical reporting.

Every database session (1:1 relation to one work process inside the application server) can have multiple threads.
If the thread state is "Job Exec Waiting" the process will wait for a JobWorker thread executing the actual work. The reason can be a limitation of the system (e.g. admission control / dyn. workload) or the sql thread itself (workload class or user parameter) or a system bottleneck. The general waiting threads should not exceed 5% over a longer time frame. This results in queueing events like mini checks M0882 "Max. parked JobWorker ratio", M0883 "Queued jobs" or M0888 "Job queueing share (%, short-term)" of the mini checks.

Dyn. Workload

If there is a high number of Jobworkers consuming CPU resources that should be better used by SqlExecutors or request threads processing OLTP load the number will be automatically reduced down to 30-40%. You can have influence on this behavior by setting some parameters (check the section below or note 2222250).
If there are many active, but waiting threads in the system, these settings can significantly reduce the dynamic concurrency. Even with a value of 100 for max_concurrency_dyn_min_pct it was observed that the dynamic concurrency was reduced to around 20 % of max_concurrency when hundreds of waiting threads were permanently active in the system. In general, you should eliminate the root cause for the waiting threads in these scenarios, but in some cases, it can be beneficial to reduce the two parameter values in order to reduce concurrency reductions.
With SAP HANA >= 2.00.059.03 and >= 2.00.063 an improved dynamic calculation reducing the risk of too low values is available.

Parameters

###########
General CPU
###########

indexserver.ini -> [sql] -> sql_executors

The number of SQL executors per SAP HANA node can be configured with this parameter.

<service>.ini -> [sql] -> max_sql_executors

The maximum number of SQL executors per SAP HANA node can be configured with this
parameter.

<service>.ini -> [session] -> busy_executor_threshold

When SAP HANA reaches the sql_executors soft limit, it will wait for the amount of micro seconds
defined with this parameter before starting an additional SQL executor thread.
If you have a highly parallelized system with many short transactions (below 200ms) you should
considering to lower this value.
Example: If you use the default busy_executor_threshold of 1 second, there are 10 session
partitions and the system is for five seconds at the sql_executors limit, you can see up to 10 * 5 =
50 addition SQL executors.
You can use SQL: "HANA_Sessions_Partitions" (SAP Note 1969700) to check for existing session
partitions.

<service>.ini -> [execution] -> max_concurrency

Job workers are threads, which are responsible to process parallelized OLAP load and internal
activities like savepoints or garbage collection. The maximum number of logical CPUs consumed
by JobWorkers per SAP HANA node can be configured with this parameter.
If you have running multiple tenants, it may be wise to reduce this value below the max. available
logical cores. If the value is not set or 0, HANA will determine the number of logs. CPUs on every
startup.

<service>.ini -> [execution] -> default_statement_concurrency_limit

This parameter controls the maximum overall parallelism for a single database request per SAP
HANA node.

global.ini -> [execution] -> other_threads_act_weight = 40

The dyn. workload reduction is based on the number of other threads (e.g. SqlExecutor, Request, background threads) being active.
Running threads are counted fully while active, but waiting threads are counted with 40 %. You can adjust the following parameter (default: 40)

global.ini -> [execution] -> load_factor_sys_weight_pct (in %, default: 10)
global.ini -> [execution] -> load_factor_job_weight_pct (in %, default: 20)

Be aware that max_concurrency_dyn_min_pct only controls one part of the dynamic concurrency
reduction.
If there are many active, but waiting threads in the system, these settings can significantly reduce
the dynamic concurrency. Even with a value of 100 for max_concurrency_dyn_min_pct it was
observed that the dynamic concurrency was reduced to around 20 % of max_concurrency when
hundreds of waiting threads were permanently active in the system.

Since Rev. 56 the hdblcm will calculate some of this parameters during the installation or update/upgrade to it. It is enabled by parameter --apply_system_size_dependent_parameters=on which is the default.

indexserver.ini [parallel] tables_preloaded_in_parallel = MAX(5, 0.1 * CPU_THREADS)
indexserver.ini [optimize_compression] row_order_optimizer_threads = MAX(4, 0.25 * CPU_THREADS)
global.ini [execution] default_statement_concurrency_limit = 0.25 * CPU_THREADS (only applied if HANA has at least 16 CPU threads)

################
General Memory
################

global.ini -> [memorymanager] -> statement_memory_limit

defines the maximum memory allocation per statement in GB. The parameter is not set by default.

global.ini -> [memorymanager] -> statement_memory_limit_threshold

defines a percentage of the global allocation limit.
Parameter statement_memory_limit is respected only if total used memory exceeds the global
allocation limit by this threshold percentage. The default value is 0% (of the global_allocation_limit)
so statement_memory_limit is always respected.

global.ini -> [memorymanager] -> total_statement_memory_limit

a value in gigabytes to define the maximum memory available to all statements

running on the system. The default value is 0 (no limit).

Details: 3202692 - How to set Memory Limit for SQL Statements

##################
Delta Merges
##################

indexserver.ini -> [indexing] -> parallel_merge_part_threads
indexserver.ini -> [indexing] -> parallel_merge_threads

The load_balancing_func parameter controls the overall merge load, parallel_merge_part_threads
limits the number of partitions per host and table that can be merged concurrently while token_per_table and parallel_merge_threads limit the single table (partition) merge load.
If you have long running delta merges please consider first a possible partitioning of the table. If you have multiple merges at one time (BW ETL smart merge), you may increase the number of parallel merge threads.

indexserver.ini -> [mergedog] -> num_merge_threads

The maximum number of parallel auto merges is controlled by the following parameter (default: 3)

indexserver.ini -> [mergedog] -> max_cpuload_for_parallel_merge

The parameter controls the maximum CPU utilization that still allows more than one auto merge being started in parallel (default: 45, unit: %)

The default should only be adjusted if you are sure that you have enough system resources besides the normal workload.

##################
Optimize compression
##################

indexserver.ini -> [optimize_compression] ->
change_compression_threads
estimate_compression_threads
get_candidates_threads
prepare_threads
row_order_optimizer_threads (10 % - 25 % of available CPU threads, at least 4)

These parameters control the number of parallel threads in different phases of optimize
compression runs. See SAP Note 2112604 for more information related to SAP HANA compression.

##################
Partitioning
##################

indexserver.ini -> [partitioning] ->
bulk_load_threads
split_threads

bulk_load_threads: This parameter controls the number of threads used for bulk (= group) loading
into a partitioned table.
split_threads: This parameter controls the parallelism of split and merge operations during repartitioning activities

Takeaway

It can happen that one query or multiple SQLs consume all CPU resources of the system if your parameters or workload classes are defined incorrectly. With proper parametrization or workload classes this won't happen. Beware which load your system must handle. Do frequent health checks if your system is still able to handle the load with growing data payload. With every resizing your parameters and workload classes have to be reviewed.

Most monitoring tools just displaying the number of occupied CPUs but not take SMT into account or even waiting ("stalled"). Stalled means the processor was not making forward progress with instructions, and usually happens because it is waiting on memory I/O.

What is displayed:

What is really the case:

Source: CPU Utilization is Wrong

Check the real values with tools like perf, top or tiptop which can display the IPC (instructions per cycle). If this value is under 1 it can be an indicator that you are likely memory stalled. For more insides on this topic read the publications of Brendan Gregg.