Why are scalar UDFs so slow?
When loading data from source systems, there are requirements in SAP BW that are often repeated at the field level. This includes, above all, logic for cleaning up and processing the data, such as:
- Add and remove leading zeros
- Case distinctions
- Remove spaces at the beginning or end
- Conversion to capital letters
- Derive one field from another with fixed pattern sub-strings, e.g.
- Derivation of time characteristics
The following listing shows a few examples in the source code:
outTab = SELECT TO_VARCHAR(TO_DATE(calday), 'YYYYMM') AS calmonth, "/BIC/SRCSYS", ... "/BIC/CURTYP", LEFT("/BIC/COAREA", 2) AS "/BIC/COUNTRY", SUBSTRING("/BIC/BI_PROFCT", 9, 2) as "/BIC/PCACCID", CASE SUBSTRING("/BIC/PROFCT", 8, 3) when '643' then '1' when '655' then '1' when '641' then '2' when '651' then '2' when '643' then '3' when '655' then '3' else '' end as "/BIC/PRBY', ... FROM :intab;
Typical expressions in the field list of a transformation routine
What all these requirements have in common is that they
- can be easily implemented in SQLScript using the existing SQL functions
- repeat themselves often
- can be executed very quickly by SAP HANA as long as they are implemented directly in the transformation routines.
DRY – Don’t Repeat Yourself
The well-known DRY principle should actually come into play here. So that you don’t repeat the source code all the time. In the field list, the scalar, user-defined functions (UDF) are available for this purpose. Until recently, these were not available for the transformation routines in the BW, because they can only be created with the AMDP Framework since AS ABAP Release 753 . But scalar UDFs would be ideal for precisely these requirements. This ensures a uniform implementation. And in the source code you can move from a technical description of the expressions to a business view:
outTab = SELECT "ZCL_CALDAY=>TO_CALMONTH"(calday) AS calday "/BIC/SRCSYS", ... "/BIC/CURTYP", "ZCL_COAREA=>TO_COUNTRY"("/BIC/COAREA") AS "/BIC/COUNTRY", "ZCL_PROFCT=>TO_PCACCID"(/BIC/PROFCT") as "/BIC/PCACCID", "ZCL_PROFCT=>TO_PRBY"(/BIC/PROFCT") as "/BIC/PRBY', ... FROM :intab;
Outsourcing the expressions to a scalar UDF
The outsourced functions look much more elegant. And they are also superior from the point of view of maintenance. For example, if another profit center appears in the list, there is exactly one function that needs to be adjusted. Using the example of
ZCL_PROFCT=>TO_PRBY I show such a scalar AMDP function:
CLASS zcl_profct DEFINITION PUBLIC FINAL CREATE PUBLIC . PUBLIC SECTION. INTERFACES if_amdp_marker_hdb. METHODS to_prby IMPORTING VALUE(iv_profct) TYPE char10 RETURNING VALUE(rv_prby) TYPE char1. ENDCLASS. CLASS zcl_profct IMPLEMENTATION. METHOD to_prby BY DATABASE FUNCTION FOR HDB LANGUAGE SQLSCRIPT OPTIONS DETERMINISTIC READ-ONLY. rv_prby = case SUBSTRING(:iv_profct, 8, 3) WHEN '643' then '1' WHEN '655' then '1' WHEN '641' then '2' WHEN '651' then '2' WHEN '643' then '3' WHEN '655' then '3' ELSE '' END ; ENDMETHOD. ENDCLASS.
The example shows that the function is just a simple wrapper for a
CASE expression. By the way, outsourcing to the ABAP world also makes it easy to write UnitTests for the functions. Since the functions in the database are only generated from the ABAP on the first call, UnitTests are also suitable for generating them.
Elegant but slow
As elegant as the outsourcing of logic in UDFs is, the concept is unfortunately not to be used for large amounts of data. This is because the UDFs have a significant influence on the runtime. I would like to show this in another expression from the example above:
This simple expression converts an ABAP
DATS value into a SQLScript date, which is then output to a
YYYYMM string. This corresponds to the format of the popular InfoObject
20200928 ==> 202009
For this purpose, I create an AMDP UDF according to the above pattern:
METHOD to_calmonth BY DATABASE FUNCTION FOR HDB LANGUAGE SQLSCRIPT OPTIONS DETERMINISTIC READ-ONLY. rv_result = TO_VARCHAR(TO_DATE(:iv_calday), 'YYYYMM'); ENDMETHOD.
I made two queries in the SQL console on a small BW table with about 3 million entries: one directly with the expression and one with the function call.
do begin select budat, TO_VARCHAR(TO_DATE(budat), 'YYYYMM'), account, amount, curr from "/BIC/AZBR_E1_S2"; end;
Statement 'do begin select budat, TO_VARCHAR(TO_DATE(budat), 'YYYYMM'), account, amount, curr from ...'
successfully executed in 501 ms 237 µs (server processing time: 801 ms 664 µs)
Fetched 1000 row(s) in 102 ms 468 µs (server processing time: 0 ms 677 µs)
Result limited to 1000 row(s) due to value configured in the Preferences
do begin select budat, "ZCL_CALDAY=>TO_CALMONTH"(BUDAT), account, amount, curr from "/BIC/AZBR_E1_S2"; end;
Statement 'do begin select budat, "ZCL_CALDAY=>TO_CALMONTH"(BUDAT), account, amount, curr from ...'
successfully executed in 2:09.418 minutes (server processing time: 4:18.632 minutes)
Fetched 1000 row(s) in 103 ms 144 µs (server processing time: 0 ms 675 µs)
Result limited to 1000 row(s) due to value configured in the Preferences
The runtime is so different that you actually have to assume a bug. Because the complexity of the example is minimal. With less complexity, I don’t need UDF anymore. What I find amazing:
- In fact, there are only 365 different values of BUDAT in the table. Thus, the system has the information: With the keyword DETERMINISTIC the system has the information: The same input should create the same output. See ABAP Documentation. This should be executed a maximum of 365 times and then be read out of the buffer, right?
- I would also have expected only the dictionary of the column to be converted, so each value exactly once.
- If you had replaced the function call with the function content with a primitive text editor with simple search/replacement, you are over 100 times faster. This is what the optimizer of an SAP HANA 7 years after the introduction of scalar UDFs should be able to do, right? The first blog post on the topic of Rich Heilmann was from 2013
In order to fully document the whole thing, I also did the analysis with PlanViz.
Planviz of the Query without UDF
PlanViz of the Query with UDF
It is clearly visible that, by and large, the same implementation plan is chosen. But the runtime of the one, red-marked JECalculate node contains the entire runtime for the UDF call. This should at least be parallelized.
I also stumbled across the
HINT (INLINE) is supposed to trigger exactly the desired behavior: Optimization should be done globally. But unfortunately there is nothing in the documentation about functions. Everything refers to procedures. And that corresponds to my observation: the hint brings absolutely no change. It remains slow.
do begin select budat, "ZCL_CALDAY=>TO_CALMONTH"(BUDAT), account, amount, curr from "/BIC/AZBR_E1_S2" WITH HINT(INLINE); end;
Why is that? Do I have a misconception in my example? Can I still optimize this? Or are the UDFs just so slow. While researching forums, I have seen that I am not alone with my problems. There are also several ways to slow down the system with unfavorable UDFs. Some examples from answers.sap.com:
- Scalar Function Memory Allocation – Poor Performance
- Scalar UDF with several returning parameters
- HANA – Scalar Function not executing in parallel
In any case, there is still great potential. At least for simple expressions without querying other DB tables.
I know that there are also complex requirements that cannot be reduced to a single expression. But then at least parallel processing would be desirable.