Why are scalar UDFs so slow?
When loading data from source systems, there are requirements in SAP BW that are often repeated at the field level. This includes, above all, logic for cleaning up and processing the data, such as:
- Add and remove leading zeros
- Case distinctions
- Remove spaces at the beginning or end
- Conversion to capital letters
- Derive one field from another with fixed pattern sub-strings, e.g.
- Derivation of time characteristics
The following listing shows a few examples in the source code:
outTab = SELECT TO_VARCHAR(TO_DATE(calday), 'YYYYMM') AS calmonth, "/BIC/SRCSYS", ... "/BIC/CURTYP", LEFT("/BIC/COAREA", 2) AS "/BIC/COUNTRY", SUBSTRING("/BIC/BI_PROFCT", 9, 2) as "/BIC/PCACCID", CASE SUBSTRING("/BIC/PROFCT", 8, 3) when '643' then '1' when '655' then '1' when '641' then '2' when '651' then '2' when '643' then '3' when '655' then '3' else '' end as "/BIC/PRBY', ... FROM :intab;
Typical expressions in the field list of a transformation routine
What all these requirements have in common is that they
- can be easily implemented in SQLScript using the existing SQL functions
- repeat themselves often
- can be executed very quickly by SAP HANA as long as they are implemented directly in the transformation routines.
DRY – Don’t Repeat Yourself
The well-known DRY principle should actually come into play here. So that you don’t repeat the source code all the time. In the field list, the scalar, user-defined functions (UDF) are available for this purpose. Until recently, these were not available for the transformation routines in the BW, because they can only be created with the AMDP Framework since AS ABAP Release 753 . But scalar UDFs would be ideal for precisely these requirements. This ensures a uniform implementation. And in the source code you can move from a technical description of the expressions to a business view:
outTab = SELECT "ZCL_CALDAY=>TO_CALMONTH"(calday) AS calday "/BIC/SRCSYS", ... "/BIC/CURTYP", "ZCL_COAREA=>TO_COUNTRY"("/BIC/COAREA") AS "/BIC/COUNTRY", "ZCL_PROFCT=>TO_PCACCID"(/BIC/PROFCT") as "/BIC/PCACCID", "ZCL_PROFCT=>TO_PRBY"(/BIC/PROFCT") as "/BIC/PRBY', ... FROM :intab;
Outsourcing the expressions to a scalar UDF
The outsourced functions look much more elegant. And they are also superior from the point of view of maintenance. For example, if another profit center appears in the list, there is exactly one function that needs to be adjusted. Using the example of
ZCL_PROFCT=>TO_PRBY I show such a scalar AMDP function:
CLASS zcl_profct DEFINITION PUBLIC FINAL CREATE PUBLIC . PUBLIC SECTION. INTERFACES if_amdp_marker_hdb. METHODS to_prby IMPORTING VALUE(iv_profct) TYPE char10 RETURNING VALUE(rv_prby) TYPE char1. ENDCLASS. CLASS zcl_profct IMPLEMENTATION. METHOD to_prby BY DATABASE FUNCTION FOR HDB LANGUAGE SQLSCRIPT OPTIONS DETERMINISTIC READ-ONLY. rv_prby = case SUBSTRING(:iv_profct, 8, 3) WHEN '643' then '1' WHEN '655' then '1' WHEN '641' then '2' WHEN '651' then '2' WHEN '643' then '3' WHEN '655' then '3' ELSE '' END ; ENDMETHOD. ENDCLASS.
The example shows that the function is just a simple wrapper for a
CASE expression. By the way, outsourcing to the ABAP world also makes it easy to write UnitTests for the functions. Since the functions in the database are only generated from the ABAP on the first call, UnitTests are also suitable for generating them.
Elegant but slow
As elegant as the outsourcing of logic in UDFs is, the concept is unfortunately not to be used for large amounts of data. This is because the UDFs have a significant influence on the runtime. I would like to show this in another expression from the example above:
This simple expression converts an ABAP
DATS value into a SQLScript date, which is then output to a
YYYYMM string. This corresponds to the format of the popular InfoObject
20200928 ==> 202009
For this purpose, I create an AMDP UDF according to the above pattern:
METHOD to_calmonth BY DATABASE FUNCTION FOR HDB LANGUAGE SQLSCRIPT OPTIONS DETERMINISTIC READ-ONLY. rv_result = TO_VARCHAR(TO_DATE(:iv_calday), 'YYYYMM'); ENDMETHOD.
I made two queries in the SQL console on a small BW table with about 3 million entries: one directly with the expression and one with the function call.
do begin select budat, TO_VARCHAR(TO_DATE(budat), 'YYYYMM'), account, amount, curr from "/BIC/AZBR_E1_S2"; end;
Statement 'do begin select budat, TO_VARCHAR(TO_DATE(budat), 'YYYYMM'), account, amount, curr from ...'
successfully executed in 501 ms 237 µs (server processing time: 801 ms 664 µs)
Fetched 1000 row(s) in 102 ms 468 µs (server processing time: 0 ms 677 µs)
Result limited to 1000 row(s) due to value configured in the Preferences
do begin select budat, "ZCL_CALDAY=>TO_CALMONTH"(BUDAT), account, amount, curr from "/BIC/AZBR_E1_S2"; end;
Statement 'do begin select budat, "ZCL_CALDAY=>TO_CALMONTH"(BUDAT), account, amount, curr from ...'
successfully executed in 2:09.418 minutes (server processing time: 4:18.632 minutes)
Fetched 1000 row(s) in 103 ms 144 µs (server processing time: 0 ms 675 µs)
Result limited to 1000 row(s) due to value configured in the Preferences
The runtime is so different that you actually have to assume a bug. Because the complexity of the example is minimal. With less complexity, I don’t need UDF anymore. What I find amazing:
- In fact, there are only 365 different values of BUDAT in the table. Thus, the system has the information: With the keyword DETERMINISTIC the system has the information: The same input should create the same output. See ABAP Documentation. This should be executed a maximum of 365 times and then be read out of the buffer, right?
- I would also have expected only the dictionary of the column to be converted, so each value exactly once.
- If you had replaced the function call with the function content with a primitive text editor with simple search/replacement, you are over 100 times faster. This is what the optimizer of an SAP HANA 7 years after the introduction of scalar UDFs should be able to do, right? The first blog post on the topic of Rich Heilmann was from 2013
In order to fully document the whole thing, I also did the analysis with PlanViz.
Planviz of the Query without UDF
PlanViz of the Query with UDF
It is clearly visible that, by and large, the same implementation plan is chosen. But the runtime of the one, red-marked JECalculate node contains the entire runtime for the UDF call. This should at least be parallelized.
I also stumbled across the
HINT (INLINE) is supposed to trigger exactly the desired behavior: Optimization should be done globally. But unfortunately there is nothing in the documentation about functions. Everything refers to procedures. And that corresponds to my observation: the hint brings absolutely no change. It remains slow.
do begin select budat, "ZCL_CALDAY=>TO_CALMONTH"(BUDAT), account, amount, curr from "/BIC/AZBR_E1_S2" WITH HINT(INLINE); end;
Why is that? Do I have a misconception in my example? Can I still optimize this? Or are the UDFs just so slow. While researching forums, I have seen that I am not alone with my problems. There are also several ways to slow down the system with unfavorable UDFs. Some examples from answers.sap.com:
- Scalar Function Memory Allocation – Poor Performance
- Scalar UDF with several returning parameters
- HANA – Scalar Function not executing in parallel
In any case, there is still great potential. At least for simple expressions without querying other DB tables.
I know that there are also complex requirements that cannot be reduced to a single expression. But then at least parallel processing would be desirable.
Very nice post!
I got some ideas about the slowness of the presented UDF and I am really eager to see what the SAP HANA experts and developers have to say about this.
Alright, as nobody seemed to feel the urge to look into this and post anything here, I decided to write up my findings.
You can find them here: Why are your scalar UDFs slow? (mine aren’t)
As the title of my post indicates: scalar UDFs don't have to be slow.
Hope that helps and maybe there will be some official documentation/explanation about the performance of UDFs and the scenarios where they should/shouldn't be used.
thank you for your article. What version of SAP HANA are you using?
I have experimented a bit. Some things from your article I can reproduce, especially that DETERMINISTIC is not good for the performance.
Other things behave differently on my system. In particular, the performance of UDFs is still worse than the direct integration of logic. You can find details in the german version of the article on my website. An english translation and an update on the SCN will follow later.
I used the HANA Express Edition 2.00.45.
The thing is, SAP does not provide documentation about what one can expect from the optimizer in which scenario. It's not even clear, what specific limitations or features the different optimisation steps provide and what conditions enable/disable which plan transformations.
The current openSAP course does nothing to improve on that - so all one can do is to look closely at the plans generated on the specific version you're using.
From what I can see in your blog post, the plan still involves the evali() call - which, as I understand it, basically means "handover to SQL Script processing" with all that comes with it (including memory usage).
With my HXE system the plan switched to actually have the SQL expressions executed as if these were part of the SELECT statement itself.
Anyhow, without a clear idea of what to expect from the optimizer(s) all performance-oriented SQL development is guess, try, and error.
UDF is defined on HANA using SQLscript. You are using ABAP function.
This is an AMDP, an ABAP Managed Database Procedure.
Look at this detail in the method implementation:
This means: The ABAP System will create a database function (UDF) with the source code that I enter in the method. And this UDF can then be called by an other database procedure.