Window function vs. self-join in SAP HANA
Intro
As we know, SAP HANA introduced window function since SPS05 Rev. 45. You can find this info from SAP HANA SQL (Changed) – What’s New in the SAP HANA Platform (Release Notes) – SAP Library, section “New SQL Statements and Functions” as follows.
“Windows Functions (new)RANK, DENSE_RANK, ROW_NUMBER, LEAD, LAG: Divide result sets of a query into groups of rows called window partitions.“
So, in this blog I want to share something about window function in SAP HANA with you. Since Window Functions – SAP HANA SQL and System Views Reference – SAP Library has already showed the syntax, the description and examples of window function in detail, I won’t explain more about the basics of window function here. However, I’ll focus on the comparison of window function and self-join in SAP HANA to show you why we need window function in SAP HANA.
Motivation
Recently I reviewed SAP HANA SQL and System Views Reference – SAP Library and when I read the section of window function, an question came to my mind. Can we implement the window function feature without using window function itself? Or in other words, if there is no window function in SAP HANA, how can we implement ROW_NUMBER, RANK, DENSE_RANK, etc.. these functions? Is it possible? ๐ Now let’s have a discussion/brain storming together.
Problems
Consider example 1 from Window Functions – SAP HANA SQL and System Views Reference – SAP Library. I just pasted the window dump as below.
Without window function in SAP HANA, can we implement ROW_NUMBER, RANK and DENSE_RANK and get the same results? It took a while for me to think about this question. Although I did not give it a shot, theoretically it should be possible if we use Cursors or ARRAY in SQLScript or we can use some methods described in SAP HANA: Dynamic Ranking using Script based (SQL Script) Vs Graphical Calculation view Vs Script based (CE Functions). However, all of these methods are a little bit complicated, since we need the knowledge of SQLScript and modeling which are not known by most new to SAP HANA. So an idea came to my mind, it’s better if we can directly use SQL to implement this. So is it possible to just use one SQL statement to implement ROW_NUMBER, RANK and DENSE_RANK respectively?
In order to answer this question, first of all we need to clarify one thing. Why we need window function and what’s the advantage of window function? Maybe you can point lots of reasons. Certainly you can. Here I just want to point out one reason – avoid self-join, because we can use self-join to implement most window functions. Believe it or not. Now let’s give it a shot.
Scenario
For simplicity, I created a more simple scenario/table as follows and we just focus on COUNT, AVG, ROW_NUMBER, RANK and DENSE_RANK these five window functions.
CREATE COLUMN TABLE S (
CLASS VARCHAR(10),
VAL INTEGER
);
INSERT INTO S VALUES ('A', 1);
INSERT INTO S VALUES ('A', 1);
INSERT INTO S VALUES ('A', 1);
INSERT INTO S VALUES ('A', 2);
INSERT INTO S VALUES ('A', 2);
INSERT INTO S VALUES ('A', 3);
INSERT INTO S VALUES ('B', 4);
As you can see, we ignore the OFFSET column in example 1 above, since COUNT, AVG, ROW_NUMBER, RANK and DENSE_RANK do not need this column. Identical to example 1, we want to partition by CLASS and order by VAL to do some ranking stuff. In addition, I’ll also show COUNT and AVG.
Window function vs. self-join
1. COUNT
Window function
SELECT CLASS, COUNT(1) OVER (PARTITION BY CLASS) CNT FROM S ORDER BY CLASS;
Self-join
SELECT A.CLASS, B.CNT FROM S A INNER JOIN
(SELECT CLASS, COUNT(1) CNT FROM S GROUP BY CLASS) B
ON A.CLASS = B.CLASS ORDER BY A.CLASS;
The logic is simple, first you create B which calculates COUNT of each class, then you make a join.
2. AVG
Window function
SELECT CLASS, AVG(VAL) OVER (PARTITION BY CLASS) AVG FROM S ORDER BY CLASS;
Self-join
SELECT A.CLASS, B.AVG FROM S A INNER JOIN
(SELECT CLASS, AVG(VAL) AVG FROM S GROUP BY CLASS) B
ON A.CLASS = B.CLASS ORDER BY A.CLASS;
The logic is similar with the example of COUNT.
3. ROW_NUMBER
Window function
SELECT CLASS, VAL, ROW_NUMBER() OVER (PARTITION BY CLASS ORDER BY VAL) ROW_NUM FROM S ORDER BY CLASS, ROW_NUM;
Self-join
SELECT A.CLASS, A.VAL, COUNT(1) ROW_NUM FROM S A INNER JOIN S B ON
A.CLASS = B.CLASS AND (A.VAL > B.VAL OR (A.VAL = B.VAL AND A."$rowid$" >= B."$rowid$"))
GROUP BY A.CLASS, A.VAL, A."$rowid$" ORDER BY CLASS, ROW_NUM;
First of all, you can see “$rowid$” is used in this example. You can find what it is from Do we have a ROWID or equivalent in HANA ? | SAP HANA. Actually “$rowid$” is undocumented and only available with column table. That’s why I created column table instead of row table.
Regarding the logic, the ROW_NUM of each row is determined by “# of VAL which is less than mine” plus “# of VAL which is equal to mine and its row id is less than or equal to mine”. Hope you can understand the logic. You may ask why I have to use the undocumented “$rowid$”. Because S has no primary key, we need to have a “logical” primary key instead.
4. RANK
Window function
SELECT CLASS, VAL, RANK() OVER (PARTITION BY CLASS ORDER BY VAL) RANK FROM S ORDER BY CLASS, RANK;
Self-join
SELECT A.CLASS, A.VAL, COUNT(1) RANK FROM S A INNER JOIN S B ON
A.CLASS = B.CLASS AND (A.VAL > B.VAL OR (A.VAL = B.VAL AND A."$rowid$" = B."$rowid$"))
GROUP BY A.CLASS, A.VAL, A."$rowid$" ORDER BY CLASS, RANK;
The logic is similar with ROW_NUMBER. Hope you can understand it.
5. DENSE_RANK
Window function
SELECT CLASS, VAL, DENSE_RANK() OVER (PARTITION BY CLASS ORDER BY VAL) DENSE_RANK
FROM S ORDER BY CLASS, DENSE_RANK;
Self-join
SELECT A.CLASS, A.VAL, COUNT(1) DENSE_RANK FROM S A
INNER JOIN (SELECT CLASS, VAL FROM S GROUP BY CLASS, VAL) B ON
A.CLASS = B.CLASS AND A.VAL >= B.VAL
GROUP BY A.CLASS, A.VAL, A."$rowid$" ORDER BY CLASS, DENSE_RANK;
As you can see, the logic is more simple than ROW_NUMBER and RANK because of the density.
Comparison of performance
Since we can implement most window functions with self-join, you may be curious about the comparison of performance between window function and self-join in SAP HANA. Therefore, I’ve also made some performance tests for you.
Test environment:
- SAP HANA SPS08 Rev. 80
- CPU: 8 cores
- Memory: 64GB
- Disk: 200GB
Data set:
- Identical table definition to the above scenario
- 100,000 records in total
- 1,000 classes
- ~100 values per class
Function | Window function (ms) | Self-join (ms) | Ratio |
---|---|---|---|
COUNT | ~80 | ~80 | ~1X |
AVG | ~90 | ~90 | ~1X |
ROW_NUMBER | ~100 | ~2500 | ~25X |
RANK | ~100 | ~2500 | ~25X |
DENSE_RANK | ~100 | ~1500 | ~15X |
We can do some analysis from the above table.
– COUNT, AVG: You can find the performance of window function and self-join seems the same. It’s true because the join condition only contains “=” which means only hash-join happened.
– ROW_NUMBER, RANK: Window function has much better performance (25X) than self-join, since the self-join contains non equi join.
– DENSE_RANK: B is kept smaller than S itself and the logic is more simple, so the ratio (15X) is smaller than ROW_NUMBER and RANK.
Conclusion
In this blog, we found a simple way (self-join in SQL) to implement the window function in SAP HANA. We implemented COUNT, AVG, ROW_NUMBER, RANK and DENSE_RANK with the self-join approach. And we compared the performance between window function and self-join in the end. For better performance, we should always use native window function in SAP HANA instead of self-join. On the other hand, if self-join exists in your SQL statement, please think about if it is possible to implement the logic with window function.
If you have any better idea how to implement the window function in SAP HANA, e.g., implement my scenario without using “$rowid$”, please feel free to let me know. Look forward to your feedback on this topic.
Hope you enjoyed reading my blog. ๐
Hi Wenjun
this blog is really well written and obviously a lot of effort has been put into it.
Besides that it is a bit of a pointless exercise - proving that there is an alternative to the existing feature which is way less efficient - but surely show-cases that you have your way with SAP HANA SQL.
A nice extension to the topic would have been to investigate and demonstrate *why* the self-join is slower and how the data processing differs for the alternatives (explain plan or PlanViz would deliver nice visuals for this).
As for your question, I really recommend Joe Celko's books on SQL programming to see standard SQL examples for the mentioned and many other functions/requirements.
This includes solutions without internal, unsupported columns like $rowid$.
Thanks for the post and cheers,
Lars
Hi Lars,
Thank you for reading my blog. Yes, you're right. The examples are best-fit for window functions and the self-join implementation is more like a SQL exercise. Your suggestion of the extension to the topic is good. Will try this. Besides thank you for the recommended books. Will also have a look.
Best regards,
Wenjun
Love Wenjun's hard work and Lars' honesty.
However would it be possible to test this against the sorts of data volumes where the row-based window functions of HANA might suffer?
Let's up the data volumes from 100k to 100m and see what happens?
Hi John,
Thank you for reading my blog. Will give it a shot with more data volumes. ๐
Best regards,
Wenjun
John, really... first faking stats with MS Paint and then leaving the delta store that large... tststs (hehehehe)
๐
It takes a while to merge all that data you know! Getting smaller...
Hi,
Thanks so much for this article! I tried to follow some instructions from few other article and got in over my head. This worked so quickly and your instructions were very easy to follow. Really appreciate this.
We use BW datasources to extract data from ECC system.
But BW system will be stopped and using <a href="https://asha24.com/blog/sap-hana-certification-guide">SAP HANA</a>instead.
Instead of extracting data from ECC with SLT and remodeling in HANA with complex business logic again , is there other way in HANA to extract data from ECC like BW datasources without BW system ?
By the way do you have any YouTube videos, would love to watch it. I would like to connect you on LinkedIn, great to have experts like you in my connection (In case, if you donโt have any issues).
Please keep providing such valuable information.
Muchas Gracias,
Kevin
Hi Buddy,
Just the information I was looking for. Thanks for the detailed instructions. I havenโt used it yet but I guess the time has come.
I am working on my HANA Express Edition (version 2, sps 2) while waiting for access to the HANA server(s) at work.
We plan to programmatically provision users to HANA, and assign privilges/roles when needed using a System type user through JDBC calls.
We would rather not use the "SYSTEM" id to do this provisioning. I created a user (IDM_ACE_DEVELOPER) that I am using to run SQL. I have gradually added the needed security roles/privilges to this user when I encounter the dreaded "insufficient privilege" error. However, I am stuck trying to grant a system privilege. I am trying to execute (JDBC call) this SQL:
"GRANT CATALOG READ TO TESTUSER" where TESTUSER is a newly provisioned user. I am using this link as a reference: https://help.sap.com/viewer/4fe29514fd584807ac9f2a04f6754767/2.0.00/en-US/20f674e1751910148a8b990d33efbdc5.html
The GRANT CATALOG READ works when signed in as the SYSTEM user, but my IDM_ACE_DEVELOPER id lacks the sufficient privilege. I have turned on the tracing, and then looked at the index server log file. I can find where it fails, but it doesn't seem to provide any clues to what I need to add to my IDM_ACE_DEVELOPER id to execute this SQL. Using HANA Studio, I have tried to compare the roles/privileges my SYSTEM user (which works) with my IDM_ACE_DEVELOPER user, but I have tried assigning everything that seemed to make sense, but it still fails (yes I remember to click save each time).
Anyways great write up, your efforts are much appreciated.
Ciao,
Irene Hynes