Technical Articles
Non-deterministic behavior of Window function
Hi all,
<Introduction>
If you have reached to this blog post, it may be because you have questioned why query sometimes prints different query results time to time when Window function is used in the query execution plan. Today, I am here to help you out to understand why is that so. Before we dive info further, let’s briefly check on the definition of Window function and its characteristics.
<Window function>
** Definition
** Syntax
<window_function> ::=
<window_function_name_or_type>
OVER ( [ <window_partition_by_clause> ] [ <window_order_by_clause> ] [<window_frame_clause>])
** More details regarding its uses
If you need more details about window function types and how to use them, please refer the link below.
After reading the definition of Window function, now you know that window function has a non-deterministic function and it deviates from a regular aggregation function, but maybe the definition is not enough to give you an intuitive impression on how exactly Window function behaves underneath. Let me explain some details with an example.
<Example>
create column table myTable (CLASS CHAR(10), AGE INT, GENDER CHAR(10), MEMBER_ID INT);
INSERT INTO myTable VALUES(‘A’, 10, ‘Female’, 12114);
INSERT INTO myTable VALUES(‘A’, 34, ‘Female’, 35583);
INSERT INTO myTable VALUES(‘A’, 51, ‘Male’,55864);
INSERT INTO myTable VALUES(‘A’, 51, ‘Male’,54312);
INSERT INTO myTable VALUES(‘A’, 51, ‘Male’,58512);
INSERT INTO myTable VALUES(‘A’, 14, ‘Female’, 48766);
INSERT INTO myTable VALUES(‘B’, 12, ‘Male’, 32587);
INSERT INTO myTable VALUES(‘B’, 12,’Female’, 96861);
INSERT INTO myTable VALUES(‘B’, 27, ‘Male’,11541);
As you can see from the link above, there are several types of Window functions. Among them, I choose to use ROW_NUMBER() as an example.
Once we have decided which one to use, then let’s create a select statement using Window function syntax. We partitioned the table into firstly with CLASS and secondly with GENDER. In query, it can be expressed as
SELECT CLASS, GENDER, AGE, MEMBER_ID,
ROW_NUMBER() OVER (PARTITION BY CLASS, GENDER ORDER BY AGE) ROW_NUM
FROM myTable;
Using ROW_NUMBER()
“ROW_NUMBER OVER (PARITION BY CLASS, GENDER)”
FIGURE 1. FIRST PARTITIONING BY CLASS (Notice the red boxes)
FIGURE 2. SECOND PARTITIONING BY GENDER (Notice the blue boxes)
And then, order them by age
“ROW_NUMBER OVER (PARTITION BY CLASS, GENDER ORDER BY AGE)”
FIGURE 3. ORDER BY AGE (Marked with black arrows)
If you look at the second blue rectangle from the top, these three members (55864, 54312, and 58512) have the same class and gender values. In this case, even if you sort them by their age, they could not be distinguished from one another because they have been partitioned only by CLASS and AGE, but not by any unique value.
If you face this situation with either small amount of data or no materialization (materialization happens switching among different engines), the query result could manage to stay the same no matter how often you execute the query. Yet, if your situation doesn’t belong to those cases mentioned above, please be informed that you may experience non-deterministic query result.
<Workaround>
- Order each partition with unique value
- If your business need allows, try to partition the table where it doesn’t have common query result
<Conclusion>
Before you design your business scenario using Window functions, please be aware of its characteristics and limitations
Sorry, but I'm not too sure about what you wanted to say here.
The SQL standard does not say that window functions are non-deterministic. And in fact, they are deterministic.
What's undefined and therefore potentially non-deterministic is the order of result records in the absence of an ORDER BY clause.
If that is what you wanted to point out, then I may have missed it, but I definitively agree as it has been correct back then (https://blogs.sap.com/2007/02/14/starting-off-with-a-classic/) and it is correct today.
Hello Lars,
I think the term that I have used in the blog could be confusing (I may need to correct them).
In this context, even if you use ORDER BY clause for age, it is not clear enough to distinguish
among these three members (55864, 54312, and 58512) because they all have same age.
Because of this, when you arrange them, the order can be interchangeable as I mentioned above.
But definitely, I get your point and I will make changes in the near future.
Yes, that's the effect of partial ordering as it applies to all places where ORDER BY can be used.
I think the important bit to take away by developers is that ORDER BY needs to include all columns for which the result order is of relevance to the meaning of the result.
But maybe you wanted to stress that SQL determinism is about the fulfillment of the declared computation and not about the resulting byte-sequence of the result materialisation.
So, if I declare that the result order of sorting ties is undefined, then the result will be deterministic in that only the relative order of ties can change.
In any case, thanks for the swift response!
Seeing that you're very familiar with SQL and how HANA implements it, maybe you want to chime in on this interesting post: https://blogs.sap.com/2020/11/02/why-are-scalar-udfs-so-slow/
"I think the important bit to take away by developers is that ORDER BY needs to include all columns for which the result order is of relevance to the meaning of the result."
-> This could be a good implementation for the workaround section
I'll take a look at the post that you suggest 🙂
Very helpful. Thanks 😉