Example Optimize Join Columns

jan_zwickel · ‎08-10-2018

This example is referenced in my other post about Optimize Join Columns. Please have a look at the other post to get a better understanding of the context of this example

In this example we will demonstrate how the usage of option Optimize Join Columns influences query processing. The example makes use of two tables.

Data used in example

The following statements will create the respective tables and fill them with data:

CREATE COLUMN TABLE "itemsMD"(



        "item"      NVARCHAR(30) ,



        "description" NVARCHAR(30)



);





CREATE COLUMN TABLE "salesItems"(



        "item"      NVARCHAR(30) ,



        "employee" NVARCHAR(30) ,



        "amount" DECIMAL(10,2)



);





insert into "itemsMD" values ('001','diapers');



insert into "itemsMD" values ('002','diapers');



insert into "itemsMD" values ('003','milk');



insert into "itemsMD" values ('004','wine');





insert into "salesItems" values ('001','Donald',10);



insert into "salesItems" values ('002','Donald',50);



insert into "salesItems" values ('003','Alice',40);



insert into "salesItems" values ('004','Dagobert',21);

Create example Calculation View that prevents join pruning and thus no join column pruning occurs

Create a Calculation View with name exampleOptimizeJoinColumns

Add an aggregation node which uses table "salesItems" as data source

Map all columns to the output and select aggregation mode "MAX" for column "amount"

The Calculation View should now look like in the following screenshot:

4. Add a Join Node, connect it with both aggregation nodes

5. Join table "itemsMD" on column "item" with left-outer join. Use the aggregation node as left join partner

6. Keep cardinality setting as n..m and flag "Optimize Join Columns"

The View should look like in the screenshot below:

7. Map to the output all columns except the "item" column from table "itemsMD"

8. In the top-most aggregation map all columns to the output

9. In node Semantics, change aggregation type of column "amount" to "SUM" (see screenshot below)

10. try to build the view. You will receive an error message that tells you that you need a cardinality setting of N:1 or 1:1 when using the "Optimize Join Columns" setting. This tells you that not all pre-requisits for the optimize join columns flag have been met.

Change the view so that join pruning and optimize join columns can occur

The cardinality setting of "n..m" prevents join pruning to occur. Therefore, the "Optimize Join Columns" flag is rendered irrelevant. To successfully build the model you can unflag "Optimize Join Columns" or set the cardinality to "n..1". We want to demonstrate the working of the "Optimize Join Columns" option, therefore, change the join cardinality to "n..1" (compare screenshot below). As best practice check that this cardinality indeed holds in the data and there is never more than one matching entry for the join column “item” in the right table. In a real scenario you would like to guarantee this by some underlying loading logic.

Build the view.

Run the following query that only requests fields from table "salesItem", so that join pruning and in consequence optimize join columns will be effective:

SELECT

  "employee",

  SUM("amount") AS "amount"

FROM 

  "OPTIMIZEJOINCOLUMN_HDI_DB_1"."exampleOptimizeJoinColumn"

GROUP BY 

  "employee"

Change the view and unflag Optimize Join Columns

Unflag "Optimize Join Columns" (see screenshot below)

Build the view and run the same SQL query from above. The results will look different:

As you can see the “amount” for “Donald” differs depending on whether the optimize join columns flag is set or not. The reason is that in the first query the join field “item” is pruned away because the pre-requisites from above are met:

the join column “item” is not requested by the query

only columns from one join partner are requested – “employee” and “amount” are the only requested columns and both come from the left table

the join is a left outer join

the cardinality to the join partner of which no columns are requested was set to “1”

This pruning of the join column “item” changes the aggregation granularity for the maximum calculation that was defined in the aggregation node. To have a clearer understanding why the values changed, let’s start in "Debug this view" mode with the queries from above.

Use "Debug this view" when "Optimize Join Columns" is not selected

Start "Debug this view" and enter the query from above (see screenshot below):

When executing the debug query (green arrow on the right) you should see something like the screenshot below:

You can see that the join node is greyed out. This indicates that join pruning occurred. If you look at the Debug Query of the lower aggregation node you will see that column "item" is requested even though column "item" does not appear in the end user query. Execute the debug query there to see the intermediate values at this node:

Use "Debug this view" when "Optimize Join Columns" is used

Check "Optimize Join Columns", build the view and run the same debug session. You should now see something like the screenshot below:

Again you see that join pruning occurred. Importantly, this time column "item" is not requested in the debug query of the lower aggregation node. This shows that the selecting "Optimize Join Columns" excluded join column "item" at an early stage of processing.

Run the debug query at the lower aggregation node to see the following intermediate values:

Based on the intermediate values of the lower aggregation node the maximum is calculated. When "Optimize Join Columns" is not selected, the column "item" is kept in the aggregation granularity and the maximum calculation will happen for both records of Donald (item 001 and 002) leading to max(10)+max(50)=60 (compare previous screenshot of intermediate values). The summing stems from the fact that aggregation mode “SUM” is selected in the Semantics node. In contrast, when "Optimize Join Columns" is selected only one record is calculated for Donald: max(10,50)=50.

Therefore, you receive different results for "Donald", depending on whether the join column is omitted or not. If you had not used “MAX” as aggregation function but, for example, “SUM” in the aggregation node you would not see an impact of join column pruning. The reason is that “SUM” is not sensitive to the aggregation granularity at which it is calculated.

Notice that you see fewer records when the Optimize Join Columns flag is set compared to when the flag is not set (3 vs. 4 records in this example). This has a small impact in this toy example but if you have many join fields that could be pruned away and in addition the join fields have high cardinalities pruning might have a substantial influence on the granularity of aggregation and thus the number of records that have to be processed subsequently. This means by reasonable usage of feature "Optimize Join Columns" performance benefits can be expected.

Finally, to demonstrate that with join column pruning the resulting measure values might change depending on the columns requested let’s request one column (in this example “description”) from the right table so that join pruning and in consequence join column pruning cannot occur. Run the following query with "Optimize Join Columns" selected.

SELECT
"description",
"employee",
SUM("amount") AS "amount"
FROM
"OPTIMIZEJOINCOLUMN_HDI_DB_1"."exampleOptimizeJoinColumn"
GROUP BY
"description",
"employee"

As you can see with the additional column from the “right” table the results are the same whether or not the optimize join columns flag is set or not. Put differently, without "Optimize Join Columns" selected the results remain the same whether or not “description” from the right table is requested while results change when "Optimize Join Columns" is set depending on whether "description" is selected or not.

We have used the debug view functionality to check and understand the flag "Optimize Join Columns" above. Another way to quickly check whether the join field is omitted is to use “EXPLAIN PLAN” for the query. Below are the EXPLAIN PLANs for the queries with and without "Optimize Join Columns" selected:

Explain Plan when "Optimize Join Columns" is not selected:

Explain Plan when "Optimize Join Columns" is selected:

The EXPLAIN PLAN for the query on the Calculation View with the "Optimize Join Columns" flag set does not display the column “item” in the aggregation row. In contrast, without the "Optimize Join Columns" option selected the EXPLAIN PLAN on the Calculation View shows the field “item” in the grouping.

By the way, join pruning can be observed in the EXPLAIN PLAN by the absence of the to be pruned table “itemsMD”. To see that compare the EXPLAIN PLANs above with an EXPLAIN PLAN of a query including “description” from the right table. In case "description" is included, two "COLUMN TABLE" entries will be shown.

Click here to navigate back to the context in which this example is discussed.