Join Between the Tables – Points To Note
I come across some Joins scenarios, where I have learnt few Best Practices, which I would like to share.
1. Rank Node (Ex: Get Distinct Records By Latest Date) before Joins:
Two Tables are there and Table names are CUSTOMER and EMAIL_SUBSCRIPTION.
CUSTOMER Table contains the columns – Customer_ID and Email_Address of them.
EMAIL_OPT_IN Table contains the columns – EMAIL_Address, Subscription_Status and Modified_On.
TABLE: CUSTOMER | TABLE: EMAIL_SUBSCRIPTION |
---|---|
![]() |
Subscription Status 1 Means “YES” and 0 Means “No” |
Requirement:
To Get CURRENT_SUBSCRIPTION_STATUS For Each CUSTOMER.
Solution:
Wrong Approach |
Correct Approach |
---|---|
In Dimension Calculation view, Two tables are directly included in Join node. Select the required fields and activate the calculation view. Data Preview of Calculation View is, This is giving Wrong results. Customers – C1 and C2 both has subscription status – Yes and No. We could not find the Current subscription status of the Customer. So before joining these two tables, Current subscription status of Email has to be find with Latest Modified_On date and This output has to be connected to Customers Table in Join node. |
In Dimension Calculation view, Include Rank node in Calculation view. Add EMAIL_SUBSCRIPTION Table into it. In Rank node, Enter the following properties. Sort Direction Descending (Top N) on MODIFIED_ON Column and Threshold as 1. So It will filter out the records based on Latest Modified On and give distinct records with latest timestamp. Next, In Join node, Add Customer Table and RANK_1 Output. Select the required fields and activate the calculation view. Let’s View the Final output of Calculation view. It is giving the Correct results. Email Subscription status Of Customer C1 is NO Email Subscription status Of Customer C2 is YES Before Joins, Ranking data is giving Distinct Records By Latest Time stamp. So This approach is giving correct results. |
2. Aggregate Table before Joins:
Two Fact Tables are there. Table names are Planning_Sales and Actual_Sales.
Both Tables contain 3 columns – Customer_Id and Product_Id and Sales.
TABLE: PLANNING_SALES | TABLE: ACTUAL_SALES |
---|---|
![]() |
![]() |
Requirement:
PLANNING and ACTUAL Sales by Product wise.
Solution:
Wrong Approach |
Correct Approach |
---|---|
In Calculation view, Two Tables are directly included in Join node. Select the required fields and activate the calculation view. Data Preview of Calculation View is, This is giving Wrong results. Lets check the Output at Join node level. The Cause of the Problem is, Both the products P1 and P2 have two entries in Planning_Sales and Actual_Sales Table. While Joins between these two tables, Each record in PLANNING table looks Each record in ACTUAL table. So the Rows_Count is doubled and After Aggregation is giving wrong results. To avoid this Problem, First Sales has to be aggregated by Product wise for both PLANNING table and ACTUAL table, Then It has to be joined. |
In Calculation View, Planning_Sales Table is Included in Aggregation Node_1 and Sales Aggregated by Product wise. Actual_Sales Table is Included in Aggregation Node_2 and Sales Aggregated by Product wise. Then In Join Node, Two Aggregated Outputs – Aggregation_1 and Aggregation_2 are Joined. In Join node, Join Between Two Aggregation Nodes, Select the required fields and activate the calculation view. Let’s View the Final output of Calculation view. This is giving Correct results. Joins before aggregation is giving distinct records with aggregated values. So This approach is giving correct results. |
Regards,
Muthuram
Thanks for sharing the wonderful information.
I tried with first one (Rank Node) and it working as expected.
While trying to create a 2nd one (aggregate table before joins), I am unable to remove the connection between Sematics with Aggregation (Marked with Yellow). Can you please help, how to link low level aggregation with Join. Please find the below snap.
Thanks.
the line (edge) between the semantics layer and the node directly below it (I've been calling it terminal node) can't be removed.
you could eliminate Aggregation_2, and link Join_1 to the Aggregation node under the Semantics
Hi Jalina,
The Connection, you highlighted is default one. So we could not delete this.
Please do the following steps.
1. Remove "Planning_Sales" Table from "Aggregation"
2. Remove Connection From "Join_1" To "Aggregation_2"
3. Add "Planning_Sales" Table to "Aggregation_2"
4. Connect "Aggregation_2" Output to "Join_1"
5. Connect "Join_1" output to "Aggregation".
Regards,
Muthuram
Hi Muthuram,
Thanks for your post. I tried the same way as you mentioned in the post but i'm getting incorrect results.Please check the screenshot below let me know if anything missing from my side
Thanks
Output
Hi Mohammed,
Reason for the Issue is,
In Aggregation_1 and Aggregation_3 Nodes,
You have added ASALES and PSALES as Attributes (Add as Output).
These two are Measures. So It should be added as "Add as Aggregated Column".
So Remove this two columns from Aggregation_1 and Aggregation_3 respectively and
add as "Add to aggregated Column" and do the same steps.
You will get correct results.
Regards,
Muthuram
Thanks Muthuram.
It was working as expected. 🙂 🙂
THank you Muthuram and Astrid. It is working fine as expected.