I come across some Joins scenarios, where I have learnt few Best Practices, which I would like to share.

1. Rank Node (Ex: Get Distinct Records By Latest Date) before Joins:


Two Tables are there and Table names are CUSTOMER and EMAIL_SUBSCRIPTION.

CUSTOMER Table contains the columns – Customer_ID and Email_Address of them.

EMAIL_OPT_IN Table contains the columns – EMAIL_Address, Subscription_Status and Modified_On.


TABLE:  CUSTOMER TABLE:  EMAIL_SUBSCRIPTION
temp.PNG

Subscription Status 1 Means “YES” and 0 Means “No”

temp.PNG


Requirement:

To Get CURRENT_SUBSCRIPTION_STATUS For Each CUSTOMER.

temp.PNG

Solution:



Wrong Approach



Correct Approach


In Dimension Calculation view,

Two tables are directly included in Join node.


temp.PNG


Select the required fields and activate the calculation view.

Data Preview of Calculation View is,


temp.PNG


This is giving Wrong results.

Customers – C1 and C2 both has subscription status –

Yes and No.

We could not find the Current subscription status of the Customer.


So before joining these two tables,

Current subscription status of Email has to be find with Latest Modified_On date and This output has to be connected to Customers Table in Join node.

In Dimension Calculation view,

Include Rank node in Calculation view.

Add EMAIL_SUBSCRIPTION Table into it.

In Rank node, Enter the following properties.

Sort Direction Descending (Top N) on

MODIFIED_ON Column and

Threshold as 1.

So It will filter out the records based on Latest Modified On and

give distinct records with latest timestamp.

temp.PNG

Next, In Join node,

Add Customer Table and RANK_1 Output.

temp.PNG

Select the required fields and activate the calculation view.

Let’s View the Final output of Calculation view.

temp.PNG

It is giving the Correct results.


Email Subscription status Of Customer C1 is NO

Email Subscription status Of Customer C2 is YES


Before Joins, Ranking data is giving Distinct Records By Latest Time stamp. So This approach is giving correct results.

2. Aggregate Table before Joins:


Two Fact Tables are there. Table names are Planning_Sales and Actual_Sales.

Both Tables contain 3 columns – Customer_Id and Product_Id and Sales.

TABLE: PLANNING_SALES TABLE: ACTUAL_SALES
temp.PNG temp.PNG

Requirement:

PLANNING and ACTUAL Sales by Product wise.

temp.PNG

Solution:



Wrong Approach



Correct Approach


In Calculation view,

Two Tables are directly included in Join node.

temp.PNG


Select the required fields and activate the calculation view.

Data Preview of Calculation View is,


temp.PNG


This is giving Wrong results.

Lets check the Output at Join node level.


temp.PNG


The Cause of the Problem is,


Both the products P1 and P2 have two entries in

Planning_Sales and Actual_Sales Table.

While Joins between these two tables,

Each record in PLANNING table looks Each record in ACTUAL table.


So the Rows_Count is doubled and After Aggregation is giving wrong results.


To avoid this Problem,

First Sales has to be aggregated by Product wise for

both PLANNING table and ACTUAL table,

Then It has to be joined.


In Calculation View,

Planning_Sales Table is Included in

Aggregation Node_1 and

Sales Aggregated by Product wise.

Actual_Sales Table is Included in

Aggregation Node_2 and

Sales Aggregated by Product wise.


Then In Join Node, Two Aggregated Outputs –

Aggregation_1 and Aggregation_2 are Joined.


temp.PNG


In Join node, Join Between Two Aggregation Nodes,


temp.PNG


Select the required fields and activate the calculation view.

Let’s View the Final output of Calculation view.


temp.PNG


This is giving Correct results.


Joins before aggregation is giving distinct records with aggregated values.

So This approach is giving correct results.


Regards,

Muthuram

To report this post you need to login first.

7 Comments

You must be Logged on to comment or reply to a post.

  1. Jalina Vincent

    Thanks for sharing the wonderful information.

    I tried with first one (Rank Node) and it working as expected.

    While trying to create a 2nd one (aggregate table before joins), I am unable to remove the connection between Sematics with Aggregation (Marked with Yellow). Can you please help, how to link low level aggregation with Join. Please find the below snap.Calculation_View.JPG

    Thanks.

    (0) 
    1. Astrid Gambill

      the line (edge) between the semantics layer and the node directly below it (I’ve been calling it terminal node) can’t be removed.

      you could eliminate Aggregation_2, and link Join_1 to the Aggregation node under the Semantics

      (0) 
    2. Muthuram Shanmugavel Post author

      Hi Jalina,

      The Connection, you highlighted is default one. So we could not delete this.

      Please do the following steps.

      1. Remove “Planning_Sales” Table from “Aggregation”

      2. Remove Connection From “Join_1” To “Aggregation_2”

      3. Add  “Planning_Sales” Table to “Aggregation_2”

      4. Connect “Aggregation_2” Output to “Join_1”

      5. Connect “Join_1” output to “Aggregation”.

      Regards,

      Muthuram

      (0) 
      1. Mohammed Rasul Salaudeen

        Hi Muthuram,

        Thanks for your post. I tried the same way as you mentioned in the post but i’m getting incorrect results.Please check the screenshot below let me know if anything missing from my side

        Thanks

        Part1.JPG

        Part2.JPG

        Output

        Output.JPG

        (0) 
        1. Muthuram Shanmugavel Post author

          Hi Mohammed,

          Reason for the Issue is,

          In Aggregation_1 and Aggregation_3 Nodes,

          You have added ASALES and PSALES as Attributes (Add as Output).

          These two are Measures. So It should be added as “Add as Aggregated Column”.

          So Remove this two columns from Aggregation_1 and Aggregation_3 respectively and

          add as “Add to aggregated Column” and do the same steps.

          You will get correct results.

          Regards,

          Muthuram

          (0) 

Leave a Reply