Merge Multiple SCD Tables into A Single Dimension
Tracking historical changes within a dimension is a common task in data warehousing. Achieving this in a HANA sidecar database (SCD) requires an explicit design focus. In some cases, the data source system maintains dates and time stamps to denote a record or value’s creation, change, or validity threshold. In SAP systems creation, change, or validity dates are often but not always maintained for a record, dimension, attribute, or value. One such example being SAP Human Capital Management (HCM) mini-master info-types, seen as below:
The above info-types are versioned using a Begin (BEGDA) Date and End (ENDDA) Date thereby creating a validity range. In this post, I am demonstrating a simplified scenario using the following three tables, each with multiple valid from/to timestamps to show how to merge and aggregate the data into an ordered historical result using SAP HANA calculation views.
Execute the below SQL script then preview the resulting tables and data.
CREATE COLUMN TABLE "STAGING"."PA0000" ( "PERNR" NVARCHAR(8), "BEGDA" NVARCHAR(8), "ENDDA" NVARCHAR(8), "STAT" NVARCHAR(1) ); CREATE COLUMN TABLE "STAGING"."PA0001" ( "PERNR" NVARCHAR(8), "BEGDA" NVARCHAR(8), "ENDDA" NVARCHAR(8), "STAT1" NVARCHAR(1) ); CREATE COLUMN TABLE "STAGING"."PA0007" ( "PERNR" NVARCHAR(8), "BEGDA" NVARCHAR(8), "ENDDA" NVARCHAR(8), "STAT7" NVARCHAR(1) ); INSERT INTO "STAGING"."PA0000" VALUES('03151228','20170801','20180116','A'); INSERT INTO "STAGING"."PA0000" VALUES('03151228','20180116','20180502','B'); INSERT INTO "STAGING"."PA0000" VALUES('03151228','20180502','99991231','C'); INSERT INTO "STAGING"."PA0001" VALUES('03151228','20180501','20180829','D'); INSERT INTO "STAGING"."PA0001" VALUES('03151228','20180829','99991231','E'); INSERT INTO "STAGING"."PA0007" VALUES('03151228','20180301','20180511','F'); INSERT INTO "STAGING"."PA0007" VALUES('03151228','20180511','20181016','G'); INSERT INTO "STAGING"."PA0007" VALUES('03151228','20181016','99991231','H');
Each table contains rows for one Personnel ID (PERNR) with changing attribute stat (STAT,STAT1,STAT7) values. The changes are denoted by Valid-From (BEGDA) and Valid-To (ENDDA) date ranges. In this example, we will apply an inclusion of BEGDA and exclusion of ENDA. Each table combinations of PERNR, BEGDA, and ENDDA to create a unique set of keys for the row.
Here is the final result after merging three tables:
The information of the three tables is now merged into a single dimension view. Each row represents a version change of attributes with different BEG_DATE and END_DATE. As an example, the first change for employee 03151228 was starting date 20180116 with the STAT value changed from A to B.
So, how do we get there? There are two steps we define to accomplish this.
Task 1 – Collect all dates then build new valid from/to date ranges using the SAP HANA Rank Node.
1) Collect dates in three PA tables. The BEGDA and ENDDA columns are renamed to DATES
Projection Node: PR_PA0000_BEG
Projection Node: PR_PA0000_END
2) Union all dates using Union Node
The result after union (all) is as follows. There are some duplicate dates such as 20180116, 99991231 etc.
3) Remove duplicate dates by applying an Aggregation Node. The dates will now be distinct.
4) To partition PERNR, using an ascending sort we generate a RANK ID for each time slice (assume each PERNR will have maximum 100,000 time slices)
Data preview for RANK_DATES
5) Derive a column LEAD_RANK_ID using below formula
Next we will split the result into two datasets. The first contains PERNR, DATES, and LEAD_RANK_ID. The second contains PERNR, DATES, and RANK_ID
1ST dataset 2nd dataset
6) Apply the following left outer join condition between the first and second dataset. From the 1st dataset, expose DATES and rename it as BEG_DATE. From the 2nd dataset, expose DATES and rename it as END_DATE.
Date preview on JN_BEG_END
7) Add a filter removing the last line where END_DATE is null.
8) Final result for task 1
Task 2 – We now have eight rows (see above) with correct date ranges but without the original three table attributes. We can add the corresponding attributes by joining the initial three base tables with our new data ranges. An SAP HANA temporal join is not applicable here as we are working with a date range instead of temporal column such as a posting date. Instead, we can write a SAP HANA SQL Script to apply a date range look up.
Here is a graphical modeling solution with the final calculation view as below
1) Select PA tables etc., PR_PA0000
2) Create a left outer join between the new date ranges and PA tables on PERNR only
Left outer join on PERNR
See the left outer join results below. The first row is a valid record as BEG_DATE >= PA00_BEGDA and END_DA <= PA00_ENDDA selecting STAT value A as the correct version. Note that some rows, such as fourth row, become invalid as BEG_DATE and END_DATE range moves out the row’s PA00_BEGDA and PA00_ENDDA date range.
3) Create a calculated attribute to flag invalid rows with a null value
Data preview with calculated attribute
4) Apply an aggregation node to compress result, eliminating invalid rows with null CC_STAT values.
4) Repeat the same steps for PA0001 and PA0007 tables.
Note: In the case where a dimension-based value requires historical perspective but carries no associated create, change or validity dates and/or time, there is the option of deriving the necessary dates during the data warehouse load processes. Historical attribute views between fact and dimension tables are now possible using SQL or temporal joins.