In my recent post, Thinking in HANA – Part 2: Revisiting SCDs, I made reference to a current challenge that we’re faced with regarding slowly challenging dimensions. Our solution goes a bit beyond the solution I referenced – populating dimension and fact tables with surrogate keys – so I thought I would share an interesting solution to what I presume is a relatively widespread problem.
- A client has a reporting requirement that involves pulling dimensions from eight different tables for the same business entity.
- Each of these tables contains slowly changing dimensions (also called time-dependent master data). For example, one table might contain an employee’s contact information that’s valid for a particular period of time, and another table might contain an employee’s job level that’s valid for a particular period of time.
- The validity periods of each tables are distinct and do not correspond in a 1:1 fashion with validity periods of the other tables (although there is overlap).
- In addition to natural keys, such as personnel number (PERNR), each table has multiple (up to 7) other key fields that lead to “duplicate” records (meaning multiple records that are “valid” for the same employee from the same table).
- Each table may not have data for all employees but just a subset.
- The data needs to provided for employees in three different roles within the fact data (i.e. a sales rep, a sales manager, and a service manager).
- Populate Define business rules to “cleanse the data” and implement them in an ETL job for each table. In this case, cleansing refers to any logic that will ensure uniqueness per business entity (i.e. employee) and point in time (i.e. as captured in a validity period for multiple points in time).
- Conform the source tables into a single target table that captures dimensions in validity periods that correspond to all source validity periods.
- Populate surrogate keys on dimension tables and fact table so that slowly-changing dimensions can be modeled with a Left Outer Join in an Analytic View.
Solution steps 1 and 3 are above have already been discussed in my previous post. Solution step 2 is the focus of this article. Note that if tables were not consolidated/conformed into a single target table, 8 different Attribute Views would have to be created for each SCD table, and these 8 views would have to modeled 3 times against each employee in the fact data set, leading to 24 modeled views. That requires significantly more effort and maintenance (and room for error) than modeling the same Attribute View three times. (Note that Attribute Views in these cases should be created as “Derived” Attribute Views based on the first view so that all changes to the view itself get propagated – without needing to maintain 3 distinct views. The same Attribute View cannot be modeled itself more than once).
In order to demonstrate the solution, examine the following data set (based on my previous posts). It shows two different time-dependent dimension tables.
This data is illustrated graphically below:
The desired target table and graphical representation are here below:
The solution will be provided in the following SQL with comments that highlight logic. Please copy, inspect, and run the SQL to understand the solution.
Before posting the solution, it’s worth noting some of the implications. The logic “explodes” time-dependent data to day-level granularity. In our case, this currently involves exploding out one of the tables of 200,000 distinct records across 600 days which results in a formidable intermediate result set of 120 million records. In combination with additional logic and tables we’ve run into some out-of-memory errors, so we’ll likely be implementing something along the lines of an iterative solution that runs on bite-size chunks of data, one at a time. For an example of such an approach, check out this article by Justin Molenaur.
DROP TABLE DIM_MATERIAL; CREATE COLUMN TABLE DIM_MATERIAL ( PROD_ID CHAR(1), DATE_FROM DATE, DATE_TO DATE, MATERIAL NVARCHAR(20), PRIMARY KEY (PROD_ID, DATE_FROM) ); DROP TABLE DIM_SIZE; CREATE COLUMN TABLE DIM_SIZE ( PROD_ID CHAR(1), DATE_FROM DATE, DATE_TO DATE, SIZE DECIMAL(12,2), PRIMARY KEY (PROD_ID, DATE_FROM) ); INSERT INTO DIM_MATERIAL VALUES ('A', '2014-01-01', '2014-01-08', 'Steel'); INSERT INTO DIM_MATERIAL VALUES ('A', '2014-01-14', '2014-01-24', 'Aluminum'); INSERT INTO DIM_MATERIAL VALUES ('A', '2014-01-25', '2014-01-31', 'Carbon Fiber'); INSERT INTO DIM_SIZE VALUES ('A', '2014-01-11', '2014-01-15', 10); INSERT INTO DIM_SIZE VALUES ('A', '2014-01-16', '2014-01-19', 12); DROP PROCEDURE CONSOLIDATE_SCD; CREATE PROCEDURE CONSOLIDATE_SCD AS BEGIN -- expand DIM_MATERIAL to day-level granularity dim_MATERIAL = SELECT PROD_ID, DATE_SQL, MATERIAL FROM _SYS_BI.M_TIME_DIMENSION TDIM INNER JOIN DIM_MATERIAL DIMC ON TDIM.DATE_SQL BETWEEN DIMC.DATE_FROM AND DIMC.DATE_TO; -- expand DIM_SIZE to day-level granularity dim_SIZE = SELECT PROD_ID, DATE_SQL, SIZE FROM _SYS_BI.M_TIME_DIMENSION TDIM INNER JOIN DIM_SIZE DIMW ON TDIM.DATE_SQL BETWEEN DIMW.DATE_FROM AND DIMW.DATE_TO; -- 1) join on natural key fields + DATE_SQL -- 2) retrieve non-null values of key fields and DATE_SQL all_dims = SELECT COALESCE(MATERIAL.PROD_ID, SIZE.PROD_ID) AS PROD_ID, COALESCE(MATERIAL.DATE_SQL, SIZE.DATE_SQL) AS DATE_SQL, MATERIAL, SIZE FROM :dim_MATERIAL MATERIAL FULL OUTER JOIN :dim_SIZE SIZE ON MATERIAL.PROD_ID = SIZE.PROD_ID AND MATERIAL.DATE_SQL = SIZE.DATE_SQL; -- re-calculate validity dates result = SELECT MIN(DATE_SQL) AS VALID_FROM, MAX(DATE_SQL) AS VALID_TO, MATERIAL, SIZE FROM :all_dims GROUP BY MATERIAL, SIZE; SELECT * FROM :result ORDER BY VALID_FROM; END; CALL CONSOLIDATE_SCD;