Thinking in HANA – Part 2a: Conforming SCDs
In my recent post, Thinking in HANA – Part 2: Revisiting SCDs, I made reference to a current challenge that we’re faced with regarding slowly challenging dimensions. Our solution goes a bit beyond the solution I referenced – populating dimension and fact tables with surrogate keys – so I thought I would share an interesting solution to what I presume is a relatively widespread problem.
- A client has a reporting requirement that involves pulling dimensions from eight different tables for the same business entity.
- Each of these tables contains slowly changing dimensions (also called time-dependent master data). For example, one table might contain an employee’s contact information that’s valid for a particular period of time, and another table might contain an employee’s job level that’s valid for a particular period of time.
- The validity periods of each tables are distinct and do not correspond in a 1:1 fashion with validity periods of the other tables (although there is overlap).
- In addition to natural keys, such as personnel number (PERNR), each table has multiple (up to 7) other key fields that lead to “duplicate” records (meaning multiple records that are “valid” for the same employee from the same table).
- Each table may not have data for all employees but just a subset.
- The data needs to provided for employees in three different roles within the fact data (i.e. a sales rep, a sales manager, and a service manager).
- Populate Define business rules to “cleanse the data” and implement them in an ETL job for each table. In this case, cleansing refers to any logic that will ensure uniqueness per business entity (i.e. employee) and point in time (i.e. as captured in a validity period for multiple points in time).
- Conform the source tables into a single target table that captures dimensions in validity periods that correspond to all source validity periods.
- Populate surrogate keys on dimension tables and fact table so that slowly-changing dimensions can be modeled with a Left Outer Join in an Analytic View.
Solution steps 1 and 3 are above have already been discussed in my previous post. Solution step 2 is the focus of this article. Note that if tables were not consolidated/conformed into a single target table, 8 different Attribute Views would have to be created for each SCD table, and these 8 views would have to modeled 3 times against each employee in the fact data set, leading to 24 modeled views. That requires significantly more effort and maintenance (and room for error) than modeling the same Attribute View three times. (Note that Attribute Views in these cases should be created as “Derived” Attribute Views based on the first view so that all changes to the view itself get propagated – without needing to maintain 3 distinct views. The same Attribute View cannot be modeled itself more than once).
In order to demonstrate the solution, examine the following data set (based on my previous posts). It shows two different time-dependent dimension tables.
This data is illustrated graphically below:
The desired target table and graphical representation are here below:
The solution will be provided in the following SQL with comments that highlight logic. Please copy, inspect, and run the SQL to understand the solution.
Before posting the solution, it’s worth noting some of the implications. The logic “explodes” time-dependent data to day-level granularity. In our case, this currently involves exploding out one of the tables of 200,000 distinct records across 600 days which results in a formidable intermediate result set of 120 million records. In combination with additional logic and tables we’ve run into some out-of-memory errors, so we’ll likely be implementing something along the lines of an iterative solution that runs on bite-size chunks of data, one at a time. For an example of such an approach, check out this article by Justin Molenaur.
DROP TABLE DIM_MATERIAL; CREATE COLUMN TABLE DIM_MATERIAL ( PROD_ID CHAR(1), DATE_FROM DATE, DATE_TO DATE, MATERIAL NVARCHAR(20), PRIMARY KEY (PROD_ID, DATE_FROM) ); DROP TABLE DIM_SIZE; CREATE COLUMN TABLE DIM_SIZE ( PROD_ID CHAR(1), DATE_FROM DATE, DATE_TO DATE, SIZE DECIMAL(12,2), PRIMARY KEY (PROD_ID, DATE_FROM) ); INSERT INTO DIM_MATERIAL VALUES ('A', '2014-01-01', '2014-01-08', 'Steel'); INSERT INTO DIM_MATERIAL VALUES ('A', '2014-01-14', '2014-01-24', 'Aluminum'); INSERT INTO DIM_MATERIAL VALUES ('A', '2014-01-25', '2014-01-31', 'Carbon Fiber'); INSERT INTO DIM_SIZE VALUES ('A', '2014-01-11', '2014-01-15', 10); INSERT INTO DIM_SIZE VALUES ('A', '2014-01-16', '2014-01-19', 12); DROP PROCEDURE CONSOLIDATE_SCD; CREATE PROCEDURE CONSOLIDATE_SCD AS BEGIN -- expand DIM_MATERIAL to day-level granularity dim_MATERIAL = SELECT PROD_ID, DATE_SQL, MATERIAL FROM _SYS_BI.M_TIME_DIMENSION TDIM INNER JOIN DIM_MATERIAL DIMC ON TDIM.DATE_SQL BETWEEN DIMC.DATE_FROM AND DIMC.DATE_TO; -- expand DIM_SIZE to day-level granularity dim_SIZE = SELECT PROD_ID, DATE_SQL, SIZE FROM _SYS_BI.M_TIME_DIMENSION TDIM INNER JOIN DIM_SIZE DIMW ON TDIM.DATE_SQL BETWEEN DIMW.DATE_FROM AND DIMW.DATE_TO; -- 1) join on natural key fields + DATE_SQL -- 2) retrieve non-null values of key fields and DATE_SQL all_dims = SELECT COALESCE(MATERIAL.PROD_ID, SIZE.PROD_ID) AS PROD_ID, COALESCE(MATERIAL.DATE_SQL, SIZE.DATE_SQL) AS DATE_SQL, MATERIAL, SIZE FROM :dim_MATERIAL MATERIAL FULL OUTER JOIN :dim_SIZE SIZE ON MATERIAL.PROD_ID = SIZE.PROD_ID AND MATERIAL.DATE_SQL = SIZE.DATE_SQL; -- re-calculate validity dates result = SELECT MIN(DATE_SQL) AS VALID_FROM, MAX(DATE_SQL) AS VALID_TO, MATERIAL, SIZE FROM :all_dims GROUP BY MATERIAL, SIZE; SELECT * FROM :result ORDER BY VALID_FROM; END; CALL CONSOLIDATE_SCD;
Nice blog about handling slowly changing dimensions. You are always thinking about multiple aspects of the problem when designing a solution. Your solution seems to be most appropriate for the problem you are trying to solve. However, sometimes, I feel that creating 8 attribute views (not counting derived views as they are automatically rebuilt) is a better option from maintenance perspective. When my brain is half dead, I find it easy to maintain a laborious no brain involved model than a concise but more complex work. Furthermore, my tendency is to avoid batch wherever possible.
I have been using the temporal joins recently. I do end up creating a custom table updated by a stored procedure when the source system does not have records for all master records (ex: employee). In those cases, I create a record with valid from date as the create date of the master record and valid till date as 12/31/9999. Of course, the attributes will be blanks.
One interesting tid bit: HANA temporal join cannot be used in a model when date field is in SAP system date format. I could not use SLT to create an addition field of SQL date type in the fact tables or dimension tables as that functionality is not available. I ended up using SLT to create addition field(s) of type integer and stored the date(s) as integer(s). I could then create a temporal join on fields.
Keep up the good work.
My tendency to avoid batch is not necessarily related to real-time analysis. I do tell everyone that words real time and data warehouse will not be together even with HANA.
I feel the demand from end users to make the data available quickly. This is not necessarily real-time. As we spoke a month ago, it is mostly during financial close at quarter end / year end, sales target tracking during quarter end etc., Even a frequency of once every hour seems to not make them happy. One thing that HANA can help is that we can create models that use a hybrid approach. I am starting with a real-time approach on a model. But I am cherry picking those joins and calculations that are affecting the performance of a model and then moving them to ETL. Of course, some models are always based on ETL approach (ex: inventory snapshots) .
I totally agree with Ramana, in all my projects we are slowing moving away from saying Real time on all the data but limit ourselves to fact tables as near real time and using the ETL approach around master data. Main reason performance hit which the end users are complaining. We might see more of this hybrid approach moving forward.
I think we're all on the same page. I just distinguish a bit more between what the end user likes/wants, and what I as the developer like/want. 🙂 Not always the same world! But yes, agreed on the general strategy.
Thanks for your blog. We have faced a similar challenge and used the following Sql. I dont know if its more efficient than yours, but you could try it to see which is better
COALESCE(MATERIAL.PROD_ID, SIZE.PROD_ID) AS PROD_ID,
Greatest( DIMW.DATE_FROM, DIMC.DATE_FROM),
Least( DIMW.DATE_TO, DIMC.DATE_TO)
FROM DIM_MATERIAL DIMC
FULL OUTER JOIN DIM_SIZE DIMW
on DIMC.PROD_ID = DIMW.PROD_ID
Where Greatest( DIMW.DATE_FROM, DIMC.DATE_FROM) <
Least( DIMW.DATE_TO, DIMC.DATE_TO)
Thanks for sharing, Stefan.
I'm always hesitant to put calculations in WHERE clauses, especially with semi-cartesian joins, due to performance implications. Nonetheless, it certainly helps further expand the solution space, and if it's worked well for you - great!