Thinking in HANA - Part 2: Revisiting SCDs

Former Member · ‎03-24-2014

Hi again folks,

It's been a few weeks since my first post in my series "Thinking on HANA", so I think it's about time for me to get the next article written up, posted, discussed, debated, critiqued, printed, turned into paper airplanes...

For those of you who have been following my posts over the past year or so, it's hopefully clear by now that one of the themes of my work is, "there's more than one way to skin a cat. I stated this a bit more formally in my last post,

As a massively-parallel, column-oriented, in-memory platform, HANA affords many opportunities to come up with new and creative solutions to seemingly well-understood problems.

In addition to exploring technical aspects of alternative solutions, I'd also like to start framing these alternatives in the bigger picture of "organizational problem-solving" (okay - "consulting") where, in addition to technical trade-offs, organizational-imposed constraints restrict the solution domain even further. The stereotypical example in IT (and other fields) is the "triple constraint" in Project Management of time, cost and quality - the three resource constraints of all organizational projects. This "triple constraint" is, of course, only a heuristic to help guide project management decision-making. Constraints such as risks, financial costs, environmental costs, social costs, scope, performance, etc can be mapped to any number of "focal points" that make the most sense (heaven forbid we start talking about hierarchies of constraints!)

When developing solutions on HANA, and more specifically, during the phase in which I research, consider and prototype as many data modeling solutions as are feasible, I typically characterize the solution set with the following model:

Functionality

Does the solution meet the ad-hoc requirements? Visualization requirements?
Does it support the data structure requirements? (i.e. hierarchy for MDX clients)
Does it provide the required result set?

Performance

Do queries execute sufficiently fast?
Are there any concerns regarding memory consumption?

Maintainability

Is the solution well-understood by other practitioners?
Can it be easily maintained?

Velocity

Does the solution meet the real-time and/or batch requirements from source system to HANA?

Resources

Can the solution be implemented in a reasonable amount of time and cost given available resources?

As should be immediately obvious, these constraints can easily be mapped to the classic project management "triple constraint", and with a bit of "refactoring" could certainly fit into four or six or N number of discrete concerns.

Keeping in mind these constraints, let's evaluate 5 different solutions to modeling "slowly changing dimensions” (SCDs) in HANA.

Slowly Changing Dimensions

I characterize slowly-changing dimensions as those attributes of “organizational entities” that 1) change over time and 2) have an impact on business processes that can and likely should be measured to assist in organizational decision-making. (For all the messy details free to reference Wikipedia's treatise on all 6+ different types). Generally these attributes are lost in transactional systems (i.e. they are treated like CRUD – created, read, updated, deleted – but not maintained or tracked over time).

Let’s take a very simple example that illustrates the concept.

Imagine a product that goes through a typical technology lifecycle – it’s built with one metal and over time upgraded with lighter and stronger metals (okay, carbon fiber isn't a metal, fair enough). Below shows an image of material changes over time – within the single month of January (I know, not very realistic, use your imagination ).

This data can be captured with the following table structure:

PROD_ID is the primary key of the product, DATE_FROM and DATE_TO capture the “validity period” of the material attribute, and MATERIAL captures the physical material the product is made of during that period.

Next, imagine this product is sold throughout all of January. Again, for simplicity’s sake – we’ll use an *incredibly* simple model. In our alternate universe here, a single sale for $100 was registered in every day of January for the product. No additional information was captured. (Imaginations, people!). Here is the table structure. It should be self-explanatory. TX_DATE is the transaction date.

Although the example already speaks for itself, let’s assume we want to analyze the aggregate sales for the product for each material that it was made out of. We would accomplish this by joining the two tables on PROD_ID as well as joining the tables where the transaction date (TX_DATE) of the sales table falls between the validity dates (DATE_FROM and DATE_TO) of the product material table. What’s also important is that we analyze sales for those days where we mistakenly didn’t track the product’s material – we still want to see the sales for those products.

Following are five alternative solutions that I'll describe for modeling slowly-changing dimensions in HANA:

Theta Join #1
Theta Join #2
SQLScript
Temporal Join
Surrogate Key

I’ll demonstrate each approach, follow each with a discussion on the pros and cons with respect to the constraints provided earlier, and I’ll end this article with a bird’s eye view on the implications for these kinds of requirements and solutions.

Theta Join #1

I use the term “thetajoin” to describe any join condition that is somewhere between the “natural” join conditions (i.e. those which respect the cardinality of the relationship between the tables) and a “cross join”, otherwise called a Cartesian product, which maps every record in one table to reach record in the other table.

For those who are interested in the precise definitions of terms like “theta join”, a simple google search should serve you well. Hopefully I don’t upset folks too much by being a bit liberal/imprecise with my use of these terms.

In the screenshot below, you’ll see that the sales table and the material table are only joined on PROD_ID.

The conditional component of the join is handled in the following calculated column, where the logic only populates the column with the sales amount if it finds that the transaction date falls between the beginning and end dates of the validity period.

Querying the data model for validity dates, material and amounts gives the following result.

Let’s discuss the data model and the results in the context of the constraints discussed initially.

Functionality

This approach solves the main requirement – aggregating sales per material type – but fails to pull through any sales figures that don’t correspond to material values (i.e. where material is null). In order to achieve this final requirement, the model would actually have to output calculated attribute values per validity period – basically the same calculation but for the dimensions rather than the single measure. The base measure should then come through alone, and when grouped by calculated dimensions, it would effectively display the final result with an additional record with nulls in the other columns as expected.

Maintainability

The model above is relatively straightforward to understand. However, it’s deceiving. The left outer join implies to the developer that all fact records will be captured as is the case with most data models – but the condition of the calculated measure renders this characteristic void. Moreover, the “theta join” is not particularly intuitive. Also, if the solution was implemented with calculated attributes instead of the calculated measure – as described above, which would give 100% correct functional results – this kind of approach would be entirely unsustainable. The amount of work (in production situation, tens or hundreds of calculated attributes), the confusion of the approach, the room for error, and the performance implications would make this a very bad decision.

Velocity

This approach can be modeled against real-time (i.e. SLT sourced) or batched (i.e. ETL sourced) data if source data has validity periods. If source data does not have validity periods, an ETL job will have to capture the validity periods. If SLT is used, an approach like this one should be used to capture validity periods.

Resources

This approach should be pretty quick to develop without taxing the development team too heavily. SLT-sourced data, however, will require additional resources to implement the history-tracking approach referenced above if this is pursued.

Performance

This model will first execute in the OLAP engine, and the aggregated results will be processed by the Calculation Engine for the calculated measure to arrive at the final result. Since both engines are required to process the result, performance will suffer. Moreover, extra memory will be consumed to accommodate the theta join (which results in more intermediate records before final result set is built).

Theta Join #2

The following approach is similar to the first one. In the analytic view, the only join condition modeled is the equijoin on PROD_ID between the two tables.

However, the conditional logic is implemented as a filter in a projection node with a Calculation View, and the base measure is pulled through.

Following are the results:

Functionality

This approach suffers from the same shortcoming as the first – any fact records that don’t have corresponding dimensions are effectively dropped from the result set.

Maintainability

The approach is about as maintainable as the first one.

Velocity

As is the case previously, this approach will support realtimeor batched data – with the requirement that the tables have validity periods. Extra logic may need to implemented in the ETL layer or SLT (or other replication layer) to capture validity periods if source tables are not already populated with them

Resources

This model has about the same development resource requirements as the first model.

Performance

This model will perhaps perform slightly better than the first. In my experience, computations are more expensive than filters. The theta join poses the same risk of additional memory consumption.

SQLScript

Below is SQLScript syntax of a scripted Calculation View that captures the traditional SQL syntax used to join slowly changing dimensions to fact tables.

Following are the results of the model:

Functionality

This approach gives 100% correct functional results.

Maintainability

Modeling slowly changing dimensions via SQLScript is not very maintainable. All SCD tables must be modeled in SQLScript, an output table with correct field names, data types, potentially capitalization and ordering must be maintained.

Velocity

As is the case previously, this approach will support realtime or batched data – with the requirement that the tables have validity periods. Extra logic may need to implemented in the ETL layer or SLT (or other replication layer) to capture validity periods if source tables are not already populated with them.

Resources

Developing this model requires strong SQL skills in developer resources. The scripted nature of the approach will also take more time to develop and debug than a graphical approach.

Performance

This model will likely have bad performance. Depending on the granularity of the dimension in the fact table, the OLAP Engine* will execute limited aggregation before handing off a large intermediate resultset to the SQL Engine to process the SCD component (which will never be pruned and thus always executed). Moreover, only equijoins are natively support in the column store – which can be examined with a VizPlan analysis against this model. As noted, the join cannot be handled by a column scan (i.e. one of HANA’s fundamental strengths) and thus will be slow to execute:

*the example above was built against raw tables, but in real-life scenarios it would likely be built against analytic views.

Temporal Join

HANA’s native feature for modeling slowly changing dimensions is called a “temporal join” and is modeled within an Analytic View is shown below:

As you can see, the join condition is specified both by the graphical equijoin as well as the temporal conditions listed in the Properties tab of the join.

Following are the results:

Functionality

This modeling approach suffers from the same fate as the first two models of this article – fact records that lack corresponding dimensions are dropped. This is due to the fact that the temporal join can only be modeled against Referential join types – which effectively function as inner joins when the respective dimensions are included in the query. This is a significant drawback to the temporal join.

Maintainability

HANA’s Temporal Join was slated to be the native solution to modeling slowly changing dimensions. As such, it’s well-documented, taught, and understood. It’s easy to model and is available directly in Analytic Views, making it very maintainable.

Velocity

As is the case previously, this approach will support realtime or batched data – with the requirement that the tables have validity periods. Extra logic may need to implemented in the ETL layer or SLT (or other replication layer) to capture validity periods if source tables are not already populated with them.

Resources

No additional resources from a development perspective are required for this approach, aside from those that may be required for implementing validity period capture in ETL or replication layer of the landscape.

Performance

In my experience, temporal joins execute with similar performance as other join types. (I have not done extensive testing with this approach, however, as I have not yet worked on a project where either the data was “referentially complete”, or where the customer was willing to accept an incomplete result set.) I’d be curious to learn more about how the temporal join actually works. The VizPlan shows a Column Search for the join as with typical models, but no additional “helper” columns (like compound join concatenated columns) were found in M_CS_ALL_COLUMNS. As such, I’m not exactly sure how the temporal aspect of the join is handled.

Surrogate Key

The final modeling approach provided is well-represented in the world of data warehousing. Tables with slowly changing dimensions are populated with a generated (“surrogate”) key that is unique per combination of natural key fields and validity period. Then, the fact table is similarly populated for records whose natural key fields correspond with the dimension and whose transaction date falls between the respective dimension record’s validity dates. Any fact records that don’t have corresponding dimensions are typically populated with a value of -1 for the surrogate key.

The screenshot below shows this approach:

Here are the results:

Functionality

This modeling approach gives 100% correct results.

Maintainability

A simple equijoin in an Analytic View is likely the most well understood modeling approach in HANA. It involves nothing more than a graphical connection and a few property specifications. No coding or detailed maintenance is required. As such, this approach is very maintainable.

Velocity

In a traditional landscape, this approach will only support batched data where surrogate key population is handled by a scheduled ETL job. In theory, careful use of triggers could handle surrogate key population against replicated data, but I’ve never tried this in practice. The maintainability and performance implications may be prohibitive.

Resources

An ETL developer will need to build the logic for surrogate key population.

Performance

This approach will give the best performance as the equijoin condition is natively supported by HANA’s OLAP Engine. Moreover, if no dimension fields are included in the client query – or even if the join field is included – HANA will prune the join off from query execution – resulting in better performance.

Discussion

In the introduction to this post, I highlighted a shared theme between this post and the first one in my “Thinking in HANA” series, pointing out that HANA may offer multiple compelling solutions to the same problem.

I can also slightly modify another point made in “Thinking in HANA: Part 1” and find wide applicability to the issues raised when modeling slowly changing dimensions in HANA:

Complex data flows captured in hand-coded SQL and/or SQLScript can be difficult to maintain.

SQL queries against raw tables, whether hand-coded or software-generated (i.e. from a semantic layer like BOBJ Universes), often fail to leverage native high-performance engines in HANA (i.e. OLAP/Calc Engine).

...

Some SQL operators can't be modeled in HANA a 1:1 fashion. For example, the SQL BETWEEN condition is limited in how it can be natively modeled in HANA.

Some readers of this article are probably left with at least the following two questions, if not more:

Why don’t you show any performance metrics for the different approaches?
Which approach would you recommend as the best?

Here are my answers:

In short, I don’t currently have datasets on any current projects that I can leverage to do performance testing. We’re in the process of building ETL jobs to populate surrogate keys on a single conformed dimension table for HR data, but it will be another week or two until this is complete and I didn’t want to delay this post any longer.
Even though I just now mentioned that we’re using the ETL approach to populate surrogate keys in order to model SCDs for our project, I can’t recommend this approach across the board. The best answer I can give for what you should do is the same answer given by all the computer science sages across the ages – “it depends”.

Are you building a one-off data model for a very small use case against tables that already have validity periods but lack referential integrity? I don’t see too much harm in going with the SQL approach.

Are you building a large-scale solution that needs to have the best performance possible and support a consistent enterprise-wide data architecture with limited real-time data velocity requirements? Then go with the ETL approach.

Does your client have a dashboarding need where only metrics plotted against actual dimensions are required? Perhaps one of the “theta-join” approaches is right for you.

Are you simply missing “blank” records in your dimension tables which are hindering referential integrity required for temporal join functional correctness? First implement this clever solution by justin.molenaur2 and you’ll be right as rain with the temporal join.

However, the point of this article is to equip you with a few “yardsticks” that you can use to help discern what the best approach is based on your specific technical, organizational (and dare I say SLA) constraints.

Final Thoughts

The more time one spends working on HANA, the more one begins to realize how many options are available. Just take a simple piece of conditional logic, for example, like that found in the join condition of SCDs. No matter where the logic is specified in a functional requirements document, it’s quite possible that that logic could be built into any of:

a calculated measure (the first approach)
a calculated attribute (mentioned in the first approach)
a CalcView’s projection node filter (second approach)
a SQLScript “node” (third approach)
a HANA temporal join (fourth approach)
persisted columns (fifth approach)
semantic-layer generated SQL (not discussed here)

And those are just for the “class” of conditional logic found in joins. Depending on requirements, conditional logic (again, almost the simplest requirement one can boil any logic down to) could also be built into:

generated columns in HANA (persisted)
SLT (persisted)
Restricted measures (on-the-fly)
Set operations (i.e. replacing “or” conditions in joins or calculations with UNION ALL set operators between two “halves” of the original dataset – a particularly easy and effective performance tuning solution )
Triggers (persisted)
Many more places I’m sure I haven’t thought of!

As the old adage goes, “When the only thing you have is a hammer, everything begins to look like a nail.” Often times new consultants coming fresh out of HA300 will see the HANA world in terms of simple Attribute/Analytic/Calculation Views, when in reality, HANA offers a wealth of technology and innovation even on some of the simplest, most well-understood problems.

Take the time to understand the technical landscape, the true business requirements and constraints, and the multitude ways that business problems can be solved on HANA – and you’ll be well positioned to deliver high-performing world-class HANA solutions for your clients.