Hi again folks,

It’s been a few weeks since my first post in my series “Thinking on HANA”, so I think it’s about time for me to get the next article written up, posted, discussed, debated, critiqued, printed, turned into paper airplanes…

For those of you who have been following my posts over the past year or so, it’s hopefully clear by now that one of the themes of my work is, “there’s more than one way to skin a cat. I stated this a bit more formally in my last post,

As a massively-parallel, column-oriented, in-memory platform, HANA affords many opportunities to come up with new and creative solutions to seemingly well-understood problems.

In addition to exploring technical aspects of alternative solutions, I’d also like to start framing these alternatives in the bigger picture of “organizational problem-solving” (okay – “consulting”) where, in addition to technical trade-offs, organizational-imposed constraints restrict the solution domain even further. The stereotypical example in IT (and other fields) is the “triple constraint” in Project Management of time, cost and quality – the three resource constraints of all organizational projects. This “triple constraint” is, of course, only a heuristic to help guide project management decision-making. Constraints such as risks, financial costs, environmental costs, social costs, scope, performance, etc can be mapped to any number of “focal points” that make the most sense (heaven forbid we start talking about hierarchies of constraints!)

When developing solutions on HANA, and more specifically, during the phase in which I research, consider and prototype as many data modeling solutions as are feasible, I typically characterize the solution set with the following model:

hana pentagon.png

Functionality

  • Does the solution meet the ad-hoc requirements? Visualization requirements?
  • Does it support the data structure requirements? (i.e. hierarchy for MDX clients)
  • Does it provide the required result set?


Performance

  • Do queries execute sufficiently fast?
  • Are there any concerns regarding memory consumption?

Maintainability

  • Is the solution well-understood by other practitioners?
  • Can it be easily maintained?

Velocity

  • Does the solution meet the real-time and/or batch requirements from source system to HANA?

Resources

  • Can the solution be implemented in a reasonable amount of time and cost given available resources?

As should be immediately obvious, these constraints can easily be mapped to the classic project management “triple constraint”, and with a bit of “refactoring” could certainly fit into four or six or N number of discrete concerns.

Keeping in mind these constraints, let’s evaluate 5 different solutions to modeling “slowly changing dimensions” (SCDs) in HANA.

Slowly Changing Dimensions

I characterize slowly-changing dimensions as those attributes of “organizational entities” that 1) change over time and 2) have an impact on business processes that can and likely should be measured to assist in organizational decision-making. (For all the messy details free to reference Wikipedia’s treatise on all 6+ different types).  Generally these attributes are lost in transactional systems (i.e. they are treated like CRUD – created, read, updated, deleted – but not maintained or tracked over time).

Let’s take a very simple example that illustrates the concept.

Imagine a product that goes through a typical technology lifecycle – it’s built with one metal and over time upgraded with lighter and stronger metals (okay, carbon fiber isn’t a metal, fair enough). Below shows an image of material changes over time – within the single month of January (I know, not very realistic, use your imagination ).

/wp-content/uploads/2014/03/metals_416824.png

This data can be captured with the following table structure:

data str.png

PROD_ID is the primary key of the product, DATE_FROM and DATE_TO capture the “validity period” of the material attribute, and MATERIAL captures the physical material the product is made of during that period.

Next, imagine this product is sold throughout all of January. Again, for simplicity’s sake – we’ll use an *incredibly* simple model. In our alternate universe here, a single sale for $100 was registered in every day of January for the product. No additional information was captured. (Imaginations, people!). Here is the table structure. It should be self-explanatory. TX_DATE is the transaction date.

/wp-content/uploads/2014/03/s2_416832.png

Although the example already speaks for itself, let’s assume we want to analyze the aggregate sales for the product for each material that it was made out of. We would accomplish this by joining the two tables on PROD_ID as well as joining the tables where the transaction date (TX_DATE) of the sales table falls between the validity dates (DATE_FROM and DATE_TO) of the product material table. What’s also important is that we analyze sales for those days where we mistakenly didn’t track the product’s material – we still want to see the sales for those products.

Following are five alternative solutions that I’ll describe for modeling slowly-changing dimensions in HANA:

  1. Theta Join #1
  2. Theta Join #2
  3. SQLScript
  4. Temporal Join
  5. Surrogate Key

I’ll demonstrate each approach, follow each with a discussion on the pros and cons with respect to the constraints provided earlier, and I’ll end this article with a bird’s eye view on the implications for these kinds of requirements and solutions.

Theta Join #1

I use the term “thetajoin” to describe any join condition that is somewhere between the “natural” join conditions (i.e. those which respect the cardinality of the relationship between the tables) and a “cross join”, otherwise called a Cartesian product, which maps every record in one table to reach record in the other table.

For those who are interested in the precise definitions of terms like “theta join”, a simple google search should serve you well. Hopefully I don’t upset folks too much by being a bit liberal/imprecise with my use of these terms.

In the screenshot below, you’ll see that the sales table and the material table are only joined on PROD_ID.

/wp-content/uploads/2014/03/img1_427414.png

The conditional component of the join is handled in the following calculated column, where the logic only populates the column with the sales amount if it finds that the transaction date falls between the beginning and end dates of the validity period.

/wp-content/uploads/2014/03/calc_416834.png

Querying the data model for validity dates, material and amounts gives the following result.

/wp-content/uploads/2014/03/res1_416835.png

Let’s discuss the data model and the results in the context of the constraints discussed initially.

Functionality


This approach solves the main requirement – aggregating sales per material type – but fails to pull through any sales figures that don’t correspond to material values (i.e. where material is null). In order to achieve this final requirement, the model would actually have to output calculated attribute values per validity period – basically the same calculation but for the dimensions rather than the single measure.  The base measure should then come through alone, and when grouped by calculated dimensions, it would effectively display the final result with an additional record with nulls in the other columns as expected.


Maintainability


The model above is relatively straightforward to understand. However, it’s deceiving. The left outer join implies to the developer that all fact records will be captured as is the case with most data models – but the condition of the calculated measure renders this characteristic void. Moreover, the “theta join” is not particularly intuitive. Also, if the solution was implemented with calculated attributes instead of the calculated measure – as described above, which would give 100% correct functional results – this kind of approach would be entirely unsustainable. The amount of work (in production situation, tens or hundreds of calculated attributes), the confusion of the approach, the room for error, and the performance implications would make this a very bad decision.


Velocity


This approach can be modeled against real-time (i.e. SLT sourced) or batched (i.e. ETL sourced) data if source data has validity periods. If source data does not have validity periods, an ETL job will have to capture the validity periods. If SLT is used, an approach like this one should be used to capture validity periods.


Resources


This approach should be pretty quick to develop without taxing the development team too heavily. SLT-sourced data, however, will require additional resources to implement the history-tracking approach referenced above if this is pursued.


Performance


This model will first execute in the OLAP engine, and the aggregated results will be processed by the Calculation Engine for the calculated measure to arrive at the final result. Since both engines are required to process the result, performance will suffer. Moreover, extra memory will be consumed to accommodate the theta join (which results in more intermediate records before final result set is built).

Theta Join #2

The following approach is similar to the first one. In the analytic view, the only join condition modeled is the equijoin on PROD_ID between the two tables.

However, the conditional logic is implemented as a filter in a projection node with a Calculation View, and the base measure is pulled through.

/wp-content/uploads/2014/03/proj_416836.png

Following are the results:

/wp-content/uploads/2014/03/res3_416837.png

Functionality


This approach suffers from the same shortcoming as the first – any fact records that don’t have corresponding dimensions are effectively dropped from the result set.


Maintainability


The approach is about as maintainable as the first one.


Velocity


As is the case previously, this approach will support realtimeor batched data – with the requirement that the tables have validity periods. Extra logic may need to implemented in the ETL layer or SLT (or other replication layer) to capture validity periods if source tables are not already populated with them


Resources


This model has about the same development resource requirements as the first model.


Performance


This model will perhaps perform slightly better than the first. In my experience, computations are more expensive than filters. The theta join poses the same risk of additional memory consumption.

SQLScript

Below is SQLScript syntax of a scripted Calculation View that captures the traditional SQL syntax used to join slowly changing dimensions to fact tables.

/wp-content/uploads/2014/03/scr_416838.png

Following are the results of the model:

/wp-content/uploads/2014/03/res10_416839.png

Functionality


This approach gives 100% correct functional results.


Maintainability


Modeling slowly changing dimensions via SQLScript is not very maintainable. All SCD tables must be modeled in SQLScript, an output table with correct field names, data types, potentially capitalization and ordering must be maintained.


Velocity


As is the case previously, this approach will support realtime or batched data – with the requirement that the tables have validity periods. Extra logic may need to implemented in the ETL layer or SLT (or other replication layer) to capture validity periods if source tables are not already populated with them.


Resources


Developing this model requires strong SQL skills in developer resources. The scripted nature of the approach will also take more time to develop and debug than a graphical approach.


Performance


This model will likely have bad performance. Depending on the granularity of the dimension in the fact table, the OLAP Engine* will execute limited aggregation before handing off a large intermediate resultset to the SQL Engine to process the SCD component (which will never be pruned and thus always executed). Moreover, only equijoins are natively support in the column store – which can be examined with a VizPlan analysis against this model. As noted, the join cannot be handled by a column scan (i.e. one of HANA’s fundamental strengths) and thus will be slow to execute:

/wp-content/uploads/2014/03/exc_416840.png

*the example above was built against raw tables, but in real-life scenarios it would likely be built against analytic views.

Temporal Join

HANA’s native feature for modeling slowly changing dimensions is called a “temporal join” and is modeled within an Analytic View is shown below:

/wp-content/uploads/2014/03/img2_427421.png

As you can see, the join condition is specified both by the graphical equijoin as well as the temporal conditions listed in the Properties tab of the join.

Following are the results:

/wp-content/uploads/2014/03/tres_416843.png

Functionality


This modeling approach suffers from the same fate as the first two models of this article – fact records that lack corresponding dimensions are dropped. This is due to the fact that the temporal join can only be modeled against Referential join types – which effectively function as inner joins when the respective dimensions are included in the query. This is a significant drawback to the temporal join.


Maintainability


HANA’s Temporal Join was slated to be the native solution to modeling slowly changing dimensions. As such, it’s well-documented, taught, and understood. It’s easy to model and is available directly in Analytic Views, making it very maintainable.


Velocity


As is the case previously, this approach will support realtime or batched data – with the requirement that the tables have validity periods. Extra logic may need to implemented in the ETL layer or SLT (or other replication layer) to capture validity periods if source tables are not already populated with them.


Resources


No additional resources from a development perspective are required for this approach, aside from those that may be required for implementing validity period capture in ETL or replication layer of the landscape.


Performance


In my experience, temporal joins execute with similar performance as other join types. (I have not done extensive testing with this approach, however, as I have not yet worked on a project where either the data was “referentially complete”, or where the customer was willing to accept an incomplete result set.) I’d be curious to learn more about how the temporal join actually works. The VizPlan shows a Column Search for the join as with typical models, but no additional “helper” columns (like compound join concatenated columns) were found in M_CS_ALL_COLUMNS. As such, I’m not exactly sure how the temporal aspect of the join is handled.

Surrogate Key

The final modeling approach provided is well-represented in the world of data warehousing. Tables with slowly changing dimensions are populated with a generated (“surrogate”) key that is unique per combination of natural key fields and validity period. Then, the fact table is similarly populated for records whose natural key fields correspond with the dimension and whose transaction date falls between the respective dimension record’s validity dates. Any fact records that don’t have corresponding dimensions are typically populated with a value of -1 for the surrogate key.

The screenshot below shows this approach:

/wp-content/uploads/2014/03/img3_427422.png

Here are the results:

Functionality


This modeling approach gives 100% correct results.


Maintainability


A simple equijoin in an Analytic View is likely the most well understood modeling approach in HANA. It involves nothing more than a graphical connection and a few property specifications. No coding or detailed maintenance is required. As such, this approach is very maintainable.


Velocity


In a traditional landscape, this approach will only support batched data where surrogate key population is handled by a scheduled ETL job. In theory, careful use of triggers could handle surrogate key population against replicated data, but I’ve never tried this in practice. The maintainability and performance implications may be prohibitive.


Resources


An ETL developer will need to build the logic for surrogate key population.


Performance


This approach will give the best performance as the equijoin condition is natively supported by HANA’s OLAP Engine. Moreover, if no dimension fields are included in the client query – or even if the join field is included – HANA will prune the join off from query execution – resulting in better performance.

Discussion

In the introduction to this post, I highlighted a shared theme between this post and the first one in my “Thinking in HANA” series, pointing out that HANA may offer multiple compelling solutions to the same problem.

I can also slightly modify another point made in “Thinking in HANA: Part 1” and find wide applicability to the issues raised when modeling slowly changing dimensions in HANA:

  1. Complex data flows captured in hand-coded SQL and/or SQLScript can be difficult to maintain.
  2. SQL queries against raw tables, whether hand-coded or software-generated (i.e. from a semantic layer like BOBJ Universes), often fail to leverage native high-performance engines in HANA (i.e. OLAP/Calc Engine).
  3. Some SQL operators can’t be modeled in HANA a 1:1 fashion. For example, the SQL BETWEEN condition is limited in how it can be natively modeled in HANA.

Some readers of this article are probably left with at least the following two questions, if not more:

  1. Why don’t you show any performance metrics for the different approaches?
  2. Which approach would you recommend as the best?

Here are my answers:

  1. In short, I don’t currently have datasets on any current projects that I can leverage to do performance testing. We’re in the process of building ETL jobs to populate surrogate keys on a single conformed dimension table for HR data, but it will be another week or two until this is complete and I didn’t want to delay this post any longer.
  2. Even though I just now mentioned that we’re using the ETL approach to populate surrogate keys in order to model SCDs for our project, I can’t recommend this approach across the board. The best answer I can give for what you should do is the same answer given by all the computer science sages across the ages – “it depends”.

Are you building a one-off data model for a very small use case against tables that already have validity periods but lack referential integrity? I don’t see too much harm in going with the SQL approach.

Are you building a large-scale solution that needs to have the best performance possible and support a consistent enterprise-wide data architecture with limited real-time data velocity requirements? Then go with the ETL approach.

Does your client have a dashboarding need where only metrics plotted against actual dimensions are required? Perhaps one of the “theta-join” approaches is right for you.

Are you simply missing “blank” records in your dimension tables which are hindering referential integrity required for temporal join functional correctness? First implement this clever solution by Justin Molenaur and you’ll be right as rain with the temporal join.


However, the point of this article is to equip you with a few “yardsticks” that you can use to help discern what the best approach is based on your specific technical, organizational (and dare I say SLA) constraints.

Final Thoughts

The more time one spends working on HANA, the more one begins to realize how many options are available. Just take a simple piece of conditional logic, for example, like that found in the join condition of SCDs. No matter where the logic is specified in a functional requirements document, it’s quite possible that that logic could be built into any of:

  1. a calculated measure (the first approach)
  2. a calculated attribute (mentioned in the first approach)
  3. a CalcView’s projection node filter (second approach)
  4. a SQLScript “node” (third approach)
  5. a HANA temporal join (fourth approach)
  6. persisted columns (fifth approach)
  7. semantic-layer generated SQL (not discussed here)

And those are just for the “class” of conditional logic found in joins. Depending on requirements, conditional logic (again, almost the simplest requirement one can boil any logic down to) could also be built into:

  1. generated columns in HANA (persisted)
  2. SLT (persisted)
  3. Restricted measures (on-the-fly)
  4. Set operations (i.e. replacing “or” conditions in joins or calculations with UNION ALL set operators between two “halves” of the original dataset – a particularly easy and effective performance tuning solution )
  5. Triggers (persisted)
  6. Many more places I’m sure I haven’t thought of!

As the old adage goes, “When the only thing you have is a hammer, everything begins to look like a nail.” Often times new consultants coming fresh out of HA300 will see the HANA world in terms of simple Attribute/Analytic/Calculation Views, when in reality, HANA offers a wealth of technology and innovation even on some of the simplest, most well-understood problems.


Take the time to understand the technical landscape, the true business requirements and constraints, and the multitude ways that business problems can be solved on HANA – and you’ll be well positioned to deliver high-performing world-class HANA solutions for your clients.

To report this post you need to login first.

13 Comments

You must be Logged on to comment or reply to a post.

  1. Jon-Paul Boyd

    Another master class Jody, this collection is a very useful addition to my modelling toolkit.  A great reflection on the power, flexibility and range of modelling options.

    Indeed I was deceived too with the dropping of facts in Theta Join #1.  Out of all the approaches, I dislike the SQLScript for it’s maintainability concerns, and prefer the temporal join version as I see it as the cleanest, self-documenting approach.

    Regarding Theta #1 and a way to achieve the base measure, I had a few attempts. Waking up to understanding which you quite clearly indicated as

    model would actually have to output calculated attribute values per validity period


    i.e. each measure like DATE_FROM, DATE_TO as new calculated column with e.g.:


    if(“TX_DATE” >= “DATE_FROM” AND “TX_DATE” <= “DATE_TO”, “DATE_TO”, daydate(‘0’))

    Not nice if you had a lot of attributes, but again with all these options you present, they are cards that can be played.  Closest I got was the new line but for life of me could not get the correct sales amount (500) for days where material unknown, ended up with amount * cartesian product material known:

    scd1.JPG

    (0) 
    1. Jody Hesch Post author

      Thanks for the feedback, Jon-Paul!

      I haven’t had the bandwidth to try that other approach that I mentioned. Perhaps it’s risky for me to post things that I haven’t yet tried! I do sometimes rely on my intuition too much.

      If I get a chance to try and learn anything useful though, I’ll certainly share it!

      (0) 
  2. Justin Molenaur

    I like your pentagonal approach to solutioning, makes a whole lot of sense.

    Otherwise, I’ll need to grab a beer and read your excellent content when I get a chance, for now the first few paragraphs will have to suffice 🙂

    Regards,

    Justin

    (0) 
  3. Purnaram Kodavatiganti

    Hey Jody,

    Nice piece. One thing I would say is to avoid

    1. generated columns in HANA (persisted) in projects where ever possible. To buggy to use. when used in our project ended up getting wrong values and fails  in multi level cascading of values.

    Regards

    PK

    (0) 
    1. Jody Hesch Post author

      Hi PK,

      Good to hear from you. That’s too bad that you’ve faced issues with generated columns. Have you opened an OSS message? I’ve been using them for a few years (not in the context of SCDs though) and have never had any problems with them.

      Cheers,

      Jody

      (0) 
    2. Justin Molenaur

      I would agree with Jody, row calculations that can be persisted shouldn’t face any issues. Can you give any specifics around what the issue was? Maybe you were trying to force calculations at row level that should been after aggregation?

      Regards,

      Justin

      (0) 
      1. Purnaram Kodavatiganti

        Hi Justin,

        Here is a simple scenario which would demonstrate the problem

        Here are the Steps to Reproduce Cascading Failure.

        Step 1:

        Create Two Table and two Generated Columns using the Syntax provided Below.

        Drop Table KODAVATI.t_emp ;

        Drop Table KODAVATI.t_dept;

        create column table KODAVATI.t_dept (dept_id tinyint primary key, dname varchar(5),Field1LONGDATE ,Field2LONGDATE );

        create column table KODAVATI.t_emp (emp_id tinyint primary key, ename varchar(5), dept_id tinyint,Field1LONGDATE , Field2LONGDATE );

        ALTER TABLE KODAVATI.t_emp ADD (

          CONVERSION_INTERVAL INT GENERATED ALWAYS AS (

          CASE

          WHEN Field1 IS NOT NULL AND Field2 IS NOT NULL AND Field1>= Field2THEN TO_INTEGER(SECONDS_BETWEEN(Field2 ,Field1)/(60*60*24))

          WHEN Field1 IS NOT NULL AND Field2 IS NOT NULL AND Field1< Field2THEN 0

          WHEN ename IS NULL THEN -99

          ELSE -98

          END

          )

          )

        ;

          ALTER TABLE KODAVATI.t_dept ADD (

          CONVERSION_INTERVAL INT GENERATED ALWAYS AS (

          CASE

          WHEN Field1IS NOT NULL AND Field2IS NOT NULL AND Field1>= Field2THEN TO_INTEGER(SECONDS_BETWEEN(Field2 ,Field1)/(60*60*24))

          WHEN Field1IS NOT NULL AND Field2IS NOT NULL AND Field1< Field2THEN 0

          WHEN dname IS NULL THEN -99

          ELSE -98

          END

          )

          )

        ;

        ALTER TABLE KODAVATI.t_emp ADD (

          CONVERSION_INTERVAL1 INT GENERATED ALWAYS AS (

          CASE

          WHEN Field1IS NOT NULL AND Field2IS NOT NULL AND Field1>= Field2THEN TO_INTEGER(SECONDS_BETWEEN(Field2 ,Field1)/(60*60*24))

          WHEN Field1IS NOT NULL AND Field2IS NOT NULL AND Field1< Field2THEN 0

          WHEN ename IS NULL THEN -99

          ELSE -98

          END

          )

          )

        ;

          ALTER TABLE KODAVATI.t_dept ADD (

          CONVERSION_INTERVAL1 INT GENERATED ALWAYS AS (

          CASE

          WHEN Field1IS NOT NULL AND Field2IS NOT NULL AND Field1>= Field2THEN TO_INTEGER(SECONDS_BETWEEN(Field2 ,Field1 )/(60*60*24))

          WHEN Field1IS NOT NULL AND Field2IS NOT NULL AND Field1< Field2THEN 0

          WHEN dname IS NULL THEN -99

          ELSE -98

          END

          )

          )

        ;

        Step 2: Insert Data using the Query Below into the new Created Tables.

        insert into KODAVATI.t_dept values (3, ‘ccc’,null,null);

        insert into KODAVATI.t_emp (emp_id, ename, dept_id,Field1 ,Field2 ) values (3, ‘ccc’, 3,null,NULL);

        Step 3: Run the following query to create the system generated Concat_Attribute.

        select *  from KODAVATI.T_EMP a, KODAVATI.T_DEPT b where  a.CONVERSION_INTERVAL= b.CONVERSION_INTERVAL and a.CONVERSION_INTERVAL1 =b.CONVERSION_INTERVAL1;

        Step 4: Verify the creation of the Concat_Attribute using the following Query.

        select * from m_cs_all_columns where table_name in ( ‘T_EMP’) and column_name like ‘%$%’; — check the columns

        Step 5: Verify the Table data using the following query. Check the values in the Field – $CONVERSION_INTERVAL$CONVERSION_INTERVAL1$

        SELECT “$CONVERSION_INTERVAL$CONVERSION_INTERVAL1$”,CONVERSION_INTERVAL,CONVERSION_INTERVAL1,Field2 ,Field1

        FROM “KODAVATI”.”T_EMP”;

        Out Put

        Step 6: Run the following command to update Field1only.

        update KODAVATI.t_emp

        set Field1=’2015-03-24 2:06:37′  where emp_id =3;

        update KODAVATI.t_DEPT

        set Field1=’2015-03-24 2:06:37′ where DEPT_id =3;

        Step 7: Verify the output again using following query. Check the values in the Field – $CONVERSION_INTERVAL$CONVERSION_INTERVAL1$

        SELECT “$CONVERSION_INTERVAL$CONVERSION_INTERVAL1$”,CONVERSION_INTERVAL,CONVERSION_INTERVAL1,Field2 ,Field1

        FROM “KODAVATI”.”T_EMP” ;

        Out Put:

        Step 8: Run the following command to update Field1 , Field2

        update KODAVATI.t_emp

        set Field1=’2015-03-24 2:06:37′  where emp_id =3;

        update KODAVATI.t_DEPT

        set Field1=’2015-03-24 2:06:37′ where DEPT_id =3;

        update KODAVATI.t_emp

        set Field2=’2015-01-24 2:06:37′  where emp_id =3;

        update KODAVATI.t_DEPT

        set Field2=’2015-01-24 2:06:37′ where DEPT_id =3;

        — Join on different set of columns

        Step 9: Verify the output again using following query. Check the values in the Field – $CONVERSION_INTERVAL$CONVERSION_INTERVAL1$

        SELECT “$CONVERSION_INTERVAL$CONVERSION_INTERVAL1$”,CONVERSION_INTERVAL,CONVERSION_INTERVAL1,Field2 ,Field1

        FROM “KODAVATI”.”T_EMP” ;

        Out Put:

        Notice That the Field value “$CONVERSION_INTERVAL$CONVERSION_INTERVAL1$” never changed from -98;-98  in spite of updates to the fields. This is the field which is used internally by HANA when Views are called which results in wrong values.

        Hope this Helps.

        Regards

        PK

        (0) 
  4. Roland Huber

    Sorry if I’m missing something, but I don’t get the point of the surrogate keys. On the first glance they look similar to the IDs, but aren’t. To my understanding they need to be kept in another table, where you have an ID and these surrogate keys.

    What is the difference to first two approach with introduction these keys?

    I personally would speak more a a wrong table design if the data cannot be joined completely. Its not a HANA issue, but more bad design issue. So, bottom line:

    I don’t get the differences.

    (0) 
      1. Roland Huber

        Hi Jody,

        thx for the response.

        As you mention ‘they are not HANA specific‘ and I think, I didn’t make it clear. I see the comparison Theta Join#1, #2, Temporal Join but all of them could be done with surrogated keys. To my understanding your solution is outside the HANA specific context, but your title of this blog is ‘Thinking in HANA’. And your analysis is based to some extend as well on the HANA specifics (Analytic View, Temporal Join, etc.).

        The solution seems to require a change in the DB-design – well I would have guessed so 😉

        Don’t know if I could make myself more clear 😏

        roland

        (0) 
        1. Jody Hesch Post author

          Hi Roland,

          Yes, I’m failing to understand your concern.

          Slowly-changing dimensions – are a data warehousing requirement driven by business reporting needs, which has nothing to do with underlying technology.

          However, additional requirements – performance, ease-of-maintenance, real-time strategy – do indeed directly correspond with underlying technology. Consultants on this HANA forum are generally looking for HANA-specific solutions that they can implement, which I’ve elaborated on here with all of the various tradeoffs in mind.

          Changing a DB design to accommodate surrogate keys (or any other changes) generally is indeed what ETL processes are used for: designing a data warehouse DB structure that supports reporting requirements.

          However, many cases – i.e. the large majority of HANA BI scenarios with source SAP data replicated via SLT – give very limited flexibility in changing target database structures.

          I hope that’s all clear!

          Cheers,

          Jody

          (0) 
  5. Roland Huber

    Hi Jody,

    I am not really sure what your point of the blog is. If I would summarize than I would say, that introduced Surr_key are helpful in SCD.

    What I do not get, is if the Surr_key are already in the database structure (see pic1) or not (see pic2). Looking on the DataFoundation pic it is unclear if the key is in the structure or not.

    For me your blog-entry reads like: Don’t make a Join on ProdID, but make one on Surr_key. Which is something I would treat independent from HANA.

    What I really do not understand are your conclusions for Theta Join#1,#2, Temporal Join. Cause I could do them as well with the Surr_Key instead of the ProdID.

    I hope as well that things get clearer 😎

    thx

    roland

    (0) 
    1. Jody Hesch Post author

      Hi Roland,

      I am not really sure what your point of the blog is

      I’m assuming you read the entire article in detail? If so, then I’m very surprised that you’re not sure about this.

      Here is a very high-level outline:

      1) I described the problem: modeling slowly-changing dimensions.

      2) I described 5 trade-offs to consider when evaluating solution approaches.

      3) I illustrated 5 solutions in detail, and evaluated each against the 5 trade-offs.

      4) I concluded with a few simple examples of how different solutions might meet various requirements, noting that there’s no single best solution.

      5) I even said, in these exact words, “The best answer I can give for what you should do is the same answer given by all the computer science sages across the ages – “it depends”… However, the point of this article is to equip you with a few “yardsticks” that you can use to help discern what the best approach is based on your specific technical, organizational (and dare I say SLA) constraints.”


      In the article, even I gave very specific examples of what kind of constraints might present themselves in real-life projects, such as whether the data is sourced real-time (SLT) versus batched (ETL), what the performance requirements are, what memory constraints may exist, etc.

      If that doesn’t help clarify, then there’s nothing else I can do to help, and I’d recommend you focus your attention on those articles that do make sense to you.


      Thanks,

      Jody

      (0) 

Leave a Reply