Skip to Content

Data Warehousing Theory… Is a rethink required?

For as long as I’ve been working in Business Intelligence and Warehousing, there has really been only 2 trains of thought on how to approach a Data Warehouse; Kimball or Inmon. At a high level, the key differences in approach are; Kimball proposes that we build from the ground up, and Inmon advocates a top down approach. That’s clearly a generalised statement of the differences, and not a statement designed to inspire debate. I’ve seen both approaches in action and both have their pluses and minuses. The topic of debate here is not focused on their differences, but if the theory of both approaches are still valid with respect to current advances in data warehousing.

The core of their design is to provide an efficient method of data storage and retrieval. At the time of design, memory and storage were both expensive, which led to the use of Data Marts and aggregated data as a method of minimising the amount of data that is stored within a data warehouse.

This is the key problem with Data Marts; they are design for aggregated data. The vast majority of users that I have spoken to during requirements gathering exercises respond with an all too familiar statement when asked what their requirements are; “we want to report on everything” (breadth). When asked how much of everything they would like to hold; the next response is “everything of everything” (depth). From a user perspective, that’s a fair enough statement. Why shouldn’t they be allowed to report on everything to make better business decisions.

To accommodate the requirement of breadth, star schema’s evolve into snowflake scheme’s and multiple Data Mart’s are created. To store the level of depth required for reporting, Data Marts would hold line item data resulting in massive fact tables and performance problems, or the data would be stored in the ODS with clever ways devised to drill from a data mart into an ODS.

Although not perfect, the theory has held well considering the barriers faced. The barriers that led to the design of data warehousing are rapidly falling down. Data Storage and Memory are rapidly decreasing in value and has been for a long time. What’s more relevant is the maturity of new models and approaches to business intelligence and warehousing. To name a few, in recent times, we have seen the rise and maturity of:

These new models and approaches do not clearly fit into the theory of data warehousing as we know it. My question is, should the theory of Kimball and Inmon be updated for the modern advances in data warehousing or do we need a fresh new approach to data warehousing?

You must be Logged on to comment or reply to a post.
  • It depends!

    Depends upon why a data warehouse is required – multiple data sources, quantity of data, complexity of data etc etc.
    We’ve all seen data warehouses where someone has just dumped every table they can find into a DB, then dropped a BO Universe on top – yes, all the data is in one place, but the users can’t find anything, and even if they did, the query would probably return nonsense.

    I generally follow Kimball, but do admit you can end up with very wide fact tables. They’re probably easier for your typical user than seeing hundreds of tables though. Even Kimball admit that you shouldn’t follow their approach without question.

    I think in memory and columnar databases reduce the need for summary tables, but sometimes the user’s need summaries for stuff like annual comparisons where a query is complex to design.

    I do like the idea of Sybase IQ (not sure what SAP call it now), on top of a DW though. As ever, it looks like the new technologies allow us to extend existing approaches, not to throw them out and start again.

  • Gary,
    this are questions which I face very often, too. Customers are sort of irritated. Why do I need a DWH if I use HANA, In-Memory? Answer is not yet clear to me.
    Fact is: HANA, In-memory still have 1:1-connections. So BW will be the preferred tool for data harmonization and consolidation from different SAP- and hopefully non-SAP Systems. HANA will be the storage medium to hold data and maybe to design the correpsonding data flows in near future.
    But I haven’t seen it in action until now and can’t imagine how to model data flow in HANA.
    At the BI Congress in Germany last week, SAP introduced a CO-PA solution for in-memory use. Very promising. They are looking for further solutions for FI, SD, MM,…. No doubt, that is the train of the future. But who will adopt it? What are the costs? Which sort of consultants are required? Lots of questions and not really many answers….


    • I havent seen HANA in action but from my experience working on other replication tools and combining them with in-memory, I found replacing operational reports on BW to in-memory is a given. But there are instances where BW is hard to replace terms of recalling rapidly changing data, or most of the data that cannot recalled direcly from ECC etc..
  • …and I think this blog rather adds to confusion than helps to resolve it.
    The main issue I see causing the confusion is with mixing terms and making assumptions about things based on the marketing names or surface pseudo-facts, rather then going deeply with understanding of matters and products being discussed.
    Like when we discuss technology/hardware advancements, we need to think first are they truly new enablers or are they respond to ever growing demands like recently-unthinkable data volumes?
    Mixing into one bag DW methodologies, data modeling, religious things (Kimball vs. Inmon) analytics- (not necessarily DW-!) focused databases, DW platforms, plus grain of buzz will not help to have a meaningful conversation.
    Gary, I would look at this SCN blog post of yours as a general introduction and will wait for more detailed and focused posts to deep dive into the discussion.
    Alternatively, I am still waiting for the blog from Ethan: 🙂
    Looking forward for those. Thanks.
    • I read the blog as asking the question: ‘How does new technology impact existing methodologies in the DW Market?’
      Seems like a good basis for a discussion to me.
    • I thought the blog brought up a good question, which I’m sure is inspired by Timo Elliott and Hasso Plattner’s pronouncements.

      If you are looking to attack the seeds of all the confusion, SAP is probably your best target 🙂

      Regarding the blog, I warned you it might be a long time 😉 I’ll admit – I don’t find star-schema to columnar mappings to be the most gratifying topic to write about!

  • it’s been a while since i had to take linear algebra, and this is not a true mathematical formula, but i think it captures what the current thinking is nicely (at least for me).
    transactions are added to rows and extend the business data as it grows within each enterprise. for normal intervals, driven mostly by time (months, quarters, years) this data needs to be aggregated and evaluated for different reasons but mostly for legal compliance.
    the rows need to be summarized and transposed into columns to pack them as close as possible to the CPU to approach constant time (realtime). software doesn’t exist without hardware and vice versa, so thinking only in either terms is leaving the other variable uncontrolled.
    in business terms, OLTP is the ledger and OLAP are the balances of transactions accrued over time.
    i know i oversimplify here, but this is my takeaway from my reading the book and it’s quite intuitive.
  • Apart from all the above, we need to consider the much enhanced capability of parallel processing too when we design the next dw.

    If SAP customers are not getting the right answers – SAP should probably consider additional ways to communicate the message on HANA etc. And try to resist the temptation of another name change for a bit 🙂

    HANA replacing BW is a fair question for customers to ask. Lot of them have big BW investments. A blunt “No” is not what is going to stop this question – SAP should take time to answer the “why not” in plain english, and at multiple levels of the ecosystem. Without such messaging – a lot of good innovation will just never get utilized by the market.

  • Hi Gary,

    Great read and good thinking. Its strange as HANA/in-memory seems to be set to solve all our ‘problems’ around data warehousing insecurities, but then real time has been achievable for many years. Sybase as we know have been driving the financial sector and a key use within Global Markets, where real time really is real time!
    I discussed this with one of my friends who works for a financial institution within trading, he said Sybase was an interesting acquisition by SAP, but also commented that a data warehouse will always be needed and the principles of Inmon & Kimball still be adhered to as order is and always will be required, irrespective of in memory. I agree.

    Thanks Geoff.

  • When looking beyond storage and query performance traditional DW approaches have the advantage of structuring data in a way that is easy for the user to approach and easy for the architect and developer to reuse and extend.

    These days I’m finding some of the biggest cost in BI projects are making data easy to use and aligned to the business policies and processes (hardware cost less and development is outsourced). This requires good data modeling – ensured by active business users and visionary architects. Expensive ressources, yes, but they drive and enable adoption using DW modeling techniques outlined early on by Inmon and Kimball + co. In my experience, lack of adoption is the biggest ROI caveat in most IT systems. Definitely also in large EDW installations.

    Unlike many non-SAP DW architects I believe SAP BW provides a great framework for using Inmon and Kimball (more Kimball than Inmon) to structure data that for, e.g. Purchasing or Production reporting litterally requires hundreds of fields available for reporting and ad hoc analysis.

  • Commenting mainly to be notified of comments, but the thought in the title needs to be kept in mind.  One example that I saw at the recent Mastering SAP Technology conference in Sydney was using a HANA in-memory appliance for calculating real-time routing for taxis in Tokyo.

    Another thought is that while the demos are generally showing us remarkable calculation speeds over extremely large data, the real change will occur when people start running the massive analytical queries that they dare not run at the moment; as soon as that starts to happen, then we will start hitting the old constraints (and finding new ones !!).  This si because everyone with access to the appliance will be running these 🙂

  • Thanks for the feedback and comments. I’m glad I’m not the only one that has had problems seeing the big picture on how these new products/technologies fit into the traditional ways that we think of a Data Warehouse. Since my initial post I have done some more digging around and have stumbled upon a brilliant blog by none other than Timo Elliott (no relation). I’m not sure why I haven’t come across this one before, however, Timo, has provided some brilliant insight and direction on the questions that I have had. Here’s the link:

    It’s great reading and exactly what I was looking for from one of our industry leaders. Anyone at the BI4 launch in London yesterday may have seen some of this material as some of it was included in Timo’s slide deck.