Let’s get acquainted
Let me first briefly introduce myself: Freek Keijzer, background in physics (laser spectroscopy, radioactivity, dangerous stuff), 20 years experience with SAP starting as a key user, 15 years experience as SAP Business Intelligence (BI) consultant. I also brew beer. This may not seem relevant, but I noticed at occasions like birthday parties that this impresses people way more than all of my SAP achievements put together.
But enough about me, let’s talk about you. You recently migrated your SAP Business Warehouse (BW) system to a HANA database. You gladly accepted the factor of 3 improvement in loading and query performance resulting from moving to this database alone. But after getting used to this improvement you think: “Shouldn’t there be more to this investment?”. The good news is: There is!
SAP is very proud of its HANA technology, and its marketing machine is running at full steam. Customers and implementation consultants – including me on occasion – do not always share SAP’s enthusiasm for new SAP products or technology. And believe me, if they were exaggerating I would tell you. That’s the kind of person I am. But they are not exaggerating. The HANA database technology with in-memory operation and column-based store is indeed revolutionary. First of all: the speed of some database actions. For someone who is used to pressing the button “Activate DSO data” just before lunch hoping that activation is finished after lunch, it comes as a surprise if this action after migrating to HANA takes like … no time at all. Instantaneous! No reason for lunch breaks any more. Well … other than eating. Some consequences of HANA database technology on data warehousing with a BW on HANA system are described in the next paragraphs.
Transactions and reporting back together in one box
In the old days, writing to and reading from row-store database tables simultaneously was a bad idea. “Runaway queries” were the cause of transaction system failures on numerous occasions. Dedicated systems arose to separate reporting output from transaction input. These systems are called “data warehouses” as they usually store a lot of data. Data is duplicated from the transaction system to the data warehouse, a process called “data loading”. Within the data warehouse, data is stored in multiple layers, typically 3 or 4, to prepare data for reporting. This implies more data duplication and more data loading.
With column-store database technology, writing to and reading from tables no longer interfere with one another. This means that transactions and reporting can be brought back together again, in one (HANA) box. This is what SAP means with “Embedded BW”. Take a look at the picture below. This is what the future of SAP reporting could look like. ERP tables – in the future S/4HANA – are directly accessed via virtual data models built with either Native HANA or BW objects. I used dotted lines in the interior of the box to emphasize that it is only one box. You can build your own virtual models or use the standard ones delivered with the software. Standard data models with BW objects are called “Business Content” (usually not virtual). With Native HANA objects or HANA views the standard models are called “HANA Live”. In practice, mixtures of standard and customized objects are found in most data-integration platforms. “On top” you can use dashboard, cockpits and apps delivered with the software, or you can build your own with fancy frontend tools like Lumira, Design Studio, BO Cloud and the other tools that are part of the Business Objects product portfolio (or not; it is hard to keep track; I will come back to frontend tooling in part 3).
Now that’s a good term: “data-integration platform”. Let’s use that instead of “data warehouse”, since we aim at storing as little data as possible. Storing data physically is nowadays called “persistency” and it should be an exception (… for historic data, “snapshots” and data from systems without direct access coupling; I will address this in part 3). Strive for max. one persistent layer for data from tables that are not already in the box, and no persistent layers at all for tables inside the box. Virtual layers yes, and with purposes quite comparable to the ones in a traditional data warehouse: raw data, data prepared for consumption by other data flows, data prepared for reporting.
No more cubes! No more infoobjects? No more data??
For data warehousing veterans it may come as a shock, but multidimensional models – commonly known as cubes – are no longer required. Cleverly designed cubes with properly chosen facts, dimensions and master data were once an absolute necessity for reporting with acceptable performance. Not anymore. Column-based database technology doesn’t need it, and doesn’t use it. Cubes are immediately flattened by the BW on HANA technology. This is tremendously good news for owners of a data warehouse filled with badly designed cubes. There is clearly no point in developing any new cubes after migration to the HANA database. In practice, this comes down removing a complete layer in the data warehouse. Usually, the top three layers are a table layer consisting of “Data Store Objects” (DSO), a reporting layer consisting of cubes and a virtual layer consisting of “multiproviders”. Most if not all data-transformations already took place in the data flow below the DSO. The virtual layer still has a purpose in decoupling the data model from the report to maximize flexibility towards future data model changes. But the cube layer has lost its purpose. Hence, no more cubes!
Another darling of BW veterans that may become obsolete in the future: infoobjects. You may have become fond of these little rascals with theirs attributes and their texts and their nine character definition limit, but wait until you need to build a couple of hundreds of them for a big model on non-SAP data. I can’t say I will miss them much. SAP is probably thinking the same thing, as in the newest BW objects – “Open ODS Views” and “Composite Providers” – infoobject are not a necessity anymore. You can just work with the field names from the source system with no restrictions with
field definitions being copied automatically. Some master data can be added later. Not as rich as infoobject functionality, but good enough for simple data models and PoCs. You can build “data flows” in BW using these new objects only, and build a report on top using the field names of the source system. No more infoobjects?
Data flows between asterisks, as these is no physical data flowing inside the virtual data flows. But why is it so important to get rid of the data as much as possible? Not because of data duplication, as data storage is cheap. Why then?
“The Horror of Data Loading”
Compared with the popularity of SAP ERP – worldwide market leader since the Stone Age – the BW data warehousing solution is surprisingly unpopular. Reasons mentioned are lack of agility and bad performance, but in my view the main reason for BW’s unpopularity is lack of data integrity caused by “The Horror of Data Loading”. Please take a look at another picture shown below. It depicts a typical data flow in BW with five layers, four of which contain data: Acquisition (raw data), Propagation (prepared for consumption), BTL or Business Transformation Layer (business rules applied), and a cube for reporting.
Between the Propagation DSO and the BTL DSO, data is read from master data tables and from other DSOs. The load between the DSOs is in delta mode, so data missed may never be restored. Off course such dependencies can be built into the logic of a process chain. If a data load goes wrong for some reason, all dependent loads are stopped automatically. This works fine as long as the data warehouse is relatively simple and data loading is stable enough for the overall process chain to finish most of the time.
But many successful data model implementations later, the data warehouse may have become so complex with dependencies so abundantly present, that the overall process chain rarely finishes on time with support consultants working over-hours to fix things manually to the best of their ability. Most of the time, their focus is on “getting the data into the cubes” and not on following the proper order according to the dependencies on the way to the cubes. This results in “data holes” in the cubes that are never fixed. To improve the likelihood of the process chain actually being finished by morning, some of the dependencies are “loosened”, thus creating more data holes. In complex data warehouse situations, one often sees initiatives to improve data loading stability. One such initiative was called “the swim lane”. Critical data flows were defined with dependencies firmly built into the
process chain logic. Dependencies for non-critical data flows were loosened. Main result of this initiative was data quality outside of the swim lane deteriorating.
Report users confronted with a “data hole” will notify the support desk. At least they will the first time. A support consultant will fix the issue by executing some sort of repair load. The report user may even report a second or third data hole he discovered. But at a certain point he or she will think: “Hey, it’s not my job to check data integrity on a daily basis! I may as well make downloads from the source system and do data-integration myself in Excel. At least I can be sure that the data I start with is correct.” And often that is exactly what happens. An apparently beautiful data warehouse, all glimmery and shiny, huge investment, and no one uses it because no one trusts the data.
What to do? Simple: stop loading data! Go back to the very old days when a report on ERP data was always a program running in the ERP system collecting data from tables on the fly during query run-time. With abominable performance and pulling down the transaction system as a side-effect, but at least the data was always correct. The latest database technologies make it possible to go back to these very old days, but now with good performance and without side effects. And virtualization is key to this success.
This blog is the first in a series of three on the same topic. In the next part I will describe a stepwise approach to move towards the better SAP BI world as sketched above. The third part is intended for “miscellaneous topics”. I invite you to steer me towards any direction in the upcoming blogs by commenting on this one.