When Data Is Separated From Its Related Applicatio...

tfxz · ‎05-29-2012

Moving data out of an application into a separate reporting (or related) environment - be it a data mart or a data warehouse or something else - is a frequent exercise. The underlying motivation is typically some sort of technical advantage like

getting the data into a system with better performance,
removing reporting load from the original system,
leveraging some specialized functionality like analytic, mining, predictive libraries, but also planning features,
bringing that data together with data from various other sources (data warehousing),
adding new capacity or capabilities w/o affecting the existing and established application.

While it is an approach that is currently heavily propagated in the SAP HANA context, I like to take a more HANA agnostic attitude in this blog. I admit that it has also been triggered by people questioning approaches like data warehhousing, managing data along layers and others who wondered why it is so hard to report on raw data. Finally, there are customers, colleagues and users who seek guidance but get completely confused. I'm not arguing here for one or the other but like to take the topic to less of a black & white but a more colourful discussion.

There is an underlying assumption that raw data can be easily consumed, e.g. via reporting tools, even though it's separated from the application and the inherent semantics built into that application. It is argued that the semantics - that is lost in that separation - can be recovered by creating a semantic data model that provides the necessary information that allow to interprete the raw data. In the case of a data warehouse with data coming from many sources it is usually the first time that semantics is created that is overarching several sources (= applications). But let me turn to a few tangible examples rather than discussing this in a purely abstract way.

Example 1: Semantics inside the data

Data is created in the context of some process. Let's look at one like buying an airline ticket or booking a hotel room. Initially, a ticket order is in a state NEW can then move to QUEUED (nothing available yet but on a waiting list), RESERVED (certain rights to cancel; maybe partially paid), BOOKED (final; fully paid). From BOOKED it can move to CANCELED, from RESERVED to BOOKED, from QUEUED to BOOKED and/or RESERVED etc etc. This is just an example of one such process; variants might apply. However, what it is important here is that there is a well defined state transition semantics underlying the process. Such state transitions can be modeled via finite state machines, petri nets or other suitable formal models which in turn allow to derive or prove certain properties and, in particular, the correctness of the process.

Why is this now important to a report? Take the example of the previous paragraph: if you want to calculate your current revenue you can only consider BOOKED tickets and those involving partial payments (e.g. QUEUED and RESERVED). Maybe the state also indicates how big the share of payment is: e.g. 100% for BOOKED, 10% for QUEUED, 25% for RESERVED. Another example / query is that you can look at your pipeline. All in all, you need to keep the underlying finite state machine (FSM) of those state transitions in mind when you write such queries. On the other hand, the application supporting the ticket process also has that FSM somehow implemented in the sense that ticket orders can change their respective state along the semantics defined in the FSM.

I'm not arguing that it is impossible to implement the semantics of such an FSM in a data model and in the queries on such a model. However, it can be tricky and tedious, especially as it might mean that an end user who uses a query generation tool will have to aware of those semantics and needs to accommodate parts of them in his query definition.

Example 2: How well is the data managed by the app?

Another example refers to the amount of data quality effort that is put into the data management by the application. Or put the other way: the less data quality the more safe guarding effort needs to put into the data model.

One instance is referential integrity: if the application guarantees referential integrity, i.e. values in foreign columns are always present in the key column of the related master data table(s), then inner joins can be used. If this referential integrity is not guaranteed then this needs to be handled somehow, e.g. by using outer joins with the associated penalty on performance.

A second instance is normalisation of values. Here is an example of a list of values that can but might not mean the same thing:

"smith"
"Smith"
"SMITH"
"Smith "
" smith"

Ideally, the application that inserts new or updates existing data already takes care of this and does not allow such a situation to arise. In SAP NW systems, the ALPHA exit takes care of this as it would normalise all the listed values to "SMITH". This has the advantage that a filter in SQL - like WHERE NAME = 'SMITH' - would yield the expected result. In the absence of such a normalisation one would need some regular expression syntax like WHERE NAME LIKE '%SMITH%' OR NAME LIKE '%smith%' (which would still miss mixed upper-lower case values like "Smith" and include values like "SMITHE"). The access log of an Apache web server, for example, tracks URLs as they have been typed by the end user, meaning for instance that upper and lower case is up to what the end user did. This means that a data model and queries on that data need to assume every idiosyncrasy possible.

Example 3: Noisy sensor data

Now human errors and habits aside. What happens to the ever growing amount of data originating from sensors like RFID, GPS, even the Apache access log mentioned above? Do they deliver more reliable data which, in turn, makes reporting (or better: the underlying data modeling) easier? Again, sensors are sensitive to all sorts of environmental factors like climate, landscape etc which leads to noisy sensor data. See figure 1 for an example of GPS data generated by my mobile phone when running through the forrest. You can imagine that I was certainly not running across that lake.

Figure 1

The application consuming and managing such data will take care of that noise either by mechanisms that tolerate or that correct potential noise. When that data is separated from the application those mechanisms need to be recreated somehow in the data model or the generated queries.

Conclusion

In summary, it is fair to say that re-creating correct semantics on raw data can be hard work. The semantics is already implemented inside the application as the latter is responsible for the data consistency. With sufficient application or process knowledge that re-creation can be straightforward for a single data source. It is significantly harder when data comes from multiple sources that are managed by semantically inter-related applications which, however, manage the data and its consistency completely independently. The task is then to refine stepwise(thus layers), almost as with raw material in a factory. The merit of HANA is that one can focus on semantically motivated layering rather than interleaving semantical refinement with performance optimising considerations. That translates into fewer data layers and layers being less overloaded with service level tasks. Whether layers involve materialisation or virtualisation is in the end a matter of sheer convenience.