Hana Smart Data Integration - Batch Dataflows

werner_daehn · ‎12-16-2014

This post is part of an entire series

What is the most performing table layout for querying data in Hana? Certainly not the OLTP data model, where many many joins are required to get all data. The OLTP data model is best suited for entering and changing data as it impacts the least amount column/rows. But in that case, is it really such a wise idea to copy the data 1:1 from the source OLTP system into Hana?

Sure you get better performance with Hana than with your existing database, but even Hana is slowed down by every join. Not much but 20 times a little is still a lot. Hence it might be a good idea to read the source data, transform it an load the data into a Hana optimized data model instead of strictly keeping it as is.

Actually, analyzing the cases when the OLTP model is good enough and when transformation makes sense was an entire exercise I went through in another blog post: Comparing the Data Warehouse approach with CalcViews - Overview and Comparing the Data Warehouse approach with CalcViews - Examples.

The summary of my findings are, Hana is simply amazing, how much it can cope with within sub seconds. But the creativity and the needs of business to query data is often even more amazing. Hence my conclusion is, operational reports can be done back where they belong - the operational system if run in Hana. But for true Business Intelligence - finding new business insights in the data - this is where I would suggest building star schema data models still.

How do you transform the data during the load

In the past the answer was quite simple, you use an ETL tool like Data Services. And the answer is and will be valid still. But for just getting data into Hana such an external ETL tool might have a too large of a footprint. They have their own look and feel, their own repository, their own administration and management, even simple things like data types are different. These tool are meant to read any source and put the data into any target.

Therefore we utilized existing Hana technology and combined it to build an ETL feature natively into Hana.

It comprises of the following components

CalcEngine as ETL engine
Hana Smart Data Access to connect to a remote source
Hana security and the entire Hana management actually
Hana monitoring
Hana repository for designtime storage and propagation to production
A new Hana editor, three in fact as of today: AFM Editor, WebIDE based Replication Editor, Smart Data Preparation - another WebIDE based tool aimed towards Business Users

Since this blog is located under Hana Development, let's focus on the tool targeting us, the IT professionals.

The AFM Editor

The AFM editor is a Hana Studio Modeling editor like any other, so I open that from a Hana repository tree.

With this object being created, you can start drawing dataflow. Pull in source objects, target objects and transforms. I am using the term objects here for a reason, because actually it can be many things. Hana tables, virtual tables, views, calcviews, table types, temporary tables. For the ETL use case it is (virtual) tables mostly.

As usual in this blog series I want to do something special, in this case my sources are the 0MATERIAL_ATTR and 0MATERIAL_TEXT extractors of the SAP ERP system. So I drag and drop these virtual tables into the editor and use them as source together with some transforms and have to add a target table.

Note: This adapter is one I wrote, it is not a shipped one.

The interesting part is the types of transforms available in the calcengine now and hence in the editor.

On the right hand side of the editor is a palette and depending on the features being installed in Hana, it shows different categories. If the smart Data Integration and Smart Data quality feature is installed, there are two categories.

Palette "General"

This contains the basic transforms.

Data Source: Reader
Data Sink: Loader
Data Sink (Template Table): A target table that is created matching its input columns
Aggregation: Allows to group-by columns and specify aggregation functions for the others
Filter: Allows to specify filters and complex mappings like substring(col1,4,8)
Join: All kinds of joins (Note: this is not the existing calcengine join. With this join node you can join n tables at once, specify different join types for each, specify complex join conditions.)
Sort: To order data
Union: A Union-all operation
Procedure: In case you want to invoke a procedure with all transformed data being passed into it as table type parameter

Palette "Data Provisioning"

This contains all Data Integration specific transforms.

Date Generation: A source that generates a range of date values, one per row
Row Generation: A source that generates a range of integer values, one per row
Case Node: To route the data through different paths controlled by conditions, e.g. All region='US' data should go into FilterUS, all region='EMEA' data into the transform FilterEMEA and all other rows into FilterOther.
Pivot: To take n input rows and merge them into n columns. All other columns stay as is.
UnPivot: The reverse operation, to take n columns, e.g. Revenue_January, Revenue_February,... and create n rows, e.g. the 12 values are returned in the column Revenue but using 12 rows.
Lookup: To lookup in another table for the lookup value and take one. A join might find multiples and hence the output would be all combinations - often not desired, hence the lookup.
Cleanse: To cleanse all data based on reference information, most often used together with the postal address directory of every single country of the world.
Geocode: To translate addresses into long/lat information and vice versa. Or Point Of Interest relationships and the such.
Table Comparison: To identify for all rows the difference of the incoming data compared to the target table and what type of difference there is.
Map Operation: To influence the way read data is handled, e.g. all data read from table1 should be flagged as delete and hence will be deleted in the target table. Also to specify different mappings for differently flagged rows, e.g. all insert rows should have the value of now() in the INSERT_DATE column but not be changed by updates. And update rows should write the value of now() into the LAST_UPDATE column.
History Preserving: Deals with all the complexity when loading a target table that should retain the history. So instead of updaing the e.g. customer record, a new customer record version should be added, the valid-from and valid-to columns of the old and new version should be updated,... things like that. Makes creating a Slow Changing Dimension Type 2 a piece of cake. Note: Requires a Table Comparison Transform somewhere upstream, else it does not know what changed and what the current values are.

Summary

As you concur hopefully, there is little reason to use an external ETL tool for loading data into Hana, even if transformations are required. Although I have talked about loading the data into Hana just once, this solution does allow to create delta logic as well. But more important, it can deal with realtime data as well!