Hana Smart Data Integration – Realtime Sources with Transformations
This post is part of an entire series
- Hana Smart Data Integration – Adapters
- Hana Smart Data Integration – Batch Dataflows
- Hana Smart Data Integration – Realtime Table Replication
- Hana Smart Data Integration – Realtime Sources with Transformations
- Hana Smart Data Integration – Realtime Sources with History Preserving
- Hana Smart Data Integration – Architecture
- Hana Smart Data Integration – Fun with Transformation Services
The classic way to work with external data in Hana is to replicate the source tables to Hana and then transform the data in the user queries, e.g. by adding a calculation view on top. But does it make sense to apply the same transformations to the same rows every time a user queries the data? It would make more sense to apply the transformations only once to each row and make the calculation view very simple. In other words, the target of the realtime subscription receiving the changes should be a transformation task, not a table like in the Twitter example before.
And actually, that is quite simple utilizing the Hana Smart Data Integration feature. To be more precise, it is two checkboxes….
The batch dataflow
In the previous example of a batch dataflow we read from two database tables, joined them together and the result was loaded into a target table. For a change, the source should be this time a RSS Feed from CNN.com.
So I have created an RSS Feed adapter, did create a remote source pointing it to the URL rss.cnn.com/rss/cnn_latest.rss and by creating a virtual table in this remote source, we can get all CNN news from this feed, these are the 50 most recent ones.
The primary key of a news article is its URL, hence this column is marked as PK of the source table, and the target table should have the same structure. To avoid a primary key violation when running the dataflow a second time, a table comparison transform compares the current data with the data being loaded already and inserts the new rows, updates changed rows and discards all that was loaded already.
The realtime dataflow
Executing that dataflow frequently would be one option but actually, the RSS adapter was built to support realtime subscriptions and is has optimizations built in, for example it asks in the http header already what the last change date of the page was. Therefore it is better to let the adapter push changes to Hana.
To accomplish that, all we have to do is checking the realtime boxes in above dataflow.
There are two of them, one is on container level (above screenshot) and the second is a property of the source table.
On table level the realtime box has to be set in case there are multiple source tables and only some of them should be read in realtime, e.g. you join this V_RSSFEED table with a flat file virtual table, which is static, not realtime.
And on container level the realtime flag is needed to generate a realtime task, even if just table types are used as source, no virtual table at all.
That’s it. Suddenly you execute above dataflow once and from then on all changes will be pushed by the adapter into this transformation into the final table.
Granted, above transformation is not necessarily the most complex one but nothing prevents us from building more complex transformations, say we perform a text data processing on the news headline to categorize the text into areas, companies named, name of people and load those into Hana.
Then the calculation view is a simple select on these tables instead of it doing the transformations on all data every time a user queries something.
Under the hoods
Obviously above realtime checkbox does change a lot from the execution point of view. Most important, two tasks are being generated now by the activation plugin of the hdbtaskflow. One is the initial load, the batch dataflow. The other is a realtime task.
The interesting part is the realtime flow. It is more or less the same as the initial load task, except that the source is no virtual table, but a tabletype.
The activation plugin has also create a remote subscription, with target TASK not TABLE as before.
When executing the initial (batch) load, the realtime subscription is activated. We can see that in the stored procedure that is used to start the dataflow.
Compare above with any other ETL tool. For those it takes you, say, 1 hour to create the initial load dataflow and multiple days for the delta logic. The goal here is to reduce that time to a mouse click and support realtime delta.