Hana Smart Data Integration - Overview

werner_daehn · ‎11-27-2014

This post is part of an entire series

Prior to Hana SP9 SAP suggested to use different tools to get data into Hana: Data Services (DS), System Landscape Transformation (SLT), Smart Data Access (SDA), Sybase Replication Server (SRS), Hana Cloud Integration - DS (HCI-DS),... to name the most important ones. You used Data Services for batch transformations of virtually any sources, SLT for realtime replication of a few supported databases with little to no transformations, HCI-DS when it comes to copying database tables into the cloud etc.
With the Hana Smart Data Integration feature you get all in one package plus any combination, when it comes to loading a single Hana instance.

The user however has very simple requirements when it comes to data movement these days:

Support batch and realtime for all sources
Allow transformations on batch and realtime data
There should be no difference between loading local on-premise data and loading over the Internet into a cloud target other than the protocol being used
Provide one connectivity that supports all
Provide one UI that supports all

The individual tools like Data Services do make sense still for all those cases the requirement matches the tool's sweet spot. For example a customer not running Hana or where Hana is just yet another database, such a user will prefer a best of breed standalone product like Data Services always. Customers requiring to merge two SAP ERP company codes will use SLT for that, it is built for this use case. All of these tools will continue to be enhanced as standalone products. In fact this is the larger and hence more important market! But to get data into Hana and to use the Hana options, that is when it becomes hard to argue why multiple external tools should be used, each with its own connectivity and capability.

In addition to that the Hana SDI feature tries to bring the entire user experience and effectiveness to the next level, or lays the groundwork for that at least.

Designing Transformations

Let's start with a very simple dataflow, I want to read news from CNN, check if the text "SAP" is part of the news description and put the result into a target table. Using Hana Studio, we create a new Flowgraph Model repo object and I dragged in the source, a first simple transformation and the target table. Then everything is configured and can be executed. So far nothing special, you would do the same thing with any other ETL tool.

But now I want to deal with the changes. With any ETL tool in the market today, I would need to build another dataflow handling changes for the source table. Possibly even multiple in case deletes have to be processed differently. And how do I identify the changed data actually?

With Smart Data Integration all I do in above dataflow is to check the realtime flag, everything else happens automatically.

How are changes detected? They are sent in realtime by the adapter.

What logic needs to be applied on the change data in order to get it merged into the target table? The same way as the initial load did, considering the change type (insert/update/delete) and its impact on the target.

The latter is very complex of course, but we when looking at what kind of dataflows the users have designed for that, we were able to come up with algorithms for each transformation.

The complexity of what happens under the cover is quite huge, but that is the point. Why should I do that for each table when it can be automated for most cases? Even if it works for 70% of the cases only, that is already a huge time saver.

Ain't that smart?

The one thing we have not been able to implement in SP9 is joins, but that was just a matter of development time. The algorithms exists already and will be implemented next.

Adapters

How does Hana get the news information from CNN? Via a Java adapter. That is the second major enhancement we built for SP9. Every Java developer can now extend Hana by writing new Adapters with a few lines of code. The foundation of this feature is Hana Smart Data Access. With this you can create virtual tables, which are views on top of remote source tables and read data from there.

For safety reasons these adapters do not run inside Hana but are hosted on one or many external computers running the Hana Data Provisioning Agent and the Adapters. This agent is a very small download from Service Market Place and can be located on any Windows/Linux computer. Since the agent talks to Hana via either TCP or https, the agent can even be installed inside the company network and loads into a Hana cloud instance!

Using that agent and its hosted adapters Hana can browse all available source tables, well in case of a RSS feed there is just a single table per RSS provider, and a virtual table being created based on that table structure.

Now that is a table just like any other, I can select from it using SQL, calculation views or whatever and will see the data as provided by the adapter. The user cannot see any difference to a native Hana table other than reading remote data will be slower than reading data from Hana.

That covers the batch case and the initial load.

For realtime Hana got extended to support a new SQL command "create remote subscription <name> using (<select from virtual table>) target <desired target>". As soon as such remote subscription got activated, the Adapter is asked to listen for changes in the source and send them as change rows to Hana for processing. The way RSS changes are received is by querying the URL frequently and push all found rows into to Hana. Other sources are might support streaming of data directly but that is up to the adapter developer. As seen from Hana the adapter provides change information in realtime, how the adapter does produce that we do not care.

This concludes a first overview about Hana Smart Data Integration. In subsequent posts I will talk about the use cases this opens up, details of each component and the internals.

There is also a video from the Hana Academy on youtube: