Skip to Content
Author's profile photo Werner Daehn

Hana Smart Data Integration – Realtime Sources with Transformations

This post is part of an entire series

Hana Smart Data Integration – Overview

The classic way to work with external data in Hana is to replicate the source tables to Hana and then transform the data in the user queries, e.g. by adding a calculation view on top. But does it make sense to apply the same transformations to the same rows every time a user queries the data? It would make more sense to apply the transformations only once to each row and make the calculation view very simple. In other words, the target of the realtime subscription receiving the changes should be a transformation task, not a table like in the Twitter example before.

And actually, that is quite simple utilizing the Hana Smart Data Integration feature. To be more precise, it is two checkboxes….

The batch dataflow

In the previous example of a batch dataflow we read from two database tables, joined them together and the result was loaded into a target table. For a change, the source should be this time a RSS Feed from CNN.com.

So I have created an RSS Feed adapter, did create a remote source pointing it to the URL rss.cnn.com/rss/cnn_latest.rss and by creating a virtual table in this remote source, we can get all CNN news from this feed, these are the 50 most recent ones.

The primary key of a news article is its URL, hence this column is marked as PK of the source table, and the target table should have the same structure. To avoid a primary key violation when running the dataflow a second time, a table comparison transform compares the current data with the data being loaded already and inserts the new rows, updates changed rows and discards all that was loaded already.

/wp-content/uploads/2015/04/rssfeed1_677008.png

The realtime dataflow

Executing that dataflow frequently would be one option but actually, the RSS adapter was built to support realtime subscriptions and is has optimizations built in, for example it asks in the http header already what the last change date of the page was. Therefore it is better to let the adapter push changes to Hana.

To accomplish that, all we have to do is checking the realtime boxes in above dataflow.

/wp-content/uploads/2015/04/rssfeed2_677048.png

There are two of them, one is on container level (above screenshot) and the second is a property of the source table.

On table level the realtime box has to be set in case there are multiple source tables and only some of them should be read in realtime, e.g. you join this V_RSSFEED table with a flat file virtual table, which is static, not realtime.

And on container level the realtime flag is needed to generate a realtime task, even if just table types are used as source, no virtual table at all.

That’s it. Suddenly you execute above dataflow once and from then on all changes will be pushed by the adapter into this transformation into the final table.

Granted, above transformation is not necessarily the most complex one but nothing prevents us from building more complex transformations, say we perform a text data processing on the news headline to categorize the text into areas, companies named, name of people and load those into Hana.

Then the calculation view is a simple select on these tables instead of it doing the transformations on all data every time a user queries something.

Under the hoods

Obviously above realtime checkbox does change a lot from the execution point of view. Most important, two tasks are being generated now by the activation plugin of the hdbtaskflow. One is the initial load, the batch dataflow. The other is a realtime task.

The interesting part is the realtime flow. It is more or less the same as the initial load task, except that the source is no virtual table, but a tabletype.

/wp-content/uploads/2015/04/rssfeed3_677063.png

The activation plugin has also create a remote subscription, with target TASK not TABLE as before.

/wp-content/uploads/2015/04/rssfeed4_677064.png

When executing the initial (batch) load, the realtime subscription is activated. We can see that in the stored procedure that is used to start the dataflow.

/wp-content/uploads/2015/04/rssfeed5_677065.png

Compare above with any other ETL tool. For those it takes you, say, 1 hour to create the initial load dataflow and multiple days for the delta logic. The goal here is to reduce that time to a mouse click and support realtime delta.

Assigned Tags

      14 Comments
      You must be Logged on to comment or reply to a post.
      Author's profile photo Former Member
      Former Member

      Dear Werner Daehn,

      Thanks for the blog. I was testing the same example and able to successfully achieve in table comparison transform in real time. However i was unable to handle deletion.

      Issue: in exact same scenario as above inside the table comp transform when i select the checkbox 'detect deleted rows from comparison table' the real time set up of flow graph gives activation error as below.

      Error message:

      column store error: task framework:  [140007] Unexpected error while creating task.;detectDeletedRows cannot be set to true because the input to operation TABLECOMP contains one or more reserved attributes.

      I removed the real time behavior and tested with virtual table with enable the check box

      'detect deleted rows from comparison table' it works fine.


      So my observation is the table comp some how doesn't handle deletions in real time.


      Can you please throw some light on my question?


      we are using CDC log reader from oracle data base.


      i also saw the videos SAP HANA Academy - Smart Data Integration/Quality : The History Preserving Transform [SPS09] - YouTube

      but couldn't find a clue.


      Thanks

      Ranjith


      Author's profile photo Werner Daehn
      Werner Daehn
      Blog Post Author

      The detect-delete is not compatible with Realtime.

      What does the detect-delete do? I reads the entire comparison table minus the source input. The output of this are the rows that exist in the target but no longer in the source.

      Or in other words, of the target has 100 rows and the source only 99, then one was deleted in the source.

      In the realtime case you get just one or two rows as input. Only the rows that actually changed. Hence the detect-delete would mark all 99 rows as deleted.

      So the solution is simple: Do not check the box, deletes from the source are passed through the TC as deletes.

      Can you check my statements?

      Author's profile photo Former Member
      Former Member

      Hi Werner Daehn,

      Thanks a lot for your super quick response! Yes in theory what you said correct. unfortunate co-incidence is when i tried testing it real time with out the check our agent had some problems. i will test it as soon as possible.

      But mean while i tried with out real time on other sandbox SPS9 97 revision to try with out ticking the check box but it didn't worked.

      What i did was created a flow graph like

      source(Virtual table) -> table comp (comparison tables as target table) -> target table

      when i deleted a row in oracle data base linked to virtual table and executed the flow graph i could still see the deleted record in target. but if i check the tick 'detect deleted rows' i was able to see the deletion happened successfully and when ran the below query i was able to see processed records 1 in this case and 0 in without enabling the check box case.

      SELECT TASK_NAME,PROCESSED_RECORDS FROM "SYS"."M_TASKS" where END_TIME = (SELECT MAX(END_TIME) from "SYS"."M_TASKS");

      i also tried and found same behavior when both the source and target are normal hana column tables (without using virtual tables or sda)

      is this expected behavior?

      Thanks

      Ranjith

      Author's profile photo Former Member
      Former Member

      Hello, Ranjith,

      I recommend you try with SPS10, as the engineering team tells me that the Table Compare operator was mostly rewritten in SPS10.

      The expected behavior is as Werner indicates: unchecking the 'detect deleted rows' flag simply tells Table Compare that it is not responsible for trying to identify deletes, they are just passed through to the next operator (Data Sink in your case). With a default writer type in the Data Sink, the deleted records at the source should be deleted at the target.

      Fernando Velez

      Author's profile photo Former Member
      Former Member

      Hi Fernando,

      Hope you are doing well! its been some time we talked. Thanks for your input.

      As I said I am waiting to test that in real time with out checking the 'detect' option.

      if I understand you guys correctly in non-real time also the 'detect deletion records' can be left uncheck and still the deletions should work in SPS10?

      Thanks

      Ranjith

      Author's profile photo Former Member
      Former Member

      Ranjith,


      No, in Batch processing the 'detect deletion records' flag needs to be checked in order for the TC operator to identify which records are to be deleted. But this assumes a full batch load, i.e., that all the input data presented to the operator contains all the data at the source. This, of course, poses practical problems when tables are large, and should only be done once in a while to 'sync' the source and the target. If and when possible, partial (delta) loads should be implemented; but in this case, flag 'detect deletion record' must not be set.


      Thanks,


      Fernando

      Author's profile photo Ugandhar Nandam
      Ugandhar Nandam

      Werner,

      I'm trying to execute flowgraph procedure which is realtime enabled and below is the error displayed for Data Provisioning using MSSQLLogReaderAdapter.

      Error_SDI_Realtime_MSSQL.png

      I have already executed the script mssql_server_init.sql but could not understand how to initialize the specific database used for creation of remote source.

      Please help me on this.

      Regards,

      Ugandhar

      Author's profile photo Former Member
      Former Member

      Hi Werner,

      Great blog. We are planning to use SDI in the new project (SPS 10). I would like to know if we can make the table as permanent table in HANA DB instead of virtual tables, using SDI. We will be provisioning the data from PeopleSoft (Oracle DB). Performance under 2 seconds from the HANA Information views is the expectation from the client.

      Thanks,

      Kalyan

      Author's profile photo Werner Daehn
      Werner Daehn
      Blog Post Author

      There seems to be a semantic misunderstanding. A Virtual Table represents the remote table. But using the SDI solution you can create a subscription on such Virtual Table, specify a Hana Table as target and then all changes in the remote system will be put into that Hana table instantaneously. Put your views or queries on top of the Hanan table and you are good.

      So:

      virtual table = remote table

      Hana table = a table inside Hana

      subscription = the realtime process pushing change information from virtual/remote table into the Hana table in realtime.

      Yes, all can be done, you just got the semantic wrong as a virtual table is just a pointer to the Oracle table in your case.

      Deal?

      Author's profile photo Former Member
      Former Member

      Yes Werner, have understood now 🙂

      Thanks for the explanation.

      Thanks,

      Kalyan

      Author's profile photo Sachin Mishra
      Sachin Mishra

      Dear Werner Daehn,

      I am trying to enable real time feature on one of my flow graph but I am getting below error,

      Description Resource Path Location Type
      [8102180] User _SYS_REPO is missing privileges: CREATE REMOTE SUBSCRIPTION for REMOTE SOURCE.

      Do you have any suggestion on this ? Thanks in advance.

      Author's profile photo Shodhan Pujari
      Shodhan Pujari

      Hi Sachin,

      Did you find the solution to this error. Can you let me know since I'm also getting same error.

      User _SYS_REPO is missing privileges: CREATE REMOTE SUBSCRIPTION ON REMOTE SOURCE "Files".

      Regards,

      Shodhan

      Author's profile photo Andrew Ming Zhou
      Andrew Ming Zhou

      Hi, can you discuss the performance considerations when using transform logic in the SDI dataflow? If there are joins in the dataflow, do they get performed in the source system (thereby taxing the OLTP system) or in the Data Provisioning server? Thanks.

      Author's profile photo Werner Daehn
      Werner Daehn
      Blog Post Author

      The logic is the following:

      • Flowgraph gets turned into a CalcScenario.
      • CalcEngine optimizes that into SQLs and operations done inside the CalcEngine.
      • SQLs are given to the Hana SDA optimizer which decides what to send to the remote system, what to do locally.

      As you can see there are two (or three) points where a join in the remote system cannot be performed:

      • If complex transformations (calcengine based transformations) are prior to the join. Example: Pivot and then join - a pivot is not a SQL operation but a calcengine transform.
      • If the SQL cannot be pushed to the remote system for logical reasons, e.g. a three way join where the middle table is a Hana table and the two others are remote tables.
      • If the source adapter does not support joins, e.g. joining two flat files using the file adapter.

      In other words, you do not have explicit control of what join is pushed down or not, but you can influence it.