Hana Smart Data Integration – Batch Dataflows
This post is part of an entire series
Hana Smart Data Integration – Overview
- Hana Smart Data Integration – Adapters
- Hana Smart Data Integration – Batch Dataflows
- Hana Smart Data Integration – Realtime Table Replication
- Hana Smart Data Integration – Realtime Sources with Transformations
- Hana Smart Data Integration – Realtime Sources with History Preserving
- Hana Smart Data Integration – Architecture
- Hana Smart Data Integration – Fun with Transformation Services
What is the most performing table layout for querying data in Hana? Certainly not the OLTP data model, where many many joins are required to get all data. The OLTP data model is best suited for entering and changing data as it impacts the least amount column/rows. But in that case, is it really such a wise idea to copy the data 1:1 from the source OLTP system into Hana?
Sure you get better performance with Hana than with your existing database, but even Hana is slowed down by every join. Not much but 20 times a little is still a lot. Hence it might be a good idea to read the source data, transform it an load the data into a Hana optimized data model instead of strictly keeping it as is.
Actually, analyzing the cases when the OLTP model is good enough and when transformation makes sense was an entire exercise I went through in another blog post: Comparing the Data Warehouse approach with CalcViews – Overview and Comparing the Data Warehouse approach with CalcViews – Examples.
The summary of my findings are, Hana is simply amazing, how much it can cope with within sub seconds. But the creativity and the needs of business to query data is often even more amazing. Hence my conclusion is, operational reports can be done back where they belong – the operational system if run in Hana. But for true Business Intelligence – finding new business insights in the data – this is where I would suggest building star schema data models still.
How do you transform the data during the load
In the past the answer was quite simple, you use an ETL tool like Data Services. And the answer is and will be valid still. But for just getting data into Hana such an external ETL tool might have a too large of a footprint. They have their own look and feel, their own repository, their own administration and management, even simple things like data types are different. These tool are meant to read any source and put the data into any target.
Therefore we utilized existing Hana technology and combined it to build an ETL feature natively into Hana.
It comprises of the following components
- CalcEngine as ETL engine
- Hana Smart Data Access to connect to a remote source
- Hana security and the entire Hana management actually
- Hana monitoring
- Hana repository for designtime storage and propagation to production
- A new Hana editor, three in fact as of today: AFM Editor, WebIDE based Replication Editor, Smart Data Preparation – another WebIDE based tool aimed towards Business Users
Since this blog is located under Hana Development, let’s focus on the tool targeting us, the IT professionals.
The AFM Editor
The AFM editor is a Hana Studio Modeling editor like any other, so I open that from a Hana repository tree.
With this object being created, you can start drawing dataflow. Pull in source objects, target objects and transforms. I am using the term objects here for a reason, because actually it can be many things. Hana tables, virtual tables, views, calcviews, table types, temporary tables. For the ETL use case it is (virtual) tables mostly.
As usual in this blog series I want to do something special, in this case my sources are the 0MATERIAL_ATTR and 0MATERIAL_TEXT extractors of the SAP ERP system. So I drag and drop these virtual tables into the editor and use them as source together with some transforms and have to add a target table.
Note: This adapter is one I wrote, it is not a shipped one.
The interesting part is the types of transforms available in the calcengine now and hence in the editor.
On the right hand side of the editor is a palette and depending on the features being installed in Hana, it shows different categories. If the smart Data Integration and Smart Data quality feature is installed, there are two categories.
Palette “General”
This contains the basic transforms.
- Data Source: Reader
- Data Sink: Loader
- Data Sink (Template Table): A target table that is created matching its input columns
- Aggregation: Allows to group-by columns and specify aggregation functions for the others
- Filter: Allows to specify filters and complex mappings like substring(col1,4,8)
- Join: All kinds of joins (Note: this is not the existing calcengine join. With this join node you can join n tables at once, specify different join types for each, specify complex join conditions.)
- Sort: To order data
- Union: A Union-all operation
- Procedure: In case you want to invoke a procedure with all transformed data being passed into it as table type parameter
Palette “Data Provisioning”
This contains all Data Integration specific transforms.
- Date Generation: A source that generates a range of date values, one per row
- Row Generation: A source that generates a range of integer values, one per row
- Case Node: To route the data through different paths controlled by conditions, e.g. All region=’US’ data should go into FilterUS, all region=’EMEA’ data into the transform FilterEMEA and all other rows into FilterOther.
- Pivot: To take n input rows and merge them into n columns. All other columns stay as is.
- UnPivot: The reverse operation, to take n columns, e.g. Revenue_January, Revenue_February,… and create n rows, e.g. the 12 values are returned in the column Revenue but using 12 rows.
- Lookup: To lookup in another table for the lookup value and take one. A join might find multiples and hence the output would be all combinations – often not desired, hence the lookup.
- Cleanse: To cleanse all data based on reference information, most often used together with the postal address directory of every single country of the world.
- Geocode: To translate addresses into long/lat information and vice versa. Or Point Of Interest relationships and the such.
- Table Comparison: To identify for all rows the difference of the incoming data compared to the target table and what type of difference there is.
- Map Operation: To influence the way read data is handled, e.g. all data read from table1 should be flagged as delete and hence will be deleted in the target table. Also to specify different mappings for differently flagged rows, e.g. all insert rows should have the value of now() in the INSERT_DATE column but not be changed by updates. And update rows should write the value of now() into the LAST_UPDATE column.
- History Preserving: Deals with all the complexity when loading a target table that should retain the history. So instead of updaing the e.g. customer record, a new customer record version should be added, the valid-from and valid-to columns of the old and new version should be updated,… things like that. Makes creating a Slow Changing Dimension Type 2 a piece of cake. Note: Requires a Table Comparison Transform somewhere upstream, else it does not know what changed and what the current values are.
Summary
As you concur hopefully, there is little reason to use an external ETL tool for loading data into Hana, even if transformations are required. Although I have talked about loading the data into Hana just once, this solution does allow to create delta logic as well. But more important, it can deal with realtime data as well!
Hello Werner Daehn.
Great read.. Indeed, Very helpful... Awesome article on "HANA Smart data access".... This is a brilliant way to go.
Regards-Ashok.
Hi Werner, interesting series on this new (at least to me!) capability, makes a lot of sense.
What about the situations where customers might already be running SLT as their replication tool from ERP systems? Can we use SDI as an "ETL" tool within HANA in order to de-normalize and transform to a star schema, while letting SLT continue to do the replication work?
If yes, can the realtime option be used for these tables, and if so what is the basis for CDC on the HANA tables?
Thanks!
Justin
That's the idea, yes. What is missing at the moment is the Adapter connecting to the SLT system. You can write your own or switch to the Replication Server based one we have today. I am currently in the makings of building an SLT adapter.
Thanks for the reply, maybe I wasn't so clear in my questions. I think you answered the question assuming someone wanted to user SDI against SLT.
My question is more along the lines of continuing to use SLT to replicate the base tables, but then using SDI to perform the ETL WITHIN HANA, ie transformations, lookups from the base tables to a consolidation layer. I saw tons of material on source system data acquisition, but what about the INTERNAL movement and transformation of data within the walls of HANA?
In addition to that question, if you use an "internal" data flowgraph, can realtime execution be used on multiple tables? I am thinking of the use case of combining two sets of transaction data, for example AP and AR, into a single physical fact table and use SDI to do this in real time?
Regards,
Justin
Okay, let me try again....
Basically you have two options.
1. Use SLT to copy the data from SAP to Hana and occasionally invoke a dataflow reading the data and loading the final table.
2. Use SDI as target of SLT to stream the changes from SAP through all transformations right into the final target table.
In the first version the missing piece is that SLT cannot trigger transformations in SLT. All it can do is insert/update/delete statements. And the question would be, why stage the data if there is no benefit?
In the second version you need to connect SDI and SLT somehow - and that is done by the Adapter. Therefore I want to build an Adapter that translates all the Hana requests like "start capturing changes for KNA1 table" to the appropriate SLT FunctionModule calls, to start collecting changes. And SLT then does not insert the change into Hana but sends the changes to the Adapter which streams it further to SDI with all its transformations.
Better?
Definitely helps me to visualize the use case better. Still one outstanding question.
Let's say we want the base tables replicated from the source 1:1, SLT is already doing that just fine.
Now I want to move data within HANA either in batch or real time to one or more target tables. Can SDI do that? If so, with realtime, how does it function if there are multiple input tables to a given dataflow?
Regards,
Justin
Batch and triggered externally is fine.
Realtime is the problem as there is no way to register changes in a Hana table. This hurts us already to some degree, say some application inserts a new record into the customer table and we want to cleanse it asynchronously. Or in your example, another tool (SLT) is inserting data into Hana tables. Right now there is no good way to do that - Work in Progress.
The answer to "how does a realtime dataflow function if there are multiple sources? e.g. three tables are joined" is:
Our goal is that this will be done automatically. Was planned for SP9 but did not make it. Hence you have to simulate what we will be doing, e.g. build three realtime dataflows. One reading the changes of table1, joining the other tables and load the result as upsert into the target. Another reading the changes of table2, joining with the other....etc.
Or let SLT load the tables with a change timestamp plus override deletes to upserts plus a to_be_deleted column as marker. Then the join is a simlpe timestamp based delta.
Or.....
What is the future for SLT having SDI in place? Or will the SLT still be relevant in order to push the data to the SDI adaptor?
Smart Data Integration is a method to control, transform and load data into Hana. SLT is a method to identify changes in SAP. So those two harmonize very well. And the interface would be an "SLT Adapter". On Hana you execute the statement "create remote subscription on KNA1 target task Customer_Data_Transformation_and_Loading_dataflow;". The SLT adapter would get that information that it should start tracking changes on the KNA1, SLT sending changes to the Adapter and the Adapter posting the changes to Hana tables/transformations.
But again, the SLT adapter is missing at the moment. Once its there, the answer will be "Don't care how you collect changes in SAP, either by SLT technology or by Replication server technology or by extractor technology or by .... You pick one adapter, your decision.".
Then it is just a discussion on what are the pros and cons of each technology to get changes from SAP.
Deal?
Are there plans to create an Agent for SAP HANA that will monitor the logs on HANA and replicate records as changes are identified? I can see this as being very useful for making real-time transformations to tables that exists in HANA (sidecar or S4HANA) .
For example, I would like to create a new table KNA1_CUSTOM that is derived from KNA1 in real-time. I would only make simple modifications ltrim(KUNNR,'0'), ucase(NAME1) and ucase(NAME2) for example...
I agree, it is quite an odd siutation at the moment. We can get data from sooo many sources in realtime..... except Hana.
We tried the log reading route but it was the wrong time with all the L2 Delta changes, especially in a multi-node environment.
There might be an adapter that is using triggers the SLT way as an interim solution, though.
Such adapter is quite important, less for what you have in mind, but to sync a Hana cloud instance with an one premise instance.
In your case I would simply add a trigger that loads those two extra fields into the KNA1 table itself or another table. But I assume you considered that anyhow and there are good reasons to not to, e.g. the extra columns should be in another instance.
Using triggers is an option but I am trying to deliver a solution that is easy to maintain and graphical in nature. MetaData is also important for lineage and transformation management. Given that real-time triggers are not an option at the moment, I will likely stick with a near-real-time solution in SAP Data Services. I can easily bulk update to a new table with auto correct load in < 30 seconds to facilitate most of the common transformations.
Perfect.
Hi Werner, Great artical, very interesting. Can you tell me if its possible to use SDI to pull in documents located in a folder into a Hana table for search and/or Text Analysis/Mining processing? Do you know of an example of 'how to' please.
Regards
Keep in mind, you can write any adapter yourself. But in your case, the fileformat adapter can do just that. It contains one remote table per file format - you don't need that - plus a few standard tables to see all files in a directory or to show the contents. The remote table FILECONTENTS is a table with the columns filename, filepath, and a CLOB column with the entire file data in it.
So in your case a query like
insert into target select filename, filepath, content from v_FILECONTENTS where filepath = '/user/documents';
would do the trick.
Please note that in SP9 the file content returns CLOB data, meaning the content it being expected as text. In SP10 this will be a BLOB column instead so you can read images, pdf files as is as well.
Hi Werner,
I've installed Data Provisioning Agent, Imported Data Provisioning Delivery Unit but could not find the Data Provisioning in the Palette.
I've attached the screen shot below for your reference.
I'm not able to identify what I have missed.
Please help me with this.
You are right, the elements shown in the Palette depend on if the EIM Option has been installed on the Hana server.
Deploy the Delivery Unit from SAP HANA Application Lifecycle Management - SAP HANA Enterprise Information Management Adm…
Since you said you have done that, my only other thought is permissions.In particular the role: "sap.hana.xs.ide.roles::EditorDeveloper"
Thank you for the reply.
I granted the above mentioned role to the SYSTEM user but it didn't help.
Is there any other workaround for this?
Also, can you please tell me what a technical user is while creating remote connection to MS SQL Server.
I have added the jdbc files and in Data Provisioning Configuration it is fine but in HANA Studio it throws an error.
If possible, please suggest me any document for creating remote sources in data provisioning.
I have forwarded your request to others, maybe they have an idea....
This error is because the DP Server, which is a process in HANA, can't communicate with the DP Agent. From the HANA Server, can you ping the server where the DP Agent is installed using the host name that you provided when creating the DP Agent? The host name can be found by executing select * from agents
Hi Richard,
You are absolutely right. After adding the host name of the DP Agent in the host file of the HANA Server, it is working fine.
Thank you.
Can you try right-clicking where it says Palette and select refresh?
Also, what version of HANA Studio are you using?
Hi Richard,
Even when I refresh, I cannot find the Data Provisioning in the Palette.
About the version of HANA Studio, I have 2.0.7 version.
Please let me know what is the issue behind.
When you created the .hdbflowgraph did you chose Flowgraph for activation as Stored Procedure or Flowgraph for activation as Task Plan. If you selected the first then this is why you're not seeing the Data Provisioning palette.
Yes, I selected the first option before and when I selected the second option Data Provisioning is visible in the palette.
Thank you.
But why is not visible for Flowgraph for activation as Stored Procedure, is it mandatory to select Flowgraph for activation as Task Plan to use the Data Provisioning transforms?
Thanks Richard, that would have never crossed my mind. Now that you mentioned it, it is obvious of course. If the runtime object is a stored procedure, none of the CalcEngine transforms can be used. And these transforms are all Data Provisioning transforms and a few more.
Stored Procedure as runtime is present as backup for Predictive-only dataflows. We should disable that soon.
Thanks Werner and Richard.
One of the first things I am struggling with is a way to project columns or perform functions that are typically completed in a data services query transform. Is there an easy way to graphically map or project a single column to two columns?
Example:
KNA1.KUNNR -> Mapped to KUNNR and KUNNR_DISPLAY. Let's assume that the second column contains a transformation like ltrim(KUNNR,'0'). I was able to do this using a JOIN transform (join KNA1 to KNA1_2) and FILTER transform. However, this is really tedious for such a simple task.
The DataServices Query transform got turned into multiple artifacts in the hdbtaskflow, in your case the Filter transform and the Mapping tab of each transform.
Can you please try the following in Hana Studio:
1. Add a filter transform and connect it to the source. By default it will have all 1:1 mappings.
2. Now add a new output column in the filter (Output Tab) named "KUNNR_DISPLAY".
3. In the mapping tab drag the source column KUNNR into the KUNNR_DISPLAY column. Alternatively you can edit the mapping formula to KUNNR, just like the KUNNR output column has.
Thanks. The trick was changing the DDL in the FILTER OUTPUT window. I then mapped and changed the expression in the mapping. Thanks again for the quick tips.
Does anyone know if there is a complete reference guide to outline all of the supported functions and complete syntax that can be utilized in the Expression Editor. This is the Expression Editor that is found in the FILTER node's output mapping properties. I have found that it supports more items than those listed in the GUI. For example it supports CASE WHEN THEN ELSE END but that is not listed anywhere. Can we assume that all SAP HANA SQL functions are supported?
Yes you can assume all HANA SQL functions are supported.
We are trying to do data transformation using SDI flowgraphs and we are looking for some inputs to handle some use cases in the right way.
1. How to handle conditional transformations? If a column in Source table has value 'a' then the value in target table should be 1 and if the value is 'b' the value in target should be 2, etc. Basically we are not sure on how to handle where we have to have transformations based in if-else or case statements.
2. How to handle record level error logging? For example, if I have 10 records and 2 of them are failing due to some constraint violation in the target table, I would like the 8 records to go to target tables and the 2 records to be logged in an error table with the errors instead of stopping the entire task with no record moving to target table
4. What is the best way to debug issues in flowgraphs? I couldn't find any material in this regard.
Appreciate your response.
ad 1) I would have thought a case expression does the trick? Am I wrong?
Expressions - SAP HANA SQL and System Views Reference - SAP Library
ad 2) That would be the Lookup transform connected to a Case transform. Or the output of the Lookup connected to two Filter transforms, the one filters all rows that found a matching primary key in the lookup table, the other takes the null rows only.
ad 4) Currently the best way would be to add template tables for a short period of time to store the output of certain transforms additionally in a table.
Currently we are enabling the usage of calcviews on top of any node of the flowgraph, that is a beautiful way of supporting datapreview throught all transformations during development.
As Werner stated, you could utilize the case transformation to split records apart and then apply the specific changes you'd like.
The other option would be to use a CASE statement within a projection.
CASE WHEN "INPUT"."myColumn" = 'a' THEN 1 ELSE CASE WHEN "INPUT"."myColumn" = 'B' THEN 2 ELSE 3 END END
This is simply SQL syntax and the CASE can be nested to support multiple different conditions. It does get a bit verbose, but it sounds like it should work for your use case.
Thanks. For (1) case statement works fine for the conditional transformation.
But for (2), the suggested solution will work for all known exceptions like missing primary keys, missing nullable properties or length, but instead of adding these validations for every transformation, is there a way to make it generic and move the errors to an unconstrained table error records? That way, I don't have to make my adapter very complex, but still can handle most error conditions correctly.
Correct. The Validation transform, which is a one stop place to enter all such rules, is missing. We had to wait until the Hana Rule Framework gets more concrete. But I have used your post to raise the priority yet by another notch.
Hi Werner,
I've couple of more questions regarding the flow-graphs.
1. After creating the flow-graph, if I need to add or delete a column from the source table, I've drop and node and add it again and it will recreate all the mappings. Problem is all the expressions that I've added already are lost. I tried refresh and recreate options and it doesn't seem to help.Is there a way to apply changes without recreating the adapters
2. What is the use of flow-graph templates? Can I create a base template with all my base mappings and later use that as a base to create a flow-graph? In that case what will happen to the main flow-graph if I change the flow-graph template.
Thanks
Chandru.
For (1) I've been trying different options, but couldn't make it work. Can you please share if there is any workaround? Is this something that works in SPS10. Thanks for your help
Hi Werner, This is a fantastic blog post. I have a small question on the capability of SDI, i I will be glad if you could respond me. Can SDI import Data Services dataflow or workflow without any modification?
Thank you!
Sujit
We do have that in mind, but before SDI needs to catch up. For example in SDI there is no Validation transform yet, datastore configurations, the entire nested data model is missing (because Hana does not support an XML datatype). And the situation around workflows is also not clear yet, should BPM workflows be used or SQL Scripts or something else?
But for cases where Data Services is used only to load Hana, there should be a migration path. We have prepared for that, e.g. when you look at the transforms of DS and SDI, they are very similar if not identical. So migration will be possible, relatively easy in fact. But for now it is too early.
Thank you Werner!
Hi Werner - just one more question - I would like to get your idea on when we might except such capability available and what version would it going to be.
Regards
Sujit
Frankly I am not sure. A week ago I would have said "a longer time" but then even our VP started to talk about it. Maybe you are right and it is time we look into it.
I've now completed two projects where I used FlowGraphs. In our solution we used them to avoid the use of the calculation engine and Join Engine in SAP HANA. We updated the FlowGraph tasks every 5 minutes in parallel. (Near Real-Time for most users) The source was SAP ECC tables in both cases. One environment was replicated using SLT and the other was SoH. We were able to create (only) Analytic Views and Attribute Views to facilitate everything from COPA, SD, FI, MM etc... In total we only needed to create about 25 FlowGraphs to accompany the different Attribute Views and Analytic Views. In the end we had a mix of 25 custom tables and base ECC tables.
When compared to SAP HANA Live Calculation views, the new solution was able to outperform them by a huge margin. We also reduce computation and allocator memory on SAP HANA by 500% by staying in the OLAP engine. This SCN posting was a huge help so thanks for the information.
Overall I was quite pleased with the addition of SDI. It gives Business Intelligence developers significantly more options to optimize Analytic Solutions. I will say that the FlowGraph UI is very buggy and it does not accommodate changes post activation without a lot of re-linking transforms. It also has the tendency to crash when the source table has 100+ columns. My hope is that overtime these issues will be worked out. I would also like to see a SELECT DISTINCT option in the JOIN and FILTER transform. I really don't like using an AGGREGATION transform to get a distinct set of columns.
As SDI is a feature part of SPS09, I assume it does not need any additional licence and no cost is involved. Am I right?
It is a Hana option, so it comes at an extra charge. The enterprise license of Hana includes SDI already, however.
Thanks Werner for the reply.
By the way, where can I check and confirm that my HANA includes SDI already or not.
Don't know. The Sales team?
I m trying to use abapadapter and catch SAP Extractors
I have this:
d...Did I miss something?
Checked the source code. So the open() call is fine, it shows the list of contexts: ABAP_CDS, BW, HANA, SAPI.
Next RFC call is on RODPS_REPL_ODP_GET_LIST with the parameters
I_SUBSCRIBER_TYPE= HANA_SDI
I_CONTEXT = SAPI
I_LANGU = ...
I_SEARCH_PATTERN = <empty>
Can you check what the RFC does in case you call it via SAPGUI manually? That should give you a pointer.
Werner,
Is there a concept like a workflow for creating an ETL stream where Flowgraph A must complete before Flowgraph B executes? I.e how is dependency handled. I want to run an initialize variable script before I execute a delta data load flow graph for instance.
Dear Werner Daehn ,
Could you please shed some light on your statement "Although I have talked about loading the data into Hana just once, this solution does allow to create delta logic as well".
We need to enable delta loads logic on existing flow-graphs. Is there any standard functionality of achieving this in flowgraphs ?
Hi Werner,
In bulk load case, partitioning can be used in flowgraph to enable parallelism to shorten the loading time.
If there is join between two tables in the flow, how SDI flowgraph can guarantee that corresponding data portions from joint two tables can meet at the same time? Or there is only one driving table can be enabled for partitioning?
thank you!
Bill
Hi Werner
Thank you for all the post related to config, I was able to successfully do the configuration.
I am not able to do delta load with 0FI_GL_4 extractor using SDI flowgraph. I have posted the question below, appreciate if you can please guide me.
https://answers.sap.com/questions/754079/smart-data-integrationsdi-sap-abap-adapter-delta-e.html
Poonam
Hi guys
I have been tried using the SDI but i am not getting palette area blank . when i tried to refresh it is showing the error.
Can anyone help me to solve this issue.
Thanks