Skip to Content
Technical Articles

Using new Structured Data Operators for Transformations in SAP Data Intelligence

SAP Data Intelligence 3.0 which is the next evolution of SAP Data Hub provides new data transformation capabilities in the Modeler which I will describe how to use in this blog post.

There is a new set of pipeline operators that can be used to perform data transformations with a similar experience that you have with SAP HANA Flowgraphs. The transformations include for example projection, filter, column operation and join on different structured files such as CSV-, Parquet- and ORC-files as well as database tables and views. The new operators are all listed under the Structured Data Operators section in the Modeler:

The structured data operators use the Flowagent subengine for execution. From look and feel they are very similar to the existing operators in the Connectivity (via Flowagent) section:

The main difference between both operator categories is that only the new Structured Data Operators can be used together with the new Data Transform operator that is listed in the same category.

We plan to consolidate both operator categories in upcoming releases and with that we will also extend the source/target options within the Data Transform to everything that is available by FlowAgent.

For those of you who are wondering what the FlowAgent subengine is: FlowAgent is an engine that runs a light-weight containerized version of SAP Data Services with each pipeline containing FlowAgent operators. It leverages SAP Data Services adapters to provide the same capabilities on structured data through the FlowAgent operators in data pipelines.

So much for the background, let’s now start using these new structured data operators.

Example Scenario:

In the following tutorial, we will combine information from unstructured data stored in Amazon S3 with enterprise data stored in SAP HANA using the new Data Transform operator. The scenario is depicted at a high-level in the following diagram:

 

Product Review Data in Amazon S3:

In Amazon S3 we have a CSV file

You can download the file from the attachments and upload it to any supported storage of your choice. Please refer to the documentation of the Structured File Consumer operator to find out which storage types are supported.

Product Master Data in SAP HANA:

Next to this, we have connected an SAP HANA instance that has stored product master data in a table called PRODUCT. This table includes product details such as the product name, the product category and the product price as it comes from an ERP system. Hence, we treat this data as enterprise data.

You can download the file and import it into your SAP HANA Cloud instance (if available) to try out the same tutorial by yourself. If you would like to use a different database type, please refer to the documentation of the Table Consumer operator to find out which databases are supported.

Read Product Reviews from Amazon S3

Now let’s start using this data in a data pipeline. We first create a new Graph in the Modeler application and add a Structured File Consumer to read the product reviews from Amazon S3:

Open the Configuration of the Structured File Consumer operator and point it to the connected storage location where the file

Then choose the Connection ID (S3TEST in this case):

Open the Source File Browser, select the file

At this point, the operator extracts the metadata from the CSV file without running the pipeline. This metadata is required later once we add and connect the Data Transform operator.

Read Product Master Data from SAP HANA

Now, we well read the Product table from SAP HANA. For this we add a Table Consumer operator to the same graph:

Open the Configuration of the Table Consumer operator and choose the Database type (HANA in this case):

Then choose the Connection ID which is HANA_CLOUD in my example:

Open the Table Browser, select the table

In my example, I have connected an SAP HANA Cloud instance that stores the required product master data. If you have access to SAP HANA Cloud, you can import the file HANA_export.tar.gz via the SAP HANA Database Explorer as described below and then use it for this exercise.

Import Product Master Data into SAP HANA Cloud

Open the SAP HANA Data Explorer, right click on the database icon and then click on Import Catalog Objects:

Browse for the HANA_export.tar.gz on your local computer and then click on Import:

This will import the PRODUCT table into the database schema DEMO.

Join Product Reviews with Product Master Data using Data Transform

Now that we have both datasets available, let’s continue by adding a Data Transform operator to the same graph (please make sure to choose the Data Transform from the Structured Data Operators category):

Drag and drop the output ports of the Structured File Consumer and Table Consumer on the Data Transform operator:

Double-click on the Data Transform operator to open the graphical transformation editor. As you have connected two source operators, you should now see two unconnected input nodes input1 and input2 in the editor:

Add a Join data operation to the editor and connect it with the input1 and input2 nodes:

Double-click on the Join node to define the join conditions. Drag-and-drop the column PRODUCT_ID from the one table to the column PRODUCTID of the other table and set the Join type to Left Outer. Please also make sure that Join_Input1 is the left table:

Remark: To reach maximum performance, it is necessary to choose different settings for the Cache and Rank properties depending on the source type and the cardinality. Please refer to the documentation of the Data Transform operator to find the right settings for your scenario. In our example, the data volumes are rather small, so we can neglect these settings for simplicity.

Next, click on the tab Columns and add the following columns to the output:

Click on the “<” icon in the upper left corner to get back to the transform editor.

Add a Projection node to the graph and connect it with the output of the Join node:

Double click on the Projection Node and add all columns to the output by clicking on the selected icon below:

Click on the tab Filter and then use the SQL helper to define a filter on the column COUNTRY to only include reviews with country code DE in the output:

Remark: Although optimizations are performed by the underlying engine during runtime, it may be better to add the projection with the filter before the join to really make sure that the filter is pushed down. However, this example is just to demonstrate the functionality and not to reach maximum performance.

Click on the “<” icon to get back to the data transform graph.

Right-click on the output port of the Projection node and then click on Create Data Target:

This should create an output port in the transformation editor that you can later use within the first level graph:

With that we have finished defining the transformation logic. Let’s now switch back to the first level of the graph by clicking on the “<” icon and then continue with the last step.

Write the enriched data as Parquet to Amazon S3

In the last step, we will write the results back to Amazon S3 using a different file format, which is Parquet. For this, we add a Structured File Producer operator to the same graph an connect it with the output port of the Data Transform operator:

Open the Configuration of the Structured File Producer and choose a Storage Type (S3 in this case):

Choose an existing Connection ID (S3TEST in this example):

Provide a meaningful target file name, e.g. enriched_product_reviews.parquet and choose PARQUET as Format:

Add a Graph Terminator operator to the graph and connect it with the Structured File Producer. The Graph Terminator will stop the execution of the graph automatically once the Parquet file is created:

That’s it, let’s now Save the graph and click on Run.

Wait until the graph switches into Running and Completed state:

Inspect the results:

After the graph execution has completed, you can use the Metadata Explorer application to inspect the written Parquet file:

In there you should see, that the product reviews have been enriched with product master data such as the product name, description and price and there should only be product reviews with country code equal to ‘DE’.

That’s actually it, as simple as that :-).

7 Comments
You must be Logged on to comment or reply to a post.
  • Hi Jens,

    That’s a great explanation and helpful indeed!!

    structuredfileproducer1 operator overwrites the target file if already exists by default which is understood. It also creates new files if we use placeholders as postfix.. example *_<counter>.csv which is also working.

    Question: How does it append the data in the same target file ? as the “Mode” is missing in the structuredfileproducer1 operator. My target is SDL but not GCS, S3 or WASB.. So wondering if Mode option is missing and we cannot specify any option, will this overwrite the file or append the file ??

    BR, Mahesh

    • Hi Mahesh,

      the mode-option is only exposed for those storage types that support a physical append such as ADL or HDFS. Object stores such as S3 and SDL do not support the append operation on storage level. To support it on File Producer level would mean that first the file would have to be downloaded, appended locally and then uploaded again which would be very expensive and slow for larger files. We are working of providing an option that new data is written into part-files which would result in an append-style on folder-level, but files are kept as immutable.

      Best regards
      Jens

  • Hi,

     

    many thanks for the blog, it helped a lot with the new operators. I tried with a different Data Set and need a new column to be created.

    I see functions like case() but no help can be found on how to use this. Tried some variations but no success.

    Could you kindly give some hint where to find the documentation with examples ?

    Regards

    Marcus