SAP BTP Data & Analytics Showcase – Load data into SAP HANA Cloud using SAP Data Intelligence
This is the 1st of 5 blog posts which are part of the “SAP BTP Data & Analytics Showcase” series of blog posts. We recommend you to look into our overall blog post to gain a better understanding of the end-to-end story and context which involve multiple SAP HANA Database & Analytics Solutions in the cloud.
In this blog post, we will demonstrate how to create pipelines via the graphical modelling feature of SAP Data Intelligence and easily load data into SAP HANA Cloud. Two pipelines are designed and implemented in our case:
- Pipeline 1: Load stations data in CSV files from DI internal data lake into SAP HANA Cloud
Pipeline to load station CSV files
- Pipeline 2: Consume REST APIs provided by open website “Tankerkönig“, retrieve real-time prices for stations which have already been stored in the HANA table (done in pipeline 1), and land prices data in SAP HANA Cloud
Pipeline to consume REST APIs and load real-time prices
Setup connection to SAP HANA Cloud (HDI container)
Before you are going to create the pipelines, the connection to SAP HANA Cloud shall be configured firstly. In our case, we have created a HDI container in the SAP Business Technology Platform. You could follow the steps below to connect the HDI container and later load data into it:
You could create a new connection with the type “HANA_DB” and give related information about your HANA database (in our case HDI container), e.g., Host, Port, User, Password and so on.
Scenario 1: Load stations CSV files into SAP HANA Cloud (HDI container)
The first scenario what we are going to show is how to create a database table and load the stations data in CSV files into this newly created table which is located in the HDI container .
Step 1: Upload “stations.csv” into DI internal data lake
We consider the gasoline stations data in Germany as a kind of master data, which does not change frequently over time. So, we downloaded the “station.csv” file from the open Github repository and uploaded it into DI internal data lake via “Metadata Explorer”. You could also choose other connection types, e.g., Amazon S3 bucket, to upload your CSV files.
Within “Metadata Explorer”, let’s go to “Browse Connections”, select the connection type “DI_DATA_LAKE” and upload CSV files.
After uploading the data, you can explore and understand datasets better using the powerful features like data profiling, preparing and publishing in SAP Data Intelligence.
Step 2: Configure pipeline to create table and load stations data into HDI container
Now, we would like to show you how to create graphs containing a set of operators and facilitate us to create an initial table and load data into HANA Cloud. You could refer this document in SAP Help Portal for more details.
Firstly, we choose the “List Files” and “Read File” operators to connect and read the content of “station.csv”. Opening the configuration of “List Files”, you could configure the connection to DI internal data lake and select the file path pointing to the “station.csv”.
Secondly, we use the “SAP HANA Client” operator to create a new table called “STATIONS_MASTER” and meanwhile write the content of stations.csv into HANA Cloud. For this purpose, the input port “data” of SAP HANA Client operator is connected with the output port “file” of “Read File” operator. A operator to convert file to string is required when you are trying to connect the above described ports. Please make sure that you choose a proper type of the converter operator.
Finally, let’s have a look at the configuration of “SAP HANA Client” operator steps by steps.
- Select the configured connection to the HDI container
- Give the database schema and table name in the format of “YourSchemaName”.”STATIONS_MASTER”
- Define table columns and data types for the station table
- Choose “Create” for table initialization tab, as the station table doesn’t exist at this moment
Now we have completed the configuration for this simple pipeline. Let’s check the new-created table and preview stations data via the “Metadata Explorer“.
Scenario 2: Consume REST APIs and load real-time prices data into SAP HANA Cloud (HDI container)
The second scenario we would like to demonstrate is how to retrieve real-time prices for stations stored in the “STATIONS_MASTER” table and create a table called “PRICES_RESTAPI” in HDI container to save prices data.
Step 1: Read stations data from HANA table
We recommend you to use the operator “HANA Table Consumer” for this step. The table “STATIONS_MASTER” contains a key field called “uuid” which is the parameter required by the endpoint of REST API.
Step 2: Consume REST APIs and pre-process json payloads using Python operator
In our case, we have decided to make use of the python library “Requests” directly to consume REST APIs instead of utilizing the operator “OpenAPI Client”. The reason is that we still need python script to pre-processing the json payloads which are returned by the “API-Method 2” listed in the Tankerkönig website before we save the data into HDI container.
First, we need to adapt the output from “HANA Table Consumer” into a CSV-based format which is accepted by “Python Operator”. To achieve this purpose, the “Flowagent CSV Producer” operator is put in use.
Second, let’s look into the configuration of “Python3 Operator” which is the most important part of this pipeline. Instead of creating our own custom operator, we have chosen the “Python3 Operator” as it reduces our efforts to create own docker file in SAP Data Intelligence and it is easy to use.
- Add import port to your Python3 Operator named “station“, data type is “string“
- Add export port to your Python3 Operator named “price“, data type is “message“
- In the python script, the import and export ports need to be referenced
In addition, we have prepared the following python script (incomplete coding) which could help you understand the structure and the way to call REST APIs easily using Python. The import port “station” and export port “price” are reflected in the program.
import requests import pandas as pd import io from datetime import datetime from decimal import Decimal ''' Retreive latest prices from public rest api of tankerkönig ''' def on_input(data): dataframe = pd.read_csv(io.StringIO(data)) station_uuids = dataframe[['uuid']] priceArray =  for index, row in station_uuids.iterrows(): .... #About open REST API methods and API key to access the open Tankerkönig-Spritpreis-APIs #Please refer to the website https://creativecommons.tankerkoenig.de url = 'https://creativecommons.tankerkoenig.de/json/prices.php?ids=' + stationIds + '&apikey=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx' r = requests.get(url, timeout=50) .... if r.status_code == 200: priceArray.append(val) df = pd.DataFrame(priceArray) df = df[["date", "station_uuid", "diesel", "e5", "e10", "dieselchange", "e5change", "e10change"]] api.send("price", api.Message(body=df.to_csv(index=False, header=True))) api.set_port_callback('station', on_input)
Step 3: Create prices table and write data into HDI container
In this step, we would reuse the “SAP HANA Client” operator to create table structure and write real-time prices data into the newly-created table. We will not repeat the configuration steps of SAP HANA Client operator here. However, we would like to mention that using the python operator we have the opportunity to adapt the data types and store the proper data formats directly in HDI container, which is able to simplify the works of business users (No converters are required in the data modelling part).
The following video will give you a more detailed impression of how these implementation steps look like in SAP Data Intelligence:
Congratulations! You have finished this session. We hope you are able to create your own pipelines in SAP Data Intelligence to connect non-SAP data sources and start your data story with the first step – Data integration.
We highly appreciate for your feedbacks and comments! Enjoy!