Skip to Content
Technical Articles

SAP Data Hub/Data Intelligence: Experiences and Learnings in Embedding Python-based ML Algorithms into a Pipeline

Hello Data Management & Data Hub Community,

in this blog entry I would like to describe some of the experiences made throughout the implementation of a Data Hub/Data Intelligence (DH/DI) use case. Amongst others, I’d like to walk you through some of the implementation steps. The customer remains anonymous, also the use case described is slightly altered.

The following major aspects will be covered in the subsequent paragraphs:

  1. Use Case
  2. System Landscape
  3. Technical Implementation
  4. SAC Story
  5. Lessons Learned
  6. Troubleshooting Data Conversion

What you can expect are some first-hand experiences made in applying SAP Data Hub’s pipeline engine and modeler. Moreover you might be able to anticipate some of the barriers that we had to overcome and speed-up your implementation project.

Having read this blog entry some of you might ask why we haven’t built the scenario in Data Intelligence, as it would fit quite nicely into the feature scope of it. the reason is simple: SAP Data Intelligence was not yet available at that time and we’ve already planned and set the scope of the use case, including the technical environment.

1. Use Case

Status Quo

The following describes the current state of implementation and requirements of the customer. Below bullet points mainly focus on a prediction algorithm, which has been built in python.

  • RAP = Resource Allocation Prediction: The aim is to optimize marketing spends and efficiently allocate between multiple dimensions: brands, countries and others…
  • To yield optimal results, a history of 7+ years of sales + marketing spend data is used
  • Various external and internal factors need to be additionally valueted (market price, number of competitors etc.)
  • The python prediction algorithm is two-folded: First step is to predict sales via regression modeling. Second step deals with optimal spend allocations.
  • Current implementation/runtime environment: Python based predictions in dedicated Python server installation
  • Historical data are obtained in .CSV format and are provided via local file systém reference

Target State

  • Integrate Python algorithm into SAP Data Hub pipeline
  • Feed pipeline with live data, stored in HANA DB table
  • Python algorithm inside a custom operator with train & run in one go
  • Output predictions into HANA DB table
  • Run prediction in a controlled manner: Trigger via public API

2. System Landscape

We used a SAP Cloud Appliance Library image as our temporary environment. Amongst others, it comprises the following software/tools:

  • Data Hub/Intelligence 2.6
  • WebIDE for HANA 2.0 (incl. DWF)
  • XS Client API
  • Analytics Cloud
  • Enterprise Architect Designer
  • HANA Cockpit
  • ADP
  • Hadoop
  • S3 simulator
  • Kubernetes Dashboard
  • Workbench Editor (XSC)

3. Technical Implementation

The following illustration depicts implementation steps and the respective tool used to realize it.

 

High-Level View on Pipeline

1. Data Modeling & Data Load

HANA Input & Output Table (INPUT_DATA / OUTPUT_DATA) – Some “cosmetics” had to be applied:

In order to allow for better matching and consistency of column/attribute names, all column names were set to uppercase. Moreover, SQL data types where adjusted according to the HANA Client Operator’s supported types.

 

2. Refactoring Python Code “RAP” Algorithm

The need to adjust the existing python code was clear from the beginning. Some minor adjustments had to be made in order to make the piece of python code “fit” for the pipeline.

The following describes some refactoring activities and moreover some additional code which was  introduced in order to cater for:

  • Incoming data conversion/preparation
  • Output data conversion
  • Type conversions
  • Proper sequencing of columns/attributes
  • Within given python code, all attribute names were by default lowercase and of non-technical (or DB related…: lowercase, spaces etc.) nature. With some refactoring and manual edit of the code, all attribute names were matched against the defined HANA Source table column names (all uppercase and without spaces).
  • Initial code relied on local data and comprised local OS path references. These references needed to be resolved. Formerly excel based input was replaced by an incoming JSON string-based dataset, provided by preceding operators in the pipeline (described in detail in the following sections).
Data_FE=pd.read_json(io.StringIO(data))
  • Column Sequence of incoming data had to be rearranged. This is due to the output provided by the HANA Client operator that fetches input data. Column order was mixed up, therefore this additional aspect became relevant.
columnSeq = ['ZONE_C','BRAND','DATE_C','NET_SALES','NET_SALES_LAGS_MEAN_1','NET_SALES_LAGS_MEAN_2','NET_SALES_LAGS_MEAN_3','NET_SALES_SQRT','NET_SALES_LAGS_MEAN_1_SQRT','NET_SALES_LAGS_MEAN_2_SQRT','NET_SALES_LAGS_MEAN_3_SQRT','TOTAL_EXPENSES','TOTAL_EXPENSES_LAGS_MEAN_1','TOTAL_EXPENSES_LAGS_MEAN_2','TOTAL_EXPENSES_LAGS_MEAN_3','TOTAL_EXPENSES_SQRT','TOTAL_EXPENSES_LAGS_MEAN_1_SQRT','TOTAL_EXPENSES_LAGS_MEAN_2_SQRT','TOTAL_EXPENSES_LAGS_MEAN_3_SQRT','NUM_WORKING_DAYS','TIME_F_LAUNCH','BRAND_GROWTH','BRAND_PRICE','MONTH_1','MONTH_2','MONTH_3','MONTH_4','MONTH_5','MONTH_6','MONTH_7','MONTH_8','MONTH_9','MONTH_10','MONTH_11','MONTH_12','QUARTER_1','QUARTER_2','QUARTER_3','QUARTER_4','VOLUME','VOLUME_LAGS_MEAN_1','VOLUME_LAGS_MEAN_2','VOLUME_LAGS_MEAN_3','TYPE']
Data_FE = Data_FE.reindex(columns=columnSeq)
  • Some type conversions of incoming data were required. Those conversions were identified during debugging and making use of pandas Dtypes function as well as the pipeline built-in api.logger.info() function.
Data_FE['VOLUME']=Data_FE['VOLUME'].astype('float64')
Data_FE['VOLUME_LAGS_MEAN_1']=Data_FE['VOLUME_LAGS_MEAN_1'].astype('float64')
Data_FE['VOLUME_LAGS_MEAN_2']=Data_FE['VOLUME_LAGS_MEAN_2'].astype('float64')
Data_FE['VOLUME_LAGS_MEAN_3']=Data_FE['VOLUME_LAGS_MEAN_3'].astype('float64')
Data_FE["DATE_C"]=pd.to_datetime(Data_FE["DATE_C"])
  • Forecast output data structure comprises lowercase attribute names that had to be converted to uppercase
rename(columns={'Date': 'DATE_C', 'iteration': 'ITERATION', 'Y_hat': 'Y_HAT', 'null': 'ROW_ID'}, inplace=True)
  • JSON conversion of forecast output DataFrame. Amongst others, the JSON orient had to be changed from index (default) to records. The HANA Client processing the output data of the algorithm is not capable of parsing an index or table based JSON orient, generated by pandas TO_JSON function.

In addition, the date format has to be set to ‘ISO’, as the default one which seconds/milliseconds based can’t be parsed as a date value by the HANA Client operator.

outputJSON = Forecast_board_master.to_json(orient='records', date_format='iso')
jsonOut = json.loads(outputJSON)
api.send("ForecastBoard", json.dumps(jsonOut))

 

3. Docker Image

The custom algorithm relies on various python libraries which are not part of any standard docker images available in Data Hub/Data Intelligence. Therefore, an own docker file definition was introduced.

Important to know is the tag “nvsml”, which is attached to the docker file. This makes the image uniquely identifiable and usable within a pipeline (within a group).

 

4. Pipeline

The following outlines some of the implementation details of each operator used in the DH pipeline. To start of execution of the pipeline there was no schedule or API required in our scope. However this can be easily achieved using native Data Hub options.

The subsequent screenshot illustrates main building blocks within the pipeline. All involved operators except in “4. Python3 Operator” are standard, without any coding involved.

  1. Constant Generator: SQL Constant Generation
  2. HANA Client: Fetch Input Data from HANA
  3. Multiplexer: Multiply Output Data
    • Terminal: Debug/Verify HANA Table Input Data
    • FormatConverter: Format HANA Input Data to JSON
    • ToString Converter: Convert JSON Data to String
  4. Python3 Operator: RAP Algorithm & Data Conversions (Custom Operator)
  5. HANA Client: Insert Forecast Data into Output Table in HANA

Operator 1: Constant Generator (Create SQL Statement) for HANA Client

This operator creates an output string, describing the input SQL statement for the next operator in the flow. It’s a plain SELECT on the input table:

SELECT ZONE_C, BRAND, DATE_C, NET_SALES, NET_SALES_LAGS_MEAN_1, NET_SALES_LAGS_MEAN_2, NET_SALES_LAGS_MEAN_3, NET_SALES_SQRT, NET_SALES_LAGS_MEAN_1_SQRT, NET_SALES_LAGS_MEAN_2_SQRT, NET_SALES_LAGS_MEAN_3_SQRT, TOTAL_EXPENSES, TOTAL_EXPENSES_LAGS_MEAN_1, TOTAL_EXPENSES_LAGS_MEAN_2, TOTAL_EXPENSES_LAGS_MEAN_3, TOTAL_EXPENSES_SQRT, TOTAL_EXPENSES_LAGS_MEAN_1_SQRT, TOTAL_EXPENSES_LAGS_MEAN_2_SQRT, TOTAL_EXPENSES_LAGS_MEAN_3_SQRT, NUM_WORKING_DAYS, TIME_F_LAUNCH, BRAND_GROWTH, BRAND_PRICE, MONTH_1, MONTH_2, MONTH_3, MONTH_4, MONTH_5, MONTH_6, MONTH_7, MONTH_8, MONTH_9, MONTH_10, MONTH_11, MONTH_12, QUARTER_1, QUARTER_2, QUARTER_3, QUARTER_4, VOLUME, VOLUME_LAGS_MEAN_1, VOLUME_LAGS_MEAN_2, VOLUME_LAGS_MEAN_3, TYPE FROM "TA_01"."INPUT_TABLE"  ORDER BY ZONE_C, BRAND, DATE_C ASC;

 

Operator 2: HANA Client – Input Data

This HANA Client operator receives a SQL string constant, generated from the preceding operator, at the sql port (type message). This indicates the HANA Client operator to fire a query towards the HANA DB as specified in the Connection properties (as shown below in the screenshot).

With respect to INSERT behavior, some settings have to be made:

Amongst others, the mode is to be set to INSERT. Even more important is setting the decimal output format to “Floating-point”, otherwise decimal based values are formatted as strings. This will create trouble in downstream operators, relying on the HANA Client’s output data. It is very likely that you get a type-mismatch in one of your downstream operators, when you e.g. parse it into a JSON/DataFrame format and work & calculate with those numbers in your algorithms.

Summary:

In combination with the constant operator, the HANA Client will take the generated constant/SQL statement as an input string and run a query against the defined HANA Database. It fetches the result set and create an output on its port “result” of type message. This allows you to then process this output data in downstream operators of the pipeline.

Data Conversion into JSON Format

Operator 1:

Multiplexing the output of the HANA Client allows for debugging and verifying the output in a terminal. This is purely for implementation purposes and needn’t be in place for a productive pipeline.

Operator 2:

Terminal output, connected to the multiplexer including a string conversion node. The toString converter as the inbound to the terminal takes over the conversion from the HANA Client’s based output (message format) to a string based representation.

Operator 3:

This operator is a FormatConverter. It converts from the message based output of the HANA Client into a JSON based data representation.

 

JSON String Conversion and RAP Algorithm

Firstly, the incoming JSON data is converted into a string value. Data arrives at the input port “inputJSON” of type string.

Secondly, the RAP Algorithm kicks in. This custom Python3 operator has got 1 input and 3 output ports. Input port of type string and output ports:

  1. Terminal/debug port (string)
  2. MAPE port (string)
  3. Forecast por (string)
  4. ResultChampion (string)

RAP Algorithm relies on additional python libraries and therefore a group must be introduced. The group holds the tag “nvsml” to explicitly use the custom defined docker image at runtime, making all required libraries in the environment available.

Load Forecast Results into HANA

A HANA Client operator takes the output of the RAP Algorithm as a JSON string-based input at the port data of type message. The incoming JSON string representation must adhere to what the HANA client expects in terms of format. The JSON representation of the data must be in a record oriented way (pandas: TO_JSON(orient=’records’).

This operator is set to process JSON based input formats. Some details are important to be set:

  • Decimal output format to “Floating-point” à If default is chosen, all decimal will be treated as strings that lead to errors when inserting table data (type mismatch)
  • Target table must be chosen: “OUTPUT_TABLE”
  • Table initialization to truncate: Allows for loading over and over again, each new run will empty target table

To finish with, write file operator “MAPE Monthly” & “Results Champion Master” are to be mentioned. The following illustrates the WriteFile properties of the “MAPE Monthly” results.

For the sake of comprehensiveness of the RAP algorithm’s output, these results (amongst others, the algorithm’s underlying coefficients) are also persisted. Other than writing those into a HANA table, they are simply dumped into HDFS as a csv file. To get csv based output, pandas function TO_CSV is used (in the Python RAP Custom operator).

Part in the python code:

 

The result files reside in the following HDFS part:

  • /tmp/<filename>.csv

The pipeline is terminated after each node by a graph terminator operator.

 

Data Hub Pipeline – Summary

  • 9/10 operators are standard: Configuration only
  • 1 custom operator for python coding
  • Both data input & output from and to HANA via HANA Client operator
  • Format and type conversions are required
  • Pipeline execution time ~4 minutes (vs. approx. ~10+ minutes local execution)

How to start of this pipeline – some options

  1. Manual Start from Modeler
  2. Scheduler in DH/DI
  3. API: OData/REST (has to be implemented, pipeline to be extended with e.g. API Operator)

4. Presentation Layer: SAP Analytics Cloud

In the following, some details on the work done in the SAC part are described. As this is not the focus of the scenario, this section is kept very short.

In order to make the output table consumable for SAC, a calculation view in HANA needs to be in place.

Expose Result Data as CalculationView

In order to make result data usable for the SAC part, we lay over a Calculation View.

In the web development workbench editor of the target HANA DB, in the respective user package TA_01, a new calculation view was created:

  • CV_OUTPUT_TABLE

This CV describes the underlying table data 1:1 without any transformations or complex logic. It represents a “pass-through” layer, making the data consumable for SAC.

SAC Story

Firstly, new model has been created, using a HANA live connection. You can now start creating your SAC story on top of the algorithm’s output table. As we rely on a live connection, the consumption layer will always have up-to-date information.

Remark:

As an alternative to querying a calculation view in HANA, the pipeline in Data Hub could also file-persist the output into SAC via respective “SAC Push API” operator. This option is suitable for use-cases that require on-demand updates on datasets and no live connection.

5. Lessons Learned

  1. Refactor RAP Python: Start & succeed in local environment first, then move into pipeline…
  2. Check & familiarize with supported data types (especially pipeline operators…)
  3. Ensure input data is in right format and complete in terms of record count
  4. Keep up communication and integrate your Data Scientist who developed the algorithm (ideally you do it yourself 😉 )
  5. Pipeline Built: Start small, in iterations, then grow
  6. Know how to debug pipelines: API Logger, Traces & Vflow log
  7. Most time spent in:
  • „Finding“ proper data formats
  • Conversion of data
  • Testing & Troubleshooting

6. Troubleshooting: Data Conversion

As you might have experienced yourself, data conversion and finding appropriate formats are sometimes time-consuming. This is where most of the time was spent in this use case.

The section mainly deals with conversion “approaches” needed from RAP Algorithm towards the HANA Client –> dumping forecast results into target HANA OUTPUT_TABLE table:

 

1st Challenge

Check the data types of the forecast result DataFrame “Forecast_board_master” in python:

We got the following error: Graph failure: cannot convert JSON number 8879185.620000 to decimal: not an integer:

[2019-11-27 08:07:00.454] [{“DATE_C”: “2012-04-01”, “ZONE_C”: “Canada”, “BRAND”: “BRAND_NAME1”, “Y”: 8879185.62, “Y_HAT”: 9230666.544167815, “ITERATION”: 4}, {“DATE_C”: “2012-05-01”, “ZONE_C”: “Canada”, “BRAND”: “BRAND_NAME1”, “Y”: 9672384.59, “Y_HAT”: 9017699.625200097, “ITERATION”:

Change decimal target columns to NVARCHAR and re-run.

First, get Dtypes of JSON Object:

[2019-11-27 08:43:21.715] 
DATE_C,object
ZONE_C,object
BRAND,object
Y,float32
Y_HAT,float32
ITERATION,int64

SQL Statement used to reshape column types on the HANA side for better troubleshooting:

  • Drop keys
  • Convert all columns to NVARCHAR for the sake of simplicity and to avoid type mismatches
  • Iteratively adapt column data types and re-run pipeline in order to narrow down type related errors/mismatches
DROP TABLE "TA_01"."OUTPUT_TABLE";
CREATE COLUMN TABLE "TA_01"."OUTPUT_TABLE"(
	"ROW_ID" NVARCHAR(100),
	"DATE_C" NVARCHAR(100),
	"ZONE_C" NVARCHAR(100),
	"BRAND" NVARCHAR(100),
	"Y" NVARCHAR(100),
	"Y_HAT" NVARCHAR(100),
	"ITERATION" NVARCHAR(100),
	PRIMARY KEY (
		"ROW_ID"
	) ) UNLOAD PRIORITY 5 AUTO MERGE;

Start typifying (“critical” columns DATE_C & Y, Y_HAT)

DROP TABLE "TA_01"."OUTPUT_TABLE";
CREATE COLUMN TABLE "TA_01"."OUTPUT_TABLE"(
	"ROW_ID" INTEGER CS_INT,
	"DATE_C" DATE CS_DAYDATE,
	"ZONE_C" VARCHAR(20),
	"BRAND" VARCHAR(20),
	"Y" DECIMAL(32, 12) CS_FIXED,
	"Y_HAT" DECIMAL(32, 12) CS_FIXED,
	"ITERATION" INTEGER CS_INT,
	PRIMARY KEY (
		"ROW_ID") ) UNLOAD PRIORITY 5 AUTO MERGE;

As seen above, HANA Table was created with decimal 32,12 type on Y & Y_HAT.

Pipeline log after execution using above format:

Graph failure: cannot convert JSON number 8242452.500000 to decimal: not an integer|

Approach:

  • Change HANA Target Table Column Data Type to FLOAT/DOUBLE
  • Check the supported data types of the HANA Client operator:

 

Final & Working Solution:

  1. Change HANA Data Types for Y & Y_HAT column to DOUBLE
DROP TABLE "TA_01"."OUTPUT_TABLE";
CREATE COLUMN TABLE "TA_01"."OUTPUT_TABLE"(
	"DATE_C" DATE CS_DAYDATE,
	"ZONE_C" VARCHAR(20),
	"BRAND" VARCHAR(20),
	"Y" DOUBLE CS_DOUBLE,
	"Y_HAT" DOUBLE CS_DOUBLE,
	"ITERATION" INTEGER CS_INT
) UNLOAD PRIORITY 5 AUTO MERGE;

 

2nd Challenge:

Conversion of Forecast Pandas DataFrame to HANA Client specific JSON format:

Challenge was in converting the output/forecast pandas data frame to JSON. JSON standard format is of type index. The next operator in the flow (HANA Client) expects a “records” oriented JSON format.

(The following sections are taken from: https://datatofish.com/export-pandas-dataframe-json/ as a reference:(

It describes different JSON types/notations of tabular based data – this is applied in the pandas.to_json() function in the RAP algorithm code to provide the forecast output:

orient=’records’[{"Product":"Desktop Computer","Price":700},{"Product":"Tablet","Price":250},{"Product":"iPhone","Price":800},{"Product":"Laptop","Price":1200}]

orient=’index’{"0":{"Product":"Desktop Computer","Price":700},"1":{"Product":"Tablet","Price":250},"2":{"Product":"iPhone","Price":800},"3":{"Product":"Laptop","Price":1200}}

orient=’values’[["Desktop Computer",700],["Tablet",250],["iPhone",800],["Laptop",1200]]

orient=’table’{"schema": {"fields":[{"name":"index","type":"integer"},{"name":"Product","type":"string"},{"name":"Price","type":"integer"}],"primaryKey":["index"],"pandas_version":"0.20.0"}, "data": [{"index":0,"Product":"Desktop Computer","Price":700},{"index":1,"Product":"Tablet","Price":250},{"index":2,"Product":"iPhone","Price":800},{"index":3,"Product":"Laptop","Price":1200}]}

orient=’columns’ (default){"Product":{"0":"Desktop Computer","1":"Tablet","2":"iPhone","3":"Laptop"},"Price":{"0":700,"1":250,"2":800,"3":1200}}

 

3rd Challenge:

This section is about the source data in INPUT_DATA table. As the use case we have built is a pilot, the underlying amount of input data is very limited and comprised only a round 700 records.

Throughout implementing the pipeline we noticed the input table only comprised around 269 records. This was obviously due to some inconsistent initial data load that for some reason nobody really noticed and has taken care of. It was perceived not to be a problem first, but while testing and running the pipeline, it turned out to be decisive that the underlying data set is consistent in terms of record count.

The reason for this is the python prediction/ML algorithm which relies on aggregated and consolidated records, on a monthly basis. There are also some hard-coded & fixed parts in the python algorithm that indicate the underlying dataset’s completeness (full year scope of consolidated base values on monthly basis).

Its a trivial error, but you have to find & understand it… especially when you’re not the developer of the python algorithm and you don’t really know details about input data etc.

 

This blog entry could (hopefully) give you some valuable insights into the pilot we have built. You might be prepared slightly better for you upcoming initiatives and you can already anticipate some details you might have not considered before.

Best regards & have fun with Data Hub / Data Intelligence implementation projects – Happy to hear your feedback on this

Be the first to leave a comment
You must be Logged on to comment or reply to a post.