Technical Articles
How to Use the HANA ML Library Within a Python Operator in SAP Data Intelligence
Andreas Forster describes in his blog how to use the HANA ML library in the context of Jupyter notebooks – both standalone and embedded in SAP Data Intelligence – and how to use the intrinsic SAP Data Intelligence connection to get access to data in a HANA database.
I will extend this a little bit more and show how to use these concepts even in a more integrated way in a Python operator of SAP Data Intelligence.
Requirements
We consider the following two scenarios:
- Scenario 1 deals with a customer who uses HANA database and for whom the algorithms provided by APL and PAL suffice. In this case all the data will completely remain in the HANA database, but the process of training and predicting will be orchestrated by SAP Data Intelligence. Models and result tables will be stored in SAP HANA database as well.
- Scenario 2 deals with a customer who uses HANA database as well but who wants to use open source libraries like Sklearn to build his models. In this case data will be accessed with the HANA ML library as well like in scenario 1, but they will be materialized in the SAP Data Intelligence runtime environment. An Sklearn algorithm will be applied to them and the resulting model will be persisted in the SAP Data Intelligence inherent SAP Data Lake (SDL). When doing the prediction part the model will be reloaded from SDL and results will be written back to SAP HANA database by means of SAP Data Intelligence like e.g. the HANA client operator.
For both scenarios we will show how to create a template operator which makes it quite easy for re-use it for any kind of these requirements. We implemented our example on a CAL image for Data Hub / Data Intelligence 2.2.3 which corresponds to release SAP Data Intelligence 1908. The CAL image has the advantage that we get a HANA in the landscape as well.
The Data
We will show a classification example with the well-known Titanic data. For this sake create a schema TITANIC first in your HANA database. Then we will create a table for training data, a table for test data and a result table for the result of the logistic regression from Sklearn in HANA. In order to do this we use the XS classic environment and create a file titanic.hdbdd in the Web-based development workbench
The contents of the file titanic.hdbdd is provided by
namespace ml.titanic.data;
@Schema: 'TITANIC'
context titanic {
@Catalog.tableType: #COLUMN
entity tab_trainDataRaw {
Key PassengerId: Integer;
Survived: String(1);
Pclass: String(1);
Name: String(256);
Gender: String(6);
Age: Integer;
SibSp: Integer;
Parch: Integer;
Ticket: String(64);
Fare: Decimal(13,2);
Cabin: String(16);
Embarked: String(1);
};
@Catalog.tableType: #COLUMN
entity tab_testDataRaw {
Key PassengerId: Integer;
Pclass: String(1);
Name: String(256);
Gender: String(6);
Age: Integer;
SibSp: Integer;
Parch: Integer;
Ticket: String(64);
Fare: Decimal(13,2);
Cabin: String(16);
Embarked: String(1);
};
@Catalog.tableType: #COLUMN
entity tab_predLogRegSklearn {
key "PASSENGERID": Integer;
"PROB2CHURN": Double;
};
};
Saving the file will activate the tables defined by it. Have a look at the tables in the catalog view and take care that you might have to adapt the authorizations of your user in order to see the schema TITANIC as well as these tables.
The Titanic dataset can be downloaded as train.csv and test.csv e.g. from the Kaggle Site. After having downloaded these two datasets import them to the HANA tables tab_trainDataRaw and tab_testDataRaw
Build the Dockerfile
Let us first docker image which will later provide the necessary runtime. Go to the Modeler and choose the Repository tab. Right-click the folder dockerfiles and choose “Create Docker File” and name it e.g. hana_ml.
As next step we need the HANA ML library file in the docker folder. Locate the file “hana_ml-1.0.7.tar.gz” on your local machine and rename it locally into “hana_ml-1.0.7“. Then import into the docker folder.
Finally rename the imported file “hana_ml-1.0.7” back to “hana_ml-1.0.7.tar.gz”. The reason for this renaming trick is that if you import the tar.gz-file it would be uncompressed automatically during the import. Double-click the file Dockerfile in the folder hana_ml and configure as follows. For the script part use
FROM $com.sap.python36
copy hana_ml-1.0.7.tar.gz hana_ml.tar.gz
run pip install hana_ml.tar.gz
run pip install requests
ENV PYTHONPATH="$PYTHONPATH:/"
Then some tags have to be given where the tag hana_ml is your customer tag. The other tags python36, opensuse and tornado should already be available. If a Python operator shall use the HANA ML library this can now be controlled by grouping this operator and using the respective tags in the configuration of that group.
Have done this configuration we save and then build the docker file.
Build the Custom Operator
Let us first build the custom operator for the first scenario. Therefore go to the modeler and the operator tab and press + (Create Operator). Call the new operator e.g. HANA_ML_BLOG as technical name and give a description as display name.
Let us provide one input port to trigger the operator:
For the part tags we configure hana_ml to tell the operator to use the runtime of the docker image we created in the step before.
The part configuration will allow us to use the connection manager and connect to a HANA instance connected to SAP Data Intelligence. Click the pencil to get into the edit mode and change to JSON:
Cut and paste the following JSON contents into the editor and save.
{
"$schema": "http://json-schema.org/draft-06/schema#",
"$id": "http://sap.com/vflow/HANA_ML_BLOG.configSchema.json",
"type": "object",
"properties": {
"hanaConnection": {
"title": "HANA Connection",
"description": "HANA Connection",
"type": "object",
"properties": {
"configurationType": {
"title": "Configuration Type",
"type": "string",
"enum": [
" ",
"Configuration Manager",
"Manual"
]
},
"connectionID": {
"title": "Connection ID",
"type": "string",
"format": "com.sap.dh.connection.id",
"sap_vflow_valuehelp": {
"url": "/app/datahub-app-connection/connections?connectionTypes=HANA_DB",
"valuepath": "id",
"displayStyle": "autocomplete"
},
"sap_vflow_constraints": {
"ui_visibility": [
{
"name": "configurationType",
"value": "Configuration Manager"
}
]
}
},
"connectionProperties": {
"title": "Connection Properties",
"$ref": "http://sap.com/vflow/com.sap.dh.connections.hana_db.schema.json",
"sap_vflow_constraints": {
"ui_visibility": [
{
"name": "configurationType",
"value": "Manual"
}
]
}
}
},
"required": []
},
"codelanguage": {
"type": "string"
},
"script": {
"type": "string"
}
},
"required": []
}
In the final tab script we will provide a script template which can be re-used.
import hana_ml
from hana_ml import dataframe
import numpy as np
def on_input(data):
conn = hana_ml.dataframe.ConnectionContext(
api.config.hanaConnection['connectionProperties']['host'],
api.config.hanaConnection['connectionProperties']['port'],
api.config.hanaConnection['connectionProperties']['user'],
api.config.hanaConnection['connectionProperties']['password'],
encrypt='true',
sslValidateCertificate='false')
# insert your specific code / script here ...
api.set_port_callback("trigger", on_input)
What we got up to now is an operator which can be used out-of-the-box for usage with PAL or APL.
Scenario 1: Use APL Within a Python Operator
Create a pipeline like this
where we use in the middle our newly created operator. This operator shall get a connection to a HANA DB which can easily by configured. Choose “Open Configuration” in the context menu of the operator, which we called here “APL Train – Classify”.
Then click the pencil beside HANA Connection and choose the desired connection.
As next step we have to create an output port for the operator to tell the graph terminator that the computation is ok and has finished.
The script of the operator “APL Train – Classify” shall be adapted as follows between the comments “insert from here …. to there”. The code lines at the beginning and at the end were already provided when defining the operator and they can be re-used.
import hana_ml
from hana_ml import dataframe
import numpy as np
import json
from hana_ml.algorithms.apl.classification import AutoClassifier
def on_input(data):
conn = hana_ml.dataframe.ConnectionContext(
api.config.hanaConnection['connectionProperties']['host'],
api.config.hanaConnection['connectionProperties']['port'],
api.config.hanaConnection['connectionProperties']['user'],
api.config.hanaConnection['connectionProperties']['password'],
encrypt='true',
sslValidateCertificate='false')
# ... insert from here ...
df_train = conn.table('ml.titanic.data::titanic.tab_trainDataRaw',schema = 'TITANIC')
cl_model = AutoClassifier(conn_context=conn, variable_auto_selection=True)
cl_model.fit(df_train, key='PassengerId',label='Survived')
# Save the model to HANA
cl_model.save_model(schema_name='TITANIC', table_name='TitanicAPLModelAutoClass_dhpl',if_exists='replace')
# Provide some quality metrics
result = {"training": "done"}
api.send("result", api.Message(result))
# ... to there ...
api.set_port_callback("trigger", on_input)
Mind that at no point data are materialized in the Data Intelligence runtime. The algorithm is executed completely on HANA. Also the resulting model is saved to a table in the HANA database.
For the prediction pipeline you can use the same custom operator again.
The script of the operator “APL Predict – Classify” shall be adapted as follows between the comments “insert from here …. to there”. The code lines at the beginning and at the end were already provided when defining the operator and they can be re-used. For prediction we reload the model from the local HANA table CustChurnAPLModelAutoClass_blog.
import hana_ml
from hana_ml import dataframe
import numpy as np
import json
from hana_ml.algorithms.apl.classification import AutoClassifier
def on_input(data):
conn = hana_ml.dataframe.ConnectionContext(
api.config.hanaConnection['connectionProperties']['host'],
api.config.hanaConnection['connectionProperties']['port'],
api.config.hanaConnection['connectionProperties']['user'],
api.config.hanaConnection['connectionProperties']['password'],
encrypt='true',
sslValidateCertificate='false')
# ... insert from here ...
df_inference = conn.table('ml.titanic.data::titanic.tab_testDataRaw',schema = 'TITANIC')
cl_model = AutoClassifier(conn_context=conn)
cl_model.load_model(schema_name='TITANIC', table_name='TitanicAPLModelAutoClass_dhpl')
result=cl_model.predict(df_inference)
cl_model.save_artifact(artifact_df=result,schema_name='TITANIC', table_name='TitanicAPLAutoClassPredict_dhpl',if_exists='replace')
# tell the system we're done
result = {"inference": "done"}
api.send("result", api.Message(result))
# ... to there ...
api.set_port_callback("trigger", on_input)
The result of the prediction is written back to the table CustChurnAPLAutoClassPredict_blog.
Scenario 2: Extend the Customer Operator
Let us slightly extend Scenario 1 and assume that not PAL or APL shall be utilized but an Sklearn library. For this reason we use the same customer operator like in scenario 1 but in this case the trained model will not be saved to HANA but to the local SAP Data Lake (SDL). For this reason we have to create an output port of type blob to export the trained model.
First of all go to the Data Intelligence Modeler –> Graphs and click the + (New Graph). Look for the customer operator we created at the beginning and drag it into the empty canvas.
Then let us add a Constant Generator, a Model Producer and a Graph Terminator in the following way:
For the custom operator Sklearn Train – Classify we have to configure the connection to HANA which works out-of-the-box due to our effort before when we created our custom operator.
The script for the operator Sklearn Train – Classify shortly explained:
- The first part reading the connection properties is part of the template operator.
- In the part Get the data and prepare them we connect to HANA via the configured Data Intelligence Connection and materialize the data in the Data Intelligence runtime. There is also the difference to scenario 1, where we used APL and all the data remain on HANA. The other lines in this part of the script deal with pure data preparation for the later usage with the Sklearn LogisticRegressionCV algorithm.
- Starting with “Train the model and generate the blob” we apply the Sklearn algorithm LogisticRegressionCV and train the model mod_LogReg. This model is brought into the required blob format model_blob with pickle and finally passed to the output port modelBlob.
import hana_ml
from hana_ml import dataframe
import numpy as np
from sklearn.linear_model import LogisticRegressionCV
import pickle
def on_input(trigger):
conn = hana_ml.dataframe.ConnectionContext(
api.config.hanaConnection['connectionProperties']['host'],
api.config.hanaConnection['connectionProperties']['port'],
api.config.hanaConnection['connectionProperties']['user'],
api.config.hanaConnection['connectionProperties']['password'],
encrypt='true',
sslValidateCertificate='false')
# ... insert from here ...
# Get the data and prepare them
df_train = conn.table('ml.titanic.data::titanic.tab_trainDataRaw',schema = 'TITANIC')
df_trainFeat=df_train.drop(['PassengerId','Name','Ticket','Cabin','Survived'])
df_trainTgt = df_train.select(['Survived'])
df_trainFeat = df_trainFeat.collect()
df_trainTgt = df_trainTgt.collect()
df_trainFeat.Gender=df_trainFeat.Gender.map({'male':0,'female':1})
df_trainFeat.Embarked=df_trainFeat.Embarked.map({'S':0,'C':1,'Q':2})
df_trainFeat.fillna(0,inplace=True)
X_trainFeat=np.array(df_trainFeat)
X_trainTgt=np.array(df_trainTgt)
# Train the model and generate the blob
mod_LogReg = LogisticRegressionCV(cv=10,penalty = 'l2', scoring='roc_auc',solver='lbfgs', max_iter=3000, multi_class='auto').fit(X_trainFeat, X_trainTgt.ravel())
model_blob = pickle.dumps(mod_LogReg)
api.send("modelBlob", model_blob)
# ... to there ...
api.set_port_callback("trigger", on_input)
One thing is missing namely to save the model to the SAP Data Intelligence repository (SAP Data Lake, SDL). Actually there are two operator which can be used now
- The Artifact Producer saves the model blob to SAP Data Lake (SDL). As a prerequisite to use this operator the pipeline around it as to be deployed as machine learning scenario.
- The Model Producer saves the model to the file system of SAP Data Intelligence.
As we developed a standalone pipeline we will use the Model producer in our example. In the configuration of the Model Producer we have to tell the system the name of the model and the place in the file system where the model blob has to be stored (parameter blobName) and a blobVersion.
Save the pipeline and execute the pipeline and wait until its status turns to completed.
Check whether the model has been saved to the specified place and navigate to SAP Data Intelligence –> System Management –> Files. Go the folder hierarchy and expand along files –> vflow –> blobs and according to com.sap.ml.models.sklearn.blog.LogReg.
It remains to create the prediction pipeline for the second scenario. For this reason we created a second customer operator and extend the previously created custom operator. Navigate to SAP Data Intelligence –> Modeler and there to the tab Operators. Click the + (Create Operator) and create an new customer operator exactly as we did it before with some differences
- Name the operator HANA_ML_BLOG_02 and display name e.g. “HANA ML Operator w/ Model Consumption”
- Beside the port trigger we create a second input port modelBlob of type blob and an output port jsonResult of type string
- For the script we provide
import hana_ml from hana_ml import dataframe import numpy as np import pandas as pd import json import pickle def on_input(trigger,modelBlob): conn = hana_ml.dataframe.ConnectionContext( api.config.hanaConnection['connectionProperties']['host'], api.config.hanaConnection['connectionProperties']['port'], api.config.hanaConnection['connectionProperties']['user'], api.config.hanaConnection['connectionProperties']['password'], encrypt='true', sslValidateCertificate='false') # api.send("jsonResult",json.dumps(result.tolist())) api.set_port_callback(["trigger", "modelBlob"], on_input)
The remaining tabs Tags and Configuration are configured like the previously defined operator.
Let us now build the prediction pipeline and navigate to SAP Data Intelligence –> Modeler. Choose the tab Graphs and click + (Create Graph). Look for our second custom operator and drag it to the canvas. Rename it as “Sklearn Predict – Classify” and configure the connection to access the HANA database. This is a benefit of our custom operator which was created in the tab Configuration.
Then let us add a Constant Generator, a Model Consumer, a toMessage Converter, a HANA Client and a Graph Terminator in the following way:
Let us configure the distinct operators:
- Model Consumer: The parameter blobName has to point to the model blob which has been created during the training (com.sap.ml.models.sklearn.blog.LogReg)
- Custom Operator “Sklearn Predict – Classify“: Copy and paste this code into the script canvas of the custom operator. Let us shortly explain what happens here. The first part which is provided by the template again takes care of the connections.The part below “Prepare Data” connects to HANA and gets the inference data.from there. Moreover some data types are adapted into a format as required be the algorithm later. After “Get the model and do the prediction” the model is re-loaded from the local file system in SAP Data Intelligence via pickle. From the algorithm we get back a table of probabilities which is finally combined together with the account IDs into the result object. The object result is finally dumped into a JSOB format and sent as string to the output port jsonResult.
-
import hana_ml from hana_ml import dataframe from sklearn.linear_model import LogisticRegressionCV import numpy as np import pandas as pd import json import pickle def on_input(trigger,modelBlob): conn = hana_ml.dataframe.ConnectionContext( api.config.hanaConnection['connectionProperties']['host'], api.config.hanaConnection['connectionProperties']['port'], api.config.hanaConnection['connectionProperties']['user'], api.config.hanaConnection['connectionProperties']['password'], encrypt='true', sslValidateCertificate='false') # .... insert from here ... # Prepare Data df_inf = conn.table('ml.titanic.data::titanic.tab_testDataRaw',schema = 'TITANIC') df_inf=df_inf.drop(['Name','Ticket','Cabin']) df_inf = df_inf.collect() df_inf.Gender=df_inf.Gender.map({'male':0,'female':1}) df_inf.Embarked=df_inf.Embarked.map({'S':0,'C':1,'Q':2}) df_inf.fillna(0,inplace=True) X_inf=np.array(df_inf) X_infFeat=X_inf[:,1:8] # Get the model and do the prediction mod_LogReg = pickle.loads(modelBlob) probs=mod_LogReg.predict_proba(X_infFeat) result=np.c_[X_inf[:,0],probs[:,1]] api.send("jsonResult",json.dumps(result.tolist())) # ... to there ... api.set_port_callback(["trigger", "modelBlob"], on_input)
- toMessage Converter: The json string containing the probabilities is converted into a message with the incoming json string in the body.
- SAP HANA Client: Choose the connection to the HANA database and then choose the table “TITANIC”.”ml.titanic.data::titanic.tab_predLogRegSklearn” to persist the result, namely the table with the probabilities. Input format shall be JSON and insert mode INSERT. Define the initial statement as truncate table “TITANIC”.”ml.titanic.data::titanic.tab_predLogRegSklearn” in order to empty the table before loading new data.
Save the pipeline and run it.
After having successfully completed the pipeline one can check the result in HANA
Summary
We implemented two examples where HANA ML library is used within a Python 3 operator of SAP Data Intelligence. We created custom operators which can make use of the SAP Data Intelligence connection properties to a HANA database. The first examples shows a nice case where all the data remain on HANA database and SAP Data Intelligence is pure orchestration of the ML process. Models as well as results are stored on HANA. The second example could be a use case for a customer whose data reside in HANA but PAL or APL do not suffice. So external Python ML libraries are utilized as e.g. Sklearn. In this case we still use the HANA ML library to import the data into the SAP Data Intelligence runtime where they are materialized. This is also the price to pay if one uses either not PAL or APL but also if one wants to merge data from another source.
Good article !
Thank you Ingo for documenting and sharing this!
Andreas Forster Ingo Peter
Please require your valuable inputs here
Thank you!
https://answers.sap.com/questions/13817838/convert-emp-kafka-json-to-table-hana-db.html