Technology Blogs by SAP
Learn how to extend and personalize SAP applications. Follow the SAP technology blog for insights into SAP BTP, ABAP, SAP Analytics Cloud, SAP HANA, and more.
cancel
Showing results for 
Search instead for 
Did you mean: 
munishsuri
Participant
Data being one of the most important assets for any Enterprise , its exploration and analysis becomes very crucial.

SAP Data Intelligence is a very powerful tool , which lets you do those complex processing on the data .

What is SAP Data Intelligence and how does it relate to Data Hub? - link

In this blog , you will be able connect HANA database as a service with Data Intelligence , explore the data via meta explorer and apply Random Forest Classifier algorithm on it.

 

For this you will be requiring a HANA database a service running on SAP Cloud Platform (Foundry) , a running instance of SAP Data Intelligence

if you are new to this platform , i would highly recommend to read blog by Andreas Forster.

 

So lets Get Started

 

Open SAP Cloud Platform Cockpit, navigate to the Global Account , then to Sub account , and finally to the space , where your HANA instance is running and open the HANA Dashboard.

 


 

Click On Edit and then Allow All IP address , this will make sure your SAP Data Intelligence instance can access the HANA instance

 


 

Its time to login into your SAP Data Intelligence and navigate to connection management and create a connection of type HANA_DB


 

user , password - username and password for logging into the HANA database

Host,Port - direct sql connectivity host and port , which can be found on HANA DB dashboard  from above step


 

Now we are going to create a Jupyter notebook.

For analysis , my database (File) looks like


































































User ID Gender Age Salary Purchased
1 Male 19 19000 0
2 Male 25 24000 1
3 Male 36 25000 0
4 Female 37 87000 1
5 Female 29 89000 0
6 Female 27 90000 1

 

For analysis i will be using (Only Age , Salary Column) for predicting Purchased column

now open a jupyter notebook from ML scenario manager and install these libraries one by one

 
pip install sklearn
pip install hdbcli
pip install matplot

 

Code For Jupiter (Note , if you have any library missing , kindly install using above step)

2 things to configue

  1. HANA connection id - line 2

  2. Enter Table Name (Schema.TableName) - line 13


import notebook_hana_connector.notebook_hana_connector
di_connection = notebook_hana_connector.notebook_hana_connector.get_datahub_connection(id_="hana") # enter id of the connection
from hdbcli import dbapi
conn = dbapi.connect(
address=di_connection["contentData"]['host'],
port=di_connection["contentData"]['port'],
user=di_connection["contentData"]['user'],
password=di_connection["contentData"]["password"],
encrypt='true',
sslValidateCertificate='false'
)
cursor = conn.cursor()
path="ML_TEST.PURCHASE" #enter table name
sql = 'SELECT * FROM '+path
cursor = conn.cursor()
cursor.execute(sql)
c=0
X=[]
y=[]
for row in cursor:
d_r=[]
#I AM USING 4 COLUMN DATASET

d_r.append(row[2])
d_r.append(row[3])
y.append(row[4])
X.append(d_r)


# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)




# Fitting Random Forest Classification to the Training set
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred.tolist())
print(cm)
arrx=np.array(X_train)
y_set=np.array(y_train)

from matplotlib.colors import ListedColormap
X1, X2 = np.meshgrid(np.arange(start = arrx[:, 0].min() - 1, stop = arrx[:, 0].max() + 1, step = 0.1),
np.arange(start = arrx[:, 1].min() - 1, stop = arrx[:, 1].max() + 1, step = 1000))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(arrx[y_set == j, 0], arrx[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)

 

you should be able to view the results in a graph.


 

(Please refer to blog on how to create a pipeline and deploy it as well)

Now Lets us create a pipeline from ML Scenario Manager for creating the model.

 

First let us create a pipeline from the template python producer

(There are some changes in the components ) to get data from HANA


 

  1. Constant Generator - to feed in the SQL query , please see the configuration below, in this case the query is
    SELECT * FROM ML_TEST.PURCHASE​


  2. HANA Client (To connect with HANA):things to note(Connection,TableName) and if you scroll down(ColumnHeader) select it to None

  3. JS Operator - to extract only the  body of the message i.e. rows
    $.setPortCallback("input",onInput);

    function isByteArray(data) {
    switch (Object.prototype.toString.call(data)) {
    case "[object Int8Array]":
    case "[object Uint8Array]":
    return true;
    case "[object Array]":
    case "[object GoArray]":
    return data.length > 0 && typeof data[0] === 'number';
    }
    return false;
    }

    function onInput(ctx,s) {
    var msg = {};

    var inbody = s.Body;
    var inattributes = s.Attributes;

    // convert the body into string if it is bytes
    if (isByteArray(inbody)) {
    inbody = String.fromCharCode.apply(null, inbody);
    }

    msg.Attributes = {};
    msg.Body = inbody;


    $.output(msg.Body);
    }


  4. To String converter (Use inInterface for sending the data from JS operator to the python file)


 

Python File for training the model and saving it
# Example Python script to perform training on input data & generate Metrics & Model Blob
def on_input(data):

import pandas as pd
import io
from io import BytesIO
import os
import numpy as np
import json


dataset = json.loads(data)
i =0;
# to send metrics to the Submit Metrics operator, create a Python dictionary of key-value pairs
X=[]
y=[]
for j in dataset:

x_temp=[]
x_temp.append(j["AGE"])
x_temp.append(j["SALARY"])
y.append(j["PURCHASED"])
X.append(x_temp)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)



# Fitting Random Forest Classification to the Training set
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred.tolist())
metrics_dict = {"confusion matrix": str(cm)}

# send the metrics to the output port - Submit Metrics operator will use this to persist the metrics
api.send("metrics", api.Message(metrics_dict))

# create & send the model blob to the output port - Artifact Producer operator will use this to persist the model and create an artifact ID
import pickle

model_blob = pickle.dumps(classifier)
api.send("modelBlob", model_blob)

api.set_port_callback("input", on_input)


 

wiretaps have been used to check the output , you may skip those blocks

For running the pipeline , you may need the dockerfile , blog

Content of the dockerfile
FROM python:3.6.4-slim-stretch



RUN pip install tornado==5.0.2
RUN python3.6 -m pip install numpy==1.16.4
RUN python3.6 -m pip install pandas==0.24.0
RUN python3.6 -m pip install sklearn


RUN groupadd -g 1972 vflow && useradd -g 1972 -u 1972 -m vflow
USER 1972:1972
WORKDIR /home/vflow
ENV HOME=/home/vflow

 

Now create tags for the dockerfile (Custom tag blogFile is create ) , tag your python file with this tag as well. Build the dockefile



 

 

Now we can run the pipeline and store the artifact (Please provide a name )


 

Now we have to create another pipeline to make an API , so  that it can be consumed.For this case use the template (Python Consumer)


As done in the above step , tag the python and update the script
import json
import io
import numpy as np
import pickle

# Global vars to keep track of model status
model = None
model_ready = False

# Validate input data is JSON
def is_json(data):
try:
json_object = json.loads(data)
except ValueError as e:
return False
return True

# When Model Blob reaches the input port
def on_model(model_blob):
global model
global model_ready

model = pickle.loads(model_blob)
model_ready=True



# Client POST request received
def on_input(msg):
error_message = ""
success = False
try:
attr = msg.attributes
request_id = attr['message.request.id']

api.logger.info("POST request received from Client - checking if model is ready")
if model_ready:
api.logger.info("Model Ready")
api.logger.info("Received data from client - validating json input")

user_data = msg.body.decode('utf-8')
# Received message from client, verify json data is valid
if is_json(user_data):
api.logger.info("Received valid json data from client - ready to use")


# obtain your results
feed = json.loads(user_data)
data_to_predict = np.array(feed['data'])
api.logger.info(str(data_to_predict))

# check path
prediction = model.predict(data_to_predict)
prediction = (prediction > 0)

success = True
else:
api.logger.info("Invalid JSON received from client - cannot apply model.")
error_message = "Invalid JSON provided in request: " + user_data
success = False
else:
api.logger.info("Model has not yet reached the input port - try again.")
error_message = "Model has not yet reached the input port - try again."
success = False
except Exception as e:
api.logger.error(e)
error_message = "An error occurred: " + str(e)

if success:
# apply carried out successfully, send a response to the user
result = json.dumps({'Results': str(prediction)})
else:
result = json.dumps({'Error': error_message})

request_id = msg.attributes['message.request.id']
response = api.Message(attributes={'message.request.id': request_id}, body=result)
api.send('output', response)


api.set_port_callback("model", on_model)
api.set_port_callback("input", on_input)

 

Now you can deploy the pipeline , once it is done , you will get a url , which you can use for the testing of your model , make sure to append /v1/uploadjson/  to your url.

Deployment of the pipeline can take a while .

Post data you can test the model

headers of the call , Authorization is Basic with username

 
[{"key":"X-Requested-With","value":"XMLHttpRequest","description":""},{"key":"Authorization","value":"Add your authentication here":""},{"key":"Content-Type","value":"application/json","description":""}]

 

Body of the request , having Age and Salary
{
"data":[[47,25000]]
}



 

!!!!! Congratulations !!!!!

you have successfully created and deployed a model , using HANA DB as a data source.

 

Some Blogs related to SAP Data Intelligence

https://blogs.sap.com/2020/03/20/sap-data-intelligence-development-news-for-3.0/

https://blogs.sap.com/2020/03/20/sap-data-intelligence-next-evolution-of-sap-data-hub/

https://blogs.sap.com/2019/07/17/sap-data-hub-and-sap-data-intelligence-streamlining-data-driven-int...

 
1 Comment