Train a model from a Jupyter notebook using the Py...

marc_daniau · ‎02-20-2018

The Python API of SAP Predictive Analytics allows you to train and apply models programmatically. The user’s code can be executed either in batch mode, from a py script, or interactively, from a notebook.

In this article, you will see how to configure, train and save a model with the API.

The example presented below was done on a Windows machine with:

SAP Predictive Analytics 3.3 Desktop, that includes the Python API.

The WinPython distribution, that has data science libraries and the Jupyter Notebook App.

We will work with census data that comes with SAP Predictive Analytics.

The Training Dataset

To start, we read the csv file and load its content into a Pandas data frame.

import pandas as pd



data_file = "Census01.csv"

data_folder = r"C:\Program Files\SAP Predictive Analytics\Desktop\Automated\Samples\Census"

df = pd.read_csv(data_folder + "\\" + data_file, header=0)

What is the dataset size?

text= "Size of %s" % (data_file)

print('\x1b[1m'+ text + '\x1b[0m')

num = df.shape[0]

print("{} Rows ".format(num))

num = df.shape[1]

print("{} Columns ".format(num))

We display the first ten rows.

df.head(10)

The last column, class, contains 1 if the individual’s annual income is over 50K, 0 otherwise.

We check the proportion of positive cases.

s1=df['class'].value_counts()

s2=df['class'].value_counts(normalize = True) *100

dfc = pd.concat([s1.rename("Observations"), s2.rename("In %")], axis=1)

dfc['In %'] = dfc['In %'].round(2)

dfc.index.name = 'Class'

dfc

The percentage of class 1 cases is large enough.

We can break down that percentage by a given categorical variable like relationship for example.

pd.crosstab(df['relationship'],df['class'],margins=True, normalize=True).round(4)*100

Class is the outcome we want to predict. To make predictions, we must first learn from our training dataset whose outcome is known. This is where the Automated Analytics library (aalib) comes into play.

What is our version of Python by the way?

print('\x1b[1m'+ 'Python Version' + '\x1b[0m')

import platform

platform.python_version()

We have the required version: 3.5. We can proceed with using aalib.

Initialization

We provide the paths to the Python API, the C++ API and the SAP Predictive Analytics desktop directories.

import sys

sys.path.append(r"C:\Program Files\SAP Predictive Analytics\Desktop\Automated\EXE\Clients\Python35")

import os

os.environ['PATH'] = r"C:\Program Files\SAP Predictive Analytics\Desktop\Automated\EXE\Clients\CPP"



AA_DIRECTORY = "C:\Program Files\SAP Predictive Analytics\Desktop\Automated"

We import the Automated Analytics library and we specify the context and the configuration store.

import aalib



class DefaultContext(aalib.IKxenContext):

    def __init__(self): 

        super().__init__()



    def userMessage(self, iSource, iMessage, iLevel):

        print(iMessage)

        return True



    def userConfirm(self, iSource, iPrompt):

        pass



    def userAskOne(iSource, iPrompt, iHidden):

        pass



    def stopCallBack(iSource):

        pass



frontend = aalib.KxFrontEnd([])

factory = frontend.getFactory()

context = DefaultContext()



factory.setConfiguration("DefaultMessages", "true")

config_store = factory.createStore("Kxen.FileStore")

config_store.setContext(context, 'en', 10, False)

config_store.openStore(AA_DIRECTORY + "\EXE\Clients\CPP", "", "")

config_store.loadAdditionnalConfig("KxShell.cfg")

Creating the Model

We create a “regression” model that will perform a classification if the specified target is nominal (e.g. class), a regression if it is continuous (e.g. age).

model = factory.createModel("Kxen.SimpleModel")

model.setContext(context, 'en', 8, False)

model.pushTransformInProtocol("Default", "Kxen.RobustRegression")

With aalib one can work against a database table or a flat file, but not against a Pandas data frame. In our case, we declare a training store against the census csv file.

store = model.openNewStore("Kxen.FileStore", data_folder, "", "")

model.newDataSet("Training", data_file, store)

The API can guess the data description or read it from a file, if any.

# model.guessSpaceDescription("Training")

metadata_file = "Desc_Census01.csv"

model.readSpaceDescription("Training", metadata_file, store)

We set the column name of the target. It can be hard-coded or based on a rule like: last column name.

target_col = list(df)[-1]

We set the roles of the variables.

model.getParameter("")

variables = model.getParameter("Protocols/Default/Variables")

variables.setAllValues("Role", "input")

variables.setSubValue(target_col + "/Role", "target")

variables.setSubValue("KxIndex/Role", "skip")

model.validateParameter()

We choose the partitioning scheme: Estimation and Validation. By default, three partitions are prepared: Estimation, Validation and Test.

model.getParameter("")

model.changeParameter("Parameters/CutTrainingPolicy", "random with no test")

model.validateParameter()

We can enable or disable the auto-selection of candidate predictors with a true/false parameter.

model.getParameter("")

model.changeParameter("Protocols/Default/Transforms/Kxen.RobustRegression/Parameters/VariableSelection", "true")

model.validateParameter()

We can set the Polynomial Order, the default value being 1.

model.getParameter("")

model.changeParameter("Protocols/Default/Transforms/Kxen.RobustRegression/Parameters/Order", "1")

model.validateParameter()

Finally, we train the model.

model.sendMode(aalib.Kxen_learn, store)

Saving the model

Our model was successfully trained. Let’s save it. The method for that is described below.

help(model.saveModel)

We name our model and persist it for later use.

model_folder = r"O:\MODULES_PA/PYTHON_API/MY_MODELS"

model_file = "models_space"

model.setName("My Classification Model")

model_comment = "Generated with Python API from Jupyter Notebook"

model_store = model.openNewStore("Kxen.FileStore", model_folder, "", "")

model.saveModel(model_store, model_file, model_comment)

This model has the same format as if you had saved it using the desktop application. The desktop user can load it if need be.

In a subsequent blog, we will debrief our census model inside a Jupyter notebook.

Train a model from a Jupyter notebook using the Python API of SAP Predictive Analytics

Get Your SAP HANA Idea Incubator Badge Today!

SCN Mission - SAP HANA Quiz Challenge is now retired

Share your #HANAStory and Win