Skip to Content
Technical Articles

SAP Data Intelligence : Advanced Attrition Data Science Scenario

In this video I’ll be showing the full process to create a logistic regression model developed in Python inside SAP Data Intelligence using a training set of employee attrition. This model will then be exposed thru an REST-API using SAP Data Intelligence. To create this demo I’ve used this great blog from Andreas Forster, a great part of this blog is directly copied from his work, with his permission. If at any point you find yourself lost in my demo, please refer back to Andreas’ blog and follow his steps through. He goes much more in detail with what to do and explains every component in a much more detailed manner.

Please note that the data used here is fake and contains no real information.

Dataset and business question

Our dataset is a small CSV file, which contains 2800 fake employee information with 64 columns of various information and also with a column indicating, job title, several competencies, salary, etc…

ID Country DOB AgeJoined Age YearsOfService
1 France 03/04/1992 27 27 0  
2 France 13/09/1984 33 35 2  
3 France 21/02/1981 38 38 0  
4 France 15/09/1967 27 52 25  
5 France 11/02/1994 21 25 4  

The business question we are trying to answer here is the risk of attrition of an employee. In our dataset, we have a column LEFT saying if the employee has left the company of not. We will use this column for training/testing purpose and then try to run it for our existing or new employees to identify those who might leave us.

In this demo I’m loading this dataset in min.io to simulate an Amazon S3 bucket. I have also loaded this dataset in my own SAP HANA database to develop my python code in the SAP Data Intelligence provided Jupyter Lab Notebook.

Data connection

First we need to define all the connections that we will be using:

We need to configure our S3 bucket connection and have it point to our bucket. You can verify the connection by clicking on Test Connection.

ML Scenario

After we have loaded our file, we can start working on the actual Data Science part of our task. Go back to the main page of SAP Data Intelligence and click into the “ML Scenario Manager”.

I will not go into the details of the machine learning python script here but basically this is where I created the notebook to perform my analysis. We will also create two pipelines here, one to perform the training of our model and the second one to deploy our REST-API using this same model.

Logistic Regression model training

First we will create a Training Pipeline by clicking the “+” sign in the Notebooks section.

For this I will use the Python Producer template

We then need to configure the read file component of our pipeline and have it point to our previously defined S3 bucket.

Then we simply need to enter the path to our csv file in the bucket.

We then need to enter our python code in the Python3 component.

Here is the code that was used in my component:

import sklearn  
from sklearn.model_selection import train_test_split  
from sklearn import preprocessing  
from sklearn.linear_model import LogisticRegression  
import pandas as pd  
from sklearn import metrics  
import io  
import numpy as np  
      
# Example Python script to perform training on input data & generate Metrics & Model Blob  
def on_input(data):  
  
    # Obtain data  
    df_data = pd.read_csv(io.StringIO(data), sep=";")  
    # Creating labelEncoder  
    le = preprocessing.LabelEncoder()  
    # Categorical features to numerical features  
    df_data['Gender']=le.fit_transform(df_data['Gender'])  
    df_data['JobType']=le.fit_transform(df_data['JobType'])  
    df_data['BirthCountry']=le.fit_transform(df_data['BirthCountry'])  
    # Balancing the dataset  
    Left = df_data[df_data["LEFT"]==1].sample(200)  
    Stayed = df_data[df_data["LEFT"]==0].sample(200)  
    balanced_df = pd.concat([Left, Stayed],ignore_index=False, sort=False)  
    # Normalization  
    column_names_to_normalize = ['Age_Years', 'Salary','VerbalCommunication','Teamwork',  
           'CommercialAwareness','AnalysingInvestigating','InitiativeSelfMotivation','Drive','WrittenCommunication',  
           'Flexibility','TimeManagement','PlanningOrganising','DaysSickYTD','ContractedHoursperWeek','PerformanceGrade2015',  
           'DaysWithoutRaise','Gender','JobType']  
    X = balanced_df[['Age_Years', 'Salary','VerbalCommunication','Teamwork',  
           'CommercialAwareness','AnalysingInvestigating','InitiativeSelfMotivation','Drive','WrittenCommunication',  
           'Flexibility','TimeManagement','PlanningOrganising','DaysSickYTD','ContractedHoursperWeek','PerformanceGrade2015',  
           'DaysWithoutRaise','Gender','JobType']]  
    # We will be using a min max scaler for normalization  
    min_max_scaler = preprocessing.MinMaxScaler()  
    df = X  
    x = df[column_names_to_normalize].values  
    x_scaled = min_max_scaler.fit_transform(x)  
    df_temp = pd.DataFrame(x_scaled, columns=column_names_to_normalize, index = df.index)  
    df[column_names_to_normalize] = df_temp  
    X = df_temp  
    # Converting string labels into numbers.  
    y=le.fit_transform(balanced_df['LEFT'])  
    #  Slicing dataset between train and test  
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)  
    # Training the Model  
    clf= LogisticRegression(verbose = 3)  
    clf_trained = clf.fit(X_train, y_train)   
    y_pred = clf_trained.predict(X_test)  
      
    MeanAbsoluteError=metrics.mean_absolute_error(y_test, y_pred)  
    MeanSquaredError=metrics.mean_squared_error(y_test, y_pred)  
    RootMeanSquaredError=np.sqrt(metrics.mean_squared_error(y_test, y_pred))  
  
    # to send metrics to the Submit Metrics operator, create a Python dictionary of key-value pairs  
    metrics_dict = {"Mean Absolute Error": str(MeanAbsoluteError),"Mean Squared Error": str(MeanSquaredError), "Root Mean Squared Error": str(RootMeanSquaredError), "n": str(len(df_data))}  
      
    # send the metrics to the output port - Submit Metrics operator will use this to persist the metrics   
    api.send("metrics", api.Message(metrics_dict))  
  
    # create & send the model blob to the output port - Artifact Producer operator will use this to persist the model and create an artifact ID  
    import pickle  
    model_blob = pickle.dumps(clf_trained)  
    api.send("modelBlob", model_blob)  
      
api.set_port_callback("input", on_input)  

Exit the code, save the pipeline and we now can train the model. Go back to the ML Scenario page and select your newly created pipeline.

One very important step here is to define the docker image for our Python3 component. This will allow us to use specific libraries for this pipeline without impacting others. Go to repository on the left side of SAP Data Intelligence and “Create a Docker File”.

FROM $com.sap.opensuse.python36
RUN python3.6 -m pip install numpy==1.16.4
RUN python3.6 -m pip install pandas==0.24.0
RUN python3.6 -m pip install sklearn

After you have done that you can enter specific tags to use with our Python component. Click on the configuration panel and enter the following information.

“opensuse”, “python36” and “tornado”. For “tornado” also enter the version “5.0.2”. We also need to enter a specific tag for this dockerfile. I’ve used “pe_attrition”.

Now save the Docker file and click the “Build” icon to start building the Docker image.

Wait until the build is successful. You can know use this docker file for your Python component.

To do so, we go back to the graphical pipeline and right click the Python3 component and select “Group”.

In the tags, we will add the previously created tag, “pe_attrition”.

You can now save the graph.

The pipeline is now complete and we can run it. Go back to the ML Scenario, select your pipeline and execute it.

Click thru all the steps and give your model a name. Wait till the model is executed.

You can now look at the results of our model thru the metrics that we used in our python code.

If the results of our metrics are good enough we can proceed with deploying the model as a REST API.

Go back to your ML scenario and copy the model technical identifier

Pushing the model as a REST-API

Let’s create a second pipeline. For this we will be using a Python Consumer template.

We only need to change the Submit Artifact Name component “Content” value and change it to ${modelTechnicalIdentifier}. As explained in Andreas’ blog, this change will enable us to pass the model’s technical identifier to the pipeline.

I then modified the code in the Python3 component to the following :

import json
import pickle
from sklearn import preprocessing
import pandas as pd

# Global vars to keep track of model status
model = None
model_ready = False

# Validate input data is JSON
def is_json(data):
  try:
    json_object = json.loads(data)
  except ValueError as e:
    return False
  return True

# When Model Blob reaches the input port
def on_model(model_blob):
    global model
    global model_ready

    model = pickle.loads(model_blob)
    model_ready = True
    api.logger.info("Model Received & Ready")

# Client POST request received
def on_input(msg):
    error_message = ""
    success = False
    prediction = None 
    df_data = None
    
    try:
        api.logger.info("POST request received from Client - checking if model is ready")
        if model_ready:
            api.logger.info("Model Ready")
            api.logger.info("Received data from client - validating json input")
            
            user_data = msg.body.decode('utf-8')
            # Received message from client, verify json data is valid
            if is_json(user_data):
                api.logger.info("Received valid json data from client - ready to use")
                
                # apply your model
                # Obtain data
                data = json.loads(user_data)
                df_data = pd.DataFrame(data, index=[0])
                # Creating labelEncoder
                le = preprocessing.LabelEncoder()
                # Categorical features to numerical features
                df_data['Gender']=le.fit_transform(df_data['Gender'])
                df_data['JobType']=le.fit_transform(df_data['JobType'])
                df_data['BirthCountry']=le.fit_transform(df_data['BirthCountry'])
                # Normalization
                column_names_to_normalize = ['Age_Years', 'Salary','VerbalCommunication','Teamwork',
                       'CommercialAwareness','AnalysingInvestigating','InitiativeSelfMotivation','Drive','WrittenCommunication',
                       'Flexibility','TimeManagement','PlanningOrganising','DaysSickYTD','ContractedHoursperWeek','PerformanceGrade2015',
                       'DaysWithoutRaise','Gender','JobType']
                X = df_data[['Age_Years', 'Salary','VerbalCommunication','Teamwork',
                       'CommercialAwareness','AnalysingInvestigating','InitiativeSelfMotivation','Drive','WrittenCommunication',
                       'Flexibility','TimeManagement','PlanningOrganising','DaysSickYTD','ContractedHoursperWeek','PerformanceGrade2015',
                       'DaysWithoutRaise','Gender','JobType']]
                # We will be using a min max scaler for normalization
                min_max_scaler = preprocessing.MinMaxScaler()
                df = X
                x = df[column_names_to_normalize].values
                x_scaled = min_max_scaler.fit_transform(x)
                df_temp = pd.DataFrame(x_scaled, columns=column_names_to_normalize, index = df.index)
                df[column_names_to_normalize] = df_temp
                X = df_temp
                prediction = model.predict_proba(X)
                # obtain your results
                success = True
            else:
                api.logger.info("Invalid JSON received from client - cannot apply model.")
                error_message = "Invalid JSON provided in request: " + user_data
                success = False
        else:
            api.logger.info("Model has not yet reached the input port - try again.")
            error_message = "Model has not yet reached the input port - try again."
            success = False
    except Exception as e:
        api.logger.error(e)
        error_message = "An error occurred: " + str(e) + " Data sent : " + str(df_data)
    
    if success:
        # apply carried out successfully, send a response to the user
        msg.body = json.dumps({'Employee attrition ': str(prediction[0])})
    else:
        msg.body = json.dumps({'Error': error_message})
    
    api.send('output', msg)
    
api.set_port_callback("model", on_model)
api.set_port_callback("input", on_input)

In my code I needed to prepare the dataset in the same manner as in my model. That is why we have again the normalization and feature selection along with the labeling of categories, in order to extract these features in the same way. Actually while writing my blog I’ve come to realize that this part is incorrect as my labelling might not have the same number of labels. This will have to be fixed…

After this we apply the model to our dataset using the predict_proba in order to provide the user with a percentage result of chances of attrition for this data.

We can now close the editor window and as in the first pipeline, we need to assign the Python component to the DockerFile containing all the required libraries. Right-click the Python component and select “Group”. Then add the tag that you used for your dockerfile, in my case “pe_attrition”.

Save the change and go back to the ML Scenario.

We now can deploy this pipeline, for this we will use the technical identifier of our previously training model.

Select your newly created pipeline and click the “Deploy” icon.

Go through all the step and when prompted for your modelTechnicalIdentifier enter the value of your model. Click “Save”.

Deploy it, your status will be pending for a few minutes

Once it is running, copy the URL of our deployed REST-API, don’t try it as it is, it is missing some part.

Contrary to Andreas who used Postman for his demo, my prefered tool is SoapUI. In a previous life, I found out some limits to Postman and have been using SoapUI ever since. It’s more complex to use but has more features.

Open SoapUI and create a new REST project.

Add /v1/uploadjson/ to your deployed URL. Change the request type from “GET” to “POST”.

Go to the “Auth” tab below, select “Basic Authorization” and enter your user name and password for SAP Data Intelligence. The user name starts with your tenant’s name, followed by a backslash and your actual user name. For SoapUI we also need to select Authenticate pre-emptively.

In the “Headers” tab and enter the key “X-Requested-With” with value “XMLHttpRequest”.

Then enter the input data in the REST-API:

{
    "ID": 2801,
    "Country": "France",
    "City_New": "Caen",
    "LAT": 49.182,
    "LONG": -0.366,
    "PreviousEmployer": "ANYCOMP",
    "DateofBirth": "16/04/1982",
    "DateJoinedCompany": "01/10/2019",
    "Age_Years": 27,
    "LengthofServiceonDeparture_Years": "",
    "DateLeftCompany": "",
    "LEFT": 0,
    "LengthofServiceonDeparture_Days": "",
    "Status": "Current Employee",
    "TerminationReason": "",
    "DateofLastreview": "04/10/2019",
    "DaysSinceLastReview": 84,
    "MonthsSinceLastReview": 7,
    "PersonalEmail": "whoever@if.org",
    "FullName": "Michel Who",
    "Gender": "M",
    "FirstLanguage": "French",
    "SecondLanguage": "English",
    "ThirdLanguage": "",
    "Salary": "34567",
    "JobTitle": "Architect",
    "VerbalCommunication": 4,
    "Teamwork": 4,
    "CommercialAwareness": 3,
    "AnalysingInvestigating": 4,
    "InitiativeSelfMotivation": 1,
    "Drive": 2,
    "WrittenCommunication": 2,
    "Flexibility": 4,
    "TimeManagement": 2,
    "PlanningOrganising": 3,
    "Phone": "000-000000",
    "EmployeeID": 12801,
    "Title": "M.",
    "UserID": "michelwho",
    "CompanyEmail": "mwho@anothercomp.org",
    "DaysSickYTD": 0,
    "FTEEquivalent": 1,
    "ContractedHoursperWeek": 37,
    "AverageWeeklyHoursWorkedYTD": 36.4,
    "ContractedHoursWorked": "98.00%",
    "AverageVariancefromContractedHoursYTD": "98%",
    "BirthCountry": "France",
    "InPensionScheme": "Y",
    "LastPayRise": 0,
    "LastPayRiseDate": "01/10/2019",
    "DaysWithoutRaise": 13,
    "PerformanceGrade2015": 0,
    "PerformanceGrade2014": 0,
    "JobType": "Skilled Labor",
    "QuarterofLastReview": 5,
    "YEARREVIEWPROS": "Managers that listen, plenty of O.T.",
    "YEARREVIEWCONS": "Company moves fast, so does the employees."
  }

Press the play button and you will get a result from our deployed model in SAP Data Intelligence.

This specific employee has a 77.06% change of leaving us. Something needs to be done or we might loose him!

You can play around with the model by changing the values in our JSON request.

Summary

In this post, we used SAP Data Intelligence to create an attrition Data Science scenario. We used the embedded Jupyter Lab notebook to create our analysis and our Machine learning model. We then used this model in a provided template pipeline that we adapted to our business question. We then executed the model and validated that the model was accurate. With this pipeline, we will be able to recreate the model in case we identify a divergence between our predicted values and the real attrition in our company. We then created a second pipeline using another template to consume the previously created model and expose it as a REST API and then used a tool, such as SoapUI to run the model on employees.

I hope this tutorial was helpful, don’t hesitate to comment if you need help or have questions.

4 Comments
You must be Logged on to comment or reply to a post.