Skip to Content
Technical Articles

Build and Serve Sentiment Model using SAP Data Intelligence – Part 1

In this blog post, Let’s create a simple deep learning(Keras) based sentiment model using SAP Data Intelligence, In the course of building various stages of machine learning pipeline which includes Data Extraction, Data Pre-Processing, Model Training, Model Deployment and Serving, you will get familiar with the SAP Data Intelligence product and its feature capabilities.

In this tutorial, We will be using the subset of well known IMDB movie review dataset for model training and serving, Do not worry about the model accuracy since the training data is the subset of the larger dataset, The primary objective of this blog post is to understand and realize the core capabilities of the SAP Data Intelligence product itself

Prerequisite

  1. Access to SAP Data Intelligence Instance
  2. Amazon S3 Account for holding raw training data

In the course of this tutorial, we will be creating 4 major pipelines such as:

Data Extraction & Pre-Processing Pipeline – Part 1

This pipeline extracts raw data from external data repository and apply some data cleansing methods for training, In this case we using S3 as our external data store.

Model Training Pipeline – Part 1

This pipeline trains a simple sentiment model using Keras API and export model in Saved Model format.

Model Deployment & Serving Pipeline – Part 2

This pipeline deploys the exported sentiment model and start serving the real time prediction request.

Inference Pipeline – Part 2

This pipeline helps create HTTP prediction request to get the sentiment of the given text input.

1. Data Extraction & Pre-Processing Pipeline

Assuming data is already available in S3 in the following format, Let’s register a S3 connection and create a pipeline which extracts the data from S3 and apply some pre-processing logic and finally produces a CSV file which will be then stored in the cloud Data Lake for further processing.

Below step by step approach will help you to successfully complete this section.

  • S3 Data Repository

    • Bucket Name: sapdi
    • Negative Reviews: path -> sentiment/negative – all the negative reviews are kept as txt files.
    • Positive Reviews: path -> sentiment/positive – all the positive reviews are kept as txt files.
    • Dataset can be downloaded from sentiment.zip
    • Unzip the file and setup the S3 data store as shown in the above screen shot.
  • Create S3 Connection in SAP Data Intelligence

    • Once logged into SAP Data Intelligence, click “Connection Management” tile from the launchpad
    • You will be redirected to the home screen of Connection Management
    • Create new connection by clicking the “Create” button located at the right hand side corner
    • In the “Create Connection” window, provide the values as shown below, this could be different based on your S3 account
      • Id: S3_SENTIMENT
      • Connection Type: S3
      • Endpoint: <ENDPOINT> e.g: s3.eu-central-1.amazonaws.com
      • Access Key: <AWS ACCESS KEY>
      • Secret Key: < AWS SECRET KEY>
    • Click “Test Connection” to see “OK” message
    • Click “Create” button to create connection
    • Now you should able to see the newly created S3 connection in the home screen of Connection Management, as shown below
  • Create Data Extraction & Pre-Processing Pipeline

    • From the SAP Data Intelligence launchpad, click “ML Scenario Manager” tile
    • You will be redirected to the home screen of ML Scenario Manager
    • Create new scenario by clicking the “+” button located at the right hand side corner of the screen
    • In the pop-up window, provide a name to the scenario For e.g: “sentiment
    • Click “Create” button which will take you to the scenario home page.
    • Once the scenario is launched, Click on the “pipelines” tab
    • In the pipeline section, Click the “+” button to create a new pipeline
    • In the “Create Pipeline” pop-up, Provide “Name” as “sentiment-data-extract-preprocess” , “Description” as “Extract data from S3 and pre-process” and click “Create” button

    • You will be now redirected to the modeler window to design the pipeline
    • Before modeling the pipeline, we need to get the required python library to run our pipeline, As we are going to use pandas as one of the dependency in pre-processing the data, We need to enable this as one of the custom docker tag for successfully running the pipeline
    • From the modeler window, In the navigation pane, choose the “Repository” tab
    • Right-click the “dockerfiles” section and choose “Create Folder
    • Name your folder as “preprocess” and click “Create
    • Right-click the newly created folder (preprocess) and choose “Create Docker File
    • Copy the below script and paste it in the docker file editor
      FROM §/com.sap.datahub.linuxx86_64/vflow-python36:2.7.9
      
      RUN python3.6 -m pip --no-cache-dir install 'pandas==0.25.1'​
    • Click to open the docker configuration icon at the right hand corner of the window and add tags as shown below
    • Finally your folder structure, editor content and configuration should look like as shown below
    • In the editor toolbar, click “Save” to save your changes.
    • In the editor toolbar, click “Build” to build a docker image for the Dockerfile, Wait until the “Build Status” turns to green as shown below

    • Now switch back to “Graph” tab
    • Enable “JSON” mode by clicking the JSON button located at the right hand side corner of the page
    • Delete the existing content and copy the content of dataextract-graph.json and paste in the editor and click save icon in the graph toolbar
    • Switch back to diagram mode and make sure your pipeline looks like below
    • Take some time to explore the configuration of each operator and try to understand the parameter values, Herewith provided some brief information about each of the operator
      • Read File: Operator that connects to the S3 datasource and recursively read all the files in the given path, Take a look at the configuration for bucket, path and pattern that are set in the operator configuration
      • Python 3 Operator: Operator that receives the file content from Read File operator and does some data cleansing and creates a consolidated CSV files as output
      • Write File: Operator that receives the CSV file and write to Data Lake
      • Wiretap: Operator to view the runtime logs
      • Graph Terminator: Once all the task are performed, this will stop the graph
    • Save the graph and go back to the Sentiment Scenario in the “Scenario Manager” and click “Create Version” button to create a new version
    • Once the version has been created, select the “sentiment-data-extract-preprocess” pipeline and click “execute” icon to run the pipeline
    • From the modeler window, Once the pipeline is in the running mode, In the graph click open the “Wiretap Operator UI” to view the runtime execution logs of the pipeline
    • Once the graph is executed & completed successfully, you will see a file called “movies.csv”  created in the path “/worm/sentiment/movies.csv” in Data Lake, This can be viewed via the “Metadata Explorer” application which can be launched from the Launch Pad
    • Now the data is ready for next stage

In this section, We managed to extract the data from S3 and created a CSV file which contains both positive and negative sentiments in equal proportion, Let’s continue with model training in the next section.

2. Model Training Pipeline

As we are already familiar with the “Scenario Manager”, Lets directly go and create the Training Pipeline

Below step by step approach will help you to successfully complete this section

  • Create Dataset Artifact

    • In the Scenario Manger, Go to dataset tab
    • Click “+” icon to register new training dataset artifact where the “movies.csv” file is present
    • In the “Register Dataset” pop-up screen, Provide “Name” as “sentiment” and “URL” as “dh-dl://DI_DATA_LAKE/worm/sentiment/”
    • Take note of the “Technical Identifier” which will be used later at the time of training
  • Create Training Pipeline

    • Start creating a new pipeline by clicking the “+” icon
    • In the “Create Pipeline” pop-up, provide name of the pipeline as “sentiment-training” and leave the rest to default and click the “Create” button
    • You will now get redirected to the modeler window for pipeline designing
    • In the modeler window enable JSON mode by clicking the JSON button located at the right hand side corner of the page
    • Delete the existing content and copy the content of training-graph.json and paste in the editor and click save icon in the graph toolbar
    • Switch back to diagram mode and make sure your pipeline looks like below
    • Take some time to explore the configuration of each operator and try to understand their parameter values as shown below, Herewith provided some brief information about each of the operator
      • Training: Operator that accepts the training script and executes the training against the given dataset
    • In the training operator configuration, modify the Artifact field to reference the dataset “Technical Identifier” created in the previous step, Provide “Name” as “SENTIMENTDATA” and “ID” as <Technical Identifier of the Dataset> as shown below
    • Save the graph and go back to the “Sentiment Scenario” in the “Scenario Manager” and click “Create Version” button to create a new version
    • Once the version has been created, select the “sentiment-training” pipeline and  click “execute” icon to run the pipeline
    • From the modeler window, Once the pipeline is in the running mode, In the graph click open the “Wiretap Operator UI” to view the runtime execution logs of the pipeline
    • Once the graph is executed & completed successfully, you will see a model artifact got created and registered as artifact in the models tab of “Scenario Manager
    • Now the model artifact is ready for “Deployment and Serving”

In this section, We managed to train the sentiment model and exported the model as artifact for further steps, Let’s continue with model deployment & Serving in the next section (to be continued in Part 2)

 

Note: In this Part 1 of the blog post, I managed to cover the end-to-end scenario of data preparation and training, Part 2 is underway and will be will be published soon, Wait and watch . . . 

Be the first to leave a comment
You must be Logged on to comment or reply to a post.