Skip to Content
Technical Articles
Author's profile photo Former Member

How to deploy a Natural Language Processing model with TensorFlow on SAP Data Intelligence

Recent advances in Deep learning have led to an astonishing improvement in the methods by which data scientists can create value from large amounts of unstructured text. Nowhere is that more present than in Natural Language Processing (NLP) and in particular, Sentiment Analysis. In this blog post, I will walk you through how you can use TensorFlow’s Python API to construct a simple (relative) deep learning model for sentiment analysis from text and implement it within SAP Data Intelligence. I would like to thank Andreas Forster, Frank Schuler, and Karim Mohraz  whose blog posts helped give me insight and knowledge that I used to build this blog post. This blog post follows a more recent Text Analysis blog by Thorsten Hapke whereby he uses pre-built Deep Learning model along with SpaCy to extract sentiment and Morphological (Named Entity Recognition) information from news articles for visualizations located at, https://blogs.sap.com/2020/10/14/text-analysis-with-sap-data-intelligence/. This blog post will show you , however, how to build the deep learning model for Sentiment Analysis from the ground up in case you have your own data for which you want to train a classifier on.

I will assume that you have:

  1. A working instance of SAP Data Intelligence
  2. Basic Python Programming language Skills.
  3. Customer Review Data (This can be literally any dataset from any datasource such as Kaggle where you have both a column of text reviews and another column of ratings.)
  • Please note that Deep Learning works best when there is a lot of data as in thousands of ratings. If you try to build even a basic deep learning model with a small amount of data, you may see sub-optimal results.

Now then, let us begin by opening up a Jupyter Notebook in a newly created Scenario in the Machine Learning Scenario Manager interface. We start by importing the necessary libraries:

import numpy as np
import pandas as pd
import tensorflow as tf
import h5py
import re
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from tensorflow.keras import Input, Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import Flatten, LSTM, Dense, Embedding, Bidirectional
from ipywidgets import IntProgress
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from spacy.lang.en import English
from spacy.lang.en.stop_words import STOP_WORDS
from yellowbrick.text import FreqDistVisualizer
nlp = English()
%matplotlib inline

 

Then, we shall be using a small dataset in the form of a CSV file that contains 11200 product reviews. However, if you wish, you could use any CSV file so long as it can be imported into Pandas with two columns, text and sentiment/review.

 

df = pd.read_csv('data/08252020.csv')

 

Now, it is always good to perform some exploratory data analysis on the data you get in order to have a good understanding of what you are working with. However, for the purpose of brevity we shall not perform overly too much as I wish to keep this blog post as simple as possible. First, it is good to know what columns you are dealing with as well as the type of hierarchy that the sentiment shall be assessed on:

df.columns

 

Index([‘Text’, ‘Ratings’, ‘Cetegories’, ‘Brand’, ‘Manufacturer’, ‘Date’, ‘City’, ‘Province’],

dtype=’object’)

 

sns.countplot(x=df.Ratings)

 

 

As one can see, we have an unfortunate distribution of ratings by which those which received a 4 or 5 make up a clear majority. We shall look at stems to mitigate this distribution soon. Now, for our binary classifier, we need to be able to segment these ratings into positive and negative values so we shall convert reviews whose rating goes from 3.0 upwards to a positive class while converting reviews below the threshold to the negative class. The labels for positive and negative reviews will be 1 and 0, respectively.

 

def to_sentiment(rating):
    rating = int(rating)
    if rating <= 2:
        return 0
    else:
        return 1

 

Next, we shall apply it towards our data and see the distribution afterwards.

 

df['Sentiment'] = df['Ratings'].apply(to_sentiment)
class_names = ['negative', 'positive']

ax = sns.countplot(x=df.Sentiment)
plt.xlabel('Review Sentiment')
ax.set_xticklabels(class_names)

 

 

Unfortunately, the segmentation has not solved the problem. As such, we are left we 3 choices; first, we can downsample by which we throw out positively labelled reviews until the amount of both classes balance but lose insightful data by doing so, second, we can upsample by which we make copies of the minority class, or we can add weights to the model which penalizes the model heavily for misclassifying the minority class. In this blog post, we shall take the third one so as to retain as much information as possible. Furthermore, we should now begin to preprocess the data. Text typically contain a multitude of non-alpha-numeric characters which would present issues to a machine learning model such as apostrophes or exclamation marks. We shall apply a function that shall remove such characters as well as extra spaces ensuring that only standard text exist.

 

def preprocess(text): 
    
    # Convert words to lower case and split them
    text = str(text).lower().split()

    # Clean the text
    text = re.sub(r"<br />", " ", str(text))
    text = re.sub(r"[^a-z]", " ", str(text))
    text = re.sub(r"   ", " ", str(text)) 
    text = re.sub(r"  ", " ", str(text))
    
    # Return a list of words
    return (text)

def remove_stopwords(text):
    string = nlp(text)
    tokens = []
    clean_text = []
    for word in string:
        tokens.append(word.text)
    for token in tokens:
        idx = nlp.vocab[token]
        if idx.is_stop is False:
            clean_text.append(token)
    return ' '.join(clean_text)

df[‘Text’] = df[‘Text’].apply(preprocess)
df['Text'] = df['Text'].apply(remove_stopwords)

 

Now that we have preprocessed the text, it would be a good idea to look at the distribution of words within our corpus. As in, We should see which words are the most common keywords for our data so that we can better understand what we are working on.  We shall therefore make a simple graph that can show us these keywords.

 

def word_distribution(text):
    
    vectorizer = CountVectorizer()
    docs = vectorizer.fit_transform(text)
    features = vectorizer.get_feature_names()

    visualizer = FreqDistVisualizer(features=features, orient='v')
    visualizer.fit(docs)
    visualizer.show()

word_distribution(df['Text'])

 

As can be inferred from the graph, the dataset is comprised around laptops and the features associated with them such as SSD and screen. Now we shall split the data into train and test sets to be put into the model.

 

X_train, X_test, y_train, y_test = train_test_split(df['Text'], df['Sentiment'], test _size=0.3)

 

Then, we shall convert the text to a tokenized state using the TensorFlow library’s inbuilt tokenizer. What a tokenizer does is that it converts a sentence to a list of separate words. For example, “I want to go to eat Italian food” would be converted to [‘I’, ‘want’, ‘to’, ‘eat’, ‘Italian’, ‘food’]. This has two purposes; first, it is the necessary format for TensorFlow to accept it as input for models, second, the tokenizer creates an index in the form of a Python dictionary in the background whereby each word now has its own unique number.

 

tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)
word_index = tokenizer.word_index # The word index needs to be used for later
vocab_size = len(word_index) + 1 # We need the vocab size as an argument later	
max_len = 250
embedding_dim = 200

 

If you recall, we had an unfortunate imbalanced dataset where the minority class, negative reviews for this example, was dwarfed by the majority class, positive reviews. If I was to just put in the data into a machine learning model, the model would learn that by just guessing that every review it positive, it could achieve a high accuracy rate, which it would. For that reason, one strategy which shall be implemented would be to give class weights to the model which alter the loss function by assigning more weight to the minority class. This has the effect of increasing the loss incurred if the minority class if mislabeled and therefore ensuring that the model learns to separate both classes. Fortunately, for us, TensorFlow takes this as a single parameter leaving us to only work out how much weight to give to each label.

 

weight_for_0 = (1 / df['Sentiment'].value_counts()[0])*(len(df))/2.0 
weight_for_1 = (1 / df['Sentiment'].value_counts()[1])*(len(df))/2.0

class_weights = {0: weight_for_0, 1: weight_for_1}

 

Now, remember that computers do not understand how to read human language in text form as we do. Because of that, we will convert each review in our text body to numeric form whereby each word is replaced with its unique value in the word index which the tokenizer made for us.

 

X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)

 

Then, we ‘pad’ the sequences whereby cut off the sentence at a threshold that we decide, hence why it’s useful to look at the word count distribution which we did, and fill up with zeros any sentence whose length is less than the threshold so that all sequences are the same exact length.

X_train_pad = pad_sequences(X_train, padding='post', maxlen=max_len)
X_test_pad = pad_sequences(X_test, padding='post', maxlen=max_len)

 

Now, we build the deep learning model for our specific task. Please be aware that with deep learning, there is no one size fits all approach. Every single project will require a different model with different parameters fit for its specific task at a specific task. What I am writing here is a generic example that I found works well with the data I have. This may change very easily for any different data.

 

inputs = tf.keras.Input(shape=(max_length,))
embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim,
        
                                      input_length=max_length,
                                      trainable=False)(inputs)
logits = Bidirectional(LSTM(128, dropout=.2, return_sequences=True, recurrent_dropout=.2,
                            kernel_regularizer=tf.keras.regularizers.L2(0.01)))(embedding)
output = Dense(1, activation='sigmoid')(logits)

model = Model(inputs=inputs, outputs=output)
optimizer = Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.995)
model.compile(optimizer=optimizer,
              loss='binary_crossentropy', metrics=['acc'])
model.summary()

 

 

The last line of code should print out a nice overview of your Deep Learning model so that you can see the change of dimensions within it as well as the number of parameters each layer has. Now we fit the model on the training data and give it the validation data so as to evaluate it. We put it in a variable called history so as to visualize the change in accuracy after.

 

history = model.fit(X_train_pad, y_train, batch_size=34, epochs=5, verbose=1, 
                    validation_data=(X_test_pad, y_test), class_weight=class_weights)

 

Running the code will cause the model to train for a predetermined number of epochs on the training data and visualize the predefined metrics per epoch as:

 

 

This is great as we see that the model has achieved a final accuracy of about 87% on unseen data. You could of course change the hyper-parameters, run it for longer epochs or get more data to see if you could increase the accuracy. However, for the purpose of keeping this blog as straightforward as possible, I shall leave it up to you to change it to your fancy. We should see our metrics’ progression as the model was going through each epoch. It is always a good idea to check on your metrics progressed throughout the training and we will do so using the following code:

 

plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('Model Accuracy over time')
plt.ylabel('Accuracy')
plt.xlabel('Epochs')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

 

 

While a confusion matrix may look confusing, no pun intended, it’s quite simple to read it if you see that the bottom right square stands for the amount of positive examples which were correctly labelled as positive whilst the top right square shows how many of the negative examples were classified correctly. As you can see, our model was able to successfully classify both labels extremely well which means that our weights successfully forced it to distinguish between the classes.

Great! We have now built a deep learning model and achieved a good accuracy score of about 87%. Now, we must implement it in pipeline form to have it ready for future inference. If you go on the Machine Learning Scenario Manager interface, scroll down onto the pipeline section and click on the create button and then choose Python Producer Pipeline as the template for a new pipeline, you may name it whatever you wish. By doing so, you will be brought to the modeler page where you will have a pre-built Python consumer graph loaded which will look like:

 

 

First, we must specify in the Read File operator which file it is, we wish to train our model on. You must therefore, within the configurations, give the path to the file you want given how you stored it i.e. S3 bucket, HANA database and so forth as such:

 

In addition to inputting our Python script, the usage of an NLP model requires an additional infrastructure modification. Remember that we had previously created a word index using the TensorFlow tokenizer. This word index is necessary for the model to not only train on data but also to predict future data for inference when deployed. As such, we have to add an additional artifact producer operator to the graph. To do so, first find the artifact producer operator in the operator section on the left. Once you double click on it, it should show up on your screen. Additionally, within the configurations for the Artifact Producer, you must specify a name to be used later for inference. For this blog post, the name shall be wordindex and you should set the Artifact Kind to “Dataset” and the Filename suffix to .CSV as such:

 

 

Now, in order to integrate it, we must first right click on the Python operator on the left-hand side and add an output port named wordindex with blob set as its datatype.

 

Then we can connect the two operators by dragging a connection from the wordindex output port we just made to the “inArtifact” input port for the newly created Artifact Producer. This should look like:

 

 

Now, we will add our Python script which will execute the model we created in our Jupyter Notebook instance but within the pipeline. To do so, we must place the equivalent of a Python script version of our notebook, i.e. without visualizations and with api commands to receive and send inputs and outputs from within SAP Data Intelligence. I have already done so for our code and you need to replace the pre-built Python script that is already in the Python operator with this:

import numpy as np
import pandas as pd
import tensorflow as tf
import h5py
import re
import logging
import json
import io
import spacy
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from tensorflow.keras import Input, Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import Flatten, LSTM, Dense, Embedding, Bidirectional
from ipywidgets import IntProgress
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from spacy.lang.en import English
from spacy.lang.en.stop_words import STOP_WORDS
nlp = English()

logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(levelname)s - %(message)s')


def remove_stopwords(text):
    '''Remove all stopwords such as a, the, and, etc. '''
    string = nlp(text)
    tokens = []
    clean_text = []
    for word in string:
        tokens.append(word.text)
    for token in tokens:
        idx = nlp.vocab[token]
        if idx.is_stop is False:
            clean_text.append(token)
    return ' '.join(clean_text)
    

def to_sentiment(rating):
    rating = int(rating)
    if rating <= 2:
        return 0
    else:
        return 1
        
    
def preprocess(text): 
    '''Clean the text, with the option to remove stopwords'''
    
    # Convert words to lower case and split them
    text = str(text).lower().split()

    # Clean the text
    text = re.sub(r"<br />", " ", str(text))
    text = re.sub(r"[^a-z]", " ", str(text))
    text = re.sub(r"   ", " ", str(text))
    text = re.sub(r"  ", " ", str(text))
    
    # Return a list of words
    return (text)

    
def on_input(data):

    
    api.logger.info("Successfully Started.")
    df = pd.read_csv(io.StringIO(str(data)), engine='python', encoding='utf-8')
    api.logger.info("CSV file imported and read from S3 bucket")
    df = pd.concat([df['reviews.text'], df['reviews.rating']], axis=1, keys=['Text', 'Ratings'])
    api.logger.info('Pandas file successfully completed.')
    # df.dropna(axis=1, inplace=True)
    
    df['Sentiment'] = df['Ratings'].apply(to_sentiment)
    df['Text'] = df['Text'].apply(preprocess)
    df['Text'] = df['Text'].apply(remove_stopwords)

        
    X_train, X_test, y_train, y_test = train_test_split(df['Text'],
                                                        df['Sentiment'],
                                                        test_size=0.3)
    
    X_train = np.array(X_train.values.tolist())
    X_test = np.array(X_test.values.tolist())
    y_train = np.array(y_train.values.tolist())
    y_test = np.array(y_test.values.tolist())
    
    
    tokenizer = Tokenizer(oov_token='<OOV>')
    tokenizer.fit_on_texts(X_train)
    api.logger.info('Tokenizer fit onto words!')
    word_index = tokenizer.word_index
    api.logger.info('Index created.')
    vocab_size = len(word_index) + 1
    
    embedding_dim = 200
    max_length = 250
    pad_type = 'post'
    
    
    weight_for_0 = (1 / df['Sentiment'].value_counts()[0])*(len(df))/2.0 
    weight_for_1 = (1 / df['Sentiment'].value_counts()[1])*(len(df))/2.0

    class_weights = {0: weight_for_0, 1: weight_for_1}
    
    X_train = tokenizer.texts_to_sequences(X_train)
    X_test = tokenizer.texts_to_sequences(X_test)
    api.logger.info('texts successfully tokenized')
    
    X_train_pad = pad_sequences(X_train, padding='post', maxlen=max_length)
    X_test_pad = pad_sequences(X_test, padding='post', maxlen=max_length)
    
    
    inputs = tf.keras.Input(shape=(max_length,))
    embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim,
                                          input_length=max_length, mask_zero=True, # The mask is needed to tell the model that the zeros are padding
                                          trainable=False)(inputs)
    logits = Bidirectional(LSTM(128, dropout=.2, return_sequences=False, recurrent_dropout=.2,
                                kernel_regularizer=tf.keras.regularizers.L2(0.01)))(embedding)
    output = Dense(1, activation='sigmoid')(logits)
    
    model = Model(inputs=inputs, outputs=output)
    optimizer = Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.995)
    model.compile(optimizer=optimizer,
                  loss='binary_crossentropy', metrics=['acc'])
    api.logger.info('Model Successfully built')
    
    model.fit(X_train_pad, y_train, batch_size=34, epochs=5, verbose=1, 
                    validation_data=(X_test_pad, y_test), class_weight=class_weights)
    api.logger.info('Model Successfully Trained')

    loss, acc = model.evaluate(X_test_pad, y_test, steps=34)
    
    # to send metrics to the Submit Metrics operator, create a Python dictionary of key-value pairs
    metrics_dict = {"loss": str(loss), "accuracy": str(acc)}
    
    # send the metrics to the output port - Submit Metrics operator will use this to persist the metrics 
    api.send("metrics", api.Message(metrics_dict))
    
    
    word_index = {y:x for x, y in word_index.items()}
    dict_df = pd.DataFrame.from_dict(word_index, orient='index', columns=['word'])
    api.logger.info(dict_df)
    word_index_file = dict_df.to_csv(index=False)
    api.send('wordindex', word_index_file)

    with h5py.File('trained_model', driver='core', backing_store=False) as model_h5:
        model.save(model_h5)
        model_h5.flush() 
        blob = model_h5.id.get_file_image()
        api.send("modelBlob", blob)
        
api.set_port_callback("input", on_input)

In SAP Data Intelligence, you must group your Python Operators with Docker Images so that the required libraries are imported. Right click on the Python Operator and click on group. This should place it within a group for which you will then define the tags that allow SAP Data Intelligence to find the Docker Image you wish to use. For an overview on how to create dockerfiles in case you don’t have too much experience with them, please see this tutorial, https://blogs.sap.com/2019/12/13/some-notes-on-docker-file-creation-on-sap-data-intelligence/, by Thorsten Hapke.

As of the new 2010 upgrade to SAP Data Intelligence, the Artifact producer Operators must also have a Write File Operator which writes the file into the artifact producer Operator as such. Make sure to connect the output port, “OutFileSend” to the input port for the Write File and then connect the output port of the Write File operator to the input port, “InFile”, for the Artifact Producer Operator.

 

 

Now, we must connect the newly created Artifact Producer to the second Python operator which is located on the right-hand side. To do this, we must create a new input port for the second Python operator called wordindex with the datatype of string as such:

 

 

Afterwards, we merely drag a connection from the outArtifact port in the Artifact Producer operator to the wordindex input port that we just created. You will be given a small screen which says that you must choose a toString operator to connect them both. Click on the toString operator and you should then see that the two operators are now connected by it as such:

 

 

Now, we should change the script within the second Python operator to take this extra connection into account. To do so, replace the script in the second Python operator with this:

 

# When both input ports signals arive, the Artifact Producer & Submit Metrics have completed - safe to terminate the graph.

def on_inputs_ready(metrics_resp, artifact_id, wordindex):
    # both input ports have data - previous operators have completed. Send a message as output to stop the graph
    api.send("output", api.Message("Operators complete."))

api.set_port_callback(["metricsResponse", "artifactId", "wordindex"], on_inputs_ready)

 

This merely checks to make sure that there is output from both artifacts as well as from the metrics operator before it terminated the graph. You should now have a complete graph which looks like:

 

 

Now that we have finished building our graph, we can save it and go back the Machine Learning Scenario Manager interface. Go back down towards the pipeline section, click on the pipeline that we have just created and when it’s highlighted, click on execute. The amount of time that this execution will take will depend on how long the model needs to train. This will of course depend on many factors such as the amount of data you have as well as the computational complexity of your model. If you have followed the steps exactly as stated, you should be able to see the Execution completed successfully.

 

 

Great, we have now completed the most difficult part of this procedure. We must now build a different pipeline for inference so that we can use the model for new unseen data that we want to classify. Go back to the interface in the machine learning scenario manager and create another pipeline but choose the Python Consumer as the template this time. I’ll name mine Sentiment Inference for simplicity. You will be faced with a screen showing:

 

 

Now, just like the training pipeline, we are going to have to modify it slightly to take into account the word index artifact that we had created. TensorFlow will need the word index in order to classify new incoming data. To do this we will have to duplicate the first part of the graph. First, we need to create a new Constant Generator Operator. You can find it in the operator section and then slightly change the content to:

 

 

Then, create a toMessage Operator and drag a connection between the newly constant generator to this toMessage Operator via the inString input port. Afterwards, create a new Artifact Consumer operator and drag a connection between the output port of the toMessage Operator and the input port for the Artifact Consumer. Then, we make a toFile operator and connect the outArtifact output port of the Artifact Consumer to the “in” input port of the toFile Operator. The we make a Read File Operator and connect the output of the toFile Operator and the input port of this Read File operator. Furthermore, now we do another similar conversion by making a toBlob Operator and connecting the File Output port of the Read File Operator to the input port to the toBlob Operator. Then Create a new input port in the Python Operator with the blob datatype and call it wordindex. Then, drag a connection between the toBlob Operator’s output port and the Python 3’s new wordindex input port. Hopefully, I haven’t lost you yet with all these conversions. Your graph should now look like:

 

 

Now, replace the script within the Python Operator with:

import io
import h5py
import json
import tensorflow as tf
import pandas as pd
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras import Model
    

# Global vars to keep track of model status
model = None
model_ready = False
word_index = None

# Validate input data is JSON
def is_json(data):
  try:
    json_object = json.loads(data)
  except ValueError as e:
    return False
  return True

# When Model Blob reaches the input port
def on_model(model_blob):
    global model
    global model_ready

    model = model_blob
    model_ready = True
    api.logger.info("Model Received & Ready")


def on_word_index(word_index_blob):
    global word_index
    word_index_ = pd.read_csv(io.StringIO(str(word_index_blob,'utf-8')))
    word_index = word_index_.to_dict()['word']
    word_index = {y:x for x, y in word_index.items()}

    api.logger.info(word_index)
    if word_index != None:
        api.logger.info("Word index successfully loaded.")
        api.logger.info(word_index)
    else:
        api.logger.info('Word Index not successfully loaded.')
        api.logger.info(word_index_blob)
    
    
# Client POST request received
def on_input(msg):
    global model
    global word_index
    error_message = ""
    success = False
    prediction = None

        
    def idx_fetch(text, word_index):
        list_ = []
        api.logger.info(word_index)
        for word in text.split():
            api.logger.info(word)
            try:
                list_.append(word_index[word.lower()])
            except:
                list_.append(word_index['<OOV>'])
        return list_
    
    def get_prediction(model, text, word_index):
        word_list = idx_fetch(text, word_index)
        x = ([word_list])
        sequence = tf.keras.preprocessing.sequence.pad_sequences(x, maxlen=250)
        sentiment = model.predict(sequence)[0]
        if sentiment >= 0.5:
            return (sentiment.item(), 1)
        else:
            return (sentiment.item(), 0)
            
    
    try:

        if model_ready:
            api.logger.info("Model Ready")

            user_data = msg.body.decode('utf-8')
            # Received message from client, verify json data is valid
            if is_json(user_data):

                # apply your model
                with io.BytesIO(model) as blob:
                    f = h5py.File(blob, 'r')
                    nnmodel = tf.keras.models.load_model(f)
                    api.logger.info('Model successfully loaded.')
                    
                # obtain your results
                text = json.loads(user_data)["text"]
                api.logger.info('Data successfully partitioned.')
                prediction = get_prediction(nnmodel, text, word_index)
                api.logger.info('Prediction was successful')
                success = True
            else:
                api.logger.info("Invalid JSON received from client - cannot apply model.")
                error_message = "Invalid JSON provided in request: " + user_data
                success = False
        else:
            error_message = "Model has not yet reached the input port - try again."
    except Exception as e:
        error_message = 'An Error occurred: ' + e
    
    if success:
        # apply carried out successfully, send a response to the user
        msg.body = json.dumps({'Sentiment Prediction': str(prediction)})
    else:
        msg.body = json.dumps({'Error': error_message})
    
    new_attributes = {'message.request.id': msg.attributes['message.request.id']}
    msg.attributes =  new_attributes
    api.send('output', msg)
    
api.set_port_callback("model", on_model)
api.set_port_callback("wordindex", on_word_index)
api.set_port_callback("input", on_input)

 

Afterwards, group the Python Operator with the Dockerfile and tags necessary for the relevant packages. Once you have done the following steps, save your graph model and go back to the Machine Learning Scenario Manager interface. Go downwards to the pipelines section, click on your new pipeline and click the deploy button. You will be brought up to a screen asking for configuration description, this is optional, just click on next. You will be brought up to another screen asking if you wish to have previous configurations. Merely press the next button as well and you will be brought to the final step before deployment where it asks you to specify the artifacts to be used in the Inference pipeline. Here, you should click on the drop-down menu and choose the model and word index which were created in the training pipeline.

 

 

Once you have done so, click on save. After a bit of processing, your model will be deployed, and you should see a screen of it being run as such:

 

Now, since we will be using POSTMAN to issue the POST request to have the model classify data we send it via JSON, we will be needing the deployment URL which will be available on the top of the same screen. Copy this and head on over to the POSTMAN interface. At this point, I will have to refer you to Andreas Forster’s blog post, https://blogs.sap.com/2019/08/14/sap-data-intelligence-create-your-first-ml-scenario/, on deploying a machine learning model with SAP Data Intelligence online with POSTMAN as he covers exactly what steps you need to do to get POSTMAN working. Once, you have put in the necessary parameters, you merely need to write whatever text you want to have classified and click on send to receive the both the classification (as in positive or negative in binary form) as well as the degree of polarity(a float between 0 and 1 determining the degree to how positive or negative the text is) from the deployed model. Now, as the dataset I used was for the review of laptops and computers, I have written:

{

“text”: “I thought the computer was amazing and the processor was so fast!”

}

 

 

Unsurprisingly, I receive the degree of sentiment which is .999 indicating it to be extremely positive as it is nearly 1 as well as the classification of 1 meaning it is positive. Please remember that Deep Learning models in general will perform best on data which matches that with which it was trained i.e. a review of a laptop for the data I used may get classified well but a Tweet talking about a new song may achieve poor results. Congratulations, we have now built a basic Natural Language Processing model with TensorFlow for sentiment analysis on SAP Data Intelligence. I have tried to keep things as simple as possible so as to ensure that you know the basic infrastructure to develop NLP models on SAP Data Intelligence. What I have described here can be modified tremendously to fit most tasks involving text and deep learning and it is up to your choice as to what you wish to do with it. Nothing is impossible.

 

Furthermore, I have written another blogpost detailing what values one could derive from sentiment analysis models such as this one which showcases the practical application of such NLP models here, https://blogs.sap.com/2020/12/22/__trashed-31/.

Assigned tags

      Be the first to leave a comment
      You must be Logged on to comment or reply to a post.