Deep Learning using SAP Leonardo ML Foundation: Te...

former_member517463 · ‎12-20-2018

This is the second article in a series of articles in the space of Deep Learning and how to use SAP Leonardo ML Foundation for the same. These articles will cover the complete process of a Deep Learning project starting for data preparation to prediction.

If you missed the first article, which was on Image Classification using SAP Leonardo ML foundation, find it here.

As the title says, this is another very popular application of Deep Learning: Text Classification. With so much textual data, automatic text classification is necessary. Some major industry examples include articles tagging, news article classification, sentiment analysis, and others.

Problem statement: Make a deep learning model to classify text into various categories.

Dataset: Any text classification dataset. In this article, we will use the famous 20 Newsgroup Dataset. The dataset contains news articles and the category they belong to. There are 20 different categories. There are 11314 samples for training and 7532 samples for testing.

Technological stack:

Platform: Train Your Own Model functionality of SAP Leonardo ML foundation.

Deep Learning library: Keras.

Programming language: Python 2

Let’s start the work! Major steps are:

Making model and training.

Making predictions using the trained model.

Making model and training.

The dataset comes with sklearn library, so we need not download it explicitly. Let's jump into coding:

Create new training.py file and keep writing all the code:

import pandas as pd

from sklearn.datasets import fetch_20newsgroups



newsgroups_train = fetch_20newsgroups(subset='train')

newsgroups_test = fetch_20newsgroups(subset='test')



train = pd.DataFrame()

train['article'] = newsgroups_train.data

train['category'] = newsgroups_train.target



test = pd.DataFrame()

test['article'] = newsgroups_test.data

test['category'] = newsgroups_test.target

This code above will make two data frames: train and test. They look like:

The first column contains the text of articles and the second column is the category (0 to 19) to which the article belongs.

Let's start coding model as the data is ready.

Do necessary imports.

import numpy as np



from keras.models import Model

from keras.layers import Dense, Embedding, Input, LSTM, Bidirectional, GlobalMaxPool1D, Dropout

from keras.callbacks import EarlyStopping, ModelCheckpoint

from keras.preprocessing import text, sequence

Some constants.

max_features = 10000

maxlen = 100

embed_size = 512

batch_size = 64

epochs = 100

Text tokenization.

train_sentences = train['article'].values

test_sentences = test['article'].values

y = train['category'].values



tokenizer = text.Tokenizer(num_words=max_features)

tokenizer.fit_on_texts(list(train_sentences))

train_tokenized = tokenizer.texts_to_sequences(train_sentences)

test_tokenized = tokenizer.texts_to_sequences(test_sentences)

train = sequence.pad_sequences(train_tokenized, maxlen=maxlen)

test = sequence.pad_sequences(test_tokenized, maxlen=maxlen)

Define the model.

def get_model():

    inp = Input(shape=(maxlen, ))

    x = Embedding(max_features, embed_size)(inp)

    x = Bidirectional(LSTM(128, return_sequences=True))(x)

    x = GlobalMaxPool1D()(x)

    x = Dropout(0.2)(x)

    x = Dense(64, activation='relu')(x)

    x = Dropout(0.3)(x)

    x = Dense(20, activation='softmax')(x)



    model = Model(inputs=inp, outputs=x)

    model.compile(loss='categorical_crossentropy', optimizer='adam',  metrics=['accuracy'])



    return model

Define some callbacks.

file_path = 'model.hdf5'

checkpoint = ModelCheckpoint(file_path, monitor='val_loss', verbose=1, save_best_only=True, mode='min')

early = EarlyStopping(monitor='val_loss', mode='min', patience=10)



callbacks = [checkpoint, early]

All set to train the model. This will take some time.

model = get_model()

model.fit(train, y, batch_size=batch_size, epochs=epochs, validation_split=0.15, callbacks=callbacks)

Predictions

Till this code, the model is trained and the weights are saved. Now, it's time to see the trained model in action i.e. to do predictions.

model.load_weights(file_path)



y_test = model.predict(test)

Upload the job to SAP Leonardo ML foundation.

All the code is written, now we need to upload this to SAP Leonardo ML foundation. Put the training.py in code folder. We also need to create a yaml file named newsgroup.yaml which specifies the resources for running the process.

job:

  name: "newsgroups"

  execution:

    image: "tensorflow/tensorflow:1.5.0-gpu"

    command: "pip install keras --upgrade && python training.py"

    completionTime: "10"

    resources:

      cpus: 1

      memory: 10000

      gpus: 1

Upload the job. Open command prompt in the same directory as code folder and yaml file. Run the following command:

cf sapml job submit -f newsgroup.yaml code

You can also see logs of the job using appropriate commands.

Through this article:

Doing basic text classification using keras and python.

Running programs as jobs on SAP Leonardo ML foundation using Train Your Own Model (TYOM).