Deep Learning using SAP Leonardo ML Foundation: Text Classification
This is the second article in a series of articles in the space of Deep Learning and how to use SAP Leonardo ML Foundation for the same. These articles will cover the complete process of a Deep Learning project starting for data preparation to prediction.
If you missed the first article, which was on Image Classification using SAP Leonardo ML foundation, find it here.
As the title says, this is another very popular application of Deep Learning: Text Classification. With so much textual data, automatic text classification is necessary. Some major industry examples include articles tagging, news article classification, sentiment analysis, and others.
Problem statement: Make a deep learning model to classify text into various categories.
Dataset: Any text classification dataset. In this article, we will use the famous 20 Newsgroup Dataset. The dataset contains news articles and the category they belong to. There are 20 different categories. There are 11314 samples for training and 7532 samples for testing.
- Platform: Train Your Own Model functionality of SAP Leonardo ML foundation.
- Deep Learning library: Keras.
- Programming language: Python 2
Let’s start the work! Major steps are:
- Making model and training.
- Making predictions using the trained model.
Making model and training.
The dataset comes with sklearn library, so we need not download it explicitly. Let’s jump into coding:
Create new training.py file and keep writing all the code:
import pandas as pd from sklearn.datasets import fetch_20newsgroups newsgroups_train = fetch_20newsgroups(subset='train') newsgroups_test = fetch_20newsgroups(subset='test') train = pd.DataFrame() train['article'] = newsgroups_train.data train['category'] = newsgroups_train.target test = pd.DataFrame() test['article'] = newsgroups_test.data test['category'] = newsgroups_test.target
This code above will make two data frames: train and test. They look like:
The first column contains the text of articles and the second column is the category (0 to 19) to which the article belongs.
Let’s start coding model as the data is ready.
Do necessary imports.
import numpy as np from keras.models import Model from keras.layers import Dense, Embedding, Input, LSTM, Bidirectional, GlobalMaxPool1D, Dropout from keras.callbacks import EarlyStopping, ModelCheckpoint from keras.preprocessing import text, sequence
max_features = 10000 maxlen = 100 embed_size = 512 batch_size = 64 epochs = 100
train_sentences = train['article'].values test_sentences = test['article'].values y = train['category'].values tokenizer = text.Tokenizer(num_words=max_features) tokenizer.fit_on_texts(list(train_sentences)) train_tokenized = tokenizer.texts_to_sequences(train_sentences) test_tokenized = tokenizer.texts_to_sequences(test_sentences) train = sequence.pad_sequences(train_tokenized, maxlen=maxlen) test = sequence.pad_sequences(test_tokenized, maxlen=maxlen)
Define the model.
def get_model(): inp = Input(shape=(maxlen, )) x = Embedding(max_features, embed_size)(inp) x = Bidirectional(LSTM(128, return_sequences=True))(x) x = GlobalMaxPool1D()(x) x = Dropout(0.2)(x) x = Dense(64, activation='relu')(x) x = Dropout(0.3)(x) x = Dense(20, activation='softmax')(x) model = Model(inputs=inp, outputs=x) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) return model
Define some callbacks.
file_path = 'model.hdf5' checkpoint = ModelCheckpoint(file_path, monitor='val_loss', verbose=1, save_best_only=True, mode='min') early = EarlyStopping(monitor='val_loss', mode='min', patience=10) callbacks = [checkpoint, early]
All set to train the model. This will take some time.
model = get_model() model.fit(train, y, batch_size=batch_size, epochs=epochs, validation_split=0.15, callbacks=callbacks)
Till this code, the model is trained and the weights are saved. Now, it’s time to see the trained model in action i.e. to do predictions.
model.load_weights(file_path) y_test = model.predict(test)
Upload the job to SAP Leonardo ML foundation.
All the code is written, now we need to upload this to SAP Leonardo ML foundation. Put the training.py in code folder. We also need to create a yaml file named newsgroup.yaml which specifies the resources for running the process.
job: name: "newsgroups" execution: image: "tensorflow/tensorflow:1.5.0-gpu" command: "pip install keras --upgrade && python training.py" completionTime: "10" resources: cpus: 1 memory: 10000 gpus: 1
Upload the job. Open command prompt in the same directory as code folder and yaml file. Run the following command:
cf sapml job submit -f newsgroup.yaml code
You can also see logs of the job using appropriate commands.
Through this article:
- Doing basic text classification using keras and python.
- Running programs as jobs on SAP Leonardo ML foundation using Train Your Own Model (TYOM).
Hi, I don't see "job" as am available command of the sapml plugin. I am using version 1.0.0 and the commands I see are the following:
config Display or modify configuration
fs Interact with training file system
help Help about any command
model Manage models in model repository
modelserver Manage model servers
retraining Retraining Service
version Print the client version information
I'm facing the same problem with version 1.1.4. Did you find any solution in the meantime?
Yes, I was able to deploy my model but I followed a slightly different approach than the one described here. Instead of saving (and then loading) model weights I saved my model in TF's "saved model format":
Then this model can be served using TF's Serving API. This way there's no need to use the -apparently inexistent- sapml "job" command. Follow the SAP HANA Academy videos on this subject and let me know if I can help.
Nice article. could you kindly share your LinkedIn profile?