Deep Learning using SAP Leonardo ML Foundation: Text Classification
This is the second article in a series of articles in the space of Deep Learning and how to use SAP Leonardo ML Foundation for the same. These articles will cover the complete process of a Deep Learning project starting for data preparation to prediction.
If you missed the first article, which was on Image Classification using SAP Leonardo ML foundation, find it here.
As the title says, this is another very popular application of Deep Learning: Text Classification. With so much textual data, automatic text classification is necessary. Some major industry examples include articles tagging, news article classification, sentiment analysis, and others.
Problem statement: Make a deep learning model to classify text into various categories.
Dataset: Any text classification dataset. In this article, we will use the famous 20 Newsgroup Dataset. The dataset contains news articles and the category they belong to. There are 20 different categories. There are 11314 samples for training and 7532 samples for testing.
- Platform: Train Your Own Model functionality of SAP Leonardo ML foundation.
- Deep Learning library: Keras.
- Programming language: Python 2
Let’s start the work! Major steps are:
- Making model and training.
- Making predictions using the trained model.
Making model and training.
The dataset comes with sklearn library, so we need not download it explicitly. Let’s jump into coding:
Create new training.py file and keep writing all the code:
import pandas as pd from sklearn.datasets import fetch_20newsgroups newsgroups_train = fetch_20newsgroups(subset='train') newsgroups_test = fetch_20newsgroups(subset='test') train = pd.DataFrame() train['article'] = newsgroups_train.data train['category'] = newsgroups_train.target test = pd.DataFrame() test['article'] = newsgroups_test.data test['category'] = newsgroups_test.target
This code above will make two data frames: train and test. They look like:
The first column contains the text of articles and the second column is the category (0 to 19) to which the article belongs.
Let’s start coding model as the data is ready.
Do necessary imports.
import numpy as np from keras.models import Model from keras.layers import Dense, Embedding, Input, LSTM, Bidirectional, GlobalMaxPool1D, Dropout from keras.callbacks import EarlyStopping, ModelCheckpoint from keras.preprocessing import text, sequence
max_features = 10000 maxlen = 100 embed_size = 512 batch_size = 64 epochs = 100
train_sentences = train['article'].values test_sentences = test['article'].values y = train['category'].values tokenizer = text.Tokenizer(num_words=max_features) tokenizer.fit_on_texts(list(train_sentences)) train_tokenized = tokenizer.texts_to_sequences(train_sentences) test_tokenized = tokenizer.texts_to_sequences(test_sentences) train = sequence.pad_sequences(train_tokenized, maxlen=maxlen) test = sequence.pad_sequences(test_tokenized, maxlen=maxlen)
Define the model.
def get_model(): inp = Input(shape=(maxlen, )) x = Embedding(max_features, embed_size)(inp) x = Bidirectional(LSTM(128, return_sequences=True))(x) x = GlobalMaxPool1D()(x) x = Dropout(0.2)(x) x = Dense(64, activation='relu')(x) x = Dropout(0.3)(x) x = Dense(20, activation='softmax')(x) model = Model(inputs=inp, outputs=x) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) return model
Define some callbacks.
file_path = 'model.hdf5' checkpoint = ModelCheckpoint(file_path, monitor='val_loss', verbose=1, save_best_only=True, mode='min') early = EarlyStopping(monitor='val_loss', mode='min', patience=10) callbacks = [checkpoint, early]
All set to train the model. This will take some time.
model = get_model() model.fit(train, y, batch_size=batch_size, epochs=epochs, validation_split=0.15, callbacks=callbacks)
Till this code, the model is trained and the weights are saved. Now, it’s time to see the trained model in action i.e. to do predictions.
model.load_weights(file_path) y_test = model.predict(test)
Upload the job to SAP Leonardo ML foundation.
All the code is written, now we need to upload this to SAP Leonardo ML foundation. Put the training.py in code folder. We also need to create a yaml file named newsgroup.yaml which specifies the resources for running the process.
job: name: "newsgroups" execution: image: "tensorflow/tensorflow:1.5.0-gpu" command: "pip install keras --upgrade && python training.py" completionTime: "10" resources: cpus: 1 memory: 10000 gpus: 1
Upload the job. Open command prompt in the same directory as code folder and yaml file. Run the following command:
cf sapml job submit -f newsgroup.yaml code
You can also see logs of the job using appropriate commands.
Through this article:
- Doing basic text classification using keras and python.
- Running programs as jobs on SAP Leonardo ML foundation using Train Your Own Model (TYOM).