How to handle Multi-class Classification in Automated Analytics?
Hi,
As part of my day to day activities, I’m checking the Idea Place for Predictive and try to provide feedback on some of the items.
One of them was about the need to handle Multi-class Classification in Automated Analytics:
Context
First let’s define what a Multi-class Classification model is:
“In machine learning, multi-class or multinomial classification is the problem of classifying instances into one of the more than two classes (classifying instances into one of the two classes is called binary classification). While some classification algorithms naturally permit the use of more than two classes, others are by nature binary algorithms; these can, however, be turned into multinomial classifiers by a variety of strategies.”
There is 2 ways to address a Multi-class Classification problem:
- One-vs.-rest (OvR)
The one-vs.-rest (or one-vs.-all) strategy involves training a single classifier per class, with the samples of that class as positive samples and all other samples as negatives. This strategy requires the base classifiers to produce a real-valued confidence score for its decision, rather than just a class label; discrete class labels alone can lead to ambiguities, where multiple classes are predicted for a single sample.
- One-vs.-one (OvO)
In the one-vs.-one, one trains K (K − 1) / 2 binary classifiers for a K-way multiclass problem; each receives the samples of a pair of classes from the original training set, and must learn to distinguish these two classes. At prediction time, a voting scheme is applied: all K (K − 1) / 2 classifiers are applied to an unseen sample and the class that got the highest number of “+1” predictions gets predicted by the combined classifier.
Source: https://en.wikipedia.org/wiki/Multiclass_classification
In the Automated Analytic mode of SAP Predictive Analytics, it provides a way to build binary classification only out of the box. The Expert Analytics mode may provide a way to handle that using one of the out of the box algorithms and for sure via an open source R script. But this blog post will only focus on the Automated Analytic mode and we won’t discuss the pros and cons of OvR or OvO.
Approach
There are multiple ways to handle an “n-way” multi-class model problem:
- The “multi-target model” approach
- Prepare a data set with n target variables for OvR or n * (n − 1) / 2 target variables for OvO
- Build one model with all the targets
The final model will probably be the worse as it will have to fit all the targets, and will not be optimal (encoding, binning, variable reduction etc.).
- The “build one, then replicate” approach
- Prepare a data set with n target variables for OvR or n * (n − 1) / 2 target variables for OvO
- Build one model with one of the targets
- Then replicate using a KxShell script to run the n-1 other models for OvR or (n * (n − 1) / 2) – 1 other models for OvO
The final models will be optimal for each target (encoding, binning, variable reduction etc.), but there will be many models to be built
We will focus on the “build one, then replicate” approach as this will provide more “optimal” models and since SAP Predictive Analytics provides all the tools to “productize” models in mass, it won’t be an issue.
Now, the trick or hard part is on the way to prepare the data set.
Prepare the Data Set
I’m a lazy guy so I don’t want to build a fix data set with n target variables for OvR or n * (n − 1) / 2 target variables for OvO, because if a new class appear, I will have modify my data set to add the new class, and that’s the last thing I want to do!!
Thisis why I love Data Manager so much!
I will assume that everyone knows the different elements in play to build your Analytical Data Set in Data Manager.
Anyway, if you don’t here is a short summary of the objects that needs to be created:
- Entity: the subject of the analysis, your customer id or event id
- Time Stamp Population: the list of entities to be used for training or scoring your model at a reference date (snapshot). It also includes your target if the population is to be used for training purpose
- Analytical Record: the list of attributes to be associated with the entity at that reference date (time stamp)
So how to handle “Multi-class Classification in Automated Analytics” with Data Manager?
You will only need one Time Stamp Population! And you will be able to handle both OvR and OvO!
Let’s take an example for this. Our Multi-class Classification will have 26 class from “A” to “Z” but could be from “1” to “26”.
Time Stamp Population for OvR:
- I will assume that you already have your “class” variable/attribute with a value between “A” to “Z” available in your Timestamp Population (via a merge, a condition etc.)
- You will need a prompt that will define the “one” class you want to use versus the rest. Let’s say it’s a String and the default value will be “A”
- Once defined, we will use the prompt in a condition/expression to generate the target
- And save it as “KxTarget” (this naming convention ensure surfacing the target variable)
- Now you have your target variable defined
- Click “Next”, and switch to the “Target” tab where you can assign your target
- If you click on “View Data”, you will get a prompt asking you for the “One” class you want to use
Time Stamp Population for OvO:
- I will assume that you already have your “class” variable/attribute with a value between “A” to “Z” available in your Timestamp Population (via a merge, a condition etc.)
- You will need a prompt that will define the “one” class you want to use versus the rest. Let’s say it’s a String and the default value will be “A”
- You will need a prompt that will define the “other one” class you want to use versus the rest. Let’s say it’s a String and the default value will be “B”
- Once defined, we will use the prompt in a condition/expression to generate the target like in OvR, so that KxTarget = 1 means it’s equal to “TheOne” and KxTarget = 0 means it’s equal to “TheOtherOne”
- Then you will need to define a filter to exclude everything but the class equal to the “TheOne” or “TheOtherOne”
- Now you have your target variable defined. Click “Next”, and switch to the “Target” tab where you can assign your target as for OvR
- If you click on “View Data”, you will get a prompt asking you for the “One” class and the “Other One” class you want to use
So we are done with the data set generation. Let’s build the models!
Build the models
So when you use Data Manager while building your classification, you will get the prompt popup that will ask you to enter the values to be used to extract the data set. Here is an example with OvR and an additional prompt:
You can click “Next”, “OK”, “Analyze”, “Next”, “Next” to reach the last step before creating the mode itself for class “A”.
Using KxShell Scripts
Click on “Export KxShell Script…” and save the “Learn” script on your Desktop for example.
if you open it in a text editor you will find that the prompt values are stored in KxShell “macros” (like programming variables)
So, if I want train that model using the generated script I will have to execute the following command in a DOS prompt:
“C:\Program Files\SAP Predictive Analytics\Desktop\Automated\EXE\Clients\CPP\KxShell.exe” “learn.kxs”
Now if I want to run it for class “B”, I will run:
“C:\Program Files\SAP Predictive Analytics\Desktop\Automated\EXE\Clients\CPP\KxShell.exe” “learn.kxs” -DTRAINING_STORE_PROMPT_1=B
and you can alter any of the macros from the script in the command line.
For OvO approach the same logic applies, except that you will need to build n * (n − 1) / 2 models which may require a little script to do the iteration properly.
Hope this was helpful and off course feel free to comment.
PS: I tried to keep the flow simple so I may have took some shortcut or be brief in the explanation to keep the entry short.