If you got interested in Machine Learning, then sooner rather than later you hear about or discover the website called Kaggle. Founded in 2010 Kaggle “allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.” (from Wikipedia)
The very first challenge participants enter is the famous Titanic ML competition. The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck. In Kaggle’s own words:
In this competition, you’ll gain access to two similar datasets that include passenger information like name, age, gender, socio-economic class, etc. One dataset is titled `train.csv` and the other is titled `test.csv`.
Train.csv will contain the details of a subset of the passengers on board (891 to be exact) and importantly, will reveal whether they survived or not, also known as the “ground truth”.
The `test.csv` dataset contains similar information but does not disclose the “ground truth” for each passenger. It’s your job to predict these outcomes.
Using the patterns you find in the train.csv data, predict whether the other 418 passengers on board (found in test.csv) survived.
In our case, we are not going to submit the final answers to the competition. Our goal here is to discover, learn, and experiment.
This Kaggle challenge is the classic classification problem. What’s more, it is a binary classification, where output takes one of only two possible values.
I would like to use this very popular exercise to get you introduced to using Machine Learning with different SAP products.
The Machine Learning process usually follows the cycle of well-defined steps from data acquisition and preparation to building an ML model and using it to get predictions based on the new data.
We are starting with SAP Analytics Cloud
For that, I am using the trial edition of SAP Analytics Cloud, so that you can replicate steps too.
Please note that the current version of the trial edition is 2021.13. Some of the functionality or UI navigations presented here might not yet be available in productive instances with quarterly update cycles.
One of the Augmented Analytics features of the product is the Smart Predict. This feature is a no-code Machine Learning. It automatically learns from your historical data and finds the best relationships or patterns of behavior to easily generate predictions for future events, values, and trends.
Let’s see how Smart Predict helps us to address Kaggle’s Titanic challenge.
1. Identify the ML Scenario
So, we are going to “using the patterns you find in the train data, predict whether the other 418 passengers onboard (found in test data) survived.” The solution should be provided in the form of a file with two columns:
- The ID of a passenger,
- The predicted value: Yes or No, e.g. encoded as 1 and 0.
2a. Data Acquisition
Go to the Datasets application and create a new dataset importing a CSV file
To keep all related artifacts in one place I created a new folder
2b. Data Discovery
Once the file is loaded the dataset is available to us to work with.
The dataset has 891 rows, or “observations“. All records together are called “population“.
There are 12 columns in the dataset, or “variables“. In our example we will be predicting the
Survived variable, that is our “target“. The “influencers” are variables that describe your data and which serve to explain a target.
The Output view of dimensions and measures, while important when building stories and visualizations, is not relevant at this stage when our focus is on training predictive models.
Let’s switch to the Columns view.
Now we can check the details of columns to understand the data better and to check their quality, like:
- There are 342 records (or 38.38%) where
- There are passenger classes 1, 2, and 3 in
Pclassand no missing values in records.
- 77.1% of records are missing a cabin number.
- A single ticket is not always for a single person but might be for a group/family.
- The histogram for the
Agevariable shows if there would be a lot of kids onboard younger than 4 years old. In fact, there is a big number of empty values in this column in the dataset.
To get a better picture of the age distribution, let’s replace empty values with the null value to differentiate them from the value
0. To achieve that create a transformation to replace all empty values with the null.
Now we got a much better view of the distribution of the age of passengers, plus information about the number of records with the null value in the
Save the dataset.
2c. Data Processing
The transformation done in the previous step is an example of data processing to prepare the dataset to be used in machine learning.
We are not going to do many more data transformations in this cycle, but there is one mandatory activity to prepare variables for ML training: checking and assigning proper statistical types to them.
Here are suggested types in the alphabetical order of columns:
|Parents or Children (
|Siblings or Spouses (
Please note as well the data type of
Survived column is Boolean.
Save the dataset.
3. Model Creation
A Predictive Scenario is a workspace, where you create and compare predictive models to find the best one to bring the best predictions.
Let’s go to the Predictive Scenarios application and create a new Classification.
Save it as
Titanic in the folder with the same name. Next:
traindataset as a training source.
- Edit column details: verify the statistical types, and
- … check the
PassengerIdas a key variable.
Survivedas a target.
Training is a process that takes these values and uses SAP machine learning algorithms to explore relationships in your data source to come up with the best combinations for the predictive model.
We are not going to do any other settings, for now, so just click Train.
In a few minutes, you should see the first “Model 1” has been trained (a Status is seen in the Status Panel). All that with a single click!
Let’s look closer at what we got:
- Our source dataset
trainhas been split into two partitions: Training and Validation. The first one was used to build multiple models, and the second one was used to select the best model, i.e. the model with the best indicators.
- Predictive Power is the main measure of predictive model accuracy. The closer its value is to 100%, the more confident you can be when you apply the predictive model to obtain predictions.
- Prediction Confidence is your predictive model’s ability to achieve the same degree of accuracy when you apply it to a new dataset that has the same characteristics as the training dataset. This value should be as close as possible to 100%.
- During the training, Smart Predict calculates an optimized set of influencers to include in your predictive model.
We will look closer to ways these indicators are calculated, but for now, indicators look good enough. Please note that they might be slightly different for the same dataset, even as a random partitioning into training and validation parts tries to keep data distribution similar.
Let’s check what six variables have been computed as influencers by Smart Predict. Understanding influencers and their contributions give you an explanation of the automatically generated model and therefore understanding how the model makes predictions.
The influencers are sorted by decreasing importance. Gender and cabin class are two top influencers.
For each influencer, we can analyze the influence of different categories (single values, ranges of values, or groups of values) on the target. The higher the absolute value of the influence, the stronger the influence of the category is. The influence of a category can be positive or negative.
Taking the cabin class variable
Pclass as an example on the screenshot above:
- Traveling in the 3rd class has a strong negative influence,
- Traveling in the 1st and 2nd classes has a positive influence, but it is traveling in the 1st class that has a much stronger influence.
We will spend more time going into the details of ML models later. For now, having quite good indicators and understanding of influencers’ contributions let’s move on.
4. Generating Predictions
In the previous step, the model has been trained and automatically deployed on SAP Analytics Cloud infrastructure.
We need to have a dataset with the population (records with observations) we want to apply the model to get the predicted category. The results will be saved to a generated dataset.
So, first, we need to prepare a SAC dataset with the data from a Kaggle-provided test file.
Go to the Datasets application, and import the
test.csv file into the dataset
test in the
The test dataset has 418 records (or observations) and only 11 columns, as the column
Survived is missing. That’s the column we want to predict.
To be consistent with the
train dataset, let’s replace missing values in the
Age variable with the null.
Save the dataset and go back to the Predictive Scenarios application. Open the scenario
Titanic, if closed.
Click on the icon Apply Predictive Model.
testdataset as the data source.
- Leave replicated columns empty. Only the key column —
PassengerIdthat we marked as a key before the training process — will be replicated from the input test dataset to the generated dataset.
- Choose only
Predicted Categoryfrom Statistics and Predictions. It will be a column with the calculated prediction added to the generated dataset.
- Output as
test-predictionsin the same
Titanicfolder. This will be our generated dataset.
Expand the status panel, and you’ll see the status changing from “Trained” to “Applying Pending” to “Applying” and, finally, to “Applied”.
Now go to the Files app where you should find
test-predictions generated dataset.
You should see two columns
Predicted Category. Out of this group of passengers, 149 are in the category with the
1 value, i.e. are predicted to survive.
These two columns are the way Kaggle expects participants to submit their predictions.
So, is that it? Are we done?
Well, we would be, if the submission to the Kaggle’s challenge would be our goal. But the real goal is to discover, learn and experiment. So, the answer is:
While we used this exercise to create our first predictive scenario and the first predictive classification model in SAP Analytics Cloud, in the next parts we will look closer at classification predictions in the Smart Predict, and will see if we can get a better Machine Learning model.
-Vitaliy, aka @Sygyzmundovych