Technical Articles
SAP Tech Bytes: Feature Engineering using Data Wrangling
In the previous posts of the series, we have built our predictive model using Smart Predict in SAP Analytics Cloud with just a few clicks. And then we looked at some key Performance Indicators of the model built for us by Smart Predict.
Improving Model Performance
Smart Predict greatly simplifies the process of training a Machine Learning model thanks to the automation of many steps and decisions. But we still have some technics available to us — such as feature engineering or imputation of missing values — that we can use to improve model performance.
These and other techniques were presented by Stuart Clarke in unit 3 “Improving Model Performance” of week 5 of the openSAP course Getting Started with Data Science.
What is Feature Engineering
The post Feature Engineering Tips for Data Scientists gives a simple and intuitive explanation:
Feature engineering is simply the thoughtful creation of new input fields from existing input data. Thoughtful is the key word here. The newly created inputs must have some relevance to the model output and generally come from knowledge of the domain.
There are some variables in the Kaggle’s Titanic datasets, like Name
or Cabin
number, which seemed irrelevant at first. But can we get some more meaningful variables from them?
Data wrangling
Let’s try to apply our (intuitive and acquired) domain knowledge to improve the model performance. We can create new variables in the training and test datasets using the data-wrangling capabilities of SAP Analytics Cloud.
Check Introduction to Smart Data Wrangling by Josef Hampp and Next gen Data Wrangling and Agile Analytics by David Stocker to learn more.
Titles
Honorifics, incl. common titles, were usual at that time. We have them as a part of the Name
column in provided datasets. Let’s extract them as a new category.
We used the Custom Expression Editor and the Wrangling Expression Language (WEL) to define our own transformations in the previous post already.
While in the train
dataset, open the Custom Expression Editor and define a new column Title
based on the following expression.
[Title] = element(split(element(split([Name],'.',1),1),",",2),2)
It splits parts of each name based on the observation that it follows the pattern: the last name at the beginning followed by a comma followed by a title (ie. honorific) and then followed by a dot and the rest of the name(s) of a passenger.
Frankly, I have never heard about Johnkheer before! We live and learn.
Save this modified dataset as train_engineered
in Titanic
folder.
Age groups
An Age
variable was defined as having a continuous statistical data type when we trained a previous model. Can we apply our human knowledge to split passengers into groups by their known age?
There are different approaches possible, like five-year age groups, or a life-cycle grouping. Based on a bit of research I came with the below grouping. But this is one of these features, where several approaches can be modeled and tested to get the best model.
[AgeCategory] = if(
[Age] >= 65,
'Elderly',
if(
[Age] >= 45,
'Senior',
if(
[Age] >= 18,
'Adult',
if(
[Age] >= 2,
'Child',
if(
[Age] > 0,
'Baby',
null
)
)
)
)
)
Save the dataset.
Traveling alone
While variables Parch
(a number of parents or children traveling with a passenger) and Sibsp
(a number of siblings or a spouse traveling with a passenger) seem not to provide too much influence on a target we can use them to create a new feature Alone
. It is a boolean variable, where True
represents a passenger traveling alone.
[Alone]=in([Parch]+[SibSp],0)
A function in(<search_expr>, <expr_1>, ..., <expr_20>)
indicates whether any of the specified <expr_N>
values (here: a single value of 0
) is equal to the <search_expr>
value (here: [Parch]+[SibSp]
, or a sum of values of variables Parch
and SibSp
).
You can use the in-place help of the Custom Expression Editor to find or check available functions.
Save the dataset.
Side and Deck of the ship
Even seemingly unuseful Cabin
variable can be used to get some new features if we know (after a little research and finding appropriate references) that its first letter represents the deck, and a parity of the last digit represents a side of the ship.
[Side] = in(toInteger(substr([Cabin], length([Cabin]), 1)), 1, 3, 5, 7, 9)
The value 1
represents a starboard (or the “right”) side of a vessel, and the value 0
represents a port (or the “left”) side.
[Deck] = if([Cabin]=='', 'Unknown', substr([Cabin], 1, 1))
Unfortunately with almost 77% of missing values, we should not expect these variables to be influential, but it was a nice try.
I would not expect a missing variable substitution by the mean value or by the most common category would make sense here, so I leave these values as calculated.
Save the dataset.
So, we created several new columns in the dataset while practicing writing expressions using formulas provided by SAP Analytics Cloud’s data wrangling capabilities.
Train a new model(s)
We added five new variables, or features, to the original dataset. It is time to train a new predictive model and to see what/if improvements we get!
- Go to the Predictive Scenarios
- Open Titanic, which we created in the first post
- Create a new model. I added a description “A train dataset with new engineered features”
- Select
train_engineered
dataset as a training data source - Click on Edit Column Details and check
PassendgerId
as a key column - Select
Survived
as a Target - Exclude the following variables from Influencers:
Age
,SibSp
,Parch
,Name
,Ticket
,Cabin
- Click Train
A new Model 2
should be created, trained, and automatically deployed.
Debriefing
Once the Model 2
is trained we can compare its performance indicators with the previous model. Indeed we achieved about 2 percentage points improvement in the Predictive Power and about 3 percentage points improvement in the Prediction Confidence.
Let’s check Influencer Contributions.
The first thing I noticed is that while the original Age
variable was #3 contributor in the Model 1
, in this new Model 2
the feature AgeCategory
is contributing the least to explain the target.
Another iteration
The sequence of the activities in the Machine Learning process is not strict and moving back and forth between different phases is normal.
In real life, we might go through a few or sometimes even dozens of iterations of training models to find the one with the best performance. That’s the reality of Data Science.
So, now based on that observation that Age
influenced more than our AgeCategory
variable, we might want to either review and modify how we group age values or we might test the hypothesis that plain Age
variable is better categorized by automated algorithms of Smart Predict.
Once again let’s create a new model. It will be similar to the previous Model 2
, with the only difference that now excludes the AgeCategory
and not Age
from influencers. Train the model.
This time the third model’s prediction confidence slightly improved over the Model 2
, but the predictive power dropped to the level of the Model 1
. In simple words, based on the validation subset of the training dataset it makes slightly more incorrect predictions than Model 2
, but does make them more consistently than Model 1
🤓
Now, it is your turn!
If you followed this series of creating these models based on the Titanic dataset, then you should practice now, if you can improve the model any further, by maybe coming with some more engineered features, removing or imputing missing values, or practicing features selection.
Please share your results in the comments below!
Optional: Validate models against the Ground Truth
While Kaggle’s Titanic challenge dataset for testing is missing Survived
column, for learning purposes it would be really nice to compare predictions we are getting when applying trained models to the real values, aka the Ground Truth.
This data is available in the dataset titanic3.csv
, which is available among datasets provided by the Vanderbilt University Medical Center (VUMC). I used that dataset to extend the original test.csv
with the column Survived
representing the ground truth, and to get the new test_survived.csv
.
While the code used to combine these two files goes beyond the scope of this series, you can find the Jupyter notebook with the code in the SAP Tech Bytes repository.
Create the test_survived
dataset
In SAP Analytics Cloud import the test_survived.csv
file to create and save a new dataset test_survived
. This dataset can be used to check our Model 1
predicitve model.
To check the second and the third models created today we would need to add all new features (columns) using the same formulas as in train_engineered
dataset above.
To apply the same transformation log (1) from the train_engineered
let’s open that dataset and use the Reimport Data button (2) to the test_survived.csv
file.
⚠️ Save the dataset as test_survived_engineered
and do not overwrite the opened train_engineered
dataset!
Making predictions
In Titanic
predictive scenario open Model 2
and click Apply Predictive Model button.
- Set source to
test_survived_engineered
created in the previous activity. - Add
Survived
to replicated columns. - Select
Predicted Category
to be added to the output dataset. - Set output dataset to
test_survived_engineered-predictions2
.
Repeat the same with the Model 3
, but call the output test_survived_engineered-predictions3
.
In the end, two new datasets should be created.
We use the same expression as in the previous post to add a Match
column…
[Match] = if(
[Survived] == 1,
if ([Predicted Category]==1, "True Positive", "False Negative"),
if ([Predicted Category]==0, "True Negative", "False Positive")
)
…to build our own “confusion matrix” for both generated datasets.
And here are the “confusion matrix” for the predictions made using Model 3
…
…which happened to produce 3 more correct negative predictions.
The life cycle of ML models
As we just have experienced the Machine Learning process requires multiple iterations to train, evaluate, compare and choose the model to be used for predictions (aka. “to be deployed for the production use”).
In the business scenario where new data is recorded constantly as time flows, the difference between all the new observations and their predicted values would be used to monitor the quality of the production model and potentially used to re-train the model as its quality degrades.
But we will leave the life cycle of ML models for another time.
Keep data-sciencing,
-Vitaliy, aka @Sygyzmundovych
Great post, Vitaliy
==
As a side note on your observation: a "jonkheer" is a honorific in Dutch, contracting jong (young) and heer (sir). Auf Deutsch: jung + Herr = Junker. For those interested, for the fine print, see
As nobility has gone out of fashion a bit in the last couple of centuries, you won't find it used much anymore and now mainly lives on as surname (Jonckheere in Flemish) as in
Thanks Denys van Kempen. Since my very first customer project I have truly enjoyed every time I am learning something new about the world while working with the technology!