SAP Tech Bytes: Feature Engineering using Data Wrangling
In the previous posts of the series, we have built our predictive model using Smart Predict in SAP Analytics Cloud with just a few clicks. And then we looked at some key Performance Indicators of the model built for us by Smart Predict.
Improving Model Performance
Smart Predict greatly simplifies the process of training a Machine Learning model thanks to the automation of many steps and decisions. But we still have some technics available to us — such as feature engineering or imputation of missing values — that we can use to improve model performance.
What is Feature Engineering
The post Feature Engineering Tips for Data Scientists gives a simple and intuitive explanation:
Feature engineering is simply the thoughtful creation of new input fields from existing input data. Thoughtful is the key word here. The newly created inputs must have some relevance to the model output and generally come from knowledge of the domain.
There are some variables in the Kaggle’s Titanic datasets, like
Cabin number, which seemed irrelevant at first. But can we get some more meaningful variables from them?
Let’s try to apply our (intuitive and acquired) domain knowledge to improve the model performance. We can create new variables in the training and test datasets using the data-wrangling capabilities of SAP Analytics Cloud.
Honorifics, incl. common titles, were usual at that time. We have them as a part of the
Name column in provided datasets. Let’s extract them as a new category.
We used the Custom Expression Editor and the Wrangling Expression Language (WEL) to define our own transformations in the previous post already.
While in the
train dataset, open the Custom Expression Editor and define a new column
Title based on the following expression.
[Title] = element(split(element(split([Name],'.',1),1),",",2),2)
It splits parts of each name based on the observation that it follows the pattern: the last name at the beginning followed by a comma followed by a title (ie. honorific) and then followed by a dot and the rest of the name(s) of a passenger.
Frankly, I have never heard about Johnkheer before! We live and learn.
Save this modified dataset as
Age variable was defined as having a continuous statistical data type when we trained a previous model. Can we apply our human knowledge to split passengers into groups by their known age?
There are different approaches possible, like five-year age groups, or a life-cycle grouping. Based on a bit of research I came with the below grouping. But this is one of these features, where several approaches can be modeled and tested to get the best model.
[AgeCategory] = if( [Age] >= 65, 'Elderly', if( [Age] >= 45, 'Senior', if( [Age] >= 18, 'Adult', if( [Age] >= 2, 'Child', if( [Age] > 0, 'Baby', null ) ) ) ) )
Save the dataset.
Parch (a number of parents or children traveling with a passenger) and
Sibsp (a number of siblings or a spouse traveling with a passenger) seem not to provide too much influence on a target we can use them to create a new feature
Alone. It is a boolean variable, where
True represents a passenger traveling alone.
in(<search_expr>, <expr_1>, ..., <expr_20>) indicates whether any of the specified
<expr_N> values (here: a single value of
0) is equal to the
<search_expr> value (here:
[Parch]+[SibSp], or a sum of values of variables
You can use the in-place help of the Custom Expression Editor to find or check available functions.
Save the dataset.
Side and Deck of the ship
Even seemingly unuseful
Cabin variable can be used to get some new features if we know (after a little research and finding appropriate references) that its first letter represents the deck, and a parity of the last digit represents a side of the ship.
[Side] = in(toInteger(substr([Cabin], length([Cabin]), 1)), 1, 3, 5, 7, 9)
1 represents a starboard (or the “right”) side of a vessel, and the value
0 represents a port (or the “left”) side.
[Deck] = if([Cabin]=='', 'Unknown', substr([Cabin], 1, 1))
Unfortunately with almost 77% of missing values, we should not expect these variables to be influential, but it was a nice try.
I would not expect a missing variable substitution by the mean value or by the most common category would make sense here, so I leave these values as calculated.
Save the dataset.
So, we created several new columns in the dataset while practicing writing expressions using formulas provided by SAP Analytics Cloud’s data wrangling capabilities.
Train a new model(s)
We added five new variables, or features, to the original dataset. It is time to train a new predictive model and to see what/if improvements we get!
- Go to the Predictive Scenarios
- Open Titanic, which we created in the first post
- Create a new model. I added a description “A train dataset with new engineered features”
train_engineereddataset as a training data source
- Click on Edit Column Details and check
PassendgerIdas a key column
Survivedas a Target
- Exclude the following variables from Influencers:
- Click Train
Model 2 should be created, trained, and automatically deployed.
Model 2 is trained we can compare its performance indicators with the previous model. Indeed we achieved about 2 percentage points improvement in the Predictive Power and about 3 percentage points improvement in the Prediction Confidence.
Let’s check Influencer Contributions.
The first thing I noticed is that while the original
Age variable was #3 contributor in the
Model 1, in this new
Model 2 the feature
AgeCategory is contributing the least to explain the target.
The sequence of the activities in the Machine Learning process is not strict and moving back and forth between different phases is normal.
In real life, we might go through a few or sometimes even dozens of iterations of training models to find the one with the best performance. That’s the reality of Data Science.
So, now based on that observation that
Age influenced more than our
AgeCategory variable, we might want to either review and modify how we group age values or we might test the hypothesis that plain
Age variable is better categorized by automated algorithms of Smart Predict.
Once again let’s create a new model. It will be similar to the previous
Model 2, with the only difference that now excludes the
AgeCategory and not
Age from influencers. Train the model.
This time the third model’s prediction confidence slightly improved over the
Model 2, but the predictive power dropped to the level of the
Model 1. In simple words, based on the validation subset of the training dataset it makes slightly more incorrect predictions than
Model 2, but does make them more consistently than
Model 1 🤓
Now, it is your turn!
If you followed this series of creating these models based on the Titanic dataset, then you should practice now, if you can improve the model any further, by maybe coming with some more engineered features, removing or imputing missing values, or practicing features selection.
Please share your results in the comments below!
Optional: Validate models against the Ground Truth
While Kaggle’s Titanic challenge dataset for testing is missing
Survived column, for learning purposes it would be really nice to compare predictions we are getting when applying trained models to the real values, aka the Ground Truth.
This data is available in the dataset
titanic3.csv, which is available among datasets provided by the Vanderbilt University Medical Center (VUMC). I used that dataset to extend the original
test.csv with the column
Survived representing the ground truth, and to get the new
While the code used to combine these two files goes beyond the scope of this series, you can find the Jupyter notebook with the code in the SAP Tech Bytes repository.
In SAP Analytics Cloud import the
test_survived.csv file to create and save a new dataset
test_survived. This dataset can be used to check our
Model 1 predicitve model.
To check the second and the third models created today we would need to add all new features (columns) using the same formulas as in
train_engineered dataset above.
To apply the same transformation log (1) from the
train_engineered let’s open that dataset and use the Reimport Data button (2) to the
⚠️ Save the dataset as
test_survived_engineered and do not overwrite the opened
Titanic predictive scenario open
Model 2 and click Apply Predictive Model button.
- Set source to
test_survived_engineeredcreated in the previous activity.
Survivedto replicated columns.
Predicted Categoryto be added to the output dataset.
- Set output dataset to
Repeat the same with the
Model 3, but call the output
In the end, two new datasets should be created.
We use the same expression as in the previous post to add a
[Match] = if( [Survived] == 1, if ([Predicted Category]==1, "True Positive", "False Negative"), if ([Predicted Category]==0, "True Negative", "False Positive") )
…to build our own “confusion matrix” for both generated datasets.
And here are the “confusion matrix” for the predictions made using
…which happened to produce 3 more correct negative predictions.
The life cycle of ML models
As we just have experienced the Machine Learning process requires multiple iterations to train, evaluate, compare and choose the model to be used for predictions (aka. “to be deployed for the production use”).
In the business scenario where new data is recorded constantly as time flows, the difference between all the new observations and their predicted values would be used to monitor the quality of the production model and potentially used to re-train the model as its quality degrades.
But we will leave the life cycle of ML models for another time.
-Vitaliy, aka @Sygyzmundovych