Skip to Content
Technical Articles
Author's profile photo Witalij Rudnicki

SAP Tech Bytes: Feature Engineering using Data Wrangling

In the previous posts of the series, we have built our predictive model using Smart Predict in SAP Analytics Cloud with just a few clicks. And then we looked at some key Performance Indicators of the model built for us by Smart Predict.

Improving Model Performance

Smart Predict greatly simplifies the process of training a Machine Learning model thanks to the automation of many steps and decisions. But we still have some technics available to us — such as feature engineering or imputation of missing values — that we can use to improve model performance.

These and other techniques were presented by Stuart Clarke in unit 3 “Improving Model Performance” of week 5 of the openSAP course Getting Started with Data Science.

What is Feature Engineering

The post Feature Engineering Tips for Data Scientists gives a simple and intuitive explanation:

Feature engineering is simply the thoughtful creation of new input fields from existing input data. Thoughtful is the key word here. The newly created inputs must have some relevance to the model output and generally come from knowledge of the domain.

There are some variables in the Kaggle’s Titanic datasets, like Name or Cabin number, which seemed irrelevant at first. But can we get some more meaningful variables from them?

Data wrangling

Let’s try to apply our (intuitive and acquired) domain knowledge to improve the model performance. We can create new variables in the training and test datasets using the data-wrangling capabilities of SAP Analytics Cloud.

Check Introduction to Smart Data Wrangling by Josef Hampp and Next gen Data Wrangling and Agile Analytics by David Stocker to learn more.

Titles

Honorifics, incl. common titles, were usual at that time. We have them as a part of the Name column in provided datasets. Let’s extract them as a new category.

We used the Custom Expression Editor and the Wrangling Expression Language (WEL) to define our own transformations in the previous post already.

While in the train dataset, open the Custom Expression Editor and define a new column Title based on the following expression.

[Title] = element(split(element(split([Name],'.',1),1),",",2),2)

It splits parts of each name based on the observation that it follows the pattern: the last name at the beginning followed by a comma followed by a title (ie. honorific) and then followed by a dot and the rest of the name(s) of a passenger.

Frankly, I have never heard about Johnkheer before! We live and learn.

Save this modified dataset as train_engineered in Titanic folder.

Age groups

An Age variable was defined as having a continuous statistical data type when we trained a previous model. Can we apply our human knowledge to split passengers into groups by their known age?

There are different approaches possible, like five-year age groups, or a life-cycle grouping. Based on a bit of research I came with the below grouping. But this is one of these features, where several approaches can be modeled and tested to get the best model.

[AgeCategory] = if(
	[Age] >= 65, 
	'Elderly', 
	if(
		[Age] >= 45, 
		'Senior', 
		if(
			[Age] >= 18, 
			'Adult', 
			if(
				[Age] >= 2, 
				'Child', 
				if(
					[Age] > 0, 
					'Baby', 
					null
				)
			)
		)
	)
)

Save the dataset.

Traveling alone

While variables Parch (a number of parents or children traveling with a passenger) and Sibsp (a number of siblings or a spouse traveling with a passenger) seem not to provide too much influence on a target we can use them to create a new feature Alone. It is a boolean variable, where True represents a passenger traveling alone.

[Alone]=in([Parch]+[SibSp],0)

A function in(<search_expr>, <expr_1>, ..., <expr_20>) indicates whether any of the specified <expr_N> values (here: a single value of 0) is equal to the <search_expr> value (here: [Parch]+[SibSp], or a sum of values of variables Parch and SibSp).

You can use the in-place help of the Custom Expression Editor to find or check available functions.

Save the dataset.

Side and Deck of the ship

Even seemingly unuseful Cabin variable can be used to get some new features if we know (after a little research and finding appropriate references) that its first letter represents the deck, and a parity of the last digit represents a side of the ship.

[Side] = in(toInteger(substr([Cabin], length([Cabin]), 1)), 1, 3, 5, 7, 9)

The value 1 represents a starboard (or the “right”) side of a vessel, and the value 0 represents a port (or the “left”) side.

[Deck] = if([Cabin]=='', 'Unknown', substr([Cabin], 1, 1))

Unfortunately with almost 77% of missing values, we should not expect these variables to be influential, but it was a nice try.

I would not expect a missing variable substitution by the mean value or by the most common category would make sense here, so I leave these values as calculated.

Save the dataset.

So, we created several new columns in the dataset while practicing writing expressions using formulas provided by SAP Analytics Cloud’s data wrangling capabilities.

Train a new model(s)

We added five new variables, or features, to the original dataset. It is time to train a new predictive model and to see what/if improvements we get!

  1. Go to the Predictive Scenarios
  2. Open Titanic, which we created in the first post
  3. Create a new model. I added a description “A train dataset with new engineered features”
  4. Select train_engineered dataset as a training data source
  5. Click on Edit Column Details and check PassendgerId as a key column
  6. Select Survived as a Target
  7. Exclude the following variables from Influencers: Age, SibSp, Parch, Name, Ticket, Cabin
  8. Click Train

A new Model 2 should be created, trained, and automatically deployed.

Debriefing

Once the Model 2 is trained we can compare its performance indicators with the previous model. Indeed we achieved about 2 percentage points improvement in the Predictive Power and about 3 percentage points improvement in the Prediction Confidence.

Let’s check Influencer Contributions.

The first thing I noticed is that while the original Age variable was #3 contributor in the Model 1, in this new Model 2 the feature AgeCategory is contributing the least to explain the target.

Another iteration

The sequence of the activities in the Machine Learning process is not strict and moving back and forth between different phases is normal.

In real life, we might go through a few or sometimes even dozens of iterations of training models to find the one with the best performance. That’s the reality of Data Science.

So, now based on that observation that Age influenced more than our AgeCategory variable, we might want to either review and modify how we group age values or we might test the hypothesis that plain Age variable is better categorized by automated algorithms of Smart Predict.

Once again let’s create a new model. It will be similar to the previous Model 2, with the only difference that now excludes the AgeCategory and not Age from influencers. Train the model.

This time the third model’s prediction confidence slightly improved over the Model 2, but the predictive power dropped to the level of the Model 1. In simple words, based on the validation subset of the training dataset it makes slightly more incorrect predictions than Model 2, but does make them more consistently than Model 1 🤓

Now, it is your turn!

If you followed this series of creating these models based on the Titanic dataset, then you should practice now, if you can improve the model any further, by maybe coming with some more engineered features, removing or imputing missing values, or practicing features selection.

Please share your results in the comments below!


Optional: Validate models against the Ground Truth

While Kaggle’s Titanic challenge dataset for testing is missing Survived column, for learning purposes it would be really nice to compare predictions we are getting when applying trained models to the real values, aka the Ground Truth.

This data is available in the dataset titanic3.csv, which is available among datasets provided by the Vanderbilt University Medical Center (VUMC). I used that dataset to extend the original test.csv with the column Survived representing the ground truth, and to get the new test_survived.csv.

While the code used to combine these two files goes beyond the scope of this series, you can find the Jupyter notebook with the code in the SAP Tech Bytes repository.

Create the test_survived dataset

In SAP Analytics Cloud import the test_survived.csv file to create and save a new dataset test_survived. This dataset can be used to check our Model 1 predicitve model.

To check the second and the third models created today we would need to add all new features (columns) using the same formulas as in train_engineered dataset above.

To apply the same transformation log (1) from the train_engineered let’s open that dataset and use the Reimport Data button (2) to the test_survived.csv file.

⚠️ Save the dataset as test_survived_engineered and do not overwrite the opened train_engineered dataset!

Making predictions

In Titanic predictive scenario open Model 2 and click Apply Predictive Model button.

  1. Set source to test_survived_engineered created in the previous activity.
  2. Add Survived to replicated columns.
  3. Select Predicted Category to be added to the output dataset.
  4. Set output dataset to test_survived_engineered-predictions2.

Repeat the same with the Model 3, but call the output test_survived_engineered-predictions3.

In the end, two new datasets should be created.

We use the same expression as in the previous post to add a Match column…

[Match] = if(
	[Survived] == 1, 
	if ([Predicted Category]==1, "True Positive", "False Negative"), 
	if ([Predicted Category]==0, "True Negative", "False Positive")
)

…to build our own “confusion matrix” for both generated datasets.

And here are the “confusion matrix” for the predictions made using Model 3

…which happened to produce 3 more correct negative predictions.

The life cycle of ML models

As we just have experienced the Machine Learning process requires multiple iterations to train, evaluate, compare and choose the model to be used for predictions (aka. “to be deployed for the production use”).

In the business scenario where new data is recorded constantly as time flows, the difference between all the new observations and their predicted values would be used to monitor the quality of the production model and potentially used to re-train the model as its quality degrades.

But we will leave the life cycle of ML models for another time.


Keep data-sciencing,
-Vitaliy, aka @Sygyzmundovych

Assigned tags

      2 Comments
      You must be Logged on to comment or reply to a post.
      Author's profile photo Denys van Kempen
      Denys van Kempen

      Great post, Vitaliy

      ==

      As a side note on your observation: a "jonkheer" is a honorific in Dutch, contracting jong (young) and heer (sir). Auf Deutsch: jung + Herr = Junker.  For those interested, for the fine print, see

      As nobility has gone out of fashion a bit in the last couple of centuries, you won't find it used much anymore and now mainly lives on as surname (Jonckheere in Flemish) as in

       

       

      Author's profile photo Witalij Rudnicki
      Witalij Rudnicki
      Blog Post Author

      Thanks Denys van Kempen. Since my very first customer project I have truly enjoyed every time I am learning something new about the world while working with the technology!