Feature Engineering Tips for Data Scientists
Most data scientists and statisticians agree that predictive modeling is both art and science yet, relatively little to no air time is given to describing the art. This post describes one piece of the art of modeling called feature engineering which expands the number of variables you have to build a model. I offer six ways to implement feature engineering and provide examples of each. Using methods like these is important because additional relevant variables increase model accuracy, which makes feature engineering an essential part of the modeling process. The full white paper may be downloaded at Feature Engineering Tips for Data Scientists.
What Is Feature Engineering?
A predictive model is a formula or method that transforms a list of input fields or variables (x, x, …, x) into some output of interest (y). Feature engineering is simply the thoughtful creation of new input fields (z, z, …, z) from existing input data (x). Thoughtful is the key word here. The newly created inputs must have some relevance to the model output and generally come from knowledge of the domain (such as marketing, sales, climatology, and the like). The more a data scientist interacts with the domain expert, the better the feature engineering process.
Take, for example, the case of modeling the likelihood of rain given a set of daily inputs: temperature, humidity, wind speed, and percentage of cloud cover. We could create a new binary input variable called “overcast” where the value equals “no” or 0 whenever the percentage of cloud cover is less than 25% and equals “yes” or 1 otherwise. Of course, domain knowledge is required to define the appropriate cutoff percentage and is critical to the end result.
The more thoughtful inputs you have, the better the accuracy of your model. This is true whether you are building logistic, generalized linear, or machine learning models.
Six Tips for Better Feature Engineering
Tip 1: Think about inputs you can create by rolling up existing data fields to a higher/broader level or category. As an example, a person’s title can be categorized into strategic or tactical. Those with titles of “VP” and above can be coded as strategic. Those with titles “Director” and below become tactical. Strategic contacts are those that make high-level budgeting and strategic decisions for a company. Tactical are those in the trenches doing day-to-day work. Other roll-up examples include:
- Collating several industries into a higher-level industry: Collate oil and gas companies with utility companies, for instance, and call it the energy industry, or fold high tech and telecommunications industries into a single area called “technology.”
- Defining “large” companies as those that make $1 billion or more and “small” companies as those that make less than $1 billion.
Tip 2: Think about ways to drill down into more detail in a single field. As an example, a contact within a company may respond to marketing campaigns, and you may have information about his or her number of responses. Drilling down, we can ask how many of these responses occurred in the past two weeks, one to three months, or more than six months in the past. This creates three additional binary (yes=1/no=0) data fields for a model. Other drill-down examples include:
- Cadence: Number of days between consecutive marketing responses by a contact: 1–7, 8–14, 15–21, 21+
- Multiple responses on same day flag (multiple responses = 1, otherwise =0)
Tip 3: Split data into separate categories also called bins. For example, annual revenue for companies in your database may range from $50 million (M) to over $1 billion (B). Split the revenue into sequential bins: $50–$200M, $201–$500M, $501M–$1B, and $1B+. Whenever a company falls with the revenue bin it receives a one; otherwise the value is zero. There are now four new data fields created from the annual revenue field. Other examples are:
- Number of marketing responses by contact: 1–5, 6–10, 10+
- Number of employees in company: 1–100, 101–500, 502–1,000, 1,001–5,000, 5,000+
Tip 4: Think about ways to combine existing data fields into new ones. As an example, you may want to create a flag (0/1) that identifies whether someone is a VP or higher and has more than 10 years of experience. Other examples of combining fields include:
- Title of director or below and in a company with less than 500 employees
- Public company and located in the midwestern United States
You can even multiply, divide, add, or subtract one data field by another to create a new input.
Tip 5: Don’t reinvent the wheel – use variables that others have already fashioned.
Tip 6: Think about the problem at hand and be creative. Don’t worry about creating too many variables at first, just let the brainstorming flow. Feature selection methods are available to deal with a large input list; see the excellent description in Matthew Shardlow’s “An Analysis of Feature Selection Techniques.” Be cautious, however, of creating too many features if you have a small amount of data to fit. In that case you may overfit the data, which can lead to spurious results.
Hundreds, even thousands of new variables can be created using the simple techniques described here. The key is to develop thoughtful additional variables that seem relevant to the target or dependent variable. So go ahead, be creative, have fun, and enjoy the process.