Machine Learning Thursdays: Predictive Analysis and Spurious Correlations Part 2
In part 1, we saw that spurious correlations are likely to be observed between time-variables and a time-dependent target. In the absence of underlying causal relations, such variables have no predictive power and should be excluded from predictive models. Now, we’ll see how to build more robust predictors through feature engineering, using a tool such as SAP Predictive Analytics Data Manager.
For a more business-related example, let’s take the Bank Marketing dataset available at: https://archive.ics.uci.edu/ml/datasets/bank+marketing. This dataset was collected by a Portuguese bank during multiple direct marketing campaigns. An observation describes a phone call and its context (call duration, customer profile, previous contacts, macroeconomic indicators). The binary classification target is positive if the client accepts to subscribe a term deposit.
The original version of this dataset was first studied by [Moro et al., 2014]¹. They found out a rather counterintuitive correlation: the propensity of subscribing a term deposit was negatively correlated with the Euribor-3m rate. Euribor means Euro Interbank Offered Rate; it is based on the averaged interest rates at which Eurozone banks lend funds to other banks, and it drives most money market rates, including term deposits.
Why would a client be more likely to put money in the bank when the interest rate is low? S. Moro et al. proposed that this correlation is related to a common cause (hidden C causing both visible A and B)—the 2008 worldwide financial crisis. Indeed, the crisis caused the central banks to intervene on the money market, with the effect of lowering interest rates. The authors proposed that the crisis may also have caused clients to feel insecure about their professional perspectives and thus put more savings into low-risk deposits.
On this dataset, SAP Predictive Analytics Automated also finds that the Euribor-3m is a top influencer (in second position after the call duration), but a closer look at the expert debriefing shows large deviations for most variables across large time slices, especially for the Euribor-3m that is stable and close to 5% in the beginning of the dataset and ends close to 1%. The dataset is ordered by date, but the actual dates are not provided.
However, it’s easy to compute them from the provided Euribor variable and from tables of historical Euribor rates. The following plot describes how the monthly subscription rate (for months with at least 30 phone calls) evolved in parallel with the Euribor-3m rate.
The subscription rate was indeed much lower when the Euribor was high, but:
- The Euribor is obviously a “time-variable” with a slow evolution
- The subscription rate also mostly follows a smooth trend
- There are some exceptions to an otherwise strong correlation:
- A peak in October 2008, with both a record Euribor and a high subscription rate, followed immediately by a sharp drop on the subscription rate
- A drop in the subscription rate in Q2 2009, while the Euribor was close to a 12-month low
Could it be that the correlation between the target (probability that a client subscribes a term deposit) and the Euribor-3m is spurious? Or that it has weaker than expected predictive power?
The Difference SAP Predictive Analytics Can Make
When in doubt, it is usually a good idea to search for additional variables, using for instance SAP Predictive Analytics Data Manager to join on other data sources or to derive additional “engineered features.” Here, we would like to know whether a phone call was outgoing (the bank calls the client to offer an attractive rate on term deposit) or incoming (the client calls the bank for whatever reason and then receives a proposal). We can’t get this information but we can instead compute the average number of daily calls (a count aggregate in Data Manager), to assess whether a given campaign relies on mass outgoing calls.
Here is the result:
And a candidate causal interpretation:
- The first 2008 campaign was likely implemented by mass outgoing calls, with a low success rate (tedious task for bank employees and/or clients unreceptive to unwanted calls)
- Subsequent campaigns were mostly based on either incoming calls and/or few outgoing calls to selectively picked clients, with a much higher success rate
- Whenever the number of calls was increased (Nov. 2008 or Q2 2009), the success rate dropped.
SAP Predictive Analytics Automated measures an improved predictive power for a model that uses this additional “number of daily calls” engineered feature. Moreover, the number of daily calls is actionable: the bank can adapt its campaign to the employees’ workload and can be more selective about the targets of outgoing calls if any.
Back to the original correlation between the classification target and the Euribor variable, there might indeed be a causal relationship between the subscription rate and the evolution of the financial crisis, but there seems to be little incentive in involving the Euribor as a proxy to the crisis intensity. The Euribor varied historically even in the absence of any major crisis, and we have no training data for rates in the 2%-3% range nor for the negative rates that occurred from 2016 onwards.
More generally, when an observed correlation is likely to be coincidental, as is often the case with slowly changing time variables, it can be worth trying to improve the robustness of the predictive model by replacing the suspicious variables with more easily interpretable features.
To learn more about this subject, see:
- All the Thursday series posts for more on machine learning, predictive analytics, and artificial intelligence.
- The predictive Forrester Wave report and the predictive analytics TDWI paper, Machine Learning for Business: Eight Best Practices for Getting Started