Machine Learning Thursdays: Predictive Analysis and Spurious Correlations
There is a saying among statisticians that “correlation doesn’t imply causation.” This is a warning against jumping to the conclusion that a correlation between two variables A and B proves that one variable causes the other. Indeed, a correlation may be observed when:
- A causes B, directly or transitively
- B causes A, directly or transitively
- A and B are involved in a cyclic causation (for example, number of preys and number of predators)
- A and B have a common cause C (like increases in ice-cream consumption and forest fires, both caused by summer heat)
- A and B are not related
In the latter case, the correlation is just a coincidence, and may disappear in future observations—the observed correlation has no predictive power.
The purpose of predictive analysis is to analyze correlations between past or present input variables X1, X2, …, Xn on one hand, and a target variable Y on the other hand. The core assumption is that past correlations will reproduce in the future, so that a model trained on the past can be used in the future to predict the likely value of the target from the input variables. This assumption is quite safe if the observed correlations are due to causal relationships involving known or even hidden variables, if the underlying process is stable enough so that the same causes will continue to produce the same effects.
Spurious Correlations and Predictive Models
But what if some of the correlations observed in the training data are just coincidences? In the absence of underlying causality, such correlations may well vanish in the future. In today’s blog, we’ll discuss the perils of spurious correlations when building a predictive model, and give clues about detecting and avoiding them. In part 2, we’ll show how we can extract more robust predictors through feature engineering, using a tool such as SAP Predictive Analytics Data Manager.
There is a well-known situation where variables are likely to be correlated by mere coincidence. In 1926, G. Udny Yule wrote:
“It is fairly familiar knowledge that we sometimes obtain between quantities varying with the time (time-variables) quite high correlations to which we cannot attach any physical significance whatever, although under the ordinary test the correlation would be held to be certainly ‘significant'”¹
To illustrate this problem, Yule gave the example of the nonsense correlation between the proportion of Church of England marriages on one hand, and the mortality rate on the other hand, in England and Wales over 45 years.
In his seminal 1926 paper, Yule demonstrated empirically that random walks with constant drifts are likely to be correlated, although they have no causal relationship.
Correlations Between Unrelated Variables
More generally, correlation is also commonly observed between two non-stationary processes, such as two unrelated time series following smooth trends. Funny examples were collected by Tyler Vigen².
In the absence of causal relationships, past correlation is of little help to predict future values of one variable from the other—consumption of margarine may continue to decrease from 2010 onwards while the divorce rate returns to high levels.
Moving Beyond Spurious Correlations
So, given the dangers of spurious correlations, how can we ensure our data analysis is correct enough to provide predictive power? In part 2 we’ll dive into a business-related example and show how to avoid suspicious variables and use instead engineered features computed using SAP Data Manager