In Analytics and Business Intelligence, we have used simple visualizations for years: bar charts, line charts, area charts and pie charts are common staples of reports and dashboards everywhere. These work really well with “actuals”, i.e. data we have collected through normal business operations in data marts and data warehouses. However, our world is changing rapidly. With the use of statistics and predictive analytics becoming increasingly more pervasive, we have to deal with an issue that we haven’t had to deal with before: how to appropriately visualize uncertainty, especially for users that are not used to this.
This is nothing new to statisticians, of course. They have been dealing with this for years. However, a lot of their analysis is shared among other statisticians or people familiar with visualization techniques to visualize probabilities, densities and confidence intervals, whereas now this audience is widening quickly to regular business users. Predictive analytics – with or without Big Data analysis – deals with real world problems that aren’t always as neat and easily represented as past sales data or revenue numbers by quarter. We want our analytics to be more actionable, and tell us what we should do, or be prepared for, rather than only showing a picture of past history. But if we don’t do this right, we could well end up with giving false impressions – and if business users base real-world actions on it, that may have dramatic consequences.
Visualizations need to communicate quickly, without needing a whole lot of explanation. But we have to watch out as well that we don’t simplify to such an extent that we hide the complexity. Weather reports can function as an interesting analogy. Standard 5-day forecast don’t advertise their margins of error, but use standard weather classifications (in smaller or greater details, see here for the UK Met Office standards ) which are appropriately vague (A “Light rain” prediction in the UK might not be much of a stretch). On the other hand, you can see a more complex visualization in this diagram of Hurricane Sandy landfall predictions, where different lines predict a different path. You may have seen similar graphs and animations on cable news or weather channels.
(source: EUROPEAN CENTER FOR MEDIUM-RANGE WEATHER FORECASTS, as quoted in How Math Helped Forecast Hurricane Sandy – Scientific American)
There are techniques to visualize this type of uncertainty. Typically, we use confidence intervals for this. Confidence intervals give us a measure of reliability or confidence in the calculated result, in that if the same calculation or model would be repeated on multiple different samples, the probability of the population parameter (or, how the data is distributed) would be within the specified range. Typically, these are probabilities are 95% and 80%, but others can be chosen as needed. (Note, a 95% confidence interval does not mean that 95% of the sample data lie within the confidence interval)
But we have to make sure we interpret these well. Below you see an example, based on 15 year GDP projections based on past history between 1960 and 2012.
For the chart on the left for Australia, you can see that the confidence intervals are really narrow. This is not too surprising, as the timeseries of actuals before it is pretty steady, too. Both the 80% and 95% confidence intervals are tightly around the actual forecast (thick blue line). On the other hand, in the projection for Greece, we see very wide bands, and while the forecast itself suggests Greece will recover from the 2008 crisis, there is a decent chance that it will not, and even dropping down further is a distinct possibility. The greater width of the confidence intervals shows the greater uncertainty for this particular forecast (as the far greater variability of the actuals already suggests). It will, therefore, be a lot easier to base business decisions on the results for Australia, whereas for Greece we may want to hedge our bets and first see how things play out, or at least be aware that there is a lot of uncertainty where things will go.
Forecasts on a time series are probably the easiest use case. Things get more complex with scatterplots where we analyze the relationship between two variables. These could be measurements from sensors, for instance, where given a certain condition x, there is a measured value y. This analysis could be used in predictive maintenance or pro-active monitoring of a manufacturing process, where past behavior of components under certain conditions is used to predict the behavior of the component under similar conditions. For example, sensors operating in extreme temperature ranges may fail more frequently than those operating around a stable 16 degrees Centigrade. Or their output may be more erratic under certain circumstances, leading to wider variation of results under specific conditions.
Let’s run through some examples of how we would deal with that, including some modern techniques that improve on the standard solution of the confidence interval. Below you see a randomly generated dataset of 300 values plotted out.
There clearly is some sort of correlation, given the shape of the points, but without further analysis we can’t do much with this. We can add a loess regression on it, which should give us a sense of the general shape, and could provide us with an answer for y given a new value of x:
This is nice, but doesn’t give us any sense of what the confidence interval is. But we can add one, based on the Standard Error:
That’s better. That gives us a reasonable picture of the dataset, as well as that the uncertainty increases dramatically at either edges – through fewer datapoints, but also greater variability in the answers – whereas in the middle, more data points are “clustered” closer together around the loess regression line. If we ask for a prediction for x near -2 or 3.5, we know to expect the real result to be far more uncertain than a prediction where x = 1. Note also that even if we get a prediction for x around 1, real y could still be easily 10 off from the prediction.
Confidence intervals are the standard way of visualizing uncertainty. But recently there has been some criticism on it as well. The main concern is that confidence intervals tends to stress the edges, rather than what happens within the confidence intervals. Solomon Hsiang and others (for instance, here) have been doing interesting work on focusing on the density of results, rather than just standard error, margins of error and confidence intervals. The idea is to “bootstrap” the data set multiple times on random selections of the original data set and use the results of that in the visualization. In this case, we’re running 1,000 loess regressions on 1,000 200 randomly selected data points out of the original 300. We can plot each of these regressions in the graph (as a “spaghetti plot”):
This is nice, and already the ‘density’ of the lines together (or fanning out at the edges) gives already a decent indication of uncertainty. However, this does have some drawbacks, not least that this can be rather heavy to visualize. Beyond the 300 data points, there are 200,000 data points to draw the spaghetti plots. That makes it hard to use in interactive (HTML5, D3.js) visualizations. (Note also how there are “breaks” around x=-2 and x=3.9, depending on which data points were included in the 200, and which not)
One way around that (and requiring for less data points in the visualization) is to use the information from the spaghetti lines in a different way. In the graph below, we plot the density of the data points as 1x (dark orange), 2x (light orange) and 3x (yellow) Standard Deviation: This requires ~ 1,700 data points (7 * 200 + 300 original data points)
Finally, we can do even better than that, and show a more continuous distribution of the data, thereby giving an even closer indication of how uncertain a particular prediction might be based on its x value. Here, in what is called “visually weighted regression”, density is plotted using a yellow-to-red gradient, with 0 being yellow, 1 being red. There is also an alpha/transparency factor used to stress this further, so the colors “fade” the less dense the values are. You can see clearly that between ~-1.4 through ~2 there is a clear red band, showing our predictions in that area to be quite reliable. On the other hand, at the edges we can really see the red “fade out”, showing that predictions there would be far less so. This technique requires ~7,000 data points, after filtering out data points with a minimal alpha value.
These latter techniques are certainly more involved, and require substantially more computing power – and even demand a lot more of the analysis and visualization tools because of the number of data points that need to be plotted. However, it does communicate substantially more information than what simple confidence intervals can, let alone the first two plots of this data set.
Clearly, this is a much larger topic than this blog post can cover. Below you’ll find some further interesting reading on this subject, including more details on this approach of visually weighted regression. I hope, though, that this gave enough of an insight into the problem to at least bear it in mind when you’re dealing with predictive- or other statistical analysis and need to communicate your results visually. As the use of predictive analytics in business operations becomes more and more pervasive, and is presented to business users with no background in statistics, the more we can do to visualize uncertainty appropriately, the more we will prevent accidents because user misread a chart, and took a prediction to be much more definite and certain than the math justifies.
Liked this? You may be interested in this 4-part series on The Future Analytics.