Candidate Influencers in SAP Analytics Cloud Smart Predict
This blog is an extension of “Time Series Forecasting in SAP Analytics Cloud Smart Predict in Detail”. I recommend you look at it before continuing. Today I’ll show you how to increase the accuracy of the predictions by using additional variables.
I’ll use the same use case of the original blog. It is about the travel costs and expenses of a company. Financial controllers deem these costs too high. They have negative impacts on the financial performance of the company. There are two objectives. The first is the analysis of costs to understand where they can be reduced. The second is to better predict future costs and avoid budget overrun.
Addressing a time series problem means predicting the evolution of the measured variable. When possible, Smart Predict uses additional variables. In this way, Smart Predict can analyze the signal more precisely. The effect is the increase in the quality of the predictive model and in turn the forecasts. In the next sections, to illustrate this,
- I’ll build two predictive models.
- I’ll use just the signal to build the first model.
- For the second model, I’ll add candidate influencers to the data model.
- Then I’ll compare the results to demonstrate the improvements.
Set and Train
I first create a Time Series Predictive Scenario. I add a first predictive model with the Travel and Expense costs. The signal variable is the overall travel cost. The date variable is the posting date of these costs. The dataset is segmented on the variable Line of Business (LOB). I then choose to exclude all the other variables. You can see the settings below in Fig. 1.
Fig 1: Settings to build a predictive model without candidate influencers
The model is then saved and trained.
Then I duplicate the first model. I change its settings to add candidate influencers. To do that, I stop excluding the variables. The settings are shown in Fig 2.
Fig 2: Settings to build a predictive model with candidate influencers
The model is also saved and trained.
Often, the dataset doesn’t contain only the signal. Other variables are also captured during the same period because people think that other variables can be important. Collecting these variables, for past, present & future values can bring additional information that will help to create better models.
In this use case, the time dimension is the posting date of a cost. The variable of interest is the cost. But there are other variables. I have separated them below in two groups:
- Some of them bring new facts like software licenses, cloud bookings, cloud subscriptions & support or headcount.
- The others bring time information like:
- number of working days,
- holiday month,
- monthly and quarterly closing day,
- first and last day of the month
- rank days of week or
- year end.
This is your knowledge about your application domain that will guide you to choose the variables that can influence the generation of the models.
Smart Predict uses these candidate influencers to increase the detection of trend and cycles in the signal. What it is looking for is a better description of these components. Keep in mind that the future values of the candidate influencers must be known (at least for the expected horizon). The more precise the trend and cycles descriptions, the better the predictions and the smaller the confidence interval.
Compare Results of Predictive Models
I compare conclusions provided in each of the model debriefs. The objective is to check if and where there are improvements of the quality. The result of this comparison will guide in the choice of the model to keep. Let’s review the three kinds of indicators.
Refer to the blog for the definition of the Horizon-Wide MAPE. Notice that the smaller the Horizon-Wide MAPE, the better the forecasted values should be.
When you look at the debrief of both models, HW-MAPE is as shown as in Fig. 3.
Fig 3: Comparison of the Horizon-Wide MAPE
The difference from these two predictive models shows an increase of 22% of the HW-MAPE when additional variables are considered during the analysis of the signal.
The travel costs differ from one LOB to another. Therefore, the historical dataset is segmented on the LOB variable. It is better to have a forecast model specific for each LOB so that those with high activity doesn’t influence those whose activity is reduced. In the debrief of each model, we can see (Fig. 4) the individual Horizon-Wide MAPE for each segment.
Fig 4: Comparison of Horizon-Wide MAPE for each segment
The accuracy of the models increases significantly for all but three LOBs. Adding knowledge in the form of additional variables can contribute to get more accurate models. But the effect is not guaranteed. The strength of this effect varies from one segment to the others. A variable can influence some segments more than the others and Smart Predict detects which are the influencer variables for each segment.
Trends and Cycles
The detection of trends and cycles can be improved when candidate influencers are considered. Let’s see this influence on two segments.
Fig 5: Improvement of the detection of the trend
When there is only the signal, the detected trend is a decreasing straight line. But when candidate influencers are added, the detected trend sticks more to the signal. It decreases until Jan 2016 and changes direction after.
Segment “Sales & Marketing”
Fig 6: Improvement of the detection of the cycle
On this segment, candidate influencers do not impact the detection of the trend. A trend is detected in both models. But when no cycle is detected when the signal is analyzed alone, one is discovered when candidate influencers are added.
This is the explanation of the increase of the quality of the models for these two segments. The better the components of a signal are detected and characterized, the better is the accuracy of the forecast model.
The definition of the confidence interval is that it comes from the difference between forecasted values and actual values. Let’s precise how it is calculated.
The model is not trained on all the historical data but on the first 75%. The 25% remaining is reserved for the validation. This subset is used to measure the difference between the actual values and the forecasted values over the horizon given by the user in the model settings. The difference represents the error done in the forecast model. Then Smart Predict computes the standard deviation from this series of errors. There are two assumptions:
- The error follows a normal distribution.
- The confidence interval should incorporate at least 95% of all values.
The theory of normal distribution states that the deviation of the average value should be at least 2 * standard deviation. Smart Predict chooses to take 3 * standard deviation to cover 99.7% of all values.
Now that you have an idea of the way confidence interval is computed, I will compare what happens when candidate influencers are used.
Fig 7: Reduction of the confidence interval with candidate influencers
In the debrief of a forecast model, in addition to the graphical representation of Fig 7, there is a table that shows the numerical predicted values with the errors min and max as shown in the Fig 8.
Fig 8: Table of the forecasted values for segment “Sales & Marketing” for the model built with candidate influencers
The confidence interval shown in this table and in the table obtained with the model built with only the signal for segment Sales & Marketing, is reduced by 13%. For segment Operations, the reduction is measured at 27%.
In conclusion, using candidate influencers can reduce the error made on the predictions.
Overall Key Takeaways
This blog covered the benefits you get when you use candidate influencers in the training of your models. They improve the relevance of models measured by:
- The Horizon-Wide MAPE.
- A better detection of the trend and cycles in the signal.
- A smaller confidence interval.
You will often gain accuracy when you use such variables. Using your knowledge about your application domain is mandatory. You can create new candidate influencers in your dataset. For example, if you have specific time dependent events that occurs every five months, you can create a specific variable. This way specific cycles could be detected. Thus, don’t hesitate to be creative.
At the end of this blog I hope to have increased your understanding of this specific aspect of Smart Predict. Feel free to experiment with it and I would be delighted if you share your experience with me.
Resources to learn more about Smart Predict.
- Time Series Forecasting in SAP Analytics Cloud Smart Predict in Detail
- Gaussian Distribution
- Calculating Interval Forecasts
- SAP Analytics Cloud – Learning
Finally, if you enjoyed this post, I’d be very grateful if you’d help to spread, comment and like. Thank you!
The blog proves, that feature selection works automatically in Smart Predict. That sounds great!
How about feature generation? Can Smart Predict generate new candidates or it uses only existing ones?
Thank you in advance!
By default Smart Predict includes all the variables of the historical dataset (except the time, the variable measured and if it is used, the varialble which segment the dataset). This means that all variables are candidate influencers. Smart Predict automatically detects which variables are effectivelly influencers and how they influences the analysis of the signal.
Can you precise what do you mean by "generate new candidates"? Is it variables which are not in the historical dataset? Is it a combination of existing variables? Do you have a particular use case to examplify your question.
Thank you for the reply. It is a combination of existing variables.
no need to reply. I clarified it with Andreas today. It is available.
No it is important to reply.
Smart Predict doesn't combine automatically candidate influencers. Now you can create such cominations in the training dataset. But this is a decision you take based on your knowledge about the application domain.
Hi Thierry BRUNET ,
Thank you very much for this detailed post.
Since then this functionality has changed, I think. Currently in the Influencer section there's "Include" and not "Exclude". Does it mean that:
Hi Mikhail Vainarevich thanks for your questions.
Thank you for getting back so quickly with your clarifications.