Regression in SAP Analytics Cloud in Detail
This blog replaces the previous blog I have written about Regression in SAC Smart Predict because we improved the algorithm since wave 2021.03 for fast-track customers and for Q2 2021 QRC releases for QRC customers.
Regression is used to estimate the value of a measure. For example:
- What is the delay for each customer to pay their invoices?
- How many products will a customer buy next month/next quarter?
- How much would a customer spend on my e-commerce on average?
I’ll start with an explanation of the questions addressed by regression. Then, with the help of a use case, I’ll explain the technique used by Smart Predict to build a regression model, then guide you on how to evaluate this model before taking the decision to use it.
Which questions? Which data?
SAP Smart Predict regression is useful to answer to questions of this form:
“How will the <measure> be in <new context>?”
Here are some examples:
- How will the revenue of products of my stores be impacted by the new marketing campaign? The variables considered are surface of the stores, their location, the number of families in the sectors of the stores, …), etc.
- What will be my level of production for the final product? The variables considered are number of working hours, the level of defects and regularity of provisioning of elementary components.
I want to make clear in your mind the difference between a time-series forecast and a regression because, for both, the target variable is a measure, and this can be confusing.
In a time-series forecast, we try to predict the value of a measure whose values are time dependent. Most of the time, the algorithms analyze the curve composed from the values of the measure taken at different time. For example, SAC Smart Predict breaks down this signal in elementary components: trend, cycles and fluctuation. Then it combines these components to predict what will be the future values of the measure.
In a regression, the notion of time is not necessary. Each row of the historical dataset represents an entity described by a set of variables. In a regression, we try to predict the value of the target of a new entity based on the values of the other variables.
To estimate the value of the target measure, SAC Smart Predict generates a formula. When the user encounters a new case, SAC Smart Predict applies this formula to calculate a value for the target measure. The main difference with classification is that it addresses numeric values. Before explaining regression in SAC Smart Predict, it is necessary to check that data is usable for regression.
SAC Smart Predict generates a predictive model from historical data. This data represents a view about your application domain, and of course the predictive objective you have. The application domain is described with contextual information and in the case of regression, with a numeric measure. It is necessary for you to select:
- A data source to tell from which source the data comes from.
- The variables that describe the use case. You choose variables that have an interest for your objective (what you want to predict). If you’re in doubt on whether or not to keep a variable, my advice is to keep it. SAC Smart Predict detects important variables and sorts them out for you. You could be surprised and discover that a variable contributes a lot to explain the target variable.
Each of these variables must be unique in the dataset. One of these variables represents the event you want to predict. For regression, this variable is a measure; it is a numeric variable.
Now we’ve covered the difference between a time-series forecast and a regression, the next question is what is the difference between classification and regression? In a classification, the target is categorical while in a regression the target is numeric.
For example, in figure 1, the classification determines who is a yellow cross and who is a purple cross. The typical questions answered with classification are:
- Does this customer want to churn: Yes or No?
- Is the threshold limit which is fixed at 500 is reached: > 500 or <= 500?
A regression finds the best formula so that when there is a new case, the formula provides an estimation of the value of the target variable which is numeric. Classification is a specific case of regression but where:
- The target variable is binary and not continuous
- The underlying principle is how to split a population between one value and another.
Use case: Sales Opportunity Predictions
To support explanations given during the next sections, I have chosen a use case that’s easy to understand. A sales manager wants to have a dashboard that presents sales revenue estimation for the next 6 months. The objectives are to have an idea of:
- Expected revenue.
- The number of customers and their expected revenue.
- The estimation of revenue per product.
- How long the sales cycles are in average per customer
Depending on the results of these estimations, our sales manager will refine the sales process and will put the focus on the most promising opportunities. At the end, here is an example of the dashboard he would like to have.
This dashboard is based on the predictions provided by SAC Smart Predict. These predictions are obtained from the description of win opportunities in ongoing quarters. The training dataset has 4546 rows and 18 variables to describe the opportunities. There are several kinds of variables:
- The customer involved in the opportunity like customer name, country, sector of activity, customer type, etc.
- The content of the opportunity like product name, length of sale cycle, number of meetings, services included, etc.
- Finally, a last variable which is the revenue of the opportunity successfully closed and that is the target used by SAC Smart Predict.
Regression modeling process
Theorical explanations about regression of Smart Predict
The regression of SAC Smart Predict relies on the gradient boosting technique, and uses the same process as classification to generate a predictive model. Thus I redirect you to the section “Theoretical explanations about Smart Predict classification” in the dedicated blog “Classification in SAP Analytics Cloud in Detail” to get detailed explanations.
Understand the outputs of a Smart Predict regression
Build a regression model
When you are in SAC, you start with the creation of a predictive scenario. You choose a regression and give a name to the predictive scenario. When you mention the training data source, during the training process, Smart Predict assigns randomly 75% of cases for the estimation dataset (dataset from which the predictive model is built) and the remaining 25% for the validation dataset (dataset used to calculate the performance of the model).
The target is the numeric variable “Opportunity_Value”.
All variables can influence the target. SAC Smart Predict calculates this influence. We will see that in more details in a next section. But what if there is a variable I want to exclude for whatever reason? To do this I would go to “Influencers” section of the setting dialog. Here I choose to exclude “Customer_ID” because it is an ID which does add any relevant information.
Generally, there are two kind of variables to exclude: IDs and variables which are correlated to the target because they bring the same information and don’t explain it. The predictive model shows them as high influencers because the target is directly deduced from them. A best practice when a variable is shown in the debrief with a very high influence, is to check that this variable is not correlated to the target. If it is, exclude it and rebuild the predictive model.
Once settings are complete, I save, and I train the model. To see the progress and the main indicators, I display the status pane at the bottom of the screen.
If you compare information displayed in the status of a regression with those of a classification, the Prediction Confidence appears in both. This indicator measures the robustness of the predictive model or its ability to reproduce the same detection on new data. It is computed the same way in both cases, and I redirect you to the blog about classification to get details.
The quality of the regression model is measured by another indicator: The Root Mean Square Error (RMSE). It is a statistical indicator which measures the average of the square difference between values predicted by the predictive model and actual values of the target for all cases of the validation dataset. The formula of the RMSE is:
The smaller is this difference, the better the quality of the predictive model is.
Compared to the old method, there is an increase of the RMSE of 25.14%. This means that the quality of the predictions will be better as well as the probability associated to the predictions.
Debrief information of a regression
The debrief of a regression contains two views.
- An overview which is displayed by default at the end of the training
- The influencer contributions.
The overview recalls the global performance indicators explained just above. These are statistics about the target to get an idea of its distribution. This combined information allows us to estimate how good the predictive model is.
For our use case in figure 6, the RMSE is 21,891. The mean of the target is around 64,000 for both the estimation and validation dataset and standard deviation is around 46,000. We can say that the quality of the predictive model is pretty good; Its robustness is above 95%, and thus is also very good.
The graph of figure 6 shows the predicted values versus the actual values of the target variable. There are 3 types of curves.
- In green it is the perfect model, which shows no error and predicts exactly the correct opportunity value.
- The blue curve is the predictive model determined by SAC Smart Predict.
- The dotted-blue curves are the error min and error max on the validation dataset. The width is 2 * standard deviation of the target values.
The population of the validation dataset is split into 20 intervals which roughly represent 5% of the population. A dot on a curve represents the segment mean on the X-axis and the target mean of the Y-axis.
How do we interpret these curves?
If the green and blue curves don’t match at all, this mean that the quality and the robustness of the predictive model are quite poor. One solution to increase the quality is to add more variables, while adding new cases will increase the robustness.
If these two curves match closely, the predictive model is good and can be trusted to predict the value of the unknown target.
Last case is when the two curves match a lot except on few segments. This means that the predictive model is good but can be improved. This could be because the cases under these segments don’t correctly cover the application domain. To remedy this situation, it is necessary to add new variables in the description of cases and to add new cases in the impacted segments.
The second view is dedicated to the influencers (see figure 7).
The first block recalls the contribution of the variables which have been identified as influencers. They are displayed sorted by decreasing importance. The most contributive ones are the ones that best explain the target. The sum of the contributions equals 100%.
The second block details how the values of a variable positively or negatively influence the target. In figure 7 below, SAC Smart Predict groups the values of the number of licenses into categories with the same influence on the target. Such a graph shows a lot of information, so I have preferred to explain in the red bubbles rather than writing a boring long text! A graph likes this shows a lot of information, so I’d rather present it with an image rather than a large block of text. Have a look at figure 7 below. From a business point of view, we can use this graph to conclude that the more licenses are booked in an opportunity, the greater is the revenue.
The third block shows how the categories of an influencer variable are distributed as a function of the mean value of the target and the frequency of either the training dataset or the validation dataset.
Here for example, look at the biggest category of licenses. The mean revenue of the opportunities is 149,903 and this represents 13.56% of the opportunities of the validation dataset. The deals in the next category of licenses have a mean revenue of 77,812 but they represent almost 30% of the validation dataset.
Knowing which are the influencers and how they influence the target allows the user to discover new insights and gives guidance to make decisions to improve business.
Using a regression model
Getting insights is good to understand the data better, but the primary objective is to get predictions when the value of the target variable is unknown. For the use case we’ve been using as an example, the objective is “What will my revenue be for my ongoing opportunities?”
Once a trustable predictive model is built, I select it, I click on the “Apply” icon and fill the dialog as shown in the figure 8.
Let’s break this down a bit:
- The data source. It is a dataset with the same information as the training dataset but where the values of the target variable (in this situation the Opportunity_Value) is unknown. It is what you want to determine.
- The output is also a dataset. I give it a name and a location, and it is in this dataset that predictions and all the information mentioned in the dialog of figure 8 will be written.
- The replicated columns are the variables of the input dataset I want to retrieve in the output dataset.
- To this output, I add other statistical columns. The minimum to be useful is to add the predicted values of the target variable. For the use case, I also choose to add a kind of marker that indicates if an opportunity of the applied dataset is an outlier.
Once this process is completed, a message is displayed in the status pane.
Let’s now open this new dataset.
In SAC, browse to the location where you stored the output dataset and open it. It contains all the columns set from the “Apply” dialog. At the end, there are the predicted values and the outlier indicator.
This output dataset now contains the description of new opportunities not yet successfully closed with an estimation of their revenue and an outlier indicator. We are now ready to create the BI dashboard shown in figure 2.
For those who have already read my blogs about Smart Predict Time Series Forecasting and Classification, my hope is that all these explanations have helped you, not only to understand how these predictive features are working, but also to give you a greater trust in Smart Predict.
If this series of blogs has contributed in your decision to start or continue with SAC Smart Predict, I’d be very grateful if you could leave a comment to that effect, and don’t forget to like it as well. Thank you.
Resources to learn more about Smart Predict.
COuld it be that the definition of the RMSE is missing an SQRT function?
Hi Walter, well seen. This is corrected.
Hi Thierry ,
The Blog is great and exhaustive , I was looking for the mentioned dataset in the system and otherwise could not find it.
Request you to help us with the dataset.
Thanks and Regards,
Nimisha Gandhi .
Happy to see that you are interested by this blog. You can get the dataset here: https://github.com/antoinechabert/predictive
Thank you so much for uploading the files.
Thanks for the link! Appreciate the data sets as well.
Already was searching for them today ::-)
Hello Thierry BRUNET ,
thank you very much for this post! It is great to see the development of the Smart Predict features in SAC. Nevertheless, is it planned that the Regression scenario could connect to either analytic or planning model in SAC instead of just a dataset? If the client wants to create a new regression model every week, he has to manually refresh the Dataset each week, while in the model he could schedule the import which would then refresh the date automatically. It would also be better to store both the data and the predictions in one model instead of having the data in one dataset and the predictions in some other.
I would also like to mention that I do not agree with your statement "A best practice when a variable is shown in the debrief with a very high influence, is to check that this variable is not correlated to the target. If it is, exclude it and rebuild the predictive model." When I for example want to create a regression model in order to explain what influences the weight of a person I would use as an influencer the height of the person because they are highly correlated. Maybe what you meant is that we want to avoid a situation when two influencers are highly correlated with each other because of the multicorrelation issue. For example when we would also explain the weight with the foot size because height and foot size would be highly correlated and would give the same information when explaining the weight.
Thank you very much for your response!
We effectively made the choice to save the output of a regression into a dataset because it allows to satisfy most of the needs of SAC BI users. Saving this information inside a planning model is strange because there is a time dimension in such models that does not exist in a regression. Now I agree with you regarding a BI model and particularly when it is question to automate a process.
Regarding your remark about the “best practice”, you made the correct assumption. I will also add that I did this remark because one objective of a learning tool is to discover insights hidden in the data. Thus when you have variables highly correlated, of course SAC Smart Predict will discover it, but most of the time you will learn nothing new. In the example of the blog, the size of a deal is highly correlated with the customer segment (Fortune 500, Enterprise or SMB). It is true, but is it surprising and will it help you? Now when you discover that the number of meeting has an influence, it is such insight that is interesting to discover and that can help you.
I hope this has help you.
thank you for the reply! Yes, you are right that there is an issue with the time dimension in the planning model. Still it would be great to have it at least in the Analytic model so that the process could be more automatized instead of creating the estimates ad hoc.
I have never used SAP Analytics Cloud for regression analysis and I am looking for solutions from SAP which would help me build supervised-trained-models. There are 3 solutions which I am narrowing down :
Could you please tell me the right solution for Machine Learning where I can import a dataset (can be from a S4HANA system, excel or CSV) and create a trained regression/classification models based on various Machine Learning Algorithms such as 'Boosted decision tree/random forest', 'Binary logistics regression'etc ?
I have been working with Minitab, Microsoft Azure to solve problems in Regression Analysis. I just want to have something similar in the SAP landscape.
As far as your blog on "SAP Analytics Cloud - Predictive Analytics - Regression" is concerned I have the following questions :-
Very well explained, it helped me alot.