You’ve got to love collective nouns. A ballet of swans, a barrel of monkeys, a charm of magpies. What is a group of machine learning models? That would be (perhaps arguably) an ensemble of models. A group of algorithms that go about their business with consensus. So what’s SAP Analytics Cloud got to do with ensembles? Well, this month (Mar 2021) we just replaced the machinery under the hood that powers it’s predictive modelling and we’re now all about ensembles. An ensemble of trees actually. Confused much?
Allow me to explain.
Smart Predict has machine learning models that business users can leverage and adapt to their own scenarios. Little knowledge of statistics and machine learning is required. With good understanding of the data and problem statement, models can be configured to make predictions like employee attrition, customer churn, a patient’s medical bill amount, weight of input to a blast furnace, etc. See references for more information on how to build these models.
The model that powered these predictions, came to SAP Analytics Cloud from SAP’s acquisition of KXEN, an American software company, that primarily marketed predictive analytics software. KXEN’s models strengthened SAP Analytics Cloud’s augmented analytics capability and helped earn that spot in the “Visionary” quadrant of Gartner’s magic quadrant for BI solutions.
The old model & its intuition
KXEN’s classification & regression models were based on an algorithm called Ridge Regression. This falls under a family of machine learning models called “Regularization models” that stem from ordinary least squares regression (OLS). What I describe below for both old and new models is the broad intuition, glossing over details of SAP’s exact implementation under the hood, which are proprietary.
In OLS, you fit your data points onto a linear equation (say y = mx + c) that best describes it. In my toy training example of 10 data points below, I have a “y” that I want predicted for every “x” I provide as input. The blue points are the true position of these data points while the green values would be my prediction, if I model my curve as the red line. I can see in my 8th data point, that the prediction is 9 units away from the truth. This is my error at the 8th data point. One way of calculating total error, is to add up all my errors. But this would mean the positive and negative errors would cancel each other out a little bit. To avoid this, we square the errors and add them up. Your best model would be the line that gives you the least squared error. This is the intuition behind ordinary least squares regression. You can perhaps now see how OLS gets its name. The slope m (aka coefficient of the x variable) and the bias c, are 2 parameters that the OLS model learns from the data it was trained on.
Error or Loss Function = minimise (yTruth -ypredicted )2
However, we find that OLS often over fits the data, which means it tries too hard to fit every data point into its line. As a result, when it predicts on data points it hasn’t seen during model fitting, it falters. It doesn’t “generalise” enough. The concept of generalisation finds application across machine learning, all the way down to the most cutting edge models in deep learning today. Regularisation was a baby step taken to teach models to “generalise”.
Ridge Regression adds a penalty term to OLS that forces the model to select lowest possible coefficients. To our earlier Error Function we now add the square of the slope (m2 or coefficient2). Notice the λ next to this penalty term which serves an interesting purpose. If λ=0, the model reduces to OLS. As λ increases the coefficients shrink to 0 (since you are minimising this function). At λ equals to infinity, the coefficient is 0. If m=0 in y=mx+c it will effectively make your predictor disappear from the model. You would have a model with no predictors. You are looking then for a middle ground. Ridge regression uses a technique called cross validation to find an optimal λ which is that middle ground.
Error or Loss Function = minimise (yTruth -ypredicted )2 + λ (coefficient2)
Once I know my new coefficient, m and bias, c, for any new data point that comes my way, I will plug in the value of x in my equation (mx+c) and the result y will be my prediction. Of course, practically you will have more than one predictor. You can extend the formulation by replacing x by x1, x2, x3 and so on. Similarly replace coefficient by coefficent1, coefficent2, coefficent3 and so on.
Why the old model is simply too old
Ridge regression has been found to work very well when you have far too many predictors and too less data points (aka p>n problems). Unlike OLS, it works very well even when it is given predictors that are correlated with each other. However, it isn’t always necessary that your predictions have a linear relationship with your predictors. Ridge regression is an idea that has been around in statistics since the 70s, as you can see from the research paper referenced below that first proposed it. It has grown to be a powerful model with modern advances in data crunching. Nevertheless, more powerful models have come since then and taken centre stage in machine learning as we know it today.
The new model & its basic unit
In 1999 Friedman proposed a minor modification to an earlier algorithm that provided substantial improvement in model performance. The idea has since been worked on and revised by Friedman himself and many others and lead to the birth of Gradient Boosting Models. This model was found to be wildly successful in the competitions hosted by the ML community in Kaggle. The idea has gone on to find applications in search engines, high energy physics and even had a role to play in the discovery of the Higgs Boson. How the butterfly flaps its wings indeed.
At the heart of gradient boosting lies a very fundamental unit called the tree. It’s called a tree, because it has branches and leaves. Clearly someone got very clever and took the creative liberty of calling what you see in my image below a tree. In my toy example below. I again have 10 data points. I will rephrase my problem slightly, to make it easier to understand. Let’s say for any 2 predictors along the x and y axis, I need to predict if the data point is of type “Blue” or type “Green”. How then can I carve this 2 dimensional space, so I can figure out the best rule that describes this data? I build a flow chart that says, if the value of x is less than 2 then call it type “Blue” but if >=2 call it type “Green”. At my leaves, you notice all green and all blue. Practically, the leaves will have one class in majority. If a new data point is presented to me, I will flow it down my tree and see whether it lands in the majority “Blue” leaf or “Green” leaf and label it accordingly. It is straightforward to extend the intuition to predictions which are numbers. The numerical prediction for each leaf will be the mean of observations that fall there.
More complex trees with more complex rules can be created. However, like OLS, one tree on a whole data set will overfit and try too hard to fit all data points into a rule. For each tree there are a couple of decisions to be made:
- What is the variable threshold at which the tree branches get created? We use something called information gain for this.
- When do we stop splitting into branches? The more we split the more we over fit, so we do something called pruning the tree. We simply cut off some branches that don’t bring a very big change in prediction. This helps generalise, but not quite enough.
This brings us to the question of how to “generalise” better and from this a whole class of machine learning models called tree based models burst forth. Instead of a single tree, we build a collection of trees – an ensemble of trees, that we consult for consensus. Think of this as a council of elders that have their own experiences with the information and are suggesting an outcome. The biblical proverb comes to mind – “Where there is no counsel, the people fall; But in the multitude of counsellors there is safety.” Proverbs 11:14
How the new model works
In gradient boosting method, our multitude of counsellors are intentionally weakened with a parameter similar to the λ of Ridge Regression. The core intuition is you iteratively build weak learners and then aggregate their findings to build one very strong learner.
- You take the dataset, and fit the first tree, say T1(x), which is a function of your predictor x. You use it to make your prediction for ground truth, y. Let’s call our prediction ŷ. You allow your tree to have d splits. You weaken your tree by multiplying its predictions with λ, so your predictions end up being a more distant from the truth. This means you will be left with a residual, say R1, similar to the error in OLS. Like in our previous example with OLS, if your prediction is 38 when the truth is 29, your residual at that data point is 9.
ŷ = λ * T1(x) + R1
- You now want to explain your residual R1 at each data point in terms of your predictors. So you build another tree on the residual R1. You get a prediction for R1 weakened by λ. You can now update your ŷ with the newfound value for R1.
R1 = λ * T2(x) + R2
ŷ = λ * T1(x) + λ * T2(x) + R2
- Notice now you are entering a loop. You now want to explain your residual R2, so you build another weakened tree T3 to explain it.
R2 = λ * T3(x) + R3
ŷ = λ * T1(x) + λ * T2(x) + λ * T3(x) + R3
- You keep building trees until you reach a point where your residual is small enough to disappear from your equation at the nth tree.
Rn = λ * Tn(x)
ŷ = λ * T1(x) + λ * T2(x) + λ * T3(x) + ….. + λ * Tn(x)
or you have busted the limit of trees you planned to build (B) but by then your RB is minuscule enough to be ignored.
RB-1 = λ * Tn(x) + RB
ŷ = λ * T1(x) + λ * T2(x) + λ * T3(x) + ….. + λ * TB(x) + RB
Why the model is so powerful
- Unlike a single tree, which will undoubtedly overfit the data, gradient boosting learns sequentially from an ensemble of trees. Overfitting may occur if your number of trees (B) is too large, but this occurs slowly if at all.
- Weakening learners so you can combine the learnings into a powerful learner:
- The shrinkage parameter or learning parameter (λ) slows the learning process down, allowing different shaped trees to attack the residuals.
- You restrict how many splits can be made in a tree (d). These d splits will involve at most d variables, weakening the learning process.
- On a related note, you can even say, in a hat tip off to Random Forest (another tree based ensemble model), that I will at any point use only a small proportion of predictor variables for a given tree.
- Instead of giving your entire dataset to a particular tree, a random sub-sample of your entire dataset is given to each tree. This is a powerful move, as it now allows for weak predictors to also find a voice and influence the prediction. Often times machine learning models are dominated by strong predictors (the loud voices).
All the methods and variables described above are parameters for the Gradient Boosting Algorithm. Knobs that can be turned to different values so you may “tune” the output signal. This leads to an army of weak learners that end up predicting the final outcome with astonishing accuracy. This has been proven time and again in competitions involving machine learning models, where Gradient Boosting models often outperform others. Today, for structured data sets, Gradient Boosting is a class of algorithms that are considered state of the art, and the best of the best.
And this is why SAP Analytics Cloud, gave its innards a revamp. The old has been cast away, and a new guard has taken its place. This guard, is an ensemble of trees, the biblical multitude of counsellors, whose counsel we believe businesses will benefit from.
So… have you counselled with our trees yet? 😉
- For those with an academic bent of mind:
- Arthur E. Hoerl and Robert W. Kennard (1970). Ridge Regression: Biased Estimation for Nonorthogonal Problems (PDF). In Technometrics, Vol. 12, No. 1 (Feb., 1970), pp. 55-67.
- Friedman, J. H. (February 1999). “Greedy Function Approximation: A Gradient Boosting Machine” (PDF).
- Friedman, J. H. (March 1999). “Stochastic Gradient Boosting” (PDF).
- Mason, L.; Baxter, J.; Bartlett, P. L.; Frean, Marcus (1999). “Boosting Algorithms as Gradient Descent”(PDF). In S.A. Solla and T.K. Leen and K. Müller (ed.). Advances in Neural Information Processing Systems 12. MIT Press. pp. 512–518.
- Mason, L.; Baxter, J.; Bartlett, P. L.; Frean, Marcus (May 1999). “Boosting Algorithms as Gradient Descent in Function Space” (PDF). Archived from the original(PDF) on 2018-12-22.
- How to build models with SAP Analytics Cloud
- Anybody can do data science with SAP Analytics Cloud by yours truly
- Predicting the next move of the virus covid-19 also by yours truly
- Regression in detail by Thierry Brunet, whose content I admire very much
- Classification in detail also by Thierry Brunet
- Time series forecasting in detail also by Thierry Brunet