# Regression problems: Highest R2 scorer is not usually the most optimal model !

So I have been lately playing a lot with SAP HANA’s PAL library for different requirements and I can say that its a very good package **but only if you know what to use and what not.**

** **Introducing myself, I am an ABAP / CRM techie with additional experience in area of practical Machine learning and have been applying the same in various problems like Anomaly detection/ prediction/ classification etc. in different platforms like Octave, Python and recently SAP.

** ****Situation:**

The problem statement here is a very common one. Suppose you have a situation where you have to predict/ forecast the sales of a company w.r.t number of leads it has. This can be termed as a regression model problem and if you have little knowledge of PAL; Polynomial regression, logarithmic regression etc. will start hovering in your minds. One would even jump to applying the SAP PAL function ( eg: POLYREG ), simply train a model and jump on the one with the R2 score closest to 1 but wait before you read below !!!!!

** ****Problem statement:**

Due to lack of knowledge of concepts of data science and machine learning, practitioners like us tend to fall in the trap of R2 score because it appears to be a simple statistic to check whether the model is good or not.

** ****R2 score:**

An R2 score is the value which shows how good it fits your training data. However there’s a difference between fitting and optimal fitting. When it comes to predictability efficiency of a model, the R2 score becomes invalid because it is a measure of how well your **training data** fits the model and nothing about the predictability.

Usually a high R2 score means a **high possibility** of “High variance”. There have been instances in my experience where a R2 score of example: 0.983 fits far **more optimally** than models of R2 score 0.99 or 0.992 etc. I have seen many people talking about achieving high R2 score, being closer to R2 = 1. However it’s not strange that many of us are not aware about the Variance-Bias problem a higher or a very low R2 score brings.

** **

**Bias Vs Variance trade-off:**

So what is this term high variance / high bias ?

** **

High bias: When the model fits so bad that it doesn’t fit the input data well and the curve is more like an unfit line, **bad fitting**. The curve is too blunt and prediction using this curve will have unreal variance.

High variance: When the model fits so good that it fits the input data but the curve appears too unreal, **good fitting but not optimal fitting**. The curve is too sharp and prediction using this curve will have unreal variance.

Just optimal: A model which fits optimally to the input data and the curve also appears real. The curve will be smooth and prediction using this curve will have close to real prediction.

So how do I chose the most “optimal” model ?:

Now the problem comes as to how to decide the best model. For the selection of the best model we will follow **k-fold model selection** algorithm, follow the simple steps below:

1. Divide your training set into n-bins of data, n can be a value from 2 to 10 and defines the number of times we would run the prediction and training on the same set, I usually prefer 4.

2. For each bin from 1 to n, pick up **M**th bin which you mark as cross validation bin and all other n-1 bins as training bin.

3. Train your regression model only on Training bin

4. Predict output on the Cross-validation set

5. **Follow this step as this is how you chose the best model:**

- Calculate set error as average of sum of [{ Output of step 4 – Y value from Cross validation set } ** 2 ]

6. Repeat steps 2 to 5 for each bin out of n-bins.

7. Repeat steps 1 to 6 for each model, average out the error calculated from step 5.1 for each model run and pick up the model with the least value.

This is usually the best practices of Machine learning, which I always follow to pick up the best model, **rather than relying on R2 scores.**

Hi Hasan, interesting post. Any thoughts on Automated/APL regression approach? Did you try it? Thanks & regards, Antoine

Hi Antoine, though I know about the product but we don't have any access to it right now hence I have not been able to try it.

However, I am actually working on a scenario these days where in data can come in any pattern at run time from machine sensors and predict. The algorithm will automatically have to devise the best model among 8 power regression models in ABAP code itself.

So in this situation as it is on run-time, I had to convert the above written logic to ABAP code, for selection of best model during runtime.

Thanks,

Hasan

Hi Hasan, you can test drive a fully functional copy of SAP BusinessObjects Predictive Analytics for 30 days. See https://go.sap.com/cmp/ft/crm-xm15-dwn-an001/index.html. Your scenario is very interesting.

Thanks

Antoine

Hey Antoine,

I tried this tool for multiple use cases and have given my experience feedback here:

https://blogs.sap.com/2017/09/21/sap-pa-3.2-feedback-awesome-tool-but-needs-focus-on-basics./

Thanks,

Hasan

Thanks for the feedback Hasan. We will look into the feedback and get back with our answer. Kind regards Antoine