In the previous post “SAP Tech Bytes: Your first Predictive Scenario in SAP Analytics Cloud” of the series, with just a few clicks we have built our predictive model using Smart Predict in SAP Analytics Cloud. Or maybe you have built a classification model by yourself based on some other data.

Before we move on and try to improve our first model, let’s spend a few minutes to understand how and how well predictions are made by ML models. While Machine Learning can accelerate decision-making by automating it, it is important to **trust** predictions coming from trained models.

## Global Performance Indicators

We’ve seen some of the indicators, namely Predictive Power and Predictive Confidence, in the previous post already.

As a reminder:

**Predictive Power**is the main measure of predictive model accuracy. The closer its value is to 100%, the more confident you can be when you apply the predictive model to obtain predictions.**Prediction Confidence**is your predictive model’s ability to achieve the same degree of accuracy when you apply it to a new dataset that has the same characteristics as the training dataset. This value should be as close as possible to 100%.

To better understand these indicators we first need to understand the **% Detected Target** chart.

And to understand this chart we first need to understand…

## Prediction Probability

Let’s re-apply the model to the same `test`

dataset as before, but this time including **Prediction Probability** in addition to the **Prediction Category** in the output.

Call the output `test-predictions-probability`

.

Open the `test-predictions-probability`

and order the data by the probability column.

The Prediction Probability is the probability that the Predicted Category is the target value, in our case the target value is `1`

meaning a passenger had survived.

The probability is calculated based on the contribution of influencers and their category influence of the trained model.

During the training phase, the model finds a threshold value that separates probability values matched into the target value. In our example, it found a value somewhere between roughly `0.419`

and `0.425`

that defines if a record being classified as `0`

or `1`

.

We can see as well the distribution of the probability values for our example.

Now, having the understanding of the prediction probability, we should easily understand…

## Detected Target Curve

The Detected Target chart was nicely explained by Stuart Clarke in unit 1 “Model Performance Metrics” of week 5 of the openSAP course Getting Started with Data Science.

To get this chart we order all observations (aka “the population”) from the dataset by their predictive probability results from the highest to the lowest along the X-axis. The Y-axis will represent the percentage of targets (in our case where `Survived`

is equal `1`

) identified in the population compared to the total number of targets.

The Validation partition of the `train`

dataset is used to evaluate the quality of the model. It could compare the Predicted Category (produced by the model trained on the Training partition of the dataset) with the ground truth, i.e. with the real value of the `Survived`

variable for each observation.

We do not know how exactly observations had been split between these two partitions, but we know that the Validation partition contained 36.59% of observations with the `Survived`

variable equal to `1`

.

Therefore** the perfect model** would require 37% of observations ordered by the predicted probability to detect all 100% of those who survived. On the other hand, the **random** selection of 37% without any model should statistically give us 37% of targets detected.

Therefore the closer the results of a model applied to the validation partition of the dataset to the perfect model results the better!

### Let’s do a small exercise

While we do not know the random split between partitions used during the training process, we can still apply the model to the `train`

dataset to get the idea.

So,

- Apply the same
`train`

dataset used to train the model, - Include
`Survived`

into replicated columns, - Include both
`Category`

and`Probability`

into the output, - Save the result as
`train-predictions`

.

Once the model is applied open the `train-predictions`

dataset and sort decreasingly by the Prediction Probability column.

Scroll down and you find the first mismatch between the Survived and Predicted Category columns. In this case, we got the **False Positive**: an observation that the model classified as a target (a person should have survived), while in reality, it was not. This is where on the chart we would see the difference starts between the Perfect Model and Validation lines.

Scrolling up from the bottom of the dataset we will find as well the last case of the **False Negative**: an observation that the model wrongly classified a person who survived.

Let’s check how many false predictions are there. Open the Custom Expression Editor and add a new `Match`

column defined as below.

```
[Match] = if(
[Survived] == 1,
if ([Predicted Category]==1, "True Positive", "False Negative"),
if ([Predicted Category]==0, "True Negative", "False Positive")
)
```

Execute it and check the distribution of matches.

### Now, going back to the Detected Target chart…

…we can see that it required 87% of observations from the Validation partition to capture all 100% of the required target, i.e. where `Survived`

was equal to `1`

.

I hope this exercise helped reading this chart and this should help us to understand better…

## Predictive Power and Predictive Confidence

Again, these Performance Indicators were nicely explained by Stuart Clarke in unit 1 “Model Performance Metrics” of week 5 of the openSAP course Getting Started with Data Science.

In our example go to the chart’s settings and add Training to the Y-Axis.

Visually, in very simple terms:

**Predictive Power**shows how close the Validation curve is to the Perfect Model, so the results of predictions are correct, and**Predictive Confidence**shows how close the Validation curve is to the Training curve, so we should expect similar Predictive Power when applying the model to the dataset with similar characteristics as the dataset used to train.

## That’s it for now, but if you are interested…

…in a more in-depth review of indicators, then please check Classification in SAP Analytics Cloud in Detail by Thierry BRUNET.

Equipped with this knowledge we will try to improve our classification predictions in the next post of the series.

Stay tuned!

-Vitaliy, aka @Sygyzmundovych