Skip to Content
Technical Articles

Anybody can do data science with SAP Analytics Cloud – Part 2

This blog post is the second in an ongoing series of blogs on “Anybody can do data science – with SAP Analytics Cloud”.

My statistics professor once told me, that if you torture data long enough it will confess to anything. In this blog post, I will attempt to bring “science” into this interrogation room and explain these confessions from the perspective of data science. Our tool of choice for the interrogation will be SAP Analytics Cloud’s Smart Predict. My hope is to rouse your curiosity and inspire you to start using Smart Predict to answer your own questions.

Part 2: Will I enjoy this wine?

In the previous part of this blog here, we laid the ground work for solving our problem. We understood the data, loaded it into SAC and trained our very first model. In this part, we understand the results and how we can use it to get predictions.

 

Step 3. Understanding the results

  • After you click on Train, and SAC completes the training process, it will show you 2 tabs of information.
    • Overview tells you about the quality of the results
      • In our cases it is 99% confident about its results. Which is awesome!
      • It also says that the error is 0.8. This means that the true value is ±0.8 from our prediction. Slight dive into statistics concepts at this point. Ideally this should be less than the standard deviation of the target variable. This means the our prediction ±error should be better than the very naïve model of mean ±standard deviation.
      • The Target statistics describe the mean and standard deviation of the portion of the dataset it used to train the model & the portion it reserved to validate its model.

    • The influencer contributions explains the results
      • Density of the wine and sugar understandably have the highest correlation with wine quality, followed by the other variables.

 

Step 4. Applying the model

  • Now that our model is ready we can apply it on the dataset we had carved out earlier.
  • Click on the apply model option (icon at far right).

  • SAC will now seek information on dataset it needs to make predictions for and where it should save its predictions. Under input dataset, select the dataset of 5 white wines we had carved out earlier to check out predictions. Under output data sets, indicate where you would like to save the predictions.

     

  • Under output columns, it will seek what variables to use for prediction. Select variables that you want to feed into the model. These should be the very same attributes you trained the model with in Step 2.

  • You can indicate what additional information you would like to see with predictions like date for when the model was applied, is the wine considered an outlier in the dataset and of course the predicted value itself. Click ok to run the model on the new dataset.

  • When SAC is done creating predictions, you should be able to see a new file in your original folder containing the predictions. Click on this file to review how the predictions look.

  • In the predictions file, see the column at far right called Predicted Value. You can compare this against the true quality value. The model predictions are quite close the ground truth. As highlighted earlier, they are not more than ±0.8 the ground truth.

  • You can repeat this process for the red wines. Click on + sign at the ribbon on the top to add a new model.

  • I found the model for red wines had an error of 0.69 with a confidence of 95%. Alcohol seems to be the dominant predictor.

  • I can see both my models in the predictive models section at the bottom. I can see status of models (trained / applied). I can click either of them (the selected model gets highlighted in blue) and view corresponding results. For the selected model, I can apply new data sets to see predictions.

  • The predictions for white wine also seem quite close to the ground truth.

 

Insights and takeaways

  1. If you got this far into this episode – Congratulations! You have built your very own predictive model with SAP Analytics cloud!
  2. You can see how easy it is to build basic models. All you need is a clear understanding of the problem you wish to solve and a good dataset. You do not need ANY coding expertise.
  3. For sake of brevity, we haven’t customised the model here all too much. This is very much possible and I will explore customisations in future blogs. Stay tuned!
  4. Why did we use regression and not classification?
    1. Regression is used when the expected output is a number, like Revenue, # of customers, etc. The results will be a prediction +/- an error. As such, this error often defines quality of the result.
    2. Classification is used when the expected output is a category or class (like will the customer buy or not, will the deal close or not). The results will be a prediction and how many predictions are typically wrong / right in each class. As such, accuracy often defines quality of the result.
    3. For both models, other metrics of measuring quality exist, but that is for another blog post.
    4. This problem can also be modelled as a classification problem, but we would have 11 classes, which is arguably unnecessarily complicating the problem. I do not wish to see how many wrong’s and right’s I get in each of the 11 classes. I only wish to see what is my prediction and typically what is the ball park within which the truth lies. Regression is apt for such a problem.
  5. Why did we separate the models for white and red wines?
    1. If you believe the two wines types are very different in nature, and the attributes that influence quality will be radically different, you can keep the models separate.
    2. If you are not sure about their similarity, you can combine the data and add in another input variable that indicates whether the wine is white or red. Ideally, if the 2 wines are truly different, the wine type should show up as top influencer in your results. In this case, the model will provide a prediction that is specific to the wine type.
  6. Lastly, a word of caution – your model is as good as your data. The more data the better. The better the quality of data, the better the quality of results. The more representative your data is of the real world, the more reliable your model is.

 

Follow on to Part 3 of this blog post to visualise the data and discover new features that could potentially improve the model. 

 

Learn more about SAC

 

Learn more about this problem

  • Paper reference: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties.
    In Decision Support Systems, Elsevier, 47(4):547-553, 2009. Available at: [Web Link]
  • Data: The UC Irvine Machine Learning Repository here.
2 Comments
You must be Logged on to comment or reply to a post.