Custom R Components – Correlation Matrix
Before starting an analysis it is important to understand the structure of the data that is to be analysed. A correlation matrix is a helpful tool to see the relationships between numerical variables. This article explains how to add such a correlation matrix to SAP Predictive Analysis by implementing a new Custom R Component.
Disclaimer
Please note that this component is not an official release by SAP and that it is provided as-is without any guarantee or support. Please test the component to ensure it works for your purposes.
Prerequisites
– R library corrgram must be installed.
How to Implement
The component can be downloaded as .spar file from GitHub. Then deploy it as described here. You just need to import it through the option “Import/Model Component”, which you will find by clicking on the plus-sign at the bottom of the list of the available algorithms.
Usage
Load the dataset LondonOlympicsDecathlon.csv into SAP Predictive Analysis. The file contains the results of all Decathlon athletes that completed the competition at the London Olympics in 2012. We want to understand the relationships between the results of the different disciplines. For instance: is a good 100 meter runner also a good 400 meter runner?
Now add a Filter Component to your analysis and remove the fields that are not needed (ID, Name, Country, Overall). So now the dataset is reduced to the ten disciplines we want to investigate.
Add the new “Correlation Matrix” component to your analysis and select the method for calculation the correlation. Here I am choosing “Pearson”. Other options are “Kendall” and “Spearman”.
Run the analysis and switch to the “Charts” view to see the correlation matrix.
The matrix shows the correlations between all numerical variables of the dataset. Each correlation is displayed twice. The coloured bottom left segment shows the correlation graphically. The top right shows the same correlation as value. Correlations are calculated on a range between -1 and 1. Higher values indicate very strong positive correlations. Negative values obviously indicate negative correlations. The colour coding ranges from deep red for -1 through grey for 0 to deep blue for 1. Just find the intersections of two variables you are interested in and you know their correlation.
We wanted to look at the 100 meters and 400 meters. You see in the upper part that the value for this combination is 0.77. This means there is a pretty strong positive correlation between the results of those two disciplines. Runners that take relatively few seconds to complete the 100 meters also take relatively few seconds to complete the 400 meters. So good 100 meter runners are also typically good 400 meter runners. And vice versa.
Now let’s look at something a bit more complex. It is easy to fall into a trap here, just as I have done myself. Thank you to Henrique Pinto for spotting this. How about the correlation between the results of the 100 meters race and the Long Jump?
The correlation between the results of the 100 meters and the Long Jump is pretty negative with -0.56. So you might think a good 100 meter runner is bad at the Long Jump. However, this is not the case. Here the specialty is the defintion of “good”. At the 100 meters you need to be fast to be good, so your time needs to be small. At the Long Jump however this is very different. To be good at the Long Jump you need to jump far. So your result needs to be large. This negative correlation of -0.56 means that athletes with small results in one discipline usually have large results in the other discipline. So with that in mind, a good 100 meter runner (small results) tends to be a good long jumper (large results).
Hey Andreas,
I was looking into this sample and I got bugged with one of the conclusions you got (a good racer is a bad jumper), which seemed to contradict the real life results I've seen before. After looking deep in the data, I got a different conclusion and I wanted to share my considerations with you, to see if you agree with them.
You stated that the 100m and 400m competitions had a high correlation because the correlation factor between them was high (0.77). This seems accurate. Also, if you look at the .csv data, you'll notice that the measures for the '100m' and '400m' columns seem to represent the time the racers took to complete the competition. In other words, a racer who had a low time in the 100m race would most likely have a low time in the 400m race as well (i.e. would be well placed in both), while a racer with a high time in 100m would also have a high time in the 400m race (i.e. would be badly placed in both).
Now, you also compared the 100m race with the long jump competition. Since the correlation factor was low (-0.56), you said that a good 100m racer seemed to be a bad long jumper, and vice-versa. However, looking at the data, while the races (100m and 400m) measures seem to represent the racing time, i.e. the lower the better, the jumps measures seem to represent the distance (or height) the athlete was able to achieve, i.e. the higher the better. That being said, the negative correlation factor would mean that a 100m racer with a low race time would most likely have a high jump distance, i.e. it does mean that a good 100m racer is also a good long jumper, which would be the logical conclusion.
Do you agree with this interpretation?
Best regards,
Henrique.
Hello Henrique,
Thank you for the feedback, you are absolutely right!
Decathlon has the speciality that the success is measured very differently in the disciplines. Sometimes it is good if an athlete's result is a small number, ie the number of seconds for a running event. Sometimes it is good when the result is a large number, ie the distance in a long jump.
When interpreting the results I should have taken this into account. I will update the post shortly, probably with a different combination of disciplines that is easier to understand.
Thanks again!
Greetings
Andreas
Great!
Thanks for the feedback. 😀
Best regards,
Henrique.
BTW, if anyone is trying to replicate this, I had to install other packages as pre-req for the 'corrgram' package:
- colorspace
- gclus
- TSP
- seriation
- corrgram
Thanks Henrique. I was struggling to find the dependencies.
For everyone else, if you select the option of install dependency automatically, it would save you some more research time!
Regards,
Surya
BTW, RStudio does that for you.
Just say the package you want to install, it will resolve and install all dependencies (and dependencies of dependencies) authomatically.
Hi Enrique, I´m trying to install the packages, no luck, where did you get them?
Thanks in advance,
Magy