Custom R Components – Correlation Matrix
Before starting an analysis it is important to understand the structure of the data that is to be analysed. A correlation matrix is a helpful tool to see the relationships between numerical variables. This article explains how to add such a correlation matrix to SAP Predictive Analysis by implementing a new Custom R Component.
Please note that this component is not an official release by SAP and that it is provided as-is without any guarantee or support. Please test the component to ensure it works for your purposes.
– R library corrgram must be installed.
How to Implement
The component can be downloaded as .spar file from GitHub. Then deploy it as described here. You just need to import it through the option “Import/Model Component”, which you will find by clicking on the plus-sign at the bottom of the list of the available algorithms.
Load the dataset LondonOlympicsDecathlon.csv into SAP Predictive Analysis. The file contains the results of all Decathlon athletes that completed the competition at the London Olympics in 2012. We want to understand the relationships between the results of the different disciplines. For instance: is a good 100 meter runner also a good 400 meter runner?
Now add a Filter Component to your analysis and remove the fields that are not needed (ID, Name, Country, Overall). So now the dataset is reduced to the ten disciplines we want to investigate.
Add the new “Correlation Matrix” component to your analysis and select the method for calculation the correlation. Here I am choosing “Pearson”. Other options are “Kendall” and “Spearman”.
Run the analysis and switch to the “Charts” view to see the correlation matrix.
The matrix shows the correlations between all numerical variables of the dataset. Each correlation is displayed twice. The coloured bottom left segment shows the correlation graphically. The top right shows the same correlation as value. Correlations are calculated on a range between -1 and 1. Higher values indicate very strong positive correlations. Negative values obviously indicate negative correlations. The colour coding ranges from deep red for -1 through grey for 0 to deep blue for 1. Just find the intersections of two variables you are interested in and you know their correlation.
We wanted to look at the 100 meters and 400 meters. You see in the upper part that the value for this combination is 0.77. This means there is a pretty strong positive correlation between the results of those two disciplines. Runners that take relatively few seconds to complete the 100 meters also take relatively few seconds to complete the 400 meters. So good 100 meter runners are also typically good 400 meter runners. And vice versa.
Now let’s look at something a bit more complex. It is easy to fall into a trap here, just as I have done myself. Thank you to Henrique Pinto for spotting this. How about the correlation between the results of the 100 meters race and the Long Jump?
The correlation between the results of the 100 meters and the Long Jump is pretty negative with -0.56. So you might think a good 100 meter runner is bad at the Long Jump. However, this is not the case. Here the specialty is the defintion of “good”. At the 100 meters you need to be fast to be good, so your time needs to be small. At the Long Jump however this is very different. To be good at the Long Jump you need to jump far. So your result needs to be large. This negative correlation of -0.56 means that athletes with small results in one discipline usually have large results in the other discipline. So with that in mind, a good 100 meter runner (small results) tends to be a good long jumper (large results).