Hello,
I would like to share with you my contribution to SAP’s 2013 Ultimate Data Geek Challenge.
The Tour de France is world’s toughest multi-day cycle race which spans three full weeks. During this time 21 stages are accomplished, totaling to about 3500 kilometers of cycling.
One of the challenges for spectators is trying to predict during the course of the race which rider is going to win the next stage. I thought it might be fun to try and use SAP Predictive Analytics to predict which rider might have the biggest chance of winning a certain stage.
I am aware that this sport might be not equally popular around the world as it is in some European countries but nevertheless I’m sure you will be able to take away some insights here.
There are a couple of topics regarding the Tour de France which might need some additional introduction. First is the stage classification: it is important to keep in mind is that all stages have been classified by the organization as one of the following:
F | Flat |
H | Hilly |
M | Mountain |
TTT | Team Time Trial |
ITT | Individual Time Trial |
P | Prologue |
This stage classification is performed by the routing committee and is known upfront for each stage.
At the start of the first stage the number of riders amounts to 198. This will decrease during the course of the race because of riders abandoning, for instance due to injuries or illness.
I have used the following strategy for the prediction of stage wins:
I have prepared the following files for this example:
Wikipedia has loads of information on historical sport events so it wasn’t difficult to get a list of historical stage wins. I have selected 7 years of stage wins but this list can of course be easily extended.
An example of the data is shown below:
Column F contains a numeric value representing the ride type from column E recoded as a numeric value. This is necessary to make the classification algorithms happy we are going to use later on. The recode was done as follows:
Also note that in case of a Team Time Trial (TTT) ride there will be no individual winner as these stage wins will be appointed to a whole team. I have removed the TTT stages from the input as the empty values will be problematic when inducing the association rules later on.
I have stored the data from this step in file stages_sdn.csv.
For this step I have created athletes_sdn.csv which contains the athletes and their properties. An example of this file is shown below:
The file contains all stage winners from the last 7 years and categorizes these athletes along the following set of attributes: Sprinter, Time Trial specialist, Hill specialist, Single Day specialist, Hill Sprint specialist, Multi Day specialist, Classics specialist, Climber, Attacker, Assist.
A given athlete may have one or more of these attributes. Like the stage wins I simply took all these values from Wikipedia.
We will now use SAP Predictive Analysis to come up with a rider classification. Simply load athletes_sdn.csv from step 2 and open the prediction view. In this example I have used the k-means clustering algorithm using k=5 clusters which after some experimenting appeared to give the best results. I want the results of the clustering algorithm to be written to a CSV file called athletes_classes.csv.
The modeling view now simply looks as follows:
Settings for the K-means clustering step:
Running the model will trigger the custering algorithm and gives the following results:
You can see cluster 2 is fairly large (35 athletes) while the others are more evenly distributed. Setting the number of clusters to a higher or lower number doesn’t improve this so I will go forward with these results.
Looking at the cluster contents in detail reveals the following:
As you can see all attackers ended up in cluster 1, time trial specialists ended up in cluster 3, sprinters in cluster 4, etc.
The athletes_classes.csv file now contains the data from the loaded file with an additional classification column.
Although the classes have already been classified by the organization I would like to narrow the number of classes down a bit like in the previous step and also use the Distance attribute.
I have loaded stages_sdn.csv and on the Predict tab set up the following process:
The Normalization step normalizes the Distance attribute to values between 0 and 1, which is preferred to keep the K-Means algorithm happy:
The K-Means clustering step was set up as follows:
The results of the clustering are to be written to a CSV file called stages_classes.csv.
Running the process now gives the following results:
To check the relationship between Ride Type and cluster I have used the Lumira data grid visualisation type. This shows that the most-used ride types have all gotten their own cluster, whereas the Individual Time Trial (ITT) and Prologue (P) ride types ended up in their own cluster. Not surprisingly these two ride types are ususally way shorter than the others.
In this step we are going to use the two created CSV files to calculate the association between the rider classes and stage classes. I will load both athletes_classes.csv and stages_classes.csv into SAP Predictive Analysis.
Before merging the two files we need to distinguish the column that contains the athlete class from the stage class, so I will rename the ClusterNumber column from athletes_classes.csv to AthleteCluster and the ClusterNumber from stages_classes.csv to StageCluster.
Now the Merge function may be used to merge the athlete data into the stage data. Make sure to map the Athlete column from athletes_classes.csv to the Winner column of stages_classes.csv:
This will give us a data set containing stage and rider data side including both cluster columns:
Before inputting these columns to the apriori algorithm I will append “A” to the athlete class and “S” to the stage class. I have done this using the formula builder on the Data tab:
This will give the following columns:
I have renamed the resulting columns to StageClusText and AthleteClusText respectively. This was done to be able to easier interpret the algorithm’s results.
These columns will now serve as an input for the apriori algorithm. The prediction view shows the following model:
The most complex and time-consuming aspect of using the apriori algorithm is setting it up with correct parameters. It might take a while before you have optimized the parameters for the data set at hand. For this example I have used the following settings:
Running the algorithm will give a list of association rules between the rider and stage clusters like shown below:
So what does this tell us? We have a larger probability of Stage Type S4 being won by an athlete from cluster A2. We’ve also found stage type S5 being won by athlete type A4. To see what this means we need to check the cluster definitions from steps 3 and 4. These rules now translate as follows:
Probably no big surprise to anyone having spent any time watching a cycle race, but at least we inferred these rules from the data!
At this point we have a set of association rules indicating the association between a certain rider type and stage type. To see how this could be useful for predicting future stage wins, we simply have to perform the following steps:
For example, imagine we are in the middle of the 2013 Tour de France and would like to predict who is going to win stage 21 (Versailles – Paris). As this was classified as a flat stage we know this ended up in stage cluster S5. Our collection of association rules states this stage cluster is most likely to be won by a rider from cluster A4.
As this stage has been ridden byt now we now know this stage was won by sprinter Marcel Kittel which indeed was a rider from cluster A4!
While this example gives you a nice impression of the built-in capabilities of SAP Predictive Analysis and the strategy of setting up a predictive model it is still a fairly straightforward. Of course in the real world there are much more variables involved in these kinds of predictions.
Nevertheless, the main takeaway I would like you to get from this is the approach of first clustering the stages and athletes and then performing an association analysis between them. This approach can also be applied to other domains like retail shopping basket analysis.
It shouldn’t be hard to make this example just a little more complex and maybe a bit more ready for real-world use. Some suggestions to beef up the complexity of this exercise may be (in random order):
I will leave these as an exercise for the interested reader. Some of the above suggestions may already be performed using the data set I have shared while others require small extensions of the data.
Dirk Kemper is a Business Intelligence consultant at rond consulting and specializes in Predictive Analysis, Enterprise Performance Management and financial reporting.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
8 | |
5 | |
5 | |
4 | |
4 | |
4 | |
4 | |
3 | |
3 | |
3 |