Predicting Tour de France stage wins using SAP Predictive Analysis
I would like to share with you my contribution to SAP’s 2013 Ultimate Data Geek Challenge.
The Tour de France is world’s toughest multi-day cycle race which spans three full weeks. During this time 21 stages are accomplished, totaling to about 3500 kilometers of cycling.
One of the challenges for spectators is trying to predict during the course of the race which rider is going to win the next stage. I thought it might be fun to try and use SAP Predictive Analytics to predict which rider might have the biggest chance of winning a certain stage.
I am aware that this sport might be not equally popular around the world as it is in some European countries but nevertheless I’m sure you will be able to take away some insights here.
There are a couple of topics regarding the Tour de France which might need some additional introduction. First is the stage classification: it is important to keep in mind is that all stages have been classified by the organization as one of the following:
Team Time Trial
Individual Time Trial
This stage classification is performed by the routing committee and is known upfront for each stage.
At the start of the first stage the number of riders amounts to 198. This will decrease during the course of the race because of riders abandoning, for instance due to injuries or illness.
I have used the following strategy for the prediction of stage wins:
- Get a list of historical stage wins for the last 7 years of Tour de France stages
- Get the rider properties for these winners (e.g. is it a climber or sprinter?)
- Classify the riders according to these collected properties
- Take the list of historical stages from step 1, combine this with the unridden stages of the current Tour de France and classify all stages according to their properties
- Perform an apriori analysis to come up with rules describing association between rider and stage classes
- Use the association rules to predict future stage wins
I have prepared the following files for this example:
- stages_sdn.csv: Contains list of stages and their respective winners
- athletes_sdn.csv: Contains list of riders which won one or more stages and their respective properties
Step 1 – Get historical stage wins
Wikipedia has loads of information on historical sport events so it wasn’t difficult to get a list of historical stage wins. I have selected 7 years of stage wins but this list can of course be easily extended.
An example of the data is shown below:
Column F contains a numeric value representing the ride type from column E recoded as a numeric value. This is necessary to make the classification algorithms happy we are going to use later on. The recode was done as follows:
Also note that in case of a Team Time Trial (TTT) ride there will be no individual winner as these stage wins will be appointed to a whole team. I have removed the TTT stages from the input as the empty values will be problematic when inducing the association rules later on.
I have stored the data from this step in file stages_sdn.csv.
Step 2 – Get rider properties of stage winners
For this step I have created athletes_sdn.csv which contains the athletes and their properties. An example of this file is shown below:
The file contains all stage winners from the last 7 years and categorizes these athletes along the following set of attributes: Sprinter, Time Trial specialist, Hill specialist, Single Day specialist, Hill Sprint specialist, Multi Day specialist, Classics specialist, Climber, Attacker, Assist.
A given athlete may have one or more of these attributes. Like the stage wins I simply took all these values from Wikipedia.
Step 3 – Classifying the riders
We will now use SAP Predictive Analysis to come up with a rider classification. Simply load athletes_sdn.csv from step 2 and open the prediction view. In this example I have used the k-means clustering algorithm using k=5 clusters which after some experimenting appeared to give the best results. I want the results of the clustering algorithm to be written to a CSV file called athletes_classes.csv.
The modeling view now simply looks as follows:
Settings for the K-means clustering step:
Running the model will trigger the custering algorithm and gives the following results:
You can see cluster 2 is fairly large (35 athletes) while the others are more evenly distributed. Setting the number of clusters to a higher or lower number doesn’t improve this so I will go forward with these results.
Looking at the cluster contents in detail reveals the following:
As you can see all attackers ended up in cluster 1, time trial specialists ended up in cluster 3, sprinters in cluster 4, etc.
The athletes_classes.csv file now contains the data from the loaded file with an additional classification column.
Step 4 – Classifying the stages
Although the classes have already been classified by the organization I would like to narrow the number of classes down a bit like in the previous step and also use the Distance attribute.
I have loaded stages_sdn.csv and on the Predict tab set up the following process:
The Normalization step normalizes the Distance attribute to values between 0 and 1, which is preferred to keep the K-Means algorithm happy:
The K-Means clustering step was set up as follows:
The results of the clustering are to be written to a CSV file called stages_classes.csv.
Running the process now gives the following results:
To check the relationship between Ride Type and cluster I have used the Lumira data grid visualisation type. This shows that the most-used ride types have all gotten their own cluster, whereas the Individual Time Trial (ITT) and Prologue (P) ride types ended up in their own cluster. Not surprisingly these two ride types are ususally way shorter than the others.
Step 5 – Calculate the association between the classes
In this step we are going to use the two created CSV files to calculate the association between the rider classes and stage classes. I will load both athletes_classes.csv and stages_classes.csv into SAP Predictive Analysis.
Before merging the two files we need to distinguish the column that contains the athlete class from the stage class, so I will rename the ClusterNumber column from athletes_classes.csv to AthleteCluster and the ClusterNumber from stages_classes.csv to StageCluster.
Now the Merge function may be used to merge the athlete data into the stage data. Make sure to map the Athlete column from athletes_classes.csv to the Winner column of stages_classes.csv:
This will give us a data set containing stage and rider data side including both cluster columns:
Before inputting these columns to the apriori algorithm I will append “A” to the athlete class and “S” to the stage class. I have done this using the formula builder on the Data tab:
This will give the following columns:
I have renamed the resulting columns to StageClusText and AthleteClusText respectively. This was done to be able to easier interpret the algorithm’s results.
These columns will now serve as an input for the apriori algorithm. The prediction view shows the following model:
The most complex and time-consuming aspect of using the apriori algorithm is setting it up with correct parameters. It might take a while before you have optimized the parameters for the data set at hand. For this example I have used the following settings:
Running the algorithm will give a list of association rules between the rider and stage clusters like shown below:
So what does this tell us? We have a larger probability of Stage Type S4 being won by an athlete from cluster A2. We’ve also found stage type S5 being won by athlete type A4. To see what this means we need to check the cluster definitions from steps 3 and 4. These rules now translate as follows:
- S5 [Flat stage] => A4 [Sprinter]
- S4 [Mountain stage] => A2 [Multi day / Climber]
Probably no big surprise to anyone having spent any time watching a cycle race, but at least we inferred these rules from the data!
Step 6 – Use association rules to predict future stage wins
At this point we have a set of association rules indicating the association between a certain rider type and stage type. To see how this could be useful for predicting future stage wins, we simply have to perform the following steps:
- Classify all remaining riders taking part of the Tour de France for which we would like to predict the stage win
- Use the association rules to look up the association between the stage cluster and athlete cluster
For example, imagine we are in the middle of the 2013 Tour de France and would like to predict who is going to win stage 21 (Versailles – Paris). As this was classified as a flat stage we know this ended up in stage cluster S5. Our collection of association rules states this stage cluster is most likely to be won by a rider from cluster A4.
As this stage has been ridden byt now we now know this stage was won by sprinter Marcel Kittel which indeed was a rider from cluster A4!
Getting a bit more serious
While this example gives you a nice impression of the built-in capabilities of SAP Predictive Analysis and the strategy of setting up a predictive model it is still a fairly straightforward. Of course in the real world there are much more variables involved in these kinds of predictions.
Nevertheless, the main takeaway I would like you to get from this is the approach of first clustering the stages and athletes and then performing an association analysis between them. This approach can also be applied to other domains like retail shopping basket analysis.
It shouldn’t be hard to make this example just a little more complex and maybe a bit more ready for real-world use. Some suggestions to beef up the complexity of this exercise may be (in random order):
- Extending the list of historical stage wins to improve the associations
- Enhancing the rider classification by taking into account more properties like weight, age, length and more advanced properties like BMI or VAM
- Enhancing the stage classification by taking the stage length and for instance the number of mountain jersey points which may be collected during the stage
I will leave these as an exercise for the interested reader. Some of the above suggestions may already be performed using the data set I have shared while others require small extensions of the data.
Dirk Kemper is a Business Intelligence consultant at rond consulting and specializes in Predictive Analysis, Enterprise Performance Management and financial reporting.