This blog is the third in an ongoing series of blogs – the blog series will last the time of the Rugby World Cup 2019!
You know the famous rugby say: “No scrum, no win”.
Each dataset you create is this small army of 8 players pushing for predictive victory.
Data is THE foundation of well-performing predictive models.
(build a rock-solid data foundation)
The data you gather should derive from the initial business question and from your business expertise. Data might be there (hopefully) in your corporate systems. Sometimes you’ll find it under raw form, and you’ll need to further refine it so that it can reveal its full potential.
Building a solid data foundation requires multiple skills:
- Creativity. You should think wide and include all variables that have the potential to improve your model. Use your “French Flair” and engage with passion.
- Method. You should not only collect the right data, but do it rightly. Do not underestimate the errors in the data and if you think checking the data once is good enough, then check it twice! Do it like Jonny Wilkinson, with cold-blood and focus.
- Determination. Obstacles will keeping coming your way throughout the data preparation phase. Keep the finish line in mind.
(use fresh data)
My business question as you know by now was related to game result predictions. I needed a dataset with past games to support the creation of my predictive model.
I had to answer these four seemingly basic questions:
- What would a row be in my dataset?
- What would be my target variable?
- What would be the list of rows in my dataset? (a question that is tightly coupled with time…)
- What would be the list of variables in my dataset?
First I had to define the meaning of a row in my dataset.
For a rugby game it’s not complex: two teams are meeting on a certain day and one of the teams win (or at rare times there is a draw).
Based on this the information that helps identify uniquely each of my dataset rows would be: Team 1, Team 2, Match Date. So far, so good.
The second answer I needed is the definition of the target variable. A rugby game can be won, lost or it can be a draw. To make it simple, I excluded games ending up with a draw.
My target variable therefore was: Game is won or lost.
The third information I needed what related to the amount of past games I would study. The tough compromise that has to be made is between having a sufficient quantity of data (to build a good predictive model, I need ideally between 500 to 1000 rows of the class I have the least – won or lost) and going not too far away in time, as clearly a team in 2019 cannot be compared exactly with a team in 1999 or 2009.
I built a first table of team-related stats as of end of year 2016, retrieved all the games played in year 2017, 2018 and 2019 (till before the beginning of the 2019 world cup). Combining these two tables – TEAMS and GAMES – would give me the training dataset I was looking before.
List of rows: All games in year 2017, 2018 and 2019 with team stats as of end of 2016
(carefully assemble your training dataset)
The TEAMS Table
After multiple iterations, this data table contained the respective performance of each national rugby team in the period 2015/2016. We want to measure the respective force of each team:
- Matches Played, Lost, Won, Draw.
- % Matches Won
- Points For, Against, Difference
- Tries, Conversions, Penalties, Drops
- 2015 World Cup Status: did they win? Did they reach the final, the semi-finals, the quarter-finals?
- The World Cup 2015 and World Cup All-time Stats for the team (if applicable)
- Team performance in major tournaments, including the Six Nations Championship, the Rugby Championship, and the Player of the Year Award.
- Various statistics about each national rugby union: number of clubs, registered players, referees… This might seem surprising but I use this to “gauge” the interest of each country for the great game of rugby.
- World Rugby Rankings, expressed both as points and respective team position in the rankings.
- The Human Development Index of the team’s country, again expressed as rank and absolute indicator.
At the end of day each team is defined by a set of 48 characteristics (variables). The dataset contains 115 teams – not necessarily every country playing rugby, but the major ones.
The GAMES Table
This table is a big list of the 1268 international games played between the 3rd of February 2017 (January is too cold for Rugby) and the 7th of September 2019 (USA has beaten Canada 20 to 15, for the record) which is shortly before the World Cup started. For each game, we have a Team 1, a Team 2, a game result (won/lost) and the points difference, tries, etc…
It’s worth nothing that a game like USA-Canada appears twice in the dataset. The first time Team 1 is USA, Team 2 is Canada and Team 1 wins. The second time Team 1 is Canada, Team 2 is USA and Team 1 loses. I created the dataset this way so that the number of records would ultimately be sufficient for my predictive modeling needs.
The GAMES AND TEAMS SQL view
Once I had the table GAMES and the table TEAMS fully correct, the only thing that was left was to join them so that for each game I would have the characteristics of Team 1, the characteristics of Team 2 and the game result.
This formed the basis of my training dataset. Later on, I realized that this is not really so much about the respective force of each team but rather about the differences between the two teams.
To give you a concrete example, a team is not going to win because its world ranking ranges between 70 and 80 points. It is likely to win because the difference to the opponent is huge. So a lot of work in the definition of the dataset was actually related to the comparison of the different characteristics of each team.
I’ll share via GitHub the final training dataset.
I was ready to train my predictive model, this will be the topic of the next episode! (cliffhanger)
(when your dataset is finally ready!)