To start, I know very little about soccer. I was recently given a few soccer data sets and asked if I could make sense of it and possibly make some predictions. Over the years, I’ve learned to not make sports predictions. First, through fantasy football and march madness pools, I have learned that there are too many external and unpredictable factors that significantly impact the results of games. But more importantly, a limited data set that lacks any real meaningful data is not conducive to making any sort of predictions. Maybe I can provide some insights and let you make the predictions…
What’s the History of Soccer’s Biggest Event?
For us non-soccer fans, this year marks the 21st event. Over the past 20 events, we can see the following.
The number of teams allowed went from 16 to 24 (in 1982) and then from 24 to 32 (in 1998).
Which Teams Win More?
Like most sports, successful teams win more than others and continue to win more. Brazil, Germany, and Italy have the most championships, appearances, and wins. On the chart on the right, in gray, we can see that teams like Netherlands and Sweden have won a lot, but never won a championship. Maybe they’re due…
Which Statistics (in my limited data set) Separate Past Winners?
Goal differential and shooting efficiency seem to be the biggest indicators. I’m sure that there are many more advanced statistics, but my data sets was limited.
Do Host Countries Have an Advantage?
Over the past 20 events, the host country has won 30% of the time. However, when comparing the graphic above and below, you can see that these Host teams tend to be the stronger teams in general and that little extra “home court advantage” pushes them over the edge.
Does the Team’s Seed or Ranking Matter?
Teams are seeded from 1-32 and separates them into different groups. There’s also a very advanced ranking system that uses a weighted average over a long period of time. I was only able to find detailed rankings from 1998, which limits this analysis further, but we can still glean some good insights. Over the past 5 events, not surprisingly, the higher seeds tend to advance further.
Over these 5 events, there are definitely outliers, in that weaker teams advance further than expected and stronger teams are eliminated earlier than expected.
If we drill down further, we can get more detailed on some of these outliers, like who these weaker and stronger teams were that were successful or unsuccessful.
Does the Size of the Country or Its GDP Matter?
This chart compare the number of wins to the country’s population and GDP. The colors indicate if the team has won a world cup or not. From the chart, we can see that aside from Germany and Brazil, the relative size and GDP of the country does not dictate the team’s success in the event.
What Have We Learned So Far?
* Good teams tend to stay good (forever) and win significantly more than the rest.
* The highest the Seed/Rank, the better the team, and the better the chances are at winning. The likelihood of a repeat winner is high.
* Goal differential and shooting efficiency separates these teams from the rest.
* The Host country only seems to matter if the Host country is a top team.
* Population and GDP doesn’t matter.
Let’s see if we can apply what we learned to this year’s teams.
Who Makes Up This Year’s Crop of Teams?
UEFA has the most teams, but CONMEBOL has the strongest teams.
If we break it down further, we can see the breakdown of these teams by confederation colors and sizes by their team’s rank.
If Rankings, Goal Differential, and Shooting Efficiency All Matter, Who Are the Best Teams?
If we look at just the Top 10 teams in this year’s field and compare their average goal differential and shooting efficiency, we can see three team’s in the top right (Spain, Belgium, and Brazil).
What’s the Relative Strengths of Each Group?
This is pretty interesting. If you compare their rank (or point score) to the team in the different groups, you see some very uneven groupings.
Group A doesn’t have any very strong teams. Uruguay is likely to win their Group, but they would struggle to advance if they played stronger teams. This would give them a more favorable match-up in the round-of-16 should they advance. Group C, on the other hand, has three strong teams (Denmark, France, and Peru) all competing against each other.
What Does This Mean?
The power of analytics is that it gives us the whole story behind the data and it can help us to validate our thought process or just make watching sports more fun. I’m not making any predictions, but the following seems interesting.
* Uruguay can easily win their Group.
* Germany (#1 ranked in world) and Brazil (#2 ranked in world) are going to meet in the Quarter-Finals, which doesn’t seem fair.
* Despite being good teams, one of these teams (Denmark, France, or Peru) will not make it to the round-of-16
* If their regular season matters, Spain, Belgium, and Brazil seem like teams that will go far.
What Do You Think?
Have a better or a more complete data set? Have more questions that you want answers to? Have some new insights that you want to share? Or just want to give analytics a try?