25.945 football matches, 299 teams, 11 leagues. All including betting odds of 10+ bookers offices. Seems like the perfect database to use predictive analysis on. And the perfect opportunity to use HANA express edition to do all my difficult calculations. While we are at it: let’s also try out the possibilities of connecting HANA to R. In this blog I want to show you how I used R as an ETL tool (yes, as an ETL tool), how I used calculation views within HANA to create my star model and how I used Lumira to do some quick data analysis. All for one reason: breaking the bank and trying to beat the bookies!
This whole thing started when I stumbled upon this database on Kaggle, while looking for some cool data to try out my colleagues newly obtained Intel Nuc with HANA express edition installed on it (see also his blog). Football has always been one my favourite sports and I did put in some occational bets in the past. I’m also a huge fan of the Football Manager series by Sports Interactive and always tried to understand how their matchmaking system worked. Now I had the data and the calculation power, with HANA express edition, to get a better insight in the world of bookmakers and predicting the outcome of football games.
I started out with downloading the database which comes in the form of a SQLlite database. Never before did I work with such a database but I was sure that I could easily convert it to CSV files. I was wrong. The database is build up by a Python script that scrapes a website with football matches of different leagues/countries within Europe. The creator of the database did an impressive job by creating a Python script that is also capable of reading the match summary with the most important moments of the games (goalscorers, cards, substitutes, etc.). While this is interesting data for now I wanted to stay with quantitive data so I had to get rid of the text columns in the database. And the simple SQLlite to CSV converter gave me no possibilities to do some basic ETL. So I looked at R.
R is basically a programming language and environment for statistical computing and graphics. R can be enriched with all kinds of libraries, which there are plenty off due to it’s open-source nature. So I was almost certain there was a library out there that could help me with my problems with the database. And there it was: RSQLite.
I loaded the library and this gave me the option to load the SQLlite database into R as a connection. With this connection it was possible to get a basic understanding of the database and it’s content. It also gave me the possibility to fire off some SQL statements at the tables within the database, after which I could write the results into a dataframe (R’s equivalent of a table). So by using some basic R commands I was now able to clean up my dataframe, remove the columns I didn’t need, and do some early data-exploration. After that I had two options: exporting my dataframe as a CSV file or try to write it directly into HANA.
Not wanting to go for the easy route I tried the second option. Again I needed a library that would help me import my dataframe into HANA. This time an older SAP blog post helped me out. The library I needed in this case was RSQL (they are not really creative with their names..). There is a simple statement you can use to get a connection to your HXE system.
Below I have made a screenshot of some of my R script that I used to make a connection to the (local) SQLlite database, to add it to a dataframe, make a connection to HXE and import my data into a table. For the bigger tables (matches, playerstats) I chose the export of CSV files to give me somewhat more control over the import into HXE.
I chose to split the Match table into multiple tables. I figured that when I wanted to make a different model without the betting odds or team formations in the future this would give me some more flexibility. For this model I did want them al together though so I created a calculation view to join the tables back together (see the screenshot below). I also added some calculated columns (this is where the power of HANA really helped me out a lot!) with the actual winner of the game (home, away or draw) and the predicted outcome by the bookmakers. Finally I added the columns that compared the actual outcome of the game and the prediction of the bookmakers, resulting in a right (R), wrong (W) or no bet (N) when no data is present.
Then I used another blog, written by my dear colleague and owner of the Nuc ;-), to connect my model to Lumira so I could do some basic data analysis and visualizations. The end result is shown below.
Some interesting conclusions can be drawn from this analysis:
- The home team scores about twice as many goals as the away team;
- This results in about 45% of the matches being won by the home team;
- Bookmakers only predict about 25% of their matches correctly.
Only 25% predicted correctly, while just guessing the home team would win would give them a 45% chance of being correct. And since bookmakers still seem to make money, this means the bookmakers have other factors that come into play when they calculate their odds. Luckily there is more data present in the database that can possibly help me explain what factors play a role in the outcome of a match. And hopefully this can all result in some predictive model that can help me break the bank ;-). More of that in my next blog!