Data Geek Challenge 2: Can your birthday help you play football in Japan or England
DG2: Birth month of English & Japanese footballers
Background
After reading Christopher Kim’s blog on birthdays and MLB players.
Can Your Birthday Help You Play Major League Baseball?
I thought I would try and look into the birthdays of English and Japanese footballers and how it relates to the school year in both countries. At the moment my family are living in England and my children follow the school year from September to mid July. My children also attend a Japanese school every Saturday and follow the Japanese academic year, which starts in April. Therefore I would see if I could get the data for both countries to see if a theory holds true that a disproportionate number of English & Japanese born footballers would be born in the first few months of the school year.
Data Collection
First issue would be where to get the data for the respective countries. Previously I had followed this tweet that every ‘on the ball’ event in the English Premier League would be made available (that would be big data) also including a reduced set of data for free. I did apply but did not hear anything back and now that offer has gone. So I had to look around for my dataset and my choice was Wikipedia. As using SPARQL you can query the data using online forms such as this one from DBpedia. More information about DBpedia can be found at this link http://dbpedia.org/About
SPARQL queries
I have used SPARQL previously to query Wikipedia data that is available in DBpedia while I was living in Japan, although I have not used it for a few years, so now seemed a good opportunity to try it again.
English Football Players Query
My chosen wikipedia category would be http://en.wikipedia.org/wiki/Category:English_footballers
The assumption being that all the footballers listed in this category are English and may or may not have attended an English school. Some English players born outside of England have been left in the list as they may have attended an English school. My main objective being that they are English and to select the birthdays.
I split the SPARQL query into two as the format of the date of birth of the player was available in two different properties.
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX dbpedia2: <http://dbpedia.org/property/>
PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>
SELECT ?player ?birthDate ?countryofbirth
WHERE {
?player dcterms:subject <http://dbpedia.org/resource/Category:English_footballers>.
?player dbpprop:birthDate ?birthDate.
OPTIONAL { ?player dbpprop:countryofbirth ?countryofbirth }
FILTER (?birthDate >= "19000101"^^xsd:date )
}
The second query is different on line 9 where the other property for birth date is queried.
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX dbpedia2: <http://dbpedia.org/property/>
PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>
SELECT ?player ?birthDate ?countryofbirth
WHERE {
?player dcterms:subject <http://dbpedia.org/resource/Category:English_footballers>.
?player dbpprop:dateofbirth ?birthDate.
OPTIONAL { ?player dbpprop:countryofbirth ?countryofbirth }
FILTER (?birthDate >= "19000101"^^xsd:date )
}
If the above queries above are cut and pasted into the DBpedia endpoint here and executed, you should get the raw data I used.
DBpedia’s form allows you to download the data in CSV format. I ran both queries and downloaded the CSV files from DBpedia and saved them locally.
Japanese Football Players Query
The same queries were used but changing the category to http://en.wikipedia.org/wiki/Category:Japanese_footballers as per the line in bold below.
The assumption being that the footballers listed are Japanese (again may or may not have attended a Japanese school).
?player dcterms:subject <http://dbpedia.org/resource/Category:Japanese_footballers>.
Using the data in Lumira
First objective was to load both files into Lumira. As my files had the same format and headings I used the Add option as shown.
Then I used the Union Feature to merge the CSV files.
I then had both files in Lumira, I then filtered out the “unknown” and other dates not in a YYYY-MM-DD format.
I then converted the birthDate column to a Date in Lumira yyyy-MM-dd
In the new column I noticed I had left in some blank data, so I filtered this out with the Exclude empty values option selected on the filter.
As I was only interested in the month of birth I added a new column using Lumira’s data manipulation feature.
Then I added a measure to the newly created month column as a count (all).
I renamed the new measure to Month and next step would be to visualise the data.
English Footballers Chart
For the theory to hold then the first few months of the English school year (September/9, October/10 & November/11) should have the highest number of players born in those months.
So the theory proves to be true 🙂 with the sample data I collected.
Interesting how September (start of school year) is double the number of July (end of the school year).
As the data contains footballers born from the 1900s to the 1990s, I thought I would add a filter to show players born after 1975.
The visualisation still highlights the difference between the start and end of the school year.
Japanese Footballers Chart
There was less filtering required on the Japanese Footballer data but exactly the same Lumira functions followed as for the English players above.
For the theory to hold then the first few months of the Japanese school year (April/4, May/5 & June/6) should have the highest number of players born in those months.
So the theory is proven again 🙂 , double the players in April the start of the school year as compared to the end of the school year.
Data Quality
I have used the raw data from the DBpedia queries from above. Only tidying up blank cells and removing data that did not have the correct format for the date of birth information. The data may not be complete for any particular football season or period so it is just a sample of data for Football players who happen to be available for the DBpedia SPARQL query.
One last thing to do
From my previous DG2 blog the one and only day and time for me to publish blog is Wednesday 13:00.
Update 2/7/2015
Women’s World Cup Semi Final – England Vs Japan
The semi final between England and Japan generated some debate in my house.
So as a way to divert attention I mention again birthdays and footballers. It’s not a wide ranging data set but the birthdays for the women footballers does not fit the theory.
Very good - but I cannot see the months to be born in - could you add text to include them?
Very good to use your analysis on when to post too.
Thanks,
Tammy
Hi Tammy
Thanks for the comment.
Not sure how to change the chart to use the name of the month. Lumira does not offer that option when adding the month only the number. The time hierarchy still keeps the years as well as the month so didnt work for me. Also had problems with this blog adding pictures. So had to keep on adding them as they kept disappearing . They display ok for me now.
Thanks
Robert.
Pretty good analysis..!! 🙂
Wow, I'm honored that my post helped you develop this analysis! Great work, and the "birthday bias" theory is proven once again!
Best of luck on the Data Geek challenge!
Chris
Hi Chris,
Thanks for the comment and I thank you for posting your original blog. I had not heard of the original analysis so happy you posted your blog as I found it a very interesting theory.
Cheers
Robert
Hi Robert.
Good Day!
I can see your effort and hard work in this blog!
Keep up the good work! I appreciate!
Have a Wonderful Day! 😆
Regards,
Hari Suseelan
Just found your article. Nice application of Malcolm Gladwell's example of hockey players from his book Outliers.
http://www.straight.com/article-175183/malcolm-gladwells-outliers-opens-tale-about-vancouver-giants