The Data Geek Challenge – Is it safe to eat near where I live?
I’m no Data Geek. I’m a developer, a techie geek through and through. I maybe spend 1% of my time analysing datasets and reports, often under duress and for a very specific reason (just why doesn’t my expenses account balance with what Atos have paid me?!) As a result, DataGenius IV largely passed me by when I first saw it posted to SCN. Of course, it looked interesting and a bit more “fun” than lots of the usual stuff posted but still, not necessarily my bag.
Then this week (week of 29th July 2013) SAP announced the personal edition of Lumira would be free and my interest was piqued again. I’ve long wanted to do some sort of mashup making use of the vast amount of data the British Government makes available via its website http://data.gov.uk/ and so with a bit of inspiration, I started randomly clicking…
As hinted above, I’m not someone who spends any great time building data sets and evaluating them, so part of the challenge for me was getting some useful dataset building blocks to begin with. As Steve Rumsby and I covered in our recent respective blogs – Is self-service BI really a good thing? & Self-service BI, or seeing cod on the moon… – there is often a massive gap between obtaining data and producing something useful for reporting consumption.
So, for my first challenge I wanted to see what extreme reports I could build and what crackers assumptions this could lead me to, by interpreting sound data in a completely incorrect way! To do this, I needed to focus down on some (sort of) related topics first, so decided upon Liverpool (the city where I live) as there are a few interesting datasets available…
First I grabbed “UK food hygiene rating data (North West) – Food Standards Agency – Datasets – DGU”1 in XML format for Liverpool, and used Excel to convert it to .csv… Aside from dropping some superfluous columns, I did no further data cleansing or alterations as I wanted to see what I could/couldn’t do with Lumira. This gave me just over 3k records of food hygiene information for businesses around Liverpool as a starting point. Especially interesting at this stage was that the dataset contained longitude & latitude information, so I already had a vague idea of where my output was heading…
I was instantly disappointed however to find that when I checked the latitude and longitude attributes, Lumira didn’t recognise these as Geographical dimensions, as I expected it to. So I resorted to RTFM and hit F1.2 I soon got bored of that and resorted back to random clicking in the Lumira interface! Things now started to come together as I figured out how to quickly use the features of Lumira. I performed the following to generate a more useful visualisation:
- Use the “Show” button to review the enrichments Lumira was suggesting, removed the ones I didn’t want (for instance, I didn’t want longitude & latitude as measures) and hit the “Enrich” button. This gave me a few measures and a time hierarchy based on the “RatingDate” column.
- I wanted to try and get a feel for just how “hygienic” the establishments of Liverpool are, especially in my own immediate neighbourhood, so I generated a very simple view using Postcode as a filter to show the ratings in the immediate vicinity of where I live:
Not very exciting is it?! At least I could see my favourite local chippy had a decent 5 rating 🙂
At this point, the reality of just how hard it is meddling with data to achieve meaningful reports, especially those that are to support very important decisions (like where should we get take-away from tonight?) really hit home. My in-experience with BI tools was getting the better of me…
I decided to expand my datasets so that I could perform some comparison analysis, and grabbed the same XML based extract for food hygiene for Manchester. I used Excel to merge the two datasets (because I couldn’t figure out how to do this in Lumira – fail) and re-opened my data as one larger dataset. I now had ~6k rows to work with and hopefully something I could use to perform some more meaningful investigations.
Liverpool vs. Manchester
With the improved dataset, I set about investigating the state of food hygiene in Liverpool and comparing it to the same in Manchester. To do this I created a Geographic Bubble Chart using the longitude/latitude data linked to the business name. With a small amount of filtering I then ended up with this overall comparison picture:
What does this tell me (other than the mapping integration in Lumira is woeful)? Well, on first glance it appears Manchester has an overall higher food hygiene rating than Liverpool (more yellow = lower ratings – Liverpool is the “blob” on the left, not that it is clear from the map!) but I’m really not convinced that is an accurate or fair interpretation of the above image. Again, it reinforces 2 things for me:
- I’m no BI whizz!
- Its easy to make data look how you want, or interpret it incorrectly.
I’m parking this Llumira/Data Geek adventure for now, and will re-visit shortly when I can find some more interesting data, and have the time to make something more useful with it.
Some thoughts for the future…
- It would be nice if I could point my Lumira at a data source and have it automatically check for upates – the data.gov.uk site for instance regularly updates the data available – maybe you can and I just haven’t figured out how?
- In the modern world we work in, I think the platform would benefit from coping with a few more file formats – based on my meddling I’d say XML and JSON would be a good start
- The mapping interface when you create a geographic based visualisation is shockingly bad. For me, the zoom and navigate don’t work at all so I’m not really sure why they are there.
2 In fact, I was disappointed to find that if I used the “Enrich All” function other than identifying measures such as “Rating Value”, “Hygiene”, etc and “RatingDate” as a Time measure, Lumira wasn’t quite as clever as I expected.3
3 It turns out it is clever, I was just being stupid 😉