Lumira Dataset: Bus Tracker from the Chicago Transit Authority
This blog post is the second part of my entry for the 2014 Data Geek Challenge.
Let’s play Timo ELLIOTT‘s Buzzwords Bingo Game: in this blog post, watch out for the following keywords and once you reach 5, shout “Bingo!” from your cubicle. See your prize in the conclusion. Enjoy!
- Big Data,
- Geospatial Analytics,
- Text-based Analytics.
Chicago, City of Big Data
My company, Alta Via Consulting, is headquartered in Chicago. Every time I wind up downtown, I take the opportunity to visit the Chicago Architecture Foundation. Their current exhibition is called “Big Data Chicago” and is definitely worth a look: http://bigdata.architecture.org/
This event is part of a larger initiative by the City of Chicago, embracing the open data movement. One example can be found in the Chicago Transit Authority (CTA) that communicates the real-time position of the buses and trains, publishes all service bulletins and even enables developers to access these data via API. Sounds like an opportunity for geospatial analytics and text analytics with Lumira, right?
More information here:
The business questions
While looking at the available data, one question stuck with me: I see the picture of transit in real-time but how do the events develop over time? How can I find patterns if all I can see is “right-now”? For instance, focusing on delays: is there a correlation with weather? With construction? With time of day? Should some route schedules be changed if they constantly run late?
To answer these questions, what we need is to collect the real-time feed into the cloud on a regular basis and, later, analyze them with Lumira. Here’s the process:
The CTA Bus Tracker API documentation details all the potential entities that are shared. I collected some of them into a MySQL database through a Php script. A WebCron task is called every hour to read the vehicle positions. Then, I created Php scripts to export the data as csv files.
- CTA Bus Tracker API Documentation
- Costing Geek: Vehicle positions as CSV
- Costing Geek: Bulletins as CSV file
- Costing Geek: Patterns formatted for Google Maps
- WebCron tasks (not always reliable, but free)
While building my architecture for this project, I had to go around several technical limitations, like:
- The CTA Bus Tracker API requires a developer key, which took me 2-3 weeks to obtain,
- By default, one API key can make a maximum of 10,000 requests. There are 127 routes, which mean a maximum of 78 snapshots per day, about 3 per hour. You have to be strategic in how you set this up,
- My hosting company doesn’t support remote call to the MySQL database for security reasons and I wasn’t able to implement an OData wrapper, hence the workaround with the CSV files. I wish I found a solution to present these data in OData format so we can link them directly from Lumira,
- My hosting company also limits to 7,500 the number of SQL commands I can run per day. It took me several days to collect all the patterns (27,360 stops or waypoints in total),
- The free webcron website only supports 150 calls per day. Again, some strategy is required here.
Overall, this data collection process took way more time than I expected and I’m lucky the Data Geek Challenge deadline has been postponed 🙂 . Feel free to use the data from the Costing Geek links but before you publish anything, please refer to the terms and conditions from the CTA Bus Tracker API Documentation. Also, please share you results in the comments so I know sharing was worth it ;o)
Before I forget: if (and only if) you found all the keywords, here’s your prize: http://goo.gl/KmkP6u