I want to start a short series of posts to share with you my experience with data from an Open Data publication in order to compose a non-trivial data stock.
When trying to get an idea of how to handle non-trivial data (or even Big Data) without employing a full blown enterprise infrastructure, there’s a couple of challenges:
- Infrastructure: Today’s hardware is very powerful – multiple CPU cores, Gigabytes of RAM, Terabytes of disk space are common even for entry level consumer PCs these days. While a substantial part of these resources is consumed by OS and other background activity (such as anti- virus or other security software), there still is more computing power at the fingertips of a power-user than data center administrators could dream of having not so long ago.
Somewhat more tricky is the question of software infrastructure. As an SAP employee, I’ve got access to a fortune of software. For those who don’t enjoy this access, I’ll include some hints about what is available free of charge for the public.
- Data: Thanks to the internet, most of us (probably all who read this post) have access to unlimited data. Usually, that data is already processed and shown to us through a user friendly presentation layer. While this makes a quick glimpse very easy, it usually doesn’t give us access to the raw data. One approach to tackle this challenge is to utilize the data provided as part of the Open Data initiative. For the activities described in this series of posts, I’ve used data about delays in US domestic flights from a government data website. The download files provided there have the advantage of being plenty (they reach back as far as 1987) and having some kind of meaning for myself and the audience (we are all familiar with airline flights). As I found out, the various attributes of the flight data characteristics also look very promising to showcase the benefits of dictionary compression.
The data used in my approach is available from RITA (Research and Innovative Technology Administration) / BTS (Bureau of Transportation Statistics). The “Prezipped File” contains all columns and all rows for one month. Getting all the months files includes a certain amount of mouse clicking, but I hope the result will be worthwhile. In total, I picked up more than 150M rows – admittedly nothing to make a Data Warehouse administrator sweat. It’s slightly more impressive to know that the ZIP files larger than 5 GB contain a total of more than 63 GB of CSV data.
Since the data format has some peculiarities, I decided to load the files into a table with relaxed typing first so I could evaluate possible pitfalls on SQL level. I’m more familiar with the specifics of SQL (and especially the SQL dialects of the SAP Sybase database technologies) than with the more sophisticated features of stream utilities (like awk), so this is a straightforward approach for me. Others may have different skill sets and preferences for technically evaluating the data.
Some column values in the CSVs are quoted, so I decided to use SAP SQL Anywhere® (SQL Anywhere) as a pre- processor and staging database. Apart from my general preference for and vast experience with SQL Anywhere, the import / export capabilities were the relevant motivation for this step. I think I’ll be able to create a load script to get the raw data into SAP® IQ (IQ) immediately, but the LOAD syntax is more restrictive than that in SQL Anywhere. World record level LOAD performance has the top priority there, and data is usually originated from a source under control of the administrators.
I generously over-sized columns containing character strings (values are stored in actual size anyway) and used the universally patient CHAR data type for time values. Also, I used a simple autoincrement numeric value as a surrogate primary key, initially for no other reason than good practice, which soon turned out to be a good idea.
From there, I would analyze and cleanse the data and export them in a format convenient for IQ load. There I would create a table with more restrictive data types after having evaluated value lengths etc. in the SQL Anywhere stage.
I ran all steps on personally owned hardware. I am a self- confessing nerd, so I may have one gadget or another that is not part of the typical cheap buyer’s equipment. But I’m quite sure I couldn’t impress a real Power Gamer with my setup.
The desk environment is a PC with Core i5 3550 4 core (non-HT) CPU, 32 GB RAM, SSD for OS and 4 spindles for data, 2 as BIOS- level RAID-0 and 2 as BIOS-level RAID-1. It runs Win7-64 Ultimate and SQL Anywhere version 12. The IQ version 16 installation is on a Boot- VHD on the same machine. All databases are operated on the RAID-0 for best performance.
The mobile environment consists of a Core i5 2430M CPU (2 cores / 4 threads) notebook, 8 GB RAM and a single spindle for the SQL Anywhere installation and a Core i7 Q740 CPU (4 cores / 8 threads), 8 GB RAM and a 2.5” eSATA hard disk supplementing the built-in spindle. Both notebooks run on Win7-64 (one Professional, one Home Premium).
The SQL Anywhere installations are the Developer Edition – it’s available for no charge after registration. I’m using regular (Enterprise Edition) IQ installations (15.4 / 16.0) but at least a subset (and I expect it to be a subset large enough to impress the IQ novice) can be processed using the IQ Express Edition – also available for no charge after registration, but limited to 5 GB of persistent data.
The Lead-in to start…
When looking at the full load, I left the notebooks some time to think. The workstation gets along quite easily with the full data set.
What comes next?
Next to follow in future blog posts is a set of scripts or script templates to create databases, tables, load data, cleanse data, unload data for IQ and load them into IQ finally. And, of course, evaluations, experiences, findings, etc.
Or you may post your ideas and suggestions – I’m dead sure someone out there has ideas beyond my wildest dreams.
Learn more about how to leverage leading-edge innovation with SAP Database Services
Follow our news on Twitter @SAPServices.