I want to start a short series of posts to share with you my experience with data from an Open Data publication in order to compose a non-trivial data stock.
When trying to get an idea of how to handle non-trivial data (or even Big Data) without employing a full blown enterprise infrastructure, there's a couple of challenges:
The data used in my approach is available from RITA (Research and Innovative Technology Administration) / BTS (Bureau of Transportation Statistics). The "Prezipped File" contains all columns and all rows for one month. Getting all the months files includes a certain amount of mouse clicking, but I hope the result will be worthwhile. In total, I picked up more than 150M rows - admittedly nothing to make a Data Warehouse administrator sweat. It's slightly more impressive to know that the ZIP files larger than 5 GB contain a total of more than 63 GB of CSV data.
Since the data format has some peculiarities, I decided to load the files into a table with relaxed typing first so I could evaluate possible pitfalls on SQL level. I'm more familiar with the specifics of SQL (and especially the SQL dialects of the SAP Sybase database technologies) than with the more sophisticated features of stream utilities (like awk), so this is a straightforward approach for me. Others may have different skill sets and preferences for technically evaluating the data.
Some column values in the CSVs are quoted, so I decided to use SAP SQL Anywhere® (SQL Anywhere) as a pre- processor and staging database. Apart from my general preference for and vast experience with SQL Anywhere, the import / export capabilities were the relevant motivation for this step. I think I'll be able to create a load script to get the raw data into SAP® IQ (IQ) immediately, but the LOAD syntax is more restrictive than that in SQL Anywhere. World record level LOAD performance has the top priority there, and data is usually originated from a source under control of the administrators.
I generously over-sized columns containing character strings (values are stored in actual size anyway) and used the universally patient CHAR data type for time values. Also, I used a simple autoincrement numeric value as a surrogate primary key, initially for no other reason than good practice, which soon turned out to be a good idea.
From there, I would analyze and cleanse the data and export them in a format convenient for IQ load. There I would create a table with more restrictive data types after having evaluated value lengths etc. in the SQL Anywhere stage.
I ran all steps on personally owned hardware. I am a self- confessing nerd, so I may have one gadget or another that is not part of the typical cheap buyer's equipment. But I'm quite sure I couldn't impress a real Power Gamer with my setup.
The desk environment is a PC with Core i5 3550 4 core (non-HT) CPU, 32 GB RAM, SSD for OS and 4 spindles for data, 2 as BIOS- level RAID-0 and 2 as BIOS-level RAID-1. It runs Win7-64 Ultimate and SQL Anywhere version 12. The IQ version 16 installation is on a Boot- VHD on the same machine. All databases are operated on the RAID-0 for best performance.
The mobile environment consists of a Core i5 2430M CPU (2 cores / 4 threads) notebook, 8 GB RAM and a single spindle for the SQL Anywhere installation and a Core i7 Q740 CPU (4 cores / 8 threads), 8 GB RAM and a 2.5'' eSATA hard disk supplementing the built-in spindle. Both notebooks run on Win7-64 (one Professional, one Home Premium).
The SQL Anywhere installations are the Developer Edition - it's available for no charge after registration. I'm using regular (Enterprise Edition) IQ installations (15.4 / 16.0) but at least a subset (and I expect it to be a subset large enough to impress the IQ novice) can be processed using the IQ Express Edition - also available for no charge after registration, but limited to 5 GB of persistent data.
When looking at the full load, I left the notebooks some time to think. The workstation gets along quite easily with the full data set.
Next to follow in future blog posts is a set of scripts or script templates to create databases, tables, load data, cleanse data, unload data for IQ and load them into IQ finally. And, of course, evaluations, experiences, findings, etc.
Or you may post your ideas and suggestions - I'm dead sure someone out there has ideas beyond my wildest dreams.
Learn more about how to leverage leading-edge innovation with SAP Database Services
Stay in the conversation by following SAP Services on #SCN.
Follow our news on Twitter @SAPServices.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
37 | |
25 | |
17 | |
13 | |
7 | |
7 | |
7 | |
6 | |
6 | |
6 |