Big Data Geek – Is it getting warmer in Virginia – NOAA Hourly Climate Data – Part 2
So I discussed loading the data from NOAA’s Hourly Climate Data FTP archive into SAP HANA and SAP Lumira in Big Data Geek – Finding and Loading NOAA Hourly Climate Data – Part 1. Since then, a few days have passed and the rest of the data got downloaded.
Here are the facts!
– 500,000 uncompressed sensor files and 500GB
– 335GB of CSV files, once processed
– 2.5bn sensor readings since 1901
– 82GB of Hana Data
– 31,000 sensor locations in 288 countries
Wow. Well Tammy Powlas asked me about Global Warming, and so I used SAP Lumira to find out whether temperatures have been increasing in Virginia, where she lives, since 1901. You will see in this video, just how fast SAP HANA is to ask complex questions. Here are a few facts about the data model:
– We aggregate all information on the fly. There are no caches, indexes, aggregates and there is no cheating. The video you see is all live data [edit: yes, all 2.5bn sensor readings are loaded!].
– I haven’t done any data cleansing. You can see this early on because we have to do a bit of cleansing in Lumira. This is real-world, dirty data.
– HANA has a very clever time hierarchy which means we can easily turn timestamps into aggregated dates like Year, Month, Hour.
– SAP Lumira has clever geographic enrichments which means we can load Country and Region hierarchies from SAP HANA really easily and quickly.
I was going to do this as a set of screenshots, but David Hull told me that it was much more powerful as a video, because you can see just how blazingly fast SAP HANA is with Lumira. I hope you enjoy it!
Let me know in the comments what you would like to see in Part 3.
Update: between the various tables, I have pretty good latitude and longitude data for the NOAA weather stations. However, NOAA did a really bad job of enriching this data and it has Country (FIPS) and US States only. There are 31k total stations, and I’d love to enrich these with global Country/Region/City information. Does anyone know of an efficient and free way of doing this? Please comment below! Thanks!
Update: in a conversation with Oliver Rogers, we discussed using HANA XS to enrich latitude and longitude data with Country/Region/City from the Google Reverse Geocoding API. This has a limit of 15k requests a day so we would have to throttle XS whilst it updates the most popular geocodings directly. This could be neat and reusable code for any HANA scenario!