Skip to Content
Author's profile photo John Appleby

Big Data Geek – Is it getting warmer in Virginia – NOAA Hourly Climate Data – Part 2

So I discussed loading the data from NOAA’s Hourly Climate Data FTP archive into SAP HANA and SAP Lumira in Big Data Geek – Finding and Loading NOAA Hourly Climate Data – Part 1. Since then, a few days have passed and the rest of the data got downloaded.

Here are the facts!

– 500,000 uncompressed sensor files and 500GB

– 335GB of CSV files, once processed

– 2.5bn sensor readings since 1901

– 82GB of Hana Data

– 31,000 sensor locations in 288 countries

Wow. Well Tammy Powlas asked me about Global Warming, and so I used SAP Lumira to find out whether temperatures have been increasing in Virginia, where she lives, since 1901. You will see in this video, just how fast SAP HANA is to ask complex questions. Here are a few facts about the data model:

– We aggregate all information on the fly. There are no caches, indexes, aggregates and there is no cheating. The video you see is all live data [edit: yes, all 2.5bn sensor readings are loaded!].

– I haven’t done any data cleansing. You can see this early on because we have to do a bit of cleansing in Lumira. This is real-world, dirty data.

– HANA has a very clever time hierarchy which means we can easily turn timestamps into aggregated dates like Year, Month, Hour.

– SAP Lumira has clever geographic enrichments which means we can load Country and Region hierarchies from SAP HANA really easily and quickly.

I was going to do this as a set of screenshots, but David Hull told me that it was much more powerful as a video, because you can see just how blazingly fast SAP HANA is with Lumira. I hope you enjoy it!

Let me know in the comments what you would like to see in Part 3.

Update: between the various tables, I have pretty good latitude and longitude data for the NOAA weather stations. However, NOAA did a really bad job of enriching this data and it has Country (FIPS) and US States only. There are 31k total stations, and I’d love to enrich these with global Country/Region/City information. Does anyone know of an efficient and free way of doing this? Please comment below! Thanks!

Update: in a conversation with Oliver Rogers, we discussed using HANA XS to enrich latitude and longitude data with Country/Region/City from the Google Reverse Geocoding API. This has a limit of 15k requests a day so we would have to throttle XS whilst it updates the most popular geocodings directly. This could be neat and reusable code for any HANA scenario!

Assigned Tags

      30 Comments
      You must be Logged on to comment or reply to a post.
      Author's profile photo Tammy Powlas
      Tammy Powlas

      Well done, John!

      I guess there is no truth to global warming.

      Yes, 1996 was a rough year weather-wise - see http://en.wikipedia.org/wiki/North_American_blizzard_of_1996

      Great video and very informative!

      Thank you

      Author's profile photo John Appleby
      John Appleby
      Blog Post Author

      Interestingly some people think that NOAA is manipulating the data so that the trend of global warming isn't shown. I'm interested in doing more analysis.

      Unfortunately the NOAA master data for weather stations is very poor - a lot of locations are missing for beacons, and the geographic data is poorly enriched. There's a decent number of latitude and longitudes so I'm wondering how to turn those into Country/Region/City so Lumira could understand them better.

      Google Maps has an API for this but I have tens of thousands of latitudes/longitudes to look up.

      Anyhow got any ideas?

      Author's profile photo Tammy Powlas
      Tammy Powlas

      Henry Banks may know; he provided me some latitude / longitude data to me before.  It may be next week before he reads this as I believe he is at TechEd AMS this week.  Not sure how he got that data.

      Author's profile photo Rama Shankar
      Rama Shankar

      Great blog - Thanks John! 🙂

      Author's profile photo John Appleby
      John Appleby
      Blog Post Author

      Glad you enjoyed, Rama

      Author's profile photo Former Member
      Former Member

      Very nice, John. Thanks. I believe you loaded the complete dataset (2.5bn rows) correct?

      If so I'm curious how you partitioned the data, as well as if you tried various partitions (both for load and for reads), and if you have any related measurements.

      Also curious about the specs for the server that the demo is run on. If you can share. 🙂

      Again, very nice demo!

      Author's profile photo John Appleby
      John Appleby
      Blog Post Author

      Absolutely! I don't do things by halves, each of those aggregations is against 2.5bn rows.

      On partitions, I partitioned by RANGE on TIMESTAMP, and I made each of the ranges about 100m rows. So it's 1901-1955, then every 10,5,3,2 and 1 year (from 1999). I didn't try different strategies because I've built enough HANA models to guess what performs well and this strategy is a good balance between load and reporting performance.

      The 2.5bn rows loaded in about 1h and I didn't control the DELTA MERGE, which would have sped things up a bit. You can see the load script below, it is crude but effective 🙂

      It's a HP Medium appliance, which is a DL-580 with 512GB, 40 cores, 80 threads, 25x2.5" 15k RPM SAS and 1TB FusionIO. I have a bigger appliance but it's currently out of service.

      If I had used the bigger appliance I would have also partitioned by a HASH, probably against the Weather Station ID.

      Glad you enjoyed it! I'm now fascinated by geographical enrichments and trying to figure out how to turn Latitude/Longitude into a reliable geo hierarchy.

      John

      for a in `find . -type d`; do for b in `find $a/*.gz`; do zcat $b | ./csv.sh >> $a.csv; done; done

      for a in `ls *.csv`; do echo "import data into table "NOAA"."WEATHER" from '$a' RECORD DELIMITED BY '\n' FIELD DELIMITED BY ',' OPTIONALLY ENCLOSED BY '\"' error log $a.err" > $a.ctl; done

      for a in `ls $PWD/*.ctl`; do echo "import from '$a' WITH THREADS 80 BATCH 20000;" > $a.sql; done

      for a in `ls *.sql`; do time hdbsql -i 05 -n localhost -u SYSTEM -p PASSWORD -I $a; echo $a; done

      Author's profile photo Former Member
      Former Member

      Great, thanks for the details John. I don't have much partitioning experience so I'm always trying to peek over people's shoulders who are doing things that I haven't done much of. 🙂

      I look forward to hearing what you find out about converting latitude/longitude to cities/states/regions/countries etc.

      Author's profile photo John Appleby
      John Appleby
      Blog Post Author

      NP! RANGE partitions are great for loading TIMESTAMP based data because it will load one partition, and then merge that one whilst it loads the next. So long as you keep your partitions around 100m rows it performs really nicely.

      If you have multiple nodes then you need a partition strategy which works well in the way you want to distribute data. I usually use a HASH in this case so I can evenly distribute data. Make sure that any column that you want to do a COUNT DISTINCT on or grouping on is based on the HASH.

      In my case HASH would be good, because the weather station is always a grouping within country. So, it will efficiently distribute stations and you will get awesome performance even when you filter by country or state.

      Author's profile photo Former Member
      Former Member

      Makes sense, thanks John. I got a bit of insight on a project where the largest table was 80 billion, partitioned across 24 nodes. It was a beast. (Some of my prior colleagues here on SCN can confirm 🙂 )

      COUNT(DISTINCT()) brought down the system.. and I'm not sure there's any partitioning strategy that could resolve that. (Support and dev teams were assigned to it but I don't know the particulars.)

      I forgot one question - how tricky was it to get HANA Studio installed on your Mac(Book)?

      Author's profile photo John Appleby
      John Appleby
      Blog Post Author

      Good news is coming soon... maybe even this month 🙂

      COUNT DISTINCT will be pushed to a bwPop, so if you partition with a strategy that aligns to the distinct column, it is super fast. Right now the distinct isn't optimized so it aggregates it all in the SQL engine.

      Mac version: https://hanadeveditionsapicl.hana.ondemand.com/hanadevedition/

      John

      Author's profile photo Tobias Hofmann
      Tobias Hofmann

      I was going to do this as a set of screenshots, but David Hull told me that it was much more powerful as a video

      You should not have listen to David. It is very common by companies to block YouTube. So I cannot see the speed of HANA + Lumira, instead it the impression that security is now after me because I accessed a forbidden site ...

      Author's profile photo John Appleby
      John Appleby
      Blog Post Author

      🙂

      What do you suggest? Vimeo is an option?

      Hopefully you have not been escorted out the building...

      John

      Author's profile photo Tobias Hofmann
      Tobias Hofmann

      Vimeo is also blocked. I hoped that in 2013 SCN will have its own video hosting platform, but ... still waiting.

      Maybe an animated GIF?

      Author's profile photo Former Member
      Former Member

      That's really great John,

      It's just too bad that I wasn't able to get the update right away (found it on twitter) because P2 was posted on HANA, and not Lumira 🙁 🙁

      I'm just shocked at how creatively amazing this is.... No suggestions (yet) for P3

      -Bijan

      Author's profile photo John Appleby
      John Appleby
      Blog Post Author

      Oops, yeah I put it in the wrong space. P3 will probably be in the HANA space because I'm going to explain how I created the next evolution of the model.

      James Ibbotson is to thank for P3, he came up with the concept (see in comment below).

      Author's profile photo Former Member
      Former Member

      Hi John

      Its really impressive i think. Whats the temperature trend for the entire USA using this method.

      Apparently the methodology NOAA use for data is to take TMAX and TMIN and averaged...

      then compared to a 30 year base line. So for example of the Average temperatures for june 1990 was 17 Degrees and in June 2012 it was 17.01 Degrees

      and the 30 year average for june was 16 Degrees for example the Anomaly would be 1.01 Degrees C

      Which is what you are seeing in graphs such as this.

      http://temperaturetrends.org/state.php?state=VA

      Its not actually temperature data.....

      Author's profile photo John Appleby
      John Appleby
      Blog Post Author

      Well from what I can see, NOAA do a bunch of things to the data to "renormalize it". I'm not a climatologist so I can't speak to the correctness of this, but they fiercely defend their tactics. It's documented here.

      By the way thanks for your great idea to compare NOAA Hourly and GHCN data. I've now loaded 2.5bn rows of GCHN, created a shared dimension model which matches NOAA and GCHN stations and built an efficient HANA model which allows delta comparison. I'm still troubleshooting some data nuances because the model is very sensitive to errors.

      Will blog on this once I have some time.

      Author's profile photo John Appleby
      John Appleby
      Blog Post Author

      So the answer to your question through the US is really hard because the data is hard to compare. There are a lot more sensors since the 1930s and 1970s and their distribution is different.

      I guess I could create a model which aggregates up by state first but even then, states are different in nature with different urban/suburban models.

      More interesting is comparing the difference between NCDC and GHCN data. This gets really interesting... more on this later this week.

      Author's profile photo Vivek Singh Bhoj
      Vivek Singh Bhoj

      Great Blog John

      Thanks for sharing 🙂

      Regards,

      Vivek

      Author's profile photo Former Member
      Former Member

      Nice Blog.

      Thanks for sharing.

      Regards

      Purnaram.K

      Author's profile photo Susan Keohan
      Susan Keohan

      Wow, John, this is great!  Great production values on the video too; I usually don't have time/energy for watching videos about anything but cats, but this one was really good.

      I am quite surprised that there is no real temperature change, but perhaps it's not a big enough sample.

      Thanks!
      Sue

      Author's profile photo Former Member
      Former Member

      Thanks John.  Very cool.

      Have you seen any time animation feature yet in Lumira ? 

      Here's a link to a Tableau demo (with a small data set).

      Check out the 19 minute mark.

      http://www.youtube.com/watch?v=6BWPoccQatI&feature=youtube_gdata_player

      Have a word with the Developers  I'd love this in Lumira too.  😉

      Cheers

      Aron

      Author's profile photo John Appleby
      John Appleby
      Blog Post Author

      Unfortunately Lumira doesn't have much for time-series analytics. It would be very cool especially with Chloropleth charts.

      This is on my Lumira wishlist along with the ability to create Infographic-style storyboards.

      John

      Author's profile photo Former Member
      Former Member

      Hey John,

      Did you already submit your ideas into the Idea Place?!?! I'd love to promote them!

      Bijan

      Author's profile photo John Appleby
      John Appleby
      Blog Post Author

      I'm part of the BI Design Council so I submit my ideas through this route.

      John

      Author's profile photo Former Member
      Former Member

      Great post, John. I really like how you use the data to answer a very specific question.

      I believe that visualizing the trend is easier if you right-click on the measure and choose "running average". It's easier to tell the trend through so many spikes.

      Julien

      Author's profile photo John Appleby
      John Appleby
      Blog Post Author

      Yes you're right - I realized this after recording the video! Thanks for the tip.

      John

      Author's profile photo Jorge Weiss
      Jorge Weiss

      HI John,

      again, very good work.

      Thanks a lot.

      /Weiss

      Author's profile photo Kevin Small
      Kevin Small

      Hi John,

      Interesting work, and thanks to you and Oliver for the idea about a generic XS service for reverse geocoding.  This overlapped with something else I was doing, so I have separated out the service part and released it here:

      Reverse Geocode your HANA Data with this XS JavaScript Utility

      It may be of interest to those reading your article above.

      Kevin.