Skip to Content

Big Data Geek – Is it getting warmer in Virginia – NOAA Hourly Climate Data – Part 2

So I discussed loading the data from NOAA’s Hourly Climate Data FTP archive into SAP HANA and SAP Lumira in Big Data Geek – Finding and Loading NOAA Hourly Climate Data – Part 1. Since then, a few days have passed and the rest of the data got downloaded.

Here are the facts!

– 500,000 uncompressed sensor files and 500GB

– 335GB of CSV files, once processed

– 2.5bn sensor readings since 1901

– 82GB of Hana Data

– 31,000 sensor locations in 288 countries

Wow. Well Tammy Powlas asked me about Global Warming, and so I used SAP Lumira to find out whether temperatures have been increasing in Virginia, where she lives, since 1901. You will see in this video, just how fast SAP HANA is to ask complex questions. Here are a few facts about the data model:

– We aggregate all information on the fly. There are no caches, indexes, aggregates and there is no cheating. The video you see is all live data [edit: yes, all 2.5bn sensor readings are loaded!].

– I haven’t done any data cleansing. You can see this early on because we have to do a bit of cleansing in Lumira. This is real-world, dirty data.

– HANA has a very clever time hierarchy which means we can easily turn timestamps into aggregated dates like Year, Month, Hour.

– SAP Lumira has clever geographic enrichments which means we can load Country and Region hierarchies from SAP HANA really easily and quickly.

I was going to do this as a set of screenshots, but David Hull told me that it was much more powerful as a video, because you can see just how blazingly fast SAP HANA is with Lumira. I hope you enjoy it!

Let me know in the comments what you would like to see in Part 3.

Update: between the various tables, I have pretty good latitude and longitude data for the NOAA weather stations. However, NOAA did a really bad job of enriching this data and it has Country (FIPS) and US States only. There are 31k total stations, and I’d love to enrich these with global Country/Region/City information. Does anyone know of an efficient and free way of doing this? Please comment below! Thanks!

Update: in a conversation with Oliver Rogers, we discussed using HANA XS to enrich latitude and longitude data with Country/Region/City from the Google Reverse Geocoding API. This has a limit of 15k requests a day so we would have to throttle XS whilst it updates the most popular geocodings directly. This could be neat and reusable code for any HANA scenario!

You must be Logged on to comment or reply to a post.
    • Interestingly some people think that NOAA is manipulating the data so that the trend of global warming isn't shown. I'm interested in doing more analysis.

      Unfortunately the NOAA master data for weather stations is very poor - a lot of locations are missing for beacons, and the geographic data is poorly enriched. There's a decent number of latitude and longitudes so I'm wondering how to turn those into Country/Region/City so Lumira could understand them better.

      Google Maps has an API for this but I have tens of thousands of latitudes/longitudes to look up.

      Anyhow got any ideas?

  • Very nice, John. Thanks. I believe you loaded the complete dataset (2.5bn rows) correct?

    If so I'm curious how you partitioned the data, as well as if you tried various partitions (both for load and for reads), and if you have any related measurements.

    Also curious about the specs for the server that the demo is run on. If you can share. 🙂

    Again, very nice demo!

    • Absolutely! I don't do things by halves, each of those aggregations is against 2.5bn rows.

      On partitions, I partitioned by RANGE on TIMESTAMP, and I made each of the ranges about 100m rows. So it's 1901-1955, then every 10,5,3,2 and 1 year (from 1999). I didn't try different strategies because I've built enough HANA models to guess what performs well and this strategy is a good balance between load and reporting performance.

      The 2.5bn rows loaded in about 1h and I didn't control the DELTA MERGE, which would have sped things up a bit. You can see the load script below, it is crude but effective 🙂

      It's a HP Medium appliance, which is a DL-580 with 512GB, 40 cores, 80 threads, 25x2.5" 15k RPM SAS and 1TB FusionIO. I have a bigger appliance but it's currently out of service.

      If I had used the bigger appliance I would have also partitioned by a HASH, probably against the Weather Station ID.

      Glad you enjoyed it! I'm now fascinated by geographical enrichments and trying to figure out how to turn Latitude/Longitude into a reliable geo hierarchy.


      for a in `find . -type d`; do for b in `find $a/*.gz`; do zcat $b | ./ >> $a.csv; done; done

      for a in `ls *.csv`; do echo "import data into table "NOAA"."WEATHER" from '$a' RECORD DELIMITED BY '\n' FIELD DELIMITED BY ',' OPTIONALLY ENCLOSED BY '\"' error log $a.err" > $a.ctl; done

      for a in `ls $PWD/*.ctl`; do echo "import from '$a' WITH THREADS 80 BATCH 20000;" > $a.sql; done

      for a in `ls *.sql`; do time hdbsql -i 05 -n localhost -u SYSTEM -p PASSWORD -I $a; echo $a; done

      • Great, thanks for the details John. I don't have much partitioning experience so I'm always trying to peek over people's shoulders who are doing things that I haven't done much of. 🙂

        I look forward to hearing what you find out about converting latitude/longitude to cities/states/regions/countries etc.

        • NP! RANGE partitions are great for loading TIMESTAMP based data because it will load one partition, and then merge that one whilst it loads the next. So long as you keep your partitions around 100m rows it performs really nicely.

          If you have multiple nodes then you need a partition strategy which works well in the way you want to distribute data. I usually use a HASH in this case so I can evenly distribute data. Make sure that any column that you want to do a COUNT DISTINCT on or grouping on is based on the HASH.

          In my case HASH would be good, because the weather station is always a grouping within country. So, it will efficiently distribute stations and you will get awesome performance even when you filter by country or state.

          • Makes sense, thanks John. I got a bit of insight on a project where the largest table was 80 billion, partitioned across 24 nodes. It was a beast. (Some of my prior colleagues here on SCN can confirm 🙂 )

            COUNT(DISTINCT()) brought down the system.. and I'm not sure there's any partitioning strategy that could resolve that. (Support and dev teams were assigned to it but I don't know the particulars.)

            I forgot one question - how tricky was it to get HANA Studio installed on your Mac(Book)?

  • I was going to do this as a set of screenshots, but David Hull told me that it was much more powerful as a video

    You should not have listen to David. It is very common by companies to block YouTube. So I cannot see the speed of HANA + Lumira, instead it the impression that security is now after me because I accessed a forbidden site ...

  • That's really great John,

    It's just too bad that I wasn't able to get the update right away (found it on twitter) because P2 was posted on HANA, and not Lumira 🙁 🙁

    I'm just shocked at how creatively amazing this is.... No suggestions (yet) for P3


    • Oops, yeah I put it in the wrong space. P3 will probably be in the HANA space because I'm going to explain how I created the next evolution of the model.

      James Ibbotson is to thank for P3, he came up with the concept (see in comment below).

  • Hi John

    Its really impressive i think. Whats the temperature trend for the entire USA using this method.

    Apparently the methodology NOAA use for data is to take TMAX and TMIN and averaged...

    then compared to a 30 year base line. So for example of the Average temperatures for june 1990 was 17 Degrees and in June 2012 it was 17.01 Degrees

    and the 30 year average for june was 16 Degrees for example the Anomaly would be 1.01 Degrees C

    Which is what you are seeing in graphs such as this.

    Its not actually temperature data.....

    • Well from what I can see, NOAA do a bunch of things to the data to "renormalize it". I'm not a climatologist so I can't speak to the correctness of this, but they fiercely defend their tactics. It's documented here.

      By the way thanks for your great idea to compare NOAA Hourly and GHCN data. I've now loaded 2.5bn rows of GCHN, created a shared dimension model which matches NOAA and GCHN stations and built an efficient HANA model which allows delta comparison. I'm still troubleshooting some data nuances because the model is very sensitive to errors.

      Will blog on this once I have some time.

    • So the answer to your question through the US is really hard because the data is hard to compare. There are a lot more sensors since the 1930s and 1970s and their distribution is different.

      I guess I could create a model which aggregates up by state first but even then, states are different in nature with different urban/suburban models.

      More interesting is comparing the difference between NCDC and GHCN data. This gets really interesting... more on this later this week.

  • Wow, John, this is great!  Great production values on the video too; I usually don't have time/energy for watching videos about anything but cats, but this one was really good.

    I am quite surprised that there is no real temperature change, but perhaps it's not a big enough sample.


  • Great post, John. I really like how you use the data to answer a very specific question.

    I believe that visualizing the trend is easier if you right-click on the measure and choose "running average". It's easier to tell the trend through so many spikes.