Big Data Geek – Finding and Loading NOAA Hourly Climate Data – Part 1
So I loved the idea of the Data Geek challenge, but I’m more of Big Data Geek. And it turns out that the National Oceanic and Atmospheric Administration (NOAA) have publicly available hourly climate data, since 1901.
It’s available at ftp://ftp.ncdc.noaa.gov/pub/data/noaa and you can download all of it freely. I’ve been kind to my IT folks in my office network so I’ve been slowly downloading it over the last few weeks, and I’m up to 2003 and I’ve got 39GB of compressed data so far. It is GZIP compressed and expands about 10:1… and in addition, the data volumes grow massively since 1973 and keep growing to the present day. I expect 600GB-1TB of raw data before we’re done.
So we’re going to need SAP HANA and Lumira to find some interesting information, and I thought I’d take you on the journey of how to do this. This is Part 1 of however many it takes me to find some meaning in the data.
Update: Part 2 is now available here! In part 2 we look at Tammy Powlas’ question “is it getting warmer in Virginia?”
Downloading the data
I’ve gone about this using the following command:
wget –mirror ftp://ftp.ncdc.noaa.gov/pub/data/noaa
There are other ways you could use to get the data much faster, including aria2, pcurl or httrack but I wanted to be kind on my IT team and this doesn’t use too much bandwidth or mess up our office network. It will take me a few weeks to get all the data and then I can keep it up to date any time!
Loading Data
The data comes in a pretty incomprehensible format and the lines look a bit like this:
0081999999999992002010100004+17900-075900FM-13+9999ELML V02099999999999999999N9999999N1+02791+02791101361REMSYN072BBXX ELML7 0100/ 99179 70759 4//// ///// 10279 20279 40136 8//// 222//;
You can download a PDF document of how this is all formatted here:
ftp://ftp.ncdc.noaa.gov/pub/data/noaa/ish-format-document.pdf
It turns out that it is a complex fixed format file, so I’m just going to use the mandatory fields, which are the first 34 fields. This gives us data like location, timestamp, temperature, wind, visibility. Pretty comprehensive. I’m a big fan of UNIX scripts to reformat this stuff so I wrote an awk script to reformat the fixed format files into CSV.
awk ‘BEGIN { FIELDWIDTHS = “4 6 5 4 2 2 2 2 1 6 7 5 5 5 4 3 1 1 4 1 5 1 1 1 6 1 1 1 5 1 5 1 5 1” } {
VAR_CHARS=$1
STATION_USAF=$2
STATION_NCDC=$3
YEAR=$4
MONTH=$5
DAY=$6
HOUR=$7
MIN=$8
SOURCE_FLAG=$9
LATITUDE=$10
LONGITUDE=$11
OBSERVATION_TYPE=$12
ELEVATION=$13
CALL_LETTER=$14
QUALITY_CONTROL=$15
WIND_DIR=$16
WIND_DIR_QUAL=$17
WIND_TYPE=$18
WIND_SPEED=$19
WIND_SPEED_QUAL=$20
SKY_CEILING=$21
SKY_CEILING_QUAL=$22
SKY_CEILING_METH=$23
SKY_CEILING_CAVOK=$24
VISIBILITY=$25
VISIBILITY_QUAL=$26
VISIBILITY_VAR=$27
VISIBILITY_VAR_QUAL=$28
AIR_TEMP=$29
AIR_TEMP_QUAL=$30
DEW_TEMP=$31
DEW_TEMP_QUAL=$32
PRESSURE=$33
PRESSURE_QUAL=$34
printf (“%s,%s,%s,%s-%s-%s %s:%s:00,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s\n”, VAR_CHARS, STATION_USAF, STATION_NCDC, YEAR, MONTH, DAY, HOUR, MIN, SOURCE_FLAG, LATITUDE, LONGITUDE, OBSERVATION_TYPE, ELEVATION, CALL_LETTER, QUALITY_CONTROL, WIND_DIR, WIND_DIR_QUAL, WIND_TYPE, WIND_SPEED, WIND_SPEED_QUAL, SKY_CEILING, SKY_CEILING_QUAL, SKY_CEILING_METH, SKY_CEILING_CAVOK, VISIBILITY, VISIBILITY_QUAL, VISIBILITY_VAR, VISIBILITY_VAR_QUAL, AIR_TEMP, AIR_TEMP_QUAL, DEW_TEMP, DEW_TEMP_QUAL, PRESSURE, PRESSURE_QUAL)
}’
Now I can put all 12653 files that constitute 2002’s data into one big CSV file!
for a in `find 2002/*2002`; do ./test.sh < $a >> 2002.csv; done
Now I have a pretty human readable format – we can make some sense of this!
0173,010010,99999,2002-01-01 00:00:00,4,+70930,-008660,FM-12,+0009,ENJA ,V020,320,1,N,0100,1,22000,1,9,N,070000,1,N,1,-0039,1,-0087,1,10032
Whilst that script ran (it takes a few minutes per year), I’m went ahead and created the HANA DDL for this table, as well as for the reference data that exist within the PDF, for things like Quality, Source Flags, Observation Types, Wind Types, Sky Ceiling Methods and Visibility. This will make thinks like FM-12 human readable like “SYNOP Report of surface observation form a fixed land station“. I’ve attached the DDL as a SQL script so you can run it on your own system.
A few minutes later, we have 63,653,095 readings for 2002 to pull into SAP HANA. We dispatch that in 2 minutes flat – got to love HANA’s bulk loader performance. Now all that remains is to build an Analytic View to combine our fact table and reference data into a readable model for Lumira. The ISH reference data also contains location information for the sensors, though it is very rough, by Country or US State.
Visualization
And now we can have some fun with Lumira! First, we can connect up to our Analytic View in HANA and see our attributes and measures. I’ve created a bunch of calculated measures so we can do averages within Lumira and push the calculations back into HANA. This makes it super fast.
But first, let’s start with Geographic Enrichment. We can add Country, and State:
Very quickly, we can do a Chloropleth Chart showing Average Wind Speed by Country in 2002
Let’s switch to Air Temperature:
And drill into the United States:
And now we’re going to filter by the month of August:
Next Steps
I need to go ahead and load the rest of the data from 1901, which will take a while. Then, we can go ahead and do some time-based analysis.
What are the questions that you’d like to be able to answer?
Great - is the average temperature really getting warmer over the years? Could you determine that from all this data?
Yes, definitely! I'm in the process of processing 1902-2004 right now whilst I download 2004-2013 data. We can look at the detail of climate change though I'm sure it will be both subtle and contentious.
for a in `find . -type d`; do for b in `find $a/*.gz`; do zcat $b | ./csv.sh >> $a.csv; done; done
Hi John, Thanks so much for taking the time to write this blog. So my question (since you are pulling from NOAA) is 'Why were there no major hurricanes in the Atlantic this year?' - perhaps this is a tough one to answer, but NOAA has a map (somewhere) that shows the paths of all Atlantic hurricanes since, I guess, 1901, and when you put them altogether, it is visually compelling.
Meanwhile, happy loading!
Sue
Nice question! And I think you mean major hurricanes... yet! The season is not yet over 🙂
Once I have the data loaded - should be this week - I'll start to look at how hurricanes are measured. I'm guessing we can do this by looking at wind speeds over X mph.
Unfortunately Lumira won't let us filter on a measure so I will have to filter it down at the HANA level. I'm sure that won't be too hard.
John, besides this being really great and pertinent information (weather affects us all), it also reinforces the power of Big Data capabilities that both HANA and, graphically, Lumira provides end users. Turning this into Predictive Analytics may help with weather forecasting...?
Thanks William. I think it will get more interesting as we get a lot more data and I advance the model.
It's possible that we can help with weather forecasting but organizations like NOAA do have very advanced and specific algorithms running on very powerful supercomputers!
Still, let's see what happens.
Wow John,
I'm looking forward to seeing P.II
Are you running SP13?
The downloading seems to be the biggest challenge here.
I'm pretty excited!!!!!!!!!!!!
-Bijan
Hey Irshaad,
The Lumira build I use is very similar to the SP13 release you can download on SAP Service Marketplace, but I work with the Lumira team on HANA integration so you may see some differences in my build.
In my experience, accessing, downloading, formatting and getting data quality right is by far the hardest challenge in the HANA/Lumira world. The HANA/Lumira tech works great by comparison!
John
Great Blog - thanks!
Rama
Hi John,
I am impressed. Nice work.
Keep it up.
Thanks.
Regards,
Hari Suseelan
I followed the whole adventure on twitter and it's pretty exciting. There's only one thing that I would add: we need units of measure, especially for temperature (showing degrees in celsius in the US is kind of counter-intuitive).
Here's my question that I hope your dataset can asnwer: if I were to invest in a wind-turbine, that requires constant winds accross the year, in which country (maybe even region or city) should I put it?
Thanks,
Julien
That's the European in me! Easy to fix in HANA though with a unit-of-measure conversion.
Very interesting question... I presume you want a high wind, with low variance. I should be able to pinpoint the exact location. Unfortunately this is tough with Lumira because there's no VAR or STDDEV functionality with the HANA connector. I hope this to be fixed in HANA SP8.
What's more I need to be careful because those aggregations aren't run within the HANA OLAP engine so when I run the queries, they may be slow.
So I asked the question: geographic stations, ordered by average wind volume, but also showing number of sensor counts and standard deviation.
If variance wasn't an issue then the best place would be the summit of Mt Washington, New Hampshire, with an average of nearly 16mph. But, it looks pretty gusty up there because the standard deviation is through the roof at 8mph.
Alaska comes up several times - St George Island, Cold Bay, St Paul Island. But I'm going to take the liberty to say that this doesn't help you, because then you would have to transport your electricity through Canada to get somewhere useful.
In the continental USA, the best place is probably Pine Springs, TX with an average wind speed of 8.3mph and a low standard deviation of 3mph. The latitude/longitude is 31.83,-104.8.
Good luck with your acquisition!
John
John,
Thanks for following up. I'm French living in Miami, FL so I always have to struggle with Celsius / Fahrenheit and MPH / KMH. Well, at least they drive on the right side of the road ...
It looks like your analysis about Alaska is in line with this map:
File:United States Wind Resources and Transmission Lines map.jpg - Wikipedia, the free encyclopedia
Especially, I may not have thought about the standard deviation and sensor counts.
However, Texas is not on the official map and you don't mention anything about California or the Great Lakes or the Jersey shores. I wonder if the data you have also collect information from the near shore stations.
I did some research and found this:
Wind farm - Wikipedia, the free encyclopedia
Also, according to Beaufort scale (Beaufort scale - Wikipedia, the free encyclopedia), a warning is raised when winds blow past 40 km/h (25 mph).
I understand the limitation of the deviation, maybe there's a workaround. What about counting the number of days when the wind speed is between 16 km/h (10 mph) and 40 km/h (25 mph) and mapping the answer?
Good luck,
Julien
Ah, this is what happens when you look at the wrong analysis 😉
I didn't know this, butTexas is the largest wind produced in the USA! Wind power in the United States - Wikipedia, the free encyclopedia
I'll go and re-think how I did the measuring. To do what you need, I'd need to measure the average and standard deviation, plus the portion of hourly wind records between 10-25mph. That's an expensive calculation! Maybe there is a different way to calculate it.
John
Hi John,
what a very nice peace of work.
Very creative! I am impressed too!
Congratulation! 🙂
/Weiss