Big Data on a Small NUC: SAP HANA Express Edition in 42 cubic inches
At this year’s TechEd in Las Vegas, SAP introduced the highly anticipated SAP HANA Express Edition.
For the very first time, developers can now run SAP HANA on their own laptops using as little (*) as 16 GB RAM. It is optimised to run on laptops, desktops and mini-PC’s, and is free to use up until a memory consumption of 32 GB. You can install it as a native binary on Linux, or as a pre-configured, ready-to-use VM image.
(*) 16 GB RAM is still not standard on every laptop, but RAM is getting cheaper, 32 GB laptops are becoming more widespread, and really, 16 GB to run SAP HANA on is truly an achievement!
I have successfully run the VM on a 2012 MacBook Pro with 16 GB. And much to my surprise, if you tweak both the laptop’s as well as the VM’s resources well, it runs pretty smoothly!
Would I only do SAP HANA development, this 16GB configuration would be pretty adequate.
Enter the Intel NUC “Skull Canyon”.
The device is literally as cool as it’s name implies. Mine has 32 GB RAM, a 250 GB SSD, and the latest Intel i7 processor. It has peripheral connectivity to whatever thingamajiggies you can imagine, and has a nice hack0rz-approved ‘skull’ embossed on its lid. It measures just 8.3″ x 4.6″ x 1.1″ (211mm x 116mm x 28mm).
I decided to install Windows 10 on it, so I could run Eclipse with the HANA Tools on it, and benefit from the ease-of-use of simply loading the VM instead of going through the hassle of installing HANA Express Edition by hand. I assigned 16 GB to the VM and assigned 4 cores, fired up the VM and finalised the recommended steps for security settings and installing the developer license.
To see whether it was all successful, I opened the HANA Cockpit, and was happy to see I had a working SAP HANA environment in the palms of my hand.
(Honestly, I get tears in my eyes of pure joy when you think about it: the sheer power of SAP HANA in a device that fits in your hand)
After this short, blissful moment, I decided to see how it would handle lots of data and whether it could hold its own against the enterprisey HANA installations.
First problem was, where to get loads of data.
After some web scouring, I found the Bureau of Transportation Statistics website, part of the United States Department of Transportation. It has lots of data available in CSV format for aviation, marine, railway, highway, basically anything related to transport.
In the Aviation section, they held data from airline “On-Time Performance” of every domestic US flight since 1987. At least, it would give me the big dataset I wanted.
I downloaded the performance data for the month of August (In October, the latest available data was for August) and it provided a zip with ~450K rows of data. Honestly, I wouldn’t have expected that many air movements in one month, but for my test this wasn’t enough data. I downloaded all 8 months of 2016 worth of data, and ended up with a dataset of almost 3.8 million records.
Using Eclipse with SAP HANA Tools, I created a schema, and using the HANA import tools I imported the CSV airline performance data:
The website also provided lookup data such as airlines and airports in CSV format, so I imported those as well. Again, I was surprised to find over 1.600 airlines in them, but since the data goes back to 1987, it also held airlines long gone — remember Pan Am? It has 4 entries in the lookup table…
Matching the airlines with the performance data revealed this year only 310 distinct airlines performed domestic flights in the US, which seemed much more reasonable.
Intrigued by the amount of airlines and number of daily flights, I wanted to stress my dataset a bit further. The half-a-millisecond response for retrieving the COUNT was nice, but not much real-use value.
Next, I decided to show the number of air movements per day:
Holy cannelloni, on average between 11k-15k air movements per day!
Although the query is still a walk in the park for any database, it took just 17 milliseconds to deduct that data from the 3.8 million records. Still impressive, but not enough.
I wanted to show the maximum delay per day, which airline was affected, and show the departure and arrival airports. This query was a bit more complex, using an INNER JOIN SELECT with a MAX aggregation, and a couple of LEFT JOINS for the airline and airport lookup tables:
select P.FL_DATE, P.ARR_DELAY, C.DESCRIPTION as AIRLINE, O.DESCRIPTION as ORIGIN, D.DESCRIPTION as DEST from BTS.ONTIMEPERF P inner join ( select FL_DATE, max(ARR_DELAY) as ARR_DELAY from BTS.ONTIMEPERF group by FL_DATE ) PMAX on P.FL_DATE = PMAX.FL_DATE and P.ARR_DELAY = PMAX.ARR_DELAY left join BTS.CARRIERS C on P.UNIQUE_CARRIER = C.CODE left join BTS.AIRPORTS O on P.ORIGIN_AIRPORT_ID = O.CODE left join BTS.AIRPORTS D on P.DEST_AIRPORT_ID = D.CODE order by P.FL_DATE
The result is the following:
Three things are obvious here:
1) On a daily basis, the delay for a flight can be more than 18 hours (ARR_DELAY is the delay in minutes for that flight)
2) If you want to travel on schedule, I’d recommend to avoid American Airlines
3) The query took just 5 milliseconds more than the previous query to complete. You could also say, it took just one twentieth of the duration of the blink of an eye to finish. Please note, I am no DBA as you could tell from the above SQL, I haven’t used any indices, and I haven’t optimised the HANA instance significantly yet.
I think it is safe to state the Intel NUC has true server capabilities and, running SAP HANA Express Edition in a VM with only 16 GB RAM assigned to it, has no visible degradation in performance.
If you want to try it out for yourself, go to http://go.sap.com/developer/topics/sap-hana-express.html and start developing!
Oh, and if you have installed SAP HANA Express Edition on a NUC (or laptop, desktop, etc), do not hesitate to share your experiences in the comments