Big Data on a Small NUC: SAP HANA Express Edition in 42 cubic inches
At this year’s TechEd in Las Vegas, SAP introduced the highly anticipated SAP HANA Express Edition.
For the very first time, developers can now run SAP HANA on their own laptops using as little (*) as 16 GB RAM. It is optimised to run on laptops, desktops and mini-PC’s, and is free to use up until a memory consumption of 32 GB. You can install it as a native binary on Linux, or as a pre-configured, ready-to-use VM image.
(*) 16 GB RAM is still not standard on every laptop, but RAM is getting cheaper, 32 GB laptops are becoming more widespread, and really, 16 GB to run SAP HANA on is truly an achievement!
I have successfully run the VM on a 2012 MacBook Pro with 16 GB. And much to my surprise, if you tweak both the laptop’s as well as the VM’s resources well, it runs pretty smoothly!
Would I only do SAP HANA development, this 16GB configuration would be pretty adequate.
Enter the Intel NUC “Skull Canyon”.
The device is literally as cool as it’s name implies. Mine has 32 GB RAM, a 250 GB SSD, and the latest Intel i7 processor. It has peripheral connectivity to whatever thingamajiggies you can imagine, and has a nice hack0rz-approved ‘skull’ embossed on its lid. It measures just 8.3″ x 4.6″ x 1.1″ (211mm x 116mm x 28mm).
I decided to install Windows 10 on it, so I could run Eclipse with the HANA Tools on it, and benefit from the ease-of-use of simply loading the VM instead of going through the hassle of installing HANA Express Edition by hand. I assigned 16 GB to the VM and assigned 4 cores, fired up the VM and finalised the recommended steps for security settings and installing the developer license.
To see whether it was all successful, I opened the HANA Cockpit, and was happy to see I had a working SAP HANA environment in the palms of my hand.
(Honestly, I get tears in my eyes of pure joy when you think about it: the sheer power of SAP HANA in a device that fits in your hand)
After this short, blissful moment, I decided to see how it would handle lots of data and whether it could hold its own against the enterprisey HANA installations.
First problem was, where to get loads of data.
After some web scouring, I found the Bureau of Transportation Statistics website, part of the United States Department of Transportation. It has lots of data available in CSV format for aviation, marine, railway, highway, basically anything related to transport.
In the Aviation section, they held data from airline “On-Time Performance” of every domestic US flight since 1987. At least, it would give me the big dataset I wanted.
I downloaded the performance data for the month of August (In October, the latest available data was for August) and it provided a zip with ~450K rows of data. Honestly, I wouldn’t have expected that many air movements in one month, but for my test this wasn’t enough data. I downloaded all 8 months of 2016 worth of data, and ended up with a dataset of almost 3.8 million records.
Using Eclipse with SAP HANA Tools, I created a schema, and using the HANA import tools I imported the CSV airline performance data:
The website also provided lookup data such as airlines and airports in CSV format, so I imported those as well. Again, I was surprised to find over 1.600 airlines in them, but since the data goes back to 1987, it also held airlines long gone — remember Pan Am? It has 4 entries in the lookup table…
Matching the airlines with the performance data revealed this year only 310 distinct airlines performed domestic flights in the US, which seemed much more reasonable.
Intrigued by the amount of airlines and number of daily flights, I wanted to stress my dataset a bit further. The half-a-millisecond response for retrieving the COUNT was nice, but not much real-use value.
Next, I decided to show the number of air movements per day:
Holy cannelloni, on average between 11k-15k air movements per day!
Although the query is still a walk in the park for any database, it took just 17 milliseconds to deduct that data from the 3.8 million records. Still impressive, but not enough.
I wanted to show the maximum delay per day, which airline was affected, and show the departure and arrival airports. This query was a bit more complex, using an INNER JOIN SELECT with a MAX aggregation, and a couple of LEFT JOINS for the airline and airport lookup tables:
select P.FL_DATE, P.ARR_DELAY, C.DESCRIPTION as AIRLINE, O.DESCRIPTION as ORIGIN, D.DESCRIPTION as DEST from BTS.ONTIMEPERF P inner join ( select FL_DATE, max(ARR_DELAY) as ARR_DELAY from BTS.ONTIMEPERF group by FL_DATE ) PMAX on P.FL_DATE = PMAX.FL_DATE and P.ARR_DELAY = PMAX.ARR_DELAY left join BTS.CARRIERS C on P.UNIQUE_CARRIER = C.CODE left join BTS.AIRPORTS O on P.ORIGIN_AIRPORT_ID = O.CODE left join BTS.AIRPORTS D on P.DEST_AIRPORT_ID = D.CODE order by P.FL_DATE
The result is the following:
Three things are obvious here:
1) On a daily basis, the delay for a flight can be more than 18 hours (ARR_DELAY is the delay in minutes for that flight)
2) If you want to travel on schedule, I’d recommend to avoid American Airlines
3) The query took just 5 milliseconds more than the previous query to complete. You could also say, it took just one twentieth of the duration of the blink of an eye to finish. Please note, I am no DBA as you could tell from the above SQL, I haven’t used any indices, and I haven’t optimised the HANA instance significantly yet.
I think it is safe to state the Intel NUC has true server capabilities and, running SAP HANA Express Edition in a VM with only 16 GB RAM assigned to it, has no visible degradation in performance.
If you want to try it out for yourself, go to http://go.sap.com/developer/topics/sap-hana-express.html and start developing!
Oh, and if you have installed SAP HANA Express Edition on a NUC (or laptop, desktop, etc), do not hesitate to share your experiences in the comments
This is great. I'm currently running HXE at the moment on a 24 GB VMware workstation instance at home in my i7 gaming rig running Win10 Host OS. Unfortunately my home PC maxes at 32GB of RAM so I'm toying with a separate project of running VMware ESXI hypervisor to eliminate the footprint of the host OS.
I'm wondering doing ESXI would work on an Intel NUC so that I have get back my gaming tower 🙂
I'm running VMWare Workstation 12.5 Pro as well 🙂
I did toy with the idea of running ESXi from the NUC's bare metal (see for instance http://www.virten.net/2016/06/vmware-homeserver-esxi-on-6th-gen-skull-canyon-intel-nuc/) but I deliberately installed Win10 (and thus was "forced" to use a type-2 hypervisor) since I wanted to install development tools on it as well.
If you plan to install ESXi on a Skull Canyon NUC, do read the comments in the link I provided. Performance should be stellar, but there are some issues though...
Looks like two people had issue with data loss with M.2 Samsung SSDs. Also quite an investment of $600 for just the CPU/GPU and enclosure, I still must purchase SSD(s) and RAM separately probably making it cost something closer to $1400 USD (https://www.amazon.com/dp/B01GPA7CFE?psc=1)
I suppose it just depends on what I do/don't want to do with a NUC with/without HXE in the mix or vice-versa. I certainly do not need another large tower running at home but I have a need to run more VMs in some sort of 'lab' environment 🙂
Mine has an Intel SSDSCKKW240H6 250GB drive. I've heard similar issues with Samsung M.2 drives as well.
It certainly is a significant amount of money, but form factor / performance is worth something as well I think 🙂
I got all excited when I saw a NUC on Amazon for $570 then realized it was base. The portability is awesome. Right now I have a linux tower with 32GB under my desk that will have to do until I can save some pennies.
Circling back on the drive topic, I've been having intermittent issues with the hypervisor failing to write to the storage medium (these Samsung M2 drives.)
It seems to happen possibly during intense i/o (like swap thrashing) on any guest VM, not specific to HANA VM. I'm curious how your experience has been?
I haven't noticed any issues yet, but admittedly I don't put it under severe load that often. Is it strictly related to VMware hypervisors or other brands as well?
In my case, I'm running VSphere (ESXI) on the Skull Canyon NUC, so that's really my only experience. It really only happens when I try to get away with running too many VMs at once for prolonged periods.
Also, I am having really bad luck for the same symptom with trying to run HXE 2.0 in this environment. I keep corrupting the HXE 2.0 OS's file system when I have to hard reboot it. I end up stuck at a GRUB boot menu error and everything is screwed. I'll try HXE 2.0 on VMWare Workstation setup to see if I have better luck, though.
Looks like you're using the same data set I used in [https://blogs.sap.com/?p=101918] and for my hands-on session DMM273 @ TechEd && d-code 2014. I also loaded these data into my HXE VM and can confirm that it's quite a good match of data and technology for a sandbox.
Cool! Wasn't aware of your blog... Will dig up the 2014 TechEd USB stick to go through your session though!
Initially, I was thinking of using IMBD movie datasets, but they have a very perculiar data format. The Million Song Database looked promising too (with 1M rows, it's not 'big', but has interesting data nonetheless) but the size (280GB) is larger than my SSD (250GB) so a no-go as well. I tried the smaller (10,000 rows) subset, but couldn't get the HDF5 imported into something meaningful to HANA...
The airline performance dataset at least was reasonably small, and simple CSV format 🙂
Yesterday I found this great dataset and think about a use case in hxe .
Models of Geo-Information Science and Remote Sensing: HILDA
Sweet! Also has it in ArcGIS map data format, so you could potentially plot in onto their maps. Interesting!