S4 HANA - large scale hardware options

Former Member · ‎02-02-2015

In 2014 with SAP HANA SPS08, we gained the kind of mission-critical stability of the HANA platform that allowed us to take some of the world's largest SAP systems and move them onto HANA.

There is now a CERTIFIED SAP HANA® HARDWARE DIRECTORY online which lists the certified Suite on HANA and S4 appliance platforms. My customers are now looking for ever-larger HANA systems. The thing about S4 ERP systems is that they are not well suited to distributed or clustered hardware. For analytical platforms, this is what allows HANA to scale to very large systems.

With S4, appliances from Cisco, Hitachi, Huawei, HP, Fujitsu, and SGI are available, which are certified and scale to 6TB of hardware. Apart from HP and SGI (more on this later), all of them are based around 8-socket Intel systems.

As a benchmark, the rough budget price for a 6TB HANA system is $600k. If you're getting below that, you've got a great deal.

Stretching 6TB to 12TB

The ratio between CPUs and RAM used in a HANA appliance is somewhat arbitrary - and relates to the certification process and benchmarking more than anything. Certainly, it's possible to put 12TB of DRAM in an 8-socket HANA appliance.

Cost will depend on your hardware vendor and their discounting, but let's use Lenovo as a good comparison. 6TB of RAM from Lenovo retails at $279,168 (that's 192x 32GB DIMMS, 46W0676). Street price is around $220k for 6TB.

If we build the same system with 12TB RAM, we have to use 64GB DIMMs (46W0741), which list at a nosebleed $5279 each. Yes, that's a painful $1,013,568 for the upgrade. On the street, you can probably get this for $750k.

Either way, a 12TB system built out of a stretched 8 socket, 6TB system, will cost you > $1.1m on the street. What's more, you will still have the same 8 CPU system that you had before - it won't provide any additional throughput. Even worse, the 64GB parts are clocked slightly slower than the 32GB parts, so you will lose a little performance for the upgrade.

Going beyond 8 sockets

The Intel Ivy Bridge architecture, on which all current SAP HANA systems are built, was designed for 4 CPUs. Each CPU has 3 connections to other CPUs, on an interface called QPI. So in a 4 CPU configuration, all of them are connected to each other.

Move to 8 CPUs, and each CPU is only connected to 3 of the 8. The other 4 connections are remote - connected via another CPU. Remember that each connection is around 12.8Gbyte/sec - some 10x faster than 10GB Ethernet. Still, it means that accessing remote memory is slower than local memory.

Moving past 8 CPUs, and the QPI connections are spread even thinner.

This is really important because in the Intel architecture, RAM is attached to a specific CPU socket. This means that local RAM to a CPU is much faster than remote memory. This effect is called NUMA, or Non-Uniform Memory Architecture. Here's a rough idea of memory latency in Intel Ivy Bridge:

Memory Type	Latency (ns)
L1 Cache Reference	1ns
L2 Cache Reference	5ns
Local Main Memory Reference	50ns
Remote Main Memory (1 hop)	200ns
Remote Main Memory (2 hops)	500ns

As you can see, the proportion of local memory memory accesses makes a huge difference to the overall system performance. Even more critically, there is limited bandwidth between CPUs on the QPI network. If all accesses are remote, performance tanks.

I've personally seen this happen in early versions of SAP HANA with large numbers of CPUs - you get a degradation of performance as load increases. Thankfully, SPS09 of HANA contains significant optimizations for memory locality, which are enhanced in HANA SPS10.

The plan is to keep as much locality between CPU operations and RAM - this will increase throughput by up to 10x, as you can see in the table above.

Still, what are our options for going past 8S/120 cores of processing power? Right now there are only two options. Note that with future Intel architectures (Haswell, Broadwell), this will of course change.

HP ConvergedSystem 900 (3-24TB)

The HP ConvergedSystem 900, or CS900 for short, is based on the HP SuperDome2 architecture, and creates a single system based on a cluster of up to 8x 2S blades for 16CPUs and 12TB DRAM.

It uses a proprietary bus-based QPI framework which uses 1 QPI connection between the 2 CPUs, and frees the other 2 QPI connections to the bus. It is designed to be configured as two 8S servers as 2 LPARs but can be configured as a single 16S server.

As a rough order of magnitude, the memory cost is around $500k for 12TB using 32GB DIMMs. With 64GB DIMMs it is possible to configure 24TB, but that shoots up to over $1.5m, plus the remaining cost of the system.

In any case, HP has a bus-based system capable of 16S and 12TB of DRAM, or 24TB if money is no object.

SGI UV300H (3-48TB)

The SGI UV300H is based on SGI's NUMAlink technology, which derives from the SGI Origin 2000 and Onyx2 systems in the 1990s, and was later branded CrayLink. Like HP, it uses building blocks, but unlike HP it does not focus on a blade-based configuration.

Instead, SGI use 4S building blocks with 3TB of DRAM. 8 of these are daisy-chained using NUMAlink for up to 32S and 24TB DRAM in a typical configuration. SGI's configuration is different to HP, because in the SGI configuration, all the QPI connectors are exposed to the front of each building block. How they are distributed amongst the building blocks depends on the configuration.

SGI have some secret sauce which increases the RAS (reliability, availability, and serviceability) of the Intel platform, and decreases the chance of DRAM catastrophic failure.

In short, SGI scales to 32S and 24TB. For this configuration you can expect to pay around $1m for the memory alone. The SGI UV300H can be extended to 32S and 48TB of DRAM, if money is no object, but that's $3m of DRAM. Gulp.

Certification

It's worth noting that none of the configurations I've talked about are certified (yet). The 16S/12TB configurations will be supported first - you can see on the certification site that the HP and SGI 8S/6TB configurations are already supported, and these are just extensions of that architecture. Our testing shows that HANA SPS08 was scalable to 8S, but 16S did not provide increased throughput. SPS09 of HANA provides not-quite linear scalability to 16S.

Judging by the DKOM slides that were shown, NUMA coding is a key focus for HANA SPS10, which we expect to be released in May 2015. With that, we are expecting good scalability to 32S. We hope that certification for 32S will follow shortly after HANA SPS10.

The nice thing about both the HP and the SGI configuration is they can be extended in building blocks. These look something like this:

HP	SGI
2 Sockets, 1.5TB	4 Sockets, 3TB
4 Sockets, 3TB	8 Sockets, 6TB
6 Sockets, 4.5TB	12 Sockets, 9TB
8 Sockets, 6TB	16 Sockets, 12TB
10 Sockets, 7.5TB	20 Sockets, 15TB
12 Sockets, 9TB	24 Sockets, 18TB
14 Sockets, 10.5TB	28 Sockets, 21TB
16 Sockets, 12TB	32 Sockets, 24TB

In both cases, you could start with the smallest configuration, and grow to the largest, fairly seamlessly. With HP, you have to buy a 3Par SAN up front, whilst SGI adds trays of NetApp direct attached storage - this makes the SGI pricing more linear. Either way, it's an extremely elegant way to build future-focussed platforms.

Also in both cases, you could upgrade from the current Ivy Bridge CPUs up to Haswell and Broadwell CPUs, and in the SGI case you can replace the memory risers for DDR4 memory (I haven't had confirmation of this for HP).

Final Words

I've been involved with all of these appliances, and they work extremely well. SGI just released SPECInt 2006 Results for their UV300H system, and it is linearly scalable from 4,8,16,32 sockets. In fact, they have the highest result for any system ever built, apart from a few results with 64 and 128 sockets from SGI and Fujitsu.

From what I see of the HP CS900, much the same applies, though HP have not released SPECInt results yet.

But do note that 24TB, and even 12TB of SAP HANA goes a long way. We recently moved a 50TB DB onto HANA, and it used 10TB of space. Unless you have a colossal database, even a 12TB appliance will go a long way.

And by the time you figure out the cost of building a HANA-based architecture with local SAS disk, compared to an equivalent Oracle, IBM, or Microsoft-based RDBMS with SAN disk, you will find the TCO numbers of HANA are crushing.