In 2014 with SAP HANA SPS08, we gained the kind of mission-critical stability of the HANA platform that allowed us to take some of the world’s largest SAP systems and move them onto HANA.

There is now a CERTIFIED SAP HANA® HARDWARE DIRECTORY online which lists the certified Suite on HANA and S4 appliance platforms. My customers are now looking for ever-larger HANA systems. The thing about S4 ERP systems is that they are not well suited to distributed or clustered hardware. For analytical platforms, this is what allows HANA to scale to very large systems.

With S4, appliances from Cisco, Hitachi, Huawei, HP, Fujitsu, and SGI are available, which are certified and scale to 6TB of hardware. Apart from HP and SGI (more on this later), all of them are based around 8-socket Intel systems.

As a benchmark, the rough budget price for a 6TB HANA system is $600k. If you’re getting below that, you’ve got a great deal.

Stretching 6TB to 12TB

The ratio between CPUs and RAM used in a HANA appliance is somewhat arbitrary – and relates to the certification process and benchmarking more than anything. Certainly, it’s possible to put 12TB of DRAM in an 8-socket HANA appliance.

Cost will depend on your hardware vendor and their discounting, but let’s use Lenovo as a good comparison. 6TB of RAM from Lenovo retails at $279,168 (that’s 192x 32GB DIMMS, 46W0676). Street price is around $220k for 6TB.

If we build the same system with 12TB RAM, we have to use 64GB DIMMs (46W0741), which list at a nosebleed $5279 each. Yes, that’s a painful $1,013,568 for the upgrade. On the street, you can probably get this for $750k.

Either way, a 12TB system built out of a stretched 8 socket, 6TB system, will cost you > $1.1m on the street. What’s more, you will still have the same 8 CPU system that you had before – it won’t provide any additional throughput. Even worse, the 64GB parts are clocked slightly slower than the 32GB parts, so you will lose a little performance for the upgrade.

Going beyond 8 sockets

The Intel Ivy Bridge architecture, on which all current SAP HANA systems are built, was designed for 4 CPUs. Each CPU has 3 connections to other CPUs, on an interface called QPI. So in a 4 CPU configuration, all of them are connected to each other.

Move to 8 CPUs, and each CPU is only connected to 3 of the 8. The other 4 connections are remote – connected via another CPU. Remember that each connection is around 12.8Gbyte/sec – some 10x faster than 10GB Ethernet. Still, it means that accessing remote memory is slower than local memory.

Moving past 8 CPUs, and the QPI connections are spread even thinner.

This is really important because in the Intel architecture, RAM is attached to a specific CPU socket. This means that local RAM to a CPU is much faster than remote memory. This effect is called NUMA, or Non-Uniform Memory Architecture. Here’s a rough idea of memory latency in Intel Ivy Bridge:

Memory Type Latency (ns)
L1 Cache Reference 1ns
L2 Cache Reference 5ns
Local Main Memory Reference 50ns
Remote Main Memory (1 hop) 200ns
Remote Main Memory (2 hops) 500ns

As you can see, the proportion of local memory memory accesses makes a huge difference to the overall system performance. Even more critically, there is limited bandwidth between CPUs on the QPI network. If all accesses are remote, performance tanks.

I’ve personally seen this happen in early versions of SAP HANA with large numbers of CPUs – you get a degradation of performance as load increases. Thankfully, SPS09 of HANA contains significant optimizations for memory locality, which are enhanced in HANA SPS10.

The plan is to keep as much locality between CPU operations and RAM – this will increase throughput by up to 10x, as you can see in the table above.

Still, what are our options for going past 8S/120 cores of processing power? Right now there are only two options. Note that with future Intel architectures (Haswell, Broadwell), this will of course change.

HP ConvergedSystem 900 (3-24TB)

The HP ConvergedSystem 900, or CS900 for short, is based on the HP SuperDome2 architecture, and creates a single system based on a cluster of up to 8x 2S blades for 16CPUs and 12TB DRAM.

It uses a proprietary bus-based QPI framework which uses 1 QPI connection between the 2 CPUs, and frees the other 2 QPI connections to the bus. It is designed to be configured as two 8S servers as 2 LPARs but can be configured as a single 16S server.

As a rough order of magnitude, the memory cost is around $500k for 12TB using 32GB DIMMs. With 64GB DIMMs it is possible to configure 24TB, but that shoots up to over $1.5m, plus the remaining cost of the system.

In any case, HP has a bus-based system capable of 16S and 12TB of DRAM, or 24TB if money is no object.

SGI UV300H (3-48TB)

The SGI UV300H is based on SGI’s NUMAlink technology, which derives from the SGI Origin 2000 and Onyx2 systems in the 1990s, and was later branded CrayLink. Like HP, it uses building blocks, but unlike HP it does not focus on a blade-based configuration.

Instead, SGI use 4S building blocks with 3TB of DRAM. 8 of these are daisy-chained using NUMAlink for up to 32S and 24TB DRAM in a typical configuration. SGI’s configuration is different to HP, because in the SGI configuration, all the QPI connectors are exposed to the front of each building block. How they are distributed amongst the building blocks depends on the configuration.

SGI have some secret sauce which increases the RAS (reliability, availability, and serviceability) of the Intel platform, and decreases the chance of DRAM catastrophic failure.

In short, SGI scales to 32S and 24TB. For this configuration you can expect to pay around $1m for the memory alone. The SGI UV300H can be extended to 32S and 48TB of DRAM, if money is no object, but that’s $3m of DRAM. Gulp.

Certification

It’s worth noting that none of the configurations I’ve talked about are certified (yet). The 16S/12TB configurations will be supported first – you can see on the certification site that the HP and SGI 8S/6TB configurations are already supported, and these are just extensions of that architecture. Our testing shows that HANA SPS08 was scalable to 8S, but 16S did not provide increased throughput. SPS09 of HANA provides not-quite linear scalability to 16S.

Judging by the DKOM slides that were shown, NUMA coding is a key focus for HANA SPS10, which we expect to be released in May 2015. With that, we are expecting good scalability to 32S. We hope that certification for 32S will follow shortly after HANA SPS10.

The nice thing about both the HP and the SGI configuration is they can be extended in building blocks. These look something like this:

HP SGI
2 Sockets, 1.5TB 4 Sockets, 3TB
4 Sockets, 3TB 8 Sockets, 6TB
6 Sockets, 4.5TB 12 Sockets, 9TB
8 Sockets, 6TB 16 Sockets, 12TB
10 Sockets, 7.5TB 20 Sockets, 15TB
12 Sockets, 9TB 24 Sockets, 18TB
14 Sockets, 10.5TB 28 Sockets, 21TB
16 Sockets, 12TB 32 Sockets, 24TB

In both cases, you could start with the smallest configuration, and grow to the largest, fairly seamlessly. With HP, you have to buy a 3Par SAN up front, whilst SGI adds trays of NetApp direct attached storage – this makes the SGI pricing more linear. Either way, it’s an extremely elegant way to build future-focussed platforms.

Also in both cases, you could upgrade from the current Ivy Bridge CPUs up to Haswell and Broadwell CPUs, and in the SGI case you can replace the memory risers for DDR4 memory (I haven’t had confirmation of this for HP).

Final Words

I’ve been involved with all of these appliances, and they work extremely well. SGI just released SPECInt 2006 Results for their UV300H system, and it is linearly scalable from 4,8,16,32 sockets. In fact, they have the highest result for any system ever built, apart from a few results with 64 and 128 sockets from SGI and Fujitsu.

From what I see of the HP CS900, much the same applies, though HP have not released SPECInt results yet.

But do note that 24TB, and even 12TB of SAP HANA goes a long way. We recently moved a 50TB DB onto HANA, and it used 10TB of space. Unless you have a colossal database, even a 12TB appliance will go a long way.

And by the time you figure out the cost of building a HANA-based architecture with local SAS disk, compared to an equivalent Oracle, IBM, or Microsoft-based RDBMS with SAN disk, you will find the TCO numbers of HANA are crushing.

To report this post you need to login first.

23 Comments

You must be Logged on to comment or reply to a post.

  1. Henrique Pinto

    I thought you were gonna comment about SGI’s all-to-all topology that allows each CPU to reach each other’s RAM with just one hop, even in their 32S system. But maybe that’s too much of vendor talk. 😉

    It could be the subject for Brian Freed’s first SCN blog.

    It’s time he creates a SCN user. 😛

    (0) 
    1. John Appleby Post author

      Well I’m not allowed to release benchmark results, but it does seem the topology does indeed reduce latency with HANA.

      However, SGI’s biggest win is their memlog technology. When you have hundreds/thousands of DIMMs, they regularly fail, and memlog allows you to wait until you have a bunch of failures before failing over to HA, replacing DIMMs, fail back. Very elegant.

      Anyone who has even run 1-2-6TB HANA systems knows that DIMM failure even with E7 RAS, is a serious problem.

      Let’s see if Brian takes your challenge!

      (0) 
    2. Mike Woodacre

      I’m not Brian, but happy to talk about the SGI NUMAlink interconnect on UV300H. This is NL7 technology, running at 56Gb/s each direction, with an all-to-all topology with 32 E7 processors in a single rack, 4 E7’s per chassis, 8 chassis per rack (we can go higher, and plan to but that introduces an additional network hop). The picture below shows the topology. NL7 also implements adaptive routing so as well as offering lowest latency through 1 hop, we can aggregate bandwidth for bulk data movement over multiple links

      Cheers,

      Mike

      /wp-content/uploads/2015/02/uv300_top_640346.png

      (0) 
      1. John Appleby Post author

        Hey Mike, realize this shows my lack of knowledge!

        There are three scenarios, if I get it right:

        – CPU-CPU in-chassis

        – CPU-Numalink-CPU in-chassis

        – CPU-Numalink-Cable-Numalink-CPU

        How does SGI provide a single hop in all circumstances, and what’s the variance in latency between these scenarios?

        Thanks!

        (0) 
  2. Henrique Pinto

    BTW, another potentially relevant info is the official announcement of the HANA on Power Ramp-up:

    SAP Service Marketplace – SAP Ramp-Up -> Upcoming Ramp-Ups -> SAP HANA on Power 1.0

    The only allowed use case, as of now, is BW on HANA. However, the ramp-up page does mention that P7+ will support 4TB on a single node for BW, and P8 will support 8TB.

    If the CPU/RAM ratio observed in x86 is kept for Power and SAP allows the OLTP (SOH/S4H) systems on Power to have 3x the RAM of OLAP (BW/DW) systems, we could potentially be seeing 12TB (P7+) and 24TB (P8) scale up machines from IBM as well, in the near future.

    (0) 
    1. John Appleby Post author

      That’s super-interesting. The cynic in me thinks it’s a bone thrown to IBM because of the S/4HANA launch 🙂

      Watch out for IBM’s memory numbers. Currently the Power8 maximum is 8S/2TB with 32GB DIMMSs In time, 16S/4TB will be supported.

      It’s also possible to move to 64GB DIMMs and even 128GB DIMMs, but these are 1) very expensive 2) dilute the DIMM:CPU ratio.

      ERP this is possible, but we have seen the ERP ratio be something like 50GB/core. So for a 16S/128c box, that’s 6TB, maybe 8TB at a stretch.

      I can’t see how Power8 gets anywhere near 12/24TB DRAM though, unless they produce larger boxes or go scale-out.

      It’s worth noting that the SGI box has 768 DIMMs at 32S/12TB, compared to the IBM E870 which has 128 DIMMs at 16S/16TB. This causes a serious bandwidth difference!

      (0) 
      1. Henrique Pinto

        I have no idea about the HW architecture, I was just inferring that from the Ramp-up page information. It’s odd they say that a 8TB scale up is possible for BW on HANA on P8 if it’s not technically achievable.

        IBM’s brochure does say the E880 (P8) server can reach up to 8TB and that the P795 (P7+) can reach 16TB. As soon as they update the Px95 server to P8, we might see 16+TB servers.

        Source: http://public.dhe.ibm.com/common/ssi/ecm/po/en/pob03029usen/POB03029USEN.PDF

        (0) 
        1. John Appleby Post author

          Yes it can, it can support

          8S/2TB (32GB DIMMs), 8S/4TB (64GB DIMMs), 8S/8TB (128GB DIMMs)

          16S/4TB (32GB DIMMs), 16S/8TB (64GB DIMMs), 16S/16TB (128GB DIMMs)


          Do note that the configurations marked in bold are actually available! Also note that for BW, I don’t think you’d want to go past 16S/4TB, as you’d run out of CPU horsepower.


          Hope this clarifies.

          (0) 
          1. Henrique Pinto

            But if the SAP application benchmark is right, it does say that the Power8 cores have at least twice the computing capacity of an Intel Xeon IvyBridge (E7 v2) core. Just calculate the SAPS/core for any given benchmark and you’ll get ~2200-2500 SAPS/core for E7v2 & ~5500 SAPS/core for P8.

            If that holds true, then you could double the amount of memory per core of currently existing x86-based HANA systems. Since P8 are 10-core CPUs, you could have 80 cores HANA OLAP P8-based servers with the same amount of RAM of 160 cores Intel E7v2 servers. Today you can hold 2TB on 120 E7v2 cores and you’ll be able to support 4TB on 240cores (16x E7v2) & 8TB on 480 cores (32x E7v2).


            P795 can reach up to 256 P7+ cores, which if updated to P8, would mean the computing capacity would be enough to support 24+TB in terms of CPU/RAM ratio. Of course the bottleneck would be how many DIMMs you could add to it and how large could they be. And that’s not even considering the upgrade from 8-core to 10-core CPUs, which would bring it up to potential 320 P8 cores and a theoretical processing capability to support 64TB RAM. But I do agree this is still speculation.

            (0) 
            1. John Appleby Post author

              HANA is memory bandwidth constrained, that’s why SGI has so many DIMMs. Also, smaller 32GB DIMMs are much much cheaper.

              If they updated the p795, you’d have 256-cores, which would support in the are of 24TB – if there were enough DIMM sockets.

              It’s not just about how much you can cram in a system – it has to work well, and HANA is a memory bandwidth hog 🙂

              (0) 
              1. Henrique Pinto

                I don’t think anyone making the case to adopt HANA on Power is focusing on costs as their main KPI (thought it’s of course important, I’d say reliability would probably play a larger role).

                Nevertheless, I do believe HoP will be a viable & scalable option in the near future. Let’s see how it rolls tho. IBM is not going through their best time these days…

                (0) 
  3. martin daeufel

    Is this still the current wisdom that Business Suite systems should be scale up only and not scale out? Is that ever going to change? Reason I am asking is in order to do HA which is required for business critical systems like ECC you would have to purchase another large system of the same size as a stand by.

    (0) 
    1. John Appleby Post author

      Yes it is. Currently the focus is on data reduction, rather than scale-out.

      Will that change? I don’t think so, because with HANA, systems get smaller, whilst hardware keeps gaining in capacity. We will in the end have hybrid scale up-and-out for large systems which combine scale-up for transactional workloads and scale-out for pure-analytics workloads and warm data.

      But what the SAP team found was clear: with today’s CPUs, RAM and interconnects, there is a cost penalty of using scale-out for transactional workloads. Combine that with the pace of increase of DRAM sizes and multi-core architectures, and the data reduction that HANA allows, and scale-out just isn’t necessary.

      HA – it depends on the RTO you require. If you just require a 15-minute RTO then you can share HA and QA hardware, and double the disk. If you require a low RTO, then you do indeed need two systems.

      (0) 
  4. Mahendra Paruchuri

    Hello John,

    One of our customer would like to implement Business Suite on HANA(S/4 HANA) with simple finance 2.0 We understand that for failover we can have the replication concept for HANA Production box. We would like to know if S/4HANA have clustering support rather than replication?

    (0) 
    1. John Appleby Post author

      So you can do a few things – you can do TDI and share the storage between two nodes and use storage remapping, or you can use system replication.

      I’d argue that both of those are cluster scenarios but the clustering happens at a different level.

      Either will work and each has TCO and RTO benefits.

      (0) 
  5. Jonathan Haun

    John,

    In response to:

    The ratio between CPUs and RAM used in a HANA appliance is somewhat arbitrary – and relates to the certification process and benchmarking more than anything. Certainly, it’s possible to put 12TB of DRAM in an 8-socket HANA appliance”

    Has SAP released a document that outlines the ratio requirements by CPU generation? Looking for something simple like the following. I’m not sure that my examples are all correct so fill in the gaps if you can.

    Analytics:

    Intel Nehalem EX E7 – 128GB/Socket

    Intel Westmere EX E7 – 256GB/Socket

    Intel Ivy Bridge EX E7 – 256GB/Socket

    Intel Haswell EX E7 – ?GB/Socket


    SoH:

    Intel Nehalem EX E7 – 256GB/Socket

    Intel Ivy Bridge EX E7 – 256GB/Socket

    Intel Haswell EX E7 – ?GB/Socket

    (0) 

Leave a Reply