HANA Scale-Out and Technical Deployment – The Case of the Standby Node
SAP HANA allows for near-endless scalability by allowing one SID to be deployed on multiple hosts, called HANA node. In HANA terms a deployment with multiple HANA nodes is considered an SAP HANA Scale-Out topology. The deployment can be physical or virtualized. The number of nodes can theoretically reach more than one hundred, even though I personally have not come across this in production.
Depending on the application, different best practices exist in terms of the number of HANA nodes, and their size. Let me describe the three most typical use cases:
- SAP HANA as a Datamart
- SAP BW/4HANA based on NetWeaver
- SAP S/4HANA based on NetWeaver
For HANA as a Datamart, you can deploy any number of nodes of any size, as these systems are usually OLAP-only applications. For BW/4HANA, many vendors certified up to 16 nodes, but we observe a tendency to reduce the number of nodes and use larger nodes, like 3-6+ TB.
On the other hand, for S/4HANA, we recommend as few nodes as possible and as large nodes as possible, typically 2, sometimes 3 with 12 to 24 TB, more nodes only in exceptional cases. The reason for the latter recommendation lies in the OLTP nature of the application.
On a technical level, the scale-out for each of the applications is the same, you have one coordinator node and any number of worker nodes that each have their own data volume on storage and have an inter-node network communication.
High availability in scale-out deployments
From a technical architecture perspective, the question is often how to ensure high availability for the SAP HANA scale-out deployment. I assume everyone knows about HANA System Replication or Storage Replication (High Availability for SAP HANA | SAP Help Portal). Although the majority of implementations at customers are based on HANA System Replication, because of, for example, block level corruption check or additional features like multi-target replication. Only in rare cases storage replication makes sense. An additional solution for SAP HANA scale-out is the local high availability (HA), by configuring an empty standby node. The solution is called Host Auto-Failover.
In case a node fails, this standby node is attached to the data volume of the failed node and loads its data, taking over its role. In the early days especially, there were numerous issues with the memory DIMMs and system boards, so there were guidelines like: for each batch of 8 nodes, add 1 standby node. Or for each 5 nodes. However, at the time, the nodes were also quite small, with 256 GB, 512 GB and only exceptionally with 1 TB. As a basic rule of thumb, it takes about 10 minutes to load 1 TB, depending on the storage throughput capacity.
At the time, HANA system replication was barely introduced, also. Also “self-healing” VM features (like auto restart) were not available. Mostly the systems were based on bare-metal boxes with exceptionally long running reboot cycles. Therefore, the standby node was a good option. At the time.
Today, the go-to deployment for high availability is HANA System Replication. Our Hyperscalers and SAP RISE use this as only solution, for example.
If you deploy a system on premise, I recommend reconsidering the need for a standby node. It ensures against physical failures of a node, for example. If 2 nodes fail at a time, this is not ensured. Failure of the DC is not ensured.
Below please find a simple decision tree to decide whether a standby node is reasonable.
- If you deploy “small” nodes, e.g. with up to 2 TB and you are not using HANA System Replication (HSR). If you use HANA system replication, a standby node is definitely slower than a takeover, but it might make sense to deal with hardware failures.
- If you deploy large HANA nodes, like 4+TB and you have relaxed SLAs, e.g., below 99.7, a standby node could make sense, but I would recommend evaluating HSR, nonetheless.
For S/4HANA scale-out deployments, we generally recommend HSR because it is much faster and has broader protection.
Please share your experience that you have had with standby nodes, how often did you need them, what were the typical root causes, and did you have them in addition to HSR, or instead?