SAP HANA Host Auto-Failover is a fully automated fault recovery solution where basically one or more standby hosts are configured to work in standby mode and added to the existing HANA system. In this method, the standby host does not contain any data and nor accepts any application requests while in standby mode and no data is preloaded in the standby host (unlike System Replication). You can consider this as a cluster-like HA solution in the same data center.

When it comes to high availability in SAP HANA, we should always aim for RPO of zero data loss which especially in business-critical production environment and Host Auto-Failover is one of the two high availability options in SAP HANA that can provide you ABSOLUTE ZERO DATA LOSS.

Figure 1: SAP HANA Host Auto-Failover in a minimal setup for HA

In Host Auto-Failover, SAP HANA regularly checks if all the cluster members are still active; and when an active host fails, a standby host “automatically” takes over its place. An internal cluster manager “nameserver” manages this entire failover process so we don’t need a 3rd party cluster software, it is handled internally within SAP HANA. Also, standby host needs access to all the database volumes in this scenario, so there will be one data pool and this can only be achieved by a shared networked storage.

The failover process happens on the host level, so failure of a single service or process won’t trigger the failover. When the primary host fails, the standby host will take over its lock on the data pool and continue working from there, so there will be no data loss. Also, because this failover process is entirely managed internally as an automated process, we should be careful to keep the data consistency. The data may be corrupted if a failed host (previously active) is restarted manually for recovery and attempts to write to data pool in parallel with failover process. So, it would be better to ensure no manual intervention during auto-failover. A controlled failback can be performed by stopping or restarting the standby host that is currently in use.

Figure 2: SAP HANA Host Auto-Failover in a scale-out scenario

To ensure data consistency, SAP introduced two capabilities:

Heartbeat is a regular TCP communication to check if the primary host is active as master before attempting to take over master role or perform a failover. It can happen from nameserver to nameserver between hosts or nameserver to hdbdeamon with SAP HANA internal communication protocol.

I/O Fencing is the process of isolating a failed node and protecting shared data pool to ensure that the (failed) primary host no longer has access the data or log volumes. This can be achieved via SAP HANA storage connector APIs.

Host Indexserver (configured role) Indexserver (actual role) Nameserver (configured role) Nameserver (actual role)
Initial host Worker Master Master 1 Master
1st added Worker Slave Master 2 Slave
2nd added Worker Slave Slave Slave
3rd added Standby Standby Master 3 Slave

Table 1: An example configuration for a Multiple-Host System in a scale-out scenario

Host Auto-Failover is a great HA option in scale-out scenarios and offer an easy option by having one or more hosts as standby as you can see above. If you want to add hosts to an existing SAP HANA system, you can use the SAP HANA database lifecycle manager (HDBLCM) or its web interface. Also, it is possible to monitor the status of all active and standby hosts in the SAP HANA cockpit and the SAP HANA studio (Landscape –> Hosts) tab.

Key benefits

  • RPO of zero data loss
  • Automated process managed internally by SAP HANA nameserver, no additional 3rd party cluster management software required
  • Low RTO, failover execution time is similar to a SAP HANA startup
  • Failover detection of primary host in less than a minute
  • Networked storage *may* lower your HANA HW costs

Trade-offs

  • Data is not preloaded, so a little higher recovery time compared to System Replication (but no longer than a SAP HANA startup)
  • Failover detection of network related issues can be around 5-7 mins

Do you have any question about SAP HANA Host Auto-Failover? Leave a comment below, I would love to help you and learn from you as much as I can!

Feel free to share!


If you liked this post, you might like these relevant posts:

SAP HANA High Availability and Disaster Recovery Series #1

SAP HANA HA and DR Series #2: Redundancy and Fault Recovery Support

Choosing the right HANA Database Architecture


References and further reading:

SAP HANA Administration Guide

Note 2057595 – FAQ: SAP HANA High Availability

Setting up Host Auto-Failover

SAP HANA – Host Auto-Failover

Monitoring Host Status and Auto-Failover Configuration

 

To report this post you need to login first.

4 Comments

You must be Logged on to comment or reply to a post.

  1. Wajid Mir

    Dear Alper,

    Thank you for an excellent article, is very detailed and valuable! Can you please comment on a few questions given below?

    1. What is the minimum version of HANA (on-premise) that comes with standard cluster manager, without the need of a 3rd party cluster manager?
    2. Also, when Server1 fails and Standby server takes over, and again when Standby server fails, will Server1 ‘automatically’ take over?
    3. in the blog it is mentioned “The failover process happens on the host level, so failure of a single service or process won’t trigger failover”.  How is HA supported when a single service HANA DB service fails in the host, is it the function of Watchdog service then, to restart the failed HANA service?

    Thanks,

    Wajid.

    (0) 
    1. Alper Somuncu Post author

      Hi Wajid,

      Sorry for my late response as I usually am more active on LinkedIn.

      1- System replication needs a 3rd party cluster manager if you need auto failover. So your only option is Host Auto-Failover. I am not sure when it was introduced first but to get the most features I would recommend to be on at least HANA 1.0 SPS11.

      2- If primary fails, the secondary becomes the new primary. If secondary fails (or if you just restart it), primary would take over if it is recovered from its failed state.

      3- That’s right. It is supported through watchdog service, which is basically a service recovery function. When one of the processes stops due to an error, watchdog service would automatically pick that up (usually in less than a minute) and tries to recover/restart it.

      Kind regards,

      Alper

      (0) 

Leave a Reply