How to achieve Zero or near-Zero Downtime for DB-failover using SAP HANA System Replication
For quite sometimes, I was working with the team on SAP HANA System Replication. This is mainly focused for CRM on HANA or SoH HA/DR POC.
CRM on HANA is a scale-up solution – for HA part, we prefer SAP HANA System Replication within the same Datacenter whereas for DR, we leverage storage replication across Datacenters.
There are two aspects:
– SAP HANA System Replication setup/failover Testing
– CRM HA : Extend SAP HANA System Replication as a HA solution for CRM
Due to business criticality, CRM system HA failover should be Auto-Failover with zero data loss.
There are some technical points in this regard –
SAP HANA System Replication is primarily a Disaster Tolerance (DT) / Disaster Recovery (DR) Solution and NOT a full-fledged HA solution.
• HANA System Replication is NOT Host Auto-Failover
• HANA System Replication synchronizes data between two data centers (Site A and Site B)
• HANA System Replication works only for Scale Up
In this blog, I will discuss about SAP HANA System Replication – possibility to make it as automated failover. But I will not touch how to setup the systems to perform SAP HANA System Replication.
My recommendation for the above as follows – which is the best solution in industry as of today:
Combination of SUSE Linux Enterprise High Availability Extension Cluster (SLES HAE) with SAP HANA System Replication. But as on date, SLES HAE is taking care of HANA Database, it is not fully SAP Application-aware.
Without SLES HAE,
Yes, HANA System Replication can be used as HA solution if the connections from database clients that were configured to reach the primary system, and need to be “diverted” to the secondary system after a failover with an automatic way via IP redirection, DNS redirection, etc. along-with SAP HANA Service Auto-Restart watchdog function. But again, we have to take care Host Auto-failover functionality.
Remember, in this way, SAP HANA System Replication can be used as main HA failover for zero or near-zero downtime maintenance or failures.
– SAP HANA System Replication is already configured as per SAP standard guide.
– DB Takeover is happening from Primary to Secondary Node in perfect manner.
– People/Team having required skill-set and proper access, authorization to perform the activity.
Preparation at ABAP Application Server :
– Set greater value for rdisp/max_wprun_time from its default value of 300 seconds. It should be greater than DB Takeover process from Primary to Secondary node.
– Set the parameter rdisp/wp_auto_restart = 0
– Set the parameter dbs/hdb/quiesce_check_enable to “1” (default value is 0).
Just before the Takeover, we have to create a file named “hdb_quiesce.dat” using touch command in the DIR_GLOBAL directory (i.e., /usr/sap/<SAP_SID>/SYS/global).
This will suspend the connection between the application server and database server (Primary node, in this case), one can check via R3trans command.
Newly started ABAP processes do not open a connection to the database until the file is removed. Although SAP Application using the dynamic profile parameter dbs/hdb/quiesce_sleeptime (default
value is 5sec.), checks whether the file named “hdb_quiesce.dat” still exists in the DIR_GLOBAL directory. So, when Secondary DB node is fully active, one can check via R3trans command – if it is successful, then we have to remove the “hdb_quiesce.dat” file. Now Application can connect to HANA Database but actually to the Secondary node. Also one can reset the parameters value as the activity is over.
But during the above DB Takeover process, we have to make necessary changes for Secondary DB Node as the default DB node for the SAP Application. Required IP Address change and restart of network services should be performed via Scripts to avoid confusion/errors.
Little bit complicated, not able to understand fully? For that reason, I have created a flow chart.
Flowchart for Host Auto-Failover while using SAP HANA System Replication
Hope it is clear now.
We have tested the whole scenario for few times and worked fine in all the cases.
There are some restrictions as follows, which need to be considered :
– Long-running database transactions like background jobs, etc. are not interrupted during this activity.
– Here, Application to Database connection is closed or suspended. External connections, e.g. connection between this HANA system and SAP Solution Manager System, are not interrupted.
– This activity is only applicable for ABAP application server. Database connections from the Java stack are not interrupted.
BTW, as the connection from Solution Manager is alive during the activity, one can leverage auto-reaction method along with scripts to perform whole scenario. And we have tested that in our environment also and worked in smooth manner.
For more details, consult SAP Note 1913302 – HANA: Suspend DB connections for short maintenance tasks.