How to achieve Zero or near-Zero Downtime for DB-failover using SAP HANA System Replication
For quite sometimes, I was working with the team on SAP HANA System Replication. This is mainly focused for CRM on HANA or SoH HA/DR POC.
CRM on HANA is a scale-up solution – for HA part, we prefer SAP HANA System Replication within the same Datacenter whereas for DR, we leverage storage replication across Datacenters.
There are two aspects:
– SAP HANA System Replication setup/failover Testing
– CRM HA : Extend SAP HANA System Replication as a HA solution for CRM
Due to business criticality, CRM system HA failover should be Auto-Failover with zero data loss.
There are some technical points in this regard –
SAP HANA System Replication is primarily a Disaster Tolerance (DT) / Disaster Recovery (DR) Solution and NOT a full-fledged HA solution.
• HANA System Replication is NOT Host Auto-Failover
• HANA System Replication synchronizes data between two data centers (Site A and Site B)
• HANA System Replication works only for Scale Up
In this blog, I will discuss about SAP HANA System Replication – possibility to make it as automated failover. But I will not touch how to setup the systems to perform SAP HANA System Replication.
My recommendation for the above as follows – which is the best solution in industry as of today:
Combination of SUSE Linux Enterprise High Availability Extension Cluster (SLES HAE) with SAP HANA System Replication. But as on date, SLES HAE is taking care of HANA Database, it is not fully SAP Application-aware.
Without SLES HAE,
Yes, HANA System Replication can be used as HA solution if the connections from database clients that were configured to reach the primary system, and need to be “diverted” to the secondary system after a failover with an automatic way via IP redirection, DNS redirection, etc. along-with SAP HANA Service Auto-Restart watchdog function. But again, we have to take care Host Auto-failover functionality.
Remember, in this way, SAP HANA System Replication can be used as main HA failover for zero or near-zero downtime maintenance or failures.
– SAP HANA System Replication is already configured as per SAP standard guide.
– DB Takeover is happening from Primary to Secondary Node in perfect manner.
– People/Team having required skill-set and proper access, authorization to perform the activity.
Preparation at ABAP Application Server :
– Set greater value for rdisp/max_wprun_time from its default value of 300 seconds. It should be greater than DB Takeover process from Primary to Secondary node.
– Set the parameter rdisp/wp_auto_restart = 0
– Set the parameter dbs/hdb/quiesce_check_enable to “1” (default value is 0).
Just before the Takeover, we have to create a file named “hdb_quiesce.dat” using touch command in the DIR_GLOBAL directory (i.e., /usr/sap/<SAP_SID>/SYS/global).
This will suspend the connection between the application server and database server (Primary node, in this case), one can check via R3trans command.
Newly started ABAP processes do not open a connection to the database until the file is removed. Although SAP Application using the dynamic profile parameter dbs/hdb/quiesce_sleeptime (default
value is 5sec.), checks whether the file named “hdb_quiesce.dat” still exists in the DIR_GLOBAL directory. So, when Secondary DB node is fully active, one can check via R3trans command – if it is successful, then we have to remove the “hdb_quiesce.dat” file. Now Application can connect to HANA Database but actually to the Secondary node. Also one can reset the parameters value as the activity is over.
But during the above DB Takeover process, we have to make necessary changes for Secondary DB Node as the default DB node for the SAP Application. Required IP Address change and restart of network services should be performed via Scripts to avoid confusion/errors.
Little bit complicated, not able to understand fully? For that reason, I have created a flow chart.
Flowchart for Host Auto-Failover while using SAP HANA System Replication
Hope it is clear now.
We have tested the whole scenario for few times and worked fine in all the cases.
There are some restrictions as follows, which need to be considered :
– Long-running database transactions like background jobs, etc. are not interrupted during this activity.
– Here, Application to Database connection is closed or suspended. External connections, e.g. connection between this HANA system and SAP Solution Manager System, are not interrupted.
– This activity is only applicable for ABAP application server. Database connections from the Java stack are not interrupted.
BTW, as the connection from Solution Manager is alive during the activity, one can leverage auto-reaction method along with scripts to perform whole scenario. And we have tested that in our environment also and worked in smooth manner.
For more details, consult SAP Note 1913302 – HANA: Suspend DB connections for short maintenance tasks.
Sorry, I have to correct one statement in this blog I stumbled upon:
> HANA System Replication works only for Scale-up
This statement is wrong!
Of course SAP HANA System Replication also works for Scale-Out - no problem at all. The different name and index servers on the different hosts can easily find their counterparts on secondary site to synchronize their activities to keep the shadow instance updated.
And, by the way, SAP HANA System Replication can be used for both usecases easily - HA and DR. How far the coupled system are distant from each other makes no difference from the operation point of view. For long distance you simply switch to ASYNC while for distances up to 50 to 100 km - or inside of one DC - you can use SYNC.
With the help of cluster managers this solution can be fully automated. SAP is not creating the hundreds cluster manager on the market for System Replication which can probably only be used for SAP HANA. And we will definitely not force customers to use only one/this cluster manager inside their data centers exclusively!
Instead we are partnering with the market leader in cluster business and customers can stay with their cluster manager choice from the past.
As a lot of SAP customers already decided to use LVM (Landscape Virtualization Manager) for post copy automation (PCA) of cloning process (PRD -> QA), LVM is now also offering cluster features for SAP HANA, please see SAP note http://service.sap.com/sap/support/notes/2050537
Thanks Ralf for correcting my mistake. And I fully agree on your comments for the rest of your update.
If I am not wrong, as per normal industry trends, system replication is mostly used for Scale-up HANA Systems.
At the same time, interested to know, while using HANA System Replication, what are the extra precautions we should take for Scale-out system nodes - for example, Table redistribution, Tables pinning to a particular node, etc.
Could you please share your expert comments.
Hi Ralf Czekalla
Do you mind to share what cluster software is currently support multinode (scale-out) Hana SR in the market now?
Hi Nicholas Chang,
a software currently managing scale-out HANA with HSR is PMS. (also auto-failover)