Best practise: Loss of a cluster node
In this blog, I want to show you how to recreate a lost cluster node.
If you lose a Windows host (a physical host or a virtual machine), you should restore it from a snapshot/checkpoint or by using a traditional backup application. The backup restores not only Windows, but also all necessary “System State” information and the cluster configuration as well.
If there is no backup available, then you have to install a fresh Windows from scratch and join it to the failover cluster. This blog will describe the steps necessary to bring back a lost cluster node.
In this example, we use this cluster:
2 SAP ASCS instances are installed, “PAC” and “PUC”:
the cluster consists of 2 cluster nodes and uses a shared disk as Quorum witness:
Let’s assume, cluster node 2 (wsiv1003-2) has been lost.
Operations will continue on the first cluster node. And the lost node appears in the cluster as “down”:
We assume that cluster node 2 is destroyed and cannot be restored.
The next steps will not affect operations on the remaining cluster node. However, due to unexpected results during installation / cluster join / cluster validation test, they may anyway affect operations.
Recommendation: Prepare the new cluster node during a maintenance downtime! It’s safer!
- Cleanup the Failover Cluster configuration Evict the lost cluster node from the cluster:You now have a one-node cluster:
- Prepare the new cluster node / install a new Windows. Install a fresh Windows host, here it’s “wsiv1003-2”. Make sure it’s the same or higher OS version! Example: You lost a Windows Server 2012 R2 node. You can install a Windows Server 2012 R2 or a Windows Server 2016. See Microsoft documentation.
- Configure networking (cluster public + heartbeat interfaces, plus additional if needed).
- Join the Windows host to the Active Directory (Windows domain).
- Run Windows Update to install the latest Windows patches.
- Add the Windows Feature “Failover Clustering”
- Add additional software according to your rules, for example, antivirus scanner, monitoring agents, etc.
- Join the host to the cluster
The cluster consists now again of two cluster nodes:However, you will not be able to failover a database OR SAP cluster group to this “new” cluster node!
The next steps require a downtime!
- Install database software … if there is a clustered database running on this cluster, consult the database vendors installation guide on how to add a cluster node to an existing clustered database.
- Install ntclust.sar package (saprc.dll). Download the latest NTCLUST.SAR package and extract it, for example to C:\ntcluster.Install the latest c-runtime package for Visual Studio 2013, x64:
See here: https://blogs.sap.com/2017/06/13/c-runtimes-needed-to-run-sap-executables/. Start the insaprct.exe tool to upgrade all cluster nodes: insaprct.exe -installSAPRC.DLL has been installed on cluster node 2 and upgraded on cluster node 1.
- Install SAP related apps. Start SWPM (sapinst.exe) on cluster node 2. Choose the “Additional Cluster Node” option. Do not use the “First Cluster Node” option!SWPM will recognize the existing installation on configuration on cluster node 1 and offers cluster groups for “PAC” and “PUC” system.In this example, I choose “PAC”. SWPM completes the installation on cluster node 2 and the “PAC-Clustergroup” can now be moved to this node:Start SWPM for each SAP system you have clustered and complete the “Additional Cluster Node” option.
- Test, test, and one more time, test! Do intensive testing! Move the cluster group several times. Kill for example the process “msg_server.exe” – an automatic failover must occur and must succeed! After successful testing, make a backup!
Congratulations! You have reinstalled a lost cluster node!
- If you have lost an older ABAP PAS which folder name is “DVEBMGSxx”, you don’t have to reinstall an AAS with the same folder name! It’s absolutely fine if you have only ABAP application server instances with folder name “Dxx”, for example D01, D10.
It is no longer mandatory to have one ABAP instance with the old “DVEBMGSxx” name.
- If you have lost a JAVA application server instance, the procedure is similar.
- Additional cluster groups may contain additional third party applications. Consult the documentation for these applications and find out what must be done on a new cluster node.
- SAP cluster groups may contain additional third-party applications. Consult the documentation for these applications and find out what must be done on a new cluster node.
is there any best-practice around what kind of quorum (shared disk vs file share ) for windows MSCS failover clusters ?
if you have an odd number of cluster nodes, there is no additional quorum witness (which can be a disk, a file share or a cloud witness) needed.
If you have an even number (for example the common 2 node cluster), then there must be a witness for the quorum information.
It depends on if you have a storage disk device and therefore you have shared disks anyway, then you can configure an additional, small disk used for quorum.
Or you don’t use shared disks in your cluster, then you can use a remote file share, for example on a NAS storage as quorum.
And if you install clusters in Microsoft’s Azure cloud, then a cloud witness will be used to store the quorum information.
A cluster with shared disks requires a storagebox connected to all cluster nodes, usually Fibre-Channel based storage. If the cluster nodes are geographically dispersed installed, the you need storageboxes which replicates the disks.
=> expensive, but proven, stable solution with good performance.
If you don’t use shared disks in the cluster, then you put SAPGLOBALHOST on a remote file share (outside the cluster), for example on a NAS device. This is usually the cheaper solution.