How to test a Windows Failover cluster?

karl-heinz_hochmuth

This blog explains how to test a Windows Failover cluster environment. We will have a close look at the central services of an SAP system and if the cluster are correctly handle a failover. Failover tests are mandatory before a high availability (HA) solution can go live for production. Additionally, latest every 6 months a failover should be tested to verify, if the cluster works as expected?

Prerequisites

You have a planned downtime for the SAP system which you want to test. Do not carry out other maintenance activities during the failover tests.
SAP system is fully up and running: SAP MMC shows all instances are “green”.

1. Stop/Start test via SAP MMC (no failover must happen)

Open SAP MMC and Failover Cluster Manager tool in parallel to see in realtime, what will happen.

In SAP MMC select the ASCS instance and stop it.

After the instance is stopped, you should see a picture like this:

Failover Cluster Manager shows the SAP instance resource is offline, SAP MMC shows it in “gray” state = not started (but related SAP start service (sapstartsrv) is running).
The application server instances show yellow, because they lost connection to the Message server.

Start the instance in SAP MMC again.

Expected result:

Failover Cluster Manager: SAP instance resource is online again (green)
SAP MMC: All resources become green again (the application server instances will reconnect to Message server, this can take ~10 seconds to complete)

Goal:

The communication between SAP MMC <- sapstartsrv -> SAPRC.DLL works fine, there are no communication or authorization problems.

2. Failover Test 1 (test planned failover)

Open SAP MMC and Failover Cluster Manager tool in parallel to see in realtime, what will happen.
Check the last line in this ERS trace file on ALL cluster nodes (where ERS instances are running):

\\<hostname>\saploc\<SID>\ERSxx\work\dev_enrepha

You should see at this point in time:

On the ERS on the current cluster (where ASCS runs) node: “Inactive”
On the ERS on the cluster node, which is a “possible owner” of the SAP cluster group: “Active”
On all other optional cluster nodes: “Inactive” (three or more cluster node scenarios)

Use the Failover Cluster Manager tool to move the SAP cluster group to another node.

After the (hopefully) successful failover of the group, write down the time how long this takes?
How long was the offline time (to stop the cluster group and its resources on the current cluster node)?
How long was the online time (to start all resources of the cluster group on the other cluster node)?

These times are the “default” in your cluster setup. An unplanned failover should show same, or at least similar times!

After failover was done, check the status of the ERS trace files.

You should see after failover:

On the ERS on the current cluster node: “Inactive”
On the ERS of the cluster node, where you started: “Active”
On all other optional cluster nodes: “Inactive” (three or more cluster node scenarios)

You can also check dev_enqsrv trace file. The Enqueue server should have detected the shadow replication table of the formerly active ERS and should use this table to get the locks of the SAP system.

3. Failover Test (unplanned failover)

Open SAP MMC and Failover Cluster Manager tool in parallel to see in realtime, what will happen.

Open Windows Task Manager, navigate to “Details” tab and select the SAP Message server process (msg_server.exe) or the SAP Enqueue server process (enserver.exe) from the list of processes.

Right mouse-click and then select “End Task”. This will kill (=terminate) the process.

The Failover Cluster Manager should detect this within a few seconds (~ 3 seconds max.). The cluster should take offline all resources of the SAP cluster group and should move the group to another cluster node. If you have configured “Possible Owners” to switch to a specific cluster node, then the cluster will move the group to that node (three or more cluster node scenarios).

Check the time how long it takes to bring the resources online:

IP + network name resources
FileServer resource (including shared disk with sapmnt share)
SAP Service resource
SAP Instance resource

Detailed information can be found in the cluster.log which you can generate in a PowerShell (with admin rights):

get-clusterlog -destination c:\temp -uselocaltime -timespan 30

(this will generate the cluster logs for all cluster nodes with the events of the last 30 minutes)

Goals:

The failover must work as designed
The failover should not take longer than ~1,5 minutes.
The failover includes an offline time, to bring all resources offline on the current node, and an online time, to start all resources on the other cluster node
Write down the times!
These times should not change in future failover tests!

4. Failover Test (test should not lead to a failover)

Open SAP MMC and Failover Cluster Manager tool in parallel to see in realtime, what will happen.

In case you have a dedicated heartbeat network connection between the cluster nodes, this test will show, if the cluster will operate stable if the heartbeat network packages can only use the way through the public network interface.

On the current cluster node where the SAP cluster group is running, disable the heartbeat network interface in Windows.

Expected result:

Nothing should happen. Check the cluster logs, you should see related network / heartbeat errors on all cluster nodes. SAP operations must not be affected!

5. Failover Test (test will lead to a failover)

Same as test 3. But this time, disable all network card on the current cluster node!

Warning!
If you’re connected via RDP console to the Windows host, you can no longer connect to Windows. Make sure you use a hardware console (Remote Management Board or in case Windows is running in a VM, a console of the hypervisor). If you cannot connect via a console to Windows, you cannot carry out this test. In case you’re using physical hardware, you can alternatively unplug the cables from the network interface(s) of the physical server.

Expected result:

The other cluster node(s) should detect the loss of communication to this host. Based on the Quorum model you use, another cluster node should take over the SAP cluster group and continues operations.

Goal:

Does the cluster works as designed?
The cluster node which has no more network connections will STOP the SAP cluster group.
Another cluster node takes over the role and starts the cluster group. ERS guarantees, that there are no lost locks.
Or do you see a “split brain”? SAP cluster group is online on both nodes?
(this shouldn’t be the case in shared disks configurations, except for one possible scenario: The shared disk is replicated via datacenters using storage mirror technology)

6. Failover Test (test will lead to a failover)

Download the tool “notmyfault.exe” from https://learn.microsoft.com/en-us/sysinternals/downloads/notmyfault .

This tool will initiate a crash (= Blue Screen or Stop) of the Windows OS.

RDP to the cluster node, where the SAP cluster group is running.
RDP to another cluster node, where the SAP cluster group should be moved to.

Start tool with parameter “crash” on the cluster node, where SAP cluster group is running. After you have started the tool, the RDP session will hang, it has lost connection, because the OS is stopping (=> crashing).

Watch closely what happens on the other cluster node! The cluster node should take over the SAP cluster group after some seconds and NOT minutes!

Goal:

This test guarantees, that possible file handle locks on the shared disk will be cleaned up – even after a Windows OS crash
This failover test must lead to a fast failover.