How to solve SAP on Windows Failover cluster stand...

karl-heinz_hochmuth · ‎02-20-2020

This blog discusses issues which can occur during SAP operations in Failover Cluster environments. Many admins are not aware of situations, where the whole HA solution does not work anymore and how to start SAP systems in such situations.

The problem in short:
An SAP instance cannot be started on any node in a cluster anymore. The instance is "failed".

This blog shows possible solutions.

What you should do if an SAP instance entered the cluster state "failed":

First: Stay calm and save all evidence of the case!

This means in detail:

Generate the cluster log. Run a PowerShell with administrative rights and run this command:

get-clusterlog -destination:c:\temp -uselocaltime

Above command creates the cluster logs of all cluster nodes in directory c:\temp using the local time zone information.

Save the work folder! Copy the \work folder of the related instance to a safe location, so this can be later investigated.

Take a look in Windows Event Log. Start eventvwr and investigate the problem. Save the logs for later analysis (Application log and System log).

Next step: Enable Maintenance Mode to start SAP instances "outside" the Failover Cluster.

When you have saved all necessary logs for later analysis, there are two options: You have some time to do a root cause analysis, then proceed to next step of this blog "Analyze the problem".

Or you need to bring back your SAP instance very quickly.

The solution: Enable the maintenance mode!

Example:

An SAP ASCS instance failed in a two node cluster.

Enable the maintenance mode for both resources, the SAP service and the SAP instance resource.

You find this setting in the Properties tab:

If you have enabled the maintenance mode, you get an information from the Windows Failover Cluster Manager:

Warning!
This means, that the Windows cluster will ignore any health check information from SAP start service (sapstartsrv) or from applications! The cluster stays now "calm".

Start the instance using SAP MMC or using sapcontrol.exe. In this example I use SAP MMC:

What we can see here is the root cause for this problem: The SAP gateway (gwrd.exe) cannot be started ... for whatever reason! The SAP MMC displays this with status "red".

The cluster and the SAP resource DLL ("saprc.dll") retrieves the state "red" and this means for the cluster, that not all necessary applications could be started and therefore a failover will be initiated.

The maintenance mode is a good way to find out the root cause very quickly with no interference of the cluster. But be aware: The cluster will not detect any failure in this condition. It will simply ignore anything.

The ASCS instance is available again, well, in this case without the gateway.

Next steps would be:

Stop again the ASCS instance using SAP MMC.

Disable the maintenance mode in Failover Cluster again (=> set the value to 0). Make sure to do this in BOTH cluster resources, service and instance.

Solve the problem. Here in this example, I made an typo in the ASCS instance profile so the gwrd.exe executable could not be found.

If the problem is solved, start the instance normally using Failover Cluster.

If you’re not sure that you found the root cause of the problem, then leave the cluster in maintenance mode for now. Use SAP MMC to start and stop the SAP instance, until the problem is fully solved.

Third step: Analyze the problem in detail and find the real root cause.

What are possible root causes which lead to a complete stand still situation?

Like the example above: A program, specified in an instance profile cannot be started. It stays "red" in SAP MMC.See solution above.

A program, specified in an instance profile can be started, but stays "yellow" - even after 60 seconds.Here is an example for such a situation:
This ASCS instance consists of an additional Web Dispatcher (sapwebdisp.exe) and gateway (gwrd.exe).
The Web Dispatcher is "yellow". This can be a problem or an expected behavior, for example, if you have enabled an internal Web Disaptcher maintenance mode. See next screen:This yellow state of one process can cause a failover of the whole instance!Recommended setting for executables, which are not "that HA relevant":Go to the properties tab of the SAP instance resource and add the executable name to the "HAnotRelevantApps" field:

This means, that the cluster will now ignore any yellow or red status of the executable.
The executables must be separated with a blank.

A program, specified in an instance profile can be started, but stays "yellow" for around 50 seconds ... then reports "green".Example:
Here it's the gateway which stays yellow for a longer time. This can be due to long naming resolution, bad performance of the system, somebody forgot to delete rdisp/trace level parameters, so all SAP executables write huge trace files, etc.The cluster waits 60 seconds until all started applications by SAP start service (sapstartsrv) must be started and report "green". After that interval, a failover will be initiated.Solution:

Increase the "AcceptableYellowTime" parameter with a value, which fits to your landscape and configuration. You find it in the Properties tab of the SAP instance cluster resource:

The cluster will now wait 120 seconds, until it will initiate a failover.

Coming up:

How to interprete the cluster events in Windows application log?

How to read and analyze a cluster log?