Storage-based fencing in mixed High Availability environments
Before we go into detail, one might ask “why combining virtual and physical cluster nodes in one cluster at all”. Fair enough. Maybe the answer “because it’s possible” is a little thin, but in our test scenario, this was exactly what we wanted to prove. Potential use cases for such a setup are:
- physical 2-node cluster plus a virtual fallback node
- intermediate step during a P2V migration to protect against unplanned downtime and reduce planned downtime
Two nodes plus fallback
So let’s frame the idea from bird’s-eye perspective: we’re running a high-load system on two physical nodes. During maintenance windows where we take off one node, we want to secure the running resources against unplanned downtime with a fallback node that can easily be provisioned and that doesn’t consume nor occupy resources when it’s not needed. Hence, a perfect candidate for a virtual machine.
There are some thoughts about the configuration of the virtual machine. First, despite the SAP best practices, neither RAM nor CPU should be reserved. Otherwise, it would consume resources of the virtualization host for nothing. Giving the virtual machine higher shares is absolutely sufficient (and supported) to grant good performance when it takes over cluster resources. Then, it’s very likely that the pure size of the virtual machine doesn’t reflect the size of a physical node. Therefore, the needs of the given cluster resources should thoroughly be considered to size the virtual machine accordingly.
The cluster resource placement decision follows a score based rule set of positive and negative affinity scores. To make this scenario work, the virtual fallback node has low Resource Location scores. This will place the cluster resources preferably on the physical nodes. Placement on the fallback node only happens when the negative colocation score between the cluster resources lowers the absolute location score of the physical node below the absolute location score of the virtual node.
The typical fencing agent we use is SBD (aka. storage-based death, stonith block device or split-brain detector). In its configuration with three devices, it provides one of the most resilient fencing methods. With a single storage system, the SBD devices should be distributed among several physical enclosures to avoid outages of single LUNs activating the fencing mechanism. On this first example, the data volumes belonging to cluster resources are held by filer head A, therefore quorum is built on this controller.
In a setup where we have some kind of replication mechanism, we want our cluster properly decide on which nodes the resources should run after the primary, “active”, storage side is down. To avoid scenarios where the whole cluster shuts itself off because the majority of the SBD devices cannot be reached by any of the nodes, we need a mechanism which ensures a correct quorum decision. For this purpose we (mis)use a resilient part of the infrastructure, which provides the 3rd SBD device usually via iSCSI, as a tie-braker.
To configure storage-based fencing on a virtual node, the steps are:
- Map SBD LUNs to physical RDMs on ESXi host
enables the LUNs to be addressed by the virtual node using the same identifier under /dev/disk/by-id just like on physical nodes
- Add all three RDMs to the virtual node
- Configure SBD on the virtual node the same way like on physical nodes
Should the scenario be extended by using multiple virtual nodes in parallel, or when setting up a cluster which is completely virtual, the RDMs pointing to the SBD LUNs have to be added to each virtual node. But as VMFS usually locks a VMDK – no matter if the VMDK descriptor points to a container file or to an RDM – in order to prevent inconsistency caused by writing to the same disk from multiple virtual machines simultaneously, we manually have to un-lock the VMDK by setting the multi-writer parameter.
There are surely some spots that could have been elaborated in more detail, but as I mentioned in my previous blog, SUSE maintains a human-readable cluster manual as well as a considerable library of SAP best practices. Nevertheless, your comments are welcome!