Be Prepared for Using Pacemaker Cluster for SAP HANA – Main Part
I am probably stating obvious when saying that every infrastructure deployment option needs to be properly tested before it can be used to host productive workloads. This is even more important for High Availability clusters, as poorly implemented cluster can cause more downtime than decision to not use any clustering at all. Worst situation that must not happen under any circumstances is that cluster would cause impact to the data consistency or data loss.
Typical High Availability cluster testing starts with situations that could happen under normal operation – these include situations like:
- Cluster stability
- Graceful failover
- Crash of primary application or server
- Crash of secondary application or server
- Patching and maintenance, etc.
These scenarios must behave as expected otherwise you better not use the cluster at all.
Next you should focus on testing more advanced scenarios – these typically include multi-level failures. These are happening in very rare cases and goal here is not to ensure that cluster will be able to protect against these failures, but to ensure that cluster will not misbehave and will not cause any damage to database.
Highest level of testing is to consider what human System Administrator will do when he is called in the middle of night to fix failed cluster. Can he accidentally cause data loss and/or inconsistency to SAP HANA database? What are protections that prevent him from inadvertently damaging SAP HANA database?
This blog is about such scenarios where System Administrator must be extremely careful as otherwise he can accidentally cause data loss and inconsistency to SAP HANA database.
Before we jump to the scenarios themselves we need to set the stage and explain some basics.
Big thanks to Fabian Herschel and Peter Schinagl from SUSE for proof-reading the blog.
For better readability whole blog is divided into following parts:
- Be Prepared for Using Pacemaker Cluster for SAP HANA – Main Part (this blog)
- Be Prepared for Using Pacemaker Cluster for SAP HANA – Part 1: Basics (this blog)
- Be Prepared for Using Pacemaker Cluster for SAP HANA – Part 2: Failure of Both Nodes
Be Prepared for Using Pacemaker Cluster for SAP HANA – Part 1: Basics
How Pacemaker Cluster works with SAP HANA System Replication
SUSE developed in collaboration with SAP the SAPHanaSR solution and released it as part of SLES for SAP Applications. This solution is based on Pacemaker Cluster that is automating failovers between two SAP HANA databases that are mirroring each other. This solution was later adopted by RedHat and is now jointly developed by both companies. Therefore, this whole blog is equally applicable to both Operating Systems.
Pacemaker Cluster with SAP HANA System Replication as visualized below is based on two identical servers (VMs) each having one SAP HANA database. Both servers are bundled together by SUSE Pacemaker Cluster.
Figure 1 – Pacemaker Cluster for SAP HANA Architecture
SAP HANA database on primary server is replicating information to SAP HANA database running on secondary server. Replication method is based on Synchronous SAP HANA System Replication – this is to ensure that no data is lost during failover. Both databases are running at the same time, however only primary database can support customer workloads. Secondary database is either completely passive or can be active in read-only mode (since SAP HANA 2.0).
Failure of primary SAP HANA database is automatically detected by Pacemaker Cluster. The cluster will automatically shutdown primary database (if still partially running) and will activate secondary database. It will also relocate virtual IP to ensure that all applications using the database can reconnect to new primary SAP HANA database. Since all the data is already pre-loaded in memory of new primary database this failover is very fast.
More details here:
Importance of fencing
Under normal operation fencing mechanism is not actively used. Cluster is communicating over network (corosync) and both sides of the cluster are constantly updating each other on the health status of SAP HANA database on given node.
The problem starts when one of nodes stops responding. Let’s assume secondary server is suddenly unable to connect to primary server. In such case the cluster on secondary server is having no way of knowing what happened – generally two options are possible:
- Primary server is not responding because it crashed or is frozen
- Primary server is healthy however due to a network issue it is not reachable
The problem is that in first case cluster should consider executing failover to restore the service while in second case the failover must not happen as otherwise SAP HANA would be active on both servers.
This situation is called split-brain (or dual primary) and is extremely dangerous. It is even more dangerous in this scenario because we are working with two independent SAP HANA databases that can easily be active on both sides.
Business impact would fatal – imagine that you are writing some transactions to database running on primary server and later other transactions to database running on secondary server which is not aware about changes written to the first database.
Now imagine what about other systems in landscape – CRM having records that does not exist in ERP, etc. This would result in logical inconsistency cascading across all systems in the customer landscape. I believe it is now obvious that fixing such situation would be very difficult and would cause huge impact on the business.
It is good to be paranoid when it comes to split-brain situations.
Pacemaker Cluster is addressing this by fencing technique called STONITH (Shoot-the-other-node-in-the-head). This mechanism does exactly what the name suggests. In case that nodes suddenly cannot communicate then one of the nodes will kill the other node to ensure that both nodes are not active at the same time. Surviving node will then serve the customer workloads.
More details here:
There are multiple techniques how fencing can be implemented. However, following two techniques are most common:
- Node shutdown via IPMI (for most Intel devices), HMC (for Power devices), vCenter or VMware plugins (for VMware VMs)
- In case of issue surviving node will power down the other node to ensure that it is not active. Main drawback is that implementation depends on used HW or VMware configuration. In some cases this approach might be considered insecure due to a password being stored in cluster configuration in unencrypted way.
- SBD (Storage-based-death) disk fencing is based on shared disk(s) provided from external source(s) – obviously, multiple SBD disks should not share same single-point-of-failure and if provided over network then not over cluster communication network (corosync).
- In case of issue surviving node will write “poison pill” to the disk instructing other node (if active) to commit suicide. Advantage of this approach is that it is generic approach that can be equally applied across different scenarios including bare-metal and virtual solutions.
More details here:
How does cluster know it is safe to failover
At this point we need to deep dive into how Pacemaker Cluster internally works with SAP HANA System Replication.
There are two SAP HANA cluster packages that are automating SAP HANA System Replication failover:
- SAPHanaSR – automating failover for following two SAP HANA single-node scenarios:
- SAP HANA SR performance optimized infrastructure – where secondary node is dedicated to fulfilling High Availability function
- SAP HANA SR cost optimized infrastructure – where secondary node is hosting additional non-productive SAP HANA database
- SAPHanaSR-ScaleOut – automating failover for SAP HANA scale-out scenario (at the time of writing of this blog available only on SUSE Linux Enterprise Server for SAP Applications)
For the sake of simplicity we will focus on single node package (SAPHanaSR) only. This package is designed to monitor and locally record multiple attributes for each node:
- Cluster Resource State (hana_<sid>_clone_state)
- Valid values: PROMOTED, DEMOTED, WAITING, UNDEFINED
This attribute is describing actual status of local SAP HANA cluster resource.
- Remote Node Hostname (hana_<sid>_remoteHost)
- Hostname of remote server (“the other node”).
- SAP HANA Roles (hana_<sid>_roles)
- String describing health status of local SAP HANA database. This includes:
- Return Code from landscapeHostConfiguration.py
- HANA role – primary/secondary
- Nameserver role
- Index server roles
- SAP HANA Site Name (hana_<sid>_site)
- Alias of local SAP HANA database (as registered when replication was configured).
- SAP HANA System Replication mode (hana_<sid>_srmode)
- Valid values: sync, syncmem
Configured SAP HANA replication mode. Replication mode async should not be used in High Availability scenario as it is associated with potential data loss during failover.
- SAP HANA System Replication status (hana_<sid>_sync_state)
- Valid values: PRIM, SOK, SFAIL
Failover to secondary cluster node can happen only in case that replication status on secondary node is SOK as otherwise replication was not operational when primary crashed and data on secondary database is not in sync with primary database. In this case failover will not happen.
- Local Node Hostname (hana_<sid>_vhost)
- Hostname used during SAP HANA installation – this could be either the local hostname any other “virtual” hostname.
- Last Primary Timestamp – LPT value (lpa_<sid>_lpt)
- Value is either timestamp value of SAP HANA database being last seen as primary or low “static” value suggesting that database is not primary.
This attribute is preventing dual primary situation. In case that cluster node attributes on both nodes are showing last state of SAP HANA database as primary (this can happen after multi-level failure – see next part for details) then higher LPT value is used to determine which SAP HANA database was “last primary”. This database is started while the other database is kept down.
- Resource Weight (master-rsc_SAPHana_<SID>_HDB<SN>)
- Internal technical cluster attribute used to control failover process. Node with highest weight will be promoted to become primary. It is calculated based on other attributes.
These attributes are updated at regular intervals and stored locally as part cluster node status.
Example of internal cluster states during normal operation (hana43 being primary):
Node Attributes: * Node hana43-hb: + hana_hac_clone_state : PROMOTED + hana_hac_remoteHost : hana44 + hana_hac_roles : 4:P:master1:master:worker:master + hana_hac_site : TOR-HAC-00-NODE1 + hana_hac_srmode : sync + hana_hac_sync_state : PRIM + hana_hac_vhost : hana43 + lpa_hac_lpt : 1439227830 + master-rsc_SAPHana_HAC_HDB00 : 150 * Node hana44-hb: + hana_hac_clone_state : DEMOTED + hana_hac_remoteHost : hana43 + hana_hac_roles : 4:S:master1:master:worker:master + hana_hac_site : TOR-HAC-00-NODE2 + hana_hac_srmode : sync + hana_hac_sync_state : SOK + hana_hac_vhost : hana44 + lpa_hac_lpt : 30 + master-rsc_SAPHana_HAC_HDB00 : 100
All these attributes are used to determine actual cluster health and are considered before failover decision is taken.