Be Prepared for Using Pacemaker Cluster for SAP HANA – Part 2: Failure of Both Nodes
Big thanks to Fabian Herschel and Peter Schinagl from SUSE for proof-reading the blog.
First part of this blog is located here:
Be Prepared for Using Pacemaker Cluster for SAP HANA – Part 2: Failure of Both Nodes
What happens when both cluster nodes will fail
Let’s start with Pacemaker Cluster running during normal operation. Colors are having following meaning – component is available (green), standby (yellow) or unavailable (red).
Figure 2 – Pacemaker Cluster during normal operation
SAP HANA on server hana43 is primary – this is visible by status PROMOTED, roles string containing “P”, sync state set to PRIM and LPT timestamp set.
SAP HANA on server hana44 is secondary – this is visible by status DEMOTED, roles string containing “S”, sync state set to SOK (meaning replication is healthy) and low “static” LPT value.
Now let’s assume that primary server will crash and is unavailable.
Figure 3 – Pacemaker Cluster after primary server failed
Secondary server will notice that primary server is down and since replication status is SOK it is able to initiate failover. Fencing operation will be executed to ensure hana43 server is offline and failover will be executed.
Figure 4 – Pacemaker Cluster after secondary server failover
When SAP HANA on server hana44 is promoted to new primary all cluster node attributes are updated –status is set to PROMOTED, roles string changed to contain “P” as database failover was executed, sync state was set to PRIM and LPT timestamp was set.
Note that cluster node attributes on primary server remain unchanged – this is because that server is unavailable. These attributes are stored locally on that server.
Now let’s assume that secondary server crashed while primary is still offline.
Figure 5 – Pacemaker Cluster after secondary server failed
Now both servers are offline but SAP HANA databases on both servers are configured to run as primary – this is also confirmed by cluster node attributes set that way.
How to (not) destroy your company data
Now imagine following scenario. You are on-duty System Administrator and you got called in the middle of night that SAP HANA is down and you need fix it. Still half-asleep you are logging to the servers listening to your boss explaining you over the phone how much money your company is losing every minute SAP HANA is down and how important it is to get SAP HANA up and running as fast as possible.
When you are finally there you see that both SAP HANA cluster servers are offline. To make it more obvious – this is what you see.
Figure 6 – Pacemaker Cluster state as System Administrator can see it
Since you are under pressure to fix SAP HANA ASAP you decide to fix primary first – after all secondary can be fixed later once we are back in business.
DANGER!!! Now you should stop and get fully awake. How do we know which server is “last primary”? Unless you got that information from some external monitoring system you have no way of knowing!!!
Let’s look what would happen if you would start wrong server. Let’s assume System Administrator checks the documentation and sees there that hana43 is supposed to be primary and decides to start it up first.
Figure 7 – Pacemaker Cluster after hana43 server was restarted
Once Operating System is rebooted System Administrator starts the Pacemaker Cluster. Since secondary server is still offline Pacemaker Cluster will be unable to retrieve LPT value from cluster node attributes of secondary server. Without LPT values from both server nodes Pacemaker Cluster will not start SAP HANA database and cluster node attribute status will be set to WAITING.
This protection from Pacemaker Cluster is called “restart inhibit”. It is there to ensure that SAP HANA is started only in case that Pacemaker Cluster can clearly determine which server is “last primary”.
DANGER!!! At this point System Administrator should stop and start thinking.
Let’s assume that our System Administrator is still half-sleep and he will be surprised why SAP HANA is still down and will start it manually.
Figure 8 – Pacemaker Cluster after SAP HANA on server hana43 was manually started
Once SAP HANA is started manually Pacemaker Cluster will detect it and will adjust cluster node attribute status to PROMOTED.
From the moment when SAP HANA was started all database updates are stored in SAP HANA database running on server hana43. However SAP HANA database running on server hana43 is not having all the data that was persisted after failover to database on server hana44 – see Figure 4.
By manually starting SAP HANA database our System Administrator caused logical inconsistency that will take incredible effort to fix.
Let’s see what is the correct approach how to deal with situations when both servers are offline.
If you are unable to clearly determine which server was “last primary” then you need to start both cluster nodes. Without access to both servers Pacemaker Cluster is unable to correctly determine which server was running “last primary” SAP HANA database.
Figure 9 – Pacemaker Cluster after both servers are restarted
Once attributes from both cluster nodes are available the Pacemaker Cluster will check which node is having higher LPT value to decide which database was “last primary”. Unfortunately this information is not written to SBD drive or anywhere else outside the local node so both nodes must be available for correct determination of “last primary” SAP HANA database.
Alternative approach can be used only in case that you are 100% sure which SAP HANA was “last primary” – maybe getting this information from some external monitoring system. In such case you can do exactly as described in previous section however you perform described steps on correct server.
First you restart “last primary” server (hana44), start Pacemaker and then manually start SAP HANA that will become primary database. Make sure to execute these steps on correct server.
Later you can restart the other server (hana43), start Pacemaker, register local SAP HANA as new secondary and cleanup resource to start SAP HANA as secondary database.
Please note that there is no protection that will prevent you from manually starting wrong SAP HANA database causing data loss or logical inconsistency. It is responsibility of System Administrator to start both nodes at the same time or correctly determine which database was “last primary”.
since hana43 was down and HSR is broken, there should be no log replay activities on hana43, and the last changed time on log volume is far behind. Thus, we can check and compare the timestamp of the most recent update time of log volume between hana43 and hana43 systems. System who had the recent time changed is likely to be the last "primary" system?
just my thought.
You mean changed time in the log volume on the file system?
and this is exactly purpose of LPT... if you would start Pacemaker cluster on both cluster nodes then LPT is compared and HANA on correct side is started up moving virtual IP there...
issue is if you start only one server (other server still being down) then you do not know the state of HANA on the second server - in such case cluster cannot compare LPT and therefore will NOT start SAP HANA... worst thing you could do is to start HANA manually...
conclusion is that unfortunately in such situation when both nodes crashed and you do not know which was last primary you need both servers fixed to decide which was last primary...
Excellent Blog. Enjoyed the simplicity with which the complex scenario is explained. I have quick question about parameter "lpa_hac_lpt" where do we find this parameter? in Pacemaker.log or corosync.log or any other parameter file.
Appreciate your response. Thanks.
Thanks for nice blog, will srHook help here and how ? Appreciate your answer on this.
not really sure what you mean by srHook... maybe you can elaborate...
the issue in nutshell is that in two node cluster scenario - in situation that both nodes failed together - if you have access to only one node (other is inaccessible/crashed) - you have no way of knowing which one was last primary - both nodes might claim they were last primary - but only comparing LPT (last primary timestamp) will tell you which one is real last primary...
conclusion is that it not safe to start the DB until you will have both nodes (and here the Pacemaker can decide it for you comparing LPT values)...
note this is not very common scenario, however I've seen it already (unfortunately) and consequences are severe...
the only potential solution to this dillema would be to have 3rd independent component that will be monitoring the cluster to observe which node was last primary and which can help to decide instead of missing node... however here we are already designing different cluster solution...
Thank you Tomas,
Please have a look at this, looks like another addition to making decision for primary node.
I am familiar with what are hooks and how they work... what is not clear to me how you suggest we use them to prevent situation above...
...someone would need to develop some kind of logic here to trace last primary and persist the information on both cluster nodes + 3rd VM in separate location...
the logic might be based on following
failover() --> for each failover you post timestamp and name of last primary (to all three locations)
startup() --> during start you check against either 3rd VM or other node if name of last primary is you... if not then you reject to start...
so yes - theoretically possible - but this would need SAP to own the subject and work with SLES and RHEL to discuss the details...
(all scenarios would need to be thought through properly - like what if that 3rd node will be unavailable, etc.)
Thank you Tomas, it helped !!!