Your SAP on Azure – Part 24 – Simplified high availability in the cloud!
SAP is one of the most critical business systems deployed to the company landscape. It holds the most valuable data and many organizations fully relay on the availability of their SAP ERP or SAP S/4HANA solutions. Even a small disruption could cause big challenges and lead to disruptions in business operations. Therefore many organizations decide to implement the system running in a highly available mode, which means all components of SAP system: database, central services instance and application servers are redundant, so even a hardware failure doesn’t stop the system.
I already covered the topic of high availability for SAP NetWeaver-based systems. It requires failover clusters that are not easy to configure and maintain. As each component of the SAP system is a Single Point of Failure you need to ensure each of them is protected and accessible even in case of hardware failure. A sample architecture includes three failover cluster to protect each system layer:
- Central Services instance
- Shared storage
- Database workload
Failover clusters require additional resources to work properly. Firstly you need load balancers to route the user traffic to the correct node. Depending on the configuration you will also need a cluster witness that consists of several small Linux VMs. Quite often such complex architecture undermines your effort and, in the end, decreases the overall availability of your SAP system.
In today’s blog, I’ll try to answer if the above sample architecture could be simplified when running SAP in Azure. I’ll focus on two areas – Central Services instance and shared storage.
Before I present my solution, let’s have a closer look at the Central Services instance architecture. It consists of two processes:
- Message server is responsible for load balancing of user connections and communication between SAP instances.
- Standalone Enqueue Server holds the user lock table
Temporary unavailability of a message server prevents new users from accessing the SAP system, but it won’t drop existing connections. But whenever Enqueue Server is restarted the content of the lock table is lost and you can’t ensure the system consistency any longer.
To solve this challenge a highly available installation of Central Services instance includes an additional component that runs on the secondary node of the cluster. Enqueue Replication Server keeps a replica of lock table in the shared memory of the passive node and in case of failover all lock entries can be retrieved and the consistency of the system is ensured even in case of Enqueue Server restart. But as the lock table is stored in shared memory and there is no possibility to retrieve it over the network the Central Services instance has to be started on the same node as Enqueue Replication Service is running. So far it was a big limitation and basically enforced the usage of clustering.
When you deploy a virtual machine to Azure the hypervisor continuously monitors its status. If there is an issue with the power state, the virtual machine will be automatically restarted. If the failure happens on the physical host your workload will be redeployed to another server and data stored on permanent disks is preserved. The service healing is enabled by default for all virtual machines running in all locations. It ensures that your workload continues operations as soon as possible even in case of underlying hardware failure.
Starting with SAP NetWeaver 7.52 you can use an updated version of the enqueue framework. The requirement to start the Central Services instance on the same host as the Enqueue Replication Server is no longer valid, as the lock table can be retrieved over the network. When you combine it with a Microsoft Azure service healing feature you can design a simplified architecture of a highly available SAP system without complex cluster configuration. In case of failure, your virtual machine will be restarted and the Standalone Enqueue Server 2 will automatically connect to the Enqueue Replication Server 2 and fetch the content of the lock table.
The Standalone Enqueue Server 2 and Enqueue Replication Server 2 are only available for systems running on SAP NetWeaver 7.52 and newer. Depending on your system release it may be activated by default or you need to activate it manually.
|NetWeaver release||Standalone Enqueue Server 2|
|Lower than 7.50||Not supported|
|7.51||Supported, but without replication|
|ABAP Platform 1809||Supported|
|SAP S/4HANA Foundation 1909||Supported|
As the Central Services instance is restarted on the same virtual machine preserving all data stored on permanent disks, you don’t have to worry about the shared filesystem. You can create an NFS server and share the /sapmnt directory from the same host – and drop another failover cluster which is especially appreciated if you had to configure DRBD replication.
By using new features that come with the new enqueue framework we are able to eliminate two out of three clusters in a highly available deployment. The configuration and maintenance are much easier if you don’t have to run Pacemaker. In addition, such deployment is also much friendlier to replicate to the secondary region using Azure Site Recovery and no additional post failover activities are required.
As usual all the good things come with some limitations. In this case, the simplified high availability configuration allows you to continue operations in case of temporary unavailability of virtual machines which can be automatically resolved by the Azure Resource Manager. It won’t help you in case of misconfiguration – for example if you provide wrong IP configuration or manually stop the services. From the Azure perspective the virtual machine would be still running without a problem. There is also no guarantee on how quickly the server would be rebooted – in the example bellow it took less than 2 minutes to restart the server, but you should always refer to the SLA to get the maximum values depending on your configuration.
Please remember that this is my private view on clustering and this blog doesn’t follow the official Microsoft recommendation which you can find here: https://docs.microsoft.com/en-us/azure/virtual-machines/workloads/sap/get-started
In my test scenario, I deployed a standard (non-HA) installation of SAP S/4HANA FOUNDATION 1909 that by default uses the Standalone Enqueue Server 2 on three virtual machines in single availability zones. Instead of manual deployment of Azure resources, I decided to use quickstart templates available on GitHub.
I have manually created an additional virtual machine for the Enqueue Replication Server in the availability zone 2.
Once the deployment of virtual machines is completed I configured NFS cluster on the Central Services instance host and attached it on all virtual machines that are part of the SAP NetWeaver.
Create the directory and mount it on all servers:
mkdir /sapmnt vi /etc/fstab sha-ascs-0:/sapmnt /sapmnt nfs rw,hard,rsize=65536,wsize=65536,vers=3,tcp 0 0
I install SAP S/4HANA FOUNDATION 1909 and the Enqueue Replication Server on four deployed virtual machines.
|VM Name||Instance number||Instance type|
|sha-ers-0||21||Enqueue Replication Server|
There were no surprises during the installation of the system. The only thing I would like to highlight is that the Software Provisioning Manager is not fully prepared for the simple HA scenario. Before starting the installation of the Enqueue Replication Server execute the Prepare Additional Cluster Node task – even that we’re not building the cluster. It creates the directory structure on the host and without it, the ERS installation fails.
Once the installation of all four components is completed you can verify the release of the Standalone Enqueue Server on the Central Services Instance using following command:
sapcontrol -nr 11 -function GetProcessList
Similarly, you can display the process list on the Enqueue Replication Server:
If you look at the system, you notice that after installing the ERS instance there is couple new parameters:
Default profile (DEFAULT.PFL):
enq/replicatorhost = sha-ers-0 enq/replicatorinst = 21 ERS profile (ERS_SHA_ASCS11_sha-ascs-0): enq/server/replication/enable = true
New entries enable replication with the remote Enqueue Replication Server and point to the host where the service is running.
We need to ensure that the ASCS and ERS services are automatically started during the system boot. We can achieve that by setting Autostart parameter to the profile files:
vi SHA_ERS21_sha-ers-0 Autostart = 1 vi SHA_ASCS11_sha-ascs-0 Autostart = 1
Now in case of unexpected server reboot both services are started automatically.
Having all services up and running we can begin testing. I logged into the application server and opened the SU01 to set a lock on table USR04. With the new Enqueue Service 2 you can display lock entries using the new t-code SMENQ that provide additional monitoring capabilities:
Let’s start with something easy. Firstly, I would like to ensure the lock entries are preserved when the Central Services Instance is manually stopped.
sapcontrol -nr 11 -function Stop sapcontrol -nr 11 -function GetProcessList
And refresh the lock table:
The lock entry is still there!
But that’s only half of the success. We should also check how the Central Services Instance behaves after an unexpected failure. I want to simulate a real virtual machine failure which was a bit of a challenge on Azure. When you restart the virtual machine using Azure Portal or the operating system command reboot, the procedure follows the graceful approach which means the it firstly tries to stop all services running on the host. Such approach wouldn’t give meaningful results, as there is a chance the Central Services Instance sends some signals to the Enqueue Replicator in the background which wouldn’t be the case for sudden virtual machine failure. After research I decided to use the az command line interface as it implements the Fast Kill option that simulates power off scenarios.
I would also like to measure how long did it take to restart the Central Services Instance. After I ensured the time is in sync I rebooted the remote server:
date az vm restart --force --name <ASCS-VM> --resource-group <ResourceGroup>
It took 1 minute and 41 seconds to restart the virtual machine and bring the central services online. I’m impressed by this result. The entire process was probably quicker than regular failover.
Quick look at the lock table. The USR04 entry is still there:
The objective of the last test is to verify the Enqueue Replicator Service will reload the lock table in case something wrong happens to the virtual machine where it is hosted. The new SMENQ transaction allows to display the lock table kept by the Replicator – previously we’d have to use a command line interface. I followed the same restart approach as I did for the Central Services instance. When I tried to refresh lock entries while the Enqueue Replicator Service is unavailable I received an error message:
But after two minutes the service was back online. You can see the lock entry is still there.
Developer Trace provides more details. The Enqueue Replicator fetched the lock entries and continue operations:
Migrating your SAP to the cloud gives you new opportunities to streamline system operations and maintenance. Physical hardware failures are not so important any longer, as the cloud platform can immediately restart virtual machines on a new host. Microsoft Azure is the only cloud platform that provides 99.9% SLA for a single virtual machine when it uses premium storage. Together with the improved enqueue framework, you can design a highly available SAP system without building unnecessary clusters.
Hi Bartosz Jarkowski,
Considering that you mentioned this in correspondance to SAP-ABAP system, what is the fate in SAP-JAVA system as you would find SAP-JAVA NW7.5 as of now. This portion sounds not true.
Very good blog post, Bartosz Jarkowski
You can add some optional pointers like Min. OS recommendations (RHEL or SUSE etc.), I understand it works with both and also GitHub link you refer, if possible.
Hi Bartosz Jarkowski thanks for sharing, what a great idea!
Lets start a campaign to get this model listed as an officially recommended Microsoft recommendation! I think this would gain significant customer support. I'm a big fan of removing complexity in the design, especially in the area of clusters. I've been working with clusters for close to two decades and have seen good and bad impact. When its bad, clusters reduce service up-time, exactly what you are trying to avoid!
I have installed SAP HANA with system replication on 2 servers, using SUSE Pacemaker cluster. This is working fine. Now I need to install S/4HANA on a cluster. I have only two VMs. One for ASCS, one for ERS. PAS and AAS should be located on the database servers, or can they be located on both the ASCS and ERS servers?
The restriction being that I only have two machines for application layer.
Thank you Bartosz,
I am not a big fan of clusters and they often cause more pain than they are worth. This seems to be a simple solution which should be convenient to configure and use, As long as one is not looking for a multi AZ autofailover solution, this should work just fine.
My only question is regarding shared storage(/sapmnt), If I understood you correctly, you are suggesting configuring a local NFS service on the (A)SCS node itself and sharing the storage to the application servers. In case of a failure, the Azure healing service will reboot the (A)SCS host which will mean the /sapmnt share will be temporarily unavailable till the (A)SCS host comes back up. Would that not cause SAP service to go down on the app servers as well causing a system wide outage. I think app servers are resilient and I will test this out myself but asking in case you have checked this.
in terms of shared storage I would consider one of the PaaS services that Azure offers, like AFS or even ANF. I agree that disconnecting storage is not ideal 🙂
I went through the details of your test results and it seems that the lock entry did not go away which I assume is an indication that the application server did not go down. Which is great. I will try and configure this HA scenario myself and see how it goes. It may be a great cost saving way for many of my customers.