How to crash your iflows and watch them failover beautifully
This is part 1 of a series discussing distributed scenarios for SAP Cloud Integration (CPI).
Part 1: How to crash your iflows and watch them failover beautifully (DR for CPI with Azure FrontDoor)
Part 2: Second round of crashing iFlows in CPI and failing over with Azure – even simpler (DR for CPI with Azure Traffic Manager)
Part 3: Black Friday will take your CPI instance offline unless… (HA for CPI)
Are you happy with redundant deployments in availability zones in a single cloud region? Are your iFlows down regularly due to CPI maintenance windows? Or are you striving for cross-region failovers? SAP deploys their CloudFoundry services for its multi-cloud environment (Azure, AWS, GCP and Alibaba) zone-redundantly. You can find the reference here.
There is nothing wrong with that. The approach is simple and cost efficient. But from resilient cloud architecture perspective it is often recommend to go one step further.
The scope of my post will be SCP Cloud Platform Integration but the approach is not limited to that.
Securing cloud workloads comes with many options
An even more reliable setup uses redundancy for high availability in the same region and a passive disaster recovery setup, that contains only the basic cloud landing zone, in another cloud region considerably far away. This way you keep the cost at bay but at the sacrifice of recovery time, because you need to provision services during the disaster first. An active setup profits from already running services but doubles the cost obviously.
You might even do this cross-cloud providers if you want.
A note on the side: Some of the very sophisticated customers that I saw, are even doing failovers with Azure deliberately once a year . They run the first half of the year in west- and the other in north-Europe. This way you ensure a well-understood CloudOps practice.
See down below a screenshot from the SAP on Azure reference architecture for S/4Hana for the mentioned setup.
Fig.1 Screenshot from SAP on Azure reference architecture docs
Great, due to above your precious backend can perform a failover to another region. But what happens to your connected SAP Cloud Platform services in that case? And specifically, your iFlows deployed on CPI?
Single instance CPI with S4 failover
It is likely you are in a situation depicted down below, where you configure a SAP Cloud Connector in each region and point them towards the same SCP Subaccount. For requests from SAP to SCP you are fine, but for the other way round you need to apply the Location ID concept, so that your iFlow knows which Cloud Connector to use.
Note on the side: Requests from SAP to SCP should go through a specified Internet breakout, because your SAP workload runs in a private VNet, that is only reachable via a gateway component for web traffic, VPN or ExpressRoute. For the sake of the CPI failover focus I simplified the described setup.
Fig.2 Overview of simple DR setup S4 and CPI
Congratulations, you just created a dependency in your iFlows on Cloud Connector Location ID ?Given that disaster recovery decisions are often manual human decisions, you should be fine with a simple parameter. To avoid redeploying a lot of iFlows I recommend to use a global variable instead of a String exposed as an externalized parameter.
Be aware that the global variable expires after 1 year, so you need to have a process to re-set it. Down below I detailed a possible approach to configure the Location ID dynamically from a global variable.
UPDATE: Santhosh Kumar Vellingiri rightfully pointed out on the comments that the Partner Directory Kit would be an even better solution than the variable to avoid data base requests. You can find additional guidance on how to set this up here.
|I created a simple iFlow that takes the location ID string that I want to activate on my CPI tenant from a header. I did that for simplicity reasons. You could also go for the payload the query string or even read from another service.|
|Once executed I see the global variable on the Monitoring UI|
|With that I can access the global variable in any iFlow in my tenant. I store it in a header again (could also be a property if you prefer that) to use it for the subsequent http call via the Cloud Connector|
|There you go. The Location Id is now being set dynamically from a global variable|
|The associated Destination for Cloud Connector looks like this. So, my authentication and Location ID come from the iFlow and the Destination just passes them on.|
|The Cloud Connector on the VM in my primary Azure region config contains the Location ID, that we need to match|
Table 1 Setup for global var on CPI tenant
In case a manual DR decision is not good enough you could start thinking to automate the process. I propose a service to check health status of your S4 backend. I have used the basic SAP Ping service for that purpose before. However, the devil is in the detail: At what point is the SAP backend considered not available anymore? 10 request failures within 5 minutes? 10 minutes? Did it failover to the DR side fully yet? Or is just something wrong with the Cloud Connector?
Fig.3 Screenshot of iFlow to automate change of dynamic Location ID
Fig.4 Overview ping-based failover with LocationID change
For above proposal you need means to determine the currently active SAP backend. This iFlow example assumes, that the active SAP system URL is stored globally like the Cloud Connector Location ID. It probes the S4 system on the ping service regularly. In case it fails it automatically changes the Location ID and active SAP URL to the secondary site on the CPI tenant. Voila, you just created a mess 😀
Obviously, you would want more checks like: how often did my probe fail in a given time frame? Maybe I should try a second time after 1 minute or so to to allow network latency issues to resolve? The Circuit-Breaker pattern is something I can recommend if you want to dig deeper in this direction.
Hey, and what about reversing this process once my primary instance becomes available again? Maybe manual failover is not so bad after all ?
That shall be it for an introduction to single instance SCP CPI with failover only on the SAP backend side. But I promised you cross region failover including CPI, right? But wait, my SAP backend configuration can only contain one target!
Reverse Proxy to the rescue
To enable dynamic failover of your integration scenarios from both S4 instances (primary and DR) to CPI (primary and DR) we need an abstraction layer. Otherwise you would need to change communication configuration on the backend when disaster happens. What we are looking for is a globally highly available component that acts as a reverse proxy, that can check the health of my CPI endpoints and start routing request from which ever S4 instance is currently active. I am using the managed services Azure Front Door (AFD) together with Azure API Management (APIM, consumption based) to achieve that. There are multiple other options that could achieve the same thing. For instance, on the Azure side you could look at a simple and cost-efficient DNS-based solution with Azure Traffic Manager (TM). Find a comparison of both solutions here and my second post on this topic specifically with AZure TM here.
I chose Front Door for its rich feature set (e.g Web Application Firewall and global availability) and quick failover reactions without the wait times compared to DNS propagations. With DNS caching it can take some time until your requests get re-routed to your DR CPI instance, because the cache acts on multiple layers with individual time-to-live settings. Even if the global DNS entry is adjusted by Azure Traffic Manager instantly it can still take minutes until your client sees the change.
API Management is primarily used to implement the health probes from AFD to the protected endpoints on SCP CPI. AFD accepts only http 200 as succesfull connection. With APIM I get more freedom to consider responses successful.
Fig.5 Overview failover architecture S4 and CPI
Awesome, now we can rely on AFD to re-route our iFlow calls depending on the availability of my CPI tenant. Since the communication details of the target iFlow are hidden behind the AFD address, it doesn’t matter anymore if my request originates from the primary S4 or my DR instance. In case S4 does a failover from West EU to North EU but CPI stays available in West EU, AFD accepts the requests from the DR site and forwards to CPI in West EU. It would also cover a failover that concerns only CPI, where S4 was not impacted at all and stayed in West EU.
Great, all combinations of system moves are covered routing-wise. What about the Cloud Connector for S4 inbound request? For the highest degree of flexibility, you would need to register both Cloud Connector instances with each productive SCP subaccount (West EU + East US).
If you configure only 1-1 connections between the primary Cloud Connector and the primary CPI instance, you need to failover CPI (even though it might still be operational) in case your S4 moves to the DR side. This scenario can be avoided with a many to many registrations of the SCP subaccounts on the Cloud Connector.
How do I keep the CPI tenants in synch?
Fig.6 CPI iflow synch
You could setup the SCP Content Agent Service with two transport routes. One pointing towards the primary productive CPI instance and one towards the DR instance. Find a blog post by SAP on the setup process here and an extended variation by the community here.
This works for CPI packages and its content. But what about credentials, private keys, variables, and OAuth credentials? There is no standard API provided to synch them. One option would be to create them externally first (e.g., in Azure KeyVault) and deploy via API to both tenants. Or secondly you could think of providing a custom iFlow to extract and expose to the other CPI tenants.
The SCP content agent looks nice, but creates double the cost, because I need a CPI tenant running that I am hopefully never going to use. The same is true for the S4 on the DR side. So, what if I am willing to take higher risk, suffer longer recovery times and do only passive setups?
What do I mean by passive: Above scenario in fig.3 is considered an active-active scenario where the workloads are waiting in hot-standby ready to take over once needed. An active-passive scenario involves only the SCP and Azure landing zone setup from a configuration, networking, provisioning and governance perspective. The actual workload like the Azure VM or the CPI instance are either stopped or not even created yet to save cost.
How to spin up resources for an active-passive scenario quickly?
Manual config works but can we do better than that? Here are some thoughts on it:
For the SAP backend part of it I can recommend looking at recent Infrastructure-as-Code (IaC) with DevOps blogs. Here is a link to some of the Azure-related video resources to get you started on the Infrastructure-as-a-Service part.
So far there is no IaC approach to spin up services like CPI in SCP in a programmatic way, that could be leveraged from Azure DevOps Pipelines, GitHub Actions or the likes. So, the actual service needs to be provisioned manually for sure. Once done, you could either synch from a new DR route on the SCP Transport Management Service or perform a manual import.
For manual import I recommend to version artifacts in an external Git repos to be sure you can recover the files, have auditability on the changes and avoid risk of losing access to the currently active version. Your implementation on the DEV CPI tenant is probably in a different state, that what is in production right now. And keep in mind, that an outage that causes a failover for CPI also crashes your CPI Web UI ?
Fig.7 Manual Git-based synch of CPI artifacts
For DevOps-based approaches on iFlow-level have a look here:
- For CloudFoundry apps (MTAR projects) I published a blog on blue/green deployments with Azure DevOps
- For iFlows find guidance on an implementation here. The SAP API Business Hub describes the interface to create, modify or delete iFlows programmatically.
Bottom line is manual provisioning and configuration is straight forward but poses greater risk of failures and creates longer time to recovery during disaster.
To perform all those steps, it is a best practice to have an Admin or Emergency user with elevated rights on the Subaccount. Developers should not touch CPI in production to avoid risk of artefact inconsistency.
Finally, the Azure Front Door setup
Create a frontend (the abstraction layer with a new URL masking CPI), configure your two Azure APIM instances (your CPI proxy tenants) as backend pools and finish by creating the routing rule to tie both things together.
Fig.8 Screenshot from AFD setup
Front Door uses priorities and weightings to forward traffic. In our case I want to always target CPI in Europe and only switch to East US in case my probe fails. Below setting considers my backend healthy when the last two probes within 60 seconds (one probe every 30secs) were successful. Be aware that AFD has a large number of Point-of-Presences (PoP) around the globe, which results in a considerably high amount of probes. See below note from the Azure docs or the reference article here.
We use http method HEAD to avoid triggering messages on CPI and be most cost-efficient. Remember we map iFlow authentication errors (http 401) to 200 success for our AFD probe.
For the probe path I deployed an iFlow that returns only its region. It listens on “<cpi>/health”.
Fig.9 Screenshot from AFD health probe setting
My routing rule (AFD -> CPI) simply targets all paths, because the pattern matching is set to “/*”.
Fig.10 Screenshot of AFD Routing Rule config
The metrics on AFD allow you to monitor the health of your backends. You can create alert rules in case certain thresholds are exceeded or pin the graph to your dashboard. This way your CloudOps team has real-time insights into your CPI tenant status and possible failovers. We expect to see values high in the 90%. Otherwise there would be a failover, rigth 😉
Fig.11 Screenshot from AFD metrics screen with Alert and dashboard pinning option
On my two consumption-based Azure APIM instances (one in west EU and one in east US) I forward all requests to their respective remote SCP CPI instance.
Fig.12 Screenshot from APIM setup
The only interesting configuration can be found on the custom probe operation. In there I map authentication error (http 401) responses from CPI for the “health” iFlow to http 200 success. This way we get proper health probes on AFD.
<choose> <when condition="@(context.Response.StatusCode == 401)"> <return-response response-variable-name="existing response variable"> <set-status code="200" reason="Probe" /> </return-response> </when> </choose>
For your convenience I uploaded the OpenAPI definition on my GitHub repos. Find the link at the end.
DR drills for the faint hearted
Let’s test this already! We have two CPI instances: Our primary in Azure West EU and our secondary in East US. My DR demo iFlows store the value of the global variable for the location ID in a header. So, we can reuse that information on Postman to verify a correct setup. Initially my Postman request to https://cpi-dr-demo.azurefd.net/http/drdemo/primary returns the products from OData service EPM_REF_APPS_PROD_MAN_SRV with value “primary”for header “Mylocationid”.
Fig.13 Screenshot of Postman, response headers
As a next step I misconfigure the probe of the european APIM instance, that checks the health status of CPI, to simulate an outage. It is kind of hard to create an error other than http 401 as we cannot rely on iFlow message failures nor can I make CPI unavailable if I am not willing to delete the instance.
I put the probe from “https://<cpi-runtime>/http/health” to “https://<cpi-runtime>/” that gives me an http 404 fortunately ?.
As of now the probes will start to fail. Once we have multiple errors reported within the 60 second probe window…
Fig.14 Overview of simulated CPI outage in Azure west-eu
We start seeing the “Mylocationid” header change to secondary. Meaning my requests are now routed via CPI in Azure East US. Lucky us ?. The same is reflected on the AFD metrics:
Fig.15 AFD metrics of CPI health after simulated outage
We see a major drop for the red-line (CPI instance in europe), that caused our failover. Nice! Such a “drop” would be straightforward for an alert. You can create it directly from the portal from the buttons above the metrics chart (see fig.15). Typically you would send an E-Mail, push a Microsoft Teams message or log a service ticket in Servie Now for instance. You get that out of the box.
Fig.16 Screenshot from Alert wizard
Even if you don’t want to use this kind of automatic failover it is still worth considering to implement a reverse proxy in between your S4 and CPI. This way you ensure the flexibility to add this abstraction without the need touch the S4 backend configuration later on. The described setup is often called a facade pattern, in case you want to dig deeper.
Thoughts on production readiness
- Authentication endpoints CPI: This example implementation of failover with two CPI tenants in different Azure regions became a lot easier this month due to an authentication mechanism change for CPI. We can now login with S-User Ids again. Before you needed to use a service key, that contained individual credentials for each CPI tenant. That would have created the need for us to consolidate authentication first, before we can put Front Door in between. Otherwise front door would have routed you correctly, but your backend would need to know upfront which credentials to provide. Not helpful at all ? My approach back then was to add Azure AD into the mix to get a global login with one credential, that is accepted in all my SCP subaccounts. The necessary trust setup can be found in my colleague’s blog. Luckily that was no longer necessary. You just need to make sure that the S-User is a member of all relevant SCP subaccounts and contains the new MessagingSend role.
- Use HTTP 401 vs. Basic-Aauth header on probe: You could argue that my mapping to http 200 for front door is suboptimal because we use an error to communicate success. There are scenraios, where this might be misleading. Again, this is mostly about service availability and recovery rather than proper user configuration. I suspect those errors to be handled differently. If you want to actually call the iFlow with authentication you can add the basic-auth header on the pre-processing step of APIM in Azure. Be aware that you open up an iFlow on the internet in doing so. I would recommend to create a new SCP role on the subaccount at least.
- To failover or not to failover: I would recommend to fine tune the parameters and logics on FrontDoor, APIM and the iFlows provided, based on your needs to avoid unintentional CPI failovers. Especially, if you do a passive setup.
- Secure AFD and APIM: It is best-practice to limit allowed traffic only to anticipated services. Find a reference to restrict APIM inbound only to AFD here. To act upon malicious traffic on Front Door level you can activate its internal Web Application Firewall.
- Secure CPI endpoints: So far there is only a header based ip allow-list mechanism in CPI to restrict the calling ips to your services. In our case that is Front Door. Have a look at this blog to learn about the setup process on CPI. Find the description to identify your Front Door instance here and on our docs.
- Automate the failover process and configuration synch: It starts with the S-User for the CPI authentication. How do you keep him in-synch across subaccounts? User provisioning from Azure AD could be an option for a streamlined approach. There is also the SAP API Business Hub for SCP. Be it as is, there are a lot of options to add automation to the described failover case. But they require detailed knowledge and create complexity. However, a hybrid approach with manual decisions and some automation, that is well understood looks promising.
- Stateful vs stateless iFlows: My provided guidance for the failover assumed stateless iFlows that do not store any messages (e.g. JMS) and run isolated. Hence, they don’t have the risk of running in parallel creating messages twice and conflicting with each other on the target systems. Consider timer-triggered iFlows or polling triggers like SFTP in an active-active CPI scenario: You need means to deal with the duplicates or have clear “failover” switches that avoid active-active altogether. For JMS you could consider moving the message queue outside of CPI to overcome the isolation. A geo-redundant flavour of a message queue, like the Azure Service Bus, could mitigate that. You would work with your messages outside of CPI via the AMQP adapter. For polling triggers, consider replacing the iFlow trigger with http to avoid unintentional restart after an CPI outage.
- Monitoring and logging: With active-active setups you need to be aware that your CPI Admin or CloudOps team needs to check two tenants as they create independent logs. To overcome that you would need an external monitor that consumes the logs via the official OData API. I provided a guide on how to achieve that with Azure Monitor here. You could scale that to multiple CPI tenants.
Alternative Proxy components
In my example we saw Front Door with API Management. You could also look at the following depending on your requirements
- Azure Traffic Manager for a DNS-based approach. Check the DNS caching time-to-live settings and keep mulit-layer caching in mind! Here is my post on the setup
- Azure Function proxies for a programmatic approach.
- Deploying an enterprise-grade Reverse Proxy like Apache in a VM or Container and manage all SLA-relevant setup yourself.
- Consider other managed proxies on the market
That was quite the ride! I explained how SAP deploys CloudFoundry zone-redundant in its multi-cloud environment and how that impacts your SAP backend disaster recovery plans.
At first we had a look at single SCP CPI (no DR) deployments with S4 backend failovers from its primary region west EU to its disaster recovery site in north EU. The Cloud Connector setup and Location ID concept play an important role to make that work.
To enable failover for CPI too, we needed to add two more components to abstract the communication configuration on the individual S4 backends. This way you can keep the config on the backend no matter what happens with downstream services. The newly added components act essentially as a reverse proxy, that needs to be highly available and globally available to meet the requirements for a resilient failover. For that purpose I introduced Azure Front Door and Azure API Management (consumption based). Keep in mind that some CPI configs and runtime artifacts such as Credentials need to be synched too (see last two points on “Thoughts on production readiness”).
Of course, there are other services out there that could do similar things. The beauty of this setup is minimal configuration effort and complete usage of managed services.
In the end we saw simulated a failover, that was induced by a misconfigured probe endpoint on API Management, because it is rather hard to create availability errors of CPI without deleting the instance. The health probes from Front Door picked up the disruption and automatically started re-routing the traffic. Not too bad, right?
Next to this failover scenario I also published a guide oin how to monitor CPI messages with Azure Monitor here.
Find the used iFlows on my GitHub repos.
As always feel free to leave or ask lots of follow-up questions.
Excellent write-up 🙂 I have been through a similar journey to set-up CPI failover and guess I can write my experience in brief here.
We also had Azure Front Door based solution for months. An active-active scenario with two CPI Production tenants added as an independent backend pool. Then with the Path pattern(s)-based routing rule, we were able to split traffic between two tenants or consolidate to one tenant as and when required. This kind of acted as a load balancer on normal days and as a reverse proxy to redirect traffic for failover. However, there was a limitation to have interface consumers connect over client / mutual authentication towards AFD.
Then we switched to Azure Traffic Manager and are still live with it by using priority-based routing. This solution enabled us to use client / mutual authentication in addition.
OAuth authentication is still a limitation with both AFD / Azure Traffic Manager (since the access token issued is tenant specific).
For SCPI to Backend failover location ID, I prefer to use the Partner Directory Kit-String parameter since this is read from cache, while the global variable is a kind of DB lookup, and doing so for every message run can be an overhead.
That is an interesting comment and I am glad the concept sees already some adoption in the field!
There are many variations on how to implement, so your notes and why you chose what are extremely insightful. I especially like the Partner Directory Kit piece. Thanks for sharing.
Community: Find some setup guidance (NEO environment but can be transformed to CF) for the Partner Directory Kit here: https://blogs.sap.com/2017/07/25/cloud-integration-partner-directory-step-by-step-example/
Your comment also clarified for me, that it is worth to invest some more on the OAuth part for the proxies.
Thanks a lot for your blog, gives good insight into the concept. We have CPI, CPMS and Portal, and plan to implement DR setup. You blog will definitely help on the CPI side, meanwhile we are developing our own concept for CPMS as this has its own challenge.
On the portal, we would need a setup with redirection I believe as its a user centric scenario.
For example if the user enters portal.mycompany.com, then this should redirect to https://flpnwc-primaryaccount.dispatcher.hana.ondemand.com/ or https://flpnwc-disasteraccount.dispatcher.hana.ondemand.com/ . Any thoughts on how to go about doing this ?
could you please specify what CPMS actually is? Are you referring to Cloud Platform Mobile Services? Also an architecture diagram is always helpful 🙂
The redirection example you describe in general is a typical use case for DNS based solutions like Traffic Manager or for actual global routing with Azure Front Door.
Feel free to reach out to have a detailed call about this.
Yes, CPMS is Mobile Services (Neo).
For redirection, how would it work ? APIM will do a heartbeat check and then TM/AZF will do redirection ?
The concept stays the same as I describe it for CPI. Health probes or heartbeats need to be checked from Traffic Manager (TM) or Azure Front Door (AFD), since they make the automated decision to re-direct based on certain error codes or response delays. Or you can manually switch on TM/AFD.
Your challenge will again be, that there is no dedicated health endpoint for PaaS services on BTP. So, you will need identify an http endpoint that you can call with an http HEAD request regularly, that gives you enough info about the "healthiness" of your Mobile Services instance. How about a feature request to SAP to expose and health-endpoint for such scnearios?
How about a bigger ask - BTP native DR for all the services 🙂 ? They already do that for HANA.
DR was given in on prem world (using VM, replication, log shipping, etc)...it should have been a defacto in cloud, isnt it ?
Very difficult to convince business that we are on "cutting" edge but need to do all this for setting up a DR (and that too isnt a fool proof).
SAP does HA and DR for you for the PaaS or SaaS service on BTP. That happens intransparently for you.
My post enables you to get more control, higher resiliency or scalability (described in part 3 of the series) yourself independent of the given capabilities that you get from SAP out-of-the-box. Especially if you are thinking cross region in hyperscaler environments. BTP by design is a single region effort.
You are right, BTP is single region effort and having experienced lot of outages (entire region or individual tenant) where all business critical transactions were stuck for hours, its very important to have cross region HA. I believe this should be an option by design and we need not custom build it.
Also, the fact that there is no "passive" option so that we are charged only when HA is activated.
From my perspective, it is expected that cloud platforms / PaaS should give us this option as workloads are moving to cloud now.