Second round of crashing iFlows in CPI and failing over with Azure – even simpler
I am building on my last post, that covered redundancy and failover strategy for SAP CPI in the CloudFoundry (CF) environment using Azure FrontDoor and API Management.
Community feedback on my first post showed, that it would be valuable to show the alternative approach with Azure Traffic Manager (TM) and CPI custom domains too. It allows for a pure DNS-based solution, which results in less cost. Furthermore, the flexible setup on TM for the expected http status codes allows to drop the API Management for the health probes compared to the FrontDoor option.
However, DNS time-to-live and cache refreshes can be a challenge for those of you who are eager on the F5 button 😉
The broader concept and caveats are covered in the first post. Here I will only show Traffic Manager and SCP specific changes to the approach.
As mentioned above I am covering the multi-cloud environment (CF) today. If you are interested in CPI on NEO with Azure TM have a look at Santhosh Kumar Vellingiri‘s great post “SAP CPI – High Availability using Azure Traffic Manager“.
An architecture overview gets us started easier
Fig.1 architecture overview with TM and custom domain for CPI runtime
At the heart of this alternative failover solution are custom domain names. They enable Azure TM to respond with different endpoints based on their given priority. Meaning, you define which endpoint shall be used in case the primary one is not responsive anymore. Furthermore TM “fails back” once the primary endpoint becomes reachable again. The described setup is the same as in the first post, except that we use TM and custom domains to perform the failover.
Let’s have a look at the moving parts
For all of this to work we need to register our own custom domain, tell CF about it and inject TM in between to be able to influence domain resolution. The relevant official SAP docs for CF on this process can be found here.
The initial steps will walk you through installing CF CLI including the custom domain plugin, assigning quota on the SCP cockpit for custom domains and provisioning the custom domain service itself. In our case we created the service from the cockpit (SAP calls it the Feature Set A). However, in the end the configuration needs to be done from CLI nevertheless.
Down below I depicted the process in a flow chart similar to SAPs drawing for the NEO environment. I liked the visualization but couldn’t find the equivalent for CF.
Fig.2 Custom domain process overview
Assigning quotas to your two subaccounts is straight forward. Since I am running SCP and TM on Azure, I also bought the domain through Azure App Service Domains for convenience. Of course you can do that through any provider. In addition to that we decided to register a wildcard domain so we can later on add multiple subdomains with the same certificate if needed.
Now, we continue in CF CLI as per SAP’s given instructions:
cf login -a https://api.cf.eu20.hana.ondemand.com cf services cf create-domain FTA-dr-primary sapcpi-disaster.com cf domains cf custom-domain-create-key euCPIkey "CN=*.sapcpi-disaster.com, O=Contoso, L=Cologne, C=DE" "*.sapcpi-disaster.com" cf custom-domain-get-csr euCPIkey euCPIkey_unsigned.pem # sign CSR with your authority of choice and upload from local file cf custom-domain-upload-certificate-chain euCPIkey 'C:\Users\mapankra\Downloads\euCPIkey-star_sapcpi-disaster_com.pem' cf custom-domain-activate euCPIkey "*.sapcpi-disaster.com" cf custom-domain-list cf custom-domain-map-route dr-primary.<node>-rt.cfapps.eu20.hana.ondemand.com myapp.sapcpi-disaster.com cf custom-domain-routes
So far so good. Please note that activation can take up to 24 hours.
I repeated the steps above for my second CPI instance in US21 to finish the preparation.
Note, that we reuse the same wildcard domain for both instances and the map the custom domain route to “myapp.sapcpi-disaster.com”. This way both endpoints on CF accept requests from “myapp.sapcpi-disaster.com”. This is essential to do the failover!
cf login -a https://api.cf.us21.hana.ondemand.com cf services cf create-domain FTA-dr-secondary sapcpi-disaster.com cf domains cf custom-domain-create-key usCPIkey "CN=*.sapcpi-disaster.com, O=Contoso, L=Cologne, C=DE" "*.sapcpi-disaster.com" cf custom-domain-get-csr usCPIkey usCPIkey_unsigned.pem # sign CSR with your authority of choice and upload from local file cf custom-domain-upload-certificate-chain usCPIkey 'C:\Users\mapankra\Downloads\usCPIkey-star_sapcpi-disaster_com.pem' cf custom-domain-activate usCPIkey "*.sapcpi-disaster.com" cf custom-domain-list cf custom-domain-map-route cpi-dr-secondary.<node>-rt.cfapps.us21.hana.ondemand.com myapp.sapcpi-disaster.com
Necessary DNS setting can be done natively from the Azure portal
Pay attention to the TXT and CNAME record. I needed to provide the TXT once during the CSR process to verify that I own the domain I provided there. This is a typical process with all the providers. In our example we used DigiCert.
The CNAME makes sure that the DNS resolution for “myapp.sapcpi-disaster.com” goes through my Azure TM first. On TM I maintain a profile that targets both our CPI runtimes but with different priority.
Fig.3 Custom Domain overview in Azure including DNS settings
Traffic Manager supports various routing methods. Down below is an excerpt of the list from the Azure docs.
- Priority: Select Priority when you want to use a primary service endpoint for all traffic, and provide backups in case the primary or the backup endpoints are unavailable.
- Weighted: Select Weighted when you want to distribute traffic across a set of endpoints, either evenly or according to weights, which you define.
- Performance: Select Performance when you have endpoints in different geographic locations and you want end users to use the “closest” endpoint in terms of the lowest network latency.
- Geographic: Select Geographic so that users are directed to specific endpoints (Azure, External, or Nested) based on which geographic location their DNS query originates from. This empowers Traffic Manager customers to enable scenarios where knowing a user’s geographic region and routing them based on that is important. Examples include complying with data sovereignty mandates, localization of content & user experience and measuring traffic from different regions.
For our failover use-case priority-based routing makes the most sense. However, performance or geographic might also be interesting when you consider external 3rd party SaaS applications like Salesforce or other integration partners sending data to CPI.
Fig.4 Traffic Manager profile
The lowest number reflects the highest priority. Meaning all traffic currently goes to my CPI instance in EU20, because the Monitor status is “online”.
We use a specific iFlow to determine CPI health, that only returns its CloudFoundry region, so we can have a nice visual verification of which endpoint was hit. Therefore, we define the target hosts for my CPI runtime instances as follows:
Fig.5 TM endpoint settings for primary CPI in EU20 and secondary in US21
The corresponding health probe config on TM to my published iFlow path at “http/health” looks like this.
Fig.6 Screenshot of health probe config in Azure TM
Like in the first post about Azure FrontDoor we expect http 401 as status code, because we don’t want to send credentials for the probe and avoid creating a lot of probing messages on CPI. In my setup I had on average 9-10 probing messages per minute to the runtime node in EU20 with “authenticated” execution. This would accumulate to ~14k messages per day for the health check. To avoid that hit to the message block quota we consider the “401-approach“ a good compromise.
In case you are not happy with the implications of overlooking error scenarios around the http 401 unauthorized status code, you could fill the Custom Header Settings with “Authorization:BasicAuth token“ and default the Expected Status Code Ranges to 200 instead.
The setting for the DNS time-to-live will impact how long it takes until a known service degradation will take effect on your sending site. Apart from that DNS caching happens at various layers. So, I highly recommend performing at least one DR drill to know what timings to expect besides the TTL on TM for your given environment. Of course, best practice would be to perform regular DR drills to verify a working and complete setup in case of real disaster. Some customers even do deliberate failover to their DR region to keep their CloudOps sharp.
Ok, so how does the iFlows look like, that serve as “health check” endpoints?
Both iFlows have a simple content-modifier that returns the region as text on the body. The iFlow is provided as an artifact on my GitHub repos. In there you can find the CPI project for my first post too. It contains some more examples including a call to the SAP Cloud Connector.
Fig.7 Screenshot from health check iFlow design
Fig.8 Deployed health check iFlows in EU20 and US21
That concludes the setup process. And here is the result from Postman when we call an iFlow using the custom domain:
Fig.9 Postman result for iFlow call with custom domain
Great success! As intended the request is being served from my CPI instance in EU20. Now, let’s crash some CPI tenants to test the failover to US21!
DR drills for the faint hearted
Ok, maybe we are not mighty enough to crash a CPI tenant easily for a DR drill 😉but what about making the health-check iFlow fail? A simple uncaught syntax error in a Groovy script would throw an http 500. Cool, that will do!
But wait, we are not executing the health check iFlows. We are simply “probing” the iFlow from TM without credentials and work with the http 401 status code. So, for the purpose of this drill let’s change the expected status code to 200 and provide the basic auth token as custom header. This way the health probe will actually execute the iFlow.
Fig.10 Config on Azure TM for the DR drill
Next, we deploy the health check iFlow in EU20 with that groovy script syntax error. And once that is live…
Fig.11 Error in Postman after iFlow update
Boom! Oh boy, now the faulty health probes on TM are piling up until they reach the configured “unhealthy” threshold of 2 consecutive failures. Note our probing interval is 30 seconds but Traffic Manager is probing from multiple Points of Presence (I had ~10 probes per minute). Meaning the endpoint will likely be marked as “degraded” much quicker than 60 seconds. Once that happens and DNS time-to-live has expired…
Fig.12 Successful failover test to US21 from Postman
I see my request being served from the CPI runtime instance in US21 😊awesome, isn’t it? During my DR drill I needed to wait roughly 1 min and performed a restart of Postman to “fight” local DNS caching on my machine. It might behave differently on your S4/ECC machine depending on your environment, operating system and local network settings. Therefore test it under near-live conditions.
Thoughts on production readiness
- Security: CPI is a service that is publicly reachable. Adding a DNS-based layer on top of that to “re-route” requests does not improve or decrease security. However, TM brings a set of controls to increase your security baseline from the Azure perspective. A straight-forward example would be built-in access logging to notice malicious access attempts to your CPI instances.
- Availability of TM: The SLA states, that a DNS query to TM will be answered from at least one name server successfully 99,99% of the time.
There is an extensive list on the production readiness of this setup on the first post already. Therefore, I added only what is special to the TM-approach.
Have a look at the pricing structure of TM here. For an average setup I suspect less than 3€ per month.
That was quite the ride! I explained how to do a failover setup for CPI with two instances in separate SCP subaccounts including the SCP custom domain service deployment. We discussed the health probing configuration and implications of relying on an “unauthenticated” call to the health-check iFlow.
In the end we saw a simulated failover, that was induced by a syntax error on a Groovy Script of the health check iFlow in my EU20 environment. The health probes from Traffic Manager picked up the disruption in less than a minute and automatically started re-routing the traffic to my CPI instance in US21. Not too bad, right?
Find the GitHub repos here.
#Kudos to my colleague Robert Biro for his helping hand on the custom domain setup and tiredlessly support on our joint SAP with Azure journey.
As always feel free to leave or ask lots of follow-up questions.