Black Friday will take your CPI instance offline unless…
This is part 3 of a series discussing distributed scenarios for SAP Cloud Integration (CPI).
Part 1: How to crash your iflows and watch them failover beautifully (DR for CPI with Azure FrontDoor)
Part 2: Second round of crashing iFlows in CPI and failing over with Azure – even simpler (DR for CPI with Azure Traffic Manager)
Part 3: Black Friday will take your CPI instance offline unless… (HA for CPI)
Processing millions of messages with iFlows per day can already be a challenge. But getting hit by this order of magnitude within a short time frame (3..2..1..aaand 8.00am Black Friday discounts are LIVE!!!) with a default CPI setup will likely be devastating – even though there is no rubble or actual smoke involved – Maybe just coming out of the ears of those poor admins fighting the masses to get the integration layer functional again 😉
CPI is part of Integration Suite and nowadays called “SAP Cloud Integration”. Nevertheless, the term CPI is widely spread and much better distinguished acronym than CI. Therefore, I keep using it here.
CPI can scale by adding additional worker nodes to your runtime. This is usually done via Support Ticket (autoscaling within one deployment is part of the roadmap). Nevertheless, this is capped by a maximum number of nodes that can be added to your tenant. What if that is not enough?
Today we will investigate orchestrating multiple CPI tenants on Azure to scale beyond the worker node limit and putting you in control to add instances on-the-fly as you see fit. We will shed light on the challenges of asynchronous messaging and ensuring processing only once, synching artifacts across tenants and sharing config to avoid redundancy.
I am building on my last posts, that covered redundancy and failover strategy for SAP CPI in the CloudFoundry (CF) environment using Azure FrontDoor and Traffic Manager. Also, I’d like to reference Santhosh’s post on his take on running multiple CPI instances.
Fig.1 Collection of consumers threatening your integration layer with stampede of purchases
Let’s get you out of that mess.
The moving parts to make it happen
As mentioned in the intro there are several angles to look at the parallel usage of multiple CPI instances in parallel for scalability and high availability. During my other posts we assumed that only one CPI instance is active and serves as primary. The secondary only becomes active if the primary is not reachable anymore. The traffic gets re-routed and flows always through one instance only. That approach is often referred to as active-passive setup.
That is fundamentally different to the active-active setup we are discussing in this post today. We need to solve the following challenges to succeed.
Present messages to CPI tenant pool in a way to process exactly once
With multiple flows we either need to be able to distribute reliably and potentially re-start on failure or equip the receiver to deal with multiple identical messages coming in. Many SAP targets cannot deal with redundant messages. Therefore, we should tackle on CPI tenant level.
We distinguish between three types of triggers for our iFlows.
- Asynchronous messages with receive acknowledgement only
- Synchronous messages with a blocked thread until fully processed by the target receiver
- Timer-based flows.
Fig.2 CPI asynch processing with Azure Service Bus
To achieve resilient asynchronous messaging, we need a message bus to supply the messages (in order if required!) and keep them until acknowledged. The bus ensures that only one CPI instance processes the message in question. There are various options on the market like RabbitMQ or the BTP native solution SAP Event Mesh to implement such an approach. In our case all components are deployed on Azure, and we are looking for a managed solution. Therefore, it makes sense to use Azure Service Bus. It has built-in request throttling to ensure we can react on unexpected steep request increases and protect down stream services. For instance until we can bring another CPI tenant online.
CPI offers an AMQP sender adapter. With that we can honour the described required properties regarding speed, throughput, scalability, and reliable messaging when pulling from the service bus.
Find more info on how to configure your service bus with HA features and how to brace against outages on the Azure docs.
Fig.3 CPI synch processing with Traffic Manager
With synchronous messages life becomes a looooot easier. We can directly hit the endpoint. For effective routing and high availability, we need to distribute messages cleverly though. In our case we applied the Azure Traffic Manager that offers round-robin, priority, performance, and geo-based algorithms. If all our CPI instances run in the same BTP region it makes most sense to apply performance-based routing to always choose the quickest route and cater for bottlenecks on the individual instance.
To be able to apply request throttling like for the asynch calls before we need to introduce an API Management component. You can choose from SAP-native options on SAP’s Integration Suite or Azure API Management for instance. This step is optional but increases the robustness of your setup, by giving down stream services time to recover or to onboard an additional CPI tenant.
Because Traffic Manager acts on the DNS level, we need a custom domain setup on CF. Alternatively, you can apply Azure FrontDoor to avoid that requirement. To learn more on the various load balancing components haver a look at this Azure docs entry. I personally also found this blog useful.
About time we look at the last trigger type. Timely, isn’t it? All right, all right, I’ll stop in time 😉
Fig.4 CPI timer-based processing with LogicApps
For this trigger type we can apply the same mechanism as for synchronous messages to distribute tasks. In addition to that we need a process that can be scheduled. In the Azure ecosystem LogicApps are a good fit for the task. On the BTP side you can have a look at the BTP Automation Pilot for an SAP-native approach to kick off the trigger via Azure Traffic Manager.
Ok, so now we established how to call the multi-instance integration flows. How about their shared artifacts like credentials for downstream services? And finally: how do you keep the iFlows in synch across the multiple CPI tenants?
Synch artifacts across CPI tenant pool
In many places iFlows and associated artifacts are moved manually by export/import from the Dev to Prod instance. That bears tremendous risk when you run multiple CPI tenants in production. We need to apply an automated approach to overcome this. SAP offers built-in transporting with the BTP Transport Management Service (TMS). There is also CTS+ and MTAR download. Check the note 265197 for more details.
The high availability setup in its simplest form, would have one CPI dev tenant and run at least two productive instances and serve them with updates only via transporting to ensure a consistent state.
Having more than one dev instance would create a “split brain” problem, because you cannot merge via the transporting and potentially mess with version numbers etc. Furthermore, dev tenants usually don’t require the same magnitudes of processing capabilities, that made us apply this high availability concept to our prod instances in the first place.
Multiple dev instances in different BTP Azure regions for redundancy and disaster recovery purposes are a different story though. Check out my earlier posts for that.
Fig.5 Synch artifacts cross CPI tenants with TMS
Considering high availability for the transport component, you might favour a true DevOps approach and global services like GitHub actions compared to a single instance of TMS on BTP. GitHub Actions, Azure DevOps Pipelines or Jenkins however need to integrate with the CPI public APIs and cannot be triggered by the CPI Admin UI. Have a look at one of my older posts on DevOps practices with Azure Pipelines for CPI for reference.
For the credential “sharing” across tenants we apply the same notion of abstracting the component as before to distribute the content.
Fig.6 Event-driven CPI credential sharing via KeyVault
For a proper setup you want to maintain and update the credentials for CPI in a central, resilient, and highly available place. Azure KeyVault is a great fit for that task as it is cross Azure region replicated by design and integrates natively with Azure EventGrid. That allows you to push out credential changes automatically. Finally, you integrate with the CPI Security Content API. A straightforward option are LogicApps to perform the http calls. Find the template on the GitHub repos.
Fig.7 Screenshot of LogicApp pushing the new shared secret to registered CPI tenants (config scrapped for a “smaller” picture)
So, now every time you create a new version of your sharable secret (let’s say: “sap with azure rocks”) it gets pushed out to all configured CPI tenants. I provided an iFlow including a groovy script to read the secret “CPI-AUTH-SHARED”.
Et voilà, we get the confirmation, that the secret has reached its destination.
Fig.8 Postman call verifying “pushed” secret on CPI tenants
For a BTP-native option you can have a look at the Credential Store Service. Again, the BTP service would be a single point of failure, that would require explicit redundancy and can only serve pull-based scenarios.
Anyhow, formalizing and automating the propagation of CPI artifacts is key to achieve reliable and consistent updates.
Streamline monitoring in one central place
From time-to-time CPI messages fail. How do we monitor and troubleshoot this effectively across our multitude of CPI prod tenants? To this day CPI monitoring is targeted only on a tenant-by-tenant basis. So, we need to extract the logs and consolidate them in one central place ourselves. There are some popular solutions out there like Dynatrace or Splunk. In our case I created an Azure Workbook to apply Azure Monitor to the problem. All my components in this example run on Azure after all 😉
Fig.9 Govern CPI logs in a central place
Find a detailed description of the workbook on my earlier post.
Fig.10 Azure Workbook with consolidated monitoring capability for multiple CPI tenants
Thoughts on production readiness
There is an extensive list on the production readiness of this setup on the first post already. In addition to that the mentioned managed Azure services have geo-redundancy built in. They are used to power services like Azure Portal, Dynamics365, Bing Search, XBox and the likes.
Customers are already running multiple CPI instances for scalability reasons today. The provided guidance simply gives a structure to the various aspects of distributed processing with Integration Suite (specifically SAP Cloud Integration) discussed within the community.
That was quite the ride! This post concludes the trilogy of running multiple CPI instances. During the first two posts we covered disaster recovery (DR) across BTP and Azure regions and applied different load balancing mechanisms (FrontDoor and Traffic Manager). Today’s write up explained the difference between DR and high availability concepts for CPI and shed light on the synchronization, redundancy and race-condition problems that come with high availability.
The key to solving all these challenges is decoupling the components, because CPI has no concept for this. We saw SAP BTP native options to approach this as well as Microsoft managed ones on Azure.
Find the mentioned artifacts on my GitHub repos.
You are now fully equipped to survive the stampede of discount hungry consumers hitting your integration infrastructure 😊Do you agree?
#Kudos to @Max Streifeneder for all the interesting conversations around this.
As always feel free to leave or ask lots of follow-up questions.
Martin Pankraz nice blog 🙂
There is also a nice Discovery Center Mission available for the synchronous flow
see Route traffic between SAP Cloud Integration tenants
Yes, indeed. When you check the associated GitHub SAP repos of that mission, you will see that it is based of my first two posts in this series. It is great to see that community content gets adopted in official channels.
Have you been using the approach? Any more insights to share?