Detection and Handling of Stuck BPMN Request-Confirmation Pattern
In this blog post we are taking a closer look at a category of problems happening during the communication between SAP Process Orchestration (including BPM) and back-end services. Based on our experiences, we recommend patterns, best practices and show an idea of a custom application able to detect and handle process instances that are affected from the typical symptoms.
What is the Request-Confirmation Pattern?
To retrieve data from a back-end service in a BPM process, one might want to use an automated activity modeled with a synchronous Web service interface (WSDL). However, synchronous Web service calls have the following drawbacks:
- They come at a high cost since they are occupying the system resources (threads, transaction, memory, execution progress in BPM) while waiting for the back-end response which is being processed there.
- Usually, it is not possible to utilize reliable exactly-once communication protocols such as WS-RM.
Therefore, it is recommended to implement communication requiring response from a back-end service with the help of a so-called request confirmation pattern. The automated activity ‘Request’ is called asynchronously and returns immediately with an empty response. Then, the process waits for a separate response message containing the requested data at the intermediate message event (IME) ‘Confirmation’:
Usually, the response message contains a correlation key (e.g. business key) which is used by the BPM system to determine which process instance is addressed by this particular response message.
How Can Process Instances Get Stuck In Back-End Communication?
If you are using this pattern in its simplistic variant, it does not care about response messages never reaching the waiting process. Such cases might occur due to various reasons:
- Outbound communication issues when performing the automated activity
- Errors during back-end service processing
- Inbound communication issues when receiving the reply message
Issues in the outbound communication can be handled via the technical error handling on the modeling side of the automated activity, which is described in a separate blog post. Within this article we will address other cases where a process is stuck while awaiting the response on the IME.
In the ideal case the back-end service invoked is able to cope with certain error situations and would reply to the BPM process with an error message rather than with a regular business response. A BPM process developer could enhance the model by a decision gateway which analyzes the content of the confirmation message received:
Not in all cases the back-end service can be adapted so that it reliably sends either a regular confirmation or an error reply with an error code. In addition, depending on the guarantees the messaging infrastructure provides, the message could get stuck or lost between the communication partners. In those cases where the back-end response (if any) does not arrive at the BPM process, the process instance will be stuck.
A stuck BPM process instance cannot be easily observed, as e.g. in the SAP NetWeaver Administrator’s view ‘Manage Processes’ the process still remains in status ‘In Progress’ because there is no real technical issue with the process instance. The process instance is logically stuck, not technically, which makes it necessary to check the processes’ details in order to find such cases. In systems with a low number of process instances you could check the back end for the individual case. But it can become a much more work-intense issue in Process Orchestration landscapes where a huge number of processes is executed, each with multiple back-end interactions.
Automatic Self-Detection of Stuck Process Instances with BPMN Means
In order to automatically detect such logically stuck process instances, the BPM flow can be enhanced by a timeout branch. You can implement this by using an ‘event based choice’ gateway and a timer event in addition to the ‘Confirmation’ IME:
The semantics of the event based choice gateway is as follows: whatever comes first (the confirmation message of the back end or the elapsed timeout) will continue the flow. In regular cases, the confirmation message arrives before the timeout and continues the flow in a regular manner. In case the confirmation does not arrive in time, the timeout timer event is triggered and continues the exception branch.
The value for the timeout in the timer event determines after how much time the process instance is considered to be affected by exceeding the expected process time in the back end. It needs to be defined according to the expected response time of the back-end service, perhaps it even needs to be adapted over time. In case the timeout is chosen too low, it might produce false-positive cases; if it is defined too high, it might take additional time until a process instance is detected as being affected. At the outgoing edge of the timeout timer event you need to model an exception handling procedure.
There is no generic solution for all failure situations you might think of. Your business scenario should be considered when defining the error handling. Do you have more ‘human centric’ processes, small or medium amount of instances per day? In such cases a complex mitigation might not be required. You could simply send an e-mail notification to the process administrator so that he or she can validate the concerned process instance and trigger the necessary correction steps.
SAP BPM supports system centric scenarios. Such processes are usually ones with limited human interaction. They are characterized by automated activities, IMEs and mapping activities. System centric processes are often executed as high volume scenarios. Single case notifications and individual error handling might not be feasible in such a scenario. The error handling and correction steps need to support a high amount of affected process instances.
Mass-Enabled Handling of Stuck Process Instances
In our scenario, the assumed reason for a high number of affected process instances is a failing back-end operation due to incorrect data originated from the BPM process. This might be caused by an update of master data or general keys, which are not reflected in the BPM process. Such cases require a selection of process instances based on their context data to change it for multiple processes in the same way. The procedure could look as follows:
Here is one example implementation for a data correction process.
In order to separate the exception handling branch from the remaining process, we put everything related to the ‘data correction’ into an embedded sub process.
The following block diagram shows the interaction of the data correction process getting triggered in case the back-end response does not arrive in time:
(Only) for processes reaching the timeout, the human activity ‘Store Data’ writes the context data (field names and values) from the process context into custom data store. This could be a simple table with the following columns:
|Primary Key||Process Instance ID||Data Name||Data Value
As the global process context from the main process is also available in the embedded sub process, it can simply be accessed by the ‘Store Data’ mapping activity. The mapping activity calls a simple custom-developed mapping EJB to write the selected context values into the custom data store. Note that the implementation of the mapping EJB needs to be idempotent as for reasons of error recovery the BPM runtime might execute the given activity for the same instance several times.
The next step in BPM is the human activity ‘Access Data’. The process will create the human task and will wait for input, i.e. task completion. The system administrator could get notified by an email, so he will query for affected instances on the data persisted in the custom store utilizing an SAP UI5 application. Such an application can be built with regular Java means to query for the affected process instances based on the stored data values. Such an application is shown by the following screenshot containing example data as follows: The product ‘Schwarzbrot’ was specified in German language. But the back end only knows the English version. The proposal is to change the affected data to the English word ‘bread’ for all detected instances with product ‘Schwarzbrot’.
Based on the investigations, the administrator can decide to change the data so that the back-end service should be able to deal with it correctly. Before changing all affected process instances’ data, using the task UI the data can be modified for just one process instance by providing the corrected data in the task UI and then completing the task manually to verify the success of the changes.
In case huge amounts of processes with the same problem pattern need to be corrected, the custom SAP UI5 application can loop over all relevant tasks via BPM OData Task Services to complete tasks with the new data value(s). In both cases, the output mapping of the human activity updates the BPM process context with the new corrected value(s).
How Should the Treated Process Continue?
After the data correction step is executed, there are various options to continue in the BPM process:
- You might want to send the request with the corrected data to the back end again.
- You might want to start a compensation process from beginning and terminate the existing process.
- In case you didn’t perform any data modification in BPM, but the issue got solved on back-end side, you might again want to wait for the response message.
Modeled in BPMN, this could look like the picture below:
An exclusive choice gateway is inserted after the ‘data correction’ sub flow to determine which path to take. Not all the options might be required for all situations.
This blog post provided some ideas how to detect and handle processes stuck due to a communication or processing error in a request-confirmation pattern. The ideas are based on BPMN means with a few lines of Java code (like a mapping EJB or a simple UI which visualizes the content of a custom database table). Depending on different scenario-specific requirements, the example could be adapted and extended accordingly.