What causes this error to appear in log files and how to handle this:
com.sap.glx.core.cluster.ClusterEntryUnavailableException: Storage group XXXXXXXXXXXXX could not be evicted (2499 ms timeout elapsed)
WHAT IS A GALAXY OBJECT?
When new process instance is created at the same time, in inside galaxy kernel a bunch of objects (called Galaxy Objects) that hold information for this process instance are created as well. The whole bunch is named Group.
When this group of objects is not more needed by kernel (probably because the process reach the point where user input or some other asynchronous event is waited to continue) this group is stored inside the database in order to free some memory (this action is named “eviction”).
The eviction is related to so-called GalaxyObjects which represent the current execution state of “In Process” BPM Process Instances. In the kernel, process execution only takes place actively in main memory in case a “next process step” needs to be calculated and executed which is triggered e.g. by an end-user has completed a task, by a correlation message has arrived, by a timer event has expired, or by a predecessor process step like an automated activity was
completed. Once those calculations are completed (and even the process is shown as ‘In Progress’), the technical state is then ‘inactive’ and the GalaxyObjects get stored to database and evicted from main memory. That’s the reason why even a customer could have 1,000,000 of ‘In Progress’ instances, it doesn’t cost continuously memory unless they don’t perform any further transitions in BPM process execution.
WHEN DOES THE ClusterEntryUnavailableException OCCUR?
On the other hand, if an end user wants to perform a monitoring or administrative activity (e.g. NWA) on a process or task, or an end user requests particular portions of the process or task, the cluster node which receives the user/web request needs to get the GalaxyObjects related to the requested Process Instance in hand. If not already the case, it requests the current owning cluster node to persist the current state to DB and to evict the affected GalaxyObjects. If this fails after a time-out period, then the ClusterEntryUnavailableException is thrown which can be observed in some cases by the customer.
When the process that is unable to continue, the group of objects for this instance stays inside the kernel “for ever”. “For ever” is quoted because it will stay as long as someone goes to see what is the problem, fix the problem and force the process instance to continue its execution (to retry the step that was failed).
WHY DOES THE ClusterEntryUnavailableException OCCUR?
In the following situations the BPM Process Instance cannot be evicted (in time):
- There is an long lasting ongoing or even endless execution of a BPM process instance (as mentioned above)
- The requested BPM Process Instance is in ‘failed state’ due to an unexpected Java exception
Actually this is the root cause of the above mentioned error, but is reported from different place.
When we have one process instance that is running inside the kernel (inside one node of the cluster) and the process fail – it stays inside this cluster node
Both cases we usually consider as irregular, as:
- Endless (exceeding the timeout) process execution can be modeled by the customer in BPM (e.g. endless loop with gateways, or mapping function with deadlock, …), but should be rare in reality
- ‘Failed state’ is a different status than business error ‘Error&Suspended’. We only expect ‘failed state’ caused by implementation issues in Java which were not expected, like a NullPointerException. In such a situation customer should check default trace and if no custom code is involved (i.e. by a custom mapping function) they should call SAP
UNDERSTANDING GALAXY NODE ARCHITECTURE
Cluster nodes of Galaxy Kernel communicate with each other about the groups that they are currently using (they have loaded inside their memory) – if one group is need to some other cluster node it is unloaded (evicted) from the node where it is loaded now and is loaded inside new node. Of course there is a reason why this group is already loaded – probably some work need to be done on it and when this work is complete the cluster node will evict the group back to free its memory. Usually 2.5 seconds are enough to complete any work that can be done with the group – except for the cases when the process is failed (then this group stays “for ever”). In this case the time frame of 2.5 seconds for group eviction is not enough and the above mentioned exception is thrown from the side that request this group to be evicted.
See the schema:
We have one dispatcher in front of the several cluster nodes. In each node there have Web Container that keep HTTP sessions for the users and an instance of the galaxy kernel that contain all required information (all structures that represent process definitions – it is called trigger networks) to know how to process any group (process instance) that it can load.
With green is shown how cluster nodes communicate with each other and with red is shown the process instance that is failed and because of this it is loaded and stays in to cluster node memory.
With blue is drawn current customer HTTP session – it is on the different cluster node and because of this kernel node 2 request from kernel node 1 to evict the group and when this is done to load the group inside kernel 2 memory. But because it is failed and cannot be evicted the request wait 2.5 seconds and log the error.
HOW TO AVOID IT?
During the past BPM developers spent a lot of effort to avoid such ClusterEntryUnavailableExceptions and as from 720 and 730 SP 8, 731 SP04 and higher, the architecture was redesigned. The basic principle behind the redesign introduced was to use queuing techniques to deliver a SOAP message to the cluster node where the serving process resides instead of doing it the other way around (which is the current behavior). To be precise, it’s not the actual message that get’s queued up – the SDO payload is anyway stored in the database – but a kernel script that then gets executed by the remote kernel. The execution of this script then produces the necessary kernel events needed to process the original message as if it had been dispatched to the right node in the first place.
Therefore this has be done by:
- avoiding synchronous operation on process and task instances with WS message and user involvement by using AsyncActions
- 1633278 – NetWeaver BPM: Asynchronous Processing of Task Actions
- 1711053 – NetWeaver BPM: Monitoring of Asynchronous BPM Actions
- 1706510 – NetWeaver BPM: Asynchronous Processing of BPM Actions
- keeping process instances more cluster-node locally by using AsyncActions
- 2065370 – BPM process instance seem stuck
- 1918531 – Increasing default of execution attempts for BPM actions
- avoiding failed state situation by substituting it in Automatic Activities with Technical Error Handling
- 1853511 – Modelling technical errors in BPM
OK FINE, BUT I HAVE THE ISSUE NOW. HOW CAN I HANDLE IT?
You have tried all the techniques mentioned in above section but the issue remains quite possible because in non-BPM issue which you need to analyse deeply. In such a case you should follow these steps:
- Go to manage processes and show all “Failed process instances”.
- Open error log tab in order to get what causes each one of the instances to fail.
- Fix the technical issue (or configuration issue). Usually the process fails because of the missing connectivity to external systems – in this case ensure that connectivity is available.
- Try to “Resume” the process instance (depend on the business need you may need to “Suspend” or “Cancel” the instances).
- If 4 did not pass (same error is logged) you may need to apply KBA: 1894823. Actually this means that administrator HTTP session is on the cluster node that is different from the one where the current group is loaded.
Hope you find this blog useful.