cancel
Showing results for 
Search instead for 
Did you mean: 

SAP BI Platform - Monitoring Watch stuck on Danger state even after metrics all return to OK state

boman_hwang
Participant

Hello everyone,

we are running a clustered SAP BI Platform environment on version 4.2 SP 07 Patch 1. We have set up monitoring and watches with automated email alerts on every server, which seemed to work okay so far.

Now a couple of days ago we had an issue on one the cluster nodes and the watch for the Central Management Service (CMS) went to a "Danger" alert level because one the metrics was hitting the danger treshold.

Turns out we had an errant process generate a huge amount of audit events for the CMS that it could not handle all at once and thus a lot of audit events got stuck in the queue. Which the metric correctly identified and thus fired the alert, so that worked as expected.

However, once we stopped and killed the errant process, the audit events got processed, the queue got empty and then the metric and also the watch should have returned to the OK state, or least that was our expected behaviour.

But now we have a strange issue: The metric for audit events in the queue is down to 0 again and shows the OK state again and every other metric for the watch was in the OK state already. But even though every single metric of the CMS watch is in the OK state, the watch itself refuses to return to the OK state as well and is instead still stuck on the Danger state. Which means all connected watches stay in their Danger state as well.

We hoped that maybe a server reboot might fix the issue, the watch is still stuck even after the regular reboot over the weekend. I also copied the complete watch and the copied watch actually switches to the correct OK state, so it seems to be just an issue with the original watch, which is the one that was part of the standard Installation of the BI platform server.

I guess I could delete the watch and recreate one as a custom watch, but then I would have to edit all the connected watches as well and I am also concerned that this would mean we could get corrupted watches in the future as well.

Is there any way I can debug or reset an individual watch or fix the issue? Any ideas would be greatly appreciated.

View Entire Topic
Joe_Peters
Active Contributor

We have had this problem since Monitoring was first introduced in BI4.1 (we're now on BI4.2 SP06). I opened two separate incidents with SAP but they were never able to determine the cause or a solution. I even asked if it was possible to completely re-set the Monitoring metadata in hopes of clearing the problem but was told that was not possible.

I discovered that the only way to reset the status of a watchlist is to disable and then re-enable it. This works for me consistently well.

I created a Java program object that runs every 10 minutes -- if there is a watchlist in Warning or Danger status, it disables and re-enables it. This works well -- if the watchlist still meets the warning/danger criteria, then it will remain in that status after the program completes, otherwise it will reset to green.

I've found the monitoring component to be very buggy. I have all watchlists set with throttling so that an alert will only be sent when the status changes, yet I regularly get a flood of alerts when a threshold is crossed. And I can't even get to the throttling tab in the new Fiori version.

boman_hwang
Participant
0 Kudos

Thank you, disabling and then re-enabling the watch did the trick!

Not sure yet whether we want to do something similar to your Java program, but that is a good idea to keep in mind.

Agree on the component being buggy. I had some issues with the throttling tab as well, but it seems to load relatively consistently once I moved Monitoring to its own APS and also increased the MaxHeap size a bunch. Seems the fiori Version is kinda memory-hungry.

denis_konovalov
Active Contributor
0 Kudos

Hello Joe,

I think you should use the new option - Schedule a Manager on a support portal and describe this problem, so it gets a better attention and maybe raised to dev's.

Joe_Peters
Active Contributor

Denis, I put many, many hours into this in the past with with my incidents. It just wasn't worth my time to pursue it any further. I created a workaround that resolved my immediate need.