Skip to Content
Author's profile photo Toby Johnston

Understanding the “Health State” in the BI Platform Monitoring application and repairing failed Health State Watches

One of the key concepts to understand in the BI Platform Monitoring Application is the Health State metric.  A number of different aspects of the monitoring application rely on the Health State metric and without a clear understanding of how this is supposed to work, it makes effective monitoring and troubleshooting the application a frustrating task.  In this article, I will describe the concept of Health State in great detail and I will also describe how to correct the Health State in your BI Platform Monitoring application.

Health State Metric

In the Central Management Console, when creating a new watch, there are two types of Health State metrics that can be used as threshold criteria:


  • Server Health State – The Server Health State indicates the health of a particular server.  This metric can be used to understand whether the server is up and running, whether the server is overloaded, and whether the server is still able to take additional requests.  The Health State of the server can indicate to the BI administrator if they need to take action to troubleshoot a problem on that particular server

/wp-content/uploads/2014/11/bb_590156.png



  • Topology Health State – The Topology Health State indicates the cumulative health of all servers of a particular type (Categories health) and also all servers in a particular server group.  The Service Categories include CrystalReports, Analysis Services, Dashboard Services, Promotion Management Services, Core Services, Explorer Services, Connectivity Services, and SIA nodes

/wp-content/uploads/2014/11/aa_590157.png

How the value for Health State is determined

In the case of the Server Health State metric, the value is determined by the result that particular server’s watch.  Anytime you create a new server manually or use the System Configuration Wizard to create your Adaptive Processing Server configuration, the system will automatically create a new watch for each server using the nomenclature of NODENAME.SERVERNAME Watch.  This is a “system” created watch and cannot be manually deleted.  You may have noticed in the Central Management Console that the system created Server Watches are also displayed for ease of access under CMC -> Home -> Servers –> Servers List.

/wp-content/uploads/2014/11/serverslist_590163.png

Health State Evaluation


Depending on value returned by the server’s watch formula, the server health will display one of the following five states.


STATE DEFINITION
GREEN Server health is good and no action is necessary
AMBER Server is slightly overloaded, nearing peak values as defined by the caution rule
RED Server resources are over used, unable to take new requests, or the server is stopped or disabled
DISABLED The watch is marked as disabled in the BI Monitoring application.  Select the watch and click the enable button to re-enable the evaluation of this watch
FAILED There is an error in the watch formula or the BI Monitoring service is disabled or not running


Topology and Categories Health States

In order to provide the BI administrator a quick path to troubleshoot issues in the BI Platform landscape, the server health states are aggregated into service category health states.  This makes it much more simple to tell if any particular product type is available for the end users that are using the system.  For example, if your BI system mainly processes Crystal Report view-on-demand requests, then it is vital in order to achieve maximum up-time that all the Crystal Reports Processing Servers in the BI landscape are available to process these jobs.  The Crystal Reports category health state depends on the aggregated health state of all the Crystal Reports server watches.  This can be seen by editing the Crystal Reports category watch formula where you will find in the formula the health state of all Crystal Reports servers.

/wp-content/uploads/2014/11/watt_590174.png

In the case of the Crystal Reports category, all of the servers required to process Crystal Reports are grouped together in the topology map so that you can tell at a glance which server watch may be causing the overall category state to change.

/wp-content/uploads/2014/11/crr_590164.png

Fixing the Overall Health Watch and the Health State Hierarchy

On the BI Platform Monitoring Dashboard, there is an Overall Health state indicator (also known as the Consolidated Health Watch).  You may have noticed that this is quite often not showing a valid state (Green, Amber, or Red) and instead is giving a state of Failed.  In order to fix this, it is important to understand how this particular Health State is determined, then make the necessary underlying watch formula corrections that this watch is dependent on.  In the monitoring application, there is a large hierarchy of Health State watches and if any of these dependent watches is broken or invalid, the Overall Health will show a state of Failed.  In order to help the BI Administrator to correct their BI Platform Monitoring application and Overall Health, I have created a diagram showing each level in the Overall Health hierarchy which you can use to track down the broken watches and correct the formula. 


In this example, you can see that the Overall Health state is Failed. 


/wp-content/uploads/2014/11/her_590183.png


If any of the dependent Health Watches below the Consolidated Health Watch are failed, then the watch in the next level up will also be failed.  Therefore, you must start at the bottom of the hierarchy and correct this watch.  In this example, the server APS 2 has a failed watch, therefore the SIA Node 2 watch is failed, the Enterprise Nodes watch is failed, and so on.

/wp-content/uploads/2014/11/her2_590184.png

After correcting the APS 2 Health State watch formula, all of the parent watches are now also showing a correct value and the Overall Health is Green (OK).  Note that, after you correct the child watch formula, wait for a few minutes as there is a metric refresh internal of 60 seconds (by default) where the Monitoring Service will update the status of all watches in the system.  In otherwords, the change in Overall Health will not happen immediately after correcting the dependent watches so be patient.

/wp-content/uploads/2014/11/her33_590185.png

Repairing the Server Watch formulas

When creating a new server or using the System Configuration Wizard, you will find that the automatic routine that handles this is not perfect and depending on which service you are creating, the automatically generated system watch may contain either the wrong server name reference, and in some cases (such as the Connection Server), the wrong metric altogether.  When you edit the watch’s danger rule or caution rule you will see in red, the erroneous contents in the formula that needs to be corrected.

A server Health State watch should contain at the very least a check to make sure the server is running.  Depending on the granularity that you desire you can create a two state watch, or a three state watch.

/wp-content/uploads/2014/11/na_590207.png

If you want to see a yellow caution state when a server is stopping and starting then you should use a three state watch, if you are only interested in seeing green state for running and red for any other state, you can use a two state watch.  Using the server metric Server Running State, you can easily create a new server watch based on whether that server is available or not.

Server Running State Values

State

Value

Stopped

0

Starting

1

Initializing

2

Running

3

Stopping

4

Failed

5

Running With Errors

6

Running With Warnings

7

See below an example of both two state and three state watches that check for server availability.  In this example, my SIA node name is NODE and the server name is SERVERNAME.

Two state watch formula:

Danger Rule NODE.SERVERNAME$’Server Running State‘!=3

Three state watch formula:

Caution Rule NODE.SERVERNAME$’Server Running State’==1 || NODE.SERVERNAME$’Server Running State’==2 || NODE.SERVERNAME$’Server Running State’==4 || NODE.SERVERNAME$’Server Running State’==6 || NODE.SERVERNAME$’Server Running State’==7
Danger Rule NODE.SERVERNAME$’Server Running State’==0 || NODE.SERVERNAME$’Server Running State’==5

Factoring in performance to the server health state

In some cases such as the Central Management Server, the load on the CMS server is used to determine the server health state.  Depending on which type of server you are editing the watch for, there are a variety of different metrics that can be used to determine load.  You may want to also include in your server watch formula some thresholds for these metrics so that the server health state metric is dependent also on how well the service is performing and whether it is able to take on more jobs.


Refer to the BI Platform Administrator Guide for more information on server metrics to determine which metrics are suitable for your BI landscape.

Assigned Tags

      17 Comments
      You must be Logged on to comment or reply to a post.
      Author's profile photo Onkar Velhal
      Onkar Velhal

      Great!, Thank you for sharing.

      Author's profile photo Brian Thomas
      Brian Thomas

      Great topic for a blog....some good info!

      Author's profile photo Pavankumar Kandukuri
      Pavankumar Kandukuri

      Thanks for this Toby 🙂 . Good info

      Author's profile photo Stephen Folan
      Stephen Folan

      Thanks Toby, really clarified this topic for me. Great.

      Author's profile photo James Rapp
      James Rapp

      Nice one Toby ... is it possible to mix in a check for servers in a disabled state? For example, I want to evaluate my stopped servers, but I know I have some DR services set to disabled. Any way to exclude them from the watch?

      Author's profile photo Toby Johnston
      Toby Johnston
      Blog Post Author

      Hey Jim,

      Yeah you could do this.  If you check the heirarchy diagram, you would need to remove those DR services from the SIA node watch and also the service category category watch.  Then they would no longer be factored into the overall health watch (aka consolidated health watch)

      Cheers

      Toby

      Author's profile photo Torsten Wirth
      Torsten Wirth

      Thanks for this blog - It's nice to see how monitoring should work. We are 3 years on BO now and monitoring  never worked for us because we followed the recommendartion to split up the APS. The result was that the internal derby "DB" is not working anymore and due to that we have to switch to an external DB (which is Oracle in our case).

      At first we found out that Oracle was not supported in our inital BO releases than with added Support there were bugs which are resulting in extreme redo log growth and now 4.1 SP5 Monitoring is paritally working. But only some monitors do work. E.g. Server status is allways green (doesn't matter if the Server is switched of or is in another state). On the other hand server avtication (status 0 or 1) is working like it should.

      So if BO monitoring is working it's a good thing. But up to now I have not seen a working version from the monitoring. And if I ask our admin to test something connected to BO monitoring it's like "don't you have something better to do - it's only waste of time".

      Sorry to say that but that are our experiences so far with BO Monitoring. It's nice if it is working for others up to now not for us.

      Author's profile photo Toby Johnston
      Toby Johnston
      Blog Post Author

      What version are you on?  Have you raised this with SAP support?

      Author's profile photo Torsten Wirth
      Torsten Wirth

      Version 4.1.5.

      Yes, we have several tickets in the last 3 years concerning that topic.

      There are serveral problems we have at the moment - I have made a large ticket with all issues we have with monitoring and than the result was that the support told me I should make a ticket for every single issue. That will result in a lot of time waste because from my experience the external indian support people all start at beginning (like DB connection - which is working - i have checked the tables, redploying tomcat, deleting cachedir and so on - we have done that all without a result). Which means spending several days with support people showing you system and in the end with no result.

      There was one or two days where the Mbean error on starting page vanished. But after that it reapeared again.

      It's strange for example that 1694041 does not recommend add federation to monitoring APS but this is necessary. It worked after we have done that for 2 days (there was also a long time between addings the federation service to the aps until seeing the green light on monitoring page).

      But we also have several other problems like showing green status on top of control modules but grey status in the detail list combined with MON00044 error sometimes (we did a redeploy of tomcat without result).

      Or we are getting the MON00048 error if we would like to change a control module. Furthermore in a control module you should see the rules graphically on top and as a text in the lower area. With some modules there is a warning message possible naming conflict" sometimes. If you press on refresh tehre is no problem.

      Sometimes we also had MON00016 which avoids a control module from saving.

      And there are basic other problems: If you would like to monitor job server via email scheduling you will get a mail if the server is ok again (because the job server is sending the mails and if it's not working there is no mail alerting).

      Or you will get a message if a server is in status green or red. But if a server is in status undefined (which will happen if you monitor e.g. a combined status of open jobs on job server > 100 or server executing status = 0 or 5) than you will get no message. In my opinion it's not enough to see if there are 100 jobs which are not done. It should be also possible to see if the server is up or not. But that seems to be not possible in one watch if you would like to combine it with alerting.

      Author's profile photo Toby Johnston
      Toby Johnston
      Blog Post Author

      Apologies Torsten, I never got an SCN email notification on this reply and haven't been back to this post in over a year.  Did you ever get the help you needed to address all these issues?

      Author's profile photo Torsten Wirth
      Torsten Wirth

      Hello Toby,

      Thanks for reply.

      Up to now we are on the same version of BO. Due to the monitoring problems and the kind lifecycle management is handled (no recording of changes like in NetWeaver and the more complicated transport system than in NetWeaver) mainly for SSO in combination with Analysis Office.

      We will maybe change to the latest release in the next months + change from Oracle to HANA DB for auditing. I will give it a try again afterwards if I find the time.

      Regards,

      Torsten

      Author's profile photo John Clark
      John Clark

      Torsten, I don't know if it will help any but I ran across this issue with MON00048.

      When I was going into some our watchlists, I would get this error.  It appears that it may be tied to having some bad data in the watchlist.  I see this displayed along with it "Possible mismatched names".  I am finding some old server names and in at least one case a reference to the old "AdaptiveProcessingServer" which has largely been replaced.

      It is rather time consuming to go through all of the watchlists to find the problems but it does appear to be helping.

      Author's profile photo Enrique Fundora
      Enrique Fundora

      I regret that monitoring is also problematic for us. We have not been able to get the failed staff changed regardless of what we do.

      Author's profile photo Toby Johnston
      Toby Johnston
      Blog Post Author

      What version are you on?  Have you raised this with SAP support?

      Author's profile photo Sachin Joshi
      Sachin Joshi

      Toby,

      Can you please provide any pointers in order to define and find which metrics are used to evaluate Lumira Services and components involved in the workflow from HANA as data source ?

      Thanks in advance.

      Author's profile photo Grigory Razin
      Grigory Razin

      Hello.

      After upgrading from version 4.2 of sp 4 to 4.2 sp 6, after a short work the redness appears on the value of CentralManagementServer. Before that, everything was always green. How to fix it

      Author's profile photo Mynyna Chau
      Mynyna Chau

      Any questions that require helpful answers by the community should be posted as separate question here: https://answers.sap.com/questions/ask.html

      Thanks and best,

      Chau