Centralized health monitoring of SAP Cloud Integration using SAP Cloud ALM
This blog post is a continuation of the blog post Central monitoring of integration scenarios using SAP Cloud ALM that is focusing on the Integration and Exception Monitoring application. Within this blog post, I will give you an overview of how the Health Monitoring application of SAP Cloud ALM can help you in an operations role to keep track of the technical health state of cloud services, especially the Cloud Integration capability of SAP Integration Suite.
What is Health Monitoring briefly?
Health Monitoring is one application out of SAP Cloud ALM for Operations. It is focusing more on technical aspects of cloud services to identify service disruptions and degradations rather than on performance aspects. It covers aspects such as service availability or utilization of resources. The checks are service-specific and depend on the metrics offered by the monitored cloud services or on-premise systems. The goal is to provide technical health indicators on the correct function of connected cloud services and on-premise systems from an application perspective. System administrators can track the availability of connected services, check whether resources are exhausted and analyze usage trends, additionally they can collaborate on reported alert.
To get an overview of the supported solutions so far, see Health Monitoring in the SAP Cloud ALM Expert Portal.
Furthermore, a big advantage is, in the overview page of the Health Monitoring all connected cloud services and on-premise systems can be watched at a glance and be grouped by the service types they belong to. It is not necessary to jump into every single local monitoring to understand the functional health of a particular service. This central view helps operations teams in their daily job to get an overview where crucial changes and adjustments are necessary to ensure business continuity. But in the end the changes themselves must be executed in the local services.
Beyond the overview, with the monitoring page you get a list of connected service instances of a selected service type with the possibility to navigate one step further to watch all metrics for this instance. A history graph provides the collected data of a timeframe for a selected metric for a trend analysis. And alert inbox empowers operation teams to react in critical situations and help ensuring business continuity.
The health of a service is defined by a rating percentage, which determines its rating color. In the screenshot below one can see three Cloud Integration services grouped together, one in Critical and the other two in Warning state. The service type SAP Integration Suite (Cloud Integration) is in Critical state like the worst status of the individual service instances.
With this blog you will understand
- how the rating of a service’s health is calculated
- the available metrics for Cloud Integration and the rating
- how to customize thresholds to change the rating
- how to get alerts activated and work collaboratively on reported alerts
- how to receive personal email notifications on reached thresholds
Calculation of a service health rating
All technical metrics from the monitored services and on-premise systems are collected on a regular basis and can be used to calculate the overall health of a monitored service. The metrics are defined by the service types themselves and therefore may differ. In a later part of this blog post, I will explain the metrics offered by Cloud Integration.
The service health is calculated bottom up: each metric, as a specific certificate validity or the activation status of a JMS (Java Messaging Service) queue, gets rated individually. Thresholds define when a metric receives a certain rating, such as OK or Critical. The metric rating is mapped to a metric health score to automate the calculation of the overall service’s health score. See the picture below to understand the assignment. The health score of the overall service is determined by the mean score of all individual metrics. A service health percentage of less than 100% determines a service as in Warning, with less than 80% as in Critical state. And as soon as one metric is rated as Fatal the entire service has a health score of 0% and is in Critical state.
Some metrics have only informative character, such as the total number of messages in a JMS queue. They are not included in the health calculation of a service.
Example: In the below screenshot of the monitoring page that is listing all monitored Cloud Integration instances, one can see that the first one is in Critical state, colored red. 41 metrics are available for this instance, 3 metrics have only informative values and are not counted, these are the total number of messages in a JMS queue. Therefore, sum up 10 in Critical 0%, 1 in Warning 50%, and 27 with Ok 100% divided by the 38 relevant metrics. This results to a service rating score of 72%. The service is rated as Critical.
Available metrics for Cloud Integration
We have talked about metrics in general long enough. In the following, one can see the ones currently available for Cloud Integration together with the default thresholds that are defining the metrics ratings. (See also the column value in picture All metrics view). In case of the Cloud Integration capability of SAP Integration Suite, as of today the status of JMS queue resources, used for asynchronous messaging scenarios, and the validity of certificates and key pairs is monitored. The thresholds are the same as you know them from the local Cloud Integration monitoring, only the thresholds names differ.
Certificate validity: shows the remaining days until the certificate or key pair expires. There is no differentiation between keystore items provided by SAP or owned by customers. As default the metric is rated as Ok as long the validity is more than 30 days and then switches to Warning. If the certificate expires in less than 7 days, the rating is Critical. See Security artifact renewal in case of an upcoming expiration.
- Number of JMS queues shows the total number of used queues in relation to the limit (default: 30). It is rated as Critical if only one or no queue remains. The maximum message queues can be configured on the tenant. Read the blog for further information.
- JMS queue capacity shows the total queue capacity in relation to the total available storage for JMS resources. If the MBs already in use reaches 80% of the total capacity available, the status will change to Warning, and if it reaches 95% the status will change to Critical.
- JMS queue consumers, producers, transactions metrics are rated as Ok if any consumer, producer, or transaction are available.
The following three metric types are available for each JMS queue but additionally also as a separate one summarizing all JMS resources for the tenant:
- JMS queue active shows whether a queue has been started or stopped.
- JMS queue status: as a JMS queue consists of three sub-queues for processing a JMS message (main processing, storage for a retry in case of an error, and storage for chunks in case of large messages) the status depends on the usage of these sub-queues. If at least the usage of one sub-queue is 95% of the max capacity the JMS queue changes to the state Critical, between 80% and 95% in Warning and above in state Ok. In case a queue is stopped, Critical is reported along with text ‘Queue stopped’.
The summarizing status metric across all JMS queues goes to Critical if at least one of the JMS queues has status Critical, the same holds for
- JMS queue message shows the total amount of messages in a queue. This metric is for information purposes only and is not counted in the overall service health score.
Monitoring Cloud Integration
The Metrics Overview, one part of the monitoring page and the centerpiece of Health Monitoring, ensures a quick health overview of a specific service. It provides a pre-configured grouping of available metrics that varies from one monitored service type to another. In the case of Cloud Integration, one can see in the screenshot below the groups for certificate validity and JMS resources. By means of these colored tiles it is easy to track all monitored metrics of a particular service briefly. Any rating in red on the page attracts attention and should be examined more closely.
For identification purposes of various metrics of the same type of Health Monitoring has introduced labels. Each JMS queue has a label ‘queue’ for its name. Certificates have the labels ‘alias’ and ‘type’ for the differentiation between a certificate and a key pair.
In the lower section of the Metrics Overview table, one can also see also tiles for each JMS queue, titled with their label. They have their own metrics such as its activation, the status, and the number of messages, therefore they are visualized as separate tiles.
In the parallel tab All Metrics (see picture below) the details of all metrics of a particular service such as all certificate validities including the remaining days until their expiry. By the timestamp you can see the last data collection. As default, Cloud Integration tenants are pulled every 5 minutes using the OData APIs.
As the list of metrics might be large the list can be sorted, filtered by a specific metric type or rating, or one can group the metrics by their labels. In the screenshot below one can see all metrics of type ‘Certificate validity’ grouped by the label ‘type’ (certificate, key pair).
Health monitoring does not offer any tools to update certificates or queues directly. This is only possible in the keystore monitor or manage stores monitor of Cloud Integration itself. You may argue that the local Cloud Integration monitoring is much better suited to really work on these monitored resources. This is correct, one can get similar information in the local monitoring tool. A big advantage of SAP Cloud ALM is, as already mentioned above, that one can get an overview of the health rating of all monitored Cloud Integration services at a glance or even all connected services and systems of your IT landscape. Operation teams benefit from this overview and can concentrate their efforts on the most important tasks and do not have to inspect all local monitoring tools. And they have the possibility to navigate from the health monitoring application to the Cloud Integration monitoring.
On the other hand, it is not possible to track a specific metric, like for example the certificate validity across all services. But an expired validity of a certificate influences the health rating of a service and with that attracts your attention anyway. See Security artifact renewal in case of an upcoming expiration.
Get an overview of all the content of other services or systems monitored by the Health Monitoring app in the SAP Cloud ALM Expert Portal.
Historical view on a particular metric
The metric history graph is helpful to see how the metrics have evolved, to identify trends for the usage of resources, or to analyze when exactly a specific situation has happened. It is possible to see the collected data of selected labels for the last 30 days. Below you see the obvious behaviors of the certificate validity of all key pairs. Linear lines represent the number of remaining days until expiry. Directly within this popup you may filter on all labels available for the metric (alias and type for Certificate Validity).
Customize thresholds to change metric and service ratings
As Technical metrics are used to calculate the overall health indicator of a service. It is interesting to understand how customers can overwrite the default metric’s thresholds. This might be interesting if a rating level should be reached eventually. This custom configuration of a metric is based on the service instance level and influences the behavior of the metric rating, the rating of the overall service instance, and indirectly also of the triggered alerts.
In the Metrics tab of the Configuration for Cloud Services popup, you get an overview of active custom thresholds of a particular service or on-prem system and one can navigate further to change it for a specific metric. In the screenshot below one can see that I have changed the numeric threshold of the metric Certificate Validity.
Example: by default, for all Cloud Integration tenants, a certificate validity threshold is reached for a less serious Warning level if less than 30 days remain until a certificate is about to expire, and a Critical level if less than 7 days remain. As one can see below, the thresholds are adjusted: Warning is already reached if less than 50 days remain until a certificate expires and Critical if less than 9 days remain. After saving this threshold configuration, the threshold is activated automatically.
Besides changing the numeric values of the thresholds, the metric rating can also be changed by mapping the rating levels differently. Only as an example, one can see below that the Warning status could be mapped to remain Ok although the monitored service is sending a Warning.
Embedded alerting on metrics reaching a threshold
As well as other SAP Cloud ALM monitoring apps, Health Monitoring comes with embedded alerting to empower operation teams to react on critical situations and help ensuring business continuity. A reported alert can be treated as a ticket. Somebody can work on it or hand it over to someone else to collaboratively work on it, add useful comments, or start an operation flow.
Alerts are based on the rating of an associated metric. The alert inbox shows all active alerts since the last data collection. It is not possible to see the past alerts. As Cloud Integration is actively pulled every 5 minutes you get the alert inbox updated in this frequency. An alert is active, if one of the metric’s entries (labels) is in a Warning or Critical state.
All alerts related to the same metric and service instance are aggregated to avoid overwhelming the alert inbox.
In the alert details you see a selected metric with all the entries, e.g., the metric Certificate Validity with all the certificates and key pairs (labels alias and type). Additionally, you may start some actions on all entries such as to add a comment, to assign a processor or to start an operation flow. All these actions may ease the issue resolution process.
Activation of alerts
Usually, the data collection for a particular service instance starts in the Health Monitoring app automatically after the setup. But alerts are inactive by default. Firstly, please check whether data collection is still turned on for the service of interest. If not start the monitoring using the toggle button.
Afterwards activate the alert for the metrics you are interested in. Regarding Cloud Integration four metrics are configured as alerts. Secondly, activate alerting a particular metric. One can get alerts on the expiry of certificates, the reach of the total JMS queue capacity, or whether any of the JMS queues has been stopped and is not active anymore. This activation can be done in the Events tab of the Configuration for Cloud Services popup of a particular service, the Health Monitoring Administrator role is needed. And you must do this step for every service instance separately to get alerts triggered to the embedded alert inbox.
Receive alert notifications or trigger operations automatically
Alerts in SAP Cloud ALM are created by the Intelligent Event Processing application, another tile within the section of SAP Cloud ALM for Operations. The alerts which can be activated directly within Health Monitoring and are displayed in the alert inbox, for these, rules are automatically setup in intelligent event processing in the background and cannot be changed. These rules are triggering alerts.
But it is possible to create your own event rules. See below the differentiation between rules originated by Health Monitoring and the ones created by users within the Intelligent Event Processing app.
The creation of own event processing rules is necessary to configure how to react on the rule, either receive auto-notifications through email or start an operation flow.
For the creation of the rules the Health Monitoring Administrator role is required. During the creation one must specify an alert name as an internal identifier and choose the relevant monitoring use case (Health Monitoring Event). Furthermore, all services and metrics can be listed to which the rule is to be applied. For the technical metric names (event sub type), it is possible to use wildcards to be more generic.
It is mandatory to select an action type, either to send an email to multiple assigned email addresses or to start an operation. Email addresses need to be added to the Notification Management app beforehand and additionally verified by the owners.
These recipients would receive email notifications every 5 minutes on the stated metrics if at least one of the labels are in a Critical state. In the current example, notifications for any alerts are configured, for expired alerts and exhausted JMS resources. If some of the certificates and key pairs are in a Critical state an alert is reported and an email notification is sent out for the metric covering also metric labels in OK state.
Health Monitoring, one application out of SAP Cloud ALM for Operations, helps operation teams to keep track of the technical health state of the services in huge and heterogenous landscape. All checks are service-specific and depend on the metrics offered by the monitored services. As of today, Cloud Integration is offering metrics on certificate validity and the exhaustion of JMS resources. The overview page as a central view helps operations teams in their daily job to get an overview of which services crucial changes and adjustments are necessary. Additionally, reported alerts can be treated as starting points in the issue resolution process to work on with colleagues collaboratively. Furthermore, it is possible to create event processing rules for individual alert notifications or automatic starts of operation flows. All these health monitoring tools support operation teams to react quickly and ensure business continuity.
Find the complete list of supported products for Health Monitoring and how to set up the application in the SAP Cloud ALM for Operations Expert Portal to get started. This page is continuously updated.
If you want to get a deeper insight into integration and exception monitoring of Cloud Integration tenants, please read my previous blog post Central monitoring of integration scenarios using SAP Cloud ALM.
If you want to understand the comparison with other ALM tools please check out the blog post Monitoring tools for Cloud Integration capability of SAP Integration Suite.
What could be the error when my on prem system is listed in health monitoring with the comment "no data recieved" ? All other applications are fine.
I assume that there are issues with the configuration and therefore the data cannot be sent.
Please open an incident on the component SV-CLM-OP-HM, then we can analyze and resolve this issue.