In the previous blogs you have seen what alerts are, how do you set up alerting in Focused RUN for SAP Solution Manager and processing them.
This blog would cover the feature Alert Correlation. You would be able to take away:
- What is Alert Correlation?
- How does it impact the Alert Inbox?
- How do I process an Alert Cluster?
- What are correlation rules and how to create?
- What are the cases when I/my organization need to use Alert Correlation?
Modern data center organization models are generating a high number of events regarding the health status of the managed system / services landscape.
The target of alert correlation is to group all events related to a single issue to high level alert cluster, which can be handled at once. The target is hereby:
- Reduce number of manual activities in alert processing
- Symptoms for the same issues caused in multiple components of the landscape should be clustered into a single alert
- Focus in most important issue without distraction by parallel workstreams
- Avoid double work
- Enable usage of highly specialized monitoring capabilities without increasing the workload in the first level
There are correlation on different levels and here is a diagram that explains the evolution of Alert Correlation.
First, during the CCMS times, the correlation was on the time axis. Single metrics were collected over time and presented as one value.
Second, As you know there are different categories of alerts like Availability, Performance, Configuration, Exception and Self-Monitoring. With the expertise SAP has in providing technical monitoring solutions, our monitoring templates are already streamlines to include multiple events(problems) into a single alert. This is the static form of alerting.
Let’s take the case of “File System Full” alert. It is comprised of different metrics
- File System Free
- File System Used
Or “High CPU Utilization” alert which comprises of :
- CPU I/O wait
- CPU Idle
- CPU load average
- CPU System Utilization
- CPU User Utilization
- CPU Utilization
As of FRUN 2 FP02, the current approach reduces alert volume even further and is based on machine learning.
Based on machine learning, the current approach reduces alert volume even further.
As we proceed with the blog, some terms to keep in mind :
Family: created using graph theory. Related technical components are grouped as a family.
Correlation Rule: Correlate alerts on family within time window of 10 minutes where Data Center= “DC1”
Alert Clusters: Alerts which are grouped together. They act as a single unit to be processed.
In the alert inbox you can see all open alerts based on the scope you have selected. An alert cluster is visible for you as long as any of the contributing alerts is part of your scope selection.
Below diagram shows the new category “Alert Cluster” introduced. You can double click on the graph to navigate to all alert clusters.
You get to see the clusters based on the scope selected using the visual filters.
You can click on any of these clusters to see the contributing alerts. You have more information like which was the first alert, which could be a side effect of another alert etc.
Once the evaluation is done, the processing should also happen here, using the “Actions”. So there is no need to provide categorization or classification information for each contributing alert separately or to confirm each alert individually.
You can click on each of the contributing alerts to view more details about them, like the metrics, their information etc.
These rules allow you to customize your alert clusters. You can find them under the “Configuration” button.
Here is an example of a customized rule which clusters alerts, but keeps the alerts of the technical system BHC since it is a top priority system.
Correlate alerts on hosts within time window of 10 minutes where technical system <> BHC
Where is Alert Correlation useful?
When you have a large landscape to monitor and want to reduce alert processing time and use it for much more meaningful tasks, this is for you.
Summarizing the benefits once more:
- Reduce mean-time-to-repair
- Lower the capacity dedicated to troubleshooting
- Decrease operational noise and alerts
Content prepared in association with Frank Wenzke