Skip to Content

SAP has provided an amazing tool for monitoring an entire SAP landscape, including both SAP and NON-SAP products. System Monitoring is capable of monitoring a vast array of applications, hosts and databases. With this expansive set of tools comes its own sets of challenges. The inherent functionality with all monitoring applications means that no monitoring tool can meet the exact and specific requirements of all landscapes without some sort of tweaking and customization. With this the refinement process of all monitoring tools is absolutely critical to the implementation and day to day management of any monitoring tool. I created the flow chart and document below in an effort to help stream line this process for everyone, providing the flow and decision making required to refine technical monitoring alerts

 

 

This document is meant to outline the steps required to refine System Monitoring Alerts. Outlining the questions that need to be asked when Technical Monitoring is generating many alerts for managed systems. Therefore requiring the alerts to be analyzed to ensure proper configuration to meet a systems unique requirements. This document assumes the reader understands the basic concept behind technical monitoring templates and has the ability to modify the metrics and alerts within specific monitoring templates as part of System Monitoring within Solution Manager.  All monitor templates are delivered with SAP defined thresholds and must be modified to meet the requirements of a specific SAP landscape. This document is written to be complimented with a flow diagram to assist in the process of following this document. Please refer to the flow diagram to assist you in utilizing this document.

Process for Refining System Monitoring Alerts

 

Entire Process Outlined in Detail:

 

Step 1 – Is the alert a False Alarm? Questions to ask to determine if it is a false alarm:

 

1. Which Metrics are triggering the Alert?

a. Open the Alert in the alert inbox and analyze the metrics triggered

b. All alerts have multiple metrics collecting data. Which Metric or metrics are being triggered for that alert?

 

2. What is the metric collecting for and how does the Metric collect the data?

a. Look at the metric in detail (template Maintenance) to determine what exactly the metric is monitoring.

All of this information is important for determining if the alert is false or not.

b. How is the data collected?

c. Does the Metric have variants?

i. If so, look in the System Monitoring Dashboard

d. What are the parameters used for the data collection?

e. What is the collection Interval?

f. What is the Threshold type and Value?

g. What is the Monitored Value setting? Average, Minimum, Maximum?

 

     3. Is the data being collected correct?

a. Look in the Managed system to determine if the data collected is correct.

b. Using the details of the metric data collection and the historical data in Solution to validate the data in

Solution Manager is correct when compared to the data in the Managed system.

 

     4. What does the historical data for that metric tell you?

a. Analyze the history of the metrics over time. Looking for a couple things

b. Is the same metric or metrics being triggered every time?

c. How many times has the alert been triggered in the past day?

d. Look at the historical data, does the threshold get exceeded at certain times of the day? Is this a period of time

that is the system is known to have a heavy load.

e. Some metrics have variants.

i. If so, which Variants are being triggered?

ii. Is it the same variant being triggered every time?

 

 

Step 2 – Process to follow if the alert is a false alarm:

 

1. Analyze the thresholds, using the information you collected while looking at the historical data.

a. Do the thresholds appear to be too low or too high?

b. Analyze the historical data to look for averages

c. Analyze the historical data to determine the best thresholds to set the metric to?

 

2. Does the data being collected appear to be correct?

a. Knowing how the data is collected. Look at the backend configuration. Is there an issue within solution

manager or with the configuration in managed system configuration?

i. Managed system Configuration

ii. Data Collectors

iii. MAI Extractors

iv. Diagnostic Agent status

 

 

Step 3 – Process to follow if the thresholds are correct:

 

1. Analyze the collection method?

a. Is the data being collected correct? Compare the collected data to the system.

 

2. At times the collection interval can be to frequent. If this appears to be the case,

Increase the collection interval. Increase it slowly.

 

3. Analyze the historical data, does the data peak at certain times of the day?

a. If so, at what times of the day does this occur?

b. Create a maintenance mode for that time of day for that specific system.

c. Assign a higher threshold to that maintenance mode.

d. This will allows for there to be a higher threshold during that time, eliminating the false alarm.

 

4. Analyze the Historical date, does the data spike in short, sporadic intervals?

a. Change the threshold type the Counter Threshold to eliminate alerting for sudden spikes

 

 

Step 4 – Process to follow if the Thresholds are incorrect:

 

1. Slowing increase or decrease the thresholds depending on the metric.

a. Use the information you found while analyzing historical data to determine the best threshold

b. Adjust in small increments, adjusting thresholds too far can lead to an ineffective alert.

c. If the thresholds need to be adjusted to extremes and it is unique to a single system, create a copy of the

template for assignment to the single managed object. This will ensure the metric is still effective for

the other managed objects.

 

 

Step 5 – Process if Data is collected properly and thresholds are correct:

 

1. If everything is correct in Solution Manager then you have 3 options

a. Resolve the issue in the Managed system

b. If the alert has more than 1 metric. Change the event type to Average Case or Best case. This will ensure

that it will take more than 1 metric to exceed its threshold to trigger an alert.

c. Deactivate the Metric/Alert for that specific managed object. Create a copy of the template for assignment

to the single managed object. This will ensure the metric is still effective for the other systems.

 

2. If the Thresholds are correct but only at certain times of the day

a. Create a Workmode for that managed object to allow you to change the threshold at different times of the day.

 

 

Step 6 – Process to follow if the Data is collected properly, the thresholds are correct and the issue does Not need to be resolved

 

a. If the alert has more than 1 metric. Change the event type to Average Case or Best case. This will ensure

that it will take more than 1 metric to exceed its threshold to trigger an alert.

b. Deactivate the Metric/Alert for that specific managed object. Create a copy of the template for assignment

to the single managed object. This will ensure the metric is still effective for the other systems.

 

 

Flow Chart for the Entire Process:

 

Refinement process.png

To report this post you need to login first.

5 Comments

You must be Logged on to comment or reply to a post.

  1. Eric Poellinger

    Hi Jereme –

    Ignorance question – curious how MAI replaces or allows for monitoring of the various business-oriented items our functional team is seeing in what looks to be an older alerting framework via tcode ALRTCATDEF?

    Is this moved to another solution by SAP?

    Thanks for your input on this!

    (0) 
    1. Jereme Swoboda Post author

      Hi Eric,

      Business process monitoring runs on MAI these days and provides the functional alerts you are looking for. Later releases of solman 7.1 allowed for the conversion from CCMS to MAI. With solman 7.2 BPmon only runs on MAI.

       

      Hope that helps!

      (0) 
  2. Aviv Cohen

    Hello Jereme

    First, Thanks for the great article.

    I have question regarding the Collection Intervals and raise alert:

    I have a daily metric, how I control the time that this metric will be execute, the reason is that my problem occur on the morning and I received the SMS in the night.

    Thanks

    Aviv Cohen

     

     

     

    (0) 

Leave a Reply