Skip to Content
Technical Articles

Machine Learning Services for Hybrid Operations: motivation and concept (Part 1)

Dear reader, welcome to a series of articles “Machine Learning Services for Hybrid Operations”. In our publications we will focus on key challenges of well-known thresholds-based alerting systems (e.g. majority of ALM products in SAP portfolio including SAP Solution Manager) and will propose a Machine Learning based approach to extend existing monitoring and alerting processes. Stay tuned!

Part 1     Ι    Part 2

Motivation

According to ITIL® Service Operation 2011 Edition: “Effective service operation is dependent on knowing the status of the infrastructure and detecting any deviation from normal or expected operation. This is provided by good monitoring and control systems”.

In mature environments automatic monitoring and alerting systems pro-actively analyze IT solutions status 24×7 without manual effort and trigger alerts in case of problems before it becomes critical for the business. IT Operators must act on automatically detected failures and pre-failure warnings before a Business User creates an incident.

Automatic monitoring and alerting systems portfolios include various levels of productive landscapes safeguarding tools: hardware stacks monitoring, exceptions management, interfaces consistency, application runtime analysis, business related KPIs visualization.

IT systems resources utilization monitoring is one of important pillars of pro-active automatic monitoring, as it can prevent critical business process disruption.

In IT systems resources utilization monitoring most automatic alerting systems use thresholds breaching events as a main trigger for alerts creation.

Key challenges when working with thresholds-based alerting systems:

  1. Thresholds-based alerting systems can’t adapt to recurring metric behavior variations, which results into high false alert rate

Manually set specific monitoring work modes could be a workaround on a several hours – days scale, but for short recurrence periods, large amounts of metrics and large amount of monitored object efforts for a fine tuning is extremely high.

As a result, monitoring experts and/or IT Operators receive huge amounts of alerts, and they spend extra time and effort to separate valid and false ones manually.

2. At the same time threshold-based alerting tools don’t cover all unanticipated situations due to its nature. For example, how to detect the time-based anomaly (unexpected vibration), if a threshold wasn’t breached?

As a result, monitoring experts and/or IT Operators spend time extra time and effort supplementary performance issues discovery, missed non-threshold violation related anomalies (extra anomalies) can indicate potentially dangerous symptoms.

Typical thresholds-based alerting system simplified:

 

Concept

Concept proposal for explained alerting challenges addressing:

  • Add a “black box” events post-processor powered by Machine Learning algorithms as an extra element between a thresholds-based alerting system and IT Operator
  • Machine Learning algorithms inside the “black box” events post-processor performs anomaly detection activities: decide whether a metric changed value is a normal recurring variation OR is an abnormal and unexpected situation
  • Redundant alerts, related to normal recurring variations will be filtered out (for example, confirmed/closed automatically)
  • Machine Learning algorithms inside the “black box” events post-processor detect non-threshold violation related deviations, which are skipped by threshold-based alerting tools

Thresholds-based alerting system with the “black box” events post-processor (simplified):

 

Concept declarations:

  • Metric anomality rating is a dynamically calculated quantitative indicator which shows whether a metric changed value is a normal recurring variation OR is an abnormal and unexpected situation (0 <= metric anomality rating value <= 100)
  • Metric anomality rating is calculated by a Machine Learning algorithm as a difference between a real detected metric value and a metric value, predicted by a previously trained Machine Learning algorithm
  • Anomaly detection is an evolution of static thresholds approach, instead of being static it is a dynamic value that adapts on a daily and weekly basis. Think about anomaly detection like static threshold which is recalculated every 5 minutes based on a local context. E.g. metric is changing between 20 and 100, while at night it is between 10 and 30. To find an unusual situation in our metric behavior, we need to take into account, that metric value 70 has different meaning depending on day or night, so our threshold would be different
  • Due to a high amount of possible abnormal metrics behavior patterns and metrics types unsupervised learning to be used
  • Use worst case rule: better to say, that there’s an anomaly instead of there’s no anomaly
  • Alerts, for which all metrics values are normal recurring variation, can be automatically closed (confirmed) OR marked in a special way

Concept algorithm simplified:

  1. Send all metrics and alerts from a thresholds-based alerting system to the “black box” events post-processor every 5 minutes
  2. In the “black box” events post-processor pick up all alerts and metrics, triggered for this alert
  • Calculate anomality rating for each of metrics in a group
  • Use a decision map to select further actions with alert depending on anomality rating (close automatically or send to operator for processing)
  1. In the “black box” events post-processor pick up all metrics, for which alerts were not triggered
  • Calculate anomality rating for each of metrics in a group
  • Use decision map to select further actions depending on anomality rating (publish non-threshold violation related event to be sent to operator)

Decision map simplified:

The table below lists goals and candidate KPIs for measuring the success of the implementation of the “black box” events post-processor for threshold-based alerting system:

Goal Candidate KPI
Reduce monitoring experts and/or IT Operators manual efforts
  • Percentage of automatically suppressed alerts for performance related metrics (or metrics with more than two states)
  • Percentage of approved alerts for performance related metrics (or metrics with more than two states)
  • Monitoring expert daily effort reduction
Monitoring process trust improvement
  • Number of abnormal situations, caused by thresholds violation and detected by Machine Learning solution in comparison to approved thresholds-based alerts on a same situation
  • Number of abnormal situations, caused by unrecognized metric behavior pattern and detected exclusively by Machine Learning Extension

Summary for part 1:

Thresholds-based alerting systems are commonly used in IT systems resources utilization monitoring.

Key challenges when working with thresholds-based alerting systems:

  • Lack of adaptation to recurring metric behavior variations, which results into high false alert rate
  • Potentially dangerous metrics behavior patterns cannot be detected, if a threshold is not breached

Machine Learning capabilities can address thresholds-based alerting systems challenges:

  • Filter out redundant alerts, related to recurring metric behavior variations
  • Detect non-threshold violation related deviations

Benefits of Machine Learning capabilities usage together with thresholds-based alerting systems:

  • Reduced time and effort of human operators, as they will receive only reliable alerts
  • Monitoring process trust improvement, as previously not covered deviations can be detected as well

In next chapters we will demonstrate how Machine Learning “black box” events post-processor concept can be implemented for SAP Solution Manager, which is a most popular monitoring and alerting platform from SAP.  Several chapters will be completely dedicated to Machine Learning “black box” algorithms and principles.

About the author

Andrew Kusnetsov (https://people.sap.com/andrew.kusnetsov) is a senior SAP Solution Manager Engineer from SAP Labs CIS CoE St. Petersburg. He is working in ALM/IT Operations/DevOps team since 2013 and is delivering Hybrid Operations related projects in EMEA region.

Debt of gratitude to Artem Sharganov (https://people.sap.com/artem.sharganov) for sharing a key contribution to Smart Monitoring initiative and help with this article.

Machine Learning Services for Hybrid Operations series

Further information

Be the first to leave a comment
You must be Logged on to comment or reply to a post.