Focused Run – Advanced Event & Alert Management (A...

former_member591012 · ‎01-01-2019

Update: Alert Correlation in FRUN by resmi.ks

Newly imagined Alert Management within FRUN, open to integration at both Inbound and Outbound, insightful and responsive user interfaces. Part 3 includes visual insight into alert rating trend, metrics, guided procedures, various actions.

FRUN AEM Part 1 | Part 2 | Part 3 | Part 4 | Part 5

In the Part 1 of this series, we talked about the purpose of AEM within FRUN. We briefly touched upon the very basics of Scope selector and the Unified Shell that are meant for all applications of FRUN. More related to this series, we then looked the at the Overview Page of AEM and mooted on its objective, its powerful Visual Filters.

In the Part 2 of this series, we then moved on to explore the Open Alert List and Alert Search page. Both these pages presented us with the opportunity to arrive at a list of alerts in two or three different ways. It helped us carry out some of the most common actions that we wanted perform on multiple alerts at one go.

Let us now go through this Part 3.

Alert Detail

One would find many details pertaining to an alert. Some of these are commonly found in all alerts, some other depends on how much information the alert sender gathered and decided to send together with the alert.

Alert detail header

The header section includes the basic information such as the Alert Name itself, the Managed Object where it occurred, when was it first created, when was it last updated (by the monitoring infrastructure that sent the alert), and all other important attributes such as Priority, Status (Open, In Process, Confirmed), Severity (0 – least severe, 9 – most severe, decided at the time of definition of the alert), Processor (if any, at this point in time), Worst Rating, Current Rating, Customer Network and Customer Name, etc.

Below the header, there are few sections on the same alert detail page, such as the (Alert-) Rating, (contributing-) Metrics, (Invoked-) Guided Procedures, if any, Documentation and so on. Let us look at these one by one.

Alert rating trend and history

One of the important attributes of an alert is its rating, viz. Critical (Red), Warning (Yellow), Okay (Green), or, when none of these are known, the rating is rendered in Grey.

Some of the use cases that send alerts to AEM are capable of continuous monitoring. That is, the monitoring infrastructure not only creates an alert when necessary, it keeps updating the underlying data pertaining to the alert, including its rating.

Successive occurrences of an alert may or may not undergo a change in its rating. When the rating changes, it usually signifies an important enough change in the situation that the Alert was about. An alert that was created with rating Warning (Yellow), may remain in that state for some time, and, may change to Critical (Red) later.

The Rating Bar included in the Alert detail page captures these changes of the rating of the Alert, on the continuum of time. Each distinctly colored section of the bar depicts the “time-window” in which the alert experienced the same rating. The subsequent section on the bar indicates another time window when the rating changed.

Thus, the rating bar helps in having a sense of the trend of the rating changes that the alert has experienced.

Indeed, rating of some alerts tend to “flicker” a lot more that it does for others. One alert may have experienced a change of rating as many as 3 or 4 times in an hour, whereas some other alert may experience a change after many hours, or even days, or no change at all.

In the following example screenshots from the same alert, notice the zoom level slider and the resulting level of detail on the rating bar.

What more than this visual insight into the rating trend of an alert?

Each section of the rating bar continues to carry the “visual filter” philosophy of AEM user interfaces.

This means, if we click on any of the sections, we get to see some detail related to that window of time as represented by the respective section of the rating bar.

Contributing Metrics

Each alert occurs when one or more of Metrics “go wrong”, that is, the metric(s) is / are measured with a value that breached the threshold, or, it was directly rated by the data collector as Critical (Red) or Warning (Yellow). Alert detail page contains a section showing all such Metrics that contributed to the Alert.

Metric ratings and values aggregate

The user is looking at a time-window of the alert, specified by a “Start” and an “End” Date & time. Usually, the start date & time is when the alert was created, and the end date & time was when it was “Last updated”. In case the user has clicked on a section of the rating bar, she / he would be looking at a different time-window, which corresponds to the Start and End Date & Time of that section of the rating bar, when the alert was rated RED, or YELLOW.

Within this time window, there could be multiple measurements for each of the contributing metrics. The page therefore contains an aggregated information of the metric ratings and values that prevailed during this time-window. These are, typically, the First, Worst and Last ratings, and Minimum, Maximum and Last measured values, if any, of each metric.

Few example screenshots should help here.

Metric detail may look different!

As mentioned earlier in this series, AEM is home for alerts from many use cases of FRUN. Not all the infrastructures handle metrics exactly in the same way. For some of the alert senders, this aggregated format of the metric detail may not be optimally bringing out the important aspects. These senders may decide to paint the metric detail differently from what we have seen so far. A few such examples follow:

Notice the above examples, where it is a more familiar tabular format where the various columns showing metric attributes are different from the “First” / “Last” / … … “Max” / “Min” kind of aggregate.

Some of these may even present an aggregated view on top of the Metric detail-table, such as the seen in the example below:

This view thus gives a good snapshot of which all metrics are not performing as expected, or, which all needs attention and which others are doing okay even though some other closely related metrics have led to this alert.

Metric attributes, static and measured

A simple click on the metric name brings up much useful information about that metric, as seen in some examples below:

Metric monitoring, trend of values

While clicking on the metric name thus shows “What” the metric is about including what were its thresholds and such attributes specified at design-time, one may be interested in knowing a general trend of the measured values of the metrics for last few hours to days, or even last month, and so on. This helps in getting an idea if some metrics are showing “steady and stable” behavior or “sudden, anomalous” pattern, and so on.

Clicking on the icon right-next to the metric name brings up a chart that depicts the metric values measured in last hours / days / weeks or few months.

Few screenshots, again, just as examples:

Discovered that “Forecast” feature as well? Those advanced features would take another write up in detail, let us keep it for our own exploration!

Guided Procedures for Alert Resolution

For resolving an alert, one would often need to take a few steps, though few alerts get auto-resolved after some time those occurred. There are many alerts for which some Guided Procedures (GP) are found that are useful for alert resolution. If one or more such GPs were executed as part of alert resolution, manually or automatically, these are seen listed in the “Guided Procedure” section of the Alert detail page.

Following are some example screenshots.

Many of those Guided Procedures are very, very rich in content.

Refer to a detailed write up in the FRUN expert portal here

We may explore few of these such as those shown in the above examples. Some of the automatically executed GPs create an execution report and this, if available, may be seen under the column “Result report”. We would see all the steps it has carried out, their status, some other “related” alerts, and many detailed steps specific to the alert under investigation.

Alert documentation

The Alert documentation section contains a brief description about what the alert is about, and some other useful information, often some links to some other tools, such as those seen in the example screenshots below.

Alert Actions

After exploring ways of Viewing the alerts, we would wonder what maybe Done next. In Part 2 of this series, we observed few of the actions that could be taken on single or multiple alerts from the Alert list itself, such as Confirm, Assign Processors, Classify and Categorize, Postpone, etc. All of these are also available when a single alert is being viewed in detail.

And there are a few more actions possible when a single alert is chosen.

The button “Actions” at the top-right corner contains one menu for each the actions mentioned above and more such as “Add Comment”, “Send Notification”, “Change Configuration”, “Display MO (Managed Object) Details”, “Trigger Alert Reaction”, “Search Guided Procedure”.

“Confirm” would remove the alert from Open Alerts list, though this alert could still be found via “Alert Search” functionality described before. Various Alert reports, to be discussed soon, could also include this alert, depending on what the end user was looking for.

Adding a short comment always help the alert processors, e.g. to keep track of some routine or even exceptional steps related to the processing of the alert.

Processors may be newly assigned, or changed, or simply removed, from the alert under question.

We have earlier discussed about the effect of Postponing an alert.

Sending Notification, in the form of an e-mail and / or SMS to one or more recipients could be an essential part of alert processing. While using this action, previously defined “Notification templates” may also be used. Recipients and Notification templates may be predefined within what is known as Notification Variants.

“Change configuration” is sometimes possible for alerts that are from use cases based on MAI, described in Prelude. This allows switching off the Alert, which means the Alert would no longer be reported even if the contributing metrics continue to get collected from the respective MO.

Likewise, “Display MO details” is sometimes possible for alerts coming from MO (Managed Objects) that have more details defined in LMDB, typically used in alerts from MAI-based use cases.

Triggering Alert Reaction: As briefly mentioned in the Prelude, if one or more Alert Reactions, that is, implementation(s) of the Outbound Integration interface of AEM are in place, these would show up as possible “Reactions” that may be triggered from the chosen alert.

New Guided Procedures are shipped from SAP from time to time and the list is growing all the time. In case the user is wondering which guided procedure(s) may help in resolving the Alert under question, “Search Guided Procedure” may be used directly from the Alert Actions.

Classification and / or Categorization of the alerts based on their business impact and / or nature of root problem, respectively are possible from the respective action-menus.

Alert Logs

We have discussed on viewing of alert logs from the Alert list. There is another way and a slightly more flexible one. This could be invoked from another small icon to the right of “Alert Action”.

While this one by-default would open a tray showing all actions that took place for this alert, it is possible to filter-out few types of actions from the tray and focus on the rest.

We have now explored most of the details that are found for a single alert.

In the Part 4 of this series, we go onto explore Alert Reporting, another unique strength of AEM.

FRUN AEM Part 1 | Part 2 | Part 3 | Part 4 | Part 5