Skip to Content
Technical Articles

Monitoring SAP and Hana Instances with Prometheus and Grafana

Monitoring specific problems with SAP standard tools is not always fun and often impossible.

Prometheus is an open source monitoring solution and Grafana a tool for creating dashboards to visualize the data. The Cloud Native Computing Foundation accepted Prometheus as its second incubated project, after Kubernetes and it is in use by many known companies.

In combination with a few Prometheus exporters it is possible to monitor and alert a wide range of problems regarding SAP with a uniform concept.

Installation

All involved programms can be run as binaries, docker containers or as part of a kubernetes cluster. The installation goes beyond this blog post but there are a lot of interesting articles and even some books covering this theme.

Standard monitoring

For server monitoring the Prometheus node_exporter (Linux) and wmi_exporter (Windows) are available. The blackbox exporter on the other hand allows blackbox probing of endpoints over HTTP and HTTPS or for example to get alerted when an SSL certificate expires.

Beside these examples a lot of other exporters are available, that can be integrated into the monitoring landscape. For alerting purposes Prometheus provides with the alertmanager a lot of configuration options.

SAP specific monitoring

For SAP specific monitoring the hana_sql_exporter and sapnwrfc_exporter come to play. Their installation and usage is described in the corresponding Github repository readmes.

As the name suggests, with the hana_sql_exporter a sql select is responsible for the data retrieval. By definition the first column must represent the value of the metric. The following columns are used as labels and must be string values. In this way, all tables are available to create the needed metrics for the existing problems.

The sapnwrfc_exporter on the other hand is an addition to solve problems, that cannot be solved with the hana_sql_exporter alone. For example the actual count of the lock table entries or the current number of dialog-processes belong in this category.

Both exporters can be used as a binary, docker container or as pod in a kubernetes cluster. They read the relevant system- and metric-information from a TOML configfile, as described in the Github repositories. It is possible to run as many exporter instances as needed. For example they can be structured by different metric categories or by system usage.

In the Prometheus configfile the exporter instances can be inserted in a separate job section:

- job_name: hana-short
      scrape_interval: 60s
      static_configs:
        - targets: ['172.45.111.105:9658'] 
          labels: {'instance': 'hana_exporter_tst'}
        - targets: ['hana-exporter-dev.sap.svc.cluster.local:9658']
          labels: {'instance': 'hana_exporter_dev'}
          ...

 

Hana backups

The first example shows how SAP Hana backups can be monitored. In this case the hana_sql_exporter config entry for the metric looks something like this:

...
[[Metrics]]
  Name = "hdb_backup_status"
  Help = "Status of last hana backup."
  MetricType = "gauge"
  TagFilter = []
  SchemaFilter = ["sys"]
  SQL = "select (case when state_name = 'successful' then 0 when state_name = 'running' then 1 else -1 end),entry_type_name as type from <SCHEMA>.m_backup_catalog where entry_id in (select max(entry_id) from m_backup_catalog group by entry_type_name)"
...

 

A few minutes after starting the exporter, the metric results can be analyzed with the Prometheus expression browser. Instance and job are standard Prometheus labels, usage and tenant are standard hana_sql_exporter labels and type is an additional label initiated in the SQL part of this metric.

 

With the Grafana dashboard all backups can be displayed in one view. As shown in this example, every hanging backup is really obvious at once. This one has been detected around 07:20, then cancelled, started again and it finished around 07:35. Additionally it’s possible to alert such a situation with the Prometheus Alertmanager.

Here are some other examples for hana_sql_exporter metrics:

  • Oldest backup days
[[Metrics]]
  Name = "hdb_oldest_backup_days"
  Help = "Oldest Backup found in backup_catalog."
  MetricType = "gauge"
  TagFilter = []
  SchemaFilter = ["sys"]
  SQL = "SELECT DAYS_BETWEEN(MIN(SYS_START_TIME), CURRENT_TIMESTAMP) OLDEST_BACKUP_DAYS FROM <SCHEMA>.M_BACKUP_CATALOG"
  • Cancelled SAP Jobs
[[Metrics]]
  Name = "hdb_cancelled_jobs_total"
  Help = "Sap jobs with status cancelled/aborted (today)"
  MetricType = "counter"
  TagFilter = ["abap"]
  SchemaFilter = ["sapabap1", "sapabap", "sapewm"]
  SQL = "select count(*) from <SCHEMA>.tbtco where enddate=current_utcdate and status='A'"

 

DBVM problem

A more complicated problem that also can be monitored, is the following one. The change of some extensive material variants lead in some cases to hanging update processes (SM50) on tables DBVM,MA61V,DBVL and MDUP. In some rare cases this even leads to a fast increasing count of hanging update entries of many other users, which can result in a complete system stop.

The occurence of specific tables in the process-overview can be counted with the sapnwrfc_exporter and the configfile entry for the metric in this case looks like this:

...
[[TableMetrics]]
  Name = "sap_processes"
  Help = "sap process info"
  MetricType = "gauge"
  TagFilter = []
  FunctionModule = "TH_WPINFO"
  Table = "WPLIST"
  AllServers = true
  [TableMetrics.Params]
    SRVNAME = ""
  [TableMetrics.RowCount]
    WP_TABLE = ["dbvm", "dbvl", "ma61v", "mdup"]
    WP_TYP = ["dia", "btc", "upd", "upd2"]
  [TableMetrics.RowFilter]
    WP_STATUS = ["on hold", "running"]
...

 

On the other hand it is possible to count the number of entries in the update table with the hana_sql_exporter:

...
[[Metrics]]
  Name = "hdb_update_table"
  Help = "SAP update table entries"
  MetricType = "gauge"
  TagFilter = ["abap"]
  SchemaFilter = ["sapabap1", "sapabap","sapewm"]
  Sql = "SELECT count(*) FROM <SCHEMA>.VBHDR WHERE VBDATE = current_date"
...

 

The result in Grafana when a problem occurs looks like this:

With the following alert rule this can be covered and alerted through one of the receivers the alertmanager can be configured for.

alert: sap_dbvm_high
expr: sum(sap_processes{system="p01", count=~"wp_table_dbvm|wp_table_dbvl|wp_table_ma61|wp_table_mdup"}) > 0 and sum(hdb_update_table{tenant="p01"} > 5)
for: 2m
labels:
  severity: critical
annotations: 
  description: DBVM problem for more than 2 minutes.
  summary: DBVM, DBVL, MA61V or MDUP table entries in SM66 > 0 and SM13 entries > 5.

 

Thanks for reading this. I hope it was helpful.

4 Comments
You must be Logged on to comment or reply to a post.