Technical Articles
Grafana API – How to leverage it to monitor SAP Data Intelligence Cloud
Background:
Since the release of SAP Data Intelligence Cloud:2110, we have an enhancement feature which allows you to monitor the performance of your application with the help of Grafana API. This post will give you a more illustrative hands-on overview other than the one mentioned in SAP Note.
Prerequisites:
Since this post is focusing on Grafana API, hence the deployment of your on-premises Grafana application, and the installation of vctl tool, are out of scope of it.
- <cluster address>: Address of the SAP Data Intelligence cluster running your tenant (same as used in the “vctl login” command).
- <username>: Tenant admin username (same as used in the “vctl login” command).
- <user password>: Tenant admin password (same as used in the “vctl login” command).
- <tenant name>: Tenant name (same as used in the “vctl login” command).
In my case, the cluster address is https://vsystem.ingress.dh-7rh7z7ok4.dh-canary.shoot.live.k8s-hana.ondemand.com, and the tenant name is default. You will also need the admin privilege to fetch the info of tenant id.
With the help of “vctl tenant get” command, you’ll get the id of your tenant, which will be useful to call Grafana API later. For more help on the SAP vctl tool, you may simply hit “vctl –help” or refer to Commands – SAP Help Portal.
Steps:
- Login to Grafana as an administrator. (e.g. If you install Grafana on your own laptop, the default url of Grafana UI should be http://127.0.0.1:3000)
- Select Configuration > Data Sources in the Grafana menu.
- Select Add data source.
- Select Time series databases > Prometheus
- Configure the SAP Data Intelligence Monitoring Query API as a Prometheus data source:
- Name: “SAP Data Intelligence”
- Default: “enabled”
- URL: “https://<cluster address>/app/diagnostics-gateway/monitoring/query”
- Access: “Server (default)”
- Auth > Basic Auth: “enabled”
- Auth > With Credentials: “enabled”
- Basic Auth Details > User: “<tenant name>\<username>”
- Basic Auth Details > Password: “<user password>”
- Custom HTTP Headers > Header: “x-requested-with”
- Custom HTTP Headers > Value: “fetch”
- HTTP Method: “POST”
- Click Save & Test.
Good job, now we have a working data source, to fetch the diagnostic metrics of our SAP Data Intelligence Cloud, we can move on to customize our dashboard based on the business requirements.
What’s more:
For your convenience, I simply copy and paste some PromQL commands that can help you fulfill that quickly. In below samples,
- ${SAP_DI_TENANT_UID} indicates the tenant id that you have fetched with “vctl tenant get” command.
- ensure that all the special characters inside is in English format. ( ” and , )
- remove SAP_DI_QUERY= before each command.
—————————————————————————————–
Basic Pod Performance Metrics Usage
Pod memory usage in bytes:
SAP_DI_QUERY=”sap_pod_memory_working_set_bytes{access_category=”pod-performance”,vsystem_datahub_sap_com_tenant_uid=”${SAP_DI_TENANT_UID}”}”
Pod CPU cores usage:
SAP_DI_QUERY=”rate(sap_pod_cpu_user_seconds_total{access_category=”pod-performance”,vsystem_datahub_sap_com_tenant_uid=”${SAP_DI_TENANT_UID}”}[5m])”
The `rate` function is applied because the base metric `sap_pod_cpu_user_seconds_total` records the total CPU usage over the lifetime of a pod and is not very informative on its own. For the expression above, a value of `0.1` corresponds to an average usage of `1/10th` of the CPU time of a single core over the past five minutes, while a value of `2` corresponds to an average usage of two full CPU cores.
Pod network usage as bytes per second:
SAP_DI_QUERY=”rate(sap_pod_network_bytes_total{access_category=”pod-performance”,vsystem_datahub_sap_com_tenant_uid=”${SAP_DI_TENANT_UID}”}[5m])”
The `rate` function is applied because the base metric `sap_pod_network_bytes_total` records the total network usage over the lifetime of a pod and is not very informative on its own. The network usage as bytes per second is computed over the past five minutes.
Pod readiness status:
SAP_DI_QUERY=”sap_pod_status_ready{access_category=”pod-performance”,vsystem_datahub_sap_com_tenant_uid=”${SAP_DI_TENANT_UID}”}”
This is a zero-one metric. The value `1` represents a ready pod, the value `0` a non-ready pod.
Smoothed pod readiness status:
SAP_DI_QUERY=”avg_over_time(sap_pod_status_ready{access_category=”pod-performance”,vsystem_datahub_sap_com_tenant_uid=”${SAP_DI_TENANT_UID}”}[5m])”
The result is a sliding window average of the pod readiness status over the past five minutes (based on four or five samples due to a sample resolution of one minute). The values lie between zero (pod not ready for the past five minutes) and one (pod ready for the past five minutes). This metric is suitable to define an alert threshold for example at `0.7`, allowing for the pod to be not ready for a minute during restarts without raising an alert.
> Note: Do not set the time interval for `rate` or `avg_over_time` functions in your PromQL expressions below `5m`. The minimum time series resolution of the queried metrics is at least one minute (see [`Metric resolution and retention`](#metric-resolution-and-retention)). This means, that for an interval of five minutes, the `rate` or `avg_over_time` functions already are based only on four samples points. Reducing the interval further may result in too few samples to calculate these functions.
—————————————————————————————–
Tenant Pod Performance
Total memory usage in bytes of all pods of the tenant:
SAP_DI_QUERY=”sum(sap_pod_memory_working_set_bytes{access_category=”pod-performance”,vsystem_datahub_sap_com_tenant_uid=”${SAP_DI_TENANT_UID}”})”
Total CPU cores usage of all pods of the tenant:
SAP_DI_QUERY=”sum(rate(sap_pod_cpu_user_seconds_total{access_category=”pod-performance”,vsystem_datahub_sap_com_tenant_uid=”${SAP_DI_TENANT_UID}”}[5m]))”
Total network usage as bytes per second of all pods of the tenant:
SAP_DI_QUERY=”sum(rate(sap_pod_network_bytes_total{access_category=”pod-performance”,vsystem_datahub_sap_com_tenant_uid=”${SAP_DI_TENANT_UID}”}[5m]))”
Total pod count:
SAP_DI_QUERY=”count(sap_pod_status_ready{access_category=”pod-performance”,vsystem_datahub_sap_com_tenant_uid=”${SAP_DI_TENANT_UID}”})”
Total count of ready pods:
SAP_DI_QUERY=”sum(sap_pod_status_ready{access_category=”pod-performance”,vsystem_datahub_sap_com_tenant_uid=”${SAP_DI_TENANT_UID}”})”
—————————————————————————————–
User Pod Performance
Total memory usage in bytes for each user:
SAP_DI_QUERY=”sum(sap_pod_memory_working_set_bytes{access_category=”pod-performance”,vsystem_datahub_sap_com_tenant_uid=”${SAP_DI_TENANT_UID}”,vsystem_datahub_sap_com_user!=””}) by (vsystem_datahub_sap_com_user)”
Total CPU cores usage for each user:
SAP_DI_QUERY=”sum(rate(sap_pod_cpu_user_seconds_total{access_category=”pod-performance”,vsystem_datahub_sap_com_tenant_uid=”${SAP_DI_TENANT_UID}”,vsystem_datahub_sap_com_user!=””}[5m])) by (vsystem_datahub_sap_com_user)”
Total network usage as bytes per second for each user:
SAP_DI_QUERY=”sum(rate(sap_pod_network_bytes_total{access_category=”pod-performance”,vsystem_datahub_sap_com_tenant_uid=”${SAP_DI_TENANT_UID}”,vsystem_datahub_sap_com_user!=””}[5m])) by (vsystem_datahub_sap_com_user)”
Total pod count for each user:
SAP_DI_QUERY=”count(sap_pod_status_ready{access_category=”pod-performance”,vsystem_datahub_sap_com_tenant_uid=”${SAP_DI_TENANT_UID}”,vsystem_datahub_sap_com_user!=””}) by (vsystem_datahub_sap_com_user)”
—————————————————————————————–
Pipeline Graph Performance
Total memory usage in bytes of all pods for each (multi-pod) graph:
SAP_DI_QUERY=”sum(sap_pod_memory_working_set_bytes{access_category=”pod-performance”,vsystem_datahub_sap_com_tenant_uid=”${SAP_DI_TENANT_UID}”,graph!=””}) by (graph)”
Total CPU cores usage of all pods for each (multi-pod) graph:
SAP_DI_QUERY=”sum(rate(sap_pod_cpu_user_seconds_total{access_category=”pod-performance”,vsystem_datahub_sap_com_tenant_uid=”${SAP_DI_TENANT_UID}”,graph!=””}[5m])) by (graph)”
Total network usage as bytes per second for each (multi-pod) graph:
SAP_DI_QUERY=”sum(rate(sap_pod_network_bytes_total{access_category=”pod-performance”,vsystem_datahub_sap_com_tenant_uid=”${SAP_DI_TENANT_UID}”,graph!=””}[5m])) by (graph)”
Readiness status for each (multi-pod) graph:
SAP_DI_QUERY=”max(sap_pod_status_ready{access_category=”pod-performance”,vsystem_datahub_sap_com_tenant_uid=”${SAP_DI_TENANT_UID}”,graph!=””}) by (graph)”
—————————————————————————————–
Here is the snapshot of one query:
Kindly be informed the Grafana UI could differ from each other, depending on the version of Grafana installed.
The suggested version (as the screenshots in this article) is v7.5.14. Otherwise you will need to tweak it a bit to run the query properly in Explore panel.
Cheers! Now enjoy your journey in Grafana!
In case any unclear, feel free to comment or reach out to me directly.
Hi Chank,
thanks for the blog.
We've tried to query the URL via Postman, but we always receive "Forbidden [403]: Prometheus access is not authorized"
The usercredentials do fit, and the user has admin priviliges.
Do you know this error message?
Greetings, Oliver
Hi Oliver,
You may want to check if the requested endpoint ends with on of:
More details in this document: https://help.sap.com/viewer/ca509b7635484070a655738be408da63/Cloud/en-US/9bc60ac178964f23b3fee513595a73a4.html
There is a simple test URL with curl command, you can also try it out with Postman.
Regards,
Tom
Hi Tom,
Are there any other metrics which can monitored other than the above blog, or is there any way to list out all the promQL queries which are available to query SAPDI Prometheus? It would be more helpful if you can share any other useful metrics along with PromQL queries.
Thanks,
Chandra
Hi Experts,
The Blog is very helpful and appreciate the content and steps.
Are you able to fetch the network usage for pipelines/users/tenant? We were able to fetch cpu and memory info and visualize it in Grafana. But failed to fetch data for network query.
Are there any other queries that can be added to this blog which can give us more metrics of each of the pods in case of multiplicity is in use in pipelines.
Thanks, Sharath
Hi,
I followed the blog and I am getting below error message, when I tried to create the DI connection.
Error reading Prometheus: An error occurred within the plugin
Thanks,
Chenna.
Error reading Prometheus: Post "https://host/app/diagnostics-gateway/monitoring/query/api/v1/query/api/v1/query": dial tcp: lookup host on Ip address: no such host
Please help Tom Hu
Oliver Zieger
Is this error observed when configuring Grafana?
The url looks incorrect. Please replace the "host" with the actual host name of your Data Intelligence cluster.
Tom Hu