Anomaly detection of SAP HANA components using Machine Learning on ELK based openITOA platform
In the previous parts of the blog we have seen the basics of ELK stack and demonstrated how to visualize and alert on critical errors of SAP HANA logs in real time. In this part we will explore on how ELK machine learning features could be used to monitor HANA logs and notify the operations team on anomalies that typical monitoring solutions and humans miss to perceive.
Machine Learning ?
Machine learning is currently one of the most overloaded terms in the software industry, as fundamentally it is used to describe a broad range of algorithms and methods for data driven prediction, decision making, and modelling.
Anomaly detection is the problem of finding patterns in data that do not conform to a model of “normal” behavior. Typical approaches for detecting such changes either use simple human computed thresholds, or mean and standard deviation to determine when data deviates significantly from the mean.
Now, with machine learning you can go deeper and ask questions like “Have any of my services changed behavior?” or “Are there any unusual processes running on my hosts?” These questions require behavioral models of hosts or services that can be automatically built from data using machine learning techniques.
Recently elasticsearch has released the Machine learning feature in May 2017 , the machine learning features of X-Pack are focused on providing “Time Series Anomaly Detection” capabilities using unsupervised machine learning.
Use case :
It is always a challenge to determine the anomaly from the large set of unstructured logs . In SAP HANA environment there are different HANA components whose logs will be usually huge, where finding anomalies from each component is quite challenging . As a extension of our Open ITOA we want to use capabilities of ML in ELK and find the anomalies from a huge set of log files from a live SAP HANA environment .
We have collected 3 months of logs of the following components from the HANA server.
You can use machine learning to observe the static parts of the message, cluster similar messages together, and classify them into message categories. The machine learning model learns what volume and pattern is normal for each category over time. You can then detect anomalies and surface rare events or unusual types of messages by using count or rare functions
The rare functions detect values that occur rarely in time or rarely for a population.It detects anomalies according to the number of distinct rare values.
How ELK ML works ?
There are few concepts that are core to machine learning in X-Pack. Understanding these concepts from the outset will tremendously help .
Machine learning jobs contain the configuration information and metadata necessary to perform an analytics task.
Jobs can analyze either a one-off batch of data or continuously in real time. Datafeeds retrieve data from Elasticsearch for analysis. Alternatively you can POST data from any source directly to an API.
As part of the configuration information that is associated with a job, detectors define the type of analysis that needs to be done. They also specify which fields to analyze. You can have more than one detector in a job, which is more efficient than running multiple jobs against the same data.
The X-Pack machine learning features use the concept of a bucket to divide the time series into batches for processing. The bucket span is part of the configuration information for a job. It defines the time interval that is used to summarize and model the data. This is typically between 5 minutes to 1 hour and it depends on your data characteristics. When you set the bucket span, take into account the granularity at which you want to analyze, the frequency of the input data, the typical duration of the anomalies, and the frequency at which alerting is required.
Machine learning nodes
A machine learning node is a node that has xpack.ml.enabled and node.ml set to true, which is the default behavior. If you set node.ml to false, the node can service API requests but it cannot run jobs. If you want to use X-Pack machine learning features, there must be at least one machine learning node in your cluster.
The X-Pack machine learning features include analysis functions that provide a wide variety of flexible ways to analyze data for anomalies.Most functions detect anomalies in both low and high values. In statistical terminology, they apply a two-sided test. Some functions offer low and high variations (for example, count, low_count, and high_count). These variations apply one-sided tests, detecting anomalies only when the values are low or high, depending one which alternative is used.If your data is sparse, there may be gaps in the data which means you might have empty buckets. You might want to treat these as anomalies or you might want these gaps to be ignored. Your decision depends on your use case and what is important to you. It also depends on which functions you use.
Step 1: Download and install the X-pack ELK stack packages
Install Elastic search , Kibana and logstash as described in the previous blog .
Step 2: Configuration of Logstash
Start the ELK stack (Elastic , Kibana, logstatsh)
Output to a local elasticsearch server is defined. The logs are written to the default index , but we have created a dedicated index named “hanalogIndex” (always prefer to have a dedicated index) This stores the log lines as a value to elasticsearch and makes it accessible for further processing. As ML in ELK uses the timestamp of the original log, its is important to map the timestamp with the actual HANA log time stamp
Create a dedicated index
We have transferred the below log files from one of our HANA environment
Execute the Logstash with the below config file
Save the config file in the logtash bin directory and execute the below command to start Logstash
You could notice the working filter as shown below
Step 3: Configuration of Kibana
Open the Kibana interface and select the index pattern for which index the visualization need to be created.
In the discover tab we could notice the latest documents , if a time field is configured for the selected index pattern, the distribution of documents over time is displayed in a histogram at the top of the page.
Step 4: Configuration of ML in Kibana
Create a ML job :
Go to the machine learning tab in the kibana and click on create ‘Create new job’ and select advanced job.
Select the input index and all the related types for which we need to perform ML anomaly detection
Give a unique job ID and description and tick on the check mark for using a dedicated index for ML
On the analysis configuration tab of the Job use the below settings for bucket,categories , detectors and influencers
Please go through the below link for more details about the ML jobs :
Verify that the correct index is selected in the Datafeed tab
Now save the job configuration
You will get a confirmation of the saved job and it will be requested to start datafeed
Start the data feed :
Click on start data feed
We can select the time range in the datafeed , but in our use case we would select the data from the beginning to till date and select on start
You could notice that the Job has started and the number of processed records would be increasing , wait it the job status changes to closed .
In this phase ML analyse the data with the detectors and influencers we have configured and writes the scores for each bucket
Now the job is finished , click on the Anomaly explorer in the Actions
View the results in anomaly explorer :
You could notice the overall anomaly and individual anomaly for each component (type) in the anomaly explorer screen
We could even go deeper insight by clicking on particular section on the timeline chart.
For instance we could notice there is a common anomaly in all the 3 components on a particular date , we could select those section and view more details for the same
We could notice that there is a highest score for “Service_shutdown” for each component on May18th , where actually the system has went down .This really shows that the ML can able to detect the anomaly from the history the of the logs .
ELK machine learning is making machine learning technology accessible to IT operation analysts and engineers who have related log data living in Elasticsearch. The basic element of X-Pack machine learning operation is the anomaly detection job. “SAP HANA Operations Anomaly Detection” use-case describes how to configure jobs to detect anomaly behaviors. Without any programming, you can become the leader of your army of algorithmic assistants to help in detecting anomalies and improve overall IT operations.
Reference links : https://www.elastic.co/guide/en/x-pack/current/xpack-ml.html