My Adventures in Machine Learning 4
Ranking the systems by workload
Today I’ll start with a trivial question: How busy are the BW systems compared to each other? It will be a preparation for my machine learning steps in the next section. I am already collection many metrics, so I have enough data to evaluate. For this question I focus on 26 metrics because I think they are most helpful:
IP_INFOPAKIDS, IP_RECORDS, DTP_REQUESTS, DTP_DATAPAKS, DTP_RECORDS, DTP_GB, Q_USERS, Q_QUERIES, Q_NAVSTEPS, PC_CHAINS, PC_RUNS, BTC_JOBS, DB_BLOCK_CHANGES, LOG_GB, SELECTS, INSERTS, UPDATES, DELETES, CALLS, DV_NUMROWS, DV_DSO_GB, DV_IC_GB, DV_WARM_GB, BIA_USED, AL_SUCCESS, USERA
They should correlate very well with the desired workload, which means the higher these numbers the higher the (useful) workload on the system. I omitted ambiguous metrics like e.g. IP_SECONDS, because a high total runtime for InfoPackages could mean either:
- a lot of data was being loaded from the source systems
- few data was loaded from the source systems, but in a very inefficient way
I assume, no unnecessary batchjobs are being scheduled and users only execute useful queries etc. The actual values of the metrics are normalized by their overall average. For each system the normalized averages for all 26 metrics are summed up. A score of 26 would mean a system is showing a very average overall workload. This also means that all 26 metrics have the same weight, which sounds like a reasonable thing to do to keep things simple.
Finally I get this overview:
Table 1: Workload Scores
System POC is pretty much average, the highest workload is on PO7. The BWonHANA systems PH1 and PH2 are somwhere in the middle.
Picture 1: Barcharts of the workload scores
From my personal experience, I would classify the sysmtes PO5, PO8, PO4, POD and PO9 as low worload systems. Maybe the systems POC, PH2, POB and PO2 could be classified as medium workload systems, since they also seem to have a similar overall workload. Now let’s start with classification algorithms if they would confirm my opinion.
Classifying systems by workload
My overall goal is to detect unusual BW workload. So a classfication algorithm should help me with anomaly detection. I’ll start with standard classifcation algorithms that should simply take all available performance data without knowing which SID the workload belongs to, and then cluster the data intelligently. Ideally the 15 SIDs will each get their own cluster.
The first algorithm I used for clustering is KMEANS. This alogrithm needs the number of desired clusters as an input. I used the Silhouette score to identify the ideal number of clusters. The best fit was achieved with 9 clusters and a silhouette score of 0.45:
Table 2: KMEANS optimal clustering
The numbers in the cells represent the number of workload samples (days) that were classified in this cluster group for the given SID. The low workload systems PO4, PO5, PO8, PO9 and POD are lumped up together in one group. However, there is no medium workload group. Systems PO2, POA and POB seem to have a similar workload and are therefore put into a single group. When I check what happens with a suboptimal clustering of 2 or 3 or 4 clusters, then I see that system PO7 is the first to be granted a cluster on its own, followed by PO1. These systems seem to be for KMEANS the easiest to identify via their workload.
A very valuable output of the KMEANS algorithm are the centroids for these 9 clusters. This provides me the datapoints of a typical workload for this cluster. These centroids are similar, but more helpful than a simple average for the collected data for a given SID. I need to test if the distance of one data point (= workload of a specific calendar day) to its centroid can be used for anomaly detection.
The second algorithm doesn’t need to be told how many clusters it should generate, but it needs to get as an input the allowed distance and the number of neighbors which constitute a cluster. Again I chose the silhouette matrix to identify the ideal clustering. This time the silhouette score was only 0.41, so slightly worse. And probably due to the small amount of data, many small clusters were detected. Probably with more data the clusters would merge to fewer and larger clusters:
Table 3: DBSCAN optimal clustering
DBSCAN creates a cluster containing the outliers or anaomalies. And it creates several clusters for a system if the workload is very diverse. E.g. PO1, PO3 and POA get serveral clusters. Maybe with more data collected they would merge into larger clusters. Remember, KMEANS confidently put all PO1 samples in one group. Again the low workload systems PO4, PO5, PO8, PO9 and POD as summed up in a single cluster. And PO2, POB and POC seem similar for DBSCAN, but not as similar as for KMEANS.
As a third alogrithm I checked the Gaussian Mixtures. They assume an ellipsoid data distribution similar to KMEANS, so let’s see if the output is comparable. I couldn’t compute a silhouette score here, so I had to identify the ideal clustering myself. I chose this one:
Table 4: Gaussian Mixtures ideal clustering
This time the “low workload” systems PO4, PO5, PO8, PO9 and POD are distributed over two clusters. The other similar systems PO2, POB and POC get their own cluster, however. To detect anomalies, I could examine the rare cases (marked in red in table 4), or similar to KMEANS the Gaussian Mixtures could calculate a probability that a sample belongs to a certain cluster. Then I just need to define a threshold probability under which the samples are defined as outliers.
In summary, I was not happy with the classification algorithms so far. They were easy to implement and running fast, but the results feel somewhat dumb. Of course, obvious differences are spotted, but I wouldn’t need these alogrithms for spotting the obvious. Small random differences can often have some large impact on their clustering, and I assume that this would still be true if I had 5 times as much data available. I am sure people with real experience in Machine Learning are reading this. If anyone could share some tips, that would be highly appreciated.
So my next step is to train a neuronal net for anomaly detection as a quite different approach. Hopefully a neuronal network can detect and learn subtle patterns in the workload data. After that, I will examine in detail the anomalies detected by the various approaches to see which one provides the best results.