My Adventures in Machine Learning 5
Recap on Classification Algorithms
In my last blog post I wrote about three standard clustering algorithms:
KMEANS, DBSCAN and Gaussian Mixtures. For me personally it was interesting
to see how they build clusters from the workload data of my 15 BW systems.
However, they were not useful for my specific task of anomaly detection.
DBSCAN explicitly lists outliers, but if there are too many similar outliers
they simply get their own category. So the anomaly detection with DBSCAN
is going in the right direction, but I cannot rely that most anomalies are
getting detected. The output is somewhat random, I get way too few true positives.
In the meantime I got much more training data, but the drawbacks remain the same from my last blog post:
- BW systems with low workload are still lumped together, the classification algorithms can hardly distinguish them
- the algorithms are very sensitive to all changes in the data (i.e. I may not add new metrics or change/improve the data collection for existing metrics)
- (except for DBSCAN) it is not clear how to interpret the results to identify anomalies/outliers
My first neuronal network
As a logical next step, I experiment with artificial neuronal networks. Being an absolute beginner, I take Keras. Creating my first artificial neuronal network was astonishingly easy.
My data is already prepared, so I can easily input it into the neuronal network. I only had to normalize the values to the range [0:1] by diving the data by the maximum value of each metric. I chose these 39 metrics because I think they are the most relevant for evaluating the BW workload:
feature_names = ['IP_INFOPAKIDS','IP_SECONDS','IP_RECORDS','IP_AVG','DTP_REQUESTS','DTP_DATAPAKS','DTP_SECONDS','DTP_RECORDS','DTP_GB','DTP_AVG','Q_NAVSTEPS','Q_USERS','Q_QUERIES','Q_RT_SUM','SELECTS','AVG_SELECT','INSERTS','AVG_INSERT','UPDATES','AVG_UPDATE','DELETES','AVG_DELETE','CALLS','AVG_CALL','LOG_GB','LOGICAL_READS','PHYSICAL_READS','DB_BLOCK_CHANGES','TABLE_SCANS','PC_CHAINS','PC_RUNS','PC_TOTAL','PC_AVG','PC_FAIL','BTC_JOBS','BTC_SUM','BTC_AVG','BTC_FAIL','WP_BTC_PCT']
And this is my very simple neuronal network:
model = Sequential() model.add(Dense(20, # The first and only hidden layer is fixed to have 20 neurons activation='relu', # The hidden layer will use the ReLu activation function input_shape=(39,))) # Since I have 39 input features for measurement, # I fix the number of neurons in the input layer to 39 model.add(Dense(15, # The output layer has 15 neurons, one per SID activation='sigmoid')) # The output layer uses the sigmoid activation function, which squashes all inputs # to the range [0, 1], making it possible to interpret the result as a probability model.summary() _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense_1 (Dense) (None, 20) 800 _________________________________________________________________ dense_2 (Dense) (None, 15) 315 ================================================================= Total params: 1,115 Trainable params: 1,115 Non-trainable params: 0 model.compile(loss='binary_crossentropy', optimizer='Adam', metrics=['accuracy']) history = model.fit(X, # Array of input features Y, # List of output labels epochs=100, # Number of epochs as defined previously validation_split=0.2, # I reserve 20% of the dataset for validation verbose=1, # Output details during training shuffle=True) # Randomly shuffle the training data before training
Training just takes 15 seconds to finish because this is a very small neuronal network with currently only 2600 workload samples (175 days per SID). The training graphs for accuracy and loss look almost identical for the training and validation data. So there seems to be no over- or underfitting:
Graph 1: training and validation loss
Graph 2: training and validation accuracy
Let’s look at the Confusion Matrix :
Table 1: Confusion Matrix of my first neuronal network
Only 36 of 2628 predictions were wrong. So the neuronal network can classify with a much higher accuracy than the clustering algorithms. Even for the “low workload” systems (PO4, PO5, PO8, PO9, POD) the classification works very well with only 22 wrong classifications. However, for the high workload systems like e.g. PO1, PO3, PO6, PO7, POA and PH1 the classification works too good, there are only 6 wrong classifications. So if I use the wrong classifications as indicators for an abnormal workload, then I get again way too few true positives.I would like the tool to err on the side of producing more false positives in order not to miss the true positives. For system PO3 I know for sure that the number of days with an unusual workload is much higher than the detected two instances.
Considering this, I need to lower the confidence of the neuronal network. If the neuronal network has a high confidence in the found SID, then the workload should be pretty normal. If the confidence is low then the workload is unusual.
My second neuronal network
The neuronal network should be more general, learning more patterns and less data. Even though no overfitting is happening, the individual numbers should become less important. I can reach this via dropout. Setting a dropout value of 20% means that a random set of 20% of the metrics are ignored during training. The new neuronal network looks almost the same as before:
model = Sequential() model.add(Dense(20, # The first and only hidden layer is fixed to have 20 neurons activation='relu', # The 1. hidden layer will use the ReLu activation function defined above input_shape=(39,))) # Since we have 70 input features for measurement, # we fix the number of neurons in the input layer to 39 model.add(Dropout(rate=0.2)) # ignore 20% of the input metrics (ranomly) model.add(Dense(15, # The output layer has 15 neurons, one per SID activation='sigmoid')) # The output layer uses the sigmoid activation function, which squashes all inputs # to the range [0, 1], making it possible to interpret the result as a probability model.summary()
The training is again very fast. This time there is a difference between the training and validation data:
Graph 3: training and validation loss with dropout
Graph 4: training and validation accuracy with dropout
The values for the validation data a slightly better than for the training data, because during validation there is no dropout. The model then sees all 39 metrics, so it is easier to identify the correct SID. However, since the model trains effectively with less data, there is more confusion, as can be seen in the second Confusion Matrix:
Table 2: Confusion Matrix of my second neuronal network (with dropout)
76 of 2628 predictions were now wrong. Again, there are many more wrong predictions for the low workload systems (PO4, PO5, PO8, PO9, POD) than for the high workload systems (PO1, PO3, PO6, PO7, POA, PH1). Namely 55 versus 7. It is still much easier to correctly identify a high workload system.
One question remains: Is the trained model now really less sure about its predictions?
The Sigmoid Values
For verification of the models’ predictions, I now focus on the system PO3. Here I can personally best evaluate whether the workload was normal or extraordinary. First of all, I take the value from the output neuron for all PO3 data, which could be interpreted as the confidence of the model that the seen workload belongs to PO3. And indeed, the confidence sank on average from 97% to 91%:
Table 3: comparing neuronal net 1 and neuronal net 2 confidence
As you can easily see, the first neuronal network was often 100% sure that the workload belonged to system PO3. Due to its high confidence the values are not helpful for my task.
Much more relevant is the second neuronal network. The most interesting part here is that when the confidence for the second neuronal network sank below 50%, here I could personally verify that the workload was indeed extraordinary (high or low). And for days with a confidence between 50% and 85% I get a good indicator for unusual workload.
Voilà, with building a simple neuronal network I can train a model that will tell me how ordinary or extraordinary the BW workload on a given day (think of: yesterday) was. And this is all possible with unsupervised learning: Nowhere I have to provide arbitrary thresholds or introduce my bias into the evaluation.