Technical Articles
Outlier Detection with One-class Classification using Python Machine Learning Client for SAP HANA
We have introduced several methods for outlier detection in a few separate blog posts, inclusive of outlier detection using statistical tests and clustering. Typically, these methods can only detect outliers the input dataset, and the detection result cannot be generalized to new data points, because they do not come up with any model. Classification methods can be adopted to overcome this difficulty. However, the power of generalization for classification method does not come for free since they require the input data to be labeled. As to the case of outlier detection, it means that any point in the input data must be labeled either as an inlier or an outlier.
Usually, classification for outlier detection requires the dataset to contain both inliers and outliers. However, there are some cases where there could be no outliers in the input dataset, yet a model for outlier detection is still required. In such cases, one-class classification could be adopted for utlization. As its name suggests, one-class classification requires the input(i.e. training) data to be labeled by a single class, yet the trained model is also able to produce the label of opposing class like other binary classification models.
One-class support vector machine(i.e. one-class SVM) is perhaps the most frequently used method for one-class classification. This method is provided in SAP HANA Predictive Analysis Library(PAL) and wrapped up by the Python machine learning client for SAP HANA(hana_ml), and in this blog post it shall be adopted to solve the outlier detection problem.
Basically, after reading this blog post, you will learn:
- How to build a one-class SVM model given a dataset
- How to apply the trained model to the prediction data, extract the information of detected outliers and evaluate the performance of the trained model
Introduction
Suppose a new system has been set up, and it runs very smoothly in the early phase. In the meantime, we want to build up a monitoring system to detect whether there are malfunctions as the new system keeps running. Then, what can we do? Waiting there until the system makes some mistake sounds a terrible idea, so we should be able to use the normal data in hand for initializing the monitoring system. There are plenty of similar cases in real life, where one-class classification becomes applicable.
Different from traditional classification methods, one-class classification tries to explore the inherent structure of the training dataset with a single label, and build a model for containing or characterizing the the training dataset, so that when a new point comes it can tell whether thr point is similar to points in the training dataset or is different from them. One-class classification methods could be derived via either density estimation, or boundary estimation, or reconstruction mode estimation w.r.t. the input data. Among them, one-class SVM is a boundary-estimation based method.
Basically, for outlier detection using one-class SVM, in the training phase a profile is drawn to encircle(almost) all points in the input data(all being inliers); while in the prediction phase, if a sample point falls into the region enclosed by the profile drawn it will be treated as an inlier, otherwise it will be treated an outlier.
In the following context of this blog post, we show a detailed case study of network intrusion detection using one-class SVM, where attacks are taking as outliers and normal connections as inliers.
The Case Study : One-class SVM for Network Intrusion Detection
Dataset Description
In this case study, a reduced version of the renowned KDD Cup 1999 dataset is used. The original dataset is for computer network intrusion detection, which contains 41 feature columns and a label column with 23 classes(representing whether a connection is normal or an attack with detalied attack type). In the reduced dataset, the computer network service type is restricted to http, and only the most basic three features are left, together withn a categorical label column of 2 classes, corresponding to whether the a record is an attack(uniformly labeled by 1 irrespective of attack type) or a normal connection(labeled by 0). Details on how the data is processed can be found in Outlier Detection DataSets (ODDS), and interested readers may refer to [1] for more information.
Since the reduced dataset contains normal connections as well as attacks, only normal connections are used when building up the one-class SVM model, while attacks are utilized in the prediction phase for evaluating the performance of the model.
For future analysis, we have downloaded the dataset, and stored it in SAP HANA in a table with name ‘HTTP_DATA_TBL’. To fetch the data, we firstly set up a connection to the database using hana_ml.
import hana_ml
from hana_ml.dataframe import ConnectionContext
cc = ConnectionContext('xxx.xxx.xxx.xx', 30x15, 'XXX', 'Xxxxxxxx')# server info hidden away
We create a hana_ml.DataFrame for this dataset in the database and fetch its brief description, illustrated as follows:
http_data = cc.table('HTTP_DATA_TBL')
http_data.describe().collect()
column | count | unique | nulls | mean | std | min | max | median | 25_percent_cont | 25_percent_disc | 50_percent_cont | 50_percent_disc | 75_percent_cont | 75_percent_disc | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ID | 567498 | 567498 | 0 | 283748.500000 | 163822.705869 | 0.000000 | 567497.000000 | 283749.000000 | 141874.250000 | 141874.000000 | 283748.500000 | 283748.000000 | 425622.750000 | 425623.000000 |
1 | X0 | 567498 | 457 | 0 | -2.268538 | 0.465346 | -2.302585 | 8.098369 | -2.302585 | -2.302585 | -2.302585 | -2.302585 | -2.302585 | -2.302585 | -2.302585 |
2 | X1 | 567498 | 523 | 0 | 5.557679 | 0.435007 | -2.302585 | 10.906691 | 5.517854 | 5.380358 | 5.380358 | 5.517854 | 5.517854 | 5.723912 | 5.723912 |
3 | X2 | 567498 | 20003 | 0 | 7.489226 | 1.316983 | -2.302585 | 16.277711 | 7.415235 | 6.490875 | 6.490875 | 7.415235 | 7.415235 | 8.372884 | 8.372884 |
4 | Y | 567498 | 2 | 0 | 0.003896 | 0.062297 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
We oberve that, besides the prescribed three feature columns(X0, X1 and X2) and the label column(Y), an additional ID column of integer type has also been added to the dataset. The label column contains two values: 0 and 1, where the normal connections are labeled by 0 and attacks labeled by 1.
Let us further divide the data points with respect to their labels and have a better comprehension of the dataset.
normal_connections = cc.sql('SELECT * FROM ({}) WHERE Y = 0'.format(http_data.select_statement))
network_attacks = cc.sql('SELECT * FROM ({}) WHERE Y = 1'.format(http_data.select_statement))
Brief descriptions of the two datasets can be illustrated as follows:
normal_connections.describe().collect()
column | count | unique | nulls | mean | std | min | max | median | 25_percent_cont | 25_percent_disc | 50_percent_cont | 50_percent_disc | 75_percent_cont | 75_percent_disc | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ID | 565287 | 565287 | 0 | 283565.330867 | 164084.702187 | 0.000000 | 567497.000000 | 282644.000000 | 141321.500000 | 141321.000000 | 282644.000000 | 282644.000000 | 425971.500000 | 425972.000000 |
1 | X0 | 565287 | 457 | 0 | -2.268710 | 0.464899 | -2.302585 | 8.098369 | -2.302585 | -2.302585 | -2.302585 | -2.302585 | -2.302585 | -2.302585 | -2.302585 |
2 | X1 | 565287 | 479 | 0 | 5.536925 | 0.279530 | -2.302585 | 10.906691 | 5.517854 | 5.380358 | 5.380358 | 5.517854 | 5.517854 | 5.723912 | 5.723912 |
3 | X2 | 565287 | 20003 | 0 | 7.483348 | 1.315889 | -2.302585 | 16.277711 | 7.409197 | 6.490875 | 6.490875 | 7.409197 | 7.409197 | 8.352578 | 8.352578 |
4 | Y | 565287 | 1 | 0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
network_attacks.describe().collect()
column | count | unique | nulls | mean | std | min | max | median | 25_percent_cont | 25_percent_disc | 50_percent_cont | 50_percent_disc | 75_percent_cont | 75_percent_disc | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ID | 2211 | 2211 | 0 | 330579.404342 | 51697.624949 | 201669.000000 | 514439.000000 | 316265.000000 | 312003.500000 | 312003.000000 | 316265.000000 | 316265.000000 | 316817.500000 | 316818.000000 |
1 | X0 | 2211 | 14 | 0 | -2.224571 | 0.566787 | -2.302585 | 2.646175 | -2.302585 | -2.302585 | -2.302585 | -2.302585 | -2.302585 | -2.302585 | -2.302585 |
2 | X1 | 2211 | 46 | 0 | 10.863823 | 0.572168 | -2.302585 | 10.906691 | 10.906691 | 10.906691 | 10.906691 | 10.906691 | 10.906691 | 10.906691 | 10.906691 |
3 | X2 | 2211 | 11 | 0 | 8.991967 | 0.451550 | -2.302585 | 9.025828 | 9.025708 | 9.025708 | 9.025708 | 9.025708 | 9.025708 | 9.025708 | 9.025708 |
4 | Y | 2211 | 1 | 0 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
From the count of the two datasets, one can see that there are 2211 data records corresponding to attacks, while 565287 data records corresponding to normal connections, indicating highly skewness of class distributions.
Data Partition and Model Training
One-class Classification builds up model on pure data of inliers. To make sure that the training data is not at all polluted by any outlier, we randomly sample 90% of the data of normal connections, and use the sampled data as the training set for one-class SVM. The rest(inclusive of attack data and the remaining 10% data of normal connections) is used in the prediction phase for evaluating the performance of the trained model.
from hana_ml.algorithms.pal.partition import train_test_val_split
train_normal, test_normal, _ = train_test_val_split(data=normal_connections,
id_column='ID',
random_seed=2,
training_percentage=0.9,
testing_percentage=0.1,
validation_percentage=0)
from hana_ml.algorithms.pal.svm import OneClassSVM
#nu is an upper bound on the fraction of training error,
#use a value close to the real(or empirically guessed) proportion of outliers
osvm = OneClassSVM(kernel='rbf',
nu=0.003)
osvm.fit(data=train_normal,
key='ID',
features=['X0', 'X1', 'X2'])
Prediction and Model Evaluation
Now we take the remaining data of normal connections and the data of attacks as the test dataset, and apply the trained one-class SVM model to them, respectively.
normal_res = osvm.predict(data=test_normal,
key='ID',
features=['X0', 'X1', 'X2'])
attack_res = osvm.predict(data=network_attacks,
key='ID',
features=['X0', 'X1', 'X2'])
Let us check the prediction result for the data of attacks.
attack_res.collect()
ID | SCORE | PROBABILITY | |
---|---|---|---|
0 | 311452 | -1 | None |
1 | 311453 | -1 | None |
2 | 311454 | -1 | None |
3 | 311455 | -1 | None |
4 | 311456 | -1 | None |
… | … | … | … |
2206 | 312268 | -1 | None |
2207 | 312269 | -1 | None |
2208 | 312270 | -1 | None |
2209 | 312271 | -1 | None |
2210 | 312272 | -1 | None |
2211 rows × 3 columns
It should be mentioned that in the prediction result of one-class SVM, detected outliers(i.e. attacks) are assigned the value of -1 in the SCORE column, while detected inliers(i.e. normal connections) are assigned the value of 1. One can see that in the prediction result table above, all observed data points(represented by their IDs) are assigned the value of -1, indicating that they are all detected as novelties, so the result looks very promising. In fact, we will show in the subsequent context that all attacks are labeled correctly by the trained one-class SVM model.
Now let us check the prediction result for the test data of normal connections.
normal_res.collect()
ID | SCORE | PROBABILITY | |
---|---|---|---|
0 | 5271 | 1 | None |
1 | 5488 | 1 | None |
2 | 6117 | 1 | None |
3 | 6452 | 1 | None |
4 | 9474 | 1 | None |
… | … | … | … |
56524 | 494850 | 1 | None |
56525 | 505028 | 1 | None |
56526 | 514290 | 1 | None |
56527 | 514986 | 1 | None |
56528 | 528143 | 1 | None |
56529 rows × 3 columns
All observed points are labeld as inlier, so the prediction result looks not bad either.
Now let us calculate some basic statistics of the prediction result for model evaluation.
attack_correct = cc.sql('SELECT * FROM ({}) WHERE SCORE=-1'.format(attack_res.select_statement)).count()
attack_wrong = attack_data.count() - attack_correct
normal_correct = cc.sql('SELECT * FROM ({}) WHERE SCORE=1'.format(normal_res.select_statement)).count()
normal_wrong = normal_res.count() - normal_correct
We use Precision and Recall and 𝐹1 score of the outlier class for evaluating the performance of the one-class SVM classifier for outlier detection.
For the outlier class(i.e. attacks) in the test dataset, we have:
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = attack_correct / (attack_correct + normal_wrong) = 0.8766851704996035 𝑅𝑒𝑐𝑎𝑙𝑙 = attack_correct / attack_data.count()= 1.0 𝐹1 = 2 × Precision × Recall / (Precision + Recall) = 0.9342911472638918
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = normal_correct / (normal_correct + attack_wrong) = 1.0 𝑅𝑒𝑐𝑎𝑙𝑙 = normal_correct / normal_data.count() = 0.9944983990518141 𝐹1 = 2 × Precision × Recall / (Precision + Recall) = 0.9972416117502018
Discussion and Summary
In this blog post, we have shown readers how to use a one-class classification method called one-class SVM for outlier detection. The major difference between the multi-class classification and one-class classification is reflected in the training data, where the former requires the training data to have multiple(i.e. more than one) labels, while the later one to have merely a single label. The whole detection procedure is not much different from other traditional supervised classification methods: first taking out a collection of inlier points, then building up an one-class SVM model on it, and finally applying the model to new points to determine whether they are classified as inliers or outliers.
The major drawback of one-class classification for outlier detection is its scope of applicability, illustrated as follows:
- Firstly, it mainly applies to cases where observed data points are all normal(i.e. being inliers), or there are too few outliers to build up an effective classification model for outlier detection. So in general, traditional multi-class classification methods are still the first choice for model training on skewed datasets with outliers being the minority class, rather than one-class classification. We shall cover the issue of how to handle the label imbalance problem for building effective multi-class classification models(for outlier detection) in a separate blog post.
- Secondly, the effectiveness of one-class classification for outlier detection could strongly rely on some intrinsic nature of inliers and outliers, for example when inliers are aggregated into a big cluster from which outliers are disjoint. Otherwise, one-class classification often fails to work well, or may need very intricate design procedure to make it work well.
References
[1] Shebuti Rayana (2016). ODDS Library [http://odds.cs.stonybrook.edu]. Stony Brook, NY: Stony Brook University, Department of Computer Science.