Outlier Detection with One-class Classification us...

likun_hou · ‎12-29-2020

We have introduced several methods for outlier detection in a few separate blog posts, inclusive of outlier detection using statistical tests and clustering. Typically, these methods can only detect outliers the input dataset, and the detection result cannot be generalized to new data points, because they do not come up with any model. Classification methods can be adopted to overcome this difficulty. However, the power of generalization for classification method does not come for free since they require the input data to be labeled. As to the case of outlier detection, it means that any point in the input data must be labeled either as an inlier or an outlier.

Usually, classification for outlier detection requires the dataset to contain both inliers and outliers. However, there are some cases where there could be no outliers in the input dataset, yet a model for outlier detection is still required. In such cases, one-class classification could be adopted for utlization. As its name suggests, one-class classification requires the input(i.e. training) data to be labeled by a single class, yet the trained model is also able to produce the label of opposing class like other binary classification models.

One-class support vector machine(i.e. one-class SVM) is perhaps the most frequently used method for one-class classification. This method is provided in SAP HANA Predictive Analysis Library(PAL) and wrapped up by the Python machine learning client for SAP HANA(hana_ml), and in this blog post it shall be adopted to solve the outlier detection problem.

Basically, after reading this blog post, you will learn:

How to build a one-class SVM model given a dataset

How to apply the trained model to the prediction data, extract the information of detected outliers and evaluate the performance of the trained model

Introduction

Suppose a new system has been set up, and it runs very smoothly in the early phase. In the meantime, we want to build up a monitoring system to detect whether there are malfunctions as the new system keeps running. Then, what can we do? Waiting there until the system makes some mistake sounds a terrible idea, so we should be able to use the normal data in hand for initializing the monitoring system. There are plenty of similar cases in real life, where one-class classification becomes applicable.

Different from traditional classification methods, one-class classification tries to explore the inherent structure of the training dataset with a single label, and build a model for containing or characterizing the the training dataset, so that when a new point comes it can tell whether the point is similar to points in the training dataset or is different from them. One-class classification methods could be derived via either density estimation, or boundary estimation, or reconstruction mode estimation w.r.t. the input data. Among them, one-class SVM is a boundary-estimation based method.

Basically, for outlier detection using one-class SVM, in the training phase a profile is drawn to encircle(almost) all points in the input data(all being inliers); while in the prediction phase, if a sample point falls into the region enclosed by the profile drawn it will be treated as an inlier, otherwise it will be treated an outlier.

In the following context of this blog post, we show a detailed case study of network intrusion detection using one-class SVM, where attacks are taking as outliers and normal connections as inliers.

The Case Study : One-class SVM for Network Intrusion Detection

Dataset Description

In this case study, a reduced version of the renowned KDD Cup 1999 dataset is used. The original dataset is for computer network intrusion detection, which contains 41 feature columns and a label column with 23 classes(representing whether a connection is normal or an attack with detailed attack type). In the reduced dataset, the computer network service type is restricted to http, and only the most basic three features are left, together within a categorical label column of 2 classes, corresponding to whether the a record is an attack(uniformly labeled by 1 irrespective of attack type) or a normal connection(labeled by 0). Details on how the data is processed can be found in Outlier Detection DataSets (ODDS), and interested readers may refer to [1] for more information.

Since the reduced dataset contains normal connections as well as attacks, only normal connections are used when building up the one-class SVM model, while attacks are utilized in the prediction phase for evaluating the performance of the model.

For future analysis, we have downloaded the dataset, and stored it in SAP HANA in a table with name 'HTTP_DATA_TBL'. To fetch the data, we firstly set up a connection to the database using hana_ml.

import hana_ml

from hana_ml.dataframe import ConnectionContext

cc = ConnectionContext('xxx.xxx.xxx.xx', 30x15, 'XXX', 'Xxxxxxxx')# server info hidden away

We create a hana_ml.DataFrame for this dataset in the database and fetch its brief description, illustrated as follows:

http_data = cc.table('HTTP_DATA_TBL')

http_data.describe().collect()

column	count	unique	nulls	mean	std	min	max	median	25_percent_cont	25_percent_disc	50_percent_cont	50_percent_disc	75_percent_cont	75_percent_disc
ID	567498	567498	0	283748.500000	163822.705869	0.000000	567497.000000	283749.000000	141874.250000	141874.000000	283748.500000	283748.000000	425622.750000	425623.000000
X0	567498	457	0	-2.268538	0.465346	-2.302585	8.098369	-2.302585	-2.302585	-2.302585	-2.302585	-2.302585	-2.302585	-2.302585
X1	567498	523	0	5.557679	0.435007	-2.302585	10.906691	5.517854	5.380358	5.380358	5.517854	5.517854	5.723912	5.723912
X2	567498	20003	0	7.489226	1.316983	-2.302585	16.277711	7.415235	6.490875	6.490875	7.415235	7.415235	8.372884	8.372884
Y	567498	2	0	0.003896	0.062297	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000

We observe that, besides the prescribed three feature columns(X0, X1 and X2) and the label column(Y), an additional ID column of integer type has also been added to the dataset. The label column contains two values: 0 and 1, where the normal connections are labeled by 0 and attacks labeled by 1.

Let us further divide the data points with respect to their labels and have a better comprehension of the dataset.

normal_connections = cc.sql('SELECT * FROM ({}) WHERE Y = 0'.format(http_data.select_statement))

network_attacks = cc.sql('SELECT * FROM ({}) WHERE Y = 1'.format(http_data.select_statement))

Brief descriptions of the two datasets can be illustrated as follows:

column	count	unique	nulls	mean	std	min	max	median	25_percent_cont	25_percent_disc	50_percent_cont	50_percent_disc	75_percent_cont	75_percent_disc
ID	565287	565287	0	283565.330867	164084.702187	0.000000	567497.000000	282644.000000	141321.500000	141321.000000	282644.000000	282644.000000	425971.500000	425972.000000
X0	565287	457	0	-2.268710	0.464899	-2.302585	8.098369	-2.302585	-2.302585	-2.302585	-2.302585	-2.302585	-2.302585	-2.302585
X1	565287	479	0	5.536925	0.279530	-2.302585	10.906691	5.517854	5.380358	5.380358	5.517854	5.517854	5.723912	5.723912
X2	565287	20003	0	7.483348	1.315889	-2.302585	16.277711	7.409197	6.490875	6.490875	7.409197	7.409197	8.352578	8.352578
Y	565287	1	0	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000

network_attacks.describe().collect()

column	count	unique	nulls	mean	std	min	max	median	25_percent_cont	25_percent_disc	50_percent_cont	50_percent_disc	75_percent_cont	75_percent_disc
ID	2211	2211	0	330579.404342	51697.624949	201669.000000	514439.000000	316265.000000	312003.500000	312003.000000	316265.000000	316265.000000	316817.500000	316818.000000
X0	2211	14	0	-2.224571	0.566787	-2.302585	2.646175	-2.302585	-2.302585	-2.302585	-2.302585	-2.302585	-2.302585	-2.302585
X1	2211	46	0	10.863823	0.572168	-2.302585	10.906691	10.906691	10.906691	10.906691	10.906691	10.906691	10.906691	10.906691
X2	2211	11	0	8.991967	0.451550	-2.302585	9.025828	9.025708	9.025708	9.025708	9.025708	9.025708	9.025708	9.025708
Y	2211	1	0	1.000000	0.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000

From the count of the two datasets, one can see that there are 2211 data records corresponding to attacks, while 565287 data records corresponding to normal connections, indicating highly skewness of class distributions.

Data Partition and Model Training

One-class Classification builds up model on pure data of inliers. To make sure that the training data is not at all polluted by any outlier, we randomly sample 90% of the data of normal connections, and use the sampled data as the training set for one-class SVM. The rest(inclusive of attack data and the remaining 10% data of normal connections) is used in the prediction phase for evaluating the performance of the trained model.

from hana_ml.algorithms.pal.partition import train_test_val_split

train_normal, test_normal, _ = train_test_val_split(data=normal_connections,

                                                    id_column='ID',

                                                    random_seed=2,

                                                    training_percentage=0.9,

                                                    testing_percentage=0.1,

                                                    validation_percentage=0)

from hana_ml.algorithms.pal.svm import OneClassSVM

#nu is an upper bound on the fraction of training error,

#use a value close to the real(or empirically guessed) proportion of outliers 

osvm = OneClassSVM(kernel='rbf',

                   nu=0.003)

osvm.fit(data=train_normal,

         key='ID',

         features=['X0', 'X1', 'X2'])

Prediction and Model Evaluation

Now we take the remaining data of normal connections and the data of attacks as the test dataset, and apply the trained one-class SVM model to them, respectively.

normal_res = osvm.predict(data=test_normal,

                          key='ID',

                          features=['X0', 'X1', 'X2'])

attack_res = osvm.predict(data=network_attacks,

                          key='ID',

                          features=['X0', 'X1', 'X2'])

Let us check the prediction result for the data of attacks.

attack_res.collect()

ID	SCORE	PROBABILITY
311452	-1	None
311453	-1	None
311454	-1	None
311455	-1	None
311456	-1	None
...	...	...
312268	-1	None
312269	-1	None
312270	-1	None
312271	-1	None
312272	-1	None

2211 rows × 3 columns

It should be mentioned that in the prediction result of one-class SVM, detected outliers(i.e. attacks) are assigned the value of -1 in the SCORE column, while detected inliers(i.e. normal connections) are assigned the value of 1. One can see that in the prediction result table above, all observed data points(represented by their IDs) are assigned the value of -1, indicating that they are all detected as novelties, so the result looks very promising. In fact, we will show in the subsequent context that all attacks are labeled correctly by the trained one-class SVM model.

Now let us check the prediction result for the test data of normal connections.

normal_res.collect()

ID	SCORE	PROBABILITY
5271	1	None
5488	1	None
6117	1	None
6452	1	None
9474	1	None
...	...	...
494850	1	None
505028	1	None
514290	1	None
514986	1	None
528143	1	None

56529 rows × 3 columns

All observed points are labeled as inlier, so the prediction result looks not bad either.

Now let us calculate some basic statistics of the prediction result for model evaluation.

attack_correct = cc.sql('SELECT * FROM ({}) WHERE SCORE=-1'.format(attack_res.select_statement)).count()

attack_wrong = attack_data.count() - attack_correct

normal_correct = cc.sql('SELECT * FROM ({}) WHERE SCORE=1'.format(normal_res.select_statement)).count()

normal_wrong = normal_res.count() - normal_correct

We use Precision and Recall and 𝐹1 score of the outlier class for evaluating the performance of the one-class SVM classifier for outlier detection.

For the outlier class(i.e. attacks) in the test dataset, we have:

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = attack_correct / (attack_correct + normal_wrong) = 0.8766851704996035



𝑅𝑒𝑐𝑎𝑙𝑙 = attack_correct / attack_data.count()= 1.0



𝐹1 = 2 × Precision × Recall / (Precision + Recall) = 0.9342911472638918

So among the detected attacks in the test data, roughly 86.7% of them receive the correct label as suggested by the Precision score; while the perfect Recall score 1.0 suggests that all real attacks in the test data are labeled correctly. Combining these two numbers results in a relative high F1 score of 0.934. The three numbers together illustrate a reasonably successful detection of outliers(i.e. attacks).

In comparison, for inlier class(i.e. normal connections) in the test dataset, we have:



𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = normal_correct / (normal_correct + attack_wrong) = 1.0



𝑅𝑒𝑐𝑎𝑙𝑙 = normal_correct / normal_data.count() = 0.9944983990518141



𝐹1 = 2 × Precision × Recall / (Precision + Recall) = 0.9972416117502018

Discussion and Summary

In this blog post, we have shown readers how to use a one-class classification method called one-class SVM for outlier detection. The major difference between the multi-class classification and one-class classification is reflected in the training data, where the former requires the training data to have multiple(i.e. more than one) labels, while the later one to have merely a single label. The whole detection procedure is not much different from other traditional supervised classification methods: first taking out a collection of inlier points, then building up an one-class SVM model on it, and finally applying the model to new points to determine whether they are classified as inliers or outliers.

The major drawback of one-class classification for outlier detection is its scope of applicability, illustrated as follows:

Firstly, it mainly applies to cases where observed data points are all normal(i.e. being inliers), or there are too few outliers to build up an effective classification model for outlier detection. So in general, traditional multi-class classification methods are still the first choice for model training on skewed datasets with outliers being the minority class, rather than one-class classification. We shall cover the issue of how to handle the label imbalance problem for building effective multi-class classification models(for outlier detection) in a separate blog post.

Secondly, the effectiveness of one-class classification for outlier detection could strongly rely on some intrinsic nature of inliers and outliers, for example when inliers are aggregated into a big cluster from which outliers are disjoint. Otherwise, one-class classification often fails to work well, or may need very intricate design procedure to make it work well.

References

[1] Shebuti Rayana (2016). ODDS Library [http://odds.cs.stonybrook.edu]. Stony Brook, NY: Stony Brook University, Department of Computer Science.