Technology Blogs by SAP
Learn how to extend and personalize SAP applications. Follow the SAP technology blog for insights into SAP BTP, ABAP, SAP Analytics Cloud, SAP HANA, and more.
cancel
Showing results for 
Search instead for 
Did you mean: 
likun_hou
Advisor
Advisor

We have introduced several methods for outlier detection in a few separate blog posts, inclusive of outlier detection using statistical tests and clustering. Typically, these methods can only detect outliers the input dataset, and the detection result cannot be generalized to new data points, because they do not come up with any model. Classification methods can be adopted to overcome this difficulty. However, the power of generalization for classification method does not come for free since they require the input data to be labeled.  As to the case of outlier detection, it means that any point in the input data must be labeled either as an inlier or an outlier.

Usually, classification for outlier detection requires the dataset to contain both inliers and outliers. However,  there are some cases where there could be no outliers in the input dataset, yet a model for outlier detection is still required. In such cases, one-class classification could be adopted for utlization. As its name suggests, one-class classification requires the input(i.e. training) data to be labeled by a single class, yet the trained model is also able to produce the label of opposing class like other binary classification models.

One-class support vector machine(i.e. one-class SVM) is perhaps the most frequently used method for one-class classification. This method is provided in SAP HANA Predictive Analysis Library(PAL) and wrapped up by the Python machine learning client for SAP HANA(hana_ml), and in this blog post it shall be adopted to solve the outlier detection problem.

Basically, after reading this blog post, you will learn:

    • How to build a one-class SVM model given a dataset
    • How to apply the trained model to the prediction data, extract the information of detected outliers and evaluate the performance of the trained model

 

Introduction


Suppose a new system has been set up, and it runs very smoothly in the early phase. In the meantime,  we want to build up a monitoring system to detect whether there are malfunctions as the new system keeps running. Then, what can we do? Waiting there until the system makes some mistake sounds a terrible idea, so we should be able to use the normal data in hand for initializing the monitoring system. There are plenty of similar cases in real life, where one-class classification becomes applicable.

Different from traditional classification methods, one-class classification tries to explore the inherent structure of the training dataset with a single label, and build a model for containing or characterizing the the training dataset, so that when a new point comes it can tell whether the point is similar to points in the training dataset or is different from them.  One-class classification methods could be derived via either density estimation, or boundary estimation, or reconstruction mode estimation w.r.t. the input data. Among them, one-class SVM is a  boundary-estimation based method.

Basically, for outlier detection using one-class SVM, in the training phase a profile is drawn to encircle(almost) all points in the input data(all being inliers); while in the prediction phase, if a sample point falls into the region enclosed by the profile drawn it will be treated as an inlier, otherwise it will be treated an outlier.

In the following context of this blog post, we show a detailed case study of network intrusion detection using one-class SVM, where attacks are taking as outliers and normal connections as inliers.

The Case Study : One-class SVM for Network Intrusion Detection

 

Dataset Description


In this case study, a reduced version of the renowned KDD Cup 1999 dataset is used. The original dataset is for computer network intrusion detection, which contains 41 feature columns and a label column with 23 classes(representing whether a connection is normal or an attack with detailed attack type). In the reduced dataset, the computer network service type is restricted to http, and only the most basic three features are left, together within a categorical label column of 2 classes, corresponding to whether the a record is an attack(uniformly labeled by 1 irrespective of attack type) or a normal connection(labeled by 0). Details on how the data is processed can be found in Outlier Detection DataSets (ODDS), and interested readers may refer to  [1] for more information.

Since the reduced dataset contains normal connections as well as attacks, only normal connections are used when building up the one-class SVM model, while attacks are utilized in the prediction phase for evaluating the performance of the model.

For future analysis, we have downloaded the dataset, and stored it in SAP HANA in a table with name 'HTTP_DATA_TBL'.  To fetch the data, we firstly set up a connection to the database using hana_ml.

import hana_ml

from hana_ml.dataframe import ConnectionContext

cc = ConnectionContext('xxx.xxx.xxx.xx', 30x15, 'XXX', 'Xxxxxxxx')# server info hidden away

We create a hana_ml.DataFrame for this dataset in the database and fetch its brief description, illustrated as follows:
http_data = cc.table('HTTP_DATA_TBL')

http_data.describe().collect()

columncountuniquenullsmeanstdminmaxmedian25_percent_cont25_percent_disc50_percent_cont50_percent_disc75_percent_cont75_percent_disc
ID5674985674980283748.500000163822.7058690.000000567497.000000283749.000000141874.250000141874.000000283748.500000283748.000000425622.750000425623.000000
X05674984570-2.2685380.465346-2.3025858.098369-2.302585-2.302585-2.302585-2.302585-2.302585-2.302585-2.302585
X156749852305.5576790.435007-2.30258510.9066915.5178545.3803585.3803585.5178545.5178545.7239125.723912
X25674982000307.4892261.316983-2.30258516.2777117.4152356.4908756.4908757.4152357.4152358.3728848.372884
Y567498200.0038960.0622970.0000001.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000

We observe that, besides the prescribed three feature columns(X0, X1 and X2) and the label column(Y), an additional ID column of integer type has also been added to the dataset. The label column contains two values: 0 and 1, where  the normal connections are labeled by 0 and attacks labeled by 1.

Let us further divide the data points with respect to their labels and have a better comprehension of the dataset.

normal_connections = cc.sql('SELECT * FROM ({}) WHERE Y = 0'.format(http_data.select_statement))

network_attacks = cc.sql('SELECT * FROM ({}) WHERE Y = 1'.format(http_data.select_statement))

Brief descriptions of the two datasets can be illustrated as follows:

columncountuniquenullsmeanstdminmaxmedian25_percent_cont25_percent_disc50_percent_cont50_percent_disc75_percent_cont75_percent_disc
ID5652875652870283565.330867164084.7021870.000000567497.000000282644.000000141321.500000141321.000000282644.000000282644.000000425971.500000425972.000000
X05652874570-2.2687100.464899-2.3025858.098369-2.302585-2.302585-2.302585-2.302585-2.302585-2.302585-2.302585
X156528747905.5369250.279530-2.30258510.9066915.5178545.3803585.3803585.5178545.5178545.7239125.723912
X25652872000307.4833481.315889-2.30258516.2777117.4091976.4908756.4908757.4091977.4091978.3525788.352578
Y565287100.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000

network_attacks.describe().collect()

columncountuniquenullsmeanstdminmaxmedian25_percent_cont25_percent_disc50_percent_cont50_percent_disc75_percent_cont75_percent_disc
ID221122110330579.40434251697.624949201669.000000514439.000000316265.000000312003.500000312003.000000316265.000000316265.000000316817.500000316818.000000
X02211140-2.2245710.566787-2.3025852.646175-2.302585-2.302585-2.302585-2.302585-2.302585-2.302585-2.302585
X1221146010.8638230.572168-2.30258510.90669110.90669110.90669110.90669110.90669110.90669110.90669110.906691
X222111108.9919670.451550-2.3025859.0258289.0257089.0257089.0257089.0257089.0257089.0257089.025708
Y2211101.0000000.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.000000

From the count of the two datasets, one can see that there are 2211 data records corresponding to attacks, while 565287 data records corresponding to normal connections, indicating highly skewness of class distributions.

 

Data Partition and Model Training


One-class Classification builds up model on pure data of inliers. To make sure that the training data is not at all polluted by any outlier, we randomly sample 90% of the data of normal connections, and use the sampled data as the training set for one-class SVM. The rest(inclusive of attack data and the remaining 10%  data of normal connections) is used in the prediction phase for evaluating the performance of the trained model.
from hana_ml.algorithms.pal.partition import train_test_val_split

train_normal, test_normal, _ = train_test_val_split(data=normal_connections,

                                                    id_column='ID',

                                                    random_seed=2,

                                                    training_percentage=0.9,

                                                    testing_percentage=0.1,

                                                    validation_percentage=0)

from hana_ml.algorithms.pal.svm import OneClassSVM

#nu is an upper bound on the fraction of training error,

#use a value close to the real(or empirically guessed) proportion of outliers 

osvm = OneClassSVM(kernel='rbf',

                   nu=0.003)

osvm.fit(data=train_normal,

         key='ID',

         features=['X0', 'X1', 'X2'])
Prediction and Model Evaluation
 
Now we take the remaining data of normal connections and the data of attacks as the test dataset, and apply the trained one-class SVM model to them, respectively.
normal_res = osvm.predict(data=test_normal,

                          key='ID',

                          features=['X0', 'X1', 'X2'])

attack_res = osvm.predict(data=network_attacks,

                          key='ID',

                          features=['X0', 'X1', 'X2'])

Let us check the prediction result for the data of attacks.
attack_res.collect()

IDSCOREPROBABILITY
311452-1None
311453-1None
311454-1None
311455-1None
311456-1None
.........
312268-1None
312269-1None
312270-1None
312271-1None
312272-1None

2211 rows × 3 columns

It should be mentioned that in the prediction result of one-class SVM, detected outliers(i.e. attacks) are assigned the value of -1 in the SCORE column, while detected inliers(i.e. normal connections) are assigned the value of 1. One can see that in the prediction result table above, all observed data points(represented by their IDs) are assigned the value of -1, indicating that they are all detected as novelties, so the result looks very promising. In fact, we will show in the subsequent context that all attacks are labeled correctly by the trained one-class SVM model.

Now let us check the prediction result for the test data of normal connections.
normal_res.collect()
IDSCOREPROBABILITY
52711None
54881None
61171None
64521None
94741None
.........
4948501None
5050281None
5142901None
5149861None
5281431None

56529 rows × 3 columns

All observed points are labeled as inlier, so the prediction result looks not bad either.

Now let us calculate some basic statistics of the prediction result for model evaluation.


attack_correct = cc.sql('SELECT * FROM ({}) WHERE SCORE=-1'.format(attack_res.select_statement)).count()

attack_wrong = attack_data.count() - attack_correct

normal_correct = cc.sql('SELECT * FROM ({}) WHERE SCORE=1'.format(normal_res.select_statement)).count()

normal_wrong = normal_res.count() - normal_correct


We use Precision and Recall and 𝐹1 score of the outlier class for evaluating the performance of the one-class SVM classifier for outlier detection.

For the outlier class(i.e. attacks)  in the test dataset, we have:

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = attack_correct / (attack_correct + normal_wrong) = 0.8766851704996035

𝑅𝑒𝑐𝑎𝑙𝑙 = attack_correct / attack_data.count()= 1.0

𝐹1 = 2 × Precision × Recall / (Precision + Recall) = 0.9342911472638918
 
So among the detected attacks in the test data, roughly 86.7% of them receive the correct label as suggested by the Precision score; while the perfect Recall score 1.0 suggests that all real attacks in the test data are labeled correctly. Combining these two numbers results in a relative high F1 score of 0.934. The three numbers together illustrate a reasonably successful detection of outliers(i.e. attacks).
 
In comparison, for inlier class(i.e. normal connections) in the test dataset, we have:
 

𝑃
𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = normal_correct / (normal_correct + attack_wrong) = 1.0


𝑅𝑒𝑐𝑎𝑙𝑙 = normal_correct / normal_data.count() = 0.9944983990518141

𝐹1 = 2 × Precision × Recall / (Precision + Recall) = 0.9972416117502018

Discussion and Summary


In this blog post, we have shown readers how to use a one-class classification method called one-class SVM for outlier detection. The major difference between the multi-class classification and one-class classification is reflected in the training data, where the former requires the training data to have multiple(i.e. more than one) labels, while the later one to have merely a single label. The whole detection procedure is not much different from other traditional supervised classification methods: first taking out a collection of inlier points, then building up an one-class SVM model on it, and finally applying the model to new points to determine whether they are classified as inliers or outliers.

The major drawback of one-class classification for outlier detection is its scope of applicability, illustrated as follows:
    • Firstly, it mainly applies to cases where observed data points are all normal(i.e. being inliers), or there are too few outliers to build up an effective classification model for outlier detection. So in general, traditional multi-class classification methods are still the first choice for model training on skewed datasets with outliers being the minority class, rather than one-class classification. We shall cover the issue of how to handle the label imbalance problem for building effective multi-class classification models(for outlier detection) in a separate blog post.
    • Secondly, the effectiveness of one-class classification for outlier detection could strongly rely on some intrinsic nature of inliers and outliers, for example when inliers are aggregated into a big cluster from which outliers are disjoint. Otherwise, one-class classification often fails to work well, or may need very intricate design procedure to make it work well.

 

References

[1] Shebuti Rayana (2016). ODDS Library [http://odds.cs.stonybrook.edu]. Stony Brook, NY: Stony Brook University, Department of Computer Science.