# Outlier Detection with One-class Classification using Python Machine Learning Client for SAP HANA

We have introduced several methods for outlier detection in a few separate blog posts, inclusive of outlier detection using statistical tests and clustering. Typically, these methods can only detect outliers the input dataset, and the detection result cannot be generalized to new data points, because they do not come up with any model. Classification methods can be adopted to overcome this difficulty. However, the power of generalization for classification method does not come for free since they require the input data to be labeled.  As to the case of outlier detection, it means that any point in the input data must be labeled either as an inlier or an outlier.

Usually, classification for outlier detection requires the dataset to contain both inliers and outliers. However,  there are some cases where there could be no outliers in the input dataset, yet a model for outlier detection is still required. In such cases, one-class classification could be adopted for utlization. As its name suggests, one-class classification requires the input(i.e. training) data to be labeled by a single class, yet the trained model is also able to produce the label of opposing class like other binary classification models.

One-class support vector machine(i.e. one-class SVM) is perhaps the most frequently used method for one-class classification. This method is provided in SAP HANA Predictive Analysis Library(PAL) and wrapped up by the Python machine learning client for SAP HANA(hana_ml), and in this blog post it shall be adopted to solve the outlier detection problem.

Basically, after reading this blog post, you will learn:

• How to build a one-class SVM model given a dataset
• How to apply the trained model to the prediction data, extract the information of detected outliers and evaluate the performance of the trained model

## Introduction

Suppose a new system has been set up, and it runs very smoothly in the early phase. In the meantime,  we want to build up a monitoring system to detect whether there are malfunctions as the new system keeps running. Then, what can we do? Waiting there until the system makes some mistake sounds a terrible idea, so we should be able to use the normal data in hand for initializing the monitoring system. There are plenty of similar cases in real life, where one-class classification becomes applicable.

Different from traditional classification methods, one-class classification tries to explore the inherent structure of the training dataset with a single label, and build a model for containing or characterizing the the training dataset, so that when a new point comes it can tell whether thr point is similar to points in the training dataset or is different from them.  One-class classification methods could be derived via either density estimation, or boundary estimation, or reconstruction mode estimation w.r.t. the input data. Among them, one-class SVM is a  boundary-estimation based method.

Basically, for outlier detection using one-class SVM, in the training phase a profile is drawn to encircle(almost) all points in the input data(all being inliers); while in the prediction phase, if a sample point falls into the region enclosed by the profile drawn it will be treated as an inlier, otherwise it will be treated an outlier.

In the following context of this blog post, we show a detailed case study of network intrusion detection using one-class SVM, where attacks are taking as outliers and normal connections as inliers.

## The Case Study : One-class SVM for Network Intrusion Detection

### Dataset Description

In this case study, a reduced version of the renowned KDD Cup 1999 dataset is used. The original dataset is for computer network intrusion detection, which contains 41 feature columns and a label column with 23 classes(representing whether a connection is normal or an attack with detalied attack type). In the reduced dataset, the computer network service type is restricted to http, and only the most basic three features are left, together withn a categorical label column of 2 classes, corresponding to whether the a record is an attack(uniformly labeled by 1 irrespective of attack type) or a normal connection(labeled by 0). Details on how the data is processed can be found in Outlier Detection DataSets (ODDS), and interested readers may refer to  [1] for more information.

Since the reduced dataset contains normal connections as well as attacks, only normal connections are used when building up the one-class SVM model, while attacks are utilized in the prediction phase for evaluating the performance of the model.

For future analysis, we have downloaded the dataset, and stored it in SAP HANA in a table with name ‘HTTP_DATA_TBL’.  To fetch the data, we firstly set up a connection to the database using hana_ml.

import hana_ml
from hana_ml.dataframe import ConnectionContext
cc = ConnectionContext('xxx.xxx.xxx.xx', 30x15, 'XXX', 'Xxxxxxxx')# server info hidden away

We create a hana_ml.DataFrame for this dataset in the database and fetch its brief description, illustrated as follows:

http_data = cc.table('HTTP_DATA_TBL')
http_data.describe().collect()
column count unique nulls mean std min max median 25_percent_cont 25_percent_disc 50_percent_cont 50_percent_disc 75_percent_cont 75_percent_disc
0 ID 567498 567498 0 283748.500000 163822.705869 0.000000 567497.000000 283749.000000 141874.250000 141874.000000 283748.500000 283748.000000 425622.750000 425623.000000
1 X0 567498 457 0 -2.268538 0.465346 -2.302585 8.098369 -2.302585 -2.302585 -2.302585 -2.302585 -2.302585 -2.302585 -2.302585
2 X1 567498 523 0 5.557679 0.435007 -2.302585 10.906691 5.517854 5.380358 5.380358 5.517854 5.517854 5.723912 5.723912
3 X2 567498 20003 0 7.489226 1.316983 -2.302585 16.277711 7.415235 6.490875 6.490875 7.415235 7.415235 8.372884 8.372884
4 Y 567498 2 0 0.003896 0.062297 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

We oberve that, besides the prescribed three feature columns(X0, X1 and X2) and the label column(Y), an additional ID column of integer type has also been added to the dataset. The label column contains two values: 0 and 1, where  the normal connections are labeled by 0 and attacks labeled by 1.

Let us further divide the data points with respect to their labels and have a better comprehension of the dataset.

normal_connections = cc.sql('SELECT * FROM ({}) WHERE Y = 0'.format(http_data.select_statement))
network_attacks = cc.sql('SELECT * FROM ({}) WHERE Y = 1'.format(http_data.select_statement))

Brief descriptions of the two datasets can be illustrated as follows:

normal_connections.describe().collect()
column count unique nulls mean std min max median 25_percent_cont 25_percent_disc 50_percent_cont 50_percent_disc 75_percent_cont 75_percent_disc
0 ID 565287 565287 0 283565.330867 164084.702187 0.000000 567497.000000 282644.000000 141321.500000 141321.000000 282644.000000 282644.000000 425971.500000 425972.000000
1 X0 565287 457 0 -2.268710 0.464899 -2.302585 8.098369 -2.302585 -2.302585 -2.302585 -2.302585 -2.302585 -2.302585 -2.302585
2 X1 565287 479 0 5.536925 0.279530 -2.302585 10.906691 5.517854 5.380358 5.380358 5.517854 5.517854 5.723912 5.723912
3 X2 565287 20003 0 7.483348 1.315889 -2.302585 16.277711 7.409197 6.490875 6.490875 7.409197 7.409197 8.352578 8.352578
4 Y 565287 1 0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
network_attacks.describe().collect()
column count unique nulls mean std min max median 25_percent_cont 25_percent_disc 50_percent_cont 50_percent_disc 75_percent_cont 75_percent_disc
0 ID 2211 2211 0 330579.404342 51697.624949 201669.000000 514439.000000 316265.000000 312003.500000 312003.000000 316265.000000 316265.000000 316817.500000 316818.000000
1 X0 2211 14 0 -2.224571 0.566787 -2.302585 2.646175 -2.302585 -2.302585 -2.302585 -2.302585 -2.302585 -2.302585 -2.302585
2 X1 2211 46 0 10.863823 0.572168 -2.302585 10.906691 10.906691 10.906691 10.906691 10.906691 10.906691 10.906691 10.906691
3 X2 2211 11 0 8.991967 0.451550 -2.302585 9.025828 9.025708 9.025708 9.025708 9.025708 9.025708 9.025708 9.025708
4 Y 2211 1 0 1.000000 0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

### Data Partition and Model Training

One-class Classification builds up model on pure data of inliers. To make sure that the training data is not at all polluted by any outlier, we randomly sample 90% of the data of normal connections, and use the sampled data as the training set for one-class SVM. The rest(inclusive of attack data and the remaining 10%  data of normal connections) is used in the prediction phase for evaluating the performance of the trained model.

from hana_ml.algorithms.pal.partition import train_test_val_split
train_normal, test_normal, _ = train_test_val_split(data=normal_connections,
id_column='ID',
random_seed=2,
training_percentage=0.9,
testing_percentage=0.1,
validation_percentage=0)
from hana_ml.algorithms.pal.svm import OneClassSVM
#nu is an upper bound on the fraction of training error,
#use a value close to the real(or empirically guessed) proportion of outliers
osvm = OneClassSVM(kernel='rbf',
nu=0.003)
osvm.fit(data=train_normal,
key='ID',
features=['X0', 'X1', 'X2'])

### Prediction and Model Evaluation

Now we take the remaining data of normal connections and the data of attacks as the test dataset, and apply the trained one-class SVM model to them, respectively.

normal_res = osvm.predict(data=test_normal,
key='ID',
features=['X0', 'X1', 'X2'])
attack_res = osvm.predict(data=network_attacks,
key='ID',
features=['X0', 'X1', 'X2'])

Let us check the prediction result for the data of attacks.

attack_res.collect()
ID SCORE PROBABILITY
0 311452 -1 None
1 311453 -1 None
2 311454 -1 None
3 311455 -1 None
4 311456 -1 None
2206 312268 -1 None
2207 312269 -1 None
2208 312270 -1 None
2209 312271 -1 None
2210 312272 -1 None

2211 rows × 3 columns

It should be mentioned that in the prediction result of one-class SVM, detected outliers(i.e. attacks) are assigned the value of -1 in the SCORE column, while detected inliers(i.e. normal connections) are assigned the value of 1. One can see that in the prediction result table above, all observed data points(represented by their IDs) are assigned the value of -1, indicating that they are all detected as novelties, so the result looks very promising. In fact, we will show in the subsequent context that all attacks are labeled correctly by the trained one-class SVM model.

Now let us check the prediction result for the test data of normal connections.

normal_res.collect()
ID SCORE PROBABILITY
0 5271 1 None
1 5488 1 None
2 6117 1 None
3 6452 1 None
4 9474 1 None
56524 494850 1 None
56525 505028 1 None
56526 514290 1 None
56527 514986 1 None
56528 528143 1 None

56529 rows × 3 columns

All observed points are labeld as inlier, so the prediction result looks not bad either.

Now let us calculate some basic statistics of the prediction result for model evaluation.

attack_correct = cc.sql('SELECT * FROM ({}) WHERE SCORE=-1'.format(attack_res.select_statement)).count()
attack_wrong = attack_data.count() - attack_correct
normal_correct = cc.sql('SELECT * FROM ({}) WHERE SCORE=1'.format(normal_res.select_statement)).count()
normal_wrong = normal_res.count() - normal_correct

We use Precision and Recall and 𝐹1 score of the outlier class for evaluating the performance of the one-class SVM classifier for outlier detection.

For the outlier class(i.e. attacks)  in the test dataset, we have:

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = attack_correct / (attack_correct + normal_wrong) = 0.8766851704996035

𝑅𝑒𝑐𝑎𝑙𝑙 = attack_correct / attack_data.count()= 1.0

𝐹1 = 2 × Precision × Recall / (Precision + Recall) = 0.9342911472638918
So among the detected attacks in the test data, roughly 86.7% of them receive the correct label as suggested by the Precision score; while the perfect Recall score 1.0 suggests that all real attacks in the test data are labeled correctly. Combining these two numbers results in a relative high F1 score of 0.934. The three numbers together illustrate a reasonably successful detection of outliers(i.e. attacks).
In comparison, for inlier class(i.e. normal connections) in the stest dataset, we have:

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = normal_correct / (normal_correct + attack_wrong) = 1.0

𝑅𝑒𝑐𝑎𝑙𝑙 = normal_correct / normal_data.count() = 0.9944983990518141

𝐹1 = 2 × Precision × Recall / (Precision + Recall) = 0.9972416117502018


## Discussion and Summary

In this blog post, we have shown readers how to use a one-class classification method called one-class SVM for outlier detection. The major difference between the multi-class classification and one-class classification is reflected in the training data, where the former requires the training data to have multiple(i.e. more than one) labels, while the later one to have merely a single label. The whole detection procedure is not much different from other traditional supervised classification methods: first taking out a collection of inlier points, then building up an one-class SVM model on it, and finally applying the model to new points to determine whether they are classified as inliers or outliers.

The major drawback of one-class classification for outlier detection is its scope of applicability, illustrated as follows:

• Firstly, it mainly applies to cases where observed data points are all normal(i.e. being inliers), or there are too few outliers to build up an effective classification model for outlier detection. So in general, traditional multi-class classification methods are still the first choice for model training on skewed datasets with outliers being the minority class, rather than one-class classification. We shall cover the issue of how to handle the label imbalance problem for building effective multi-class classification models(for outlier detection) in a separate blog post.
• Secondly, the effectiveness of one-class classification for outlier detection could strongly rely on some intrinsic nature of inliers and outliers, for example when inliers are aggregated into a big cluster from which outliers are disjoint. Otherwise, one-class classification often fails to work well, or may need very intricate design procedure to make it work well.

## References

[1] Shebuti Rayana (2016). ODDS Library [http://odds.cs.stonybrook.edu]. Stony Brook, NY: Stony Brook University, Department of Computer Science.