We have introduced several methods for outlier detection in a few separate blog posts, inclusive of outlier detection using statistical tests and clustering. Typically, these methods can only detect outliers the input dataset, and the detection result cannot be generalized to new data points, because they do not come up with any model. Classification methods can be adopted to overcome this difficulty. However, the power of generalization for classification method does not come for free since they require the input data to be labeled. As to the case of outlier detection, it means that any point in the input data must be labeled either as an inlier or an outlier.
Usually, classification for outlier detection requires the dataset to contain both inliers and outliers. However, there are some cases where there could be no outliers in the input dataset, yet a model for outlier detection is still required. In such cases, one-class classification could be adopted for utlization. As its name suggests, one-class classification requires the input(i.e. training) data to be labeled by a single class, yet the trained model is also able to produce the label of opposing class like other binary classification models.
One-class support vector machine(i.e. one-class SVM) is perhaps the most frequently used method for one-class classification. This method is provided in SAP HANA Predictive Analysis Library(PAL) and wrapped up by the Python machine learning client for SAP HANA(hana_ml), and in this blog post it shall be adopted to solve the outlier detection problem.
Basically, after reading this blog post, you will learn:
Suppose a new system has been set up, and it runs very smoothly in the early phase. In the meantime, we want to build up a monitoring system to detect whether there are malfunctions as the new system keeps running. Then, what can we do? Waiting there until the system makes some mistake sounds a terrible idea, so we should be able to use the normal data in hand for initializing the monitoring system. There are plenty of similar cases in real life, where one-class classification becomes applicable.
Different from traditional classification methods, one-class classification tries to explore the inherent structure of the training dataset with a single label, and build a model for containing or characterizing the the training dataset, so that when a new point comes it can tell whether the point is similar to points in the training dataset or is different from them. One-class classification methods could be derived via either density estimation, or boundary estimation, or reconstruction mode estimation w.r.t. the input data. Among them, one-class SVM is a boundary-estimation based method.
Basically, for outlier detection using one-class SVM, in the training phase a profile is drawn to encircle(almost) all points in the input data(all being inliers); while in the prediction phase, if a sample point falls into the region enclosed by the profile drawn it will be treated as an inlier, otherwise it will be treated an outlier.
In the following context of this blog post, we show a detailed case study of network intrusion detection using one-class SVM, where attacks are taking as outliers and normal connections as inliers.
import hana_ml from hana_ml.dataframe import ConnectionContext cc = ConnectionContext('xxx.xxx.xxx.xx', 30x15, 'XXX', 'Xxxxxxxx')# server info hidden away
http_data = cc.table('HTTP_DATA_TBL') http_data.describe().collect()
column | count | unique | nulls | mean | std | min | max | median | 25_percent_cont | 25_percent_disc | 50_percent_cont | 50_percent_disc | 75_percent_cont | 75_percent_disc |
ID | 567498 | 567498 | 0 | 283748.500000 | 163822.705869 | 0.000000 | 567497.000000 | 283749.000000 | 141874.250000 | 141874.000000 | 283748.500000 | 283748.000000 | 425622.750000 | 425623.000000 |
X0 | 567498 | 457 | 0 | -2.268538 | 0.465346 | -2.302585 | 8.098369 | -2.302585 | -2.302585 | -2.302585 | -2.302585 | -2.302585 | -2.302585 | -2.302585 |
X1 | 567498 | 523 | 0 | 5.557679 | 0.435007 | -2.302585 | 10.906691 | 5.517854 | 5.380358 | 5.380358 | 5.517854 | 5.517854 | 5.723912 | 5.723912 |
X2 | 567498 | 20003 | 0 | 7.489226 | 1.316983 | -2.302585 | 16.277711 | 7.415235 | 6.490875 | 6.490875 | 7.415235 | 7.415235 | 8.372884 | 8.372884 |
Y | 567498 | 2 | 0 | 0.003896 | 0.062297 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
normal_connections = cc.sql('SELECT * FROM ({}) WHERE Y = 0'.format(http_data.select_statement)) network_attacks = cc.sql('SELECT * FROM ({}) WHERE Y = 1'.format(http_data.select_statement))
column | count | unique | nulls | mean | std | min | max | median | 25_percent_cont | 25_percent_disc | 50_percent_cont | 50_percent_disc | 75_percent_cont | 75_percent_disc |
ID | 565287 | 565287 | 0 | 283565.330867 | 164084.702187 | 0.000000 | 567497.000000 | 282644.000000 | 141321.500000 | 141321.000000 | 282644.000000 | 282644.000000 | 425971.500000 | 425972.000000 |
X0 | 565287 | 457 | 0 | -2.268710 | 0.464899 | -2.302585 | 8.098369 | -2.302585 | -2.302585 | -2.302585 | -2.302585 | -2.302585 | -2.302585 | -2.302585 |
X1 | 565287 | 479 | 0 | 5.536925 | 0.279530 | -2.302585 | 10.906691 | 5.517854 | 5.380358 | 5.380358 | 5.517854 | 5.517854 | 5.723912 | 5.723912 |
X2 | 565287 | 20003 | 0 | 7.483348 | 1.315889 | -2.302585 | 16.277711 | 7.409197 | 6.490875 | 6.490875 | 7.409197 | 7.409197 | 8.352578 | 8.352578 |
Y | 565287 | 1 | 0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
network_attacks.describe().collect()
column | count | unique | nulls | mean | std | min | max | median | 25_percent_cont | 25_percent_disc | 50_percent_cont | 50_percent_disc | 75_percent_cont | 75_percent_disc |
ID | 2211 | 2211 | 0 | 330579.404342 | 51697.624949 | 201669.000000 | 514439.000000 | 316265.000000 | 312003.500000 | 312003.000000 | 316265.000000 | 316265.000000 | 316817.500000 | 316818.000000 |
X0 | 2211 | 14 | 0 | -2.224571 | 0.566787 | -2.302585 | 2.646175 | -2.302585 | -2.302585 | -2.302585 | -2.302585 | -2.302585 | -2.302585 | -2.302585 |
X1 | 2211 | 46 | 0 | 10.863823 | 0.572168 | -2.302585 | 10.906691 | 10.906691 | 10.906691 | 10.906691 | 10.906691 | 10.906691 | 10.906691 | 10.906691 |
X2 | 2211 | 11 | 0 | 8.991967 | 0.451550 | -2.302585 | 9.025828 | 9.025708 | 9.025708 | 9.025708 | 9.025708 | 9.025708 | 9.025708 | 9.025708 |
Y | 2211 | 1 | 0 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
From the count of the two datasets, one can see that there are 2211 data records corresponding to attacks, while 565287 data records corresponding to normal connections, indicating highly skewness of class distributions.
from hana_ml.algorithms.pal.partition import train_test_val_split train_normal, test_normal, _ = train_test_val_split(data=normal_connections, id_column='ID', random_seed=2, training_percentage=0.9, testing_percentage=0.1, validation_percentage=0) from hana_ml.algorithms.pal.svm import OneClassSVM #nu is an upper bound on the fraction of training error, #use a value close to the real(or empirically guessed) proportion of outliers osvm = OneClassSVM(kernel='rbf', nu=0.003) osvm.fit(data=train_normal, key='ID', features=['X0', 'X1', 'X2'])
normal_res = osvm.predict(data=test_normal, key='ID', features=['X0', 'X1', 'X2']) attack_res = osvm.predict(data=network_attacks, key='ID', features=['X0', 'X1', 'X2'])
attack_res.collect()
ID | SCORE | PROBABILITY |
311452 | -1 | None |
311453 | -1 | None |
311454 | -1 | None |
311455 | -1 | None |
311456 | -1 | None |
... | ... | ... |
312268 | -1 | None |
312269 | -1 | None |
312270 | -1 | None |
312271 | -1 | None |
312272 | -1 | None |
normal_res.collect()
ID | SCORE | PROBABILITY |
5271 | 1 | None |
5488 | 1 | None |
6117 | 1 | None |
6452 | 1 | None |
9474 | 1 | None |
... | ... | ... |
494850 | 1 | None |
505028 | 1 | None |
514290 | 1 | None |
514986 | 1 | None |
528143 | 1 | None |
attack_correct = cc.sql('SELECT * FROM ({}) WHERE SCORE=-1'.format(attack_res.select_statement)).count() attack_wrong = attack_data.count() - attack_correct
normal_correct = cc.sql('SELECT * FROM ({}) WHERE SCORE=1'.format(normal_res.select_statement)).count() normal_wrong = normal_res.count() - normal_correct
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = attack_correct / (attack_correct + normal_wrong) = 0.8766851704996035
𝑅𝑒𝑐𝑎𝑙𝑙 = attack_correct / attack_data.count()= 1.0
𝐹1 = 2 × Precision × Recall / (Precision + Recall) = 0.9342911472638918
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = normal_correct / (normal_correct + attack_wrong) = 1.0
𝑅𝑒𝑐𝑎𝑙𝑙 = normal_correct / normal_data.count() = 0.9944983990518141
𝐹1 = 2 × Precision × Recall / (Precision + Recall) = 0.9972416117502018
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
12 | |
10 | |
9 | |
7 | |
7 | |
7 | |
6 | |
6 | |
5 | |
4 |