Skip to Content
Technical Articles
Author's profile photo Likun Hou

Learning from Labeled Anomalies for Efficient Anomaly Detection using Python Machine Learning Client for SAP HANA

In a few separate blog posts, we have discussed the problem of anomaly detection in dataset with multiple features using techniques like one-class classification, clustering(DBSCAN) as well as statistical tests. However, all the aforementioned techniques become less applicable when the dataset of interest is of high dimensionality(i.e. contains many features), or the boundary between normal points and anomalous ones is complicated. In this case, a better approach is to manually label the point of anomalies in the dataset, and then train a supervised machine learning model for the classification of normal points and anomalies.

In this blog post, we will analyze a specific dataset with labeled anomalies, and use the decision tree algorithm in SAP HANA Prective Analysis Library(PAL) through Python Machine Learning Client for SAP HANA(hana_ml) to construct a classification model for anomaly detection. In the meantime, several resampling techniques are also involved for improving the performance of the trained model on different perspectives.

Introduction

Separating anomalies from normal ones using with labeled datasets seems as simple as a regular classification problem. However, the highly skewed distribution between the normal and anomalous data points can pose a big challenge for building up any efficient classification model, because in datasets anomalies are usually so rare to be observed while normality is overwhelming. For example, we consider a disease with prevalence rate 0.1%, if we use a naive model that predict all people as non-patient of this disease, then this model has a “high” accuracy rate of 99.9%, seemingly good. However, if we happily adopt this naive model for detecting this disease, then it would be a disaster for all real patients. The imbalance of distribution between normal and anomalous labels is one typical characteristic for anomaly detection problems, especially when normal points and anomalies are entangled in the feature space of the dataset.

In the meatime, anomalous cases are usually much more valuable than normal ones. For example, if we fail to detect a fraudulent transaction between bank acounts, then we could have a great loss of money; however, if we suspect one transaction is fraudulent and it turn out to be not true, we only pay for some manual verification procedures that are realtively cheap. Higher importance for anomalous cases compared to normal ones is another typical characteristic for anomaly detection.

In this blog post, we will do a case study on anomaly detection using labeled dataset, the following contents will be included in our discussion:

  1. Introduction & background knowledge on the dataset for our case study, with brief problem analysis
  2. Anomaly detection from classification models with the help of various resampling techniques

Case Study: Thyroid Hyperfunctionality Detection by Classification

Dataset Description and Problem Analysis

The problem of interest in this blog post is thyroid disease recognition. The original full dataset is available in the UCI machine learning repository[1]. The original dataset contains 21 attributes – 15 of them are categorical and 6 are numerical. The dataset is divided into 3 classes : normal, subnormal and hyperfunction. Hyperfunction is the minority class in this dataset, but it is also the case that we are mostly interested in because once gained, it may accelerates the body’s metabolism, bringing along symptoms like unintentional loss of weight, rapid or irregular heartbeat, nervousness, anxiety and irritability, etc.

Our designated task in this blog post is to distinguish hyperfunctional cases from non-hyperfunctional(i.e. normal and subnormal) ones, using only the 6 numerical attributes.  A reduced version of this dataset for this target is available in the ODDS library[2], where all attribute values are scaled into the range [0, 1] using Min-Max scalar. Besides, The label column in this dataset is valued with 0s and 1s, where 1 for hyperfunction and 0 for non-hyperfunction.

Let us examine the corresponding dataset for further analysis.  We assume that the data has already been stored in a table with name ‘PAL_THYROID_DATA_TBL’ in a database of SAP HANA platform. Then the dataset can be accessed by establishing a connection to the database using hana_ml, illustrated as follows:

import hana_ml
from hana_ml.dataframe import ConnectionContext
cc = ConnectionContext('xx.xxx.xxx.xx', 30x15, 'XXX', 'Xxxxxxx')
thyroid_df = cc.table('PAL_THYROID_DATA_TBL')

Then, thyroid_df  a hana_ml.DataFrame that contains the information of the dataset, a brief description of this dataset could be obtained as follows:

thyroid_df.describe().collect()
column count unique nulls mean std min max median 25_percent_cont 25_percent_disc 50_percent_cont 50_percent_disc 75_percent_cont 75_percent_disc
0 ID 3772 3772 0 1885.500000 1089.026935 0.0 3771.0 1886.000000 942.750000 942.000000 1885.500000 1885.000000 2828.250000 2828.000000
1 V0 3772 93 0 0.543121 0.203790 0.0 1.0 0.569892 0.376344 0.376344 0.569892 0.569892 0.709677 0.709677
2 V1 3772 280 0 0.008983 0.043978 0.0 1.0 0.003019 0.001132 0.001132 0.003019 0.003019 0.004528 0.004528
3 V2 3772 72 0 0.186826 0.070405 0.0 1.0 0.190702 0.156546 0.156546 0.190702 0.190702 0.213472 0.213472
4 V3 3772 243 0 0.248332 0.080579 0.0 1.0 0.241822 0.203271 0.203271 0.241822 0.240654 0.282710 0.282710
5 V4 3772 141 0 0.376941 0.087382 0.0 1.0 0.375587 0.328638 0.328638 0.375587 0.375587 0.413146 0.413146
6 V5 3772 324 0 0.177301 0.054907 0.0 1.0 0.173770 0.149180 0.149180 0.173770 0.173770 0.196721 0.196721
7 TYPE 3772 2 0 0.024655 0.155093 0.0 1.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

As revealed by the mean value of the TYPE column, hyperfunctional cases covers less than 3% of all cases in the dataset, so the dataset is highly skewed w.r.t. thyroid functionality types.

 

Now we inspect the hyperfunctional cases and non-hyperfunctional cases separately.

cc.sql('select * from ({}) where TYPE=1'.format(thyroid_df.select_statement)).describe().collect()
column count unique nulls mean std min max median 25_percent_cont 25_percent_disc 50_percent_cont 50_percent_disc 75_percent_cont 75_percent_disc
0 ID 93 93 0 1873.645161 1065.693167 19.000000 3679.000000 1853.000000 1042.000000 1042.000000 1853.000000 1853.000000 2702.000000 2702.000000
1 V0 93 45 0 0.525494 0.201366 0.000000 0.892473 0.537634 0.376344 0.376344 0.537634 0.537634 0.666667 0.666667
2 V1 93 71 0 0.176248 0.212708 0.011698 1.000000 0.096226 0.049057 0.049057 0.096226 0.096226 0.203774 0.203774
3 V2 93 26 0 0.086889 0.060211 0.014231 0.241935 0.080645 0.033207 0.033207 0.080645 0.080645 0.128083 0.128083
4 V3 93 55 0 0.074221 0.050246 0.000000 0.203271 0.072430 0.028037 0.028037 0.072430 0.072430 0.112150 0.112150
5 V4 93 51 0 0.397900 0.085781 0.197183 0.615023 0.399061 0.333333 0.333333 0.399061 0.399061 0.455399 0.455399
6 V5 93 58 0 0.050884 0.033034 0.000000 0.101639 0.050820 0.018033 0.018033 0.050820 0.050820 0.080328 0.080328
7 TYPE 93 1 0 1.000000 0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
cc.sql('select * from ({}) where TYPE=0'.format(thyroid_df.select_statement)).describe().collect()
column count unique nulls mean std min max median 25_percent_cont 25_percent_disc 50_percent_cont 50_percent_disc 75_percent_cont 75_percent_disc
0 ID 3679 3679 0 1885.799674 1089.750477 0.00000 3771.000000 1888.000000 940.500000 940.000000 1888.000000 1888.000000 2831.500000 2832.000000
1 V0 3679 93 0 0.543566 0.203859 0.00000 1.000000 0.569892 0.376344 0.376344 0.569892 0.569892 0.709677 0.709677
2 V1 3679 235 0 0.004755 0.011221 0.00000 0.273585 0.002830 0.001075 0.001075 0.002830 0.002830 0.004340 0.004340
3 V2 3679 71 0 0.189353 0.068794 0.00000 1.000000 0.190702 0.156546 0.156546 0.190702 0.190702 0.213472 0.213472
4 V3 3679 212 0 0.252734 0.076211 0.03972 1.000000 0.245327 0.205607 0.205607 0.245327 0.245327 0.282710 0.282710
5 V4 3679 141 0 0.376411 0.087369 0.00000 1.000000 0.375587 0.328638 0.328638 0.375587 0.375587 0.399061 0.399061
6 V5 3679 284 0 0.180496 0.051472 0.02459 1.000000 0.175410 0.150820 0.150820 0.175410 0.175410 0.198361 0.198361
7 TYPE 3679 1 0 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

A closer examination of the data tells us that the distribution of hyperfunctional cases and non-hyperfunctional ones are quite different(in quantiles) in attributes like V1, V2, V3, and V5. So these numerical attributes have the potential to tell the differences between thyroid hyperfunctionality and non-hyperfunctionality.

Visulization of the dataset is another way to assess the how difficult the classification problem should be. Dimensionality reduction is indispensible for the dataset since the its feature space is of dimension 6. Here PCA is applied for tranforming the dataset from 6D to 2D for drawing a scatter plot of the dataset(without involving the TYPE column).

from hana_ml.algorithms.pal.decomposition import PCA
res = PCA().fit_transform(thyroid_df, key='ID',
                          features=['V0', 'V1',
                                    'V2', 'V3',
                                    'V4', 'V5'])
thyroid_2d = res[['ID', 'COMPONENT_1', 'COMPONENT_2']]
hyper_ids = 'SELECT ID FROM ({}) WHERE TYPE=1'.format(thyroid_df.select_statement)
non_hyper_ids = 'SELECT ID FROM ({}) WHERE TYPE=0'.format(thyroid_df.select_statement)
thyroid_2d_hyper = cc.sql('SELECT * FROM ({}) WHERE ID IN ({})'.format(thyroid_2d.select_statement, hyper_ids))
thyroid_2d_non_hyper = cc.sql('SELECT * FROM ({}) WHERE ID IN ({})'.format(thyroid_2d.select_statement, non_hyper_ids))
import matplotlib.pyplot as plt
h1 = plt.scatter(thyroid_2d_non_hyper.collect()['COMPONENT_1'], thyroid_2d_non_hyper.collect()['COMPONENT_2'], c='red')
h2 = plt.scatter(thyroid_2d_hyper.collect()['COMPONENT_1'], thyroid_2d_hyper.collect()['COMPONENT_2'], s = 5, c='blue')
plt.legend((h1, h2), ('Normal/Subnormal', 'Hyperfunction'))
plt.show()

One can see from the above figure that hyperfunctional cases are distributed differently from the non-hyperfunctional ones(in the reduced attribute space), yet the two classes are not that well separated.

Dataset Partition

Before bulding up any classfication model for anomaly detection, we firstly divide the whole dataset into training and testing part using the train_test_val_split() method in hana_ml, where training percent is set to 0.7 and testing percent is set to 0.3(no validation data).

rom hana_ml.algorithms.pal.partition import train_test_val_split
train_, test_, _ = train_test_val_split(data = thyroid_df,
                                        id_column='ID',
                                        random_seed=2,
                                        partition_method='stratified',
                                        stratified_column='TYPE',
                                        training_percentage=0.7,
                                        testing_percentage=0.3,
                                        validation_percentage=0)

 

Now we are ready to build a classification model. DecisionTree classifier is used in the following context for illustration, other classification algorithms are also applicable with similar workflow.

Direct Training

To begin with, we build a decision-tree classifier on the training dataset directly without any modification. The decision-tree  classifier is called through the UnifiedClassification class in hana_ml, by doing so we can obtain many evaluation metric values of trained model on the test dataset by calling the corresponding score() function. The entire procedure is illustrated as follows:

from hana_ml.algorithms.pal.unified_classification import UnifiedClassification
dtc = UnifiedClassification(func='DecisionTree', algorithm='cart')
dtc.fit(data=train_,
        partition_method='no',
        key='ID',
        label='TYPE',
        categorical_variable='TYPE')
res_direct = dtc.score(data=test_,
                       key='ID',
                       features=['V0','V1', 'V2',
                                 'V3', 'V4', 'V5'])

Values of evaluation metrics of the trained model on the test dataset is available in the 2nd element of returned result of the score() function, and it can be collected to the python client as follows:

res_direct[1].collect()
STAT_NAME STAT_VALUE CLASS_NAME
0 AUC 0.9938162544169611 None
1 RECALL 0.9972826086956522 0
2 PRECISION 0.9963800904977376 0
3 F1_SCORE 0.9968311453146219 0
4 SUPPORT 1104 0
5 RECALL 0.8571428571428571 1
6 PRECISION 0.8888888888888888 1
7 F1_SCORE 0.8727272727272727 1
8 SUPPORT 28 1
9 ACCURACY 0.9938162544169611 None
10 KAPPA 0.8695594916705077 None
11 MCC 0.8697105036187616 None

A few key statistical values that worth mentioning:

  1. The overall accuracy is ~99.38%, higher than the naive ‘always non-hyperfunctional’ classifier.
  2. ~85.7% hyperfunctional cases in the test are assigned the correct label by the decision-tree classifier
  3. ~88.9% predicted hyperfunctional cases of the decision-tree classifier are real hyperfunctional cases.

The performance of the trained model is already reasonably well. However, since classes are imbalanced in the training dataset, resampling techniques that balance the counts of classes have the potential for further improving performance of the classification model. We will verify this justification in the subsequent subsections. Our first try is to oversample the minority class for achieving class balance in the training data.

Model Training with Minority-class Oversampling

We firstly augment the number of hyperfunctional cases  several times in the training dataset using the synthetic minority over-sampling technique(i.e. SMOTE), so that the numer of hyperfunctional cases and that of non-hyperfunctional cases become comparable.

train_pos = cc.sql('SELECT * FROM ({}) WHERE TYPE=1'.format(train_.select_statement))
train_neg = cc.sql('SELECT * FROM ({}) WHERE TYPE=0'.format(train_.select_statement))
multi = int(train_neg.count()/train_pos.count())#times for count of non-hyperfunctional cases relative to that of hyperfunctional cases  
from hana_ml.algorithms.pal.preprocessing import SMOTE
smote = SMOTE(smote_amount=multi*100) #smote_amount is reflected by percentage 
train_smote = smote.fit_transform(data = train_[['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'TYPE']],
                                  label = 'TYPE',
                                  minority_class=1)
dtc.fit(data=train_smote,
        partition_method='no',
        label='TYPE',
        categorical_variable='TYPE')
score_res = dtc.score(data=test_,
                      key='ID',
                      features=['V0', 'V1', 'V2',
                                'V3', 'V4', 'V5'])
score_res[1].collect()
STAT_NAME STAT_VALUE CLASS_NAME
0 AUC 0.9990487145550575 None
1 RECALL 0.9981884057971014 0
2 PRECISION 0.99909338168631 0
3 F1_SCORE 0.9986406887177163 0
4 SUPPORT 1104 0
5 RECALL 0.9642857142857143 1
6 PRECISION 0.9310344827586207 1
7 F1_SCORE 0.9473684210526316 1
8 SUPPORT 28 1
9 ACCURACY 0.9973498233215548 None
10 KAPPA 0.9460095389507154 None
11 MCC 0.9461627755815293 None

So, by oversampling the hyperfunctional cases in the traininig data to be roughly the same size as that of non-hyperfunctional cases, and retraining the decision tree model, we see the some improvement revealed by the following key statistics:

  1. The overall accuracy increases from ~99.38% to ~99.73%(versus training without resampling)
  2. ~96.4% hyperfunctional cases in the test are assigned the correct label by the decision-tree classifier(versus 85.7% without resampling)
  3. ~93.1% predicted hyperfunctional cases of the decision-tree classifier are real hyperfunctional cases(versus 88.9% without resampling)

The increment of model performance on test dataset is non-negligible, which shows the effectivity of oversampling the minority-class.

 

Another approach for over-samping the minority class is direct duplication, which equivalent to bootstrapping is some sense.

import numpy as np
train_mult = train_neg
for i in range(np.int64(mult)):
    train_mult = train_mult.union(train_pos)#duplicate positive cases to several times so that their group are roughly the same size as negative group
dtc = UnifiedClassification(func='DecisionTree', algorithm='cart')
dtc.fit(data=train_mult[['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'TYPE']],
        partition_method='no',
        label='TYPE',
        categorical_variable='TYPE')
score_res = dtc.score(data=test_,
                      key='ID',
                      features=['V0', 'V1', 'V2', 'V3', 'V4', 'V5'])
score_res[1].collect()
STAT_NAME STAT_VALUE CLASS_NAME
0 AUC 0.9990487145550575 None
1 RECALL 0.9981884057971014 0
2 PRECISION 0.99909338168631 0
3 F1_SCORE 0.9986406887177163 0
4 SUPPORT 1104 0
5 RECALL 0.9642857142857143 1
6 PRECISION 0.9310344827586207 1
7 F1_SCORE 0.9473684210526316 1
8 SUPPORT 28 1
9 ACCURACY 0.9973498233215548 None
10 KAPPA 0.9460095389507154 None
11 MCC 0.9461627755815293 None

The score statistics are nearly the same as the case when SMOTE is applied.

 

Model Training with Majority-class Undersampling

Oversampling the minority-class of training data can usually increase its related precision and recall metrics without much affecting the evaluation metrics of majority-class. However, it has some drawbacks: one is the additional memory/computational resource consumption for oversampled training data, another is that it usually lacks the potential in achieving a very high recall rate for the minority-class. When misclassification of a case with minority-class label becomes unacceptably high, we must figure out a smart way to increase its recall rate, while naively labeling all data with minority-class is obviously not a smart way and we should always avoid doing that.

So here comes majority-class undersampling, in which case the collection of data with the majority-class label is subsampled. As a consequence, the area covered by the majority-class data points becomes smaller with lower density, which gives room to machine learning methods for better modeling the minority-class points with less contraversy. This usually results in direct increment of the coverage(i.e. recall rate) minority-class points, yet some points of majority-class will also be misclassified as minority-class(precision rate drops down). However, as long as the price paid by the later is smaller than the values saved from the former, we are happy to make the change.

Undersampling the majority-class can be realized in many different ways. Our first try here is TomekLinks algorithm, which is provided in SAP HANA Predictive Analysis Library(PAL) and wrapped up in the Python machine learning client for SAP HANA(hana_ml).

from hana_ml.algorithms.pal.preprocessing import TomekLinks
tmk = TomekLinks()
train_tmk = tmk.fit_transform(train_, label='TYPE')
dtc.fit(data=train_tmk,
        key = 'ID',
        partition_method='no',
        label='TYPE',
        categorical_variable='TYPE')
score_res = dtc.score(data=test_,
                      key='ID',
                      features=['V0', 'V1', 'V2',
                                'V3', 'V4', 'V5'])
score_res[1].collect()
STAT_NAME STAT_VALUE CLASS_NAME
0 AUC 0.9964664310954063 None
1 RECALL 0.9972826086956522 0
2 PRECISION 0.9990925589836661 0
3 F1_SCORE 0.9981867633726201 0
4 SUPPORT 1104 0
5 RECALL 0.9642857142857143 1
6 PRECISION 0.9 1
7 F1_SCORE 0.9310344827586207 1
8 SUPPORT 28 1
9 ACCURACY 0.9964664310954063 None
10 KAPPA 0.9292234587970489 None
11 MCC 0.9298058529321855 None

Compared to direct classification result without resampling the training data, we have:

  1. The overall accuracy increases from 99.38% to 99.64%(versus training without resampling)
  2. 96.4% percent of hyperfunctional cases are now correctly predicted(versus 85.7% without resampling)
  3. 90% percent of the predicted hyperfunctional cases are now correct(versus 88.9% without resampling)

For hyperfunctional cases, the recall rate has much improvement(same as the one achieved in minority-class oversampling), yet the increment of precision is much less, and the value is less the one achieved by minority-class oversampling. The result is consistent with our previous analysis for majority-class subsampling.

In the following we try random subsampling of the majority class, in which we subsample the the majority-class data greatly so that eventually the two classes will have similar size in the training data.

from hana_ml.algorithms.pal.preprocessing import Sampling
dsp_rate=0.03 # subsampling rate for the non-hyperfunctional cases
sp = Sampling(method='simple_random_without_replacement', sampling_size=int(dsp_rate*train_neg.count()), random_state=2)#
train_neg_sub = sp.fit_transform(train_neg)
train_balanced = train_neg_sub.union(train_pos)
dtc.fit(data=train_balanced,
        key='ID',
        partition_method='no',
        label='TYPE',
        categorical_variable='TYPE')
score_res = dtc.score(data=test_,
                      key='ID',
                      features=['V0', 'V1', 'V2',
                                'V3', 'V4', 'V5'])
score_res[1].collect()
STAT_NAME STAT_VALUE CLASS_NAME
0 AUC 0.9787985865724381 None
1 RECALL 0.9782608695652174 0
2 PRECISION 1 0
3 F1_SCORE 0.989010989010989 0
4 SUPPORT 1104 0
5 RECALL 1 1
6 PRECISION 0.5384615384615384 1
7 F1_SCORE 0.7000000000000001 1
8 SUPPORT 28 1
9 ACCURACY 0.9787985865724381 None
10 KAPPA 0.6900328587075575 None
11 MCC 0.72577947948589 None

Now the trained model achieves a perfect recall score for the hyperfunctional cases in the test dataset, yet the corresponding precision score is also greatly reduced, indicating the production of many falsely predicted hyperfunctional cases in the test dataset.

Model Training with Hybrid Method

High undersampling rate of the majority class can potentially lead to high recall rate of the minority-class, yet it also has risk of underfitting the majority-class since many points of the class are thrown away in the training phase. We have already seen the production of many falsely predicted hyperfunctional cases caused by undersampling the non-hyperfunctional cases in the training data. In comparsion, combining the minority-class oversampling and majority-class undersampling, i.e. a hybrid method, could be a smarter and more robust way for treating class imbalanced problems.

SMOTETomek is an algorithms that combined the both sampling strategies, which is also provided in SAP HANA Predictive Analysis Library(PAL) and wrapped up in the Python machine learning client for SAP HANA(hana_ml). In the following context we use SMOTETomek to resample the training data.

from hana_ml.algorithms.pal.preprocessing import SMOTETomek
stk = SMOTETomek(smote_amount=multi*100)
train_stk = stk.fit_transform(data=train_,
                              label='TYPE',
                              minority_class=1)
from hana_ml.algorithms.pal.unified_classification import UnifiedClassification
dtc = UnifiedClassification(func='DecisionTree',
                            algorithm='cart')
dtc.fit(data=train_stk[['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'TYPE']],
        partition_method='no',
        label='TYPE',
        categorical_variable='TYPE')
score_res = dtc.score(data=test_,
                      key='ID',
                      features=['V0', 'V1', 'V2',
                                'V3', 'V4', 'V5'])
score_res[1].collect()
STAT_NAME STAT_VALUE CLASS_NAME
0 AUC 0.9990487145550575 None
1 RECALL 0.9981884057971014 0
2 PRECISION 0.99909338168631 0
3 F1_SCORE 0.9986406887177163 0
4 SUPPORT 1104 0
5 RECALL 0.9642857142857143 1
6 PRECISION 0.9310344827586207 1
7 F1_SCORE 0.9473684210526316 1
8 SUPPORT 28 1
9 ACCURACY 0.9973498233215548 None
10 KAPPA 0.9460095389507154 None
11 MCC 0.9461627755815293 None

Overall, the evaluation metrics on the test dataset are similar to the case when SMOTE is applied to the training data. This is reasonable because usually only a few points are associated with TomekLinks in a dataset, so the undersampling effect is insignificant.

 

In the following context we undersample(randomly without replacement) the non-hyperfunctional cases in the training data by half, and in the mean time oversample the hyperfunctional cases using SMOTE so that the two classes eventually have similar size.

from hana_ml.algorithms.pal.preprocessing import Sampling
dsp_rate=0.5
sp = Sampling(method='simple_random_without_replacement',
              sampling_size=int(dsp_rate*train_neg.count()),
              random_state=2)
mult = int(train_neg.count()/train_pos.count())
train_neg_sub = sp.fit_transform(train_neg)
from hana_ml.algorithms.pal.preprocessing import SMOTE
smote = SMOTE(smote_amount=int(multi*100*dsp_rate))
train_pos_sup = smote.fit_transform(train_pos,
                                    label='TYPE',
                                    minority_class=1)
train_balanced = train_neg_sub.union(train_pos_sup)
dtc.fit(data=train_balanced,
        key='ID',
        partition_method='no',
        label='TYPE',
        categorical_variable='TYPE')
score_res = dtc.score(data=test_,
                      key='ID',
                      features=['V0', 'V1', 'V2',
                                'V3', 'V4', 'V5'])
score_res[1].collect()
STAT_NAME STAT_VALUE CLASS_NAME
0 AUC 0.9973498233215548 None
1 RECALL 0.9972826086956522 0
2 PRECISION 1 0
3 F1_SCORE 0.998639455782313 0
4 SUPPORT 1104 0
5 RECALL 1 1
6 PRECISION 0.9032258064516129 1
7 F1_SCORE 0.9491525423728813 1
8 SUPPORT 28 1
9 ACCURACY 0.9973498233215548 None
10 KAPPA 0.9477956096661133 None
11 MCC 0.9490897684093421 None

The trained model also scores perfectly on the recall rate for hyperfunctional cases in the test data and,  compared to the result by purely undersampling the non-hyperfunctional cases, the precision rate of hyperfunctional cases improves a lot(from ~53.85% to ~90.32%). Compared with previous results, the result of this hybrid resampling method is the most satisfying one.

Summary and Discussion

In this blog post, we have done a case study on anomaly detection by classification, where a training dataset with normal and  anomalous labels is provided. The main difficulty for anomaly detection by classification is that the distribution of normal and anomalous labels is usually highly imbalanced, which can pose some challenge for building up an efficient classification model. We have shown that, by appropriately resampling the training data, we could potentially improve the classification model’s performance in the prediction phase, especially for the precision and recall rate w.r.t. the anomalous data points. In our case study, a hybrid resampling method combing SMOTE and random subsampling(without replacement) gives the most satisfying result.

However, it should be emphasized that, even when the training data is highy imbalanced in classes, resampling does not always lead to better machine learning model(e.g. see [3]) than the one without resampling the training data. One must be careful enough to validate the gain by resampling the training data, especially when the collected features can well tell the differences between classes.

References

[1] Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science
[2] Shebuti Rayana (2016). ODDS Library [http://odds.cs.stonybrook.edu]. Stony Brook, NY: Stony Brook University, Department of Computer Science
[3] Yang Feng, Min Zhou and Xin Tong (2020), Imbalanced classification: an objective-oriented review, Technical Report, New York University, School of Global Public Health

Assigned Tags

      Be the first to leave a comment
      You must be Logged on to comment or reply to a post.