Outlier Detection using Statistical Tests in Python Machine Learning Client for SAP HANA

In datasets with multiple features, one typical type of outliers are those corresponding to extreme values in numerical features. In this blog post, we will show how to use statistical tests in the Python machine learning client for SAP HANA(hana_ml) to detect such outliers. Note that the statistical test algorithms are originally implemented in the SAP HANA Predictive Analysis Library(PAL), while hana_ml  provides us a friendly enough Python API for such algorithms.

In this blog post, you will learn:

• Outlier detection using variance test
• Outlier detection using IQR test

Introduction

Outliers are points in datasets that are significantly different from others observations. Their appearance could be the result of many reasons, like measurement variability, experimental error, unexpected event, etc. Outliers do not always linked with errors or bad things, sometimes they are equivalent to ‘outstanding’ and worth more of our investigation. However, since their existence often poses some difficulty for statistical analysis of the dataset, the detection of outliers is often desired for dataset preprocessing. The detected outliers could then be removed from the dataset, or analyzed by more careful studies, based on what role the outliers play in different datasets.

The detection of outliers typically depends on the modeling inliers that are considered indifferent from most data points in the dataset. A typical case is: for a collection of numerical values, values that centered around the sample mean/median are considered to be inliers, while values deviates greatly from the sample mean/median are usually considered to be outliers. In this tutorial we consider the detection of such type of outliers using statistical tests.

There are many other statistical test algorithms for detecting outliers, like variance test, inter-quartile-range(IQR) test, extreme studentized deviate(ESD) test, etc. Among them, the following two algorithms are the main focus of this blog post:

• Variance Test – Perhaps the most simple yet popular approach for outlier detection. Given a collection of numerical values, sample mean 𝜇̃ and deviation 𝜎̃ is firstly computed, with a pre-specified positive multiplier 𝛼, a range of normality [𝜇̃ 𝛼𝜎̃ ,𝜇̃ +𝛼𝜎̃ ] is defined. If a point is within this range, then it is labeled as a inlier, otherwise it is labeled as an outlier.
• IQR Test – Similar to variance test, it also specify a range to contain inlier points, while outliers are points outside the specified range. . Two quartiles 𝑄1 (25th percentile) and 𝑄3 (75th percentile) and the quartile range 𝑅=𝑄3−𝑄1 is used to define the interval for inliers: [𝑄1−𝛼𝑅,𝑄3+𝛼𝑅] , where 𝛼 is a pre-specified positive multiplier that is usually set to 1.5. The estimation of quartiles is much more robust to the presence of extreme outliers compared to mean/variance, so the detection also becomes more robust.

Solutions

The two statistical test algorithms mentioned in the previous section are only for 1D numerical values. For datasets with multiple numerical features, we can inspect each interested feature separately for outlier detection, and then aggregate the detection results as a whole. In this way we would potentially detect a fair amount of outliers from the dataset.

The two test algorithms naturally leads to the two use case that will be illustrated in this section. We will elaborate variance test for outliers detection in the first use case, with illustration of a possible weakness. IQR test for outlier detection, which is not suffered from such weakness, will be elaborated in the 2nd use case.

Before we go to detailed use cases, we firstly need to establish a sound connection to SAP HANA.

import hana_ml
from hana_ml.dataframe import ConnectionContext
cc = ConnectionContext(address='xx.xx.xx.xx', port=30x15, user='XXX', password='Xxxxxxx')#detailed account info is hidded away

The established ConnectionContext object cc is a connection to SAP HANA, with which we can send out queries to the database and fetch the corresponding result.

Use Cases

The dataset applied in both use cases is a two-variate dataset Generated from a 2D Gaussian distribution. Here we assumed it is stored in a HANA table with name of  “PAL_GAUSSIAN_2D_DATA_TBL”. Then, we can get the handle of it in python client using the table() function in the established ConnectionContext object.

df = cc.table('PAL_GAUSSIAN_2D_DATA_TBL')

The handle df itself is a hana_ml.DataFrame object, which contains the information of data in database. We can apply the collect() method of hana_ml.DataFrame to fetch the data from database to the Python client.

df.collect()

The dataset has 3 columns: one ID column and two feature columns with name X and Y, respectively. Then, we can get a shallow impression of the dataset using the scatter plot functionality in Python.

data = df.collect()
import matplotlib.pyplot as plt
plt.scatter(data['X'], data['Y'])
plt.show()

Outlier Detection using Variance Test

Variance test is categorized as a preprocessing algorithm in hana_ml, we import it from hana_ml and apply it to the two feature columns X and Y, respectively.

from hana_ml.algorithms.pal.preprocessing import variance_test
resX = variance_test(df[['ID', 'X']], sigma_num=3)
resY = variance_test(df[['ID', 'Y']], sigma_num=3)

Variance test returns a tuple of two hana_ml DataFrames, where the first one is the outlier detection result, and the second one is related statistics of the data involved in outlier detection. For each detection result, the ID column is there along with a new column of the name ‘IS_OUT_OF_RANGE’. The IS_OUT_OF_RANGE column contains 0s and 1s, where 0 is for  inlier and 1 for outlier. We can check the detected outliers in X values via a SQL query statement as follows:

x_outlier = cc.sql('SELECT ID FROM ({}) WHERE IS_OUT_OF_RANGE = 1'.format(resX[0].select_statement))
x_outlier.collect()

The detection of outliers in the Y column can be done in a similar way. Consequently, the two detection results could be unified to form the overall detection result of outliers(using the union() function for hana_ml DataFrames). However, there might be some data points that are recognized as outliers in both X and Y values, which creates duplicated values in the unified detection result. We need to apply the drop_duplicates() function for the removal of duplicated values in this case. The entire procedure is illustrated as follows:

y_outlier = cc.sql('SELECT ID FROM ({}) WHERE IS_OUT_OF_RANGE = 1'.format(resY[0].select_statement))
outlier_id = x_outlier.union(y_outlier)
outlier_id = outlier_id.drop_duplicates()
df_outlier = cc.sql('SELECT * FROM ({}) WHERE ID IN (SELECT ID FROM ({}))'.format(df.select_statement,
outlier_id.select_statement))

Finally, we draw the scatter plot of the detected outliers as follows:

df_inlier = cc.sql('SELECT * FROM ({}) WHERE ID NOT IN (SELECT ID FROM ({}))'.format(df.select_statement,
outlier_id.select_statement))
data_outlier = df_outlier.collect()
data_inlier = df_inlier.collect()
plt.scatter(data_inlier['X'], data_inlier['Y'], label='Inlier')
plt.scatter(data_outlier['X'], data_outlier['Y'], color='red', label='Outlier')
plt.legend()
plt.show()

However, it is known that the effectivity of variance test is easily affected by the existence of extreme outliers. For example, we add an additional point to the 2D Gaussian dataset used above as follows:

new_point = cc.sql('SELECT ID+1 AS ID, X+4000 AS X, Y FROM ({}) WHERE ID = 3999'.format(df.select_statement))
new_df = df.union(new_point)
new_df.collect()

The augmented dataset contains a new data point with ID 4000, which has very large X value(~4000, other data points have X values centered around 0, commonly no greater than 5 in absolute sense). Then, if we apply variance test with the X values in the augmented dataset, we will obtain the following result:

resX = variance_test(new_df[['ID', 'X']], sigma_num=3)
outlier_X = cc.sql('SELECT ID FROM ({}) WHERE IS_OUT_OF_RANGE = 1'.format(resX[0].select_statement))
outlier_X.collect()

So the variance test on X results in the detection of only the newly added extreme outlier. Intrinsically, this happens because the newly added extreme outlier makes the originally detected outliers look much more ‘normal’; while numerically, variance test depends on the calculation of sample mean and variance, both are very sensitive to existence of extreme values in the dataset.

In the following section we introduce an outlier detection approach called inter-quartile-range(IQR) that is much more robust to the existence of extreme outliers.

Outlier Detection using Inter-Quartile-Range(IQR) Test

IQR is categorized as an statistics algorithm in hana_ml, we can import it and then apply it to any data values of interest. However, for the ease of comparison between variance test and IQR test, we first manually tune a multiplier for IQR, so that IQR test will detect similar number of outliers in X column as variance test for the origin dataset. We find that a multiplier around 1.8 satisfies our constraint, illustrated as follows:
from hana_ml.algorithms.pal.stats import iqr
iqr_resX1 = iqr(df, key = 'ID', col='X', multiplier=1.8)
iqr_outlier_X1 = cc.sql('SELECT ID FROM ({}) WHERE IS_OUT_OF_RANGE = 1'.format(iqr_resX1[0].select_statement))
iqr_outlier_X1.collect()

Next, we apply IQR test with multiplier 1.8  to the augmented dataset with an added extreme X-valued point in the previous section.

iqr_resX = iqr(new_df, key = 'ID', col='X', multiplier=1.8) 

Following is the illustration of the detection result.

iqr_outlier_X = cc.sql('SELECT ID FROM ({}) WHERE IS_OUT_OF_RANGE = 1'.format(iqr_resX[0].select_statement))
iqr_outlier_X.collect()

So under IQR test, the introduction of a new extreme outlier only results in the added detection of this point itself, and all other originally detected outliers remain to be detected.

For completeness, let us continue the outlier detection on Y, and then view the overall detection results on the original dataset. The procedure is similar to that of variance test.

iqr_resY = iqr(new_df, key = 'ID', col='Y', multiplier=1.8)
iqr_outlier_Y = cc.sql('SELECT ID FROM ({}) WHERE IS_OUT_OF_RANGE = 1'.format(iqr_resY[0].select_statement))
iqr_outlier_id = iqr_outlier_X.union(iqr_outlier_Y).drop_duplicates() # merge the result
df_outlier = cc.sql('SELECT * FROM ({}) WHERE ID IN (SELECT ID FROM ({}))'.format(df.select_statement,
iqr_outlier_id.select_statement))
df_inlier = cc.sql('SELECT * FROM ({}) WHERE ID NOT IN (SELECT ID FROM ({}))'.format(df.select_statement,
iqr_outlier_id.select_statement))
data_outlier = df_outlier.collect()
data_inlier = df_inlier.collect()
plt.scatter(data_inlier['X'], data_inlier['Y'], label='Inlier')
plt.scatter(data_outlier['X'], data_outlier['Y'], color='red', label='Outlier')
plt.legend()
plt.show()

The detection result is quite similar to that of variance test on the original dataset, which is not at all affected by the introduction of the extreme outlier in X-value.

Discussion and Summary

In the blog post, we have introduced two statistical tests for detecting outliers in datasets, namely variance test and IQR test. Both methods specify a range for the discrimination between inliers and outliers. Compared with variance test, IQR test is a more robust outlier detection method with the presence of extremely deviated(from mean/median) values in the targeted numerical feature.

The drawbacks of the two tests for outlier detection are also obvious. Since both methods only work on 1D numerical data, so they are mainly applicable to outliers with at least one outstanding numerical features value. However, there are outliers that do not contain any outstanding numerical feature value, but standing out from the population when all their feature values are combined. For example, people with age 5 is not a minority group in population, and people with height between 170 cm and 171 cm is also not a minority group in population, yet a person with age 5 and height 170 cm is highly likely to be an outlier in population. Besides, it is often beneficial to consider other characteristics, e.g. local density, connectivity to other data points when identifying whether a data point is an outlier or not, yet this is not reflected in both test method. In a few separate tutorials[1][2][3], we show readers how to detect such outliers by involving all features simultaneously and exploring more characteristics among points.