###### Technical Articles

# Automatic Outlier Detection for Time Series in SAP HANA

In time series, an outlier is a data point that is different from the general behavior of remaining data points. In Predictive Analysis Library (PAL) of SAP HANA, we have automatic outlier detection for time series. You can find more details in Outlier Detection for Time Series in PAL.

In PAL, the outlier detection procedure is divided into two steps. In step 1, we get the residual from the original series. In step 2, we detect the outliers from the residual. **In step 1, we have an automatic method.**

In this blog post, you will learn:

- introduction of outlier in time series and automatic detection method in PAL
- some use cases of automatic outlier detection

To make it easy to test the performance of automatic outlier detection, we use Z1 score method (the default method) in step 2 and set the parameter THRESHOLD = 3 (the default value). This parameter value is based on 3 sigma rule.

To make it easy to read and show the results, we call the PAL procedure in Jupyter Notebook. For calling the PAL procedure and plotting the results, we need some functions. We put the functions in the Appendix.

# Introduction

In time series, outliers can have many causes, such as data entry error, experimental error, sampling error and natural outlier. Outliers have a huge impact on the result of data analysis, such as the seasonality test. Outlier detection is an important data preprocessing for time series analysis.

In this algorithm, the outlier detection procedure is divided into two steps. In step 1, we get the residual from the original series. In step 2, we detect the outliers from the residual. We focus on the automatic method of step 1 in this blog.

In step 1, we have an automatic method. For the automatic method, we combine seasonal decomposition, linear regression, median filter and super smoother. The processes in the automatic method is shown in the picture below.

In the output of this algorithm, we have a result table and a statistic table. In the result table, the residual, outlier score and outlier label are included. In the statistic table, some information of the time series and outlier detection method is included. For automatic method, the final smoothing method in step 1 is shown in the statistic table.

# Test Cases

To call the PAL procedure with python, we need to import some python packages.

```
import numpy as np
from datetime import datetime
from pandas import read_table
import matplotlib.pyplot as plt
import pandas as pd
```

## case 1: smooth data without seasonality

### data

The data is from spikey_v.dat. It is smooth, but without seasonality. The plot of the data is shown with the code below.

```
str_path = 'your data path'
cols = ['j', 'u', 'v', 'temp', 'sal', 'y', 'mn', 'd', 'h', 'mi']
df = read_table(str_path+'spikey_v.dat' , delim_whitespace=True, names=cols)
df.index = [datetime(*x) for x in zip(df['y'], df['mn'], df['d'], df['h'], df['mi'])]
df = df.drop(['y', 'mn', 'd', 'h', 'mi'], axis=1)
data = np.full(len(df), np.nan)
for i in range(len(df)):
data[i] = df['u'][i]
plt.plot(data)
plt.grid()
```

### automatic outlier detection

We call the procedure _SYS_AFL.PAL_OUTLIER_DETECTION_FOR_TIME_SERIES to detect outliers automatically. The python code and results are as follows.

```
Outlier_parameters = {
"AUTO": 1,
}
df = pd.concat([pd.DataFrame({"ID":list(range(len(data)))}),pd.DataFrame(data)],axis=1)
dfOutlierResults, dfStats, dfMetri = OutlierDetectionForTimeSeries(df,Outlier_parameters,cc)
dfOutlierResults
```

`dfStats`

The results are plotted as below.

`outlier_result_plot(dfOutlierResults)`

`outlier_plot(dfOutlierResults)`

## case 2: non-smooth data without seasonality

### data

The data is from R package “fpp2”. It is the gold daily price data from January 1st 1985 to March 31th 1989. The data is neither smooth nor seasonal. There are some missing values. The missing values are imputed by linear interpolation. The plot of the data is shown with the code below.

```
str_path = 'your data path'
df = pd.read_csv(str_path+'daily_csv_no_missing_value.csv')
num = len(df)
data = np.full(num,np.nan)
for i in range(num):
data[i] = df['Price'][i]
plt.plot(data)
plt.grid()
```

### automatic outlier detection

We call the procedure _SYS_AFL.PAL_OUTLIER_DETECTION_FOR_TIME_SERIES to detect outliers automatically. The python code and results are as follows.

```
Outlier_parameters = {
"AUTO": 1,
}
df = pd.concat([pd.DataFrame({"ID":list(range(len(data)))}),pd.DataFrame(data)],axis=1)
dfOutlierResults, dfStats, dfMetri = OutlierDetectionForTimeSeries(df,Outlier_parameters,cc)
dfOutlierResults
```

`dfStats`

The results are plotted as below.

`outlier_result_plot(dfOutlierResults)`

`outlier_plot(dfOutlierResults)`

From the statistic table, we can see that the final smoothing method is super smoother, as the time series is not so smooth. From the above results, we find that PAL detect the outlier at t = 769 successfully and also consider some other points as outliers. From the plots of residual and outlier score, we can find that the outlier at t = 769 is very obvious. We can adjust the threshold or choose other methods in step 2 to only detect the outlier at t = 769.

## case 3: smooth data with seasonality

### data

The data is synthetic. It is seasonal with period = 40. There are four obvious outliers in the time series. The time series is as below.

```
import math
random_seed = 3
np.random.seed(random_seed)
cols = 200 # length of time series
cycle = 40
outlier_idx = [30, 45, 73, 126, 159, 173]
timestamp = np.full(cols,np.nan,dtype = int)
for i in range(cols):
timestamp[i] = i
seasonal = np.full(cols,np.nan,dtype = float)
for i in range(cols):
seasonal[i] = math.sin(2*math.pi/cycle*timestamp[i])
const = np.full(cols,2,dtype = float)
noise = np.full(cols,0,dtype = float)
for i in range(cols):
noise[i] = 0.2*(np.random.rand()-0.5)
trend = 0.01*timestamp
outlier = np.full(cols,0,dtype = float)
for i in outlier_idx:
outlier[i] = 4*(np.random.rand()-0.5)
data = seasonal + const + noise + trend + outlier
plt.plot(data)
plt.grid()
```

### automatic outlier detection

We call the procedure _SYS_AFL.PAL_OUTLIER_DETECTION_FOR_TIME_SERIES to detect outliers automatically. The python code and results are as follows.

```
Outlier_parameters = {
"AUTO": 1,
}
df = pd.concat([pd.DataFrame({"ID":list(range(len(data)))}),pd.DataFrame(data)],axis=1)
dfOutlierResults, dfStats, dfMetri = OutlierDetectionForTimeSeries(df,Outlier_parameters,cc)
dfOutlierResults
```

`dfStats`

The results are plotted as below.

`outlier_result_plot(dfOutlierResults)`

`outlier_plot(dfOutlierResults)`

From the statistic table, we can see that the final smoothing method is median filter and seasonal decomposition, followed by super smoother, as the time series is quite smooth and seasonal. From the above results, we can see that the four obvious outliers are detected by PAL.

## case 4: non-smooth data with seasonality

### data

The data is monthly ice cream data. The period is 12. You can find the data in ice_cream_interest.csv. There are two obvious outliers in the time series. The plot of the time series is as below.

```
str_path = 'your data path'
df = pd.read_csv(str_path+'ice_cream_interest.csv')
data = np.full(len(df),np.nan,dtype = float)
for i in range(len(df)):
data[i] = df['interest'][i]
plt.plot(data)
plt.grid()
```

### automatic outlier detection

```
Outlier_parameters = {
"AUTO": 1,
}
df = pd.concat([pd.DataFrame({"ID":list(range(len(data)))}),pd.DataFrame(data)],axis=1)
dfOutlierResults, dfStats, dfMetri = OutlierDetectionForTimeSeries(df,Outlier_parameters,cc)
dfOutlierResults
```

`dfStats`

The results are plotted as below.

`outlier_result_plot(dfOutlierResults)`

`outlier_plot(dfOutlierResults)`

From the statistic table, we can see that the final smoothing method is seasonal decomposition, followed by super smoother, as the time series is seasonal, but not so smooth. From the above results, we can see that the two obvious outliers are detected by PAL.

# Conclusions

In this blog post, we describe what is outlier in time series and provide an automatic outlier detection method for time series in PAL. We also provide some examples to show how to call the automatic outlier detection procedure and show the detection results. From the above results, we can see that the automatic method can detect outliers in different time series. Hope you enjoy reading this blog!

The method will also be included in hana-ml. If you want to learn more about the automatic outlier detection method for time series in SAP HANA Predictive Analysis Library (PAL) and hana-ml, please refer to the following links:

Outlier Detection for Time Series in PAL

Outlier Detection for Time Series in hana-ml (automatic method will be included after 2023 Q3)

### Other Useful Links:

Outlier Detection using Statistical Tests in Python Machine Learning Client for SAP HANA

Outlier Detection by Clustering using Python Machine Learning Client for SAP HANA

Outlier Detection with One-class Classification using Python Machine Learning Client for SAP HANA

# Appendix

## SAP HANA Connection

```
import hana_ml
from hana_ml import dataframe
conn = dataframe.ConnectionContext('host', 'port', 'username', 'password')
```

## Functions for Table

```
def createEmptyTable(table_name, proto, cc):
with cc.connection.cursor() as cur:
try:
joint = []
for key in proto:
joint.append(" ".join(['"{:s}"'.format(key), proto[key]]))
cur.execute('CREATE COLUMN TABLE %s (%s);' %
(table_name, ",".join(joint)))
except:
print(
f"\"CREATE TABLE {table_name}\" was unsuccessful. Maybe the table has existed.")
def dropTable(table_name, cc):
with cc.connection.cursor() as cur:
try:
cur.execute(f"DROP TABLE {table_name}")
except:
print(f"\"DROP TABLE {table_name}\" was unsuccessful. Maybe the table does not exist yet.")
def createTableFromDataFrame(df, table_name, cc):
dropTable(table_name, cc)
dt_ml = dataframe.create_dataframe_from_pandas(cc, df, table_name=table_name, table_structure={"MODEL_CONTENT":"NCLOB"})
# dt_ml = dataframe.create_dataframe_from_pandas(cc, df, table_name=table_name, table_structure={"COL1":"CLOB"})
return dt_ml
```

## Function of Calling the PAL Procedure of Outlier Detection for Time Series

```
def OutlierDetectionForTimeSeries(df, parameters, cc,
data_table='ZPD_PAL_DATA_TBL',
parameter_table='ZPD_PAL_PARAMETERS_TBL',
result_table='ZPD_PAL_RESULT_TBL',
stats_table='ZPD_PAL_STATS_TBL',
metri_table='ZPD_PAL_METRI_TBL'):
# Input table
createTableFromDataFrame(df, data_table, cc)
# Result table
dropTable(result_table, cc)
createEmptyTable(result_table, {
"TIMESTAMP": "INTEGER","RAW_DATA":"DOUBLE","RESIDUAL":"DOUBLE","OUTLIER_SCORE":"DOUBLE","IS_OUTLIER":"INTEGER"}, cc)
# Metri table
dropTable(metri_table, cc)
createEmptyTable(metri_table, {
"STAT_NAME": "NVARCHAR(1000)","VALUE":"DOUBLE"}, cc)
# Stats table
dropTable(stats_table, cc)
createEmptyTable(stats_table, {
"STAT_NAME": "NVARCHAR(1000)", "STAT_VALUE": "NVARCHAR(1000)"}, cc)
# Parameter table
dropTable(parameter_table, cc)
createEmptyTable(parameter_table, {"PARAM_NAME": "nvarchar(256)", "INT_VALUE": "integer",
"DOUBLE_VALUE": "double", "STRING_VALUE": "nvarchar(1000)"}, cc)
if parameters:
with cc.connection.cursor() as cur:
for parName, parValue in parameters.items():
if isinstance(parValue, str):
parValue = f"'{parValue}'"
parametersSQL = f"{parValue if isinstance(parValue,int) else 'NULL'}, {parValue if isinstance(parValue,float) else 'NULL'}, { parValue if isinstance(parValue,str) else 'NULL'}"
cur.execute(
f"INSERT INTO {parameter_table} VALUES ('{parName}', {parametersSQL});")
elif isinstance(parValue,list):
for x in parValue:
if isinstance(x, str):
x = f"'{x}'"
parametersSQL = f"{x if isinstance(x,int) else 'NULL'}, {x if isinstance(x,float) else 'NULL'}, { x if isinstance(x,str) else 'NULL'}"
cur.execute(
f"INSERT INTO {parameter_table} VALUES ('{parName}', {parametersSQL});")
else:
parametersSQL = f"{parValue if isinstance(parValue,int) else 'NULL'}, {parValue if isinstance(parValue,float) else 'NULL'}, { parValue if isinstance(parValue,str) else 'NULL'}"
cur.execute(
f"INSERT INTO {parameter_table} VALUES ('{parName}', {parametersSQL});")
else:
print("No parameters given using default values.")
sql_str = f"\
do begin \
lt_data = select * from {data_table}; \
lt_control = select * from {parameter_table};\
CALL _SYS_AFL.PAL_OUTLIER_DETECTION_FOR_TIME_SERIES(:lt_data, :lt_control, lt_res, lt_stats, lt_metri); \
INSERT INTO {result_table} SELECT * FROM :lt_res; \
INSERT INTO {stats_table} SELECT * FROM :lt_stats;\
INSERT INTO {metri_table} SELECT * FROM :lt_metri; \
end;"
with cc.connection.cursor() as cur:
cur.execute(sql_str)
return cc.table(result_table).collect(), cc.table(stats_table).collect(), cc.table(metri_table).collect()
```

## Functions of Plotting Results

```
def outlier_result_plot(dResults):
dResults.sort_values(by = list(dResults)[0], inplace = True, ascending = True)
raw_data = np.array(dResults['RAW_DATA'])
residual = np.array(dResults['RESIDUAL'])
outlier_score = np.array(dResults['OUTLIER_SCORE'])
is_outlier = np.array(dResults['IS_OUTLIER'])
plt.figure(figsize = (24,4.5))
plt.subplot(1,4,1)
plt.plot(raw_data)
plt.grid()
plt.title('RAW_DATA')
plt.subplot(1,4,2)
plt.plot(residual)
plt.grid()
plt.title('RESIDUAL')
plt.subplot(1,4,3)
plt.plot(outlier_score)
plt.grid()
plt.title('OUTLIER_SCORE')
plt.subplot(1,4,4)
plt.plot(is_outlier)
plt.grid()
plt.title('IS_OUTLIER')
```

```
def outlier_plot(dResults):
dResults.sort_values(by = list(dResults)[0], inplace = True, ascending = True)
raw_data = np.array(dResults['RAW_DATA'])
is_outlier = np.array(dResults['IS_OUTLIER'])
outlier_idx = np.array([],dtype = int)
for i in range(len(is_outlier)):
if is_outlier[i] == 1:
outlier_idx = np.append(outlier_idx,i)
plt.plot(raw_data)
plt.scatter(outlier_idx,raw_data[outlier_idx],color = 'red')
plt.grid()
plt.title("series and outlier")
```

Hello,

This is very useful and helpful information. I found lot of interesting and information from your post. Your blog is very much impressive. This is really works well for me. Keep posting this kind of blogs always. I have read your all blogs and it's very useful for me. MyAccountAccess

Hello,

Thanks for your interest. You can also test this algorithm with your data. If you have any questions or comments about the algorithm. please contact me. Thanks!