Skip to Content
Technical Articles
Author's profile photo Zhi Yang

Automatic Outlier Detection for Time Series in SAP HANA

In time series, an outlier is a data point that is different from the general behavior of remaining data points. In Predictive Analysis Library (PAL) of SAP HANA, we have automatic outlier detection for time series. You can find more details in Outlier Detection for Time Series in PAL.

In PAL, the outlier detection procedure is divided into two steps. In step 1, we get the residual from the original series. In step 2, we detect the outliers from the residual. In step 1, we have an automatic method.

In this blog post, you will learn:

  • introduction of outlier in time series and automatic detection method in PAL
  • some use cases of automatic outlier detection

To make it easy to test the performance of automatic outlier detection, we use Z1 score method (the default method) in step 2 and set the parameter THRESHOLD = 3 (the default value). This parameter value is based on 3 sigma rule.

To make it easy to read and show the results, we call the PAL procedure in Jupyter Notebook. For calling the PAL procedure and plotting the results, we need some functions. We put the functions in the Appendix.

Introduction

In time series, outliers can have many causes, such as data entry error, experimental error, sampling error and natural outlier. Outliers have a huge impact on the result of data analysis, such as the seasonality test. Outlier detection is an important data preprocessing for time series analysis.

In this algorithm, the outlier detection procedure is divided into two steps. In step 1, we get the residual from the original series. In step 2, we detect the outliers from the residual. We focus on the automatic method of step 1 in this blog.

In step 1, we have an automatic method. For the automatic method, we combine seasonal decomposition, linear regression, median filter and super smoother. The processes in the automatic method is shown in the picture below.

Processes%20of%20Automatic%20Method

Processes of Automatic Method

In the output of this algorithm, we have a result table and a statistic table. In the result table, the residual, outlier score and outlier label are included. In the statistic table, some information of the time series and outlier detection method is included. For automatic method, the final smoothing method in step 1 is shown in the statistic table.

Test Cases

To call the PAL procedure with python, we need to import some python packages.

import numpy as np
from datetime import datetime
from pandas import read_table
import matplotlib.pyplot as plt
import pandas as pd

case 1: smooth data without seasonality

data

The data is from spikey_v.dat.  It is smooth, but without seasonality. The plot of the data is shown with the code below.

str_path = 'your data path'
cols = ['j', 'u', 'v', 'temp', 'sal', 'y', 'mn', 'd', 'h', 'mi']

df = read_table(str_path+'spikey_v.dat' , delim_whitespace=True, names=cols)

df.index = [datetime(*x) for x in zip(df['y'], df['mn'], df['d'], df['h'], df['mi'])]
df = df.drop(['y', 'mn', 'd', 'h', 'mi'], axis=1)
data = np.full(len(df), np.nan)
for i in range(len(df)):
    data[i] = df['u'][i]
plt.plot(data)
plt.grid()

automatic outlier detection

We call the procedure _SYS_AFL.PAL_OUTLIER_DETECTION_FOR_TIME_SERIES to detect outliers automatically. The python code and results are as follows.

Outlier_parameters = {
    "AUTO": 1,
}
df = pd.concat([pd.DataFrame({"ID":list(range(len(data)))}),pd.DataFrame(data)],axis=1)
dfOutlierResults, dfStats, dfMetri = OutlierDetectionForTimeSeries(df,Outlier_parameters,cc)
dfOutlierResults

dfStats

The results are plotted as below.

outlier_result_plot(dfOutlierResults)

outlier_plot(dfOutlierResults)

From the statistic table, we can see that the final smoothing method is median filter, as the time series is quite smooth. From the above results, we find that PAL miss detecting an outlier. This is because there are two big outliers here and the standard deviation becomes large. From the plots of residual and outlier score, we can find four outliers very clearly. We can adjust the threshold or choose other methods in step 2 to find all outliers.

case 2: non-smooth data without seasonality

data

The data is from R package “fpp2”. It is the gold daily price data from January 1st 1985 to March 31th 1989. The data is neither smooth nor seasonal. There are some missing values. The missing values are imputed by linear interpolation. The plot of the data is shown with the code below.

str_path = 'your data path'
df = pd.read_csv(str_path+'daily_csv_no_missing_value.csv')
num = len(df)
data = np.full(num,np.nan)
for i in range(num):
    data[i] = df['Price'][i]
plt.plot(data)
plt.grid()

automatic outlier detection

We call the procedure _SYS_AFL.PAL_OUTLIER_DETECTION_FOR_TIME_SERIES to detect outliers automatically. The python code and results are as follows.

Outlier_parameters = {
    "AUTO": 1,
}
df = pd.concat([pd.DataFrame({"ID":list(range(len(data)))}),pd.DataFrame(data)],axis=1)
dfOutlierResults, dfStats, dfMetri = OutlierDetectionForTimeSeries(df,Outlier_parameters,cc)
dfOutlierResults

dfStats

The results are plotted as below.

outlier_result_plot(dfOutlierResults)

outlier_plot(dfOutlierResults)

From the statistic table, we can see that the final smoothing method is super smoother, as the time series is not so smooth. From the above results, we find that PAL detect the outlier at t = 769 successfully and also consider some other points as outliers. From the plots of residual and outlier score, we can find that the outlier at t = 769 is very obvious. We can adjust the threshold or choose other methods in step 2 to only detect the outlier at t = 769.

case 3: smooth data with seasonality

data

The data is synthetic. It is seasonal with period = 40. There are four obvious outliers in the time series. The time series is as below.

import math
random_seed = 3
np.random.seed(random_seed)
cols = 200 # length of time series
cycle = 40
outlier_idx = [30, 45, 73, 126, 159, 173]
timestamp = np.full(cols,np.nan,dtype = int)
for i in range(cols):
    timestamp[i] = i
seasonal = np.full(cols,np.nan,dtype = float)
for i in range(cols):
    seasonal[i] = math.sin(2*math.pi/cycle*timestamp[i])
const = np.full(cols,2,dtype = float)
noise = np.full(cols,0,dtype = float)
for i in range(cols):
    noise[i] = 0.2*(np.random.rand()-0.5)
trend = 0.01*timestamp
outlier = np.full(cols,0,dtype = float)
for i in outlier_idx:
    outlier[i] = 4*(np.random.rand()-0.5)
data = seasonal + const + noise + trend + outlier
plt.plot(data)
plt.grid()

automatic outlier detection

We call the procedure _SYS_AFL.PAL_OUTLIER_DETECTION_FOR_TIME_SERIES to detect outliers automatically. The python code and results are as follows.

Outlier_parameters = {
    "AUTO": 1,
}
df = pd.concat([pd.DataFrame({"ID":list(range(len(data)))}),pd.DataFrame(data)],axis=1)
dfOutlierResults, dfStats, dfMetri = OutlierDetectionForTimeSeries(df,Outlier_parameters,cc)
dfOutlierResults

dfStats

The results are plotted as below.

outlier_result_plot(dfOutlierResults)

outlier_plot(dfOutlierResults)

From the statistic table, we can see that the final smoothing method is median filter and seasonal decomposition, followed by super smoother, as the time series is quite smooth and seasonal. From the above results, we can see that the four obvious outliers are detected by PAL.

case 4: non-smooth data with seasonality

data

The data is monthly ice cream data. The period is 12. You can find the data in ice_cream_interest.csv. There are two obvious outliers in the time series. The plot of the time series is as below.

str_path = 'your data path'
df = pd.read_csv(str_path+'ice_cream_interest.csv')
data = np.full(len(df),np.nan,dtype = float)
for i in range(len(df)):
    data[i] = df['interest'][i]
plt.plot(data)
plt.grid()

automatic outlier detection

We call the procedure _SYS_AFL.PAL_OUTLIER_DETECTION_FOR_TIME_SERIES to detect outliers automatically. The python code and results are as follows.

Outlier_parameters = {
    "AUTO": 1,
}
df = pd.concat([pd.DataFrame({"ID":list(range(len(data)))}),pd.DataFrame(data)],axis=1)
dfOutlierResults, dfStats, dfMetri = OutlierDetectionForTimeSeries(df,Outlier_parameters,cc)
dfOutlierResults

dfStats

The results are plotted as below.

outlier_result_plot(dfOutlierResults)

outlier_plot(dfOutlierResults)

From the statistic table, we can see that the final smoothing method is seasonal decomposition, followed by super smoother, as the time series is seasonal, but not so smooth. From the above results, we can see that the two obvious outliers are detected by PAL.

Conclusions

In this blog post, we describe what is outlier in time series and provide an automatic outlier detection method for time series in PAL. We also provide some examples to show how to call the automatic outlier detection procedure and show the detection results. From the above results, we can see that the automatic method can detect outliers in different time series. Hope you enjoy reading this blog!

The method will also be included in hana-ml. If you want to learn more about the automatic outlier detection method for time series in SAP HANA Predictive Analysis Library (PAL) and hana-ml, please refer to the following links:

Outlier Detection for Time Series in PAL

Outlier Detection for Time Series in hana-ml (automatic method will be included after 2023 Q3)

 

Other Useful Links:

Outlier Detection using Statistical Tests in Python Machine Learning Client for SAP HANA

Outlier Detection by Clustering using Python Machine Learning Client for SAP HANA

Anomaly Detection in Time-Series using Seasonal Decomposition in Python Machine Learning Client for SAP HANA

Outlier Detection with One-class Classification using Python Machine Learning Client for SAP HANA

Appendix

SAP HANA Connection

import hana_ml
from hana_ml import dataframe
conn = dataframe.ConnectionContext('host', 'port', 'username', 'password')

Functions for Table

def createEmptyTable(table_name, proto, cc):
    with cc.connection.cursor() as cur:
        try:
            joint = []
            for key in proto:
                joint.append(" ".join(['"{:s}"'.format(key), proto[key]]))
            cur.execute('CREATE COLUMN TABLE %s (%s);' %
                        (table_name, ",".join(joint)))
        except:
            print(
                f"\"CREATE TABLE {table_name}\" was unsuccessful. Maybe the table has existed.")

def dropTable(table_name, cc):
    with cc.connection.cursor() as cur:
        try:
            cur.execute(f"DROP TABLE {table_name}")
        except:
            print(f"\"DROP TABLE {table_name}\" was unsuccessful. Maybe the table does not exist yet.")


def createTableFromDataFrame(df, table_name, cc):
    dropTable(table_name, cc)
    dt_ml = dataframe.create_dataframe_from_pandas(cc, df, table_name=table_name, table_structure={"MODEL_CONTENT":"NCLOB"})
    # dt_ml = dataframe.create_dataframe_from_pandas(cc, df, table_name=table_name, table_structure={"COL1":"CLOB"})
    return dt_ml

Function of Calling the PAL Procedure of Outlier Detection for Time Series

def OutlierDetectionForTimeSeries(df, parameters, cc,
            data_table='ZPD_PAL_DATA_TBL',
            parameter_table='ZPD_PAL_PARAMETERS_TBL',
            result_table='ZPD_PAL_RESULT_TBL',
            stats_table='ZPD_PAL_STATS_TBL',
            metri_table='ZPD_PAL_METRI_TBL'):

    # Input table
    createTableFromDataFrame(df, data_table, cc)

    # Result table
    dropTable(result_table, cc)
    createEmptyTable(result_table, {
                     "TIMESTAMP": "INTEGER","RAW_DATA":"DOUBLE","RESIDUAL":"DOUBLE","OUTLIER_SCORE":"DOUBLE","IS_OUTLIER":"INTEGER"}, cc)

    # Metri table
    dropTable(metri_table, cc)
    createEmptyTable(metri_table, {
                     "STAT_NAME": "NVARCHAR(1000)","VALUE":"DOUBLE"}, cc)

    # Stats table
    dropTable(stats_table, cc)
    createEmptyTable(stats_table, {
                     "STAT_NAME": "NVARCHAR(1000)", "STAT_VALUE": "NVARCHAR(1000)"}, cc)

    # Parameter table
    dropTable(parameter_table, cc)
    createEmptyTable(parameter_table, {"PARAM_NAME": "nvarchar(256)", "INT_VALUE": "integer",
                     "DOUBLE_VALUE": "double", "STRING_VALUE": "nvarchar(1000)"}, cc)

    if parameters:
        with cc.connection.cursor() as cur:
            for parName, parValue in parameters.items():

                if isinstance(parValue, str):
                    parValue = f"'{parValue}'"
                    parametersSQL = f"{parValue if isinstance(parValue,int) else 'NULL'}, {parValue if isinstance(parValue,float) else 'NULL'}, { parValue if isinstance(parValue,str) else 'NULL'}"
                    cur.execute(
                    f"INSERT INTO {parameter_table} VALUES ('{parName}', {parametersSQL});")

                elif isinstance(parValue,list):
                    for x in parValue:
                        if isinstance(x, str):
                            x = f"'{x}'"
                        parametersSQL = f"{x if isinstance(x,int) else 'NULL'}, {x if isinstance(x,float) else 'NULL'}, { x if isinstance(x,str) else 'NULL'}"
                        cur.execute(
                        f"INSERT INTO {parameter_table} VALUES ('{parName}', {parametersSQL});")
                else:
                    parametersSQL = f"{parValue if isinstance(parValue,int) else 'NULL'}, {parValue if isinstance(parValue,float) else 'NULL'}, { parValue if isinstance(parValue,str) else 'NULL'}"
                    cur.execute(
                    f"INSERT INTO {parameter_table} VALUES ('{parName}', {parametersSQL});")
                    
    else:
        print("No parameters given using default values.")

    sql_str = f"\
        do begin \
            lt_data = select * from {data_table}; \
            lt_control = select * from {parameter_table};\
            CALL _SYS_AFL.PAL_OUTLIER_DETECTION_FOR_TIME_SERIES(:lt_data, :lt_control, lt_res, lt_stats, lt_metri); \
            INSERT INTO {result_table} SELECT * FROM :lt_res; \
            INSERT INTO {stats_table} SELECT * FROM :lt_stats;\
            INSERT INTO {metri_table} SELECT * FROM :lt_metri; \
        end;"

    with cc.connection.cursor() as cur:
        cur.execute(sql_str)

    return cc.table(result_table).collect(), cc.table(stats_table).collect(), cc.table(metri_table).collect()

Functions of Plotting Results

def outlier_result_plot(dResults):
    dResults.sort_values(by = list(dResults)[0], inplace = True, ascending = True)
    raw_data = np.array(dResults['RAW_DATA'])
    residual = np.array(dResults['RESIDUAL'])
    outlier_score = np.array(dResults['OUTLIER_SCORE'])
    is_outlier = np.array(dResults['IS_OUTLIER'])
    plt.figure(figsize = (24,4.5))
    plt.subplot(1,4,1)
    plt.plot(raw_data)
    plt.grid()
    plt.title('RAW_DATA')
    plt.subplot(1,4,2)
    plt.plot(residual)
    plt.grid()
    plt.title('RESIDUAL')
    plt.subplot(1,4,3)
    plt.plot(outlier_score)
    plt.grid()
    plt.title('OUTLIER_SCORE')
    plt.subplot(1,4,4)
    plt.plot(is_outlier)
    plt.grid()
    plt.title('IS_OUTLIER')
def outlier_plot(dResults):
    dResults.sort_values(by = list(dResults)[0], inplace = True, ascending = True)
    raw_data = np.array(dResults['RAW_DATA'])
    is_outlier = np.array(dResults['IS_OUTLIER'])
    outlier_idx = np.array([],dtype = int)
    for i in range(len(is_outlier)):
        if is_outlier[i] == 1:
            outlier_idx = np.append(outlier_idx,i)
    plt.plot(raw_data)
    plt.scatter(outlier_idx,raw_data[outlier_idx],color = 'red')
    plt.grid()
    plt.title("series and outlier")

Assigned Tags

      2 Comments
      You must be Logged on to comment or reply to a post.
      Author's profile photo Charles Homan
      Charles Homan

      Hello,

      This is very useful and helpful information. I found lot of interesting and information from your post. Your blog is very much impressive. This is really works well for me. Keep posting this kind of blogs always. I have read your all blogs and it's very useful for me.    MyAccountAccess

      Author's profile photo Zhi Yang
      Zhi Yang
      Blog Post Author

      Hello,

      Thanks for your interest. You can also test this algorithm with your data. If you have any questions or comments about the algorithm. please contact me. Thanks!