A time-series is a collection of data points/values ordered by time, often with evenly spaced time-stamps. For example, an air-quality monitoring system continuously measures the air quality around it, and sends out the air-quality-index(AQI) values so we get the time-series of AQIs. In this case, malfunction of the monitoring system or unexpected local incidents(like fire-accident) can bring sudden changes of AQIs temporarily, causing the obtained values to appear anomalous. Sometimes the detection of anomalous points in a time-series could be as simple as statistical tests, yet frequently the task will be much more difficult since there is no guarantee that the anomalous points are directly associated with extreme values. However, statistical tests for anomaly/outlier detection could become applicable to the time-series data if appropriate modeling is applied.
In this blog post, we will focus on the detection of anomalies/outliers in time-series that can be largely explained by a smoothing trend together with a single seasonal pattern. Such time-series usually can be modeled well enough by seasonal-trend decomposition, or seasonal decomposition for simplicity. The seasonal decomposition method is provided in SAP HANA Predictive Analysis Library(PAL), and wrapped up in the Python Machine Learning Client for SAP HANA(hana_ml) . Basically, in this blog post you will learn:
The detection of anomalies from a given time-series is usually not an easy task. The natural association with time brings many unique features to time-series that regular 1D datasets, like time-dependency(via lagging), trend, seasonality, holiday effects, etc. Because of this, traditional statistical tests or clustering-based methods for anomaly/outlier usually will fail for time-series data, because the time information of time-series is ignored by their design of nature. In other words, the applicability of statistical tests at least requires time-series data values to be time-independent, yet this is often not guaranteed. However, with appropriated modeling, a roughly time-independent time-series can often be extracted/transformed from the original time-series of interest, in which case statistical tests becomes applicable. One such technique, also the main interest of this blog post, is seasonal decomposition.
Basically, seasonal decomposition decomposes a given time-series into three components: trend, seasonal and random, where:
The relation between the original time-series data and its decomposed components in seasonal decomposition can either be additive or multiplicative. To be formal, we let X be the given time-series, and S/T/R be its seasonal/trend/random component respectively, then
It should be emphasized that, for multiplicative decomposition, values in the random component are usually centered around 1, where the value of 1 means that the original time-series can be perfectly explained by the multiplication of its trend and seasonal components. In contrast, for additive decomposition, values in the random component are usually centered around 0, where the value of 0 indicates that the given time-series can be perfectly explained by the addition of its trend and and seasonal components.
In this blog post, we focus on anomaly detection for time-series that can largely be modeled by seasonal decomposition. In such case, anomalous points are marked by values with irregularly large deviations in the random component, which correspond to unexplained large variations beyond trend and seasonality.
import hana_ml from hana_ml.dataframe import ConnectionContext cc = ConnectionContext('xx.xxx.xxx.xx', 30x15, 'XXX', 'Xxxxxxxx')#account info hidden away ap_tbl_name = 'AIR_PASSENGERS_TBL' ap_df = cc.table(ap_tbl_name)
ap_data = ap_df.collect() ap_data
import matplotlib.pyplot as plt plt.plot(pd.to_datetime(ap_data['Month']), ap_data['#Passengers']) plt.show()
ap_df_wid = ap_df.cast('Month', 'DATE').add_id(id_col='ID', ref_col='Month')
ap_df_wid.collect()
from hana_ml.algorithms.pal.tsa.seasonal_decompose import seasonal_decompose stat, decomposed = seasonal_decompose(data = ap_df_wid, key = 'ID', endog = '#Passengers', extrapolation=True)
stat.collect()
decomposed.collect()
from hana_ml.algorithms.pal.preprocessing import variance_test test_res, _ = variance_test(data=decomposed, key='ID', data_col='RANDOM', sigma_num=3)
anomalies_vt_id = cc.sql('SELECT ID FROM ({}) WHERE IS_OUT_OF_RANGE = 1'.format(test_res.select_statement)) anomalies_vt = cc.sql('SELECT * FROM ({}) WHERE ID IN ({})'.format(ap_df_wid.select_statement, anomalies_vt_id.select_statement)) anomalies_vt.collect()
from hana_ml.algorithms.pal.stats import iqr test_res, _ = iqr(data=decomposed, key='ID', col='RANDOM')
c.sql('SELECT * FROM ({}) WHERE ID IN (SELECT ID FROM ({}))'.format(decomposed[['ID', 'RANDOM']].select_statement, anomalies_iqr.select_statement)).collect()
import matplotlib.pyplot as plt import numpy as np dc = decomposed.collect() plt.plot(ap_data['Month'], dc['RANDOM'], 'k.-') oidx = np.array(anomalies_iqr.collect()['ID']) - 1 plt.plot(ap_data['Month'][oidx], dc.iloc[oidx, 3], 'ro') plt.show()
import matplotlib.pyplot as plt import numpy as np ap_dc = ap_df.collect() oidx = np.array(anomalies_iqr.collect()['ID']) - 1 plt.plot(ap_dc['Month'], ap_dc['#Passengers'], 'k.-') plt.plot(ap_dc['Month'][oidx], ap_dc['#Passengers'][oidx], 'ro') plt.show()
top2_anomalies = cc.sql('SELECT TOP 2 ID, RANDOM FROM ({}) ORDER BY ABS(RANDOM-1) DESC'.format(decomposed.select_statement)) top2_anomalies.collect()
import matplotlib.pyplot as plt oidx = np.array(top2_anomalies.collect()['ID']) - 1 plt.plot(ap_dc['Month'], ap_dc['#Passengers'], 'k.-') plt.plot(ap_dc['Month'][oidx], ap_dc['#Passengers'][oidx], 'ro') plt.show()
ap_df_log = cc.sql('SELECT "ID", "Month", LN("#Passengers") FROM ({})'.format(ap_df_wid.select_statement)) stats_log, decomposed_log = seasonal_decompose(data = ap_df_log, key = 'ID', endog = 'LN(#Passengers)', extrapolation=True) stats_log.collect()
decomposed_log.head(12).collect()
import matplotlib.pyplot as plt ap_dc= ap_df.collect() plt.plot(ap_dc['Month'], ap_dc['#Passengers'], 'k.-') tmsp = pd.to_datetime('1960-03-01') plt.plot(tmsp, ap_dc[ap_dc['Month']==tmsp]['#Passengers'], 'ro') plt.show()
By inspecting data in previous years, we may find that the number of passengers will usually have a significant increment when moving from February to March, followed by a slight drop down from March to April. However, in 1960 the proportion of increment from February to March is only a little, while from March to April the increment becomes significant. This strongly volidates the regular seasonal pattern, and should be the reason why the point is labeled as anomalous. This can also be justified by the following plot, where the numbers of passengers in every March within the time-limits of the given time-series are marked by red squares.
test_res_log_iqr, _ = iqr(data=decomposed_log, key='ID', col='RANDOM') anomalies_log_iqr_id = cc.sql('SELECT ID FROM ({}) WHERE IS_OUT_OF_RANGE = 1'.format(test_res_log_iqr.select_statement)) anomalies_log_iqr = cc.sql('SELECT "Month", "LN(#Passengers)" FROM ({}) WHERE ID IN ({})'.format(ap_df_log.select_statement, anomalies_log_iqr_id.select_statement)) anomalies_log_iqr.collect()
top1_anomaly_log_iqr_id = cc.sql('SELECT TOP 1 ID FROM ({}) ORDER BY ABS(RANDOM) DESC'.format(decomposed_log.select_statement)) top1_anomaly_log_iqr = cc.sql('SELECT "Month", "LN(#Passengers)" FROM ({}) WHERE ID IN ({})'.format(ap_df_log.select_statement, top1_anomaly_add_iqr_id.select_statement)) top1_anomaly_log_iqr.collect()
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
20 | |
14 | |
12 | |
11 | |
10 | |
9 | |
8 | |
8 | |
7 | |
7 |