Time-Series Modeling and Analysis using SAP HANA P...

likun_hou · ‎01-07-2021

In this blog post, we will show users how to perform time-series modeling and analysis using SAP HANA Predictive Analysis Library(PAL). Different from the original SQL interface, here we call PAL procedures through the Python machine learning client for SAP HANA(hana_ml). Python is often much more welcomed for today's users that are most familiar with Python, especially data analysts. Thanks to hana_ml, now by writting lines of Python code, we can call various PAL procedures with easy.

Besides Python, we also assume that readers of this blog post have some basic knowledge on time-series like trend and seasonality. If not, please refer to the Appendix section of this blog post for a quick comprehension.

The rest of this blog post is organized as follows: firstly we have a brief introduction to the time-series data of interest, with some basic analysis using run-sequence plot and seasonal decomposition, then the time-series data are further analyzed by exponential smoothing and ARIMA models, finally the main content of this blog post is summarized.

Introduction to the Dataset, with Basic Analysis

The data used for demonstration in this blog post is the "AirPassengers" dataset, which reflects the number of passengers(in thousands) transmitted monthly by an airline company between Jan. 1949 and Dec. 1960. This is a classical dataset that was introduced by Box and Jenkins in the reknowned textbook Time Series Analysis, Forecasting and Control(3rd edition). We assume that the dataset is stored in a local .csv file(AirPassengers.csv), and it can be loaded to the Python client using the pandas libary.

import pandas as pd

air_passengers = pd.read_csv(PATH_TO_LOCAL_CSV,#replacing the actual path the the local .csv file

                             parse_dates=[0])#the 0th column contains month info, if not specified, values in this columns shall be parse as strings

air_passengers

	Month	#Passengers
0	1949-01-01	112
1	1949-02-01	118
2	1949-03-01	132
3	1949-04-01	129
4	1949-05-01	121
...	...	...
139	1960-08-01	606
140	1960-09-01	508
141	1960-10-01	461
142	1960-11-01	390
143	1960-12-01	432

144 rows × 2 columns

air_passengers.dtypes

Month          datetime64[ns]

#Passengers             int64

dtype: object

So the timestamp column for this time-series is Month, where each month is represented by the 1st day of it.

Now we take the first step for analyzing the time-series data --- drawing its run-sequence plot.

import matplotlib.pyplot as plt

plt.plot(air_passengers['Month'], air_passengers['#Passengers'])

plt.vlines(x=pd.to_datetime(['{}-01-01 00:00:00'.format(str(1949+i)) for i in range(12)]),

           ymin=100, ymax=620, colors='r', linestyles='dashed')

plt.show()

Every dashed red line in the above figure indicates the start of a new year. We can see from this figure that the number of airline passengers keeps growing year-by-year in general, forming an upward trend. Besides, in each year the monthly numbers have ups-and-downs in a similar fashion: low in winter season and high in summer season. Furthermore, the magnitude of local oscillations grows with the upward trend, which is a clear evidence for multiplicative seasonality.

Time-series Analysis with Seasonal Decomposition

The justifications drawn from the run-sequence plot of the time-series data can be verified by seasonal decomposition. Seasonal decomposition is implemented in PAL by a procedure named SEASONALITY_TEST, and the procedure is wrapped by a function called seasonal_decompose() in hana_ml. In order to apply the seasonal decomposition function in hana_ml to the time-series data, we need to store the dataset in SAP HANA, which can be realized by the following lines of code:

import hana_ml

from hana_ml.dataframe import create_dataframe_from_pandas, ConnectionContext

cc = ConnectionContext('xxxxxxx.xxx.xxx.xxx', 30x15, 'XXXXXX', 'XXXxxx')#account info hidden away

ap_name = 'AIR_PASSENGERS_TBL'

ap_df = create_dataframe_from_pandas(cc, pandas_df = air_passengers,

                                     table_name = ap_name, force = True)

Now 'AIR_PASSENGERS_TBL' is a concrete table that holds the time-series data in SAP HANA, and ap_df is the corresponding hana_ml DataFrame that are associated with this table in the Python client. The data stored in SAP HANA can also be collected to the Python client, illustrated as follows:

ap_df.collect()

	Month	#Passengers
0	1949-01-01	112
1	1949-02-01	118
2	1949-03-01	132
3	1949-04-01	129
4	1949-05-01	121
...	...	...
139	1960-08-01	606
140	1960-09-01	508
141	1960-10-01	461
142	1960-11-01	390
143	1960-12-01	432

144 rows × 2 columns

Column data types of the table can be fetched as follows:

ap_df.dtypes()

[('Month', 'TIMESTAMP', 27, 27, 27, 0), ('#Passengers', 'INT', 10, 10, 10, 0)]

Since the seasonal decomposition procedure in PAL requires the input time-series data to have a timestamp column of integer type, which is inconsistent with the given hana_ml DataFrame, so we need to add an additional integer columns for representing the order of the time-series data. This could be be realized calling the add_id() function of hana_ml DataFrame, illustrated as follows:

ap_df_wid = ap_df.add_id(id_col="ID", ref_col='Month') # generate the integer ID column based on the order of Month

ap_df_wid.collect()

	ID	Month	#Passengers
0	1	1949-01-01	112
1	2	1949-02-01	118
2	3	1949-03-01	132
3	4	1949-04-01	129
4	5	1949-05-01	121
...	...	...	...
139	140	1960-08-01	606
140	141	1960-09-01	508
141	142	1960-10-01	461
142	143	1960-11-01	390
143	144	1960-12-01	432

144 rows × 3 columns

Now seasonal decomposition can be applied to the data with added integer timestamp(i.e. ID) column.

from hana_ml.algorithms.pal.tsa.seasonal_decompose import seasonal_decompose

stats, decomposed = seasonal_decompose(data = ap_df_wid, key = 'ID',

                                       endog = '#Passengers')

stats.collect()

	STAT_NAME	STAT_VALUE
0	type	multiplicative
1	period	12
2	acf	0.88758

So a strong multiplicative yearly seasonality is detected, which is consistent with our justifications drawn from the run-sequence plot of the time-series.

The corresponding decomposition result can be visualized as follows:

decompose_data = decomposed.collect()

import matplotlib.pyplot as plt

figure, axe = plt.subplots(3)

axe[0].plot(decompose_data.TREND, 'r-.')

axe[0].set_title("Trend Component")

axe[1].plot(decompose_data.SEASONAL, 'b-.')

axe[1].set_title("Seasonal Component")

axe[2].plot(decompose_data.RANDOM, 'k-.')

axe[2].set_title("Residual")

plt.show()

So the multiplicative residuals are all close to 1, implying that the multiplication of trend and seasonal components gives a time-series nearly the same as the original one, so seasonal decomposition can well explain the given time-series data.

Data Partition

Lastly in this section, we split the time-series into two parts, where the 1st part is from 1949 to 1959, used for model training; and the 2nd part is the data in 1960, used for testing. This step is for time-series modeling and forecasts in the subsequent sections.

ap_df_train = cc.sql("select * from ({}) where \"Month\" < '1960-01-01'".format(ap_df_wid.select_statement))

ap_df_test = cc.sql("select * from ({}) where \"Month\" >= '1960-01-01'".format(ap_df_wid.select_statement))

We may check the training part of this dataset as follows:

ap_df_train.collect()

	ID	Month	#Passengers
0	1	1949-01-01	112
1	2	1949-02-01	118
2	3	1949-03-01	132
3	4	1949-04-01	129
4	5	1949-05-01	121
...	...	...	...
127	128	1959-08-01	559
128	129	1959-09-01	463
129	130	1959-10-01	407
130	131	1959-11-01	362
131	132	1959-12-01	405

132 rows × 3 columns

Time-series Modeling and Analysis with Exponential Smoothing

Based on our previous analysis, the AirPassengers dataset is a time-series that contains yearly seasonality and a non-flat trend. One model that can handle such a time-series is triple exponential smoothing. In the following context, auto exponential smoothing is adopted to facilitate us in determining the optimal parameters of exponential smoothing models.

from hana_ml.algorithms.pal.tsa.exponential_smoothing import AutoExponentialSmoothing

auto_eps = AutoExponentialSmoothing(model_selection = True,

                                    forecast_model_name = "TESM",

                                    seasonal='multiplicative',

                                    optimizer_time_budget=10,

                                    max_iter=500,

                                    forecast_num=12)

Then, optimal parameters for the triple exponential smoothing model can be determined by feeding it with the training data, illustrated as follows:

auto_eps.fit_predict(data = ap_df_train, key = "ID", endog = '#Passengers')

Optimal parameters for the trained triple exponential smoothing model is contained in the stats_ attribute of auto_eps, and the content of this attribute is a hana_ml DataFrame so it can be collected to the Python client as follows:

auto_eps.stats_.collect()

	STAT_NAME	STAT_VALUE
0	FORECAST_MODEL_NAME	TESM
1	MSE	112.15447756481977
2	NUMBER_OF_ITERATIONS	587
3	SA_NUMBER_OF_ITERATIONS	500
4	NM_NUMBER_OF_ITERATIONS	87
5	NM_EXECUTION_TIME	0.000515
6	SA_STOP_COND	MAX_ITERATION
7	NM_STOP_COND	ERROR_DIFFERENCE
8	ALPHA	0.30708710037019393
9	BETA	0.03405392634387122
10	GAMMA	0.9694460594869154
11	CYCLE	12
12	SEASONAL	Multiplicative
13	DAMPED	false

So auto exponential smoothing also finds out that the period of seasonality for the given time-series is 12(see the CYCLE value in the above stats table), which is consistent the result of seasonal decomposition.

The fitted & forecasted values are contained in the forecast_ attribute of auto_eps, and since we have assigned forecast_num the value of 12 when initializing the auto exponential smoothing class, the last 12 records in the forecast_ attribute should correspond to the forecasted values based on the trained triple exponential smoothing model. We can examine those values by collecting them to the Python client, illustrated as follows:

auto_eps.forecast_.select(["TIMESTAMP", "VALUE"]).collect().tail(12)

	TIMESTAMP	VALUE
120	133	415.618831
121	134	392.725246
122	135	461.003599
123	136	446.982653
124	137	470.396336
125	138	537.062048
126	139	622.435966
127	140	632.745177
128	141	519.081970
129	142	454.389067
130	143	399.317911
131	144	440.222018

We can join the predict values with the ground truth for better comparison.

merged = ap_df_test.join(auto_eps.forecast_.select(['TIMESTAMP', 'VALUE']),

                         'TIMESTAMP = ID')

merged.collect()

	ID	Month	#Passengers	TIMESTAMP	VALUE
0	133	1960-01-01	417	133	415.618831
1	134	1960-02-01	391	134	392.725246
2	135	1960-03-01	419	135	461.003599
3	136	1960-04-01	461	136	446.982653
4	137	1960-05-01	472	137	470.396336
5	138	1960-06-01	535	138	537.062048
6	139	1960-07-01	622	139	622.435966
7	140	1960-08-01	606	140	632.745177
8	141	1960-09-01	508	141	519.081970
9	142	1960-10-01	461	142	454.389067
10	143	1960-11-01	390	143	399.317911
11	144	1960-12-01	432	144	440.222018

We see that the predicted values are very close to the ground truth. We can justify this observation quantitatively using accuracy measures. In the following context we apply three accuracy measures for the prediction result : root-mean-square-error(rmse), mean-absolute-percentage-error(mape), mean-absolute-deviation(mad).

	STAT_NAME	STAT_VALUE
0	MAD	10.433921
1	MAPE	0.022462
2	RMSE	15.834904

We see that two accuray measures MAD and RMSE are at least one order smaller compared to the magnitude of the origin data, so the forecast accuracy is not bad. The small percentage error value(i.e. MAPE) is also consistent with this justification since it is scale-invariant.

Time-series Modeling and Analysis with ARIMA

Another model that can handle time-series with trend and seasonality is ARIMA, where differencing and seasonal differencing are firstly applied to make the entire time-series stationary, followed by auto-regressive and moving-average(i.e. ARMA) modeling of the differenced(stationarized) time-series. The entire process for ARIMA modeling can be realized automatically in AutoARIMA.

from hana_ml.algorithms.pal.tsa.auto_arima import AutoARIMA

auto_arima = AutoARIMA()

auto_arima.fit(data = ap_df_train, key = 'ID',

               endog = '#Passengers')

Then, the optimal parameters of the ARIMA model learned from the training data can be accessed from the model_ attribute of auto_arima, illustrated as follows:

auto_arima.model_.collect()

	KEY	VALUE
0	p	1
1	AR	-0.571616
2	d	1
3	q	1
4	MA	0.394036
5	s	12
6	P	1
7	SAR	0.960813
8	D	0
9	Q	0
10	SMA
11	sigma^2	106.632
12	log-likelihood	-507.15
13	AIC	1022.3
14	AICc	1022.61
15	BIC	1033.8
16	dy(n-p:n-1)_aux
17	dy_aux	0
18	dy_0	6;14;-3;-8;14;13;0;-12;-17;-15;14;-3;11;15;-6;...
19	x(n-d:n-1)_aux
20	y(n-d:n-1)_aux	0
21	y(n-d:n-1)_0	405
22	epsilon(n-q:n-1)_aux	0
23	epsilon(n-q:n-1)_0	18.5076

This results in an ARIMA(1,1,1)(1,0,0)12 model. However, this model is unlikely to be effective enough for modeling the given time-series since it only contains a single difference. We have already observed that the magnitude of seasonal osciallations of this time-series grows proportionally with its upward trend, so after a single difference is applied, differenced values should be larger in magnitude in later phase than in early phase, which violates the stationary assumption for ARMA modeling. This can be justified by the forecast accuracy measures of the trained ARIMA model, illustrated as follows:

forecast_res = auto_arima.predict(forecast_method='formula_forecast',

                                  forecast_length=12)

fc_res_reduced = cc.sql('SELECT TIMESTAMP, FORECAST from ({})'.format(forecast_res.select_statement))

arima_merged = ap_df_test.join(fc_res_reduced,

                               'TIMESTAMP + {} = ID'.format(ap_df_test.collect()['ID'][0]))

arima_merged.collect()

	ID	Month	#Passengers	TIMESTAMP	FORECAST
0	133	1960-01-01	417	0	424.640706
1	134	1960-02-01	391	1	408.751100
2	135	1960-03-01	419	2	469.439996
3	136	1960-04-01	461	3	460.290951
4	137	1960-05-01	472	4	483.088042
5	138	1960-06-01	535	5	533.200322
6	139	1960-07-01	622	6	606.136366
7	140	1960-08-01	606	7	616.754322
8	141	1960-09-01	508	8	524.488257
9	142	1960-10-01	461	9	470.698744
10	143	1960-11-01	390	10	427.453005
11	144	1960-12-01	432	11	468.773196

from hana_ml.algorithms.pal.tsa.accuracy_measure import accuracy_measure

acc_measures = accuracy_measure(data = arima_merged.select(['#Passengers', 'FORECAST']),

                                evaluation_metric = ['rmse', 'mape', 'mad'])

acc_measures.collect()

	STAT_NAME	STAT_VALUE
0	MAD	18.038311
1	MAPE	0.040867
2	RMSE	23.332016

So all three chosen accuracy measures are worse than those produced by exponential smoothing modeling.

From our previous analysis, we strongly suspect that multiplicative seasonality of the time-series data is the main reason that makes ARIMA modeling difficulty. This difficulty could possibly be resolved by taking logarithm of the time-series, which transforms multiplicative seasonality to addtive seasonality.

ap_df_train_log = cc.sql('SELECT "ID", "Month", LN("#Passengers") FROM ({})'.format(ap_df_train.select_statement)) # LN is logarithm transformation

ap_df_train_log.collect()

	ID	Month	LN(#Passengers)
0	1	1949-01-01	4.718499
1	2	1949-02-01	4.770685
2	3	1949-03-01	4.882802
3	4	1949-04-01	4.859812
4	5	1949-05-01	4.795791
...	...	...	...
127	128	1959-08-01	6.326149
128	129	1959-09-01	6.137727
129	130	1959-10-01	6.008813
130	131	1959-11-01	5.891644
	132	1959-12-01	6.003887

132 rows × 3 columns

The result of seasonality test indicates that the (logarithmic) transformed time-series is of additive yearly seasonality, illustrated as follows:

from hana_ml.algorithms.pal.tsa.seasonal_decompose import seasonal_decompose

stats, _ = seasonal_decompose(data = ap_df_train_log,

                              key = 'ID',

                              endog = 'LN(#Passengers)')

stats.collect()

	STAT_NAME	STAT_VALUE
0	type	additive
1	period	12
2	acf	0.877358

We can re-fit the ARIMA model by using the transformed training data.

	KEY	VALUE
0	p	1
1	AR	-0.306761
2	d	1
3	q	0
4	MA
5	s	12
6	P	2
7	SAR	0.531414;0.43327
8	D	0
9	Q	0
10	SMA
11	sigma^2	0.00140498
12	log-likelihood	229.076
13	AIC	-450.152
14	AICc	-449.835
15	BIC	-438.651
16	dy(n-p:n-1)_aux
17	dy_aux	0
18	dy_0	0.0521857;0.112117;-0.0229895;-0.0640218;0.109...
19	x(n-d:n-1)_aux
20	y(n-d:n-1)_aux	0
21	y(n-d:n-1)_0	6.00388
22	epsilon(n-q:n-1)_aux

The re-fitting process results in an ARIMA(1,1,0)(2,0,0)12 model. Now we make a 12 step forecast using the trained model, and compared the (exponentiated) forecasted values with the ground truth.

forecast_res2 = auto_arima.predict(forecast_method='formula_forecast',

                                   forecast_length=12)

fc_res2_reduced = cc.sql('select TIMESTAMP, EXP(FORECAST) AS FORECAST from ({})'.format(forecast_res2.select_statement))

arima_merged2 = ap_df_test.join(fc_res2_reduced,

                              'TIMESTAMP + {} = ID'.format(ap_df_test.collect()['ID'][0]))

arima_merged2.collect()

	ID	Month	#Passengers	TIMESTAMP	FORECAST
0	133	1960-01-01	417	0	418.275306
1	134	1960-02-01	391	1	396.366024
2	135	1960-03-01	419	2	458.930102
3	136	1960-04-01	461	3	445.316686
4	137	1960-05-01	472	4	467.906297
5	138	1960-06-01	535	5	538.461284
6	139	1960-07-01	622	6	614.319460
7	140	1960-08-01	606	7	628.451952
8	141	1960-09-01	508	8	516.175951
9	142	1960-10-01	461	9	457.957321
10	143	1960-11-01	390	10	403.803691
11	144	1960-12-01	432	11	444.414850

So the forecast values(after exponential transformation) are close to the ground truth, which indicates that trained ARIMA model using the transformed training data explains the time-series data very well. This can also be justified by forecast accuracy measures, illustrated as follows:

acc_measures2 = accuracy_measure(data = arima_merged2.select(['#Passengers', 'FORECAST']), 

                                 evaluation_metric = ['rmse', 'mape', 'mad'])

acc_measures2.collect()

	STAT_NAME	STAT_VALUE
0	MAD	11.448283
1	MAPE	0.024789
2	RMSE	15.501060

The result is very close to that of (auto) exponential smoothing, so logarithm transformation plus ARIMA modeling results in a good explanation of the time-series data.

Summary

In this blog post, we have shown readers how to apply procedures in SAP HANA Predictive Analysis Library(PAL) to do time-series modeling and analysis through Python machine learning client for SAP HANA(hana_ml). Some useful take-away messages are:

Drawing the run-sequence plot for the time-series to get a general comprehension of it

Applying seasonal decomposition to the time-series to get its type of seasonality and seasonal period

Choosing appropriate methods for modeling the time-series based on the characteristics drawn from its run-sequence plot as well as seasonal decomposition result.

Before training the model, some model parameters could be specified if they are easily obtainable, and appropriate transformation(like logarithm transformation) could be applied to the training data if necessary

Evaluating the quality of the trained model using forecast accuracy measures

Appendix

Level, Trend, Seasonality, and Noise

Level: the average value of time-series

Trend: the general shape of the time-series, where any local/short-term fluctuations are ignored(smoothed/averaged out)

Seasonality: fluctuation patterns that appear regularly during fixed period in time-series

Noise: irregular and unexplained variations in time-series

Type of Seasonality

Seasonality can be divided into different categories based on its length of period, like
```
      daily, weekly, monthly, quarterly and yearly seasonality.
```
A time-series can have multiple seasonalities with different periods, take the hourly electricity consumption data of a city located in temperate zone for example:

Seasonality can also be categorized by its correlation with the trend:
Again taking the quarterly electricity consumption as an example, assuming that total consumption is higher in summer season than that in spring season, then this is typical case of yearly seasonality(repeated once every year). If the electricity consumption is always 1 million units higher in summer than that in spring, then the seasonality is additive; if the aggregated electricity consumption is always 10% higher in summer than that in spring, then the seasonality is multiplicative.

Stationarity

Another important concept for analysis of time-series is stationary. Generally speaking, a given time-series is called stationary if whose statistical properties like mean, variance and autocorrelation do not change over time. In this sense, a time-series with non-flat trend or seasonality is obviously non-stationary since trend and seasonality affect values of time-series in different times.