In this blog post, we will show users how to perform time-series modeling and analysis using SAP HANA Predictive Analysis Library(PAL). Different from the original SQL interface, here we call PAL procedures through the Python machine learning client for SAP HANA(hana_ml). Python is often much more welcomed for today's users that are most familiar with Python, especially data analysts. Thanks to hana_ml, now by writting lines of Python code, we can call various PAL procedures with easy.
Besides Python, we also assume that readers of this blog post have some basic knowledge on time-series like trend and seasonality. If not, please refer to the Appendix section of this blog post for a quick comprehension.
The rest of this blog post is organized as follows: firstly we have a brief introduction to the time-series data of interest, with some basic analysis using run-sequence plot and seasonal decomposition, then the time-series data are further analyzed by exponential smoothing and ARIMA models, finally the main content of this blog post is summarized.
The data used for demonstration in this blog post is the "AirPassengers" dataset, which reflects the number of passengers(in thousands) transmitted monthly by an airline company between Jan. 1949 and Dec. 1960. This is a classical dataset that was introduced by Box and Jenkins in the reknowned textbook Time Series Analysis, Forecasting and Control(3rd edition). We assume that the dataset is stored in a local .csv file(AirPassengers.csv), and it can be loaded to the Python client using the pandas libary.
import pandas as pd air_passengers = pd.read_csv(PATH_TO_LOCAL_CSV,#replacing the actual path the the local .csv file parse_dates=[0])#the 0th column contains month info, if not specified, values in this columns shall be parse as strings
air_passengers
Month | #Passengers | |
0 | 1949-01-01 | 112 |
1 | 1949-02-01 | 118 |
2 | 1949-03-01 | 132 |
3 | 1949-04-01 | 129 |
4 | 1949-05-01 | 121 |
... | ... | ... |
139 | 1960-08-01 | 606 |
140 | 1960-09-01 | 508 |
141 | 1960-10-01 | 461 |
142 | 1960-11-01 | 390 |
143 | 1960-12-01 | 432 |
air_passengers.dtypes
Month datetime64[ns]
#Passengers int64
dtype: object
import matplotlib.pyplot as plt plt.plot(air_passengers['Month'], air_passengers['#Passengers']) plt.vlines(x=pd.to_datetime(['{}-01-01 00:00:00'.format(str(1949+i)) for i in range(12)]), ymin=100, ymax=620, colors='r', linestyles='dashed') plt.show()
Every dashed red line in the above figure indicates the start of a new year. We can see from this figure that the number of airline passengers keeps growing year-by-year in general, forming an upward trend. Besides, in each year the monthly numbers have ups-and-downs in a similar fashion: low in winter season and high in summer season. Furthermore, the magnitude of local oscillations grows with the upward trend, which is a clear evidence for multiplicative seasonality.
import hana_ml from hana_ml.dataframe import create_dataframe_from_pandas, ConnectionContext cc = ConnectionContext('xxxxxxx.xxx.xxx.xxx', 30x15, 'XXXXXX', 'XXXxxx')#account info hidden away ap_name = 'AIR_PASSENGERS_TBL' ap_df = create_dataframe_from_pandas(cc, pandas_df = air_passengers, table_name = ap_name, force = True)
Now 'AIR_PASSENGERS_TBL' is a concrete table that holds the time-series data in SAP HANA, and ap_df is the corresponding hana_ml DataFrame that are associated with this table in the Python client. The data stored in SAP HANA can also be collected to the Python client, illustrated as follows:
ap_df.collect()
Month | #Passengers | |
0 | 1949-01-01 | 112 |
1 | 1949-02-01 | 118 |
2 | 1949-03-01 | 132 |
3 | 1949-04-01 | 129 |
4 | 1949-05-01 | 121 |
... | ... | ... |
139 | 1960-08-01 | 606 |
140 | 1960-09-01 | 508 |
141 | 1960-10-01 | 461 |
142 | 1960-11-01 | 390 |
143 | 1960-12-01 | 432 |
144 rows × 2 columns
Column data types of the table can be fetched as follows:
ap_df.dtypes()
[('Month', 'TIMESTAMP', 27, 27, 27, 0), ('#Passengers', 'INT', 10, 10, 10, 0)]
Since the seasonal decomposition procedure in PAL requires the input time-series data to have a timestamp column of integer type, which is inconsistent with the given hana_ml DataFrame, so we need to add an additional integer columns for representing the order of the time-series data. This could be be realized calling the add_id() function of hana_ml DataFrame, illustrated as follows:
ap_df_wid = ap_df.add_id(id_col="ID", ref_col='Month') # generate the integer ID column based on the order of Month ap_df_wid.collect()
ID | Month | #Passengers | |
0 | 1 | 1949-01-01 | 112 |
1 | 2 | 1949-02-01 | 118 |
2 | 3 | 1949-03-01 | 132 |
3 | 4 | 1949-04-01 | 129 |
4 | 5 | 1949-05-01 | 121 |
... | ... | ... | ... |
139 | 140 | 1960-08-01 | 606 |
140 | 141 | 1960-09-01 | 508 |
141 | 142 | 1960-10-01 | 461 |
142 | 143 | 1960-11-01 | 390 |
143 | 144 | 1960-12-01 | 432 |
from hana_ml.algorithms.pal.tsa.seasonal_decompose import seasonal_decompose stats, decomposed = seasonal_decompose(data = ap_df_wid, key = 'ID', endog = '#Passengers') stats.collect()
STAT_NAME | STAT_VALUE | |
0 | type | multiplicative |
1 | period | 12 |
2 | acf | 0.88758 |
decompose_data = decomposed.collect() import matplotlib.pyplot as plt figure, axe = plt.subplots(3) axe[0].plot(decompose_data.TREND, 'r-.') axe[0].set_title("Trend Component") axe[1].plot(decompose_data.SEASONAL, 'b-.') axe[1].set_title("Seasonal Component") axe[2].plot(decompose_data.RANDOM, 'k-.') axe[2].set_title("Residual") plt.show()
So the multiplicative residuals are all close to 1, implying that the multiplication of trend and seasonal components gives a time-series nearly the same as the original one, so seasonal decomposition can well explain the given time-series data.
Lastly in this section, we split the time-series into two parts, where the 1st part is from 1949 to 1959, used for model training; and the 2nd part is the data in 1960, used for testing. This step is for time-series modeling and forecasts in the subsequent sections.
ap_df_train = cc.sql("select * from ({}) where \"Month\" < '1960-01-01'".format(ap_df_wid.select_statement)) ap_df_test = cc.sql("select * from ({}) where \"Month\" >= '1960-01-01'".format(ap_df_wid.select_statement))
We may check the training part of this dataset as follows:
ap_df_train.collect()
ID | Month | #Passengers | |
0 | 1 | 1949-01-01 | 112 |
1 | 2 | 1949-02-01 | 118 |
2 | 3 | 1949-03-01 | 132 |
3 | 4 | 1949-04-01 | 129 |
4 | 5 | 1949-05-01 | 121 |
... | ... | ... | ... |
127 | 128 | 1959-08-01 | 559 |
128 | 129 | 1959-09-01 | 463 |
129 | 130 | 1959-10-01 | 407 |
130 | 131 | 1959-11-01 | 362 |
131 | 132 | 1959-12-01 | 405 |
from hana_ml.algorithms.pal.tsa.exponential_smoothing import AutoExponentialSmoothing auto_eps = AutoExponentialSmoothing(model_selection = True, forecast_model_name = "TESM", seasonal='multiplicative', optimizer_time_budget=10, max_iter=500, forecast_num=12)
auto_eps.fit_predict(data = ap_df_train, key = "ID", endog = '#Passengers')
auto_eps.stats_.collect()
STAT_NAME | STAT_VALUE | |
0 | FORECAST_MODEL_NAME | TESM |
1 | MSE | 112.15447756481977 |
2 | NUMBER_OF_ITERATIONS | 587 |
3 | SA_NUMBER_OF_ITERATIONS | 500 |
4 | NM_NUMBER_OF_ITERATIONS | 87 |
5 | NM_EXECUTION_TIME | 0.000515 |
6 | SA_STOP_COND | MAX_ITERATION |
7 | NM_STOP_COND | ERROR_DIFFERENCE |
8 | ALPHA | 0.30708710037019393 |
9 | BETA | 0.03405392634387122 |
10 | GAMMA | 0.9694460594869154 |
11 | CYCLE | 12 |
12 | SEASONAL | Multiplicative |
13 | DAMPED | false |
So auto exponential smoothing also finds out that the period of seasonality for the given time-series is 12(see the CYCLE value in the above stats table), which is consistent the result of seasonal decomposition.
The fitted & forecasted values are contained in the forecast_ attribute of auto_eps, and since we have assigned forecast_num the value of 12 when initializing the auto exponential smoothing class, the last 12 records in the forecast_ attribute should correspond to the forecasted values based on the trained triple exponential smoothing model. We can examine those values by collecting them to the Python client, illustrated as follows:
auto_eps.forecast_.select(["TIMESTAMP", "VALUE"]).collect().tail(12)
TIMESTAMP | VALUE | |
120 | 133 | 415.618831 |
121 | 134 | 392.725246 |
122 | 135 | 461.003599 |
123 | 136 | 446.982653 |
124 | 137 | 470.396336 |
125 | 138 | 537.062048 |
126 | 139 | 622.435966 |
127 | 140 | 632.745177 |
128 | 141 | 519.081970 |
129 | 142 | 454.389067 |
130 | 143 | 399.317911 |
131 | 144 | 440.222018 |
merged = ap_df_test.join(auto_eps.forecast_.select(['TIMESTAMP', 'VALUE']), 'TIMESTAMP = ID') merged.collect()
ID | Month | #Passengers | TIMESTAMP | VALUE | |
0 | 133 | 1960-01-01 | 417 | 133 | 415.618831 |
1 | 134 | 1960-02-01 | 391 | 134 | 392.725246 |
2 | 135 | 1960-03-01 | 419 | 135 | 461.003599 |
3 | 136 | 1960-04-01 | 461 | 136 | 446.982653 |
4 | 137 | 1960-05-01 | 472 | 137 | 470.396336 |
5 | 138 | 1960-06-01 | 535 | 138 | 537.062048 |
6 | 139 | 1960-07-01 | 622 | 139 | 622.435966 |
7 | 140 | 1960-08-01 | 606 | 140 | 632.745177 |
8 | 141 | 1960-09-01 | 508 | 141 | 519.081970 |
9 | 142 | 1960-10-01 | 461 | 142 | 454.389067 |
10 | 143 | 1960-11-01 | 390 | 143 | 399.317911 |
11 | 144 | 1960-12-01 | 432 | 144 | 440.222018 |
STAT_NAME | STAT_VALUE | |
0 | MAD | 10.433921 |
1 | MAPE | 0.022462 |
2 | RMSE | 15.834904 |
from hana_ml.algorithms.pal.tsa.auto_arima import AutoARIMA auto_arima = AutoARIMA() auto_arima.fit(data = ap_df_train, key = 'ID', endog = '#Passengers')
auto_arima.model_.collect()
KEY | VALUE | |
0 | p | 1 |
1 | AR | -0.571616 |
2 | d | 1 |
3 | q | 1 |
4 | MA | 0.394036 |
5 | s | 12 |
6 | P | 1 |
7 | SAR | 0.960813 |
8 | D | 0 |
9 | Q | 0 |
10 | SMA | |
11 | sigma^2 | 106.632 |
12 | log-likelihood | -507.15 |
13 | AIC | 1022.3 |
14 | AICc | 1022.61 |
15 | BIC | 1033.8 |
16 | dy(n-p:n-1)_aux | |
17 | dy_aux | 0 |
18 | dy_0 | 6;14;-3;-8;14;13;0;-12;-17;-15;14;-3;11;15;-6;... |
19 | x(n-d:n-1)_aux | |
20 | y(n-d:n-1)_aux | 0 |
21 | y(n-d:n-1)_0 | 405 |
22 | epsilon(n-q:n-1)_aux | 0 |
23 | epsilon(n-q:n-1)_0 | 18.5076 |
forecast_res = auto_arima.predict(forecast_method='formula_forecast', forecast_length=12) fc_res_reduced = cc.sql('SELECT TIMESTAMP, FORECAST from ({})'.format(forecast_res.select_statement)) arima_merged = ap_df_test.join(fc_res_reduced, 'TIMESTAMP + {} = ID'.format(ap_df_test.collect()['ID'][0])) arima_merged.collect()
ID | Month | #Passengers | TIMESTAMP | FORECAST | |
0 | 133 | 1960-01-01 | 417 | 0 | 424.640706 |
1 | 134 | 1960-02-01 | 391 | 1 | 408.751100 |
2 | 135 | 1960-03-01 | 419 | 2 | 469.439996 |
3 | 136 | 1960-04-01 | 461 | 3 | 460.290951 |
4 | 137 | 1960-05-01 | 472 | 4 | 483.088042 |
5 | 138 | 1960-06-01 | 535 | 5 | 533.200322 |
6 | 139 | 1960-07-01 | 622 | 6 | 606.136366 |
7 | 140 | 1960-08-01 | 606 | 7 | 616.754322 |
8 | 141 | 1960-09-01 | 508 | 8 | 524.488257 |
9 | 142 | 1960-10-01 | 461 | 9 | 470.698744 |
10 | 143 | 1960-11-01 | 390 | 10 | 427.453005 |
11 | 144 | 1960-12-01 | 432 | 11 | 468.773196 |
from hana_ml.algorithms.pal.tsa.accuracy_measure import accuracy_measure acc_measures = accuracy_measure(data = arima_merged.select(['#Passengers', 'FORECAST']), evaluation_metric = ['rmse', 'mape', 'mad']) acc_measures.collect()
STAT_NAME | STAT_VALUE | |
0 | MAD | 18.038311 |
1 | MAPE | 0.040867 |
2 | RMSE | 23.332016 |
ap_df_train_log = cc.sql('SELECT "ID", "Month", LN("#Passengers") FROM ({})'.format(ap_df_train.select_statement)) # LN is logarithm transformation ap_df_train_log.collect()
ID | Month | LN(#Passengers) | |
0 | 1 | 1949-01-01 | 4.718499 |
1 | 2 | 1949-02-01 | 4.770685 |
2 | 3 | 1949-03-01 | 4.882802 |
3 | 4 | 1949-04-01 | 4.859812 |
4 | 5 | 1949-05-01 | 4.795791 |
... | ... | ... | ... |
127 | 128 | 1959-08-01 | 6.326149 |
128 | 129 | 1959-09-01 | 6.137727 |
129 | 130 | 1959-10-01 | 6.008813 |
130 | 131 | 1959-11-01 | 5.891644 |
132 | 1959-12-01 | 6.003887 |
from hana_ml.algorithms.pal.tsa.seasonal_decompose import seasonal_decompose stats, _ = seasonal_decompose(data = ap_df_train_log, key = 'ID', endog = 'LN(#Passengers)') stats.collect()
STAT_NAME | STAT_VALUE | |
0 | type | additive |
1 | period | 12 |
2 | acf | 0.877358 |
KEY | VALUE | |
0 | p | 1 |
1 | AR | -0.306761 |
2 | d | 1 |
3 | q | 0 |
4 | MA | |
5 | s | 12 |
6 | P | 2 |
7 | SAR | 0.531414;0.43327 |
8 | D | 0 |
9 | Q | 0 |
10 | SMA | |
11 | sigma^2 | 0.00140498 |
12 | log-likelihood | 229.076 |
13 | AIC | -450.152 |
14 | AICc | -449.835 |
15 | BIC | -438.651 |
16 | dy(n-p:n-1)_aux | |
17 | dy_aux | 0 |
18 | dy_0 | 0.0521857;0.112117;-0.0229895;-0.0640218;0.109... |
19 | x(n-d:n-1)_aux | |
20 | y(n-d:n-1)_aux | 0 |
21 | y(n-d:n-1)_0 | 6.00388 |
22 | epsilon(n-q:n-1)_aux |
forecast_res2 = auto_arima.predict(forecast_method='formula_forecast', forecast_length=12) fc_res2_reduced = cc.sql('select TIMESTAMP, EXP(FORECAST) AS FORECAST from ({})'.format(forecast_res2.select_statement)) arima_merged2 = ap_df_test.join(fc_res2_reduced, 'TIMESTAMP + {} = ID'.format(ap_df_test.collect()['ID'][0])) arima_merged2.collect()
ID | Month | #Passengers | TIMESTAMP | FORECAST | |
0 | 133 | 1960-01-01 | 417 | 0 | 418.275306 |
1 | 134 | 1960-02-01 | 391 | 1 | 396.366024 |
2 | 135 | 1960-03-01 | 419 | 2 | 458.930102 |
3 | 136 | 1960-04-01 | 461 | 3 | 445.316686 |
4 | 137 | 1960-05-01 | 472 | 4 | 467.906297 |
5 | 138 | 1960-06-01 | 535 | 5 | 538.461284 |
6 | 139 | 1960-07-01 | 622 | 6 | 614.319460 |
7 | 140 | 1960-08-01 | 606 | 7 | 628.451952 |
8 | 141 | 1960-09-01 | 508 | 8 | 516.175951 |
9 | 142 | 1960-10-01 | 461 | 9 | 457.957321 |
10 | 143 | 1960-11-01 | 390 | 10 | 403.803691 |
11 | 144 | 1960-12-01 | 432 | 11 | 444.414850 |
acc_measures2 = accuracy_measure(data = arima_merged2.select(['#Passengers', 'FORECAST']), evaluation_metric = ['rmse', 'mape', 'mad']) acc_measures2.collect()
STAT_NAME | STAT_VALUE | |
0 | MAD | 11.448283 |
1 | MAPE | 0.024789 |
2 | RMSE | 15.501060 |
daily, weekly, monthly, quarterly and yearly seasonality.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
40 | |
25 | |
17 | |
13 | |
8 | |
7 | |
7 | |
7 | |
6 | |
6 |