Skip to Content
Technical Articles
Author's profile photo Likun Hou

Time-Series Modeling and Analysis using SAP HANA Predictive Analysis Library(PAL) through Python Machine Learning Client for SAP HANA

In this blog post, we will show users how to perform time-series modeling and analysis using SAP HANA Predictive Analysis Library(PAL). Different from the original SQL interface, here we call PAL procedures through the Python machine learning client for SAP HANA(hana_ml). Python is often much more welcomed for today’s users that are most familier with Python, especially data analysts. Thanks to hana_ml, now by wrtting lines of Python code, we can call various PAL procedures with easy.

Besides Python, we also assume that readers of this blog post have some basic knowledge on time-series like trend and seasonality. If not, please refer to the Appendix section of this blog post for a quick comprehension.

The rest of this blog post is organized as follows: firstly we have a brief introduction to the time-series data of interest, with some basic analysis using run-sequence plot and seasonal decomposition, then the time-series data are further analyzed by exponential smoothing and ARIMA models, finally the main content of this blog post is summarized.

 

Introduction to the Dataset, with Basic Analysis

The data used for demonstration in this blog post is the “AirPassengers” dataset, which reflects the number of passengers(in thousands) transmitted monthly by an airline company between Jan. 1949 and Dec. 1960. This is a classical dataset that was introduced by Box and Jenkins in the reknowned textbook Time Series Analysis, Forecasting and Control(3rd edition). We assume that the dataset is stored in a local .csv file(AirPassengers.csv), and it can be loaded to the Python client using the pandas libary.

import pandas as pd
air_passengers = pd.read_csv(PATH_TO_LOCAL_CSV,#replacing the actual path the the local .csv file
                             parse_dates=[0])#the 0th column contains month info, if not specified, values in this columns shall be parse as strings
air_passengers
Month #Passengers
0 1949-01-01 112
1 1949-02-01 118
2 1949-03-01 132
3 1949-04-01 129
4 1949-05-01 121
139 1960-08-01 606
140 1960-09-01 508
141 1960-10-01 461
142 1960-11-01 390
143 1960-12-01 432

144 rows × 2 columns

air_passengers.dtypes
Month          datetime64[ns]
#Passengers             int64
dtype: object

So the timestamp column for this time-series is Month, where each month is represented by the 1st day of it.

Now we take the first step for analyzing the time-series data — drawing its run-sequence plot.

import matplotlib.pyplot as plt
plt.plot(air_passengers['Month'], air_passengers['#Passengers'])
plt.vlines(x=pd.to_datetime(['{}-01-01 00:00:00'.format(str(1949+i)) for i in range(12)]),
           ymin=100, ymax=620, colors='r', linestyles='dashed')
plt.show()

Every dashed red line in the above figure indicates the start of a new year. We can see from this figure that the number of airline passengers keeps growing year-by-year in general, forming an upward trend. Besides, in each year the monthly numbers have ups-and-downs in a similar fashion: low in winter season and high in summer season. Furthermore, the magnitude of local oscillations grows with the upward trend, which is a clear evidence for multiplicative seasonality.

Time-series Analysis with Seasonal Decomposition

The justifications drawn from the run-sequence plot of the time-series data can be verified by seasonal decomposition. Seasonal decomposition is implemented in PAL by a procedure named SEASONALITY_TEST, and the procedure is wrapped by a function called seasonal_decompose() in hana_ml. In order to apply the seasonal decomposition function in hana_ml to the time-series data, we need to store the dataset in SAP HANA, which can be realized by the following lines of code:

import hana_ml
from hana_ml.dataframe import create_dataframe_from_pandas, ConnectionContext
cc = ConnectionContext('xxxxxxx.xxx.xxx.xxx', 30x15, 'XXXXXX', 'XXXxxx')#account info hidden away
ap_name = 'AIR_PASSENGERS_TBL'
ap_df = create_dataframe_from_pandas(cc, pandas_df = air_passengers,
                                     table_name = ap_name, force = True)

Now ‘AIR_PASSENGERS_TBL’ is a concrete table that holds the time-series data in SAP HANA, and ap_df is the corresponding hana_ml DataFrame that are associated with this table in the Python client.  The data stored in SAP HANA can also be collected to the Python client, illustrated as follows:

ap_df.collect()
Month #Passengers
0 1949-01-01 112
1 1949-02-01 118
2 1949-03-01 132
3 1949-04-01 129
4 1949-05-01 121
139 1960-08-01 606
140 1960-09-01 508
141 1960-10-01 461
142 1960-11-01 390
143 1960-12-01 432

144 rows × 2 columns

 

Column data types of the table can be fetched as follows:

ap_df.dtypes()
[('Month', 'TIMESTAMP', 27, 27, 27, 0), ('#Passengers', 'INT', 10, 10, 10, 0)]

Since the seasonal decomposition procedure in PAL requires the input time-series data to have a timestamp column of integer type, which is inconsistent with the given hana_ml DataFrame, so we need to add an additional integer columns for representing the order of the time-series data. This could be be realized calling the add_id() function of hana_ml DataFrame, illustrated as follows:

ap_df_wid = ap_df.add_id(id_col="ID", ref_col='Month') # generate the integer ID column based on the order of Month
ap_df_wid.collect()
ID Month #Passengers
0 1 1949-01-01 112
1 2 1949-02-01 118
2 3 1949-03-01 132
3 4 1949-04-01 129
4 5 1949-05-01 121
139 140 1960-08-01 606
140 141 1960-09-01 508
141 142 1960-10-01 461
142 143 1960-11-01 390
143 144 1960-12-01 432

144 rows × 3 columns

Now seasonal decomposition can be applied to the data with added integer timestamp(i.e.  ID) column.

from hana_ml.algorithms.pal.tsa.seasonal_decompose import seasonal_decompose
stats, decomposed = seasonal_decompose(data = ap_df_wid, key = 'ID',
                                       endog = '#Passengers')
stats.collect()
STAT_NAME STAT_VALUE
0 type multiplicative
1 period 12
2 acf 0.88758

So a strong multiplicative yearly seasonality is detected, which is consistent with our justifications drawn from the run-sequence plot of the time-series.

The corresponding decomposition result can be visualized as follows:

decompose_data = decomposed.collect()
import matplotlib.pyplot as plt
figure, axe = plt.subplots(3)
axe[0].plot(decompose_data.TREND, 'r-.')
axe[0].set_title("Trend Component")
axe[1].plot(decompose_data.SEASONAL, 'b-.')
axe[1].set_title("Seasonal Component")
axe[2].plot(decompose_data.RANDOM, 'k-.')
axe[2].set_title("Residual")
plt.show()

So the multiplicative resudals are all close to 1, implying that the multiplication of trend and seasonal components gives a time-series nearly the same as the original one, so seasonal decomposition can well explain the given time-series data.

Data Partition

Lastly in this section, we split the time-series into two parts, where the 1st part is from 1949 to 1959, used for model training; and the 2nd part is the data in 1960, used for testing. This step is for time-series modeling and forecasts in the subsequent sections.

ap_df_train = cc.sql("select * from ({}) where \"Month\" < '1960-01-01'".format(ap_df_wid.select_statement))
ap_df_test = cc.sql("select * from ({}) where \"Month\" >= '1960-01-01'".format(ap_df_wid.select_statement))

We may check the training part of this dataset as follows:

ap_df_train.collect()
ID Month #Passengers
0 1 1949-01-01 112
1 2 1949-02-01 118
2 3 1949-03-01 132
3 4 1949-04-01 129
4 5 1949-05-01 121
127 128 1959-08-01 559
128 129 1959-09-01 463
129 130 1959-10-01 407
130 131 1959-11-01 362
131 132 1959-12-01 405

132 rows × 3 columns

 

Time-series Modeling and Analysis with Exponential Smoothing

Based on our previous analysis, the AirPassengers dataset is a time-series that contains yearly seasonality and a non-flat trend. One model that can handle such a time-series is triple exponential smoothing. In the following context, auto exponential smoothing is adopted to facilitate us in determininng the optimal parameters of exponential smoothing models.

from hana_ml.algorithms.pal.tsa.exponential_smoothing import AutoExponentialSmoothing
auto_eps = AutoExponentialSmoothing(model_selection = True,
                                    forecast_model_name = "TESM",
                                    seasonal='multiplicative',
                                    optimizer_time_budget=10,
                                    max_iter=500,
                                    forecast_num=12)

Then, optimal parameters for the triple exponential smoothing model can be determined by feeding it with the training data, illustrated as follows:

auto_eps.fit_predict(data = ap_df_train, key = "ID", endog = '#Passengers')

Optimal parameters for the trained triple exponential smoothing model is contained in the stats_ attribute of auto_eps, and the content of this attribute is a hana_ml DataFrame so it can be collected to the Python client as follows:

auto_eps.stats_.collect()
STAT_NAME STAT_VALUE
0 FORECAST_MODEL_NAME TESM
1 MSE 112.15447756481977
2 NUMBER_OF_ITERATIONS 587
3 SA_NUMBER_OF_ITERATIONS 500
4 NM_NUMBER_OF_ITERATIONS 87
5 NM_EXECUTION_TIME 0.000515
6 SA_STOP_COND MAX_ITERATION
7 NM_STOP_COND ERROR_DIFFERENCE
8 ALPHA 0.30708710037019393
9 BETA 0.03405392634387122
10 GAMMA 0.9694460594869154
11 CYCLE 12
12 SEASONAL Multiplicative
13 DAMPED false

So auto exponential smoothing also finds out that the period of seasonality for the given time-series is 12(see the CYCLE value in the above stats table), which is consistent the result of seasonal decomposition.

The fitted & forecasted values are contained in the forecast_ attribute of auto_eps, and since we have assigned forecast_num the value of 12 when initializing the auto exponential smoothing class, the last 12 records in the forecast_ attribute should correspond to the forecasted values based on the trained triple exponential smoothing model. We can examine those values by collecting them to the Python client, illustrated as follows:

auto_eps.forecast_.select(["TIMESTAMP", "VALUE"]).collect().tail(12)
TIMESTAMP VALUE
120 133 415.618831
121 134 392.725246
122 135 461.003599
123 136 446.982653
124 137 470.396336
125 138 537.062048
126 139 622.435966
127 140 632.745177
128 141 519.081970
129 142 454.389067
130 143 399.317911
131 144 440.222018

We can join the predict values with the ground truth for better comparison.

merged = ap_df_test.join(auto_eps.forecast_.select(['TIMESTAMP', 'VALUE']),
                         'TIMESTAMP = ID')
merged.collect()
ID Month #Passengers TIMESTAMP VALUE
0 133 1960-01-01 417 133 415.618831
1 134 1960-02-01 391 134 392.725246
2 135 1960-03-01 419 135 461.003599
3 136 1960-04-01 461 136 446.982653
4 137 1960-05-01 472 137 470.396336
5 138 1960-06-01 535 138 537.062048
6 139 1960-07-01 622 139 622.435966
7 140 1960-08-01 606 140 632.745177
8 141 1960-09-01 508 141 519.081970
9 142 1960-10-01 461 142 454.389067
10 143 1960-11-01 390 143 399.317911
11 144 1960-12-01 432 144 440.222018

We see that the predicted values are very close to the ground truth. We can jusitify this observation quantitatively using accuracy measures. In the following context we apply three accuracy measures for the prediction result : root-mean-square-error(rmse), mean-absolute-percentage-error(mape), mean-absolute-deviation(mad).

from hana_ml.algorithms.pal.tsa.accuracy_measure import accuracy_measure
res = accuracy_measure(data = merged.select(['#Passengers', 'VALUE']),
                       evaluation_metric = ['rmse', 'mape', 'mad'])
res.collect()
STAT_NAME STAT_VALUE
0 MAD 10.433921
1 MAPE 0.022462
2 RMSE 15.834904

We see that two accuray measures MAD and RMSE are at least one order smaller compared to the mangnitude of the origin data, so the forecast accuracy is not bad. The small percentage error value(i.e. MAPE) is also consistent with this justification since it is scale-invariant.

Time-series Modeling and Analysis with ARIMA

Another model that can handle time-series with trend and seasonality is ARIMA, where differencing and seasonal differencing are firstly applied to make the entire time-series stationary, followed by auto-regressive and moving-average(i.e. ARMA) modeling of the differenced(stationarized) time-series. The entire process for ARIMA modeling can be realized automatically in AutoARIMA.

from hana_ml.algorithms.pal.tsa.auto_arima import AutoARIMA
auto_arima = AutoARIMA()
auto_arima.fit(data = ap_df_train, key = 'ID',
               endog = '#Passengers')

Then, the optimal parameters of the ARIMA model learned from the training data can be accessed from the  model_ attribute of auto_arima, illustrated as follows:

auto_arima.model_.collect()
KEY VALUE
0 p 1
1 AR -0.571616
2 d 1
3 q 1
4 MA 0.394036
5 s 12
6 P 1
7 SAR 0.960813
8 D 0
9 Q 0
10 SMA
11 sigma^2 106.632
12 log-likelihood -507.15
13 AIC 1022.3
14 AICc 1022.61
15 BIC 1033.8
16 dy(n-p:n-1)_aux
17 dy_aux 0
18 dy_0 6;14;-3;-8;14;13;0;-12;-17;-15;14;-3;11;15;-6;…
19 x(n-d:n-1)_aux
20 y(n-d:n-1)_aux 0
21 y(n-d:n-1)_0 405
22 epsilon(n-q:n-1)_aux 0
23 epsilon(n-q:n-1)_0 18.5076

This results in an ARIMA(1,1,1)(1,0,0)12 model. However, this model is unlikely to be effective enough for modeling the given time-series since it only contains a single difference. We have already observed that the magnitude of seasonal osciallations of this time-series grows proportionally with its upward trend, so after a single difference is applied, differenced values should be larger in magnitude in later phase than in early phase, which violates the stationary assumption for ARMA modeling. This can be justified by the forecast accuracy measures of the trained ARIMA model, illustrated as follows:

forecast_res = auto_arima.predict(forecast_method='formula_forecast',
                                  forecast_length=12)
fc_res_reduced = cc.sql('SELECT TIMESTAMP, FORECAST from ({})'.format(forecast_res.select_statement))
arima_merged = ap_df_test.join(fc_res_reduced,
                               'TIMESTAMP + {} = ID'.format(ap_df_test.collect()['ID'][0]))
arima_merged.collect()
ID Month #Passengers TIMESTAMP FORECAST
0 133 1960-01-01 417 0 424.640706
1 134 1960-02-01 391 1 408.751100
2 135 1960-03-01 419 2 469.439996
3 136 1960-04-01 461 3 460.290951
4 137 1960-05-01 472 4 483.088042
5 138 1960-06-01 535 5 533.200322
6 139 1960-07-01 622 6 606.136366
7 140 1960-08-01 606 7 616.754322
8 141 1960-09-01 508 8 524.488257
9 142 1960-10-01 461 9 470.698744
10 143 1960-11-01 390 10 427.453005
11 144 1960-12-01 432 11 468.773196
from hana_ml.algorithms.pal.tsa.accuracy_measure import accuracy_measure
acc_measures = accuracy_measure(data = arima_merged.select(['#Passengers', 'FORECAST']),
                                evaluation_metric = ['rmse', 'mape', 'mad'])
acc_measures.collect()
STAT_NAME STAT_VALUE
0 MAD 18.038311
1 MAPE 0.040867
2 RMSE 23.332016

So all three chosen accuracy measures are worse than those produced by exponential smoothing modeling.

From our previous analysis, we strongly suspect that multiplicative seasonality of the time-series data is the main reason that makes ARIMA modeling difficulty. This difficulty could possibly be resolved by taking logarithm of the time-series, which transforms multiplicative seasonality to addtive seasonality.

ap_df_train_log = cc.sql('SELECT "ID", "Month", LN("#Passengers") FROM ({})'.format(ap_df_train.select_statement)) # LN is logarithm transformation
ap_df_train_log.collect()
ID Month LN(#Passengers)
0 1 1949-01-01 4.718499
1 2 1949-02-01 4.770685
2 3 1949-03-01 4.882802
3 4 1949-04-01 4.859812
4 5 1949-05-01 4.795791
127 128 1959-08-01 6.326149
128 129 1959-09-01 6.137727
129 130 1959-10-01 6.008813
130 131 1959-11-01 5.891644
131 132 1959-12-01 6.003887

132 rows × 3 columns

The result of seasonality test indicates that the (logarithmic) transformed time-series is of additive yearly seasonality, illustrated as follows:

from hana_ml.algorithms.pal.tsa.seasonal_decompose import seasonal_decompose
stats, _ = seasonal_decompose(data = ap_df_train_log,
                              key = 'ID',
                              endog = 'LN(#Passengers)')
stats.collect()
STAT_NAME STAT_VALUE
0 type additive
1 period 12
2 acf 0.877358

We can re-fit the ARIMA model by using the transformed training data.

auto_arima.fit(data = ap_df_train_log, key = 'ID', endog = 'LN(#Passengers)')
auto_arima.model_.collect()
KEY VALUE
0 p 1
1 AR -0.306761
2 d 1
3 q 0
4 MA
5 s 12
6 P 2
7 SAR 0.531414;0.43327
8 D 0
9 Q 0
10 SMA
11 sigma^2 0.00140498
12 log-likelihood 229.076
13 AIC -450.152
14 AICc -449.835
15 BIC -438.651
16 dy(n-p:n-1)_aux
17 dy_aux 0
18 dy_0 0.0521857;0.112117;-0.0229895;-0.0640218;0.109…
19 x(n-d:n-1)_aux
20 y(n-d:n-1)_aux 0
21 y(n-d:n-1)_0 6.00388
22 epsilon(n-q:n-1)_aux

The re-fitting process results in an ARIMA(1,1,0)(2,0,0)12 model. Now we make a 12 step forecast using the trained model, and compared the (exponentiated) forecasted values with the ground truth.

forecast_res2 = auto_arima.predict(forecast_method='formula_forecast',
                                   forecast_length=12)
fc_res2_reduced = cc.sql('select TIMESTAMP, EXP(FORECAST) AS FORECAST from ({})'.format(forecast_res2.select_statement))
arima_merged2 = ap_df_test.join(fc_res2_reduced,
                              'TIMESTAMP + {} = ID'.format(ap_df_test.collect()['ID'][0]))
arima_merged2.collect()
ID Month #Passengers TIMESTAMP FORECAST
0 133 1960-01-01 417 0 418.275306
1 134 1960-02-01 391 1 396.366024
2 135 1960-03-01 419 2 458.930102
3 136 1960-04-01 461 3 445.316686
4 137 1960-05-01 472 4 467.906297
5 138 1960-06-01 535 5 538.461284
6 139 1960-07-01 622 6 614.319460
7 140 1960-08-01 606 7 628.451952
8 141 1960-09-01 508 8 516.175951
9 142 1960-10-01 461 9 457.957321
10 143 1960-11-01 390 10 403.803691
11 144 1960-12-01 432 11 444.414850

So the forecast values(after exponential transformation) are close to the ground truth, which indicates that trained ARIMA model using the transformed training data explains the time-series data very well. This can also be justified by forecast accuracy measures, illustrated as follows:

acc_measures2 = accuracy_measure(data = arima_merged2.select(['#Passengers', 'FORECAST']), 
                                 evaluation_metric = ['rmse', 'mape', 'mad'])
acc_measures2.collect()
STAT_NAME STAT_VALUE
0 MAD 11.448283
1 MAPE 0.024789
2 RMSE 15.501060

The result is very close to that of (auto) exponential smoothing, so logarithm transformation plus ARIMA modeling results in a good explanation of the time-series data.

Summary

In this blog post, we have shown readers how to apply procedures in SAP HANA Predictive Analysis Library(PAL) to do time-series modeling and analysis through Python machine learning client for SAP HANA(hana_ml). Some useful take-away messages are:

  • Drawing the run-sequence plot for the time-series to get a general comprehension of it
  • Applying seasonal decomposition to the time-series to get its type of seasonality and seasonal period
  • Choosing appropriate methods for modeling the time-series based on the characteristics drawn from its run-sequence plot as well as seasonal decomposition result.
  • Before training the model, some model parameters could be specified if they are easily obtainable, and appropriate transformation(like logarithm transformation) could be applied to the training data if necessary
  • Evaluating the quality of the trained model using forecast accuracy measures

 

Appendix

Level, Trend, Seasonality, and Noise

  • Level: the average value of time-series
  • Trend: the general shape of the time-series, where any local/short-term fluctuations are ignored(smoothed/avaraged out)
  • Seasonality: fluctuation patterns that appear regularly during fixed period in time-series
  • Noise: irregular and unexplained variations in time-series

Type of Seasonality

  • Seasonality can be divided into different categories based on its length of period, like
          daily, weekly, monthly, quarterly and yearly seasonality.

    A time-series can have multiple seasonalities with different periods, take the hourly electricity consumption data of a city located in temperate zone for example:

    • Electricity consumption is high in day time, low in night hours – this is daily seasonality.
    • Electricity consumption is usually high in work days, and low in weekends – this is weekly seasonality.
    • Electricity consumption is usually relatively high in summer and winter seasons(especially in hot summers), and low in spring and autumn – this is typical yearly seasonality.
  • Seasonality can also be categorized by its correlation with the trend:
    • Additive Seasonality : The magnitude of seasonal fluctuations is stable and does not change with increasing/decreasing of local levels.
    • Multiplicative Seasonality : The magnitude of seasonal fluctuations changes proportionally with the increasing/decreasing of local levels

    Again taking the quarterly electricity consumption as an example, assuming that total consumption is higher in summer season than that in spring season, then this is typical case of yearly seasonality(repeated once every year). If the electricity consumption is always 1 million units higher in summer than that in spring, then the seasonality is additive; if the aggregated electricity consumption is always 10% higher in summer than that in spring, then the seasonlity is multiplicative.

Stationarity

Another important concept for analysis of time-series is stationary. Generally speaking, a given time-series is called stationary if whose statistical properties like mean, variance and autocorrelation do not change over time. In this sense, a time-series with non-flat trend or seasonality is obviously non-stationary since trend and seasonality affect values of time-series in different times.

Assigned tags

      3 Comments
      You must be Logged on to comment or reply to a post.
      Author's profile photo Marcus Schiffer
      Marcus Schiffer

      Hi,

       

      thanks for the valuable blog!

      We were trying to rebuild this on our own dataset and it works quite well. However , we might want to include exogeneous variables in the forecast. When doing this  like e.g.:

      from hana_ml.algorithms.pal.tsa.arima import ARIMA
      ari = ARIMA( order=(0, 2, 1), method='mle', thread_ratio=1.0)
      ari.fit(data = df_train, key = 'ID', endog = 'VAR1' , exog = ['VAR2'])

      we get totally out f scope forecast values.

      What might possibly be wrong in the model parameters ?

      We tried also with seasonality etc, but no success. Forecast values always tend to show erratic (very high) values.

      Any hint is appreciated.

       

      Regards

      Marcus

      Author's profile photo Likun Hou
      Likun Hou
      Blog Post Author

      Hi,

      Getting erratic forecast values means the ARIMA process drawn from the input time-series is non-stationary. It usually happens because you specify the wrong `order` parameter for ARIMA modeling. So if your dataset is flawless(numerically), you may try the following resolutions:

      1. Using AutoARIMA instead of ARIMA to draw the optimal `order` parameter from the input time-series data

      2. If you do not want to use AutoARIMA, considering reducing the differencing order(in your ARIMA modeling 2 is used, so use an order less than 2) and rebuild the model

      Author's profile photo Marcus Schiffer
      Marcus Schiffer

      Thanks for the reply !

      In the mean time I was able to fix it: In the train dataframe all other columns other than the exogeneous ones need to be deleted. Then it works and gives good results.

       

      Thanks again

      Regrads

      Marcus