Technology Blogs by SAP
Learn how to extend and personalize SAP applications. Follow the SAP technology blog for insights into SAP BTP, ABAP, SAP Analytics Cloud, SAP HANA, and more.
cancel
Showing results for 
Search instead for 
Did you mean: 
likun_hou
Advisor
Advisor

In this blog post, we will show users how to perform time-series modeling and analysis using SAP HANA Predictive Analysis Library(PAL). Different from the original SQL interface, here we call PAL procedures through the Python machine learning client for SAP HANA(hana_ml). Python is often much more welcomed for today's users that are most familiar with Python, especially data analysts. Thanks to hana_ml, now by writting lines of Python code, we can call various PAL procedures with easy.

Besides Python, we also assume that readers of this blog post have some basic knowledge on time-series like trend and seasonality. If not, please refer to the Appendix section of this blog post for a quick comprehension.

The rest of this blog post is organized as follows: firstly we have a brief introduction to the time-series data of interest, with some basic analysis using run-sequence plot and seasonal decomposition, then the time-series data are further analyzed by exponential smoothing and ARIMA models, finally the main content of this blog post is summarized.

 

Introduction to the Dataset, with Basic Analysis


The data used for demonstration in this blog post is the "AirPassengers" dataset, which reflects the number of passengers(in thousands) transmitted monthly by an airline company between Jan. 1949 and Dec. 1960. This is a classical dataset that was introduced by Box and Jenkins in the reknowned textbook Time Series Analysis, Forecasting and Control(3rd edition). We assume that the dataset is stored in a local .csv file(AirPassengers.csv), and it can be loaded to the Python client using the pandas libary.

import pandas as pd

air_passengers = pd.read_csv(PATH_TO_LOCAL_CSV,#replacing the actual path the the local .csv file

                             parse_dates=[0])#the 0th column contains month info, if not specified, values in this columns shall be parse as strings

 

air_passengers
 
  
 Month#Passengers
01949-01-01112
11949-02-01118
21949-03-01132
31949-04-01129
41949-05-01121
.........
1391960-08-01606
1401960-09-01508
1411960-10-01461
1421960-11-01390
1431960-12-01432

144 rows × 2 columns


air_passengers.dtypes

Month          datetime64[ns]
#Passengers int64
dtype: object

So the timestamp column for this time-series is Month, where each month is represented by the 1st day of it.

Now we take the first step for analyzing the time-series data --- drawing its run-sequence plot.
import matplotlib.pyplot as plt

plt.plot(air_passengers['Month'], air_passengers['#Passengers'])

plt.vlines(x=pd.to_datetime(['{}-01-01 00:00:00'.format(str(1949+i)) for i in range(12)]),

           ymin=100, ymax=620, colors='r', linestyles='dashed')

plt.show()

 


Every dashed red line in the above figure indicates the start of a new year. We can see from this figure that the number of airline passengers keeps growing year-by-year in general, forming an upward trend. Besides, in each year the monthly numbers have ups-and-downs in a similar fashion: low in winter season and high in summer season. Furthermore, the magnitude of local oscillations grows with the upward trend, which is a clear evidence for multiplicative seasonality.


Time-series Analysis with Seasonal Decomposition


The justifications drawn from the run-sequence plot of the time-series data can be verified by seasonal decomposition. Seasonal decomposition is implemented in PAL by a procedure named SEASONALITY_TEST, and the procedure is wrapped by a function called seasonal_decompose() in hana_ml. In order to apply the seasonal decomposition function in hana_ml to the time-series data, we need to store the dataset in SAP HANA, which can be realized by the following lines of code:


import hana_ml

from hana_ml.dataframe import create_dataframe_from_pandas, ConnectionContext

cc = ConnectionContext('xxxxxxx.xxx.xxx.xxx', 30x15, 'XXXXXX', 'XXXxxx')#account info hidden away

ap_name = 'AIR_PASSENGERS_TBL'

ap_df = create_dataframe_from_pandas(cc, pandas_df = air_passengers,

                                     table_name = ap_name, force = True)


Now 'AIR_PASSENGERS_TBL' is a concrete table that holds the time-series data in SAP HANA, and ap_df is the corresponding hana_ml DataFrame that are associated with this table in the Python client.  The data stored in SAP HANA can also be collected to the Python client, illustrated as follows:

ap_df.collect()

 

 Month#Passengers
01949-01-01112
11949-02-01118
21949-03-01132
31949-04-01129
41949-05-01121
.........
1391960-08-01606
1401960-09-01508
1411960-10-01461
1421960-11-01390
1431960-12-01432


144 rows × 2 columns


Column data types of the table can be fetched as follows:

ap_df.dtypes()

 

[('Month', 'TIMESTAMP', 27, 27, 27, 0), ('#Passengers', 'INT', 10, 10, 10, 0)]


Since the seasonal decomposition procedure in PAL requires the input time-series data to have a timestamp column of integer type, which is inconsistent with the given hana_ml DataFrame, so we need to add an additional integer columns for representing the order of the time-series data. This could be be realized calling the add_id() function of hana_ml DataFrame, illustrated as follows:

ap_df_wid = ap_df.add_id(id_col="ID", ref_col='Month') # generate the integer ID column based on the order of Month

ap_df_wid.collect()

 IDMonth#Passengers
011949-01-01112
121949-02-01118
231949-03-01132
341949-04-01129
451949-05-01121
............
1391401960-08-01606
1401411960-09-01508
1411421960-10-01461
1421431960-11-01390
1431441960-12-01432

144 rows × 3 columns

Now seasonal decomposition can be applied to the data with added integer timestamp(i.e.  ID) column.
from hana_ml.algorithms.pal.tsa.seasonal_decompose import seasonal_decompose

stats, decomposed = seasonal_decompose(data = ap_df_wid, key = 'ID',

                                       endog = '#Passengers')

stats.collect()

 


 STAT_NAMESTAT_VALUE
0typemultiplicative
1period12
2acf0.88758

So a strong multiplicative yearly seasonality is detected, which is consistent with our justifications drawn from the run-sequence plot of the time-series.

The corresponding decomposition result can be visualized as follows:
decompose_data = decomposed.collect()

import matplotlib.pyplot as plt

figure, axe = plt.subplots(3)

axe[0].plot(decompose_data.TREND, 'r-.')

axe[0].set_title("Trend Component")

axe[1].plot(decompose_data.SEASONAL, 'b-.')

axe[1].set_title("Seasonal Component")

axe[2].plot(decompose_data.RANDOM, 'k-.')

axe[2].set_title("Residual")

plt.show()

 


So the multiplicative residuals are all close to 1, implying that the multiplication of trend and seasonal components gives a time-series nearly the same as the original one, so seasonal decomposition can well explain the given time-series data.

 

Data Partition


Lastly in this section, we split the time-series into two parts, where the 1st part is from 1949 to 1959, used for model training; and the 2nd part is the data in 1960, used for testing. This step is for time-series modeling and forecasts in the subsequent sections.

ap_df_train = cc.sql("select * from ({}) where \"Month\" < '1960-01-01'".format(ap_df_wid.select_statement))

ap_df_test = cc.sql("select * from ({}) where \"Month\" >= '1960-01-01'".format(ap_df_wid.select_statement))


We may check the training part of this dataset as follows:

ap_df_train.collect()

 


 IDMonth#Passengers
011949-01-01112
121949-02-01118
231949-03-01132
341949-04-01129
451949-05-01121
............
1271281959-08-01559
1281291959-09-01463
1291301959-10-01407
1301311959-11-01362
1311321959-12-01405

132 rows × 3 columns


Time-series Modeling and Analysis with Exponential Smoothing


Based on our previous analysis, the AirPassengers dataset is a time-series that contains yearly seasonality and a non-flat trend. One model that can handle such a time-series is triple exponential smoothing. In the following context, auto exponential smoothing is adopted to facilitate us in determining the optimal parameters of exponential smoothing models.


from hana_ml.algorithms.pal.tsa.exponential_smoothing import AutoExponentialSmoothing

auto_eps = AutoExponentialSmoothing(model_selection = True,

                                    forecast_model_name = "TESM",

                                    seasonal='multiplicative',

                                    optimizer_time_budget=10,

                                    max_iter=500,

                                    forecast_num=12)

Then, optimal parameters for the triple exponential smoothing model can be determined by feeding it with the training data, illustrated as follows:
auto_eps.fit_predict(data = ap_df_train, key = "ID", endog = '#Passengers')

Optimal parameters for the trained triple exponential smoothing model is contained in the stats_ attribute of auto_eps, and the content of this attribute is a hana_ml DataFrame so it can be collected to the Python client as follows:
auto_eps.stats_.collect()
 
 STAT_NAMESTAT_VALUE
0FORECAST_MODEL_NAMETESM
1MSE112.15447756481977
2NUMBER_OF_ITERATIONS587
3SA_NUMBER_OF_ITERATIONS500
4NM_NUMBER_OF_ITERATIONS87
5NM_EXECUTION_TIME0.000515
6SA_STOP_CONDMAX_ITERATION
7NM_STOP_CONDERROR_DIFFERENCE
8ALPHA0.30708710037019393
9BETA0.03405392634387122
10GAMMA0.9694460594869154
11CYCLE12
12SEASONALMultiplicative
13DAMPEDfalse

So auto exponential smoothing also finds out that the period of seasonality for the given time-series is 12(see the CYCLE value in the above stats table), which is consistent the result of seasonal decomposition.

The fitted & forecasted values are contained in the forecast_ attribute of auto_eps, and since we have assigned forecast_num the value of 12 when initializing the auto exponential smoothing class, the last 12 records in the forecast_ attribute should correspond to the forecasted values based on the trained triple exponential smoothing model. We can examine those values by collecting them to the Python client, illustrated as follows:

auto_eps.forecast_.select(["TIMESTAMP", "VALUE"]).collect().tail(12)

 


 TIMESTAMPVALUE
120133415.618831
121134392.725246
122135461.003599
123136446.982653
124137470.396336
125138537.062048
126139622.435966
127140632.745177
128141519.081970
129142454.389067
130143399.317911
131144440.222018

We can join the predict values with the ground truth for better comparison.
merged = ap_df_test.join(auto_eps.forecast_.select(['TIMESTAMP', 'VALUE']),

                         'TIMESTAMP = ID')

merged.collect()
 IDMonth#PassengersTIMESTAMPVALUE
01331960-01-01417133415.618831
11341960-02-01391134392.725246
21351960-03-01419135461.003599
31361960-04-01461136446.982653
41371960-05-01472137470.396336
51381960-06-01535138537.062048
61391960-07-01622139622.435966
71401960-08-01606140632.745177
81411960-09-01508141519.081970
91421960-10-01461142454.389067
101431960-11-01390143399.317911
111441960-12-01432144440.222018

We see that the predicted values are very close to the ground truth. We can justify this observation quantitatively using accuracy measures. In the following context we apply three accuracy measures for the prediction result : root-mean-square-error(rmse), mean-absolute-percentage-error(mape), mean-absolute-deviation(mad).

 STAT_NAMESTAT_VALUE
0MAD10.433921
1MAPE0.022462
2RMSE15.834904

We see that two accuray measures MAD and RMSE are at least one order smaller compared to the magnitude of the origin data, so the forecast accuracy is not bad. The small percentage error value(i.e. MAPE) is also consistent with this justification since it is scale-invariant.
 

Time-series Modeling and Analysis with ARIMA


Another model that can handle time-series with trend and seasonality is ARIMA, where differencing and seasonal differencing are firstly applied to make the entire time-series stationary, followed by auto-regressive and moving-average(i.e. ARMA) modeling of the differenced(stationarized) time-series. The entire process for ARIMA modeling can be realized automatically in AutoARIMA.
from hana_ml.algorithms.pal.tsa.auto_arima import AutoARIMA

auto_arima = AutoARIMA()

auto_arima.fit(data = ap_df_train, key = 'ID',

               endog = '#Passengers')


Then, the optimal parameters of the ARIMA model learned from the training data can be accessed from the  model_ attribute of auto_arima, illustrated as follows:

auto_arima.model_.collect()
 
 KEYVALUE
0p1
1AR-0.571616
2d1
3q1
4MA0.394036
5s12
6P1
7SAR0.960813
8D0
9Q0
10SMA 
11sigma^2106.632
12log-likelihood-507.15
13AIC1022.3
14AICc1022.61
15BIC1033.8
16dy(n-p:n-1)_aux 
17dy_aux0
18dy_06;14;-3;-8;14;13;0;-12;-17;-15;14;-3;11;15;-6;...
19x(n-d:n-1)_aux 
20y(n-d:n-1)_aux0
21y(n-d:n-1)_0405
22epsilon(n-q:n-1)_aux0
23epsilon(n-q:n-1)_018.5076

This results in an ARIMA(1,1,1)(1,0,0)12 model. However, this model is unlikely to be effective enough for modeling the given time-series since it only contains a single difference. We have already observed that the magnitude of seasonal osciallations of this time-series grows proportionally with its upward trend, so after a single difference is applied, differenced values should be larger in magnitude in later phase than in early phase, which violates the stationary assumption for ARMA modeling. This can be justified by the forecast accuracy measures of the trained ARIMA model, illustrated as follows:
forecast_res = auto_arima.predict(forecast_method='formula_forecast',

                                  forecast_length=12)

fc_res_reduced = cc.sql('SELECT TIMESTAMP, FORECAST from ({})'.format(forecast_res.select_statement))

arima_merged = ap_df_test.join(fc_res_reduced,

                               'TIMESTAMP + {} = ID'.format(ap_df_test.collect()['ID'][0]))

arima_merged.collect()

 IDMonth#PassengersTIMESTAMPFORECAST
01331960-01-014170424.640706
11341960-02-013911408.751100
21351960-03-014192469.439996
31361960-04-014613460.290951
41371960-05-014724483.088042
51381960-06-015355533.200322
61391960-07-016226606.136366
71401960-08-016067616.754322
81411960-09-015088524.488257
91421960-10-014619470.698744
101431960-11-0139010427.453005
111441960-12-0143211468.773196

from hana_ml.algorithms.pal.tsa.accuracy_measure import accuracy_measure

acc_measures = accuracy_measure(data = arima_merged.select(['#Passengers', 'FORECAST']),

                                evaluation_metric = ['rmse', 'mape', 'mad'])

acc_measures.collect()

 STAT_NAMESTAT_VALUE
0MAD18.038311
1MAPE0.040867
2RMSE23.332016

So all three chosen accuracy measures are worse than those produced by exponential smoothing modeling.

From our previous analysis, we strongly suspect that multiplicative seasonality of the time-series data is the main reason that makes ARIMA modeling difficulty. This difficulty could possibly be resolved by taking logarithm of the time-series, which transforms multiplicative seasonality to addtive seasonality.
ap_df_train_log = cc.sql('SELECT "ID", "Month", LN("#Passengers") FROM ({})'.format(ap_df_train.select_statement)) # LN is logarithm transformation

ap_df_train_log.collect()

 IDMonthLN(#Passengers)
011949-01-014.718499
121949-02-014.770685
231949-03-014.882802
341949-04-014.859812
451949-05-014.795791
............
1271281959-08-016.326149
1281291959-09-016.137727
1291301959-10-016.008813
1301311959-11-015.891644
 1321959-12-016.003887

132 rows × 3 columns

The result of seasonality test indicates that the (logarithmic) transformed time-series is of additive yearly seasonality, illustrated as follows:
from hana_ml.algorithms.pal.tsa.seasonal_decompose import seasonal_decompose

stats, _ = seasonal_decompose(data = ap_df_train_log,

                              key = 'ID',

                              endog = 'LN(#Passengers)')

stats.collect()

 STAT_NAMESTAT_VALUE
0typeadditive
1period12
2acf0.877358

We can re-fit the ARIMA model by using the transformed training data.

 KEYVALUE
0p1
1AR-0.306761
2d1
3q0
4MA 
5s12
6P2
7SAR0.531414;0.43327
8D0
9Q0
10SMA 
11sigma^20.00140498
12log-likelihood229.076
13AIC-450.152
14AICc-449.835
15BIC-438.651
16dy(n-p:n-1)_aux 
17dy_aux0
18dy_00.0521857;0.112117;-0.0229895;-0.0640218;0.109...
19x(n-d:n-1)_aux 
20y(n-d:n-1)_aux0
21y(n-d:n-1)_06.00388
22epsilon(n-q:n-1)_aux 

The re-fitting process results in an ARIMA(1,1,0)(2,0,0)12 model. Now we make a 12 step forecast using the trained model, and compared the (exponentiated) forecasted values with the ground truth.
forecast_res2 = auto_arima.predict(forecast_method='formula_forecast',

                                   forecast_length=12)

fc_res2_reduced = cc.sql('select TIMESTAMP, EXP(FORECAST) AS FORECAST from ({})'.format(forecast_res2.select_statement))

arima_merged2 = ap_df_test.join(fc_res2_reduced,

                              'TIMESTAMP + {} = ID'.format(ap_df_test.collect()['ID'][0]))

arima_merged2.collect()


 IDMonth#PassengersTIMESTAMPFORECAST
01331960-01-014170418.275306
11341960-02-013911396.366024
21351960-03-014192458.930102
31361960-04-014613445.316686
41371960-05-014724467.906297
51381960-06-015355538.461284
61391960-07-016226614.319460
71401960-08-016067628.451952
81411960-09-015088516.175951
91421960-10-014619457.957321
101431960-11-0139010403.803691
111441960-12-0143211444.414850

So the forecast values(after exponential transformation) are close to the ground truth, which indicates that trained ARIMA model using the transformed training data explains the time-series data very well. This can also be justified by forecast accuracy measures, illustrated as follows:
acc_measures2 = accuracy_measure(data = arima_merged2.select(['#Passengers', 'FORECAST']), 

                                 evaluation_metric = ['rmse', 'mape', 'mad'])

acc_measures2.collect()

 STAT_NAMESTAT_VALUE
0MAD11.448283
1MAPE0.024789
2RMSE15.501060

The result is very close to that of (auto) exponential smoothing, so logarithm transformation plus ARIMA modeling results in a good explanation of the time-series data.

Summary

 

In this blog post, we have shown readers how to apply procedures in SAP HANA Predictive Analysis Library(PAL) to do time-series modeling and analysis through Python machine learning client for SAP HANA(hana_ml). Some useful take-away messages are:
    • Drawing the run-sequence plot for the time-series to get a general comprehension of it
    • Applying seasonal decomposition to the time-series to get its type of seasonality and seasonal period
    • Choosing appropriate methods for modeling the time-series based on the characteristics drawn from its run-sequence plot as well as seasonal decomposition result.
    • Before training the model, some model parameters could be specified if they are easily obtainable, and appropriate transformation(like logarithm transformation) could be applied to the training data if necessary
    • Evaluating the quality of the trained model using forecast accuracy measures

 

Appendix

 

Level, Trend, Seasonality, and Noise

    • Level: the average value of time-series
    • Trend: the general shape of the time-series, where any local/short-term fluctuations are ignored(smoothed/averaged out)
    • Seasonality: fluctuation patterns that appear regularly during fixed period in time-series
    • Noise: irregular and unexplained variations in time-series

 

Type of Seasonality

    • Seasonality can be divided into different categories based on its length of period, like
            daily, weekly, monthly, quarterly and yearly seasonality.

      A time-series can have multiple seasonalities with different periods, take the hourly electricity consumption data of a city located in temperate zone for example:

        • Electricity consumption is high in day time, low in night hours - this is daily seasonality.

        • Electricity consumption is usually high in work days, and low in weekends - this is weekly seasonality.

        • Electricity consumption is usually relatively high in summer and winter seasons(especially in hot summers), and low in spring and autumn - this is typical yearly seasonality.


    • Seasonality can also be categorized by its correlation with the trend:

        • Additive Seasonality : The magnitude of seasonal fluctuations is stable and does not change with increasing/decreasing of local levels.

        • Multiplicative Seasonality : The magnitude of seasonal fluctuations changes proportionally with the increasing/decreasing of local levels

      Again taking the quarterly electricity consumption as an example, assuming that total consumption is higher in summer season than that in spring season, then this is typical case of yearly seasonality(repeated once every year). If the electricity consumption is always 1 million units higher in summer than that in spring, then the seasonality is additive; if the aggregated electricity consumption is always 10% higher in summer than that in spring, then the seasonality is multiplicative.

 

Stationarity

 
Another important concept for analysis of time-series is stationary. Generally speaking, a given time-series is called stationary if whose statistical properties like mean, variance and autocorrelation do not change over time. In this sense, a time-series with non-flat trend or seasonality is obviously non-stationary since trend and seasonality affect values of time-series in different times.
3 Comments