Technical Articles
Time-Series Modeling and Analysis using SAP HANA Predictive Analysis Library(PAL) through Python Machine Learning Client for SAP HANA
In this blog post, we will show users how to perform time-series modeling and analysis using SAP HANA Predictive Analysis Library(PAL). Different from the original SQL interface, here we call PAL procedures through the Python machine learning client for SAP HANA(hana_ml). Python is often much more welcomed for today’s users that are most familier with Python, especially data analysts. Thanks to hana_ml, now by wrtting lines of Python code, we can call various PAL procedures with easy.
Besides Python, we also assume that readers of this blog post have some basic knowledge on time-series like trend and seasonality. If not, please refer to the Appendix section of this blog post for a quick comprehension.
The rest of this blog post is organized as follows: firstly we have a brief introduction to the time-series data of interest, with some basic analysis using run-sequence plot and seasonal decomposition, then the time-series data are further analyzed by exponential smoothing and ARIMA models, finally the main content of this blog post is summarized.
Introduction to the Dataset, with Basic Analysis
The data used for demonstration in this blog post is the “AirPassengers” dataset, which reflects the number of passengers(in thousands) transmitted monthly by an airline company between Jan. 1949 and Dec. 1960. This is a classical dataset that was introduced by Box and Jenkins in the reknowned textbook Time Series Analysis, Forecasting and Control(3rd edition). We assume that the dataset is stored in a local .csv file(AirPassengers.csv), and it can be loaded to the Python client using the pandas libary.
import pandas as pd
air_passengers = pd.read_csv(PATH_TO_LOCAL_CSV,#replacing the actual path the the local .csv file
parse_dates=[0])#the 0th column contains month info, if not specified, values in this columns shall be parse as strings
air_passengers
Month | #Passengers | |
---|---|---|
0 | 1949-01-01 | 112 |
1 | 1949-02-01 | 118 |
2 | 1949-03-01 | 132 |
3 | 1949-04-01 | 129 |
4 | 1949-05-01 | 121 |
… | … | … |
139 | 1960-08-01 | 606 |
140 | 1960-09-01 | 508 |
141 | 1960-10-01 | 461 |
142 | 1960-11-01 | 390 |
143 | 1960-12-01 | 432 |
144 rows × 2 columns
air_passengers.dtypes
Month datetime64[ns] #Passengers int64 dtype: object
So the timestamp column for this time-series is Month, where each month is represented by the 1st day of it.
Now we take the first step for analyzing the time-series data — drawing its run-sequence plot.
import matplotlib.pyplot as plt
plt.plot(air_passengers['Month'], air_passengers['#Passengers'])
plt.vlines(x=pd.to_datetime(['{}-01-01 00:00:00'.format(str(1949+i)) for i in range(12)]),
ymin=100, ymax=620, colors='r', linestyles='dashed')
plt.show()
Every dashed red line in the above figure indicates the start of a new year. We can see from this figure that the number of airline passengers keeps growing year-by-year in general, forming an upward trend. Besides, in each year the monthly numbers have ups-and-downs in a similar fashion: low in winter season and high in summer season. Furthermore, the magnitude of local oscillations grows with the upward trend, which is a clear evidence for multiplicative seasonality.
Time-series Analysis with Seasonal Decomposition
The justifications drawn from the run-sequence plot of the time-series data can be verified by seasonal decomposition. Seasonal decomposition is implemented in PAL by a procedure named SEASONALITY_TEST, and the procedure is wrapped by a function called seasonal_decompose() in hana_ml. In order to apply the seasonal decomposition function in hana_ml to the time-series data, we need to store the dataset in SAP HANA, which can be realized by the following lines of code:
import hana_ml
from hana_ml.dataframe import create_dataframe_from_pandas, ConnectionContext
cc = ConnectionContext('xxxxxxx.xxx.xxx.xxx', 30x15, 'XXXXXX', 'XXXxxx')#account info hidden away
ap_name = 'AIR_PASSENGERS_TBL'
ap_df = create_dataframe_from_pandas(cc, pandas_df = air_passengers,
table_name = ap_name, force = True)
Now ‘AIR_PASSENGERS_TBL’ is a concrete table that holds the time-series data in SAP HANA, and ap_df is the corresponding hana_ml DataFrame that are associated with this table in the Python client. The data stored in SAP HANA can also be collected to the Python client, illustrated as follows:
ap_df.collect()
Month | #Passengers | |
---|---|---|
0 | 1949-01-01 | 112 |
1 | 1949-02-01 | 118 |
2 | 1949-03-01 | 132 |
3 | 1949-04-01 | 129 |
4 | 1949-05-01 | 121 |
… | … | … |
139 | 1960-08-01 | 606 |
140 | 1960-09-01 | 508 |
141 | 1960-10-01 | 461 |
142 | 1960-11-01 | 390 |
143 | 1960-12-01 | 432 |
144 rows × 2 columns
Column data types of the table can be fetched as follows:
ap_df.dtypes()
[('Month', 'TIMESTAMP', 27, 27, 27, 0), ('#Passengers', 'INT', 10, 10, 10, 0)]
Since the seasonal decomposition procedure in PAL requires the input time-series data to have a timestamp column of integer type, which is inconsistent with the given hana_ml DataFrame, so we need to add an additional integer columns for representing the order of the time-series data. This could be be realized calling the add_id() function of hana_ml DataFrame, illustrated as follows:
ap_df_wid = ap_df.add_id(id_col="ID", ref_col='Month') # generate the integer ID column based on the order of Month
ap_df_wid.collect()
ID | Month | #Passengers | |
---|---|---|---|
0 | 1 | 1949-01-01 | 112 |
1 | 2 | 1949-02-01 | 118 |
2 | 3 | 1949-03-01 | 132 |
3 | 4 | 1949-04-01 | 129 |
4 | 5 | 1949-05-01 | 121 |
… | … | … | … |
139 | 140 | 1960-08-01 | 606 |
140 | 141 | 1960-09-01 | 508 |
141 | 142 | 1960-10-01 | 461 |
142 | 143 | 1960-11-01 | 390 |
143 | 144 | 1960-12-01 | 432 |
144 rows × 3 columns
Now seasonal decomposition can be applied to the data with added integer timestamp(i.e. ID) column.
from hana_ml.algorithms.pal.tsa.seasonal_decompose import seasonal_decompose
stats, decomposed = seasonal_decompose(data = ap_df_wid, key = 'ID',
endog = '#Passengers')
stats.collect()
STAT_NAME | STAT_VALUE | |
---|---|---|
0 | type | multiplicative |
1 | period | 12 |
2 | acf | 0.88758 |
So a strong multiplicative yearly seasonality is detected, which is consistent with our justifications drawn from the run-sequence plot of the time-series.
The corresponding decomposition result can be visualized as follows:
decompose_data = decomposed.collect()
import matplotlib.pyplot as plt
figure, axe = plt.subplots(3)
axe[0].plot(decompose_data.TREND, 'r-.')
axe[0].set_title("Trend Component")
axe[1].plot(decompose_data.SEASONAL, 'b-.')
axe[1].set_title("Seasonal Component")
axe[2].plot(decompose_data.RANDOM, 'k-.')
axe[2].set_title("Residual")
plt.show()
So the multiplicative resudals are all close to 1, implying that the multiplication of trend and seasonal components gives a time-series nearly the same as the original one, so seasonal decomposition can well explain the given time-series data.
Data Partition
Lastly in this section, we split the time-series into two parts, where the 1st part is from 1949 to 1959, used for model training; and the 2nd part is the data in 1960, used for testing. This step is for time-series modeling and forecasts in the subsequent sections.
ap_df_train = cc.sql("select * from ({}) where \"Month\" < '1960-01-01'".format(ap_df_wid.select_statement))
ap_df_test = cc.sql("select * from ({}) where \"Month\" >= '1960-01-01'".format(ap_df_wid.select_statement))
We may check the training part of this dataset as follows:
ap_df_train.collect()
ID | Month | #Passengers | |
---|---|---|---|
0 | 1 | 1949-01-01 | 112 |
1 | 2 | 1949-02-01 | 118 |
2 | 3 | 1949-03-01 | 132 |
3 | 4 | 1949-04-01 | 129 |
4 | 5 | 1949-05-01 | 121 |
… | … | … | … |
127 | 128 | 1959-08-01 | 559 |
128 | 129 | 1959-09-01 | 463 |
129 | 130 | 1959-10-01 | 407 |
130 | 131 | 1959-11-01 | 362 |
131 | 132 | 1959-12-01 | 405 |
132 rows × 3 columns
Time-series Modeling and Analysis with Exponential Smoothing
Based on our previous analysis, the AirPassengers dataset is a time-series that contains yearly seasonality and a non-flat trend. One model that can handle such a time-series is triple exponential smoothing. In the following context, auto exponential smoothing is adopted to facilitate us in determininng the optimal parameters of exponential smoothing models.
from hana_ml.algorithms.pal.tsa.exponential_smoothing import AutoExponentialSmoothing
auto_eps = AutoExponentialSmoothing(model_selection = True,
forecast_model_name = "TESM",
seasonal='multiplicative',
optimizer_time_budget=10,
max_iter=500,
forecast_num=12)
Then, optimal parameters for the triple exponential smoothing model can be determined by feeding it with the training data, illustrated as follows:
auto_eps.fit_predict(data = ap_df_train, key = "ID", endog = '#Passengers')
Optimal parameters for the trained triple exponential smoothing model is contained in the stats_ attribute of auto_eps, and the content of this attribute is a hana_ml DataFrame so it can be collected to the Python client as follows:
auto_eps.stats_.collect()
STAT_NAME | STAT_VALUE | |
---|---|---|
0 | FORECAST_MODEL_NAME | TESM |
1 | MSE | 112.15447756481977 |
2 | NUMBER_OF_ITERATIONS | 587 |
3 | SA_NUMBER_OF_ITERATIONS | 500 |
4 | NM_NUMBER_OF_ITERATIONS | 87 |
5 | NM_EXECUTION_TIME | 0.000515 |
6 | SA_STOP_COND | MAX_ITERATION |
7 | NM_STOP_COND | ERROR_DIFFERENCE |
8 | ALPHA | 0.30708710037019393 |
9 | BETA | 0.03405392634387122 |
10 | GAMMA | 0.9694460594869154 |
11 | CYCLE | 12 |
12 | SEASONAL | Multiplicative |
13 | DAMPED | false |
So auto exponential smoothing also finds out that the period of seasonality for the given time-series is 12(see the CYCLE value in the above stats table), which is consistent the result of seasonal decomposition.
The fitted & forecasted values are contained in the forecast_ attribute of auto_eps, and since we have assigned forecast_num the value of 12 when initializing the auto exponential smoothing class, the last 12 records in the forecast_ attribute should correspond to the forecasted values based on the trained triple exponential smoothing model. We can examine those values by collecting them to the Python client, illustrated as follows:
auto_eps.forecast_.select(["TIMESTAMP", "VALUE"]).collect().tail(12)
TIMESTAMP | VALUE | |
---|---|---|
120 | 133 | 415.618831 |
121 | 134 | 392.725246 |
122 | 135 | 461.003599 |
123 | 136 | 446.982653 |
124 | 137 | 470.396336 |
125 | 138 | 537.062048 |
126 | 139 | 622.435966 |
127 | 140 | 632.745177 |
128 | 141 | 519.081970 |
129 | 142 | 454.389067 |
130 | 143 | 399.317911 |
131 | 144 | 440.222018 |
We can join the predict values with the ground truth for better comparison.
merged = ap_df_test.join(auto_eps.forecast_.select(['TIMESTAMP', 'VALUE']),
'TIMESTAMP = ID')
merged.collect()
ID | Month | #Passengers | TIMESTAMP | VALUE | |
---|---|---|---|---|---|
0 | 133 | 1960-01-01 | 417 | 133 | 415.618831 |
1 | 134 | 1960-02-01 | 391 | 134 | 392.725246 |
2 | 135 | 1960-03-01 | 419 | 135 | 461.003599 |
3 | 136 | 1960-04-01 | 461 | 136 | 446.982653 |
4 | 137 | 1960-05-01 | 472 | 137 | 470.396336 |
5 | 138 | 1960-06-01 | 535 | 138 | 537.062048 |
6 | 139 | 1960-07-01 | 622 | 139 | 622.435966 |
7 | 140 | 1960-08-01 | 606 | 140 | 632.745177 |
8 | 141 | 1960-09-01 | 508 | 141 | 519.081970 |
9 | 142 | 1960-10-01 | 461 | 142 | 454.389067 |
10 | 143 | 1960-11-01 | 390 | 143 | 399.317911 |
11 | 144 | 1960-12-01 | 432 | 144 | 440.222018 |
We see that the predicted values are very close to the ground truth. We can jusitify this observation quantitatively using accuracy measures. In the following context we apply three accuracy measures for the prediction result : root-mean-square-error(rmse), mean-absolute-percentage-error(mape), mean-absolute-deviation(mad).
from hana_ml.algorithms.pal.tsa.accuracy_measure import accuracy_measure
res = accuracy_measure(data = merged.select(['#Passengers', 'VALUE']),
evaluation_metric = ['rmse', 'mape', 'mad'])
res.collect()
STAT_NAME | STAT_VALUE | |
---|---|---|
0 | MAD | 10.433921 |
1 | MAPE | 0.022462 |
2 | RMSE | 15.834904 |
We see that two accuray measures MAD and RMSE are at least one order smaller compared to the mangnitude of the origin data, so the forecast accuracy is not bad. The small percentage error value(i.e. MAPE) is also consistent with this justification since it is scale-invariant.
Time-series Modeling and Analysis with ARIMA
Another model that can handle time-series with trend and seasonality is ARIMA, where differencing and seasonal differencing are firstly applied to make the entire time-series stationary, followed by auto-regressive and moving-average(i.e. ARMA) modeling of the differenced(stationarized) time-series. The entire process for ARIMA modeling can be realized automatically in AutoARIMA.
from hana_ml.algorithms.pal.tsa.auto_arima import AutoARIMA
auto_arima = AutoARIMA()
auto_arima.fit(data = ap_df_train, key = 'ID',
endog = '#Passengers')
Then, the optimal parameters of the ARIMA model learned from the training data can be accessed from the model_ attribute of auto_arima, illustrated as follows:
auto_arima.model_.collect()
KEY | VALUE | |
---|---|---|
0 | p | 1 |
1 | AR | -0.571616 |
2 | d | 1 |
3 | q | 1 |
4 | MA | 0.394036 |
5 | s | 12 |
6 | P | 1 |
7 | SAR | 0.960813 |
8 | D | 0 |
9 | Q | 0 |
10 | SMA | |
11 | sigma^2 | 106.632 |
12 | log-likelihood | -507.15 |
13 | AIC | 1022.3 |
14 | AICc | 1022.61 |
15 | BIC | 1033.8 |
16 | dy(n-p:n-1)_aux | |
17 | dy_aux | 0 |
18 | dy_0 | 6;14;-3;-8;14;13;0;-12;-17;-15;14;-3;11;15;-6;… |
19 | x(n-d:n-1)_aux | |
20 | y(n-d:n-1)_aux | 0 |
21 | y(n-d:n-1)_0 | 405 |
22 | epsilon(n-q:n-1)_aux | 0 |
23 | epsilon(n-q:n-1)_0 | 18.5076 |
This results in an ARIMA(1,1,1)(1,0,0)12 model. However, this model is unlikely to be effective enough for modeling the given time-series since it only contains a single difference. We have already observed that the magnitude of seasonal osciallations of this time-series grows proportionally with its upward trend, so after a single difference is applied, differenced values should be larger in magnitude in later phase than in early phase, which violates the stationary assumption for ARMA modeling. This can be justified by the forecast accuracy measures of the trained ARIMA model, illustrated as follows:
forecast_res = auto_arima.predict(forecast_method='formula_forecast',
forecast_length=12)
fc_res_reduced = cc.sql('SELECT TIMESTAMP, FORECAST from ({})'.format(forecast_res.select_statement))
arima_merged = ap_df_test.join(fc_res_reduced,
'TIMESTAMP + {} = ID'.format(ap_df_test.collect()['ID'][0]))
arima_merged.collect()
ID | Month | #Passengers | TIMESTAMP | FORECAST | |
---|---|---|---|---|---|
0 | 133 | 1960-01-01 | 417 | 0 | 424.640706 |
1 | 134 | 1960-02-01 | 391 | 1 | 408.751100 |
2 | 135 | 1960-03-01 | 419 | 2 | 469.439996 |
3 | 136 | 1960-04-01 | 461 | 3 | 460.290951 |
4 | 137 | 1960-05-01 | 472 | 4 | 483.088042 |
5 | 138 | 1960-06-01 | 535 | 5 | 533.200322 |
6 | 139 | 1960-07-01 | 622 | 6 | 606.136366 |
7 | 140 | 1960-08-01 | 606 | 7 | 616.754322 |
8 | 141 | 1960-09-01 | 508 | 8 | 524.488257 |
9 | 142 | 1960-10-01 | 461 | 9 | 470.698744 |
10 | 143 | 1960-11-01 | 390 | 10 | 427.453005 |
11 | 144 | 1960-12-01 | 432 | 11 | 468.773196 |
from hana_ml.algorithms.pal.tsa.accuracy_measure import accuracy_measure
acc_measures = accuracy_measure(data = arima_merged.select(['#Passengers', 'FORECAST']),
evaluation_metric = ['rmse', 'mape', 'mad'])
acc_measures.collect()
STAT_NAME | STAT_VALUE | |
---|---|---|
0 | MAD | 18.038311 |
1 | MAPE | 0.040867 |
2 | RMSE | 23.332016 |
So all three chosen accuracy measures are worse than those produced by exponential smoothing modeling.
From our previous analysis, we strongly suspect that multiplicative seasonality of the time-series data is the main reason that makes ARIMA modeling difficulty. This difficulty could possibly be resolved by taking logarithm of the time-series, which transforms multiplicative seasonality to addtive seasonality.
ap_df_train_log = cc.sql('SELECT "ID", "Month", LN("#Passengers") FROM ({})'.format(ap_df_train.select_statement)) # LN is logarithm transformation
ap_df_train_log.collect()
ID | Month | LN(#Passengers) | |
---|---|---|---|
0 | 1 | 1949-01-01 | 4.718499 |
1 | 2 | 1949-02-01 | 4.770685 |
2 | 3 | 1949-03-01 | 4.882802 |
3 | 4 | 1949-04-01 | 4.859812 |
4 | 5 | 1949-05-01 | 4.795791 |
… | … | … | … |
127 | 128 | 1959-08-01 | 6.326149 |
128 | 129 | 1959-09-01 | 6.137727 |
129 | 130 | 1959-10-01 | 6.008813 |
130 | 131 | 1959-11-01 | 5.891644 |
131 | 132 | 1959-12-01 | 6.003887 |
132 rows × 3 columns
The result of seasonality test indicates that the (logarithmic) transformed time-series is of additive yearly seasonality, illustrated as follows:
from hana_ml.algorithms.pal.tsa.seasonal_decompose import seasonal_decompose
stats, _ = seasonal_decompose(data = ap_df_train_log,
key = 'ID',
endog = 'LN(#Passengers)')
stats.collect()
STAT_NAME | STAT_VALUE | |
---|---|---|
0 | type | additive |
1 | period | 12 |
2 | acf | 0.877358 |
We can re-fit the ARIMA model by using the transformed training data.
auto_arima.fit(data = ap_df_train_log, key = 'ID', endog = 'LN(#Passengers)')
auto_arima.model_.collect()
KEY | VALUE | |
---|---|---|
0 | p | 1 |
1 | AR | -0.306761 |
2 | d | 1 |
3 | q | 0 |
4 | MA | |
5 | s | 12 |
6 | P | 2 |
7 | SAR | 0.531414;0.43327 |
8 | D | 0 |
9 | Q | 0 |
10 | SMA | |
11 | sigma^2 | 0.00140498 |
12 | log-likelihood | 229.076 |
13 | AIC | -450.152 |
14 | AICc | -449.835 |
15 | BIC | -438.651 |
16 | dy(n-p:n-1)_aux | |
17 | dy_aux | 0 |
18 | dy_0 | 0.0521857;0.112117;-0.0229895;-0.0640218;0.109… |
19 | x(n-d:n-1)_aux | |
20 | y(n-d:n-1)_aux | 0 |
21 | y(n-d:n-1)_0 | 6.00388 |
22 | epsilon(n-q:n-1)_aux |
The re-fitting process results in an ARIMA(1,1,0)(2,0,0)12 model. Now we make a 12 step forecast using the trained model, and compared the (exponentiated) forecasted values with the ground truth.
forecast_res2 = auto_arima.predict(forecast_method='formula_forecast',
forecast_length=12)
fc_res2_reduced = cc.sql('select TIMESTAMP, EXP(FORECAST) AS FORECAST from ({})'.format(forecast_res2.select_statement))
arima_merged2 = ap_df_test.join(fc_res2_reduced,
'TIMESTAMP + {} = ID'.format(ap_df_test.collect()['ID'][0]))
arima_merged2.collect()
ID | Month | #Passengers | TIMESTAMP | FORECAST | |
---|---|---|---|---|---|
0 | 133 | 1960-01-01 | 417 | 0 | 418.275306 |
1 | 134 | 1960-02-01 | 391 | 1 | 396.366024 |
2 | 135 | 1960-03-01 | 419 | 2 | 458.930102 |
3 | 136 | 1960-04-01 | 461 | 3 | 445.316686 |
4 | 137 | 1960-05-01 | 472 | 4 | 467.906297 |
5 | 138 | 1960-06-01 | 535 | 5 | 538.461284 |
6 | 139 | 1960-07-01 | 622 | 6 | 614.319460 |
7 | 140 | 1960-08-01 | 606 | 7 | 628.451952 |
8 | 141 | 1960-09-01 | 508 | 8 | 516.175951 |
9 | 142 | 1960-10-01 | 461 | 9 | 457.957321 |
10 | 143 | 1960-11-01 | 390 | 10 | 403.803691 |
11 | 144 | 1960-12-01 | 432 | 11 | 444.414850 |
So the forecast values(after exponential transformation) are close to the ground truth, which indicates that trained ARIMA model using the transformed training data explains the time-series data very well. This can also be justified by forecast accuracy measures, illustrated as follows:
acc_measures2 = accuracy_measure(data = arima_merged2.select(['#Passengers', 'FORECAST']),
evaluation_metric = ['rmse', 'mape', 'mad'])
acc_measures2.collect()
STAT_NAME | STAT_VALUE | |
---|---|---|
0 | MAD | 11.448283 |
1 | MAPE | 0.024789 |
2 | RMSE | 15.501060 |
The result is very close to that of (auto) exponential smoothing, so logarithm transformation plus ARIMA modeling results in a good explanation of the time-series data.
Summary
In this blog post, we have shown readers how to apply procedures in SAP HANA Predictive Analysis Library(PAL) to do time-series modeling and analysis through Python machine learning client for SAP HANA(hana_ml). Some useful take-away messages are:
- Drawing the run-sequence plot for the time-series to get a general comprehension of it
- Applying seasonal decomposition to the time-series to get its type of seasonality and seasonal period
- Choosing appropriate methods for modeling the time-series based on the characteristics drawn from its run-sequence plot as well as seasonal decomposition result.
- Before training the model, some model parameters could be specified if they are easily obtainable, and appropriate transformation(like logarithm transformation) could be applied to the training data if necessary
- Evaluating the quality of the trained model using forecast accuracy measures
Level, Trend, Seasonality, and Noise
- Level: the average value of time-series
- Trend: the general shape of the time-series, where any local/short-term fluctuations are ignored(smoothed/avaraged out)
- Seasonality: fluctuation patterns that appear regularly during fixed period in time-series
- Noise: irregular and unexplained variations in time-series
Type of Seasonality
- Seasonality can be divided into different categories based on its length of period, like
daily, weekly, monthly, quarterly and yearly seasonality.
A time-series can have multiple seasonalities with different periods, take the hourly electricity consumption data of a city located in temperate zone for example:
- Electricity consumption is high in day time, low in night hours – this is daily seasonality.
- Electricity consumption is usually high in work days, and low in weekends – this is weekly seasonality.
- Electricity consumption is usually relatively high in summer and winter seasons(especially in hot summers), and low in spring and autumn – this is typical yearly seasonality.
- Seasonality can also be categorized by its correlation with the trend:
- Additive Seasonality : The magnitude of seasonal fluctuations is stable and does not change with increasing/decreasing of local levels.
- Multiplicative Seasonality : The magnitude of seasonal fluctuations changes proportionally with the increasing/decreasing of local levels
Again taking the quarterly electricity consumption as an example, assuming that total consumption is higher in summer season than that in spring season, then this is typical case of yearly seasonality(repeated once every year). If the electricity consumption is always 1 million units higher in summer than that in spring, then the seasonality is additive; if the aggregated electricity consumption is always 10% higher in summer than that in spring, then the seasonlity is multiplicative.
Stationarity
Another important concept for analysis of time-series is stationary. Generally speaking, a given time-series is called stationary if whose statistical properties like mean, variance and autocorrelation do not change over time. In this sense, a time-series with non-flat trend or seasonality is obviously non-stationary since trend and seasonality affect values of time-series in different times.
Hi,
thanks for the valuable blog!
We were trying to rebuild this on our own dataset and it works quite well. However , we might want to include exogeneous variables in the forecast. When doing this like e.g.:
from hana_ml.algorithms.pal.tsa.arima import ARIMA
ari = ARIMA( order=(0, 2, 1), method='mle', thread_ratio=1.0)
ari.fit(data = df_train, key = 'ID', endog = 'VAR1' , exog = ['VAR2'])
we get totally out f scope forecast values.
What might possibly be wrong in the model parameters ?
We tried also with seasonality etc, but no success. Forecast values always tend to show erratic (very high) values.
Any hint is appreciated.
Regards
Marcus
Hi,
Getting erratic forecast values means the ARIMA process drawn from the input time-series is non-stationary. It usually happens because you specify the wrong `order` parameter for ARIMA modeling. So if your dataset is flawless(numerically), you may try the following resolutions:
1. Using AutoARIMA instead of ARIMA to draw the optimal `order` parameter from the input time-series data
2. If you do not want to use AutoARIMA, considering reducing the differencing order(in your ARIMA modeling 2 is used, so use an order less than 2) and rebuild the model
Thanks for the reply !
In the mean time I was able to fix it: In the train dataframe all other columns other than the exogeneous ones need to be deleted. Then it works and gives good results.
Thanks again
Regrads
Marcus