Skip to Content
Technical Articles
Author's profile photo Likun Hou

Additive Model Time-series Analysis using Python Machine Learning Client for SAP HANA

In a few related blog posts[1][2], we have shown the analysis of time-series using traditional approaches like seasonal decomposition, ARIMA and exponential smoothing. However, those traditional approaches have some drawbacks for time-series analysis, typically ineffective in dealing time-series with complicated trend, multiple seasonality as well as cyclic patterns(repeated with non-fixed period, like holidays, event days, etc). The aforementioned drawbacks can be largely overcomed by a relatively new method called additive model time-series analysis(additive model analysis is often used for short in later context without confusion). This method is provided in the SAP HANA Predictive Analysis Library(PAL), and wrapped up in the Python machine learning client for SAP HANA(hana_ml).

In this blog post, we show readers how to apply additive model analysis to analyze time-series with specific case studies.

The rest of this blog post is organized as follows: firstly we give a brief introduction to additive model analysis, then we go to two specific case studies, inclusive of data description, modeling and forecasting, and finally we summarize the main content of this blog post.

 

Introduction

Additive model analysis is a newly emerged approach for time-series modeling. Unlike traditional approaches(like ARIMA and exponential smoothing) that explore time-based dependencies among observations, it treats time-series modeling as a curve-fitting problem, and uses an additive model to fit/forecast time-series data. Under this setting, the given time-series would be decomposed into four components: trend, seasonality, cyclic patterns, and a random component. The formula is as follows:

                                   𝑦(𝑡)=𝑔(𝑡)+𝑠(𝑡)+(𝑡)+ϵ(𝑡).

It can handle data with complicated trend(using piecewise linear approximation), multiple seasonality(daily, weekly, yearly and beyond, event if period being non-integer) with strong seasonal effects. Besides, the modeling of cyclic patterns also helps to explain variations with non-fixing period. As a consequence, additive model time-series analysis often has better performance when comparing with other classical methods for time-series modeling/analysis.

 

Case Study I : Additive Model Analysis for Electricity System Demand in Singapore

Dataset Description

The first dataset of our interest in this blog post is the historical electricity system demand for every half-hour period(MW) in Singapore, which is available in Singapore’s open data portal[3]. The original data is stored in a .csv file[4], and it has been preprocessed and stored in a table with name ‘SG_SYSTEM_DEMAND_DATA_TBL’ in SAP HANA. Then, the dataset can be accessed via a ConnectionContext to SAP HANA under hana_ml.

import hana_ml
from hana_ml.dataframe import ConnectionContext
cc = ConnectionContext('lsvxc0103.sjc.sap.corp', 30315, 'PAL_USER', 'Welcome1')
df = cc.table('SG_SYSTEM_DEMAND_DATA_TBL')
data = df.sort('datetime').collect()
data
datetime system_demand_actual
0 2012-02-06 00:30:00 4526.68
1 2012-02-06 01:00:00 4436.09
2 2012-02-06 01:30:00 4361.24
3 2012-02-06 02:00:00 4295.23
4 2012-02-06 02:30:00 4233.17
75931 2016-06-05 22:00:00 5862.00
75932 2016-06-05 22:30:00 5756.00
75933 2016-06-05 23:00:00 5674.00
75934 2016-06-05 23:30:00 5566.00
75935 2016-06-06 00:00:00 5446.00

75936 rows × 2 columns

 

In the following context we draw the run-sequence plot of the full time-series.

import matplotlib.pyplot as plt
plt.figure(figsize=(16, 8))
plt.plot(data['datetime'], data['system_demand_actual'])
plt.title('Figure 1. Overview of the dataset')
plt.show()
plt.close()

A coarse examination of the plot indicates that the data have approximately linear trend and yearly seasonality. Now let us zoom in to see if we can have other findings. In the following we take out data of the first 5 weeks for detailed examination.

df_5wk = cc.sql('SELECT * FROM ({}) WHERE "datetime" <= \'2012-03-12 00:00:00\''.format(df.select_statement))
data_5wk = df_5wk.sort('datetime').collect()

Now the 5-week data has been collected to the python client, and we can draw the run-sequence plot in python.

plt.figure(figsize=(16, 4))
plt.plot(data_5wk['datetime'], data_5wk['system_demand_actual'])
plt.title('Figure 2: Zoom in view of data of 1st 5 weeks')
plt.show()
plt.close()

The plot shows that the data also have strong weekly seasonality, and(weak) daily seasonality caused by the simlarity of demand pattern between weekdays(the demand pattern in weekends are also simliar, yet with smaller magnitude).

In summary, the given time-series for analysis contains multiple seasonality, inclusive of yearly seasonality(with non-integer period 365.25). All these characteristics make the analysis of this time-series difficult when using traditional approaches like ARIMA, exponential smoothing as well as seasonal decompose.

Modeling and Forecast using Additive Model Analysis

Additive models analysis is a new method that treats time-series modeling as a curve-fitting problem with respect to time. In contrast, exponential smoothing and ARIMA try model the dependencies of the current data with the past(inclusive of expected values and errors). Besides, additive model analysis use (partial) Fourier series to model periodic patterns in the input data. By doing so, the problem of non-integer periods is easily resolved, and the problem of multiple periods also become non-critical. Moreover, additive model analysis also allows users to input events(usually cyclical, called holidays in the Python API) that could potentially affect the the shape of the curve for given time-series during a short period of time and break up the regular seasonal pattern. This often leads to better modeling for time-series.

Now let us apply additive model analysis to our dataset of electricity system demand. As a standard procedure in machine learning, we firstly split the dataset into two parts for training and testing purposes, respectively.

df_train = cc.sql('SELECT * FROM ({}) WHERE "datetime" <= \'2016-02-18 00:00:00\''.format(df.select_statement))
df_train.sort('datetime').collect()
datetime system_demand_actual
0 2012-02-06 00:30:00 4526.68
1 2012-02-06 01:00:00 4436.09
2 2012-02-06 01:30:00 4361.24
3 2012-02-06 02:00:00 4295.23
4 2012-02-06 02:30:00 4233.17
70699 2016-02-17 22:00:00 5903.00
70700 2016-02-17 22:30:00 5745.00
70701 2016-02-17 23:00:00 5597.00
70702 2016-02-17 23:30:00 5441.00
70703 2016-02-18 00:00:00 5321.00

70704 rows × 2 columns

 

Now let us use the training set to fit a model for additive model analysis.

from hana_ml.algorithms.pal.tsa.additive_model_forecast import AdditiveModelForecast # import the python api for additive model analysis from hana_ml
amf = AdditiveModelForecast(seasonality_mode='additive', daily_seasonality='true', weekly_seasonality='true', yearly_seasonality='true')
amf.fit(df_train)

 

datetime system_demand_actual
0 2012-02-06 00:30:00 4526.68
1 2012-02-06 01:00:00 4436.09
2 2012-02-06 01:30:00 4361.24
3 2012-02-06 02:00:00 4295.23
4 2012-02-06 02:30:00 4233.17
70699 2016-02-17 22:00:00 5903.00
70700 2016-02-17 22:30:00 5745.00
70701 2016-02-17 23:00:00 5597.00
70702 2016-02-17 23:30:00 5441.00
70703 2016-02-18 00:00:00 5321.00

70704 rows × 2 columns

 

The model fitting procedure takes some time. After it is done, we may apply the trained additive model to the test data, in which case only time-stamps matter irrespective of the demand values.

pred_res = amf.predict(df_test)
pred_data = pred_res.sort('datetime').collect()
import matplotlib.pyplot as plt
plt.figure(figsize=(16,5))
plt.plot(data['datetime'][50000:], data['system_demand_actual'][50000:])# use partial data for better illustration of the prediction result
plt.plot(pred_data['datetime'], pred_data['YHAT'], color='red')
plt.legend(['Actual Demand', 'Predicted Demand'])
plt.title('Figure 3. Actual Demand VS Predicted Demand')
plt.show()
plt.close()

We can join the real demand values with predicted ones, and compute the prediction accuracy measures.

rename_pred = pred_res.rename_columns({'datetime': 'date_time'})
join_pred = df_test.join(rename_pred[['date_time', 'YHAT']], #YHAT is the column that stores the predicted values
                         '"datetime"="date_time"')
join_pred.collect()
datetime system_demand_actual date_time YHAT
0 2016-02-18 00:30:00 5164.0 2016-02-18 00:30:00 5218.050311
1 2016-02-18 01:00:00 5044.0 2016-02-18 01:00:00 5103.471702
2 2016-02-18 01:30:00 4943.0 2016-02-18 01:30:00 5010.291241
3 2016-02-18 02:00:00 4866.0 2016-02-18 02:00:00 4937.996240
4 2016-02-18 02:30:00 4812.0 2016-02-18 02:30:00 4882.707945
5227 2016-06-05 22:00:00 5862.0 2016-06-05 22:00:00 5777.180270
5228 2016-06-05 22:30:00 5756.0 2016-06-05 22:30:00 5688.704385
5229 2016-06-05 23:00:00 5674.0 2016-06-05 23:00:00 5577.197404
5230 2016-06-05 23:30:00 5566.0 2016-06-05 23:30:00 5452.238527
5231 2016-06-06 00:00:00 5446.0 2016-06-06 00:00:00 5325.501001

5232 rows × 4 columns

from hana_ml.algorithms.pal.tsa.accuracy_measure import accuracy_measure
res = accuracy_measure(data = join_pred[['system_demand_actual', 'YHAT']],
                       evaluation_metric = ['rmse', 'mad', 'mape'])
res.collect()
STAT_NAME STAT_VALUE
0 MAD 195.786364
1 MAPE 0.033426
2 RMSE 244.762496

MAD and RMSE are both small compared with the demand values in the dataset (~200 vs ~5000), consistent with the mean absolute percent error(MAPE) which is around 3.3%.

For the sake of comparsion, we apply AutoExponentialSmoothing on the same set of training data, and then evaluate the performance of the generated model on test data.

For saving computational time and improving model quality, we can pre-specify a few parameters when initialzing the AutoExponentialSmoothing instance: since the data contains both trend and seasonality components, triple exponential smoothing should be chosen to handle them; seasonal components do not vary with level, which suggests additive seasonality, so the seasonal parameter should be set to ‘additive’.

from hana_ml.algorithms.pal.tsa.exponential_smoothing import AutoExponentialSmoothing
aesm = AutoExponentialSmoothing(forecast_model_name='TESM',
                                forecast_num=df_test.count(),
                                seasonal='additive',
                                optimizer_random_seed=1)
df_train_wid = df_train.add_id('ID', 'datetime')
aesm.fit_predict(data=df_train_wid, key='ID',
                 endog='system_demand_actual')
aesm.forecast_.sort('TIMESTAMP').collect()
TIMESTAMP VALUE PI1_LOWER PI1_UPPER PI2_LOWER PI2_UPPER
0 1 4710.426403 NaN NaN NaN NaN
1 2 4420.592040 NaN NaN NaN NaN
2 3 4344.739633 NaN NaN NaN NaN
3 4 4294.175059 NaN NaN NaN NaN
4 5 4237.244328 NaN NaN NaN NaN
75931 75932 5611.215324 3185.398049 8037.032600 1901.247953 9321.182696
75932 75933 5505.265185 3079.215918 7931.314453 1794.943013 9215.587358
75933 75934 5402.156812 2975.875576 7828.438048 1691.479873 9112.833751
75934 75935 5295.253762 2868.740578 7721.766945 1584.222091 9006.285433
75935 75936 5185.558722 2758.813614 7612.303831 1474.172353 8896.945092

75936 rows × 6 columns

Again the predicted demand values and the actual ones can be joined for prediction accuracy measure computation.

df_wid = df.add_id('ID', 'datetime')
df_test_wid = cc.sql('SELECT * FROM ({}) WHERE ID >= {}'.format(df_wid.select_statement, df_train.count()))
aesm_join_res = df_test_wid.join(aesm.forecast_[['TIMESTAMP', 'VALUE']],##the VALUE column stores fitted & predicted demand values
                                 'ID = TIMESTAMP')
aesm_join_res.collect()

 

ID datetime system_demand_actual TIMESTAMP VALUE
0 70704 2016-02-18 00:00:00 5321.0 70704 5305.413617
1 70705 2016-02-18 00:30:00 5164.0 70705 5170.236697
2 70706 2016-02-18 01:00:00 5044.0 70706 5053.972515
3 70707 2016-02-18 01:30:00 4943.0 70707 4958.628810
4 70708 2016-02-18 02:00:00 4866.0 70708 4887.215553
5228 75932 2016-06-05 22:00:00 5862.0 75932 5611.215324
5229 75933 2016-06-05 22:30:00 5756.0 75933 5505.265185
5230 75934 2016-06-05 23:00:00 5674.0 75934 5402.156812
5231 75935 2016-06-05 23:30:00 5566.0 75935 5295.253762
5232 75936 2016-06-06 00:00:00 5446.0 75936 5185.558722

5233 rows × 5 columns

aesm_measure = accuracy_measure(data = aesm_join_res[['system_demand_actual', 'VALUE']], 
                                evaluation_metric = ['rmse', 'mad', 'mape'])
aesm_measure.collect()
STAT_NAME STAT_VALUE
0 MAD 262.475689
1 MAPE 0.044364
2 RMSE 300.193255

One can see from the values of selected accuracy measures that the performance of exponential smoothing is not as good as additive model analysis, which is consistent with our previous analysis on the dataset.

Adding Holidays for Improved Forecast

Up until now we haven’t talked about the holiday effect in additive model time-series analysis. It is seen from Figure 2 that, in weekends the system demands are usually smallers compared to weekdays, we natually suspect that in holidays the systems demand will also deviate from the regular pattern in weekdays. We verify this by checking the residue of the training data(i.e. differences between fitted and actual values) when additive model analysis is applied.

fit_res = amf.predict(df_train) #obtain the fitting result for the training data
fit_res.sort('datetime').collect()
datetime YHAT YHAT_LOWER YHAT_UPPER
0 2012-02-06 00:30:00 4542.912928 4225.863575 4856.542969
1 2012-02-06 01:00:00 4442.691457 4116.797874 4751.370439
2 2012-02-06 01:30:00 4364.044821 4051.549657 4685.978732
3 2012-02-06 02:00:00 4306.432861 3994.744755 4599.067855
4 2012-02-06 02:30:00 4265.949011 3936.221074 4602.731532
70699 2016-02-17 22:00:00 5857.695587 5552.015396 6172.163877
70700 2016-02-17 22:30:00 5755.662651 5469.078620 6074.457734
70701 2016-02-17 23:00:00 5630.292965 5292.285728 5938.692115
70702 2016-02-17 23:30:00 5491.190423 5176.592394 5808.430988
70703 2016-02-18 00:00:00 5350.053660 5030.360951 5667.936164

70704 rows × 4 columns

We join the fitted values with actual ones for computing their differences.

fit_res_rename = fit_res.rename_columns({'datetime':'date_time'})
fit_join = df_train.join(fit_res_rename, '"datetime" = "date_time"')
fit_join.collect()
datetime system_demand_actual date_time YHAT YHAT_LOWER YHAT_UPPER
0 2012-02-06 00:30:00 4526.68 2012-02-06 00:30:00 4542.912928 4225.863575 4856.542969
1 2012-02-06 01:00:00 4436.09 2012-02-06 01:00:00 4442.691457 4116.797874 4751.370439
2 2012-02-06 01:30:00 4361.24 2012-02-06 01:30:00 4364.044821 4051.549657 4685.978732
3 2012-02-06 02:00:00 4295.23 2012-02-06 02:00:00 4306.432861 3994.744755 4599.067855
4 2012-02-06 02:30:00 4233.17 2012-02-06 02:30:00 4265.949011 3936.221074 4602.731532
70699 2016-02-17 22:00:00 5903.00 2016-02-17 22:00:00 5857.695587 5552.015396 6172.163877
70700 2016-02-17 22:30:00 5745.00 2016-02-17 22:30:00 5755.662651 5469.078620 6074.457734
70701 2016-02-17 23:00:00 5597.00 2016-02-17 23:00:00 5630.292965 5292.285728 5938.692115
70702 2016-02-17 23:30:00 5441.00 2016-02-17 23:30:00 5491.190423 5176.592394 5808.430988
70703 2016-02-18 00:00:00 5321.00 2016-02-18 00:00:00 5350.053660 5030.360951 5667.936164

70704 rows × 6 columns

residue = cc.sql('SELECT "datetime", YHAT - "system_demand_actual" AS RESID FROM ({})'.format(fit_join.select_statement))
import matplotlib.pyplot as plt
plt.figure(figsize=(16,5))
plt.plot(data['datetime'][0:df_train.count()], residue.collect()['RESID'])
plt.title('Figure 4. Overview of the random component of training data')
plt.show()
plt.close()

One can see that there are many(mainly upward w.r.t. zero value) large spiky values in the residual, which indicates that in certain days the actual demand values would fall below the predicted values by a large amount. For further examination, we use the deviation of the residual value from 0 as the measurement of how abnormality an data point is, then the date information of the top 100 result in the dataset is extracted.

abnormal = cc.sql('SELECT * FROM ({}) ORDER BY RESID DESC LIMIT 100'.format(residue.select_statement))
abnormal_dates = abnormal[['datetime']].cast('datetime', 'DATE').drop_duplicates()
abnormal_dates.sort('datetime').collect()
datetime
0 2013-02-11
1 2013-02-12
2 2014-01-31
3 2014-10-22
4 2015-02-19
5 2015-02-20
6 2016-02-08
7 2016-02-09

 

The dates listed above are all public holidays in Singapore(2014-10-22 is Deepavali Day,  others are for Lunar New Year), which indicates that system demand will fall below the regular level if the day for inspection happens to be a public holiday. This justifies our previous suspection.

However, since major public holidays of a year usually shall be announced even before the year actuall starts, it is okay for us to incorporate them in the model building and model prediction phase. Additive model analysis provides us this facility, so we can collect the set of holidays(exclusive of weekends) within the dates range of the input data and store it in SAP HANA.

holidays = cc.table('SG_HOLIDAY_TBL').cast('Date', 'DATE')#The Date column in holiday table is of type 'TIMESTAMP', convert to 'DATE'
holidays.head(5).collect()
Date Name
0 2012-01-02 NY
1 2012-01-23 CNY
2 2012-01-24 CNY
3 2012-04-06 GF
4 2012-05-01 LD

where NY = New Year, CNY = Chinese New Year, GF = Good Friday, LD = Labor Day.

Now let’s re-fit the additive time-series model with the introduction of holidays.

amf.fit(data = df_train, holiday=holidays)

We re-plot the residue component.

fit_res = amf.predict(df_train)
fit_res_rename = fit_res.rename_columns({'datetime':'date_time'})
fit_join = df_train.join(fit_res_rename, '"datetime" = "date_time"')
residue = cc.sql('SELECT "datetime", YHAT - "system_demand_actual" AS RESID FROM ({})'.format(fit_join.select_statement))
import matplotlib.pyplot as plt
plt.figure(figsize=(16,5))
plt.plot(data['date'][0:df_train.count()], residue.collect()['RESID'])
plt.title('Figure 4. Overview of random component of training data with holiday info.')
plt.show()
plt.close()

We can see from the residue component of the training data that, after public holiday information is introduced, those original excessively large residual values become less frequent, indicating smaller fitting error. In general, the introduction of holidays helps the model in capturing the associated cyclic patterns.

Let us verify whether the prediction error could be reduced as well by apply the re-fitted additive model on the test data.

pred_res2 = amf.predict(df_test)
renm_pred2 = pred_res2.rename_columns({'datetime': 'date_time'})
join_pred2 = df_test.join(renm_pred2[['date_time', 'YHAT']], '"datetime"="date_time"')
from hana_ml.algorithms.pal.tsa.accuracy_measure import accuracy_measure
res = accuracy_measure(data = join_pred2[['system_demand_actual', 'YHAT']],
                       evaluation_metric = ['rmse', 'mad', 'mape'])
res.collect()
STAT_NAME STAT_VALUE
0 MAD 175.825074
1 MAPE 0.030518
2 RMSE 220.948932

One can see that all selected accuracy measures improve by a non-negligible amount on the test data thanks to the addition of holiday information:

  • MAD : from ~195.8 to ~175.8,
  • MAPE : from ~0.0335 to ~0.0305,
  • RMSE : from ~224.8 to ~220.9

 

Case Study II :Additive Model Analysis for International Air Arrivals in Singapore

Dataset Description

The data we use in this section is the monthly international visitor arrivals at Changi airport in Singapore[5], available also in Singapore’s open data portal[3].  The time-series data have an overall upward trend, while the growth rate could be different within different time-intervals. The data has been preprocessed and stored in HANA database in a table with the name of ‘SG_INTER_VISITORS_TBL’.

psg = cc.table('SG_INTER_VISITORS_TBL').cast('month', 'DATE')#'month' column with type 'TIMESTAMP', convert to 'DATE' type for easier processing
psg.collect()
month total_international_visitor_arrivals
0 1978-01-01 167016
1 1978-02-01 147954
2 1978-03-01 163199
3 1978-04-01 162400
4 1978-05-01 162667
450 2015-07-01 1519188
451 2015-08-01 1445052
452 2015-09-01 1131967
453 2015-10-01 1246706
454 2015-11-01 1193906

455 rows × 2 columns

The dataset contains two columns, with the 1st column being month info(represented by the 1st day of that month) and 2nd column being to total number of international visitor arrivals. The range of time for this time-series data is from Jan. 1978 to Nov. 2015, with a total number of 455 records.

To proceed further, we divide dataset into training part and testing part, respectively. All records from year 1978 to year 2011, i.e. the heading 408 records, are used for training an additive time-series model, while the rest of records are used for model evaluation.

train_num = 408
test_num = psg.count() - train_num
psg_train = psg.head(train_num)
psg_test = psg.sort('month', desc=True).head(test_num).sort('month')

In the following we examine the training data by drawing its run-sequence plot.

psg_train_data = psg_train.collect()
import matplotlib.pyplot as plt
plt.figure(figsize=(16*(408/455),5))
plt.plot(psg_train_data['month'],
         psg_train_data['total_international_visitor_arrivals'])
plt.show()
plt.close()

One can see that in general the number of air arrivals keep increasing with time. However, the increasing rate is non-stationary, and there are also some intermittent drops that are significant, making the trend of this time-series difficult to model.

Additive model time-series analysis also gives us the flexibility in modeling the trend in a picewise manner. If not specified manually, it will automatically identifying a collection of change-points for the input time-series, where the model of trend is allowed to change when tranversing through a change-point. However, for time-series with complicated trend, the detected change-points produced by the default mechanism are often sub-optimal, leading to over-fitting or under-fitting for the trend of the training data. In contrast, manually labeled change-points(often could be obtained from outer sources) could sometimes alleviate this weakness and do a better job for time-series modeling and prediction.

In this section, we show the benefit of the flexibility provided by additive model analysis for trend modeling, using the above monthly air arrival dataset as an example.

Modeling and Forecast with Additive Model Analaysis

One can see from the  run-sequence plot of the air arrival data that,  in April, May and June 2003, the number of international arrivals undergoes a significant drop from its regular level. This is due to the short-term outbreak of SARS cronavirus in Singapore. From the discussion of our previous section, we may include this event when dealing with this time-series. So we store the corresponding values of the two months in a table named by ‘SG_SARS_MONTH_TBL’ in SAP HANA.

event = cc.table('SG_SARS_MONTH_TBL').cast('Month', 'DATE')
event.collect()
Month Name
0 2004-04-01 SARS
1 2004-05-01 SARS
2 2004-06-01 SARS

It is seen from the training data that,  local variations increases with the growing number of international arrivals, which is typically regards a single a multiplicative seasonality(if there is any).

from hana_ml.algorithms.pal.tsa.additive_model_forecast import AdditiveModelForecast
amf = AdditiveModelForecast(seasonality_mode='multiplicative')
amf.fit(data=psg_train, holiday=event)

Then let us check how well the model fits the training data.

psg_fit = amf.predict(psg_train)# obtain fitted value for training data
psg_fit_rename_month = psg_fit.rename_columns({'month': 'Month'})
join_true_and_fit = psg_train.join(psg_fit_rename_month, '"month" = "Month"')
from hana_ml.algorithms.pal.tsa.accuracy_measure import accuracy_measure
fit_res = accuracy_measure(data = join_true_and_fit[['total_international_visitor_arrivals', 'YHAT']], 
                           evaluation_metric = ['rmse', 'mad', 'mape'])
fit_res.collect()
STAT_NAME STAT_VALUE
0 MAD 31945.872096
1 MAPE 0.067876
2 RMSE 51823.310879

Next we apply the trained model to the test data, and evaluate the prediction result using popular evaluation metrics.

psg_pred = amf.predict(psg_test)
psg_pred_rename_month = psg_pred.rename_columns({'month': 'Month'})
join_true_and_pred = psg_test.join(psg_pred_rename_month, '"month" = "Month"')
psg_pred_acc = accuracy_measure(data = join_true_and_pred[['total_international_visitor_arrivals', 'YHAT']],
                                evaluation_metric = ['rmse', 'mad', 'mape'])
psg_pred_acc.collect()
STAT_NAME STAT_VALUE
0 MAD 119813.052030
1 MAPE 0.093767
2 RMSE 139003.141614
Both MAD and RMSE are around 1e+5, one order smaller compared to the number of arrivals(~ 1e+6), consistent with MAPE(~ 0.09).

Since the time-series of interest in this section has larger variations when its level value goes higher, so it is natual that MSE and RMSE have larger values on the test data since the level is expected to be higher within the prediction period. However, mean absolute percentage error is scale-invariant w.r.t. level of time-series, so compared to that of training data, its higher value on the test data strongly suggests that the training data could be over-fitted by the additive time-series model.

For better illustrattion, the real and predicted values are plotted as follows:

pred_data = psg_pred.collect()
psg_test_data = psg_test.collect()
import matplotlib.pyplot as plt
plt.figure(figsize=(16,5))
plt.plot(psg_test_data['month'], psg_test_data['total_international_visitor_arrivals'])
plt.plot(pred_data['month'], pred_data['YHAT'], color='red')
plt.show()
plt.close()

As discussed in the beginning of this section, when building the above additive model, a set of change-points is detected implicitly using a defaulted mechanism, and the detected change-points may not be optimal. So for the input time-series, after a careful examination of training data, we suggest the following set of changepoints:

‘1982-11-01’, ‘1986-01-01’, ‘1991-05-01’, ‘1995-09-01’, ‘1997-10-01’, ‘2000-07-01’, ‘2004-02-01’, ‘2007-12-01’, ‘2009-03-01’.

Let us first mark out the specified change-points in the training data via run-sequence plot.

import pandas as pd
psg_train_data = psg_train.collect()
changepoints = list(pd.to_datetime(pd.Series(['1982-11-01', '1986-01-01', '1991-05-01', '1995-09-01', '1997-10-01', '2000-07-01', '2004-02-01', '2007-12-01', '2009-03-01'])))
event_points = list(pd.to_datetime(pd.Series(['2003-04-01', '2003-05-01', '2003-06-01'])))
cp_data = psg_train_data.loc[[x in changepoints for x in psg_train_data['month']],:]
event_data = psg_train_data.loc[[x in event_points for x in psg_train_data['month']],:]
plt.figure(figsize=(int(16*(408/455)),5))
plt.plot(psg_train_data['month'], psg_train_data['total_international_visitor_arrivals'])
plt.scatter(cp_data['month'], cp_data['total_international_visitor_arrivals'], color='red')
plt.scatter(event_data['month'], event_data['total_international_visitor_arrivals'], color='purple')
plt.legend(['Visitor Arrivals', 'Specified Change-points', 'Event Points'])
plt.show()
plt.close()

Next, we manually specify these points as change-points when building up the additive model for time-series analysis.

amf = AdditiveModelForecast(seasonality_mode='multiplicative',
                            changepoints=['1982-11-01 00:00:00',
                                          '1986-01-01 00:00:00',
                                          '1991-05-01 00:00:00',
                                          '1995-09-01 00:00:00',
                                          '2000-07-01 00:00:00',
                                          '2004-02-01 00:00:00',
                                          '2007-12-01 00:00:00',
                                          '2009-03-01 00:00:00'])
amf.fit(data=psg_train, holiday=event)

Again let us check how well the model fits the training data.

psg_fit2 = amf.predict(psg_train)
psg_fit2_rename_month = psg_fit2.rename_columns({'month': 'Month'})
join_true_and_fit2 = psg_train.join(psg_fit2_rename_month, '"month" = "Month"')
from hana_ml.algorithms.pal.tsa.accuracy_measure import accuracy_measure
fit_res2 = accuracy_measure(data = join_true_and_fit2[['total_international_visitor_arrivals', 'YHAT']], 
                            evaluation_metric = ['rmse', 'mad', 'mape'])
fit_res2.collect()
STAT_NAME STAT_VALUE
0 MAD 32760.561142
1 MAPE 0.070333
2 RMSE 52156.801417

So the metrics for model fitting on the training dataset is not as good as the case when using default mechasim for chang-point detection, but the result are comparable.

Now we evaluate the re-fitted model on the test data.

STAT_NAME STAT_VALUE
0 MAD 93115.156652
1 MAPE 0.073317
2 RMSE 111212.533575

Thanks to the manually specified change-points, all evaluation metrics have improved on the test data. In particular, mean absolute percentage error(MAPE) has been dropped by nearly 2%, making it close the MAPE value on the training data.

We can check the consistency between real and predicted arrivals by drawing their run-sequence plots.

pred2_data = psg_pred2.collect()
plt.figure(figsize=(16,5))
plt.plot(psg_test_data['month'],
         psg_test_data['total_international_visitor_arrivals'])
plt.plot(pred2_data['month'], pred2_data['YHAT'], color='red')
plt.legend(['Actual Arrivals', 'Predicted Arrivals'])
plt.show()
plt.close()

 

Summary

In this blog post, we have shown readers how additive model analysis faciliates us in handling time-series with trend, seasonality and cyclic patterns. Compared with traditional approaches that rely on stationarity and lagged dependecy, the curve-fitting nature of additive time-series models provides us more flexibilities for time-series modeling.

 

Assigned Tags

      Be the first to leave a comment
      You must be Logged on to comment or reply to a post.