Technical Articles
Additive Model Time-series Analysis using Python Machine Learning Client for SAP HANA
In a few related blog posts[1][2], we have shown the analysis of time-series using traditional approaches like seasonal decomposition, ARIMA and exponential smoothing. However, those traditional approaches have some drawbacks for time-series analysis, typically ineffective in dealing time-series with complicated trend, multiple seasonality as well as cyclic patterns(repeated with non-fixed period, like holidays, event days, etc). The aforementioned drawbacks can be largely overcomed by a relatively new method called additive model time-series analysis(additive model analysis is often used for short in later context without confusion). This method is provided in the SAP HANA Predictive Analysis Library(PAL), and wrapped up in the Python machine learning client for SAP HANA(hana_ml).
In this blog post, we show readers how to apply additive model analysis to analyze time-series with specific case studies.
The rest of this blog post is organized as follows: firstly we give a brief introduction to additive model analysis, then we go to two specific case studies, inclusive of data description, modeling and forecasting, and finally we summarize the main content of this blog post.
Introduction
Additive model analysis is a newly emerged approach for time-series modeling. Unlike traditional approaches(like ARIMA and exponential smoothing) that explore time-based dependencies among observations, it treats time-series modeling as a curve-fitting problem, and uses an additive model to fit/forecast time-series data. Under this setting, the given time-series would be decomposed into four components: trend, seasonality, cyclic patterns, and a random component. The formula is as follows:
𝑦(𝑡)=𝑔(𝑡)+𝑠(𝑡)+ℎ(𝑡)+ϵ(𝑡).
It can handle data with complicated trend(using piecewise linear approximation), multiple seasonality(daily, weekly, yearly and beyond, event if period being non-integer) with strong seasonal effects. Besides, the modeling of cyclic patterns also helps to explain variations with non-fixing period. As a consequence, additive model time-series analysis often has better performance when comparing with other classical methods for time-series modeling/analysis.
Case Study I : Additive Model Analysis for Electricity System Demand in Singapore
Dataset Description
The first dataset of our interest in this blog post is the historical electricity system demand for every half-hour period(MW) in Singapore, which is available in Singapore’s open data portal[3]. The original data is stored in a .csv file[4], and it has been preprocessed and stored in a table with name ‘SG_SYSTEM_DEMAND_DATA_TBL’ in SAP HANA. Then, the dataset can be accessed via a ConnectionContext to SAP HANA under hana_ml.
import hana_ml
from hana_ml.dataframe import ConnectionContext
cc = ConnectionContext('lsvxc0103.sjc.sap.corp', 30315, 'PAL_USER', 'Welcome1')
df = cc.table('SG_SYSTEM_DEMAND_DATA_TBL')
data = df.sort('datetime').collect()
data
datetime | system_demand_actual | |
---|---|---|
0 | 2012-02-06 00:30:00 | 4526.68 |
1 | 2012-02-06 01:00:00 | 4436.09 |
2 | 2012-02-06 01:30:00 | 4361.24 |
3 | 2012-02-06 02:00:00 | 4295.23 |
4 | 2012-02-06 02:30:00 | 4233.17 |
… | … | … |
75931 | 2016-06-05 22:00:00 | 5862.00 |
75932 | 2016-06-05 22:30:00 | 5756.00 |
75933 | 2016-06-05 23:00:00 | 5674.00 |
75934 | 2016-06-05 23:30:00 | 5566.00 |
75935 | 2016-06-06 00:00:00 | 5446.00 |
75936 rows × 2 columns
In the following context we draw the run-sequence plot of the full time-series.
import matplotlib.pyplot as plt
plt.figure(figsize=(16, 8))
plt.plot(data['datetime'], data['system_demand_actual'])
plt.title('Figure 1. Overview of the dataset')
plt.show()
plt.close()
A coarse examination of the plot indicates that the data have approximately linear trend and yearly seasonality. Now let us zoom in to see if we can have other findings. In the following we take out data of the first 5 weeks for detailed examination.
df_5wk = cc.sql('SELECT * FROM ({}) WHERE "datetime" <= \'2012-03-12 00:00:00\''.format(df.select_statement))
data_5wk = df_5wk.sort('datetime').collect()
Now the 5-week data has been collected to the python client, and we can draw the run-sequence plot in python.
plt.figure(figsize=(16, 4))
plt.plot(data_5wk['datetime'], data_5wk['system_demand_actual'])
plt.title('Figure 2: Zoom in view of data of 1st 5 weeks')
plt.show()
plt.close()
The plot shows that the data also have strong weekly seasonality, and(weak) daily seasonality caused by the simlarity of demand pattern between weekdays(the demand pattern in weekends are also simliar, yet with smaller magnitude).
In summary, the given time-series for analysis contains multiple seasonality, inclusive of yearly seasonality(with non-integer period 365.25). All these characteristics make the analysis of this time-series difficult when using traditional approaches like ARIMA, exponential smoothing as well as seasonal decompose.
Modeling and Forecast using Additive Model Analysis
Additive models analysis is a new method that treats time-series modeling as a curve-fitting problem with respect to time. In contrast, exponential smoothing and ARIMA try model the dependencies of the current data with the past(inclusive of expected values and errors). Besides, additive model analysis use (partial) Fourier series to model periodic patterns in the input data. By doing so, the problem of non-integer periods is easily resolved, and the problem of multiple periods also become non-critical. Moreover, additive model analysis also allows users to input events(usually cyclical, called holidays in the Python API) that could potentially affect the the shape of the curve for given time-series during a short period of time and break up the regular seasonal pattern. This often leads to better modeling for time-series.
Now let us apply additive model analysis to our dataset of electricity system demand. As a standard procedure in machine learning, we firstly split the dataset into two parts for training and testing purposes, respectively.
df_train = cc.sql('SELECT * FROM ({}) WHERE "datetime" <= \'2016-02-18 00:00:00\''.format(df.select_statement))
df_train.sort('datetime').collect()
datetime | system_demand_actual | |
---|---|---|
0 | 2012-02-06 00:30:00 | 4526.68 |
1 | 2012-02-06 01:00:00 | 4436.09 |
2 | 2012-02-06 01:30:00 | 4361.24 |
3 | 2012-02-06 02:00:00 | 4295.23 |
4 | 2012-02-06 02:30:00 | 4233.17 |
… | … | … |
70699 | 2016-02-17 22:00:00 | 5903.00 |
70700 | 2016-02-17 22:30:00 | 5745.00 |
70701 | 2016-02-17 23:00:00 | 5597.00 |
70702 | 2016-02-17 23:30:00 | 5441.00 |
70703 | 2016-02-18 00:00:00 | 5321.00 |
70704 rows × 2 columns
Now let us use the training set to fit a model for additive model analysis.
from hana_ml.algorithms.pal.tsa.additive_model_forecast import AdditiveModelForecast # import the python api for additive model analysis from hana_ml
amf = AdditiveModelForecast(seasonality_mode='additive', daily_seasonality='true', weekly_seasonality='true', yearly_seasonality='true')
amf.fit(df_train)
datetime | system_demand_actual | |
---|---|---|
0 | 2012-02-06 00:30:00 | 4526.68 |
1 | 2012-02-06 01:00:00 | 4436.09 |
2 | 2012-02-06 01:30:00 | 4361.24 |
3 | 2012-02-06 02:00:00 | 4295.23 |
4 | 2012-02-06 02:30:00 | 4233.17 |
… | … | … |
70699 | 2016-02-17 22:00:00 | 5903.00 |
70700 | 2016-02-17 22:30:00 | 5745.00 |
70701 | 2016-02-17 23:00:00 | 5597.00 |
70702 | 2016-02-17 23:30:00 | 5441.00 |
70703 | 2016-02-18 00:00:00 | 5321.00 |
70704 rows × 2 columns
The model fitting procedure takes some time. After it is done, we may apply the trained additive model to the test data, in which case only time-stamps matter irrespective of the demand values.
pred_res = amf.predict(df_test)
pred_data = pred_res.sort('datetime').collect()
import matplotlib.pyplot as plt
plt.figure(figsize=(16,5))
plt.plot(data['datetime'][50000:], data['system_demand_actual'][50000:])# use partial data for better illustration of the prediction result
plt.plot(pred_data['datetime'], pred_data['YHAT'], color='red')
plt.legend(['Actual Demand', 'Predicted Demand'])
plt.title('Figure 3. Actual Demand VS Predicted Demand')
plt.show()
plt.close()
We can join the real demand values with predicted ones, and compute the prediction accuracy measures.
rename_pred = pred_res.rename_columns({'datetime': 'date_time'})
join_pred = df_test.join(rename_pred[['date_time', 'YHAT']], #YHAT is the column that stores the predicted values
'"datetime"="date_time"')
join_pred.collect()
datetime | system_demand_actual | date_time | YHAT | |
---|---|---|---|---|
0 | 2016-02-18 00:30:00 | 5164.0 | 2016-02-18 00:30:00 | 5218.050311 |
1 | 2016-02-18 01:00:00 | 5044.0 | 2016-02-18 01:00:00 | 5103.471702 |
2 | 2016-02-18 01:30:00 | 4943.0 | 2016-02-18 01:30:00 | 5010.291241 |
3 | 2016-02-18 02:00:00 | 4866.0 | 2016-02-18 02:00:00 | 4937.996240 |
4 | 2016-02-18 02:30:00 | 4812.0 | 2016-02-18 02:30:00 | 4882.707945 |
… | … | … | … | … |
5227 | 2016-06-05 22:00:00 | 5862.0 | 2016-06-05 22:00:00 | 5777.180270 |
5228 | 2016-06-05 22:30:00 | 5756.0 | 2016-06-05 22:30:00 | 5688.704385 |
5229 | 2016-06-05 23:00:00 | 5674.0 | 2016-06-05 23:00:00 | 5577.197404 |
5230 | 2016-06-05 23:30:00 | 5566.0 | 2016-06-05 23:30:00 | 5452.238527 |
5231 | 2016-06-06 00:00:00 | 5446.0 | 2016-06-06 00:00:00 | 5325.501001 |
5232 rows × 4 columns
from hana_ml.algorithms.pal.tsa.accuracy_measure import accuracy_measure
res = accuracy_measure(data = join_pred[['system_demand_actual', 'YHAT']],
evaluation_metric = ['rmse', 'mad', 'mape'])
res.collect()
STAT_NAME | STAT_VALUE | |
---|---|---|
0 | MAD | 195.786364 |
1 | MAPE | 0.033426 |
2 | RMSE | 244.762496 |
MAD and RMSE are both small compared with the demand values in the dataset (~200 vs ~5000), consistent with the mean absolute percent error(MAPE) which is around 3.3%.
For the sake of comparsion, we apply AutoExponentialSmoothing on the same set of training data, and then evaluate the performance of the generated model on test data.
For saving computational time and improving model quality, we can pre-specify a few parameters when initialzing the AutoExponentialSmoothing instance: since the data contains both trend and seasonality components, triple exponential smoothing should be chosen to handle them; seasonal components do not vary with level, which suggests additive seasonality, so the seasonal parameter should be set to ‘additive’.
from hana_ml.algorithms.pal.tsa.exponential_smoothing import AutoExponentialSmoothing
aesm = AutoExponentialSmoothing(forecast_model_name='TESM',
forecast_num=df_test.count(),
seasonal='additive',
optimizer_random_seed=1)
df_train_wid = df_train.add_id('ID', 'datetime')
aesm.fit_predict(data=df_train_wid, key='ID',
endog='system_demand_actual')
aesm.forecast_.sort('TIMESTAMP').collect()
TIMESTAMP | VALUE | PI1_LOWER | PI1_UPPER | PI2_LOWER | PI2_UPPER | |
---|---|---|---|---|---|---|
0 | 1 | 4710.426403 | NaN | NaN | NaN | NaN |
1 | 2 | 4420.592040 | NaN | NaN | NaN | NaN |
2 | 3 | 4344.739633 | NaN | NaN | NaN | NaN |
3 | 4 | 4294.175059 | NaN | NaN | NaN | NaN |
4 | 5 | 4237.244328 | NaN | NaN | NaN | NaN |
… | … | … | … | … | … | … |
75931 | 75932 | 5611.215324 | 3185.398049 | 8037.032600 | 1901.247953 | 9321.182696 |
75932 | 75933 | 5505.265185 | 3079.215918 | 7931.314453 | 1794.943013 | 9215.587358 |
75933 | 75934 | 5402.156812 | 2975.875576 | 7828.438048 | 1691.479873 | 9112.833751 |
75934 | 75935 | 5295.253762 | 2868.740578 | 7721.766945 | 1584.222091 | 9006.285433 |
75935 | 75936 | 5185.558722 | 2758.813614 | 7612.303831 | 1474.172353 | 8896.945092 |
75936 rows × 6 columns
Again the predicted demand values and the actual ones can be joined for prediction accuracy measure computation.
df_wid = df.add_id('ID', 'datetime')
df_test_wid = cc.sql('SELECT * FROM ({}) WHERE ID >= {}'.format(df_wid.select_statement, df_train.count()))
aesm_join_res = df_test_wid.join(aesm.forecast_[['TIMESTAMP', 'VALUE']],##the VALUE column stores fitted & predicted demand values
'ID = TIMESTAMP')
aesm_join_res.collect()
ID | datetime | system_demand_actual | TIMESTAMP | VALUE | |
---|---|---|---|---|---|
0 | 70704 | 2016-02-18 00:00:00 | 5321.0 | 70704 | 5305.413617 |
1 | 70705 | 2016-02-18 00:30:00 | 5164.0 | 70705 | 5170.236697 |
2 | 70706 | 2016-02-18 01:00:00 | 5044.0 | 70706 | 5053.972515 |
3 | 70707 | 2016-02-18 01:30:00 | 4943.0 | 70707 | 4958.628810 |
4 | 70708 | 2016-02-18 02:00:00 | 4866.0 | 70708 | 4887.215553 |
… | … | … | … | … | … |
5228 | 75932 | 2016-06-05 22:00:00 | 5862.0 | 75932 | 5611.215324 |
5229 | 75933 | 2016-06-05 22:30:00 | 5756.0 | 75933 | 5505.265185 |
5230 | 75934 | 2016-06-05 23:00:00 | 5674.0 | 75934 | 5402.156812 |
5231 | 75935 | 2016-06-05 23:30:00 | 5566.0 | 75935 | 5295.253762 |
5232 | 75936 | 2016-06-06 00:00:00 | 5446.0 | 75936 | 5185.558722 |
5233 rows × 5 columns
aesm_measure = accuracy_measure(data = aesm_join_res[['system_demand_actual', 'VALUE']],
evaluation_metric = ['rmse', 'mad', 'mape'])
aesm_measure.collect()
STAT_NAME | STAT_VALUE | |
---|---|---|
0 | MAD | 262.475689 |
1 | MAPE | 0.044364 |
2 | RMSE | 300.193255 |
One can see from the values of selected accuracy measures that the performance of exponential smoothing is not as good as additive model analysis, which is consistent with our previous analysis on the dataset.
Adding Holidays for Improved Forecast
Up until now we haven’t talked about the holiday effect in additive model time-series analysis. It is seen from Figure 2 that, in weekends the system demands are usually smallers compared to weekdays, we natually suspect that in holidays the systems demand will also deviate from the regular pattern in weekdays. We verify this by checking the residue of the training data(i.e. differences between fitted and actual values) when additive model analysis is applied.
fit_res = amf.predict(df_train) #obtain the fitting result for the training data
fit_res.sort('datetime').collect()
datetime | YHAT | YHAT_LOWER | YHAT_UPPER | |
---|---|---|---|---|
0 | 2012-02-06 00:30:00 | 4542.912928 | 4225.863575 | 4856.542969 |
1 | 2012-02-06 01:00:00 | 4442.691457 | 4116.797874 | 4751.370439 |
2 | 2012-02-06 01:30:00 | 4364.044821 | 4051.549657 | 4685.978732 |
3 | 2012-02-06 02:00:00 | 4306.432861 | 3994.744755 | 4599.067855 |
4 | 2012-02-06 02:30:00 | 4265.949011 | 3936.221074 | 4602.731532 |
… | … | … | … | … |
70699 | 2016-02-17 22:00:00 | 5857.695587 | 5552.015396 | 6172.163877 |
70700 | 2016-02-17 22:30:00 | 5755.662651 | 5469.078620 | 6074.457734 |
70701 | 2016-02-17 23:00:00 | 5630.292965 | 5292.285728 | 5938.692115 |
70702 | 2016-02-17 23:30:00 | 5491.190423 | 5176.592394 | 5808.430988 |
70703 | 2016-02-18 00:00:00 | 5350.053660 | 5030.360951 | 5667.936164 |
70704 rows × 4 columns
We join the fitted values with actual ones for computing their differences.
fit_res_rename = fit_res.rename_columns({'datetime':'date_time'})
fit_join = df_train.join(fit_res_rename, '"datetime" = "date_time"')
fit_join.collect()
datetime | system_demand_actual | date_time | YHAT | YHAT_LOWER | YHAT_UPPER | |
---|---|---|---|---|---|---|
0 | 2012-02-06 00:30:00 | 4526.68 | 2012-02-06 00:30:00 | 4542.912928 | 4225.863575 | 4856.542969 |
1 | 2012-02-06 01:00:00 | 4436.09 | 2012-02-06 01:00:00 | 4442.691457 | 4116.797874 | 4751.370439 |
2 | 2012-02-06 01:30:00 | 4361.24 | 2012-02-06 01:30:00 | 4364.044821 | 4051.549657 | 4685.978732 |
3 | 2012-02-06 02:00:00 | 4295.23 | 2012-02-06 02:00:00 | 4306.432861 | 3994.744755 | 4599.067855 |
4 | 2012-02-06 02:30:00 | 4233.17 | 2012-02-06 02:30:00 | 4265.949011 | 3936.221074 | 4602.731532 |
… | … | … | … | … | … | … |
70699 | 2016-02-17 22:00:00 | 5903.00 | 2016-02-17 22:00:00 | 5857.695587 | 5552.015396 | 6172.163877 |
70700 | 2016-02-17 22:30:00 | 5745.00 | 2016-02-17 22:30:00 | 5755.662651 | 5469.078620 | 6074.457734 |
70701 | 2016-02-17 23:00:00 | 5597.00 | 2016-02-17 23:00:00 | 5630.292965 | 5292.285728 | 5938.692115 |
70702 | 2016-02-17 23:30:00 | 5441.00 | 2016-02-17 23:30:00 | 5491.190423 | 5176.592394 | 5808.430988 |
70703 | 2016-02-18 00:00:00 | 5321.00 | 2016-02-18 00:00:00 | 5350.053660 | 5030.360951 | 5667.936164 |
70704 rows × 6 columns
residue = cc.sql('SELECT "datetime", YHAT - "system_demand_actual" AS RESID FROM ({})'.format(fit_join.select_statement))
import matplotlib.pyplot as plt
plt.figure(figsize=(16,5))
plt.plot(data['datetime'][0:df_train.count()], residue.collect()['RESID'])
plt.title('Figure 4. Overview of the random component of training data')
plt.show()
plt.close()
One can see that there are many(mainly upward w.r.t. zero value) large spiky values in the residual, which indicates that in certain days the actual demand values would fall below the predicted values by a large amount. For further examination, we use the deviation of the residual value from 0 as the measurement of how abnormality an data point is, then the date information of the top 100 result in the dataset is extracted.
abnormal = cc.sql('SELECT * FROM ({}) ORDER BY RESID DESC LIMIT 100'.format(residue.select_statement))
abnormal_dates = abnormal[['datetime']].cast('datetime', 'DATE').drop_duplicates()
abnormal_dates.sort('datetime').collect()
datetime | |
---|---|
0 | 2013-02-11 |
1 | 2013-02-12 |
2 | 2014-01-31 |
3 | 2014-10-22 |
4 | 2015-02-19 |
5 | 2015-02-20 |
6 | 2016-02-08 |
7 | 2016-02-09 |
The dates listed above are all public holidays in Singapore(2014-10-22 is Deepavali Day, others are for Lunar New Year), which indicates that system demand will fall below the regular level if the day for inspection happens to be a public holiday. This justifies our previous suspection.
However, since major public holidays of a year usually shall be announced even before the year actuall starts, it is okay for us to incorporate them in the model building and model prediction phase. Additive model analysis provides us this facility, so we can collect the set of holidays(exclusive of weekends) within the dates range of the input data and store it in SAP HANA.
holidays = cc.table('SG_HOLIDAY_TBL').cast('Date', 'DATE')#The Date column in holiday table is of type 'TIMESTAMP', convert to 'DATE'
holidays.head(5).collect()
Date | Name | |
---|---|---|
0 | 2012-01-02 | NY |
1 | 2012-01-23 | CNY |
2 | 2012-01-24 | CNY |
3 | 2012-04-06 | GF |
4 | 2012-05-01 | LD |
where NY = New Year, CNY = Chinese New Year, GF = Good Friday, LD = Labor Day.
Now let’s re-fit the additive time-series model with the introduction of holidays.
amf.fit(data = df_train, holiday=holidays)
We re-plot the residue component.
fit_res = amf.predict(df_train)
fit_res_rename = fit_res.rename_columns({'datetime':'date_time'})
fit_join = df_train.join(fit_res_rename, '"datetime" = "date_time"')
residue = cc.sql('SELECT "datetime", YHAT - "system_demand_actual" AS RESID FROM ({})'.format(fit_join.select_statement))
import matplotlib.pyplot as plt
plt.figure(figsize=(16,5))
plt.plot(data['date'][0:df_train.count()], residue.collect()['RESID'])
plt.title('Figure 4. Overview of random component of training data with holiday info.')
plt.show()
plt.close()
We can see from the residue component of the training data that, after public holiday information is introduced, those original excessively large residual values become less frequent, indicating smaller fitting error. In general, the introduction of holidays helps the model in capturing the associated cyclic patterns.
Let us verify whether the prediction error could be reduced as well by apply the re-fitted additive model on the test data.
pred_res2 = amf.predict(df_test)
renm_pred2 = pred_res2.rename_columns({'datetime': 'date_time'})
join_pred2 = df_test.join(renm_pred2[['date_time', 'YHAT']], '"datetime"="date_time"')
from hana_ml.algorithms.pal.tsa.accuracy_measure import accuracy_measure
res = accuracy_measure(data = join_pred2[['system_demand_actual', 'YHAT']],
evaluation_metric = ['rmse', 'mad', 'mape'])
res.collect()
STAT_NAME | STAT_VALUE | |
---|---|---|
0 | MAD | 175.825074 |
1 | MAPE | 0.030518 |
2 | RMSE | 220.948932 |
One can see that all selected accuracy measures improve by a non-negligible amount on the test data thanks to the addition of holiday information:
- MAD : from ~195.8 to ~175.8,
- MAPE : from ~0.0335 to ~0.0305,
- RMSE : from ~224.8 to ~220.9
Case Study II :Additive Model Analysis for International Air Arrivals in Singapore
Dataset Description
The data we use in this section is the monthly international visitor arrivals at Changi airport in Singapore[5], available also in Singapore’s open data portal[3]. The time-series data have an overall upward trend, while the growth rate could be different within different time-intervals. The data has been preprocessed and stored in HANA database in a table with the name of ‘SG_INTER_VISITORS_TBL’.
psg = cc.table('SG_INTER_VISITORS_TBL').cast('month', 'DATE')#'month' column with type 'TIMESTAMP', convert to 'DATE' type for easier processing
psg.collect()
month | total_international_visitor_arrivals | |
---|---|---|
0 | 1978-01-01 | 167016 |
1 | 1978-02-01 | 147954 |
2 | 1978-03-01 | 163199 |
3 | 1978-04-01 | 162400 |
4 | 1978-05-01 | 162667 |
… | … | … |
450 | 2015-07-01 | 1519188 |
451 | 2015-08-01 | 1445052 |
452 | 2015-09-01 | 1131967 |
453 | 2015-10-01 | 1246706 |
454 | 2015-11-01 | 1193906 |
455 rows × 2 columns
The dataset contains two columns, with the 1st column being month info(represented by the 1st day of that month) and 2nd column being to total number of international visitor arrivals. The range of time for this time-series data is from Jan. 1978 to Nov. 2015, with a total number of 455 records.
To proceed further, we divide dataset into training part and testing part, respectively. All records from year 1978 to year 2011, i.e. the heading 408 records, are used for training an additive time-series model, while the rest of records are used for model evaluation.
train_num = 408
test_num = psg.count() - train_num
psg_train = psg.head(train_num)
psg_test = psg.sort('month', desc=True).head(test_num).sort('month')
In the following we examine the training data by drawing its run-sequence plot.
psg_train_data = psg_train.collect()
import matplotlib.pyplot as plt
plt.figure(figsize=(16*(408/455),5))
plt.plot(psg_train_data['month'],
psg_train_data['total_international_visitor_arrivals'])
plt.show()
plt.close()
One can see that in general the number of air arrivals keep increasing with time. However, the increasing rate is non-stationary, and there are also some intermittent drops that are significant, making the trend of this time-series difficult to model.
Additive model time-series analysis also gives us the flexibility in modeling the trend in a picewise manner. If not specified manually, it will automatically identifying a collection of change-points for the input time-series, where the model of trend is allowed to change when tranversing through a change-point. However, for time-series with complicated trend, the detected change-points produced by the default mechanism are often sub-optimal, leading to over-fitting or under-fitting for the trend of the training data. In contrast, manually labeled change-points(often could be obtained from outer sources) could sometimes alleviate this weakness and do a better job for time-series modeling and prediction.
In this section, we show the benefit of the flexibility provided by additive model analysis for trend modeling, using the above monthly air arrival dataset as an example.
Modeling and Forecast with Additive Model Analaysis
One can see from the run-sequence plot of the air arrival data that, in April, May and June 2003, the number of international arrivals undergoes a significant drop from its regular level. This is due to the short-term outbreak of SARS cronavirus in Singapore. From the discussion of our previous section, we may include this event when dealing with this time-series. So we store the corresponding values of the two months in a table named by ‘SG_SARS_MONTH_TBL’ in SAP HANA.
event = cc.table('SG_SARS_MONTH_TBL').cast('Month', 'DATE')
event.collect()
Month | Name | |
---|---|---|
0 | 2004-04-01 | SARS |
1 | 2004-05-01 | SARS |
2 | 2004-06-01 | SARS |
It is seen from the training data that, local variations increases with the growing number of international arrivals, which is typically regards a single a multiplicative seasonality(if there is any).
from hana_ml.algorithms.pal.tsa.additive_model_forecast import AdditiveModelForecast
amf = AdditiveModelForecast(seasonality_mode='multiplicative')
amf.fit(data=psg_train, holiday=event)
Then let us check how well the model fits the training data.
psg_fit = amf.predict(psg_train)# obtain fitted value for training data
psg_fit_rename_month = psg_fit.rename_columns({'month': 'Month'})
join_true_and_fit = psg_train.join(psg_fit_rename_month, '"month" = "Month"')
from hana_ml.algorithms.pal.tsa.accuracy_measure import accuracy_measure
fit_res = accuracy_measure(data = join_true_and_fit[['total_international_visitor_arrivals', 'YHAT']],
evaluation_metric = ['rmse', 'mad', 'mape'])
fit_res.collect()
STAT_NAME | STAT_VALUE | |
---|---|---|
0 | MAD | 31945.872096 |
1 | MAPE | 0.067876 |
2 | RMSE | 51823.310879 |
Next we apply the trained model to the test data, and evaluate the prediction result using popular evaluation metrics.
psg_pred = amf.predict(psg_test)
psg_pred_rename_month = psg_pred.rename_columns({'month': 'Month'})
join_true_and_pred = psg_test.join(psg_pred_rename_month, '"month" = "Month"')
psg_pred_acc = accuracy_measure(data = join_true_and_pred[['total_international_visitor_arrivals', 'YHAT']],
evaluation_metric = ['rmse', 'mad', 'mape'])
psg_pred_acc.collect()
STAT_NAME | STAT_VALUE | |
---|---|---|
0 | MAD | 119813.052030 |
1 | MAPE | 0.093767 |
2 | RMSE | 139003.141614 |
Since the time-series of interest in this section has larger variations when its level value goes higher, so it is natual that MSE and RMSE have larger values on the test data since the level is expected to be higher within the prediction period. However, mean absolute percentage error is scale-invariant w.r.t. level of time-series, so compared to that of training data, its higher value on the test data strongly suggests that the training data could be over-fitted by the additive time-series model.
For better illustrattion, the real and predicted values are plotted as follows:
pred_data = psg_pred.collect()
psg_test_data = psg_test.collect()
import matplotlib.pyplot as plt
plt.figure(figsize=(16,5))
plt.plot(psg_test_data['month'], psg_test_data['total_international_visitor_arrivals'])
plt.plot(pred_data['month'], pred_data['YHAT'], color='red')
plt.show()
plt.close()
As discussed in the beginning of this section, when building the above additive model, a set of change-points is detected implicitly using a defaulted mechanism, and the detected change-points may not be optimal. So for the input time-series, after a careful examination of training data, we suggest the following set of changepoints:
‘1982-11-01’, ‘1986-01-01’, ‘1991-05-01’, ‘1995-09-01’, ‘1997-10-01’, ‘2000-07-01’, ‘2004-02-01’, ‘2007-12-01’, ‘2009-03-01’.
Let us first mark out the specified change-points in the training data via run-sequence plot.
import pandas as pd
psg_train_data = psg_train.collect()
changepoints = list(pd.to_datetime(pd.Series(['1982-11-01', '1986-01-01', '1991-05-01', '1995-09-01', '1997-10-01', '2000-07-01', '2004-02-01', '2007-12-01', '2009-03-01'])))
event_points = list(pd.to_datetime(pd.Series(['2003-04-01', '2003-05-01', '2003-06-01'])))
cp_data = psg_train_data.loc[[x in changepoints for x in psg_train_data['month']],:]
event_data = psg_train_data.loc[[x in event_points for x in psg_train_data['month']],:]
plt.figure(figsize=(int(16*(408/455)),5))
plt.plot(psg_train_data['month'], psg_train_data['total_international_visitor_arrivals'])
plt.scatter(cp_data['month'], cp_data['total_international_visitor_arrivals'], color='red')
plt.scatter(event_data['month'], event_data['total_international_visitor_arrivals'], color='purple')
plt.legend(['Visitor Arrivals', 'Specified Change-points', 'Event Points'])
plt.show()
plt.close()
Next, we manually specify these points as change-points when building up the additive model for time-series analysis.
amf = AdditiveModelForecast(seasonality_mode='multiplicative',
changepoints=['1982-11-01 00:00:00',
'1986-01-01 00:00:00',
'1991-05-01 00:00:00',
'1995-09-01 00:00:00',
'2000-07-01 00:00:00',
'2004-02-01 00:00:00',
'2007-12-01 00:00:00',
'2009-03-01 00:00:00'])
amf.fit(data=psg_train, holiday=event)
Again let us check how well the model fits the training data.
psg_fit2 = amf.predict(psg_train)
psg_fit2_rename_month = psg_fit2.rename_columns({'month': 'Month'})
join_true_and_fit2 = psg_train.join(psg_fit2_rename_month, '"month" = "Month"')
from hana_ml.algorithms.pal.tsa.accuracy_measure import accuracy_measure
fit_res2 = accuracy_measure(data = join_true_and_fit2[['total_international_visitor_arrivals', 'YHAT']],
evaluation_metric = ['rmse', 'mad', 'mape'])
fit_res2.collect()
STAT_NAME | STAT_VALUE | |
---|---|---|
0 | MAD | 32760.561142 |
1 | MAPE | 0.070333 |
2 | RMSE | 52156.801417 |
So the metrics for model fitting on the training dataset is not as good as the case when using default mechasim for chang-point detection, but the result are comparable.
Now we evaluate the re-fitted model on the test data.
STAT_NAME | STAT_VALUE | |
---|---|---|
0 | MAD | 93115.156652 |
1 | MAPE | 0.073317 |
2 | RMSE | 111212.533575 |
Thanks to the manually specified change-points, all evaluation metrics have improved on the test data. In particular, mean absolute percentage error(MAPE) has been dropped by nearly 2%, making it close the MAPE value on the training data.
We can check the consistency between real and predicted arrivals by drawing their run-sequence plots.
pred2_data = psg_pred2.collect()
plt.figure(figsize=(16,5))
plt.plot(psg_test_data['month'],
psg_test_data['total_international_visitor_arrivals'])
plt.plot(pred2_data['month'], pred2_data['YHAT'], color='red')
plt.legend(['Actual Arrivals', 'Predicted Arrivals'])
plt.show()
plt.close()
Summary
In this blog post, we have shown readers how additive model analysis faciliates us in handling time-series with trend, seasonality and cyclic patterns. Compared with traditional approaches that rely on stationarity and lagged dependecy, the curve-fitting nature of additive time-series models provides us more flexibilities for time-series modeling.
References
[1] Time-Series Modeling and Analysis using SAP HANA Predictive Analysis Library(PAL) through Python Machine Learning Client for SAP HANA
[2] Anomaly Detection in Time-Series using Seasonal Decomposition in Python Machine Learning Client for SAP HANA
[3] Portal website: data.gov.sg
[4] https://storage.data.gov.sg/half-hourly-system-demand/resources/half-hourly-system-demand-data-from-2-feb-2012-onwards-2016-06-14T05-29-36Z.csv
[5] Dataset website: data.gov.sg/dataset/total-visitor-international-arrivals-to-singapore