Looking under the hood of Automated Time Series from SAP Predictive Analytics
Time Series Modeling Process
The goal of time series modeling is to find the internal structure within a sequence of data points in time order. When that structure is known, the temporal measurement can be extrapolated in the forecast horizon. To build a time series model, SAP Predictive Analytics decomposes the signal (observed series) into additive components, as follows:
Signal = Trend + Cycles + Fluctuation + Residual
The tool evaluates automatically various combinations of trends, cycles and fluctuations. Eventually it selects the combination that gives the best forecast accuracy. The overall process is represented in the following diagram:
The user can customize the process by disabling some of the components if needed. By default, all the components are enabled.
After a model has been generated, the winning components are displayed in the debrief:
This model works with monthly ozone rates data. It detected a linear trend and an auto-regressive component of 37 months. No cycles were found. The user can view the signal (the green line on the chart) and the trend component (the blue line), as follows:
He can also display the detrended signal (in green below) and the auto-regressive component (in blue):
As part of the modeling process, the tool derives attributes automatically from the date-time field.
Here is the list of these derived attributes: secondOfMinute, minuteOfHour, hourOfDay, dayOfWeek, dayOfMonth, dayOfYear, weekOfMonth, weekOfYear, halfMonthOfYear and monthOfYear.
In the following model, built against a sub-hourly dataset, a cycle was detected that uses the minuteOfHour derived attribute:
Let’s look at another example with a monthly series. The following model didn’t find a trend pattern per say but applied a lag 1 transformation to the data. It detected a cycle that uses the monthOfYear derived attribute, and an auto-regressive component of 44 points:
A time series model can detect multiple cycles. The following model captured two cycles based on the dayOfWeek and the dayOfYear derived attributes:
In the diagram given at the beginning, you may have noticed the term, “extra-predictable”. What does that mean? The term refers to variables that the user can provide to improve the accuracy of the forecast. They are in fact candidate predictors that come in addition to the date-time field and its derived attributes. In the case of a daily series the user may be able to get data from external sources in order to build columns such as: Is National Holiday (Yes/No), Is School Holiday (Yes/No), Is Sporting Event (Yes/No), Weather Temperature, and Weather Rain. Or else, the user can perform calendar-based feature engineering.
Let’s look at an example where we prepared extra predictors on a daily series. Seven columns, one per weekday, were added. Each column is a continuous type and contains the number of days in the month regarding a given weekday. The debrief of the model shows a cycle based on the total days in the month for weekday 2, as follows:
Let’s look at a second example of extra predictors, still on a daily series. Using Data Manager, we calculated two flags of nominal type: In First 4 days of Month and, In Last 4 days of Month, as well as seven ranks of ordinal type: Rank Day Of Week 1, Rank Day Of Week 2, … , [up to] Rank Day Of Week 7.
The following is a data preview of the features built on the time dimension table:
The model debrief says that the feature Rank Day Of Week 1 allowed the detection of a cycle pattern in the data, as follows:
Note that the cycle component handles extra-predictors of ordinal type and continuous, not nominal. The trend component handles extra-predictors of ordinal type, continuous and nominal.
Note also that the data for the candidate predictors must be provided both for the past and future periods of the time series. So, for example, if you want to feed the model with weather information, it must include the weather forecast data.
To report the accuracy of an automated time series model, SAP Predictive Analytics uses mainly the Mean Average Percentage Error (MAPE), as shown here:
In the model debrief, the user sees a global indicator that tells how well the model performs. This is called the Horizon Wide MAPE:
The statistical reports menu gives the details behind this performance indicator. In the example below that has a forecast horizon of 6 periods, we see 6 MAPE values calculated on the Validation dataset. The Horizon Wide MAPE is simply the average of these MAPE values.
The menu “Other Performance Indicators” shows metrics like MAE that is appropriate when the series has zero values.
The time series dataset was split into two partitions automatically. The first partition, called Estimation, contains 75% of the data. The second partition, called Validation, contains the other 25%; it has the most recent data points. This partitioning scheme corresponds to the default configuration of a time series model. Here is the default data split for the ozone rates data.
Note that a third partition can be added; in that case the split becomes: 60% for Estimation, 20% for Validation and 20% for Test.
To go beyond the MAPE indicator, a residual analysis can be done to check if the model managed to find all the patterns in the data. We will perform a residual analysis in the use case described in the next section.
Use Case: Milk Production Forecast
The SAP HANA Automated Predictive Library (APL) allows a data scientist or a SQL developer to invoke the automated time series modeling engine directly from a SQL script. This section illustrates how, within HANA Studio, a data scientist can use the Forecast function of APL.
To begin, we look at the data in a line chart, a 14-year history of monthly milk production:
We generate multiple lagged series from the original series to see if there is auto-correlation in the data:
create view TS_LAGGED as With S2 as ( select 1 as id from dummy union select 2 from dummy ), LAGS as ( select row_number() over() as LAG from S2, S2, S2, S2, S2, S2 ) Select LAG, A.MONTH_START as THE_DATE, A.MILK_QTY as Y, L.MILK_QTY as Y_LAGGED From APL_SAMPLES.MILK_PROD A, APL_SAMPLES.MILK_PROD L, LAGS Where A.MONTH_START = add_months(L.MONTH_START, -LAG) order by 1 asc;
We check for patterns visually at different lags, using a lag plot. In the screen below we selected the lag 12. There is clearly a pattern.
We compute the correlation coefficient, for each lag, to get a summary view.
select LAG, round(corr(Y, Y_LAGGED),3) as PEARSON_COEFFICIENT from TS_LAGGED where lag <= 36 group by LAG order by 1 asc;
The Pearson coefficient is a standard statistic that ranges from -1 to 1. A value of 1 means that there is a perfect positive correlation. A value of -1 means that there is a perfect negative correlation. Zero means that there is no correlation. In the case of the milk quantity we observe strong positive correlation at different lags.
Now that we are more familiar with the dataset, we move to time series modeling.
We create a SQL view where we transform the monthly quantity into an average quantity per day to remove the variation due to the fact that months don’t have the same number of days, and sort the series over time, as follows:
create view TS_SORTED as select MONTH_START as THE_DATE, to_double( MILK_QTY / DAYOFMONTH(LAST_DAY(MONTH_START))) as AVERAGE_PER_DAY from APL_SAMPLES.MILK_PROD order by 1 asc;
We create the table type that describes the structure of the forecast output:
create type FORECAST_OUT_T as table ( THE_DATE DATE, AVERAGE_PER_DAY DOUBLE, "kts_1" DOUBLE );
We create the requested tables for the function to work, and ask for a 12-month forecast horizon:
create table FUNC_HEADER like "SAP_PA_APL"."sap.pa.apl.base::BASE.T.FUNCTION_HEADER"; insert into FUNC_HEADER values ('Oid', 'MLK_PRD'); insert into FUNC_HEADER values ('LogLevel', '8'); create table FORECAST_CONFIG like "SAP_PA_APL"."sap.pa.apl.base::BASE.T.OPERATION_CONFIG_DETAILED"; insert into FORECAST_CONFIG values ('APL/Horizon', '12', null); insert into FORECAST_CONFIG values ('APL/TimePointColumnName', 'THE_DATE', null); create table VARIABLE_DESC like "SAP_PA_APL"."sap.pa.apl.base::BASE.T.VARIABLE_DESC_OID"; create table VARIABLE_ROLES like "SAP_PA_APL"."sap.pa.apl.base::BASE.T.VARIABLE_ROLES_WITH_COMPOSITES_OID"; insert into VARIABLE_ROLES values ('THE_DATE', 'input', NULL, NULL, 'MLK_PRD'); insert into VARIABLE_ROLES values ('AVERAGE_PER_DAY','target', NULL, NULL, 'MLK_PRD'); create table FORECAST_OUT like FORECAST_OUT_T; create table OPERATION_LOG like "SAP_PA_APL"."sap.pa.apl.base::BASE.T.OPERATION_LOG"; create table SUMMARY like "SAP_PA_APL"."sap.pa.apl.base::BASE.T.SUMMARY"; create table INDICATORS like "SAP_PA_APL"."sap.pa.apl.base::BASE.T.INDICATORS";
We call the APL Forecast function:
call "SAP_PA_APL"."sap.pa.apl.base::FORECAST"( FUNC_HEADER, FORECAST_CONFIG, VARIABLE_DESC, VARIABLE_ROLES, -- Inputs 'USER_APL','TS_SORTED', -- Business Data 'USER_APL','FORECAST_OUT', -- Forecasted Values OPERATION_LOG, SUMMARY, INDICATORS -- Outputs ) with overview;
We query the output tables to obtain a debrief of the trained model:
select ltrim(ltrim(key,'Model'),'Time') as Item, value from SUMMARY where KEY like 'ModelTimeSeries%' or KEY like 'ModelRecord%' UNION select KEY , cast(VALUE as VARCHAR) from INDICATORS where KEY in ('Trend','Cycles','Fluctuations');
To assess if the model did a good job at capturing all the patterns in the time series data, we will run a few tests on the residuals.
First, we compute the mean of the residuals series, with the expected value being zero:
drop view ts_residuals; create view ts_residuals as select THE_DATE, AVERAGE_PER_DAY as OBSERVED_VALUE, "kts_1" as MODEL_FORECAST, AVERAGE_PER_DAY - "kts_1" as MODEL_RESIDUALS from FORECAST_OUT where AVERAGE_PER_DAY is not null and "kts_1" is not null order by 1 asc; select avg(MODEL_RESIDUALS) as "Mean value of residuals" from ts_residuals;
Then we generate multiple lagged series from the residuals series:
drop view TS_LAGGED; create view TS_LAGGED as With S2 as ( select 1 as id from dummy union select 2 from dummy ), LAGS as ( select row_number() over() as LAG from S2, S2, S2, S2, S2, S2 ) select LAG, A.THE_DATE , A.MODEL_RESIDUALS as Y, L.MODEL_RESIDUALS as Y_LAGGED from ts_residuals A, ts_residuals L, LAGS where A.THE_DATE = add_months(L.THE_DATE, LAG) order by 1, 2 asc;
And we compute the correlation coefficient, for each lag, with the same SQL we used earlier:
It seems there is no auto-correlation left in the residuals.
Finally, we check that the residuals look like white noise, as expected in case of a good forecasting model. For that, we will use the HANA PAL function: White Noise Test. This PAL function performs a Ljung-Box test. If the series is white noise, the test returns a value of 1. If not, a value of 0 is returned.
We must put the residuals data in the format requested by the PAL function:
CREATE COLUMN TABLE PAL_WHITENOISETEST_DATA_TBL LIKE PAL_WHITENOISETEST_DATA_T; INSERT INTO PAL_WHITENOISETEST_DATA_TBL select to_integer(to_char(the_date,'YYYYMMDD')), MODEL_RESIDUALS from USER_APL.ts_residuals ;
We want to run the Ljung-Box test from lag 1 up to lag 24, with probability set to 95%:
CREATE COLUMN TABLE PAL_CONTROL_TBL( "NAME" VARCHAR (50), "INTARGS" INTEGER, "DOUBLEARGS" DOUBLE, "STRINGARGS" VARCHAR (100) ); INSERT INTO PAL_CONTROL_TBL VALUES ('LAG', 24, NULL, NULL); INSERT INTO PAL_CONTROL_TBL VALUES ('THREAD_NUMBER', 8, NULL, NULL); INSERT INTO PAL_CONTROL_TBL VALUES ('PROBABILITY', NULL, 0.95, NULL);
We call the PAL function:
CREATE COLUMN TABLE PAL_WHITENOISETEST_RESULT_TBL LIKE PAL_WHITENOISETEST_RESULT_T; CALL USER_APL.PAL_WHITENOISETEST_PROC( PAL_WHITENOISETEST_DATA_TBL, PAL_CONTROL_TBL, PAL_WHITENOISETEST_RESULT_TBL ) WITH OVERVIEW;
We display the result of the test:
select Case "VALUE" When 1 Then 'Yes' When 0 Then 'No' Else Null End as "Is White Noise ?" from PAL_WHITENOISETEST_RESULT_TBL;
To use this PAL function, don’t forget to create first the required table types and its stored procedure, as we did with the following script:
CREATE TYPE PAL_WHITENOISETEST_DATA_T AS TABLE( "IDCOL" INTEGER, "DATACOL" DOUBLE ); CREATE TYPE PAL_CONTROL_T AS TABLE( "NAME" VARCHAR (50), "INTARGS" INTEGER, "DOUBLEARGS" DOUBLE, "STRINGARGS" VARCHAR (100) ); CREATE TYPE PAL_WHITENOISETEST_RESULT_T AS TABLE( "NAME" VARCHAR (100), "VALUE" INT ); CREATE COLUMN TABLE PAL_WHITENOISETEST_SIGNATURE( "POSITION" INT, "SCHEMA_NAME" NVARCHAR(256), "TYPE_NAME" NVARCHAR(256), "PARAMETER_TYPE" VARCHAR(7) ); INSERT INTO PAL_WHITENOISETEST_SIGNATURE VALUES (1,'USER_APL', 'PAL_WHITENOISETEST_DATA_T','IN'); INSERT INTO PAL_WHITENOISETEST_SIGNATURE VALUES (2,'USER_APL', 'PAL_CONTROL_T','IN'); INSERT INTO PAL_WHITENOISETEST_SIGNATURE VALUES (3,'USER_APL', 'PAL_WHITENOISETEST_RESULT_T','OUT'); CALL SYS.AFLLANG_WRAPPER_PROCEDURE_CREATE( 'AFLPAL', 'WHITENOISETEST', 'USER_APL', 'PAL_WHITENOISETEST_PROC',PAL_WHITENOISETEST_SIGNATURE);
We are finished with our residual analysis tests. We know we have a good forecasting model. Let’s produce the forecasted values.
We transform the forecasted average quantity per day into a monthly quantity:
select F.THE_DATE, A.MILK_QTY, round(F."kts_1" * DAYOFMONTH(LAST_DAY(F.THE_DATE)),2) as FORECAST_QTY from USER_APL.FORECAST_OUT F left outer join APL_SAMPLES.MILK_PROD A on F.THE_DATE = A.MONTH_START order by 1 asc;
We show below the results of this query; the last 12 rows, corresponding to the 12-month forecast, were automatically generated by APL: