1. Introduction

xinchen · ‎12-18-2020

1. Introduction

Seasonality is a crucial characteristic of a time series. In the SAP HANA Predictive Analysis Library (PAL), we provide a method for seasonal decomposition. This method is also wrapped up in the Python Machine Learning Client for SAP HANA (hana-ml), which offers a seasonality test and the decomposes the time series into three components: trend, seasonal and random.

In this blog post, you will learn:

The definition of seasonality and why it is necessary to decompose a time series data.
How to apply the seasonal_decompose() function of hana-ml to analysis two typical real-world time series examples.

1.1 Definition

Seasonality is a characteristic of a time series where the data experiences regular and predictable changes, such as weekly and monthly. Seasonal behavior differs from cyclic behavior because seasonality always has a fixed and known period, while cyclic behavior does not have a fixed period, e.g., a business cycle. Seasonality can be used to help analyze stocks and economic trends. For instance, companies can use seasonality to help determine certain business decisions such as inventories and staffing.

1.2 Why we decompose the time series

In time series analysis and forecasting, we usually consider the data as a combination of trend, seasonality, and noise. We can form a forecasting model by capturing the best of these components. Typically, there are two decomposition models for time series: additive and multiplicative. The additive model is useful when the seasonal variation is relatively constant over time, whereas the multiplicative model is useful when the seasonal variation increases over time.

Real-world problems are messy and noisy, such as the trend not being monotonous, and the real model could have both additive and multiplicative components. Nevertheless, these decomposition models provide us with a structured and simple way to analyze and forecast the data. Hence, identifying the seasonality in a time series can help you build a better model. This can occur in the following ways:

Data cleaning: Removing the seasonal component will give you a clearer relationship between input and output.
Interpretability: Provide more information about the time series.

In the seasonal_decompose() function of hana_ml, we provide two phases of functions:

1. Seasonality Test : The seasonal_decompose() function tests whether a time series exhibits seasonality or not by removing the trend and identifying the seasonality through the calculation of autocorrelation (acf). The output includes the number of periods, the type of model (additive/multiplicative), and the acf of the period.

2. Seasonal Decomposition : Based on the model structured in the seasonality test phase, the components of trend, seasonality, and random noise are determined.

Overall, the seasonal_decompose() function of hana_ml provides an easy and quick method to identify seasonality and decompose the time series. In the following sections, we will demonstrate how to use this function to analyze two real-world datasets.

2. Solutions

In this section, the U.S. gasoline retail sales and New York taxi passengers cases are analyzed.
All source code will use Python Machine Learning Client for SAP HANA Predictive Analysis Library (PAL).

2.1 Setup a Connection to SAP HANA

First, we need to establish a connection to SAP HANA. After that, we can utilize various functions of hana_ml to perform data analysis. Here is an example:

>>> import hana_ml
>>> from hana_ml import dataframe
>>> conn = dataframe.ConnectionContext('host', 'port', 'username', 'password')

Please replace ‘address’, ‘port’, ‘user’, and ‘password’ with your SAP HANA instance details.

2.2 Use Cases

2.2.1 Case 1: US Gasoline Retail Sales

Dataset link: https://www.eia.gov/dnav/pet/hist/LeafHandler.ashx?n=PET&s=A103600001&f=M

This dataset includes the monthly data of U.S. Total Gasoline Retail Sales by Refiners (in Thousand Gallons per Day) from January 1983 to July 2020. The dataset has two columns: Date and Sales, and contains 451 data points.

The figure below illustrates the variation in the dataset, and we can observe a potential yearly pattern. From 2008 to 2015, there is a significant decrease in sales. Considering the timing, we speculate that the drop could be attributed to the 2008 economic crash, which had a pronounced negative impact on the oil and gas industry. Looking at the data for 2020, there is a steep decline in early 2020, which may be due to the lockdowns imposed during the COVID-19 pandemic in the US.

Fig.1 U.S. Total Gasoline Retail Sales by RefinersThe dataset has been imported into SAP HANA under the table name “GASOLINE_TBL”. Therefore, we can access the dataset using the dataframe.ConnectionContext.table() function. Next, we add a column named ‘ID’ to the original DataFrame, gasoline_df, as the seasonal_decompose() function requires an integer column as a key column.

>>> gasoline_df = conn.table("GASOLINE_TBL") # Access to the data table
>>> gasoline_df = gasoline_df.add_id('ID' ) # Add ID column
>>> print(gasoline_df.head(5).collect()) # Show the first 5 rows of gasoline_df

Firstly, because the seasonality is indicated by the autocorrelation lag, we invoke the plot_acf() function to display the autocorrection (acf) and the result is shown in Fig. 2.

>>> from hana_ml.visualizers.eda import plot_acf
>>> plot_acf(data=gasoline_df, key='ID', col='Sales', method = 'fft', thread_ratio=0.4, enable_plotly=False)

Fig.2. Autocorrelation plot of gasoline_df data

In the beginning, it is assumed that the data follows a yearly pattern, so we expect that when the lag is 12, the value of acf is high. However, in this case, the time series is not stationary and the significant decline from 2008 to 2015 greatly affects the acf values. Consequently, the initial expectation is proven false, resulting in a decreasing curve for the acf. Hence, to identify the seasonality, it is necessary to eliminate the trend in the data.

To address this, the hana_ml library offers the seasonal_decompose() function, which conducts a seasonality test while accounting for the impact of the trend. The function is invoked as shown in the code below and returns a list of two dataframes. The first dataframe provides the statistics, such as the type of decomposition and the acf value corresponding to the period. The second dataframe contains the three decomposed components: seasonality, trend, and random.

>>> from hana_ml.algorithms.pal.tsa.seasonal_decompose import seasonal_decompose
>>> stats, decompose = seasonal_decompose(data= gasoline_df, endog = 'Sales', key='ID')
>>> print(stats.collect())
>>> print(decompose.collect())

From the result of stats, we could see the period is detected as 12 and the type of decomposition model is additive. We also provide a plot_seasonal_decompose() function to visualize the three decomposed components in Fig. 3.

>>> plot_seasonal_decompose(data=gasoline_df, key='ID', col='Sales', enable_plotly=False)Fig. 3. Seasonal decomposition results of gasoline_df

2.2.2 Case 2: New York Taxi Passengers

Dataset Link: https://github.com/numenta/NAB/blob/master/data/realKnownCause/nyc_taxi.csv

This dataset describes the number of NYC taxi passengers in 8 months, from July 2014 to Jan. 2015, where the five anomalies occur during the NYC marathon, Thanksgiving, Christmas, New Years day, and a snow storm. The raw data is from the NYC Taxi and Limousine Commission. The data file included here consists of aggregating the total number of taxi passengers into 30 minute buckets. Data has two columns, timestamp and value of passengers, and 10320 instances.

The dataset has been imported into the SAP HANA and the table name is "TAXI_TBL". A sample of the first 5 rows of data and a plot of first 1000 instances is shown below.

Fig. 4. The first 1000 rows plot of New York Taxi Passengers Dataset

From the Fig. 4, it seems that the number of taxi passengers follows a daily and weekly pattern. Hence, we calculate the acf as follows in the code below and obtain the acf plot in Fig. 5.

>>> correlation(data=taxi_df, key='ID', x='value', max_lag=1500).collect()

Fig. 5 Acf plot of taxi dataset

We invoke the seasonal_decompose() and obtain that the period is 336 which is a weekly pattern having the highest value of acf.

>>> stats, decompose = seasonal_decompose(data=taxi_df, endog='value', key='ID')
>>> print(stats.collect())
>>> print(decompose.collect())

Also, the decomposed components of taxi dataset is shown in Fig. 6.Fig. 6 Decomposed components of taxi dataset

3. Summary

In this blog post, we described what is seasonality and how to analyze and decompose the time series with seasonal_decompose() of hana-ml. If you want to learn more of seasonal_decompose() function of hana-ml and SAP HANA Predictive Analysis Library (PAL), please refer to the following links:
hana-ml seasonal_decompose documentation
SAP HANA Predictive Analysis Library (PAL) Seasonality Test manual

Identification of Seasonality in Time Series with Python Machine Learning Client for SAP HANA

1. Introduction

1.1 Definition

1.2 Why we decompose the time series

2. Solutions

2.1 Setup a Connection to SAP HANA

2.2 Use Cases

2.2.1 Case 1: US Gasoline Retail Sales

2.2.2 Case 2: New York Taxi Passengers

3. Summary

Other Useful Links:

Get Your SAP HANA Idea Incubator Badge Today!

SCN Mission - SAP HANA Quiz Challenge is now retired

Share your #HANAStory and Win