In the context of knowledge discovery from time series data, it is a common practice to work with the data sets not in their original form, but through a reduced representation, which retains the most important features of the original time series. To this end various transformations and representation formats have been proposed throughout the years. In this post I will be focusing on one particular representation format – the piecewise linear approximation (PLA) – and one transformation that can be used to obtain the PLA: the bottom-up segmentation.
A discussion of other approaches is beyond the scope of this post (perhaps I will pick up on that in a later blog) but, for anyone interested in further reading, the references at the end of the post should provide for a good starting point.
Anyway.. what I am aiming at with the current blog is to illustrate how the combination of HANA, R and UI5 can be used to produce custom analytics solutions, (hopefully) without having to invest an excessive amount of time in their development.
The piecewise linear approximation is one of the most intuitive and easiest to understand representation formats applicable to time series data. It replaces the raw data (stored in a point-by-point format) through a set of consecutive linear segments, that “follow” the evolution of the original time series.
The next figure provides an example of a 1000-point time series which has been approximated by 4 linear segments through the above mentioned approach.
In order to obtain the PLA of a timeseries, a bottom-up segmentation can be used. This technique starts of by merging all N adjacent time series data points, forming a set of N-1 segments. The algorithm continues by iteratively merging the pair of consecutive segments which introduces the smallest representation error in the approximation, until some stopping criteria is met.
That being said, in this blog post I will be illustrating how the bottom-up segmentation technique can be implemented using the R language and incorporated into a small native HANA application for real-time determination of a time series’ PLA.
For implementing this example you will need:
- a local R installation
- a HANA machine with R integration (unfortunately not available in the free HCP trial account)
- a basic understanding of HANA, R and UI5, although the most important code snippets are available here: Adaptive Piecewise Linear Approximation of Time Series [HANA, R, UI5] · GitHub
R-based time series segmentation
The code for performing the bottom-up segmentation is available on the above mentioned GitHub Gist (segment.r), so it can easily be sourced into your local R workspace and executed on the provided data set (data.csv).
To keep things simple, I will not go into the details of the R-implementation here, relevant are only the functions:
- segment( input_time_series, number_of_segments ): returns the segmentation model for a time series, according to the number of segments requested
- segPlot( segmentation_model ): transforms a segmentation model back to a time series, i.e. in a point-by-point format which can be used for plotting
> source('segment.R') > rawData <- read.csv('data.csv') > segmented <- segPlot(segment(rawData[,2], 5)) > plot(rawData) > lines(segmented)
Executing the above instructions in the R console results in the following graphical representation of the initial time series (points), overlaid with its 5-segment-PLA (lines):
Next we will examine how the same segmentation logic can be deployed to the HANA platform, and embedded within a native app.
The figure below provides an overview of the building blocks required to get this scenario up and running.
The time series is stored in a .hdbtable composed of one index and one value column(alternately HANA series tables can be used). The data.csv can then be imported into this (series) table.
R stored procedure
Next an R-language stored procedure has to be defined in the HANA repository which will handle the actual segmentation:
- The procedure accepts a series table and a “granularity” parameter, which defines the number of segments to be used in the approximation.
- The output of the stored procedure is a new series table containing the PLA of the original data.
The actual content of the procedure is composed of:
- The segment.R code, which defines the necessary functions for the segmentation
- 2 additional lines of code that trigger the segmentation an create the output data frame
Exposing the data
The data is exposed in the form of an OData service through 2 entities:
where the second entity returns the segmented version of the raw data of the series table, and is based on a parametric calculation view (“tsaprox”). The calculation view accepts the segmentation granularity as input and feeds it together with the series table data to the R stored procedure returning the result.
Using this approach, a segmentation of granularity = 5 could e.g. be retrieved by the following query:
(Some more details on exposing Calculation Views with parameters through OData can be found here )
Finally the raw and segmented data can be consumed through a UI5 application, like the one in the example below: the segmentation granularity is selectable through a slider control and triggers a recalculation through the R-procedure. (A nice improvement here would be to store the R-model for future evaluations, in order to eliminate the cost of some redundant steps performed at each roundtrip)
The result should look similar to the one obtained in the R console (e.g. here is the original time series and it’s 5-segment approximation, side-by-side) :