New Series Analytic SQL Functions with SAP HANA SPS09
A series is defined as a sequence of data points made over a specific interval, most often a time interval. Examples of series are average levels of CO2 in the atmosphere, daily closing values of the stock market and daily precipitation levels.
Time series analysis is the practice of analyzing time series data in order to extract meaningful information. Just about every organization can benefit from harvesting this information in order to better understand reality: from financial applications striving to better support decision making, to heavy industry companies looking to better predict service needs, to retail organizations looking to improve by describing typical seasonality behavior.
Among the many new features with SPS09, SAP HANA introduced a number of powerful series analytic SQL functions. This article will introduce and explain these functions, as well as give general recommendations on their applicability.
The SERIES clause
As of SAP HANA SPS09, the CREATE TABLE statement has been enhanced to support a new SERIES clause. This clause indicates that the table will contain series data. A series table is like any other table but has additional properties. These allow for more efficient storage and takes particular advantage of the columnar nature of SAP HANA.
Series analytics functions
Getting the man in the middle – MEDIAN
The median value is very well known and does not need much further explanation. In SAP HANA, null values are eliminated when calculating the median value and the middle value (the average of the two middle values in the case the number of elements is even) is returned.
Linear dependence between two variables – CORR
When using the CORR function, SAP HANA will calculate the Pearson product momentum correlation coefficient – often simply called the correlation coefficient, or r – between two variables. The output is a value between –1 and 1.
A negative correlation indicates that as one of the variables increases, the other one decreases. An example of this would be the correlation between vaccinations and illness: as more people are vaccinated for a given illness, the less this illness will occur.
Similarly, a positive correlation means that an increase in one variable will also increase the other variable. As an example, consider the relationship between smoking and lung disease.
No correlation means that an increase of one variable will not reliably affect the other variable, and that the elements are really just randomly distributed.
Rank correlation between two variables – CORR_SPEARMAN
Sometimes, two variable are obviously correlated – just not as a line in a graph.
For these situations, we are interested in measuring correlation based on how a ranking of a data point in one series is related to the ranking of a data point in another series.
This is measured by the Spearman’s Rho correlation, which is supported in SAP HANA by the CORR_SPEARMAN function. Considering the graph above, the CORR value is 0.88, indicating a strong corrrelation, but not perfect (a line would not be a perfect fit). The CORR_SPEARMAN value, however, is a perfect 1 since each consecutive value ranks the same on both the x and the y axis.
A common issue in statistical analysis is the existence of outliers. An outlier is a data point which does not fit the general trend of the data. There are multiple reasons for the occurrence of outliers, and they can have a significant impact on the calculated correlation. The Spearman correlation will, by it’s nature, be much less affected by the existence of outliers.
When analyzing correlations between variables, it is important to note that the value does not indicate the strength of the correlation, only its directional value. To know how strong the measured correlation coefficient is, you need to also take into account the sample size (the larger the sample size, the more we can trust calculated correlations). This is known as the its significance. Another way of expressing this is that the stronger the significance, the less likely the correlation is to happen by chance.
This is a measure based on the sample size and must be calculated in order to draw a meaningful conclusion.
If it is calculated that there is less than a one in twenty chance (5%) that a calculated correlation can happen by chance, the findings are said to be significant. If there is less than a one in one hundred chance (1%), the findings are designated as highly significant. Currently, the significance is not automatically calculated by the SAP HANA analytical series SQL functions.
SPS09 is the first release of SAP HANA to support SQL-level analytical series functions. Over time, we can expect more support for additional functionality, such as direct support for significance calculations and other measures such as Kendall’s Tau correlation – a different correlation measure which is more sensitive to many, but smaller, rank differences. Kendall’s Tau is a better measure when we are looking to measure overall conformance but are not too concerned about one or two major discrepancies. By contrast, Spearman’s Rho is very sensitive to singular, large variations.
Full support for series analytics can be found in the R language, which is also supported by SAP HANA. To take advantage of the full, blazing speed of in-memory analytics, however, we need to use the native SQL functions described in this article.
As the SAP HANA platform continues to evolve, I will update this blog with information about new capabilities for lightning fast series analytics!