Technical Articles
Pivoting Data with SAP HANA
For a long time now we have been waiting to get pivoting functionality with SAP HANA. This was not possible and there have been several attempts to get same functionality with custom sql codes. In this blog post I will describe how pivoting can be achieved via SAP HANA Python Client API for Machine Learning Algorithms (hana_ml package) without the moving data out of SAP HANA.
With hana_ml version 1.0.7 we can now pivot the data easily and intuitively. This functionality is only supported via hana_ml and not in SAP HANA SQL directly, as the hana_ml wrapper generates the required sql code on the fly.
Lets say I have a SAP HANA table which has sample data as below and we access it via hana_ml dataframe df_data:
The number of different SKUs is 26
Now we want to run a correlation analysis between the different SKUs to check if their sales are potentially cannibalizing or positively impacting each other.
To do this we need to pivot the data so we have a column for each of the 26 SKUs containing the SALES_VALUE.
#Create the pivoted data as a hana_ml dataframe with pivot_table
dp = df_data.pivot_table(values='SALES_VALUE',index = 'DATE_MEASUREMENT', columns='SKU',aggfunc='sum')
#Format the data to fillin zero and have the right datatype for HANA PAL correlation function
dp = dp.fillna(0,SKUS).cast(SKUS, 'DOUBLE')
This generates a hana_ml dataframe which has SALES_VALUE for each of the 26 SKUs in a column format:
Now we can easily get a correlation matrix between all SKUs as follows, here I am choosing to show only the top 10 SKUs based on their SALES_VALUE but you can run the same analysis for all SKUs very efficiently:
#Focus on only top 10 SKUs
SKUS = list(set(top_10_skus.select('SKU').collect()['SKU']))
from hana_ml.algorithms.pal import stats
corr = stats.pearsonr_matrix(conn_context=conn, data=dp, cols = SKUS)
corr.collect()
This provides the following hana_ml dataframe with correlation coefficient between every pair of SKUs.
We can also easily plot this as a matrix using hana_ml.visualizers package
Many other plotting capabilities with hana_ml.visualizers is described very nicely in Arun Godwin Patel’s blog post
from hana_ml import visualizers
from hana_ml.visualizers.eda import EDAVisualizer
import matplotlib.pyplot as plt
get_ipython().magic('matplotlib inline')
f = plt.figure(figsize=(10,10))
ax = f.add_subplot()
eda = EDAVisualizer(ax)
ax, corr_data = eda.correlation_plot(data=dp, corr_cols =SKUS, label=True, cmap='RdYlGn')
plt.show()
This produces the following correlation matrix.
Finally we now have a way of pushing down pivoting functionality to SAP HANA without moving data outside or writing customized SQL.
A very well illustrated example of the HANA Python ML library. Thanks for sharing this.
What I find irritating is that the article suggests, that HANA generally now provides an actual solution to the lack of the PIVOT-functionality that is present in other major DBMS platforms.
I'd say that most developers would expect functionality similar to the PIVOT/UNPIVOT (Oracle, MS SQL Server) or CROSSTAB (Postgresql) in HANA when SAP announces that it is finally possible to pivot data with push-down and without customized SQL.
But all this is only available if you happen to use Python and the hana_ml library for your data frames (which makes me wonder how many developers gladly change from Pandas to hana_ml library for a such a feature).
Technically it is, of course, correct, that the automatically generated SQL statement is "pushed-down" to HANA and the resultset is computed there (without moving the data outside). But such word mincing leaves a taste of disappointment after reading that the PIVOT option is now available in HANA.
I agree with your observation which is why I stated at the very start that the functionality is only available via hana_ml library, with no intention of mincing any words.
I appreciate your comments and will try to incorporate in future work so you are not disappointed.
You do not need to choose between pandas and hana_ml dataframes. They are interchangeable and all you need to do to transform a hana df into a pandas df is .collect().
The difference here is that if you are using HANA as your data source for ML algorithms and originally had to pull all the data into python to pivot the data set as part of the data preparation steps, you can now delegate all or most of the data prep to HANA itself, and just .collect() on the fully prepared data. Or you might also not even need to .collect() at all if the algos you want to use are supported by PAL or APL. It's definitely an improvement worth sharing, for ML-related use cases.
This is fascinating, but I am discouraged because you note that the way to pivot data is through hana_ml, using Python code, and that the actual SQL code is generated on the fly.
From the HANA documentation, I can see no way so access hana_ml and dataframes directly from SQLScript via tools like hdbsql or the HANA Database Explorer, or even from a HANA stored procedure. Is that correct? Do I need to install Python and write OS-level Python scripts to pivot data in the manner you describe? I did find a diagram with a line from “SQL Client” to “APL” and “PAL” but I am not sure if this is relevant.
Thanks, Mark
OK, after many hours of searching links, I think I have my answer. Arun Goodwin Patel sums it up in his post Conquer your data with the HANA DataFrame – Exploratory Data Analysis with this picture:
I have confirmed this with several other sources, and conclude that the pivot method described in this post is only available through a python interface. I think.
Hi Mark,
Yes that is correct, the pivoting described above only works through the python interface. One can get the sql generated but I understand this is not what one would want if you are in the sql world. The sql is also custom to the data frames and handles so not a generic database way of pivoting.
Kind regards,
Nidhi