Automated Descriptive Statistics Report with SAP HANA Cloud and Python
For Solution Advisors who occasionally receive customer datasets to analyze, running descriptive statistics to summarize, describe, and visualize the data is often the first step.
In this blog, I will highlight two ways the hana_ml Python library can automatically generate visualizations your SAP HANA Dataset to help you quickly understand your data. These can serve as handy tools in the Solution Advisor’s arsenal by providing quick and easy ways to generate useful visualizations and reports on your dataset. These reports will provide charts showing:
- Missing values %
- High cardinality %
- Numeric distributions
- Categorical distributions
I like these reports because they are easy to use and require no Python or ML experience. I posted a blog on generating descriptive statistics as a HANA table but these reports provide some nice visualizations of the dataset, so hopefully you will find this useful as well.
(Optional) Upload Table to HANA Cloud
The report generators work on tables in SAP HANA so if the table you want descriptive statistics on is already there you can skip this step. However, if your table is not yet in SAP HANA we can easily bring it into SAP HANA as well.
For this tutorial, we will use the famous Bank Marketing dataset to make it easy to follow. We’ll upload this dataset to SAP HANA Cloud and generate the two types of reports available on this data.
# Create connection to HANA Cloud import hana_ml.dataframe as dataframe # Instantiate connection object conn = dataframe.ConnectionContext(address = '<Your HANA tenant info.hanacloud.ondemand.com>', port = 443, user = '<USERNAME', password = "<PASSWORD>", encrypt = 'true', sslValidateCertificate = 'false') # Display HANA version to test connection print('HANA version: ' + conn.hana_version())
First we need to read in the file as a Pandas dataFrame. To create the report we’ll need an ID column so we can use the index for this.
df = pd.read_csv('../bank/bank-full.csv', sep=';').reset_index(drop=False)
We’ll use the hana_ml library to create a HANA table from our .csv file. This also creates a pointer to that table (“hana_df”).
# Upload Pandas dataframe to HANA Cloud hana_df = dataframe.create_dataframe_from_pandas(connection_context = conn, pandas_df = df, table_name = 'BANKMARKETING', force = True)
Now that we have an SAP HANA table we can work with, let’s create the visualizations!
Use Profiler to create in-notebook visualizations
If you need a few charts to cut-paste into a PPT, the Profiler is a great option. You just need to point it at an SAP HANA dataset and it will create a series of charts on the data types, missing values, cardinality, and distributions of all numeric fields.
To use the Profiler, create an instance and provide the SAP HANA dataset and column in the dataset that serves as the unique ID. Running p.description() starts the process of analyzing the dataset. Once the analysis is ready, you can plot the charts by calling the pointer (“fig”).
# Import Profiler class from hana_ml.visualizers.eda import Profiler # Create analysis. # data = SAP HANA dataFrame # key = index column (e.g. unique ID) p = Profiler() fig = p.description(data=hana_df, key="index") # Render plot fig
Use DatasetReportBuilder if you want a report (HTML/iFrame)
If you prefer a report you can email to colleagues or others, the DatasetReportBuilder may fit the bill. It lets you create either a HTML report file or an iFrame report in your browser.
The generate_notebook_iframe_report() creates the dataset report in your jupyter notebook window:
from hana_ml.visualizers.dataset_report import DatasetReportBuilder datasetReportBuilder = DatasetReportBuilder() datasetReportBuilder.build(hana_df, key="index") datasetReportBuilder.generate_notebook_iframe_report()
Alternatively, this report can be saved as an HTML file:
from hana_ml.visualizers.dataset_report import DatasetReportBuilder datasetReportBuilder = DatasetReportBuilder() datasetReportBuilder.build(hana_df, key="index") datasetReportBuilder.generate_html_report('BANKMARKETING REPORT.html')
As you can see, there are several ways to leverage the power of HANA to accelerate descriptive analysis on unfamiliar datasets. I hope this blog was helpful for you in better understanding of the hana_ml library and the visualizers within.
If you have any questions/comments please let me know, especially if you have trouble replicating what was shown. Thanks for your interest in this topic!