Happy World Statistics Day with hana_ml 2.6!
Today, October 20th is World Statistics Day celebrated with the theme “Connecting the world with data we can trust”. The day is celebrated every five years, so the next chance will be in 2025!
Last week the new version 2.6 of Python Machine Learning Client for SAP HANA (aka
hana_ml) has been released. So, what could be a better way to celebrate World Statistics Day if not looking at one of the new features of that release brought to help you to understand, and therefore help you to trust, the data you have? In this post, we will look at the new feature called the Dataset Report that generates a report from profiling data in SAP HANA table.
I am going to run this example using Jupyter Lab…
…and, as with all experiments, I will run it in a separate Docker container.
docker run -p 8888:8888 --detach -v ~:/home/jovyan/work --name hmlsandbox01 -e GRANT_SUDO=yes -e JUPYTER_ENABLE_LAB=yes jupyter/minimal-notebook start-notebook.sh --LabApp.token='' docker exec hmlsandbox01 pip install hana_ml==2.6.*
You may notice I added
--LabApp.token='' so that Jupyter Lab is not secured, but then it is running on my laptop for my own experiments, so I did it on purpose.
The Database Report requires as well IPywidgets, so I need to follow steps from: https://ipywidgets.readthedocs.io/en/stable/user_install.html
docker exec hmlsandbox01 conda install -c conda-forge ipywidgets docker exec hmlsandbox01 jupyter labextension install @email@example.com docker restart hmlsandbox01
These installation steps take a few minutes.
Let’s open Jupyter Lab…
…and check the installed version is
import hana_ml hana_ml.__version__
The dataset I am going to use is the one from last week’s post Low-code data analysis application with SAP HANA push-down by Dmitry Buslov and Andreas Forster. It is Mall_Customers.csv from Kaggle that I have loaded by following the steps described in their post.
Please note that I am using SAP HANA Cloud productive instance, as the trial does not yet have
scriptserver service included at the time of writing. Without the
scriptserver SAP HANA embedded ML libraries, like PAL and APL cannot run.
cchc=hana_ml.dataframe.ConnectionContext(port=443, address='<uuid>.hana.prod-eu20.hanacloud.ondemand.com', user='<user_name>', password='<MySuperSecretPassword>', encrypt=True) hdf_mallcustomers=cchc.table('MALL_CUSTOMERS') hdf_mallcustomers.collect()
Run Dataset Report…
from hana_ml.visualizers.dataset_report import DatasetReportBuilder drb_mall=DatasetReportBuilder() drb_mall.build(data=hdf_mallcustomers, key='ID')
…and once all build steps are completed…
…check the interactive output
The generated report now gives you a more detailed view of the data! Enjoy!
The next World Statistics Day is in five years, but I plan to dig into the new capabilities of this version of
hana_ml — especially support for multi-model analytics — already next week 😉
-Vitaliy (aka @Sygyzmundovych)
Good one, Witalij!
I tried this on a 35K dataset with 16 features. The report got generated successfully however, couldn't manage to open it. Every time I try to open the Jupyter notebook or the html report file that got generated, the browser would freeze. (i tried both ways i.e. generate_html_report as well as generate_notebook_iframe_report without any luck).
Would you know if there are any limitations around size of the dataset that we could use this report for?