Skip to Content
Technical Articles

Happy World Statistics Day with hana_ml 2.6!

Today, October 20th is World Statistics Day celebrated with the theme “Connecting the world with data we can trust”. The day is celebrated every five years, so the next chance will be in 2025!

Last week the new version 2.6 of Python Machine Learning Client for SAP HANA (aka hana_ml) has been released. So, what could be a better way to celebrate World Statistics Day if not looking at one of the new features of that release brought to help you to understand, and therefore help you to trust, the data you have? In this post, we will look at the new feature called the Dataset Report that generates a report from profiling data in SAP HANA table.

I am going to run this example using Jupyter Lab…

…and, as with all experiments, I will run it in a separate Docker container.

docker run -p 8888:8888 --detach -v ~:/home/jovyan/work --name hmlsandbox01 -e GRANT_SUDO=yes -e JUPYTER_ENABLE_LAB=yes jupyter/minimal-notebook start-notebook.sh --LabApp.token=''
docker exec hmlsandbox01 pip install hana_ml==2.6.*

You may notice I added --LabApp.token='' so that Jupyter Lab is not secured, but then it is running on my laptop for my own experiments, so I did it on purpose.

The Database Report requires as well IPywidgets, so I need to follow steps from: https://ipywidgets.readthedocs.io/en/stable/user_install.html

docker exec hmlsandbox01 conda install -c conda-forge ipywidgets
docker exec hmlsandbox01 jupyter labextension install @jupyter-widgets/jupyterlab-manager@2.0
docker restart hmlsandbox01

These installation steps take a few minutes.

Let’s open Jupyter Lab…

…and check the installed version is 2.6.

import hana_ml
hana_ml.__version__

The dataset

The dataset I am going to use is the one from last week’s post Low-code data analysis application with SAP HANA push-down by Dmitry Buslov and Andreas Forster. It is  Mall_Customers.csv from Kaggle that I have loaded by following the steps described in their post.

Please note that I am using SAP HANA Cloud productive instance, as the trial does not yet have scriptserver service included at the time of writing. Without the scriptserver SAP HANA embedded ML libraries, like PAL and APL cannot run.

cchc=hana_ml.dataframe.ConnectionContext(port=443,
                                         address='<uuid>.hana.prod-eu20.hanacloud.ondemand.com', 
                                         user='<user_name>',
                                         password='<MySuperSecretPassword>',
                                         encrypt=True)

hdf_mallcustomers=cchc.table('MALL_CUSTOMERS')
hdf_mallcustomers.collect()

Run Dataset Report…

from hana_ml.visualizers.dataset_report import DatasetReportBuilder
drb_mall=DatasetReportBuilder()
drb_mall.build(data=hdf_mallcustomers, key='ID')

…and once all build steps are completed…

…check the interactive output

drb_mall.generate_notebook_iframe_report()

The generated report now gives you a more detailed view of the data! Enjoy!


The next World Statistics Day is in five years, but I plan to dig into the new capabilities of this version of hana_ml — especially support for multi-model analytics — already next week 😉

Stay healthy
-Vitaliy (aka @Sygyzmundovych)

 

3 Comments
You must be Logged on to comment or reply to a post.
  • I tried this on a 35K dataset with 16 features. The report got generated successfully however, couldn’t manage to open it. Every time I try to open the Jupyter notebook or the html report file that got generated, the browser would freeze. (i tried both ways i.e. generate_html_report as well as generate_notebook_iframe_report  without any luck).

    Would you know if there are any limitations around size of the dataset that we could use this report for?

     

    Venkat

  • Thank you so much for the information keep suggesting us such informative post I will recommend this post to others keep sharing such an amazing post.