Happy World Statistics Day with hana_ml 2.6!

Vitaliy-R · ‎10-20-2020

Today, October 20th is World Statistics Day celebrated with the theme “Connecting the world with data we can trust”. The day is celebrated every five years, so the next chance will be in 2025!

Last week the new version 2.6 of Python Machine Learning Client for SAP HANA (aka hana_ml) has been released. So, what could be a better way to celebrate World Statistics Day if not looking at one of the new features of that release brought to help you to understand, and therefore help you to trust, the data you have? In this post, we will look at the new feature called the Dataset Report that generates a report from profiling data in SAP HANA table.

I am going to run this example using Jupyter Lab...

...and, as with all experiments, I will run it in a separate Docker container.

docker run -p 8888:8888 --detach -v ~:/home/jovyan/work --name hmlsandbox01 -e GRANT_SUDO=yes -e JUPYTER_ENABLE_LAB=yes jupyter/minimal-notebook start-notebook.sh --LabApp.token=''

docker exec hmlsandbox01 pip install hana_ml==2.6.*

You may notice I added --LabApp.token='' so that Jupyter Lab is not secured, but then it is running on my laptop for my own experiments, so I did it on purpose.

The Database Report requires as well IPywidgets, so I need to follow steps from: https://ipywidgets.readthedocs.io/en/stable/user_install.html

docker exec hmlsandbox01 conda install -c conda-forge ipywidgets

docker exec hmlsandbox01 jupyter labextension install @jupyter-widgets/jupyterlab-manager@2.0

docker restart hmlsandbox01

These installation steps take a few minutes.

Let's open Jupyter Lab...

...and check the installed version is 2.6.

import hana_ml

hana_ml.__version__

The dataset

The dataset I am going to use is the one from last week's post Low-code data analysis application with SAP HANA push-down by dmitrybuslov and andreas.forster. It is Mall_Customers.csv from Kaggle that I have loaded by following the steps described in their post.

Please note that I am using SAP HANA Cloud productive instance, as the trial does not yet have scriptserver service included at the time of writing. Without the scriptserver SAP HANA embedded ML libraries, like PAL and APL cannot run.

cchc=hana_ml.dataframe.ConnectionContext(port=443,

                                         address='<uuid>.hana.prod-eu20.hanacloud.ondemand.com', 

                                         user='<user_name>',

                                         password='<MySuperSecretPassword>',

                                         encrypt=True)



hdf_mallcustomers=cchc.table('MALL_CUSTOMERS')

hdf_mallcustomers.collect()

Run Dataset Report...

from hana_ml.visualizers.dataset_report import DatasetReportBuilder

drb_mall=DatasetReportBuilder()

drb_mall.build(data=hdf_mallcustomers, key='ID')

...and once all build steps are completed...

...check the interactive output

drb_mall.generate_notebook_iframe_report()

The generated report now gives you a more detailed view of the data! Enjoy!

The next World Statistics Day is in five years, but I plan to dig into the new capabilities of this version of hana_ml -- especially support for multi-model analytics -- already next week 😉

Stay healthy
-Vitaliy (aka @Sygyzmundovych)