Environment

Fukuhara · ‎09-07-2022

I am writing this blog to show data visualization using python package hana_ml. DatasetReportBuilder is a EDA function and show report as below. I used famous titanic dataset, which is easy to understand.

Environment

Environment is as below.

Python: 3.7.13(Google Colaboratory)

HANA: Cloud Edition 2022.16

Python packages and their versions.

hana_ml: 2.13.22072200

pandas: 1.3.5

seaborn: 0.11.2(just for getting titanic dataset)

As for HANA Cloud, I activated scriptserver and created my users. Though I don't recognize other special configurations, I may miss something since our HANA Cloud was created long time before.

I didn't use HDI here to make environment simple.

Generated Report

A report can be displayed within jupyter or downloaded as html file.

Overview

"Variable Types" block has Detail tab.

Sample

Sample shows top 10 records.

Variables

Variables screen shows variable statistics.

Variable "Age" is Automatically discretized.

Character type variables are shown as doughnut chart.

Data Correlations

Data Scatter Matrix

Python Script

1. Install Python packages

Install python package hana_ml, which is not pre-installed on Google Colaboratory.

As for pandas and seaborn, I used pre-installed ones.

!pip install hana_ml

2. Import modules

Import python package modules.

from hana_ml.dataframe import ConnectionContext, create_dataframe_from_pandas

from hana_ml.visualizers.dataset_report import DatasetReportBuilder

import pandas as pd

import seaborn as sns

3. Connect to HANA Cloud

Connect to HANA Cloud and check its version.

ConnectionContext class is for connection to HANA.

HOST = '<HANA HOST NAME>'

SCHEMA = USER = '<USER NAME>'

PASS = '<PASSWORD>'

conn = ConnectionContext(address=HOST, port=443, user=USER,

                           password=PASS, schema=SCHEMA,

                           encrypt=True, sslValidateCertificate=False) 

print(conn.hana_version())

4.00.000.00.1660640318 (fa/CE2022.16)

4. Load Titanic dataset

Load Titanic dataset using seaborn. I rename a column name to "gender", since original name is prohibit in the SAP Community blog.

df = sns.load_dataset('titanic')

df.rename({'xx': 'gender'}, axis=1, inplace=True)

print(df)

df.info()

Here is dataframe overview.

     survived  pclass  gender   age  sibsp  parch     fare embarked   class  \

0           0       3    male  22.0      1      0   7.2500        S   Third   

1           1       1  female  38.0      1      0  71.2833        C   First   

2           1       3  female  26.0      0      0   7.9250        S   Third   

3           1       1  female  35.0      1      0  53.1000        S   First   

4           0       3    male  35.0      0      0   8.0500        S   Third   

..        ...     ...     ...   ...    ...    ...      ...      ...     ...   

886         0       2    male  27.0      0      0  13.0000        S  Second   

887         1       1  female  19.0      0      0  30.0000        S   First   

888         0       3  female   NaN      1      2  23.4500        S   Third   

889         1       1    male  26.0      0      0  30.0000        C   First   

890         0       3    male  32.0      0      0   7.7500        Q   Third   



       who  adult_male deck  embark_town alive  alone  

0      man        True  NaN  Southampton    no  False  

1    woman       False    C    Cherbourg   yes  False  

2    woman       False  NaN  Southampton   yes   True  

3    woman       False    C  Southampton   yes  False  

4      man        True  NaN  Southampton    no   True  

..     ...         ...  ...          ...   ...    ...  

886    man        True  NaN  Southampton    no   True  

887  woman       False    B  Southampton   yes   True  

888  woman       False  NaN  Southampton    no  False  

889    man        True    C    Cherbourg   yes   True  

890    man        True  NaN   Queenstown    no   True  



[891 rows x 15 columns]

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 891 entries, 0 to 890

Data columns (total 15 columns):

 #   Column       Non-Null Count  Dtype   

---  ------       --------------  -----   

 0   survived     891 non-null    int64   

 1   pclass       891 non-null    int64   

 2   gender       891 non-null    object  

 3   age          714 non-null    float64 

 4   sibsp        891 non-null    int64   

 5   parch        891 non-null    int64   

 6   fare         891 non-null    float64 

 7   embarked     889 non-null    object  

 8   class        891 non-null    category

 9   who          891 non-null    object  

 10  adult_male   891 non-null    bool    

 11  deck         203 non-null    category

 12  embark_town  889 non-null    object  

 13  alive        891 non-null    object  

 14  alone        891 non-null    bool    

dtypes: bool(2), category(2), float64(2), int64(4), object(5)

memory usage: 80.7+ KB

5. Define table and upload data

Define HANA Table and upload data using function "create_dataframe_from_pandas".

TRAIN_TABLE = 'PAL_TRAIN'

dfh = create_dataframe_from_pandas(conn, df, TRAIN_TABLE,

                             schema=SCHEMA, 

                             force=True, # True: truncate and insert

                             replace=True) # True: Null is replaced by 0

print(f'Table Structure: {dfh.get_table_structure()}')

Table Structure: {'survived': 'INT', 'pclass': 'INT', 'gender': 'NVARCHAR(5000)',

'age': 'DOUBLE', 'sibsp': 'INT', 'parch': 'INT', 'fare': 'DOUBLE', 

'embarked': 'NVARCHAR(5000)', 'class': 'NVARCHAR(5000)', 'who': 'NVARCHAR(5000)', 

'adult_male': 'NVARCHAR(5000)', 'deck': 'NVARCHAR(5000)', 

'embark_town': 'NVARCHAR(5000)', 'alive': 'NVARCHAR(5000)', 'alone': 'NVARCHAR(5000)'}

6. Generate report

Here is the main part. It takes about 30 seconds to generate a report. It calculates correlation, so it takes so much time with many records.

Dataset need key columns, so I added a key "ID" column using "add_id" function.

"generate_notebook_iframe_report" function show a report in Jupyter.

"generate_html_report" function save a html report file. Its parameter is fine name prefix, so "titanic_dataset_report.html" is the complete fine name in this script.

datasetReportBuilder = DatasetReportBuilder()

datasetReportBuilder.build(dfh.add_id(), key="ID")

datasetReportBuilder.generate_notebook_iframe_report()

datasetReportBuilder.generate_html_report('titanic')

7. Close connection

Last but not least, closing connection explicitly is preferable.

conn.close()

Python hana_ml: Visualize Dataset(DatasetReportBuilder)

Environment

Generated Report

Overview

Sample

Variables

Data Correlations

Data Scatter Matrix

Python Script

1. Install Python packages

2. Import modules

3. Connect to HANA Cloud

4. Load Titanic dataset

5. Define table and upload data

6. Generate report

7. Close connection

Get Your SAP HANA Idea Incubator Badge Today!

SCN Mission - SAP HANA Quiz Challenge is now retired

Share your #HANAStory and Win