Easy Descriptive Statistics with Python and SAP H...

former_member559925 · ‎05-24-2021

As Solution Advisors, we occasionally participate in POCs which require us to receive and analyze real customer data. Often, the first step to developing an analysis plan is to perform some exploratory data analysis on the data to determine the distribution of values and uniqueness of each column. Since we are dealing with real customer use-cases the datasets tend to be very wide, often with missing data. It is important to be able to get a sense of the quality of the dataset quickly since we may have to get back to the customer with our questions about the dataset before the project and analysis planning can begin.

I discovered a nice trick to generate this information within SAP HANA Cloud using the hana_ml Python library and wanted to share it as my first blog post. 😊

Problem: SAP Data Intelligence Fact Sheet cannot be exported

The fact sheets on profiled datasets in SAP Data Intelligence provide descriptive statistics on each column but we are not able to save this information as a dataset to manipulate and report on.

SAP Data Intelligence fact sheet provides descriptive statistics

However, the describe() method in the hana_ml Python library provides a simple way to generate this summary for us. This summary is generated natively in SAP HANA so it is fast as well.

Those familiar with Python may already be familiarity with describe(). For a pandas dataFrame, calling describe will produce a nice table with descriptive statistics like min, max, mean, and quartile values of each column.

The hana_ml Python library has also implemented this method and it is a handy way to generate descriptive statistics on any SAP HANA table. It is very flexible as well, allowing you to save the results natively as an SAP HANA table or bring it into your Python environment. Finally, it is a great way to understand how the Python wrapper works on top of SAP HANA Cloud.

SAP HANA DataFrames are SQL statements

There are already several great blogs on the SAP HANA DataFrame but it allows you to utilize your Python knowledge to work with your SAP HANA environment. The SAP HANA DataFrame provides a pointer to the data in HANA without storing any of the physical data. It is very flexible and allows you to perform data manipulations and transformations easily. It also let’s you move your data from your HANA environment and Python environment easily. With collect(), you can materialize your SAP HANA dataFrame as a Pandas dataFrame to offer even greater flexibility.

HANA dataFrame represents a table, column or SQL. Source: https://help.sap.com/doc/1d0ebfe5e8dd44d09606814d83308d4b/2.0.04/en-US/hana_ml.html

Use SAP HANA_ml Python library to import local files (e.g. csv, Excel)

In addition to generating descriptive statistics on your SAP HANA tables, you can also use the hana_ml library to load local files (e.g. .csv or Excel) to SAP HANA cloud automatically. To get started, let’s first import the necessary libraries and establish our connection to SAP HANA Cloud.

First, let's import the necessary libraries:

import pandas as pd

import hana_ml.dataframe as dataframe

print('HANA ML version: ' + hana_ml.__version__)

print('Pandas version:  ' + pd.__version__)

Next, set up our SAP HANA Cloud connection:

# Import libraries

import hana_ml.dataframe as dataframe



# Create connection to HANA Cloud

# Instantiate connection object

conn = dataframe.ConnectionContext(address = '<Your tenant info here for example something.hanacloud.ondemand.com>',

                                   port = 443, 

                                   user = '<Username>', 

                                   password = "<Password>", 

                                   encrypt = 'true',

                                   sslValidateCertificate = 'false')



# Display HANA version to test connection

print('HANA version: ' + conn.hana_version())



# Print APL version to confirm PAL/APL are enabled

import hana_ml.algorithms.apl.apl_base as apl_base

v = apl_base.get_apl_version(conn)

v.head(4)

Create connection to HANA tenant and test connection

The sample file I used for this demo is the 2018 file from the Airline Delay and Cancellation Data, 2019-2018. I wanted to test the performance vs. Pandas but you can use whatever file you have handy.

Using Pandas, we can import the file and run describe() to get summary statistics in Python.

import pandas as pd



df = pd.read_csv('../datasets/airline-delay/2018.csv')

Pandas describe method

To do the same in SAP HANA Cloud, we’ll first need to convert the Pandas dataFrame to an SAP HANA table. We can use the create_dataframe_from_pandas()to do this. It will automatically upload the Pandas DataFrame to SAP HANA to create a table or view, and an SAP HANA DataFrame as well.

Although we do not need to specify the datatype formats, the automatic conversion is not the most efficient. For example, pandas object types are converted to NVARCHAR(5000). However, we can use Pandas to provide the format lengths of string columns to minimize storage by calculating the maximum length of each string column.

# Subset dataframe to only object (string) types

df2 = df.select_dtypes(include='object')



# Create table of columns and data types

r = df2.dtypes.to_frame('dtypes').reset_index()



# Filter to string columns

sv = r[r['dtypes']=='object'].set_index('index')



# Iterate for each string column

for i in sv.index.to_list():

    # Get max length

    l = df2[i].str.len().max().astype(int)

    # Write max length to columns table

    sv.loc[i, 'len'] = f'NVARCHAR({l})'



# Create dictionary of columns:length

hana_fmt = sv['len'].to_dict()

hana_fmt

Find max length of character strings and create format dictionary

Now, we can use create_dataframe_from_pandas() to create the SAP HANA table. There are many optional options as well, be sure to check out the docs.

# conn - Connection to HANA Cloud

# pandas_df - Pandas dataFrame to upload

# table_name - HANA table name to create

# table_structure - HANA table format dictionary

# allow_bigint - Allows mapping to bigint or int

# force - replace HANA table if exists





dataframe.create_dataframe_from_pandas(connection_context = conn,

                                       pandas_df = df, 

                                       table_name = 'AIRLINES_2018',

                                       allow_bigint = True, 

                                       force = True)

Creating HANA Cloud table using dtype formats

The code above creates the SAP HANA dataFrame (“df_hana”) which points to the “AIRLINES_2018” SAP HANA table but you can also reference tables like below. We can call the describe() on that SAP HANA dataFrame to generate the descriptive stats we need. The code below brings the descriptive statistics into Pandas with collect()

%%time



stats = conn.table('AIRLINE_2018')

stats.describe().collect()

SAP HANA describe() collected to Pandas

However, if we wanted this information in SAP HANA we can do so without bringing into Pandas and saving directly as a SAP HANA table.

df_hana.describe().save('STATS_AIRLINE2018')

We can see this table created in the SAP HANA Database Explorer:

Descriptive statistics saved as table in SAP HANA Cloud

We can understand what is happening behind the scenes with the select_statement. This shows the underlying SQL behind the SAP HANA DataFrame.

df_hana.describe().select_statement

Underlying SQL statement we do not have to write

As you can see, the SAP hana_ml Python library is very flexible and allows you to combine the flexibility to Python with the power of SAP HANA.

You can use it to automate the time-consuming task of running descriptive statistics to accelerate the process of data discovery and exploratory data analysis.

Special thanks to andreas.forster, marc.daniau, and onno.bagijn and the other bloggers for sharing their invaluable knowledge.

Easy Descriptive Statistics with Python and SAP HANA Cloud

Get Your SAP HANA Idea Incubator Badge Today!

SCN Mission - SAP HANA Quiz Challenge is now retired

Share your #HANAStory and Win