Skip to Content
Technical Articles

Multi-model in hana_ml 2.6 for Python: More on Dataframes

In the previous post, we looked at the basics of Pandas and HANA dataframes that we will use in this series explaining the new multi-model functionality of hana_ml 2.6 in the context of my demo used in SAP TechEd’s DAT108 session.

In the previous episode…

…we loaded data from CSV files into a client’s Pandas dataframe and then persisted those Pandas dataframes by creating tables with that data in SAP HANA using create_dataframe_from_pandas(). The result of each create_dataframe_from_pandas() method execution was a HANA dataframe in Python that defines a SELECT statement to read this data.

We then applied collect() method to that HANA dataframe to get the result of that SELECT statement as data in a Pandas dataframe. I used the following Python code for that:

print(dfh_ports.collect())

and that was a really bad line of code!

It was bad for two reasons.

How to display the results

First, the Pandas module provides its own method to output the content of the dataframe: display() i/o print(). The advantages of the display method are:

  1. nicely formatted outputs in notebooks,
  2. options to customize the output.

The good news is that display is the default action for Pandas dataframes.

How to collect the results from SAP HANA

Second, applying collect() to a HANA dataframe that selects the complete SAP HANA table will read all table’s records in the database, transfer that complete dataset over the network, and will try to store all this data in your Python client’s memory. And that’s not how it should be done!

If all we need is to see a few sample records stored in the SAP HANA table, then you should apply head() before collect().

When a method like head() is applied to a HANA dataframe, then the result is another HANA dataframe with a modified SELECT statement. Let’s see this in practice.

dfh_ports
dfh_ports.select_statement
dfh_ports.head(5)
dfh_ports.head(5).select_statement

Data exploration with HANA dataframes

As we said above, the best practice is pushing the calculation down to the SAP HANA database and limiting the amount of data transferred back to the client. And this limiting should apply not only to a number of rows but as well to required columns only. Tables in SAP HANA are column-oriented by default, and selecting columns you do not need in the result is the anti-pattern.

The method dtypes() is used to find columns of a HANA dataframe. If you are familiar with Pandas, then you can notice that with a HANA dataframe it is a method, while in a Pandas dataframe dtypes is a property.

If you need only column names without data types, then you can simply use the columns property.

print(dfh_ports.columns)

Now, let’s answer some data exploration questions!

What is the airport with the longest runway?

We need to use the sort() method to answer this question.

(
    dfh_ports
    .select("CODE", "DESC", "LONGEST", "COUNTRY", "CITY")
    .sort("LONGEST", desc=True)
    .head(3).collect()
)

To make the code easier to read I wanted to split this Python line of code into multiple lines. That’s why I put it all into the brackets.

Accordingly to https://simpleflying.com/longest-runways/ “…the [Qamdo Bamda Airport] runway’s length is a necessity due to its high altitude. In fact, the airport and runway are situated 4,400 meters above sea level. As we alluded to above, high altitude affects engine performance and the ability for wings to acquire the lift needed to get the aircraft airborne.

The SELECT statement behind the result was:

What airport is the closest to a pole (either North or South)?

To answer this we need to add a calculated column ABSLAT with the absolute values of airports’ latitudes that can be either positive in the Northern hemisphere, or negative in the Southern.

(
    dfh_ports
    .select(
        "CODE", "DESC", "LONGEST", "COUNTRY", "CITY", "LAT", 
        ('ABS("LAT")', "ABSLAT")
    )
    .sort("ABSLAT", desc=True).head(3)
    .collect()
)

Accordingly to https://en.wikipedia.org/wiki/Svalbard_Airport,_Longyear Svalbard Airport “…is the northernmost airport in the world with scheduled public flights.

What are the countries with airports with the highest number of runways?

For that, we will use the agg() method with max aggregation function from SAP HANA SQL.

(
    dfh_ports
    .agg(
        agg_list=[("max", "RUNWAYS", "MAXRUNWAYS")], 
        group_by="COUNTRY"
    )
    .sort("MAXRUNWAYS", desc=True)
    .head(7).collect()
)

You can find this notebook…

…in hana-ml-samples repository: https://github.com/SAP-samples/hana-ml-samples/blob/main/Python-API/usecase-examples/multimodel-analysis-airroutes/01%20Dataframes.ipynb.

A self-check riddle

I did not include generated SQL statements for the last two examples. I leave these to you to find out as a hands-on exercise.

If you want to even better understand HANA dataframe calculations, then I would suggest you to experiment with the statements to include filter()deselect(), distinct() etc.


So far, we haven’t done anything multi-model specific in our code, but I wanted to make sure that we are on the same page before we jump to the geospatial data in the next post.

Stay healthy ❤️ stay curious,
-Vitaliy (aka @Sygyzmundovych)

Be the first to leave a comment
You must be Logged on to comment or reply to a post.