Multi-model in hana_ml 2.6 for Python (part 02): More on Dataframes
In the previous post, we looked at the basics of Pandas and HANA dataframes that we will use in this series explaining the new multi-model functionality of
hana_ml 2.6 in the context of my demo used in SAP TechEd’s DAT108 session.
In the previous episode…
…we loaded data from CSV files into a client’s Pandas dataframe and then persisted those Pandas dataframes by creating tables with that data in SAP HANA using
create_dataframe_from_pandas(). The result of each
create_dataframe_from_pandas() method execution was a HANA dataframe in Python that defines a
SELECT statement to read this data.
We then applied
collect() method to that HANA dataframe to get the result of that
SELECT statement as data in a Pandas dataframe. I used the following Python code for that:
and that was a really bad line of code!
It was bad for two reasons.
How to display the results
First, the Pandas module provides its own method to output the content of the dataframe:
print(). The advantages of the display method are:
- nicely formatted outputs in notebooks,
- options to customize the output.
The good news is that
display is the default action for Pandas dataframes.
How to collect the results from SAP HANA
collect() to a HANA dataframe that selects the complete SAP HANA table will read all table’s records in the database, transfer that complete dataset over the network, and will try to store all this data in your Python client’s memory. And that’s not how it should be done!
If all we need is to see a few sample records stored in the SAP HANA table, then you should apply
When a method like
head() is applied to a HANA dataframe, then the result is another HANA dataframe with a modified
SELECT statement. Let’s see this in practice.
Data exploration with HANA dataframes
As we said above, the best practice is pushing the calculation down to the SAP HANA database and limiting the amount of data transferred back to the client. And this limiting should apply not only to a number of rows but as well to required columns only. Tables in SAP HANA are column-oriented by default, and selecting columns you do not need in the result is the anti-pattern.
dtypes() is used to find columns of a HANA dataframe. If you are familiar with Pandas, then you can notice that with a HANA dataframe it is a method, while in a Pandas dataframe
dtypes is a property.
If you need only column names without data types, then you can simply use the
Now, let’s answer some data exploration questions!
What is the airport with the longest runway?
We need to use the
sort() method to answer this question.
( dfh_ports .select("CODE", "DESC", "LONGEST", "COUNTRY", "CITY") .sort("LONGEST", desc=True) .head(3).collect() )
To make the code easier to read I wanted to split this Python line of code into multiple lines. That’s why I put it all into the brackets.
Accordingly to https://simpleflying.com/longest-runways/ “…the [Qamdo Bamda Airport] runway’s length is a necessity due to its high altitude. In fact, the airport and runway are situated 4,400 meters above sea level. As we alluded to above, high altitude affects engine performance and the ability for wings to acquire the lift needed to get the aircraft airborne.”
SELECT statement behind the result was:
What airport is the closest to a pole (either North or South)?
To answer this we need to add a calculated column
ABSLAT with the absolute values of airports’ latitudes that can be either positive in the Northern hemisphere, or negative in the Southern.
( dfh_ports .select( "CODE", "DESC", "LONGEST", "COUNTRY", "CITY", "LAT", ('ABS("LAT")', "ABSLAT") ) .sort("ABSLAT", desc=True).head(3) .collect() )
Accordingly to https://en.wikipedia.org/wiki/Svalbard_Airport,_Longyear Svalbard Airport “…is the northernmost airport in the world with scheduled public flights.”
What are the countries with airports with the highest number of runways?
For that, we will use the
agg() method with
max aggregation function from SAP HANA SQL.
( dfh_ports .agg( agg_list=[("max", "RUNWAYS", "MAXRUNWAYS")], group_by="COUNTRY" ) .sort("MAXRUNWAYS", desc=True) .head(7).collect() )
You can find this notebook…
A self-check riddle
I did not include generated SQL statements for the last two examples. I leave these to you to find out as a hands-on exercise.
So far, we haven’t done anything multi-model specific in our code, but I wanted to make sure that we are on the same page before we jump to the geospatial data in the next post.
Stay healthy ❤️ stay curious,
-Vitaliy (aka @Sygyzmundovych)