In the
previous post, we looked at the basics of Pandas and HANA dataframes that we will use
in this series explaining the new multi-model functionality of
hana_ml
2.6 in the context of my demo used in
SAP TechEd’s DAT108 session.
In the previous episode...
...we loaded data from CSV files into a client's Pandas dataframe and then persisted those Pandas dataframes by creating tables with that data in SAP HANA using
create_dataframe_from_pandas()
. The result of each
create_dataframe_from_pandas()
method execution was a HANA dataframe in Python that defines a
SELECT
statement to read this data.
We then applied
collect()
method to that HANA dataframe to get the result of that
SELECT
statement as data in a Pandas dataframe. I used the following Python code for that:
print(dfh_ports.collect())
and that was a really bad line of code!
It was bad for two reasons.
How to display the results
First, the Pandas module provides its own method to output the content of the dataframe:
display()
i/o
print()
. The advantages of the display method are:
- nicely formatted outputs in notebooks,
- options to customize the output.
The good news is that
display
is the default action for Pandas dataframes.
How to collect the results from SAP HANA
Second, applying
collect()
to a HANA dataframe that selects the complete SAP HANA table will read all table's records in the database, transfer that complete dataset over the network, and will try to store all this data in your Python client's memory.
And that's not how it should be done!
If all we need is to see a few sample records stored in the SAP HANA table, then you should apply
head()
before
collect()
.
When a method like
head()
is applied to a HANA dataframe, then the result is another HANA dataframe with a modified
SELECT
statement. Let's see this in practice.
dfh_ports
dfh_ports.select_statement
dfh_ports.head(5)
dfh_ports.head(5).select_statement
Data exploration with HANA dataframes
As we said above, the best practice is
pushing the calculation down to the SAP HANA database and
limiting the amount of data transferred back to the client. And this limiting should apply not only to a number of rows but as well to required columns only. Tables in SAP HANA are column-oriented by default, and
selecting columns you do not need in the result is the anti-pattern.
The method
dtypes()
is used to find columns of a HANA dataframe. If you are familiar with Pandas, then you can notice that with a HANA dataframe it is a method, while in a Pandas dataframe
dtypes
is a property.
If you need only column names without data types, then you can simply use the
columns
property.
print(dfh_ports.columns)
Now, let's answer some data exploration questions!
What is the airport with the longest runway?
We need to use the
sort()
method to answer this question.
(
dfh_ports
.select("CODE", "DESC", "LONGEST", "COUNTRY", "CITY")
.sort("LONGEST", desc=True)
.head(3).collect()
)
To make the code easier to read I wanted to split this Python line of code into multiple lines. That's why I put it all into the brackets.
Accordingly to
https://simpleflying.com/longest-runways/ "...
the [Qamdo Bamda Airport] runway’s length is a necessity due to its high altitude. In fact, the airport and runway are situated 4,400 meters above sea level. As we alluded to above, high altitude affects engine performance and the ability for wings to acquire the lift needed to get the aircraft airborne."
The
SELECT
statement behind the result was:
What airport is the closest to a pole (either North or South)?
To answer this we need to add
a calculated column ABSLAT
with the absolute values of airports' latitudes that can be either positive in the Northern hemisphere, or negative in the Southern.
(
dfh_ports
.select(
"CODE", "DESC", "LONGEST", "COUNTRY", "CITY", "LAT",
('ABS("LAT")', "ABSLAT")
)
.sort("ABSLAT", desc=True).head(3)
.collect()
)
Accordingly to
https://en.wikipedia.org/wiki/Svalbard_Airport,_Longyear Svalbard Airport "...
is the northernmost airport in the world with scheduled public flights."
What are the countries with airports with the highest number of runways?
For that, we will use the
agg()
method with
max
aggregation function from SAP HANA SQL.
(
dfh_ports
.agg(
agg_list=[("max", "RUNWAYS", "MAXRUNWAYS")],
group_by="COUNTRY"
)
.sort("MAXRUNWAYS", desc=True)
.head(7).collect()
)
You can find this notebook...
...in
hana-ml-samples
repository:
https://github.com/SAP-samples/hana-ml-samples/blob/main/Python-API/usecase-examples/multimodel-anal....
A self-check riddle
I did not include generated SQL statements for the last two examples. I leave these to you to find out as a hands-on exercise.
If you want to even better understand HANA dataframe calculations, then I would suggest you to experiment with the statements to include
filter()
,
deselect()
,
distinct()
etc.
So far, we haven't done anything multi-model specific in our code, but I wanted to make sure that we are on the same page before we jump to the geospatial data in the
next post.
Stay healthy
❤️ stay curious,
-Vitaliy (aka
@Sygyzmundovych)