Skip to Content
Technical Articles

Multi-model in hana_ml 2.6 for Python: Spatial data

In the previous post, we looked at the regular data exploration using HANA dataframes. But the objective of this series is to explain the new multi-model functionality of hana_ml 2.6, which I used as well in my demo in SAP TechEd’s DAT108 session.

With hana_ml 2.6 release several HANA DataFrame methods have been extended to support the processing of (geo-)spatial columns. And what is the better time to discuss this if not during Geography Awareness Week?

If spatial data with SAP HANA is new to you, I would suggest going through introductory tutorials first:

  1. Spatial types: https://developers.sap.com/group.hana-aa-spatial-get-started.html
  2. Spatial methods: https://developers.sap.com/group.hana-aa-spatial-methods.html

Assuming we are Ok with the spatial basics…

…let me move back to JupyterLab and create a new notebook 02 Spatial.ipynb.

Because this is the new notebook, we will need to load the required Python modules and connect to the SAP HANA instance first.

import pandas as pd
from hana_ml import dataframe as dfh

Please note, that this time I am not using import hana_ml, but aliases:

  • pd for Pandas, and
  • dfh for HANA DataFrame class.
hana_cloud_endpoint="<uuid>.hana.trial-eu10.hanacloud.ondemand.com:443"
hana_cloud_host, hana_cloud_port=hana_cloud_endpoint.split(":")

cchc=dfh.ConnectionContext(port=hana_cloud_port,
                                         address=hana_cloud_host,
                                         user='HANAML',
                                         password='Super$ecr3t!',
                                         encrypt=True
                                        )
cchc.connection.isconnected()

Find the nearest airport

Let try to query the location of the nearest airport. It will require your rough GPS location’s latitude and longitude assigned to two variables mylat and mylon. In my case:

mylat, mylon = (51.1, 16.9)

You can provide yours manually there, or we can go a bit geekier and query the location from the same https://ifconfig.co service used in the earlier post  Allow connections to SAP HANA Cloud instance from selected IP addresses — using the command line.

import requests, json
s = requests.Session()

mylat, mylon = [json.loads
                (
                    s.get("https://ifconfig.co/json").text
                )
                .get(key) for key in ['latitude', 'longitude']
               ]
s.close()

Let’s print the coordinates we got.

print(mylat, mylon)

In your case, they should be different. And if you are on the VPN, then it might not even be your location at all. Adjust the location manually, if required 🌐

With these two variables mylat and mylon properly set, we can now build a HANA DataFrame to query the nearest airport.

dfh_nearport=(cchc
              .table("PORTS")
              .select("CODE", "DESC", 
                      ('New ST_Point("LON", "LAT").ST_SRID(4326)', "PORT_LOC"),
                      (
                          'New ST_POINT({lon}, {lat}).ST_SRID(4326).ST_Distance(New ST_Point("LON", "LAT").ST_SRID(4326))'
                          .format(lon=mylon, lat=mylat)
                          , "DIST_FROMME"
                      )
                     )
              .sort("DIST_FROMME").
              head(1)
             )

With this line of code we:

  1. select from the table PORTS two existing columns CODE, DESC plus two calculated columns:
  2. PORT_LOC with the spatial location of an airport
  3. and the DIST_FROMME with the distance in meters (that’s the default unit of distance) from my location to an airport using the spatial reference ID SRID 4326.
  4. Then we sort records by the distance column
  5. and select only one top record.

You can verify the generated SELECT statement.

dfh_nearport.select_statement

As we discussed already, the SQL in a HANA DataFrame is executed only when specific methods, like collect() are called.

dfp_nearport=dfh_nearport.collect()
dfp_nearport

So, the nearest airport is WRO (as expected 🙂), located about 5.8 kilometers from me. But what is that array of bytes [1, 1, 0, 0, 0, 0, 0, 0, 192, 195, 226, 48, 64... as the location?

Well, the spatial data is stored in a binary format. To be able to read it properly several methods in hana_ml 2.6 have been extended with the geo_cols and srid optional properties, like sql().

dfp_nearport=(cchc
        .sql(
            '''SELECT TOP 1
            CODE, DESC,
            New ST_Point(LON, LAT).ST_SRID(4326) as "PORT_LOC",
            New ST_POINT(16.9599, 51.0791).ST_SRID(4326).ST_Distance(New ST_Point(LON, LAT).ST_SRID(4326)) as "DIST_FROMME"
            FROM "PORTS"
            ORDER BY "DIST_FROMME" ASC
            '''.format(mylon, mylat),
            geo_cols=["PORT_LOC"],
            srid=4326
        )
        .collect())

display(dfp_nearport)

 

Now the airport’s location is represented using the Well-Known Text (or WKT) value POINT (16.88579940795898 51.10269927978516) that can be copied and pasted e.g. in http://geojson.io/ -> Meta -> Load WKT String to be displayed on the map.

Isn’t it possible to visualize spatial data now that we are using Python and Jupyter?” ask you. Absolutely! And we will get to that, but first, let us finish with the use of geo_cols with HANA DataFrame.

The next question would be what if you want to use HANA DataFrame methods to build the query, not embedded SQL, yet still, return the geospatial columns in the right format? This is where the trick, which I shared as a riddle in the first post, should help — embedding dfh’s select_statement into SQL with geo_cols, like:

dfp_nearport=(cchc
              .sql(dfh_nearport.select_statement,
                       geo_cols={"PORT_LOC": 4326})
              .collect()
             )
display(dfp_nearport)

Last note here: check two different ways I used geo_cols syntax. It can be:

  • either a dictionary combining a column name and SRID in one geo_cols={"PORT_LOC": 4326},
  • or — convenient when you have multiple columns all with the same SRID — a list of columns plus an SRID as two different parameters geo_cols=["PORT_LOC"], srid=4326. In this example, it is a list of one element, but could be more.

Load input data with a geospatial column included

So far we had to calculate airport spatial points (locations) from LAT and LON columns on the SAP HANA table PORT. It would be more convenient to create a column with this ST_GEOMETRY(4326) data type already at the data persistence.

Let’s do this now, but it means we need to repeat the step of reading and wrangling data from a CSV file using a Pandas dataframe dfp_ports, the same as we did in 01 Dataframes.ipynb once again in the current Jupyter notebook.

dfp_nodes=pd.read_csv('https://github.com/krlawrence/graph/raw/master/sample-data/air-routes-latest-nodes.csv')

dfp_ports=(
           dfp_nodes[dfp_nodes['~label'].isin(['airport'])]
           .drop(['~label','type:string','author:string','date:string'], axis=1)
           .convert_dtypes()
          )

dfp_ports.columns=(dfp_ports.columns
                   .str.replace('~','')
                   .str.replace(':.*','')
                   .str.upper()
                  )

And once again we are creating (or rather re-creating SAP HANA table PORTS thanks to force=True property) HANA DataFrame, indicating that a pair f columns LON and LAT are X and Y coordinates of a spatial point with Spatial Reference ID 4326.

dfh_ports=dfh.create_dataframe_from_pandas(cchc,
                                           dfp_ports, "PORTS",
                                           geo_cols=[("LON", "LAT")],
                                           srid=4326,
                                           force=True
                                          )

This time a table in the database was created with the new POINT_LON_LAT_GEO column, as seen by running a command:

dfh_ports.columns

Now we can use this column to simplify the previous exploration of data, like finding the nearest airport!

Load spatial data from Esri shapefiles

Another great improvement in hana_ml 2.6 is the ability to load Esri shapefiles, which is a very popular format to exchange geospatial data. I was very excited when saw this added functionality! I hope you are (or will be) too.

We will use an old, but still very popular World Borders shapefile from the Thematic Mapping.

I am going to create a subfolder Shapes and download the zipped file into that subfolder. This time I will do this right from my notebook’s cell:

!mkdir -p ./Shapes
!wget https://thematicmapping.org/downloads/TM_WORLD_BORDERS-0.3.zip -O ./Shapes/TM_WORLD_BORDERS-0.3.zip 

Let’s use the new create_dataframe_from_shapefile() method to persist the data in the SAP HANA table TM_WORLD_BORDERS. You can notice the use of geo_cols and srid here as well.

dfh_countries = dfh.create_dataframe_from_shapefile(
        connection_context=cchc, 
        shp_file='./Shapes/TM_WORLD_BORDERS-0.3.zip',
        table_name="TM_WORLD_BORDERS",
        geo_cols=["SHAPE"],
        srid=4326
)

and query the table to check your country, like Poland in my case:

dfh_countries.select("ISO2", "NAME", "SHAPE").filter("NAME='Poland'").collect()

As the last example let me show the HANA DataFrame’s query joining two tables using the spatial predicate ST_CoveredBy(). Please note the use of the alias() method to build the join of two HANA DataFrames dfh_ports and dfh_countries.

How many airports are in Poland?

(
    dfh_ports.alias("P")
    .join
    (
        dfh_countries.filter("NAME='Poland'").alias("C"),
        condition='"P"."POINT_LON_LAT_GEO".ST_CoveredBy("C"."SHAPE")=1',
        select=["P.*"]
    )
).count()

PS. And yes, I know there is a column COUNTRY in the PORTS table, but I hope you get the point of this last overengineering effort 🤓

You can find this notebook…

…in hana-ml-samples repository: https://github.com/SAP-samples/hana-ml-samples/blob/main/Python-API/usecase-examples/multimodel-analysis-airroutes/02%20Spatial.ipynb.

A self-check riddle

Can you modify queries, using the distance unit other than the default meter? Two hints:

  1. Available units of measures are stored in SAP HANA in the table "PUBLIC"."ST_UNITS_OF_MEASURE".
  2. You can include the unit of measure as an optional parameter in the SAP HANA SQL method ST_Distance().

As promised, we will focus on the visualization of the geospatial results in the next episode.

And stay healthy ❤️
-Vitaliy (aka @Sygyzmundovych)

PS. And if you read this on November 18th, then Happy GIS Day!

Be the first to leave a comment
You must be Logged on to comment or reply to a post.