Well, analyzing the spread of the SARS-CoV-2 coronavirus was not my dream use case. But based on the responses to Ferry Djaja‘s Tracking Coronavirus COVID-19 Near Real Time with SAP HANA XSA article I decided to add my two groszy too.
[Updated on 20-03-30 with the changed links to the source data; and the new map output based on the new data granularity. Thanks Douglas Maltby for your comment!]
I would like to show you how you can pull and load these files into SAP HANA using just a few lines of code thanks to SAP HANA Python Client API for Machine Learning (
Some people were confused with the visualization on the map at the end — please note that this article focuses on technical use case connecting different components, not on doing coronavirus data deep analysis.
Get Python environment, e.g. Jupyter
I will use Jupyter in the Docker container for that. Please have a look at my previous post Understanding containers (part 05): shared files between the host and containers if you are not familiar with how to start it. As well you can do all the same steps below from any other Python environment.
So, I have my container
myjupyter01 running. I am connected to the Jupyter UI as described in the previous blog.
The Jupyter image I used from the Docker Hub registry was
jupyter/minimal-notebook. It contains already some popular data processing packages, like
But additionally, I need to install
hana_ml, which — in its current version 1.0.8 — is available on PyPI repository: https://pypi.org/project/hana-ml/.
The command to run the installation is
python -m pip install hana_ml, but because I am running it from Jupyter notebook with Python3 kernel, I need to run it with
! at the beginning:
!python -m pip install hana_ml
Obviously, this installation step has to be done only once. No need to rerun it in the same container e.g. when reloading the newest files.
pandas to import files with data
Let’s import the same three files (
recovered) from https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series as Ferry used in his example.
import hana_ml, pandas # Links updated on 2020-03-22 df_confd = pandas.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv') df_death = pandas.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv') df_recvd = pandas.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv') #Links from before March 22nd #df_confd = pandas.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv') #df_death = pandas.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Deaths.csv') #df_recvd = pandas.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Recovered.csv')
As you can see from the preview of the Pandas dataframe, it lists only countries or provinces with confirmed cases, and every day the new column is added with the latest data from the previous day. Lines are added when the first case(s) is confirmed in the new region.
pandas to re-format the data frame
Before persisting the data in SAP HANA, let’s:
- Remove all date columns except the last one,
- Rename the last column from the actual date (like today’s
df_confd_latest=df_confd.drop(df_confd.columns[4:len(df_confd.columns)-1], axis='columns') df_confd_latest.columns = [*df_confd_latest.columns[:-1],'Confirmed']
hana_ml to persist data in SAP HANA table
Now let me connect to my instance of SAP HANA Express with the user
hanaml that already exists there…
cc=hana_ml.dataframe.ConnectionContext('12.34.567.890', 39015, 'hanaml', 'MyPasswordReusedEverywhere')
…and convert the Pandas dataframe
df_confd_latest into a HANA dataframe
hdf_confd=hana_ml.dataframe.create_dataframe_from_pandas(cc, df_confd_latest, 'df_confd', force=True)
Once the HANA dataframe is created:
- A physical column table is created in HANA and data from Pandas dataframe is inserted there,
- HANA dataframe
hdf_confdin Python does not store any data in your laptop, but only points to a table
HANAML.df_confdin SAP HANA server memory, and all Python operations on the HANA dataframe are physically exected in HANA db without moving data between the server and a client,
- To display the result of any operations, we need to apply
collect()method to convert HANA dataframe to Pandas (and as a result to bring data from HANA db server to the local client).
Use DBeaver to check data in SAP HANA…
You might remember me already using DBeaver — the free database tool supporting SAP HANA — in my previous post “GeoArt with SAP HANA and DBeaver“.
I am using it now again, and indeed I can find the table
df_confd in the schema
HANAML with all the data from the source Pandas dataframe.
…and do a spatial preview
As the table contains latitude and longitude columns I can visualize impacted countries/states right from DBeaver with the following SQL using Spatial data preview.
SELECT NEW ST_POINT("Long", "Lat"), "Country/Region", "Province/State", "Confirmed" FROM HANAML."df_confd";
I needed to change the map projection to
EPSG:4326 to get these points on the map. And DBeaver shows me the rest of the record data when I click on any point.
[Below is the old screenshot from 2020-03-11, which demonstrates as well the different granularity of e.g. US data used at that time]
DBeaver spatial preview is not a full-blown geospatial visual exploration tool. Yet it is good enough to see impacted countries/regions (depending on the granularity in the source files).
Should you be interested to learn more about
… then I would definitely recommend checking Hands-On Tutorial: Machine Learning push-down to SAP HANA with Python by Andreas Forster.
HANA ML is a part of the new “Advanced Analytics with SAP HANA” topic for CodeJam events. Unfortunately because of the coronavirus situation, we had to cancel the first one organized by Jakob Flaman in Bern this month. Another one is organized by Ewelina Pękała on May 27th in Katowice: https://www.eventbrite.com/e/sap-codejam-katowice-registration-99016299417. Hopefully, the situation gets normal by that time, and we will not need to cancel this one too.
Stay healthy ❤️
-Vitaliy (aka @Sygyzmundovych)