Skip to Content
Technical Articles
Author's profile photo Witalij Rudnicki

Quickly load Covid-19 data with hana_ml and see with DBeaver

Well, analyzing the spread of the SARS-CoV-2 coronavirus was not my dream use case. But based on the responses to Ferry Djaja‘s Tracking Coronavirus COVID-19 Near Real Time with SAP HANA XSA article I decided to add my two groszy too.

[Updated on 20-03-30 with the changed links to the source data; and the new map output based on the new data granularity. Thanks Douglas Maltby for your comment!]

In his blog post, Ferry used JavaScript in SAP HANA XSA to pull the data from CSV files updated daily by Johns Hopkins University.

I would like to show you how you can pull and load these files into SAP HANA using just a few lines of code thanks to SAP HANA Python Client API for Machine Learning (hana_ml package).

Some people were confused with the visualization on the map at the end — please note that this article focuses on technical use case connecting different components, not on doing coronavirus data deep analysis.

Get Python environment, e.g. Jupyter

I will use Jupyter in the Docker container for that. Please have a look at my previous post Understanding containers (part 05): shared files between the host and containers if you are not familiar with how to start it. As well you can do all the same steps below from any other Python environment.

So, I have my container myjupyter01 running. I am connected to the Jupyter UI as described in the previous blog.

Install hana_ml

The Jupyter image I used from the Docker Hub registry was jupyter/minimal-notebook. It contains already some popular data processing packages, like pandas.

But additionally, I need to install hana_ml, which — in its current version 1.0.8 — is available on PyPI repository: https://pypi.org/project/hana-ml/.

The command to run the installation is python -m pip install hana_ml, but because I am running it from Jupyter notebook with Python3 kernel, I need to run it with ! at the beginning:

!python -m pip install hana_ml

Obviously, this installation step has to be done only once. No need to rerun it in the same container e.g. when reloading the newest files.

Use pandas to import files with data

Let’s import the same three files (confirmed, deaths, recovered) from https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series as Ferry used in his example.

import hana_ml, pandas

# Links updated on 2020-03-22
df_confd = pandas.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')
df_death = pandas.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv')
df_recvd = pandas.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv')

#Links from before March 22nd
#df_confd = pandas.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv')
#df_death = pandas.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Deaths.csv')
#df_recvd = pandas.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Recovered.csv')

As you can see from the preview of the Pandas dataframe, it lists only countries or provinces with confirmed cases, and every day the new column is added with the latest data from the previous day. Lines are added when the first case(s) is confirmed in the new region.

Use pandas to re-format the data frame

Before persisting the data in SAP HANA, let’s:

  1. Remove all date columns except the last one,
  2. Rename the last column from the actual date (like today’s 3/10/20 to Confirmed).
df_confd_latest=df_confd.drop(df_confd.columns[4:len(df_confd.columns)-1], axis='columns')

df_confd_latest.columns = [*df_confd_latest.columns[:-1],'Confirmed']

Use hana_ml to persist data in SAP HANA table

Now let me connect to my instance of SAP HANA Express with the user hanaml that already exists there…

cc=hana_ml.dataframe.ConnectionContext('12.34.567.890', 39015, 'hanaml', 'MyPasswordReusedEverywhere')

…and convert the Pandas dataframe df_confd_latest into a HANA dataframe hdf_confd.

hdf_confd=hana_ml.dataframe.create_dataframe_from_pandas(cc, df_confd_latest, 'df_confd', force=True)

Once the HANA dataframe is created:

  1. A physical column table is created in HANA and data from Pandas dataframe is inserted there,
  2. HANA dataframe hdf_confd in Python does not store any data in your laptop, but only points to a table HANAML.df_confd in SAP HANA server memory, and all Python operations on the HANA dataframe are physically exected in HANA db without moving data between the server and a client,
  3. To display the result of any operations, we need to apply collect() method to convert HANA dataframe to Pandas (and as a result to bring data from HANA db server to the local client).

Use DBeaver to check data in SAP HANA…

You might remember me already using DBeaver — the free database tool supporting SAP HANA — in my previous post “GeoArt with SAP HANA and DBeaver“.

I am using it now again, and indeed I can find the table df_confd in the schema HANAML with all the data from the source Pandas dataframe.

…and do a spatial preview

As the table contains latitude and longitude columns I can visualize impacted countries/states right from DBeaver with the following SQL using Spatial data preview.

SELECT NEW ST_POINT("Long", "Lat"), "Country/Region", "Province/State", "Confirmed" FROM HANAML."df_confd";

I needed to change the map projection to EPSG:4326 to get these points on the map. And DBeaver shows me the rest of the record data when I click on any point.

[Below is the old screenshot from 2020-03-11, which demonstrates as well the different granularity of e.g. US data used at that time]

DBeaver spatial preview is not a full-blown geospatial visual exploration tool. Yet it is good enough to see impacted countries/regions (depending on the granularity in the source files).

Should you be interested to learn more about hana_ml

… then I would definitely recommend checking Hands-On Tutorial: Machine Learning push-down to SAP HANA with Python by Andreas Forster.

HANA ML is a part of the new “Advanced Analytics with SAP HANA” topic for CodeJam events. Unfortunately because of the coronavirus situation, we had to cancel the first one organized by Jakob Flaman in Bern this month. Another one is organized by Ewelina Pękała on May 27th in Katowice: https://www.eventbrite.com/e/sap-codejam-katowice-registration-99016299417. Hopefully, the situation gets normal by that time, and we will not need to cancel this one too.


Stay healthy ❤️
-Vitaliy (aka @Sygyzmundovych)

Assigned Tags

      12 Comments
      You must be Logged on to comment or reply to a post.
      Author's profile photo Shmily Wang
      Shmily Wang

      Very interesting work!

      Author's profile photo Luca Toldo
      Luca Toldo

      Although ML is in the article, this posting is not about Machine Learning. However since you are showing use of python, it could be interesting to see if from the. data you have and the methods that are available, one could do some forecasts.

      Author's profile photo Witalij Rudnicki
      Witalij Rudnicki
      Blog Post Author

      Hi Luca Toldo

      Indeed, the focus of this post was just a quick tip on loading csv data from GitHub into HANA table. And there is too little data to run any meaningful ML algorithms...

      But there were blog posts on how to use hana_ml with Python already by Andreas ForsterShivam Shukla or https://people.sap.com/arun.godwin.patel.

      Author's profile photo Shivam Shukla
      Shivam Shukla

      Thanks for adding me here in this post , will surely check this out on priority how we can forecast something using the dataset.

       

      Thanks,

      Shivam

      Author's profile photo Luca Toldo
      Luca Toldo

      External links on Machine Learning with  COVID-19 data

      https://towardsdatascience.com/machine-learning-for-biology-how-will-covid-19-mutate-next-4df93cfaf544

      COVID-19 Open Research Database

      https://pages.semanticscholar.org/coronavirus-research

      COVID-19 Kaggle dataset (daily updated)
      https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset

      Author's profile photo Graham Hardy
      Graham Hardy

      Hi Witalij

       

      Excellent post and I too followed Ferry's post about the tracking of the virus - I have also been following a Hungarian team responsible for the RSOE EDIS Emergency and Disaster Information System  http://hisz.rsoe.hu/alertmap/index2.php

      Similar topic but they have been doing Global Disaster tracking like this for years. But now if we can use hana_ml then why not.

      Do you have this post as a PDF presentation perhaps ?

       

      Regards

       

      Graham Hardy

      Author's profile photo Witalij Rudnicki
      Witalij Rudnicki
      Blog Post Author

      Hi Graham. There is the whole collection of Covid-19 maps and visuals available at http://www.cidrap.umn.edu/covid-19/maps-visuals

      No, I do not have it in any other form, as the purpose of this post was just to share a quick tech tip 🙂

      Best regards.

      Author's profile photo Douglas Maltby
      Douglas Maltby

      FYI, the COVID-19 source files were moved and restructured on March 22. See the Readme.md here: https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series

      The new source files are in the below statements to read the files.

      df_confd = pandas.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')
      df_death = pandas.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv')
      df_recvd = pandas.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv')

      Hope that helps those who try this exercise later!

      Doug

      Author's profile photo Witalij Rudnicki
      Witalij Rudnicki
      Blog Post Author

      Thanks Doug!

      I updated the code and the screenshot (which is terrifying in the comparison to the previous from just a few weeks ago).

      Author's profile photo Douglas Maltby
      Douglas Maltby

      The COVID-19 US data is also updated daily by county with FIPS codes and long/lat in 2 separate files in the repository. I've added these 2 US files to my analysis exercises alongside the global data with a few dataframe adjustments. See below. The data ties exactly to what's being reported locally and globally.

      df_deathus = pandas.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv')
      df_confdus = pandas.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv')
      df_confdus_latest=df_confdus.drop(df_confdus.columns[10:len(df_confdus.columns)-1], axis='columns')
      df_confdus_latest.columns = [*df_confdus_latest.columns[:-1],'Confirmed-US']
      df_deathus_latest=df_deathus.drop(df_deathus.columns[10:len(df_deathus.columns)-1], axis='columns')
      df_deathus_latest.columns = [*df_deathus_latest.columns[:-1],'Deaths-US']

      While the situation and data are grim, I've enjoyed using Jupyter, Python, pandas, HANA ML, HANA HXE and DBeaver to experiment with them all on some very real and relevant data.

       

      Author's profile photo S Abinath
      S Abinath

      Clear steps defined thanks for sharing....

      Author's profile photo Dirk Kemper
      Dirk Kemper

      Great approach for quickly ingesting a Pandas dataframe into HANA with only a minimal amount of code!