Technical Articles
Quickly load Covid-19 data with hana_ml and see with DBeaver
Well, analyzing the spread of the SARS-CoV-2 coronavirus was not my dream use case. But based on the responses to Ferry Djaja‘s Tracking Coronavirus COVID-19 Near Real Time with SAP HANA XSA article I decided to add my two groszy too.
[Updated on 20-03-30 with the changed links to the source data; and the new map output based on the new data granularity. Thanks Douglas Maltby for your comment!]
In his blog post, Ferry used JavaScript in SAP HANA XSA to pull the data from CSV files updated daily by Johns Hopkins University.
I would like to show you how you can pull and load these files into SAP HANA using just a few lines of code thanks to SAP HANA Python Client API for Machine Learning (hana_ml
package).
Some people were confused with the visualization on the map at the end — please note that this article focuses on technical use case connecting different components, not on doing coronavirus data deep analysis.
Get Python environment, e.g. Jupyter
I will use Jupyter in the Docker container for that. Please have a look at my previous post Understanding containers (part 05): shared files between the host and containers if you are not familiar with how to start it. As well you can do all the same steps below from any other Python environment.
So, I have my container myjupyter01
running. I am connected to the Jupyter UI as described in the previous blog.
Install hana_ml
The Jupyter image I used from the Docker Hub registry was jupyter/minimal-notebook
. It contains already some popular data processing packages, like pandas
.
But additionally, I need to install hana_ml
, which — in its current version 1.0.8 — is available on PyPI repository: https://pypi.org/project/hana-ml/.
The command to run the installation is python -m pip install hana_ml
, but because I am running it from Jupyter notebook with Python3 kernel, I need to run it with !
at the beginning:
!python -m pip install hana_ml
Obviously, this installation step has to be done only once. No need to rerun it in the same container e.g. when reloading the newest files.
Use pandas
to import files with data
Let’s import the same three files (confirmed
, deaths
, recovered
) from https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series as Ferry used in his example.
import hana_ml, pandas
# Links updated on 2020-03-22
df_confd = pandas.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')
df_death = pandas.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv')
df_recvd = pandas.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv')
#Links from before March 22nd
#df_confd = pandas.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv')
#df_death = pandas.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Deaths.csv')
#df_recvd = pandas.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Recovered.csv')
As you can see from the preview of the Pandas dataframe, it lists only countries or provinces with confirmed cases, and every day the new column is added with the latest data from the previous day. Lines are added when the first case(s) is confirmed in the new region.
Use pandas
to re-format the data frame
Before persisting the data in SAP HANA, let’s:
- Remove all date columns except the last one,
- Rename the last column from the actual date (like today’s
3/10/20
toConfirmed
).
df_confd_latest=df_confd.drop(df_confd.columns[4:len(df_confd.columns)-1], axis='columns')
df_confd_latest.columns = [*df_confd_latest.columns[:-1],'Confirmed']
Use hana_ml
to persist data in SAP HANA table
Now let me connect to my instance of SAP HANA Express with the user hanaml
that already exists there…
cc=hana_ml.dataframe.ConnectionContext('12.34.567.890', 39015, 'hanaml', 'MyPasswordReusedEverywhere')
…and convert the Pandas dataframe df_confd_latest
into a HANA dataframe hdf_confd
.
hdf_confd=hana_ml.dataframe.create_dataframe_from_pandas(cc, df_confd_latest, 'df_confd', force=True)
Once the HANA dataframe is created:
- A physical column table is created in HANA and data from Pandas dataframe is inserted there,
- HANA dataframe
hdf_confd
in Python does not store any data in your laptop, but only points to a tableHANAML.df_confd
in SAP HANA server memory, and all Python operations on the HANA dataframe are physically exected in HANA db without moving data between the server and a client, - To display the result of any operations, we need to apply
collect()
method to convert HANA dataframe to Pandas (and as a result to bring data from HANA db server to the local client).
Use DBeaver to check data in SAP HANA…
You might remember me already using DBeaver — the free database tool supporting SAP HANA — in my previous post “GeoArt with SAP HANA and DBeaver“.
I am using it now again, and indeed I can find the table df_confd
in the schema HANAML
with all the data from the source Pandas dataframe.
…and do a spatial preview
As the table contains latitude and longitude columns I can visualize impacted countries/states right from DBeaver with the following SQL using Spatial data preview.
SELECT NEW ST_POINT("Long", "Lat"), "Country/Region", "Province/State", "Confirmed" FROM HANAML."df_confd";
I needed to change the map projection to EPSG:4326
to get these points on the map. And DBeaver shows me the rest of the record data when I click on any point.
[Below is the old screenshot from 2020-03-11, which demonstrates as well the different granularity of e.g. US data used at that time]
DBeaver spatial preview is not a full-blown geospatial visual exploration tool. Yet it is good enough to see impacted countries/regions (depending on the granularity in the source files).
Should you be interested to learn more about hana_ml
…
… then I would definitely recommend checking Hands-On Tutorial: Machine Learning push-down to SAP HANA with Python by Andreas Forster.
HANA ML is a part of the new “Advanced Analytics with SAP HANA” topic for CodeJam events. Unfortunately because of the coronavirus situation, we had to cancel the first one organized by Jakob Flaman in Bern this month. Another one is organized by Ewelina Pękała on May 27th in Katowice: https://www.eventbrite.com/e/sap-codejam-katowice-registration-99016299417. Hopefully, the situation gets normal by that time, and we will not need to cancel this one too.
Stay healthy ❤️
-Vitaliy (aka @Sygyzmundovych)
Very interesting work!
Although ML is in the article, this posting is not about Machine Learning. However since you are showing use of python, it could be interesting to see if from the. data you have and the methods that are available, one could do some forecasts.
Hi Luca Toldo
Indeed, the focus of this post was just a quick tip on loading csv data from GitHub into HANA table. And there is too little data to run any meaningful ML algorithms...
But there were blog posts on how to use
hana_ml
with Python already by Andreas Forster, Shivam Shukla or https://people.sap.com/arun.godwin.patel.Thanks for adding me here in this post , will surely check this out on priority how we can forecast something using the dataset.
Thanks,
Shivam
External links on Machine Learning with COVID-19 data
https://towardsdatascience.com/machine-learning-for-biology-how-will-covid-19-mutate-next-4df93cfaf544
COVID-19 Open Research Database
https://pages.semanticscholar.org/coronavirus-research
COVID-19 Kaggle dataset (daily updated)
https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset
Hi Witalij
Excellent post and I too followed Ferry's post about the tracking of the virus - I have also been following a Hungarian team responsible for the RSOE EDIS Emergency and Disaster Information System http://hisz.rsoe.hu/alertmap/index2.php
Similar topic but they have been doing Global Disaster tracking like this for years. But now if we can use hana_ml then why not.
Do you have this post as a PDF presentation perhaps ?
Regards
Graham Hardy
Hi Graham. There is the whole collection of Covid-19 maps and visuals available at http://www.cidrap.umn.edu/covid-19/maps-visuals
No, I do not have it in any other form, as the purpose of this post was just to share a quick tech tip 🙂
Best regards.
FYI, the COVID-19 source files were moved and restructured on March 22. See the Readme.md here: https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series
The new source files are in the below statements to read the files.
Hope that helps those who try this exercise later!
Doug
Thanks Doug!
I updated the code and the screenshot (which is terrifying in the comparison to the previous from just a few weeks ago).
The COVID-19 US data is also updated daily by county with FIPS codes and long/lat in 2 separate files in the repository. I've added these 2 US files to my analysis exercises alongside the global data with a few dataframe adjustments. See below. The data ties exactly to what's being reported locally and globally.
While the situation and data are grim, I've enjoyed using Jupyter, Python, pandas, HANA ML, HANA HXE and DBeaver to experiment with them all on some very real and relevant data.
Clear steps defined thanks for sharing....
Great approach for quickly ingesting a Pandas dataframe into HANA with only a minimal amount of code!