I haven’t seen too many Covid dashboards using SAP Technology, and even less with detailed UK data from the UK or England, it therefore seemed an ideal use case to test drive HANA Cloud with SAP Analytics Cloud. In this blogpost I will describe the challenges I faced and through and share some visualisations.
The datasources I have used are the most detailed publicly available data from either the UK or England. My code is on Github so you should be able to clone and build the HANA WebIDE project. You can find a link to the project at the end.
In the UK we have separate responsible entities for many functions. This means that England, Scotland, Wales and Northern Ireland can act independently and report data differently in some circumstances. For this reason some datasets I have used only contain England’s data.
All of the UK and England datasources that I have come across are file based. With CSV and Excel being the main file types. People think of those as “easy” formats to work with, but in reality they are quite clumsy. These datasources are mostly updated daily, except for the Office of National Statistics (ONS) which updated weekly.
At first I downloaded the data manually as anyone can do, however that soon became a burden and tedious. I therefore decided to automate/script the data acquisition. I used some python code to do this, with a similar framework used for most sources. At the same time I set up Visual Studio Code as my development environment for both HANA and Python. I would then have the latest files download directly into my project.
# # Python 3 Code used to download latest testing data into WebIDE project # tests='https://www.gov.uk/guidance/coronavirus-covid-19-information-for-the-public' tts_dl="../src/data/testing.csv" response = requests.get(tests) soup = BeautifulSoup(response.content) blk=soup.find_all(class_="govuk-link") for url in blk: link = url.get('href') if 'testing_time_series' in link: print(link) tts=link response = requests.get(tts) file = open(tts_dl, "wb") file.write(response.content) file.close
Data Quality is as important as ever. Does perfect data exist? Not in this case, the more data sources I integrated, the more data quality issues I encountered. Each day can bring both new and old data quality issues, with these official “reliable” sources of data. I’ve elaborated on a couple of the data quality issues encountered.
I addressed most of these issues with HANA Calculation View logic.
Data in the wrong column
This is not uncommon with CSV files, but in Excel you don’t expect that.
When this type of issue occurs first instinct is to validate your ETL or ELT logic. Once we know the data is not reliable we can build those checks into our data loads.
“Unavailable” instead of blank or null shown some of the time in a numerical field. The thousand separator appearing in some records of numerical data
Invalid/null final records in the dataset
Data Types changing, e.g. numbers (of people) gaining decimal points
Date formats changing
Sheet names changing
Spaces at the end of fields
Spaces at the end of data
Now some of this could be excused and I imagine people are working hard behind the scenes to produce these figures, but it is quite telling how many different types of issue surface on their external data. Some of the issues disappear and re-appear where as others still remain.
Development – WebIDE, VSCode, Git and Python
After watching Thomas Jung’s excellent ABAP Freak Show, covering local HANA development within Visual Studio, I decided to try it out for myself. It took a bit of understanding some of the the hana-cli commands from Thomas’s github project, but once setup I became very productive.
For me, git integration is the single biggest advantage of using the HDI and the WebIDE either with HANA Cloud or HANA 2.x. In simplistic terms git takes care of the versioning, backups, code synchronisation and even multi developer parallel work streams.
I now have multiple copies of my project, one locally in VSCode, and another in the WebIDE. I can make changes, add new files, update the data and then easily sync with git, not forgetting to commit and push!
After some experimentation I decided it would be cleanest and simplest if I also kept ALL the data itself within the WebIDE project. This makes the project completely self contained and allows easy parallel deployment to multiple landscapes.
HANA Cloud vs HANA On-Premise 2.00.x
I moved my WebIDE project between HANA 2.0.46 and HANA Cloud, the migration and making the project co-deploy in both environments was easy, only very minor changes required. Three areas required attention
Removing any references to Column Store Data Types. For example if I had a column defined as
“DOWNLOAD_DATE” DATE CS_DAYDATE
I just deleted the CS_DAYDATE in the table definition, the same was true for CS_INT, CS_DOUBLE and any others CS types you may come across in existing table definitions. This syntax without the Column Store Data Type being specified works in both environments.
There seems to be a small difference in the .hdiconfig between on-prem and HANA Cloud. This file controls the behaviour of the WebIDE and which filetypes are supported.
At the top of the file on premise HANA 2.0.46 I see
“plugin_version” : “126.96.36.199”,
Where as HANA Cloud shows
Replacing this line with the correct entry and checking you are not using any file types that aren’t supported on that platform solves this.
The final change I required was related to the npm hdi-deployer version
On premise WebIDE for SAP HANA I see a specific version
Where as HANA Cloud / WebIDE Full Stack as more generic expression
Switching to the HANA Cloud/WebIDE Full Stack syntax ( “^3.*” ) works for both environments.
Physical vs Virtual vs Browser (Back End beats Front End)
Within the SAP tool set we have multiple choices where to join (integrate) our data. All of these techniques could technically be used.
- SAC Data Blending – (browser based)
- HANA Calculation Views – (virtualised)
- Smart Data Integration Flowgraphs (persistent)
Having choices of how, when and where to integrate data is good, as implementation do vary.
I would recommend using HANA’s virtual model techniques and avoid SAP Analytics Cloud blending where ever possible. The blending in SAC is nice if you don’t have access to the HANA environment but you could be sacrificing performance and functionality. Persisting the data (via SDI flowgraph) is an additional step that I tried to avoid when possible. Especially for smaller datasets the virtualised approach works really well.
Using a virtual data model in HANA with calculation views provides neat separation of entities. I built a calculation view per source and then combined calc views as required. Single source analysis provides limited insight, by combining multiple data sources – population, regions, spatial, cases, testing, deaths and more you see a fuller picture.
When you consider complexity vs performance and functionality the virtual views in the back end wins.
The Data Doesn’t lie (but you can make it)
If you look at the above chart you would think the UK started slowly but has now ramped up it’s Corona Virus testing. However this simple chart does not tell the full story.
If you look at the same data, but now include the test type you can see that around the end of April a large amount of Orange appears which is “pillar 2” or “p2 Posted & Drive-through” tests. Post a test out to someone is not the same as receiving that test back. This “pillar 2” data was conveniently added to hit the 100,000 target that was set.
All about the Presentation
SAP Analytics Cloud looks good out of the box, even someone like myself who has no design or artistic skills, can create sensible output. Having an understanding of the data puts you in good place – which data and how to present it. No doubt these visualisations can be improved, but it should give you a feel of how easy it is, to create decent stories from the data.
With lockdown easing, there is much greater focus on the more recent data to identify hotspots. Looking at Figure 8, cases per 100,000 population makes the hotspots easier to find. Here it is clear Leicester has a serious out break. I have sorted the top 3 charts descending by number of cases per 100k people and applied different date restrictions from 30 days to last 7 days.
I also created a page where you can easily focus on an individual local authority, simply by searching or selecting on the left hand side. SAC then updates the charts to reflect this selection. Knowing the Leicester is a hot spot, we can focus on that Local Authority. The data shows this latest outbreak start towards the start of June.
I also created additional time-series charts which allowed me to compare the Office of National Statistics (ONS) data with NHS England. The top left chart shows that many of the daily figurers being reported were significantly less than the actual ONS figure.
Heatmaps from blue, cold to red hot are a great way to visualise multi-dimensional data and allow the eye to immediately be drawn to problem areas.
My WebIDE project can be found here.
Together HANA Cloud and SAP Analytics Cloud provide a great platform for analysing and visualising data. Combine that with Git integration and Visual Studio and you have a fantastic way to develop and acquire datasets.