Crash testing IBM dashDB with SAP BI
Introduction
IBM dashDB is a new database from IBM. It targets the cloud, but promises continuity in experience with the DB2 ecosystem.
Friendly colleagues at IBM provided me with an instance I could deploy on a virtual machine in our labs, so I have had the chance to give the database a test drive.
To make things more interesting, I have used an open dataset: the French government has published the database of road accidents for the past years. This dataset carries hundreds of thousands of accidents, with the people involved, vehicles, locations etc. All of this has been anonymized, and it becomes a great playground for our experiments.
Environment setup
As said above, I have used a local instance of dashDB. This comes in the form of a docker container, which I deployed on a virtual machine in one of our labs. Setup was smooth, and once done the dashDB console is easily accessible using a web browser.
At this stage Ican upload the tables from the dataset. CSV import is easy to use and works well (one caveat: the dataset is French, so we need to pick codepage 1252). Once done, the tables are available in the database:
It is worth noting that I created no indexes or keys, the database tells me it is using columnar support, so I let it handle optimizations.
Let the experiment begin
At this stage I can fire up my trusty Information Design Tool, and try to connect to the database. dashDB recommends the use of DB2 connectors, so I simply created the connection as a classic DB2 connection (in my case, using the JDBC driver) :
Everything works fine, and the tables I uploaded are available under my user, so I can proceed to create a data foundation, which materialises the links between the tables :
Business layer is the next logical step. I created 2 measures:
- number of accidents, which is a simple count of the accidents’ IDs.
- number of people killed, which involves counting a calculated field (contatenation of accident and passenger ids) and filtering for severity of injuries.
These 2 measures rely on different tables, so once I start using both I necessarily end up joining tables and performing calculations.
A few tests in Information Design Tool bring back good results: the database executes the SQL with no issues, and does perform it quickly. Most queries return within a second, even those that require joining and calculating across datasets.
A few facts (and SQL queries)
This chapter explores a few queries that I have run during my tests. For each query I captured the SQL using the View SQL button in the query panel.
Accidents trend
How many accidents per year ? And how many people killed each year ?
The first part of the question is quite easily expressed:
SELECT
BLUADMIN.CARACTERISTIQUES.AN,
COUNT(DISTINCT BLUADMIN.CARACTERISTIQUES.NUM_ACC)
FROM
BLUADMIN.CARACTERISTIQUES
GROUP BY
BLUADMIN.CARACTERISTIQUES.AN
ORDER BY
1
The result is blazingly fast: 94 ms
The second part of the question (how many people killed each year) is a bit tricker :
SELECT
BLUADMIN.CARACTERISTIQUES.AN,
COUNT(DISTINCT (BLUADMIN.USAGERS.NUM_ACC CONCAT ‘:’ CONCAT BLUADMIN.USAGERS.NUM_VEH CONCAT ‘:’ CONCAT BLUADMIN.USAGERS.PLACE))
FROM
BLUADMIN.CARACTERISTIQUES,
BLUADMIN.USAGERS,
BLUADMIN.VEHICULES
WHERE
( BLUADMIN.CARACTERISTIQUES.NUM_ACC=BLUADMIN.VEHICULES.NUM_ACC )
AND ( BLUADMIN.VEHICULES.NUM_VEH = BLUADMIN.USAGERS.NUM_VEH AND BLUADMIN.VEHICULES.NUM_ACC = BLUADMIN.USAGERS.NUM_ACC )
AND ( BLUADMIN.USAGERS.GRAV = 2 )
GROUP BY
BLUADMIN.CARACTERISTIQUES.AN
It is also fast to execute: 160 ms.
Put together, I can now check the trend of roadkill in the past years:
Well, let’s say that 2013 was a good year and that this graph explains why country leaders are again talking about fighting for safety on the roads.
Safer hours to drive
Similar queries (which I won’t copy here) give trends of accidents and deaths across hours of the day.
The Mortality line is not the number of deaths in itself, but divided by the number of accidents. To simplify, how much chance to you have to die if involved in an accident ?
Many things appear on that graph:
- there are more accidents during daytime (7am-8pm)
- there are even peaks around office begin/end hours
- daytime accidents are more benign
- night-time accidents (the 2am-6am period) are the most lethal
Safest drivers
To close this series of SQL queries, are men or women the safest drivers ?
In addition to the traditional gender dispute, this query is also interesting for the database because it requires :
- using the (see above) count measures, sometimes filtered
- joining the tables to find the driver involved in each accident
For the number of accidents, the resulting SQL sentence is :
SELECT
CASE BLUADMIN.USAGERS.SEXE
WHEN 1 THEN ‘Masculin’
WHEN 2 THEN ‘Féminin’
ELSE ‘Autres’
END,
COUNT(DISTINCT BLUADMIN.CARACTERISTIQUES.NUM_ACC)
FROM
BLUADMIN.USAGERS,
BLUADMIN.CARACTERISTIQUES,
BLUADMIN.VEHICULES
WHERE
( BLUADMIN.CARACTERISTIQUES.NUM_ACC=BLUADMIN.VEHICULES.NUM_ACC )
AND ( BLUADMIN.VEHICULES.NUM_VEH = BLUADMIN.USAGERS.NUM_VEH AND BLUADMIN.VEHICULES.NUM_ACC = BLUADMIN.USAGERS.NUM_ACC )
GROUP BY
CASE BLUADMIN.USAGERS.SEXE
WHEN 1 THEN ‘Masculin’
WHEN 2 THEN ‘Féminin’
ELSE ‘Autres’
END
As before, this does execute fast, in under a second.
The resulting visualization speaks for itself:
Sorry lads, still more accidents and worse ones than women…
Conclusion
The crash test is successful: my dashDB instance has performed well. From my point of view it did behave like any DB2 instance.
If you want to try it yourself, you are probably in for a satisfactory experience, but please bear in mind the following points:
- I have used the JDBC driver. DB2 CLI is also supported by IBM, but I have not tried it.
- My tests give a feel of how the database behaves with our products, but do not replace a full certification. As of today the dashDB database is not supported for SAP BI.
Glad to see SAP BusinessObjects and DashDB work well together. In fact, since DashDB and DB2 have same code base one should have wondered if not (,-)