Drill, baby names and the BI Platform
Drill is a relatively new datasource offered by the Apache foundation. The promise is compelling: “Query any non-relational datastore”.
Exposing a variety of non relational sources in a relational fashion can add value to our Business Intelligence stack. In this article, we explore through an example how we can connect Drill to the SAP BI Platform.
To make the use case a bit interesting, we will be using the US baby names dataset from the Kaggle datasets. This dataset is available for free (registration required).
Technical environment for this article
There are many ways to install Apache Drill and the SAP BI Platform. This article focuses on demonstrating the feasibility, so the the technical environment remains voluntarily simple.
We will rely on a standalone setup of Drill on one Windows server. Our experiments tend to show that Drill keeps the promise of identical behavior when run on a Hadoop cluster, so we can afford to keep things simple.
Our SAP BI Platform setup is composed of two computers:
- The server is a Windows 64-bit server. No clustering, no Unixes, let’s keep things simple !
- The client is a Windows computer (architecture does not matter).
Both run the SAP BI Platform 4.2 SP03.
You may attempt to configure the Apache Drill connectivity in more complex deployments, and there is a good chance it works. Please refer to the product documentation for specifics of such environments.
Installing Apache Drill
Download Drill from the Apache website
Unzip the file contents into a folder, and open a command prompt in that folder. Navigate to the bin subfolder and run the following command :
sqlline.bat -u “jdbc:drill:zk=local”
Congratulations you have a (tiny) drill server up and running !
A bit of configuration
In its default mode, Drill does not attempt to use the first line of CSV files as column titles, and exposes them as column 0, column 1 etc. The CSV file does have column titles, so you need to configure Drill to handle them.
On your Drill machine, open the URL http://localhost:8047/ , go to the Storage tab. The “Enabled Storage Plugins” section should contain “dfs”, click on its update button.
In the JSON that is shown, add the following entry to the “formats” section:
This instructs Drill to handle CSV files correctly. To apply this to your file, add an entry in the workspaces section for the folder where you have stored the dataset CSV files, based on the example below:
“location”: “/data (means c:\data, replace with your own folder)”,
Installing Drill client libraries
To enable the SAP BI stack to consume Drill, you now need to install the ODBC drivers for Drill. These drivers are available at the MapR site. Install the 32-bit driver on your SAP BI Client tools computer, and the 64-bit driver on your SAP BI Server computer.
Launch the 32-bit ODBC data source administrator on your SAP BI Client tools computer, and the 64-bit ODBC data source administrator on your SAP BI Server computer.
On both computers, go to the System DSN tab and create a DSN using the Drill driver with the same name (DrillWindowsServer in my example) :
Click on the Test… button to ensure it works :
Connecting the SAP BI Platform to Drill
On both BI Client and server computers, edit the file dataAccess\connectionServer\odbc\odbc.sbo to add the following section :
<DataBase Active=”Yes” Name=”Apache Drill”>
<Parameter Name=”CharSet Table” Platform=”Unix”>datadirect</Parameter>
<Parameter Name=”SQL Parameter File”>drill</Parameter>
<Parameter Name=”Array Fetch Available”>True</Parameter>
<Parameter Name=”Max Rows Available”>True</Parameter>
Again, on both server and client machines, copy the odbc.prm and odbcen.prm to files named drill.prm and drillen.prm, and set the DISTINCT parameter in the drill.prm fileto YES :
Restart the server to ensure the configuration change is taken into account.
Launch the Information Design Tool, and start creating a connection to Drill, as it appears in the datasources list :
The current (simple) setup does not implement security, so leave the credentials blank, and pick the DSN previously created. Test connection should be successful.
Leave all other configuration parameters to their default values, and create the connection.
SQL specifics for Drill
Let us now review the connection. Switch to the Show values tab, you can expand the list of workspaces, but tables that are inside them are not retrieved.
This is partly due to Apache Drill’s approach: most of the data and its structure is discovered at query time. So the table name you would provide in a SELECT statement contains the path to the table. Moreover, the same artefact can be queried in different ways. For instance, a CSV file can be queried as one table, or as part of a folder (where all CSV files are joined together as one virtual table).
This means you will not be able to rely on IDT’s assistance to create the universe. Instead you are going to encapsulate the Drill table you want to query in a derived table and use it to build our universe.
One caveat here: Drill relies on backquotes in the table name, so beware of typos for quotes in your queries. In the Show values tab, enter the following query (adapt the file path to the actual path for your data file):
SELECT * FROM dfs.`data/us-baby-names/StateNames.csv` LIMIT 100
Notice the LIMIT clause at the end of the SQL sentence, which avoids parsing the entire dataset when your need is only a simple test.
Once you have the correct path, you should get some data:
The SELECT * may look like a good candidate, but the Drill documentation recommends to cast the columns explicitly into their target datatypes (including character field sizes) to improve performance. In our case, in order to use a measure, you will need the column to be of a numeric datatype, and casting will be a requirement for this.
Adding casts transforms the SQL into :
SELECT CAST(Id AS INTEGER) AS Id, CAST(Name AS VARCHAR(15)) AS Name,
CAST(`Year` AS INTEGER) AS BirthYear, CAST(Gender AS VARCHAR(1)) AS Gender,
CAST(State AS VARCHAR(2)) AS State, CAST(`Count` AS INTEGER) AS Occurences
Notice the backquotes around Drill’s reserved words.
This SQL can be pasted into a derived table definition in IDT.
At this stage you can build a universe based on the derived table. Save your data foundation, and create a business layer based on it with automatic object detection.
Remove the Id object.
Change the type for Occcurences to a measure and edit its SQL expression accordingly:
To enable easy filtering, create parameters for Year and State filtering (State is shown below):
Create filters based on these parameters (again, the screenshot shows the State instance):
Publish the universe to your BI Platform Server, you are now ready to build documents based on the Drill datasource, for instance you can see how many children were named Sapphire in various US states in 2014.
In this article we have demonstrated how to create a connection to a Drill server from the SAP BI Platform. The demonstration covered both client and server tools, and concludes with a working WebI document.
Drill’s SQL support and performance are good. The text file we relied on has been used without any transformation in Drill, and response times are good, queries complete in a few seconds.
However using Drill has required us to do a few adaptations to the usual way of working with relational databases:
- The most notable adaptation is the inability to retrieve database metadata in Information Design Tool. The workaround to this is the use of derived tables. This turns out to generate more complex SQL queries, but Drill handles it well
- Along the same lines of Drill’s flexible metadata model, the SQL for derived tables needs to be accurate in describing the data it is accessing.
All in all, Drill has held its promise, at the cost of a bit of extra work for the universe designer.