Skip to Content

Introduction

Drill is a relatively new datasource offered by the Apache foundation. The promise is compelling: “Query any non-relational datastore”.

Exposing a variety of non relational sources in a relational fashion can add value to our Business Intelligence stack. In this article, we explore through an example how we can connect Drill to the SAP BI Platform.

To make the use case a bit interesting, we will be using the US baby names dataset from the Kaggle datasets. This dataset is available for free (registration required).

Technical environment for this article

There are many ways to install Apache Drill and the SAP BI Platform. This article focuses on demonstrating the feasibility, so the the technical environment remains voluntarily simple.

We will rely on a standalone setup of Drill on one Windows server. Our experiments tend to show that Drill keeps the promise of identical behavior when run on a Hadoop cluster, so we can afford to keep things simple.

Our SAP BI Platform setup is composed of two computers:

  • The server is a Windows 64-bit server. No clustering, no Unixes, let’s keep things simple !
  • The client is a Windows computer (architecture does not matter).

Both run the SAP BI Platform 4.2 SP03.

You may attempt to configure the Apache Drill connectivity in more complex deployments, and there is a good chance it works. Please refer to the product documentation for specifics of such environments.

Installing Apache Drill

Download Drill from the Apache website

Unzip the file contents into a folder, and open a command prompt in that folder. Navigate to the bin subfolder and run the following command :

sqlline.bat -u “jdbc:drill:zk=local”

Congratulations you have a (tiny) drill server up and running !

A bit of configuration

In its default mode, Drill does not attempt to use the first line of CSV files as column titles, and exposes them as column 0, column 1 etc. The CSV file does have column titles, so you need to configure Drill to handle them.

On your Drill machine, open the URL http://localhost:8047/ , go to the Storage tab. The “Enabled Storage Plugins” section should contain “dfs”, click on its update button.

In the JSON that is shown, add the following entry to the “formats” section:

“csv”: {

“type”: “text”,

“extensions”: [

“csv”

],

“extractHeader”: true,

“delimiter”: “;”

}

This instructs Drill to handle CSV files correctly. To apply this to your file, add an entry in the workspaces section for the folder where you have stored the dataset CSV files, based on the example below:

“data”: {

“location”: “/data (means c:\data, replace with your own folder)”,

“writable”: false,

“defaultInputFormat”: “csv”

}

Installing Drill client libraries

To enable the SAP BI stack to consume Drill, you now need to install the ODBC drivers for Drill. These drivers are available at the MapR site. Install the 32-bit driver on your SAP BI Client tools computer, and the 64-bit driver on your SAP BI Server computer.

Launch the 32-bit ODBC data source administrator on your SAP BI Client tools computer, and the 64-bit ODBC data source administrator on your SAP BI Server computer.

On both computers, go to the System DSN tab and create a DSN using the Drill driver with the same name (DrillWindowsServer in my example) :

/wp-content/uploads/2016/09/screen_01_1029694.png

Click on the Test… button to ensure it works :

/wp-content/uploads/2016/09/screen_02_1029695.png

Connecting the SAP BI Platform to Drill

On both BI Client and server computers, edit the file dataAccess\connectionServer\odbc\odbc.sbo to add the following section :

<DataBase Active=”Yes” Name=”Apache Drill”>

<Libraries>

<Library Platform=”MSWindows”>dbd_wodbc3</Library>

<Library Platform=”MSWindows”>dbd_odbc3</Library>

<Library Platform=”MSWindows”>dbd_wddodbc3</Library>

<Library Platform=”MSWindows”>dbd_ddodbc3</Library>

</Libraries>

<Parameter Name=”CharSet Table” Platform=”Unix”>datadirect</Parameter>

<Parameter Name=”Family”>Apache</Parameter>

<Parameter Name=”SQL Parameter File”>drill</Parameter>

<Parameter Name=”Array Fetch Available”>True</Parameter>

<Parameter Name=”Max Rows Available”>True</Parameter>

</DataBase>

Again, on both server and client machines, copy the odbc.prm and odbcen.prm to files named drill.prm and drillen.prm, and set the DISTINCT parameter in the drill.prm fileto YES :

<Parameter Name=”DISTINCT”>YES</Parameter>

Restart the server to ensure the configuration change is taken into account.

Launch the Information Design Tool, and start creating a connection to Drill, as it appears in the datasources list :

/wp-content/uploads/2016/09/screen_03_1029696.png

The current (simple) setup does not implement security, so leave the credentials blank, and pick the DSN previously created. Test connection should be successful.

/wp-content/uploads/2016/09/screen_04_1029697.png

Leave all other configuration parameters to their default values, and create the connection.

SQL specifics for Drill

Let us now review the connection. Switch to the Show values tab, you can expand the list of workspaces, but tables that are inside them are not retrieved.

/wp-content/uploads/2016/09/screen_05_1030339.png

This is partly due to Apache Drill’s approach: most of the data and its structure is discovered at query time. So the table name you would provide in a SELECT statement contains the path to the table. Moreover, the same artefact can be queried in different ways. For instance, a CSV file can be queried as one table, or as part of a folder (where all CSV files are joined together as one virtual table).

This means you will not be able to rely on IDT’s assistance to create the universe. Instead you are going to encapsulate the Drill table you want to query in a derived table and use it to build our universe.

One caveat here: Drill relies on backquotes in the table name, so beware of typos for quotes in your queries. In the Show values tab, enter the following query (adapt the file path to the actual path for your data file):

SELECT * FROM dfs.`data/us-baby-names/StateNames.csv` LIMIT 100

Notice the LIMIT clause at the end of the SQL sentence, which avoids parsing the entire dataset when your need is only a simple test.

Once you have the correct path, you should get some data:

/wp-content/uploads/2016/09/screen_06_1030350.png

The SELECT * may look like a good candidate, but the Drill documentation recommends to cast the columns explicitly into their target datatypes (including character field sizes) to improve performance. In our case, in order to use a measure, you will need the column to be of a numeric datatype, and casting will be a requirement for this.

Adding casts transforms the SQL into :

SELECT CAST(Id AS INTEGER) AS Id, CAST(Name AS VARCHAR(15)) AS Name,

CAST(`Year` AS INTEGER) AS BirthYear, CAST(Gender AS VARCHAR(1)) AS Gender,

CAST(State AS VARCHAR(2)) AS State, CAST(`Count` AS INTEGER) AS Occurences

FROM dfs.`data/us-baby-names/StateNames.csv`

Notice the backquotes around Drill’s reserved words.

This SQL can be pasted into a derived table definition in IDT.

/wp-content/uploads/2016/09/screen_07_1030366.png

At this stage you can build a universe based on the derived table. Save your data foundation, and create a business layer based on it with automatic object detection.

Remove the Id object.

Change the type for Occcurences to a measure and edit its SQL expression accordingly:

/wp-content/uploads/2016/09/screen_08_1030370.png

To enable easy filtering, create parameters for Year and State filtering (State is shown below):

/wp-content/uploads/2016/09/screen_09_1030372.png

Create filters based on these parameters (again, the screenshot shows the State instance):

/wp-content/uploads/2016/09/screen_10_1030405.png

Publish the universe to your BI Platform Server, you are now ready to build documents based on the Drill datasource, for instance you can see how many children were named Sapphire in various US states in 2014.

/wp-content/uploads/2016/09/screen_11_1030408.png

Conclusion

In this article we have demonstrated how to create a connection to a Drill server from the SAP BI Platform. The demonstration covered both client and server tools, and concludes with a working WebI document.

Drill’s SQL support and performance are good. The text file we relied on has been used without any transformation in Drill, and response times are good, queries complete in a few seconds.

However using Drill has required us to do a few adaptations to the usual way of working with relational databases:

  • The most notable adaptation is the inability to retrieve database metadata in Information Design Tool. The workaround to this is the use of derived tables. This turns out to generate more complex SQL queries, but Drill handles it well
  • Along the same lines of Drill’s flexible metadata model, the SQL for derived tables needs to be accurate in describing the data it is accessing.

All in all, Drill has held its promise, at the cost of a bit of extra work for the universe designer.

To report this post you need to login first.

4 Comments

You must be Logged on to comment or reply to a post.

  1. Thomas Buxereau

    Hi there

    Do you know if it’s possible to run this with BO 4.1 sp5?
    We are not ready for a 4.2 upgrade yet unfortunately but I would like to try and get Drill working.

    Thanks

    (0) 
  2. Claire Fowler

    Hi David

    I have followed the steps you mentioned below in detail however when I open the Connection tab in BO IDT I do not see Apache Drill anywhere??
    We have the drill drivers laded and can connect and run queries through the ODBC driver window.

    I have edited the odbc.sbo file and copied the two prm files and renamed them, however I still don’t see the Apache Drill option under apache. I have installed SAP BO 4.2 sp3 to do this.

    We are in a POC phase at the moment and it’s vital that we can access Drill.

    Thanks

    (0) 
  3. David MOBBS Post author

    Hi Claire,

    Have you restarted the client product (eg Information Design Tool) after saving the edited configuration ? The configuration files are loaded at product startup so you do need to restart the product for the configuration to take effect.

    Otherwise, do make sure you edit 3 files :

    • ODBC.sbo
    • drill.prm
    • drill(your locale code).prm

    If any of these 3 is missing, then driver configuration is incorrect and the connectivity would not show up.

    regards,

    David.

     

    (0) 

Leave a Reply