Skip to Content
Technical Articles

Your SAP on Azure – Part 19 – Expose Azure Databricks to SAP HANA using ODBC connection

The use cases for the Hadoop and SAP HANA are different, but in some cases, it is recommended or even required or to connect both worlds. The Hadoop clusters operate on very large datasets that consist of unstructured data while SAP HANA is focused on analytical processing. An ETL process that would replicate the Hadoop data is not always a good solution as it could dramatically increase the database memory requirements. But a smart usage of both technologies benefits from lower storage cost and high-speed in-memory processing power at the same time.

There are a few ways to expose data coming from a Hadoop cluster to the SAP HANA database. Some time ago I wrote two blog posts where I describe how to use SAP Vora and the Spark Extension to query the data from the Azure HDInsight and from the Azure Databricks in SAP HANA. But the SAP Vora is not required. SAP HANA can be directly connected to the Hadoop cluster using just ODBC drivers or the HANA Spark Controller. Today I present the first way, but don’t worry, a blog about the Spark Controller is on the way!

In previous episode I have already presented how to deploy the Azure Databricks cluster, so this time I jump directly to the configuration part. If you don’t have a cluster already, I’d recommend reading the Part 17 firstly.

ODBC DRIVERS

I use the unixODBC as the Driver Manager. It was not installed by default on my server, so I use Yast to add the missing component:

The Simba ODBC drivers that I’m going to use today are available on the Databricks website. To access the download link you’ll have to provide your details like name or e-mail address. After around 10 minutes I had a new e-mail in my Inbox.

The SAP HANA database runs on Linux environment. I downloaded the 64-bit version and extracted the archive.

unzip SimbaSparkODBC-2.6.4.1004-LinuxRPM-64bit.zip
cd SimbaSparkODBC-2.6.4.1004-LinuxRPM-64bit/
dir

The RPM installation is also straightforward

rpm -ivh simbaspark-2.6.4.1004-1.x86_64.rpm

By default the package is extracted to the /opt/simba/spark directory:

The driver library libsparkodbc_sb64.so is located in the /opt/simba/spark/lib/64

CONFIGURE ODBC DRIVERS

To configure the drivers we’ll need an input from the Databricks cluster. It exposes the ODBC protocol and the connection details are available under Clusters -> Configuration -> ODBC

To establish the connection we’ll also require a personal token that can be generated in the Databricks dashboard. Click on the User Settings under the profile icon in the top right corner:

Click on Generate New Token and choose a validity period. The token will be displayed on the screen – remember to copy it as it is not possible to display it again.

The ODBC configuration relays on two text files that should be kept in the home directory of the <sid>adm user. The first one can be copied from the driver location and stores the driver settings. There is no need to change anything inside.

su - <sid>adm
cd $HOME
cp /opt/simba/spark/lib/64/simba.sparkodbc.ini .simba.sparkodbc.ini

The second file keeps the connection information. Create the .odbc.ini file with following content:

[Databricks]
Driver=/opt/simba/spark/lib/64/libsparkodbc_sb64.so
Server=<Server Hostname from the Databricks ODBC settings> 
HOST=<Server Hostname from the Databricks ODBC settings>
PORT=443
SparkServerType=3
Schema=default
ThriftTransport=2
SSL=1
AuthMech=3
HTTPPath=<HTTP Path from the Databricks ODBC settings>

The unixODBC uses the system variables to identify configuration files. The easiest way to modify them is to create the .customer.sh file located in the home directory, that will be automatically executed during the database user login.

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/simba/spark/lib/64
export ODBCINI=$HOME/.odbc.ini
export SIMBASPARKINI=$HOME/.simba.sparkodbc.ini

Log out and log in again to check the current unixODBC configuration. User data sources should point to the .odbc.ini file that we created.

odbcinst -j

TEST THE CONNECTION

I think this is a good moment to stop for a while and check if the connection to the Databricks works. Sometimes the unixODBC can be tricky so I want to ensure I configured it correctly before I do anything in SAP HANA. But to test the connection we need some data in the Databricks tables.

There are two ways to create a table in the Azure Databricks. You can either convert a DataFrame into a table or use the user interface to import a new file using a browser.

In the previous post I presented how to import data residing in the Azure Data Lake into a DataFrame. We can just add one line of code to save it as a Table:

val df = spark.read.option("inferSchema", true).csv("dbfs:/mnt/DemoDatasets/airports.dat")
df.show()
df.write.saveAsTable("Airports_DF")

The alternative way is also very easy. In the Databricks dashboard click on Data in the left menu and Select Add Data. On the screen below you can also see the table I just created using the script.:

The dataset I’m using is available here:

https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat

Now you can just drag and drop the file to the browser window and adjust parameters like table name or table headers:

Now, when we have some data, we can query the Databricks cluster using the ODBC drivers.

isql <DSN> <username> <password>

If you can see the SQL command prompt it means the connection is established successfully and we can run a sample query:

select airport_name, city from airports_dat limit 10

The connection is working fine and we can continue with the HANA configuration.

CREATE A VIRTUAL TABLE IN SAP HANA

Today’s post goal is to access the Databricks cluster from SAP HANA database. We have already verified that the connection is working properly, so the last task is to create a new Remote Source and reference the Databricks table as a Virtual Tables.

Create a new Remote Source and provide the ODBC DSN together with user credentials:

Now you can expand the Remote Sources to see the available tables in the Azure Databricks:

Create a virtual table:

Now you can query it as it was a part of the SAP HANA database

6 Comments
You must be Logged on to comment or reply to a post.
  • Thanks for sharing this – very timely (as you know :))

     

    Are you able to demonstrate how this works if you establish your data source using Azure Databricks using Delta instead of a standard Parquet approach as I believe it should be quite similar but yield much better performance and efficiency.

     

    Also do you have any guidance for how complex a data model / data sourcing you can support from your source data and the influence it may have on cost of Azure Databricks running e.g. would this be good for running your spark cluster 24×7 or more to cache your data into a BW for your reporting and then using something like Databricks Delta to append the updates (either batch or streaming) into the BW/4HANA stack.

    • Hello Chris,

      I run a test and I was able to expose the Delta tables using ODBC drivers.

      The challenge I see in your approach is that it’s very close to a regular ETL process. The Smart Data Access is designed to work with on-line data sources to provide (near) real-time data without the need of copying them.

      I’m not an expert in BW, so not really sure what options do you have to work with the cache, but if you want to keep the data in BW and update it with changes, then what’s the advantage of this solution over a regular ETL?

      To access the data on-line you’d require the Databricks cluster running. But if you could use a set of small VM, it doesn’t neccessarily has to be expensive, especially when comparing to a price of HANA VM with additional memory. I’d also check the current availability of the Databricks cluster – maybe it’s already running most of the time. If you’re able to analyze at what time users access the data, maybe you could schedule the automatic start and stop based on time?

  • Hello,

    Thanks for this wonderful blog. Extremely detailed and easy to follow.

    Can you also share some insight on how can we connect to Azure Databricks from an on-premise HANA DB?

    Are there some options of connecting via a proxy host or connect via some gateway system like SAP Cloud Connector or Azure Data Management Gateway?

    Thanks and Regards

    Shubra

     

  • Great article,  Bartosz!

    If we use a standalone DPagent on windows server, I assume we just need to install the windows odbc driver and configuration the DSN there. right?

    Also, have you compared the performance between odbc and spark controller?

    Which connection type is recommended?

     

     

    Regards,

    Aaron