Your SAP on Azure – Part 19 – Expose Azure Databricks to SAP HANA using ODBC connection
The use cases for the Hadoop and SAP HANA are different, but in some cases, it is recommended or even required or to connect both worlds. The Hadoop clusters operate on very large datasets that consist of unstructured data while SAP HANA is focused on analytical processing. An ETL process that would replicate the Hadoop data is not always a good solution as it could dramatically increase the database memory requirements. But a smart usage of both technologies benefits from lower storage cost and high-speed in-memory processing power at the same time.
There are a few ways to expose data coming from a Hadoop cluster to the SAP HANA database. Some time ago I wrote two blog posts where I describe how to use SAP Vora and the Spark Extension to query the data from the Azure HDInsight and from the Azure Databricks in SAP HANA. But the SAP Vora is not required. SAP HANA can be directly connected to the Hadoop cluster using just ODBC drivers or the HANA Spark Controller. Today I present the first way, but don’t worry, a blog about the Spark Controller is on the way!
In previous episode I have already presented how to deploy the Azure Databricks cluster, so this time I jump directly to the configuration part. If you don’t have a cluster already, I’d recommend reading the Part 17 firstly.
I use the unixODBC as the Driver Manager. It was not installed by default on my server, so I use Yast to add the missing component:
The Simba ODBC drivers that I’m going to use today are available on the Databricks website. To access the download link you’ll have to provide your details like name or e-mail address. After around 10 minutes I had a new e-mail in my Inbox.
The SAP HANA database runs on Linux environment. I downloaded the 64-bit version and extracted the archive.
unzip SimbaSparkODBC-22.214.171.1244-LinuxRPM-64bit.zip cd SimbaSparkODBC-126.96.36.1994-LinuxRPM-64bit/ dir
The RPM installation is also straightforward
rpm -ivh simbaspark-188.8.131.524-1.x86_64.rpm
By default the package is extracted to the /opt/simba/spark directory:
The driver library libsparkodbc_sb64.so is located in the /opt/simba/spark/lib/64
CONFIGURE ODBC DRIVERS
To configure the drivers we’ll need an input from the Databricks cluster. It exposes the ODBC protocol and the connection details are available under Clusters -> Configuration -> ODBC
To establish the connection we’ll also require a personal token that can be generated in the Databricks dashboard. Click on the User Settings under the profile icon in the top right corner:
Click on Generate New Token and choose a validity period. The token will be displayed on the screen – remember to copy it as it is not possible to display it again.
The ODBC configuration relays on two text files that should be kept in the home directory of the <sid>adm user. The first one can be copied from the driver location and stores the driver settings. There is no need to change anything inside.
su - <sid>adm cd $HOME cp /opt/simba/spark/lib/64/simba.sparkodbc.ini .simba.sparkodbc.ini
The second file keeps the connection information. Create the .odbc.ini file with following content:
[Databricks] Driver=/opt/simba/spark/lib/64/libsparkodbc_sb64.so Server=<Server Hostname from the Databricks ODBC settings> HOST=<Server Hostname from the Databricks ODBC settings> PORT=443 SparkServerType=3 Schema=default ThriftTransport=2 SSL=1 AuthMech=3 HTTPPath=<HTTP Path from the Databricks ODBC settings>
The unixODBC uses the system variables to identify configuration files. The easiest way to modify them is to create the .customer.sh file located in the home directory, that will be automatically executed during the database user login.
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/simba/spark/lib/64 export ODBCINI=$HOME/.odbc.ini export SIMBASPARKINI=$HOME/.simba.sparkodbc.ini
Log out and log in again to check the current unixODBC configuration. User data sources should point to the .odbc.ini file that we created.
TEST THE CONNECTION
I think this is a good moment to stop for a while and check if the connection to the Databricks works. Sometimes the unixODBC can be tricky so I want to ensure I configured it correctly before I do anything in SAP HANA. But to test the connection we need some data in the Databricks tables.
There are two ways to create a table in the Azure Databricks. You can either convert a DataFrame into a table or use the user interface to import a new file using a browser.
In the previous post I presented how to import data residing in the Azure Data Lake into a DataFrame. We can just add one line of code to save it as a Table:
val df = spark.read.option("inferSchema", true).csv("dbfs:/mnt/DemoDatasets/airports.dat") df.show() df.write.saveAsTable("Airports_DF")
The alternative way is also very easy. In the Databricks dashboard click on Data in the left menu and Select Add Data. On the screen below you can also see the table I just created using the script.:
The dataset I’m using is available here:
Now you can just drag and drop the file to the browser window and adjust parameters like table name or table headers:
Now, when we have some data, we can query the Databricks cluster using the ODBC drivers.
isql <DSN> <username> <password>
If you can see the SQL command prompt it means the connection is established successfully and we can run a sample query:
select airport_name, city from airports_dat limit 10
The connection is working fine and we can continue with the HANA configuration.
CREATE A VIRTUAL TABLE IN SAP HANA
Today’s post goal is to access the Databricks cluster from SAP HANA database. We have already verified that the connection is working properly, so the last task is to create a new Remote Source and reference the Databricks table as a Virtual Tables.
Create a new Remote Source and provide the ODBC DSN together with user credentials:
Now you can expand the Remote Sources to see the available tables in the Azure Databricks:
Create a virtual table:
Now you can query it as it was a part of the SAP HANA database