Skip to Content
Technical Articles
Author's profile photo Vivek RR

Unified Analytics with SAP Datasphere & Databricks Lakehouse Platform- Data Federation Scenarios

Introduction

With the announcements in the SAP Data Unleashed, SAP introduced the successor of SAP Data Warehouse Cloud, SAP Datasphere – a powerful Business Technology Platform data service that addresses the Data to value strategy of every Enterprise organization to deliver seamless access to business data.  Our Datasphere has been enriched with new features thereby delivering a unified service for data integration, cataloging, semantic modeling, data warehousing, and virtualizing workloads across SAP and non-SAP data.

I am sure there will be lots of blogs that will be published soon discussing the latest offerings, roadmaps, and features. However, this blog focuses on the latest announcement related to open data partners and I am going to start by focusing on Databricks.

With the rise of the Lakehouse platform that combines both Data Warehouses & Data Lakes, there has been a trend with SAP customers exploring Unified Analytics Platforms or say unified environments that address different perspectives of data management, governance, development, and finally deployments of analytic workloads based on diverse data sources and formats. Previously, there is a need to replicate the data completely out of SAP environments for customers to adopt the lakehouse platform. But the current partnership with Databricks will focus to simplify and integrate the hybrid landscapes efficiently.

There was an article posted by The Register regarding the SAP Datasphere  and it exactly resonated with the SAP messaging

 

 

Please note that what I am about to discuss further is the data federation scenarios between SAP Datasphere and Databricks that works as of today. There will be additional ways of integrating with Databricks in the future.

Brief Introduction to the Lakehouse Platform

In simple terms, a lakehouse is a Data Management architecture that enables users to perform diverse workloads such as BI, SQL Analytics, Data Science & Machine Learning on a unified platform. And by utilizing a combined data management platform such as lakehouse has the following benefits

  1. Enables Direct Data Access across SAP and Non-SAP sources
  2. Simplifies data governance
  3. Removes Data Redundancy
  4. Direct Connectivity to BI Tools
  5. Simplified security with a single source of truth
  6. Support for open source & commercial tooling
  7. Operating on multi-cloud environments and many more.

And Databricks is one such Lakehouse platform that takes a unified approach by integrating disparate workloads to execute data engineering, Analytics, Data Science & Machine Learning use cases. And as mentioned on their site, the platform is simple, open & multi-cloud. We will discuss the Lakehouse platform features and capabilities in future blogs but as mentioned before we are going to focus on a data federation scenario to access data from Databricks SQL into SAP Datasphere.

Consider a scenario where the data from a non-SAP source is continuously ingested into cloud object storage say AWS S3. Note that Databricks has an autoloader feature to efficiently process data files from different cloud storages as they arrive and ingest them into Lakehouse seamlessly. Then we utilize the delta live table framework for building data pipelines and storing the transformed data in Delta format on cloud storage, which can subsequently be accessed by Databricks SQL(DB SQL).

As referred to in the integration scenario below, SAP Datasphere will connect to Databricks SQL with the existing data federation capabilities and users can blend the data with SAP sources for reporting/BI workloads based on SAP Analytics Cloud(SAC).

Assuming you process the incoming data and persist as tables in Databricks SQL, you will then perform the following steps to establish connectivity to SAP Datasphere

Data Federation with Databricks SQL

Prerequisites

  1. You have access to SAP Datasphere with authorization to create a new Data provisioning agent
  2. Access to Virtual Machine or On-Premise system where you install Data provisioning agent.
  3. You have access to Databricks Clusters as well as SQL warehouse.
  4. You have downloaded the JDBC driver from Databricks website.

Databricks Cluster & SQL Access

As mentioned in the previous section, I assume you are already working on Databricks topics. If you are interested in getting access , you can sign up for the free trial for 14 days. Or you can sign up from the hyperscaler marketplace such as AWS marketplace for the same.

Once you login to your account, you will notice the Unified environment for different workspaces

Data Science and Engineering

Machine Learning

SQL

Navigate to workspace “Data Science and Engineering” and select the compute which you have been using for data transformation. Just to note that when we build Delta Live tables pipelines, it uses its own compute to run pipelines. We will discuss that in the later blogs.

Select the all-purpose compute to get additional information.

Navigate to the “Advanced options” to get JDBC connection details

You will find the connection details here. We need to tweak it a bit prior to connecting with generic JDBC connectors.  Also, we will be connecting using personal access tokens and not user credentials.

 

To generate the personal access token, you can generate it from the user settings and save the token for later use.

Here is the modified JDBC URL that we will be using in SAP Datasphere connection. We need to add “IgnoreTransactions=1” to ignore transaction related operations. In your case, you will be just copying the URL from advanced options and add the parameter IgnoreTransactions as shown below

jdbc:databricks://dbc-xxxxxxxxxxx.cloud.databricks.com:443/default;IgnoreTransactions=1;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/XXXXXXXXXXXXXXXXX268/XX-XX-56fird2a;AuthMech=3;UID=token;PWD=<Personal Access Token>

We can now navigate to Databricks SQL to see if we have the necessary schemas and tables to access. The tables were created under the schema “dltcheck”.

 

Data Provisioning Agent(DP Agent) Installation Instructions

I am not going the discuss the DP agent installation in detail as there are numerous articles posted on the same topic. And will be addressing the specific changes that need to be done for Databricks access.

In order to establish secure connectivity between SAP Datasphere and Databricks SQL, the Data Provisioning Agent(DP agent) has to be installed on a virtual machine and configured. For DP agent installation, you can refer to the following document. To connect the DP agent to the SAP HANA Cloud tenant of SAP Datasphere, please follow the steps mentioned in this document. Assuming all the steps were completed, the status of the DP agent will be displayed as “Connected”. In my case, the DP agent is DBRICKSNEW.

 

Configuration Adjustments on the Virtual Machine

Navigate to the folder <dpagent>/camel/lib and copy the jar file that is downloaded from Databricks site. Also, extract the jar files in the same location.

And navigate to <dpagent>/camel folder and adjust the properties of configfile-jdbc.properties

Change the delimident value to BACKTICK.

Save and restart the DP agent. Login to SAP Datasphere to check the status of DP agent. It should display the status as “Connected”. Edit the DPagent and select the Agent Adapter “CameJDBCAdapter”

Now we should be able to connect to Databricks SQL from SAP Datasphere.

Establishing Connection from SAP Datasphere

Navigate to your space and create a new connection for the “Generic JDBC” connector

Provide any Business Name and the following details

Class : com.databricks.client.jdbc.Driver

JDBC URL :  jdbc:databricks://<your databricksaccount>. cloud.databricks.com>:443/default;IgnoreTransactions=1;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/XXXXXXXXXXXXXXXXX268/XX-XX-XXXXXX;AuthMech=3;UID=token;PWD=<Personal Access Token> 

Please copy the JDBC URL as such from Databricks Cluster advanced options and just add the parameter “IgnoreTransactions=1”. Setting this property to 1 ignores any transaction-related operations and returns successfully.

Provide your Databricks user account credentials or token credentials with user as token and select the Data provisioning agent that you just activated.

With the connection details and configurations done properly, validation should be successful.

 

To make sure we have access to data, let’s use the Data Builder to build an analytical model that could be consumed in SAP Analytics Cloud. Navigate to Data Builder and create a new Graphical View

Navigate to Sources->Connections->Databricks(“Your Business Name For Generic JDBC Connection”)->”Databricks Schema”. You will find the list of tables under the schema “dltcheck”

 

Here is the Databricks SQL Access for the same schema “dltcheck”

Select the data with which you wanted to build the Analytical model. In my case it is bucketraw1,  added some calculated columns, aggregated the necessary data, and exposed the relevant columns as the Analytical model “Confirmed_cases”. And the data preview of the model shows the corresponding records too.

This model could be consumed in SAP Analytics Cloud for reporting.

Troubleshooting Errors

1. If you do not see the connection validated, then there are two options to troubleshoot. Either we can use the generic log files from DP agent to identify the issue or add the log parameters in JDBC URL and collect those specific errors. If you face validation issues, then add these parameters in the jdbc url.

 

jdbc:databricks://<youraccount>.cloud.databricks.com:443/default;LogLevel=5;LogPath= <foldername>;IgnoreTransactions=1;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/<SQLPath>;AuthMech=3;UID=token;PWD=<token>

Log Level – 5   which means enable logging on the DEBUG level, which logs detailed information that is  useful for debugging the connector

Logpath – This will be available in the following path /usr/sap/<dpagentfolder>/

The log path can be found on the linux VM and it will generate the specific log files

2. If you see some errors related to mismatched input, then please adjust the camel jdbc properties as mentioned in DPAgent Installation Instructions. The “DELIMIDENT” value should be set  to “BACKTICK”.

2022-12-19 16:31:57,552 [ERROR] [f6053d30-7fcd-436f-bed8-bf4d1358847788523] DefaultErrorHandler | CamelLogger.log [] - Failed delivery for (MessageId: 2567788B8CF359D-0000000000000000 on ExchangeId: 2567788B8CF359D-0000000000000000). Exhausted after delivery attempt: 1 caught: java.sql.SQLException: [Databricks][DatabricksJDBCDriver](500051) ERROR processing query/statement. Error Code: 0, SQL state: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.sql.catalyst.parser.ParseException: mismatched input '"FUNCTION_ALIAS_PREFIX_0"' expecting {<EOF>, ';'}(line 1, pos 33)

                                == SQL ==

                                SELECT SUM("confirmed_cases") AS "FUNCTION_ALIAS_PREFIX_0", SUM("deaths") AS "FUNCTION_ALIAS_PREFIX_1", "covidsummarygcs$VT"."county_name" FROM "default"."covidsummarygcs" "covidsummarygcs$VT" GROUP BY "covidsummarygcs$VT"."county_name"

 

3. If you see the communication link failure, then the IgnoreTransactions parameter has not been set in your JDBC URL.

         

As mentioned before, the data federation is enabled through JDBC connector as of now . But things will change with additional connections in the future. Hopefully this helped you in understanding the Data Federation capabilities with Databricks. Please do share your feedback. In case of connection issues, feel free to comment, and will try to help you out.

 

Assigned Tags

      8 Comments
      You must be Logged on to comment or reply to a post.
      Author's profile photo Werner Dähn
      Werner Dähn

      So in essence, there is nothing new. Not even a new SDI Adapter.

      The SDI Agent exists since nine years, the CamelJDBCAdapter since seven years.

      The DPAgent is not even provided by DataBricks but each customer must install his own in a separate container image.

      Note: Other than commonly said, one DPAgent can serve multiple Hana instances and connect to different systems.

       

      https://help.sap.com/docs/HANA_SMART_DATA_INTEGRATION/7952ef28a6914997abc01745fef1b607/598cdd48941a41128751892fe68393f4.html?locale=en-US&version=1.0_SPS12

       

      Author's profile photo Vivek RR
      Vivek RR
      Blog Post Author

      Hi Werner

      Thanks for your feedback.
      As mentioned in my blog, I have explained how to establish connection between SAP Datasphere(Initial release) and Databricks as of now. 
      But there will additional ways of integrating in future as mentioned in our corporate blog. Hope this clarifies.

      Author's profile photo Werner Dähn
      Werner Dähn

      Regarding

      there will additional ways of integrating in future as mentioned in our corporate blog.

      I have read that now twice but cannot find anything in the text pointing into this direction. What paragraphs did you have in mind?

      Author's profile photo Vivek RR
      Vivek RR
      Blog Post Author

      Hi Werner

      You can check the article as highlighted below

      Author's profile photo Martin Frick
      Martin Frick

      Thank you Vivek RR for taking the time to share your technical Datasphere knowledge with the community. Your blog post is very informative and well-written! Highly Appreciated!

      Author's profile photo Vivek RR
      Vivek RR
      Blog Post Author

      Thanks Martin Frick  for your feedback!

      Author's profile photo Julien Leloup
      Julien Leloup

      Hey Vivek RR ,

       

      Thanks a lot for this blog! We're trying to set up something similar, and your walkthrough has helped us a lot!

      I do have one question regarding the user credentials. In your walkthrough, you mention you get a personal access token from the DBX user settings, but it's unclear to me what you do with this. Do you paste this in the JDBC url in the PWD parameter? And if so, which credentials do you then use in the end to log in?

       

      thanks in advance.

       

      Regards,

      Julien

      Author's profile photo Vivek RR
      Vivek RR
      Blog Post Author

      Hi Julien

      Thanks for your feedback. If you see the last parameter, that's where you will add the token. You will just add the token there and NOT the user id. UID will be hardcoded as a token as you see below.

      jdbc:databricks://<youraccount>.cloud.databricks.com:443/default;LogLevel=5;LogPath= <foldername>;IgnoreTransactions=1;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/<SQLPath>;AuthMech=3;UID=token;PWD=<token>

      And when you login from SAP Datasphere, just provide the Databricks email as username. The password could be the actual Databricks password or the token. Hope this helps.