Introduction

Vivek-RR · ‎03-10-2023

Introduction

With the announcements in the SAP Data Unleashed, SAP introduced the successor of SAP Data Warehouse Cloud, SAP Datasphere – a powerful Business Technology Platform data service that addresses the Data to value strategy of every Enterprise organization to deliver seamless access to business data. Our Datasphere has been enriched with new features thereby delivering a unified service for data integration, cataloging, semantic modeling, data warehousing, and virtualizing workloads across SAP and non-SAP data.

I am sure there will be lots of blogs that will be published soon discussing the latest offerings, roadmaps, and features. However, this blog focuses on the latest announcement related to open data partners and I am going to start by focusing on Databricks.

With the rise of the Lakehouse platform that combines both Data Warehouses & Data Lakes, there has been a trend with SAP customers exploring Unified Analytics Platforms or say unified environments that address different perspectives of data management, governance, development, and finally deployments of analytic workloads based on diverse data sources and formats. Previously, there is a need to replicate the data completely out of SAP environments for customers to adopt the lakehouse platform. But the current partnership with Databricks will focus to simplify and integrate the hybrid landscapes efficiently.

There was an article posted by The Register regarding the SAP Datasphere and it exactly resonated with the SAP messaging

Please note that what I am about to discuss further is the data federation scenarios between SAP Datasphere and Databricks that works as of today. There will be additional ways of integrating with Databricks in the future.

Brief Introduction to the Lakehouse Platform

In simple terms, a lakehouse is a Data Management architecture that enables users to perform diverse workloads such as BI, SQL Analytics, Data Science & Machine Learning on a unified platform. And by utilizing a combined data management platform such as lakehouse has the following benefits

Enables Direct Data Access across SAP and Non-SAP sources

Simplifies data governance

Removes Data Redundancy

Direct Connectivity to BI Tools

Simplified security with a single source of truth

Support for open source & commercial tooling

Operating on multi-cloud environments and many more.

And Databricks is one such Lakehouse platform that takes a unified approach by integrating disparate workloads to execute data engineering, Analytics, Data Science & Machine Learning use cases. And as mentioned on their site, the platform is simple, open & multi-cloud. We will discuss the Lakehouse platform features and capabilities in future blogs but as mentioned before we are going to focus on a data federation scenario to access data from Databricks SQL into SAP Datasphere.

Consider a scenario where the data from a non-SAP source is continuously ingested into cloud object storage say AWS S3. Note that Databricks has an autoloader feature to efficiently process data files from different cloud storages as they arrive and ingest them into Lakehouse seamlessly. Then we utilize the delta live table framework for building data pipelines and storing the transformed data in Delta format on cloud storage, which can subsequently be accessed by Databricks SQL(DB SQL).

As referred to in the integration scenario below, SAP Datasphere will connect to Databricks SQL with the existing data federation capabilities and users can blend the data with SAP sources for reporting/BI workloads based on SAP Analytics Cloud(SAC).

Assuming you process the incoming data and persist as tables in Databricks SQL, you will then perform the following steps to establish connectivity to SAP Datasphere

Data Federation with Databricks SQL

Prerequisites

You have access to SAP Datasphere with authorization to create a new Data provisioning agent

Access to Virtual Machine or On-Premise system where you install Data provisioning agent.

You have access to Databricks Clusters as well as SQL warehouse.

You have downloaded the JDBC driver from Databricks website.

Databricks Cluster & SQL Access

As mentioned in the previous section, I assume you are already working on Databricks topics. If you are interested in getting access , you can sign up for the free trial for 14 days. Or you can sign up from the hyperscaler marketplace such as AWS marketplace for the same.

Once you login to your account, you will notice the Unified environment for different workspaces

Data Science and Engineering

Machine Learning

SQL

Navigate to workspace “Data Science and Engineering” and select the compute which you have been using for data transformation. Just to note that when we build Delta Live tables pipelines, it uses its own compute to run pipelines. We will discuss that in the later blogs.

Select the all-purpose compute to get additional information.

Navigate to the “Advanced options” to get JDBC connection details

You will find the connection details here. We need to tweak it a bit prior to connecting with generic JDBC connectors. Also, we will be connecting using personal access tokens and not user credentials.

To generate the personal access token, you can generate it from the user settings and save the token for later use.

Here is the modified JDBC URL that we will be using in SAP Datasphere connection. We need to add “IgnoreTransactions=1” to ignore transaction related operations. In your case, you will be just copying the URL from advanced options and add the parameter IgnoreTransactions as shown below

jdbc:databricks://dbc-xxxxxxxxxxx.cloud.databricks.com:443/default;IgnoreTransactions=1;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/XXXXXXXXXXXXXXXXX268/XX-XX-56fird2a;AuthMech=3;UID=token;PWD=<Personal Access Token>

We can now navigate to Databricks SQL to see if we have the necessary schemas and tables to access. The tables were created under the schema “dltcheck”.

Data Provisioning Agent(DP Agent) Installation Instructions

I am not going the discuss the DP agent installation in detail as there are numerous articles posted on the same topic. And will be addressing the specific changes that need to be done for Databricks access.

In order to establish secure connectivity between SAP Datasphere and Databricks SQL, the Data Provisioning Agent(DP agent) has to be installed on a virtual machine and configured. For DP agent installation, you can refer to the following document. To connect the DP agent to the SAP HANA Cloud tenant of SAP Datasphere, please follow the steps mentioned in this document. Assuming all the steps were completed, the status of the DP agent will be displayed as “Connected”. In my case, the DP agent is DBRICKSNEW.

Configuration Adjustments on the Virtual Machine

Navigate to the folder <dpagent>/camel/lib and copy the jar file that is downloaded from Databricks site. Also, extract the jar files in the same location.

And navigate to <dpagent>/camel folder and adjust the properties of configfile-jdbc.properties

Change the delimident value to BACKTICK.

Save and restart the DP agent. Login to SAP Datasphere to check the status of DP agent. It should display the status as "Connected". Edit the DPagent and select the Agent Adapter “CameJDBCAdapter”

Now we should be able to connect to Databricks SQL from SAP Datasphere.

Establishing Connection from SAP Datasphere

Navigate to your space and create a new connection for the “Generic JDBC” connector

Provide any Business Name and the following details

Class : com.databricks.client.jdbc.Driver



JDBC URL :  jdbc:databricks://<your databricksaccount>. cloud.databricks.com>:443/default;IgnoreTransactions=1;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/XXXXXXXXXXXXXXXXX268/XX-XX-XXXXXX;AuthMech=3;UID=token;PWD=<Personal Access Token>

Please copy the JDBC URL as such from Databricks Cluster advanced options and just add the parameter “IgnoreTransactions=1”. Setting this property to 1 ignores any transaction-related operations and returns successfully.

Provide your Databricks user account credentials or token credentials with user as token and select the Data provisioning agent that you just activated.

With the connection details and configurations done properly, validation should be successful.

To make sure we have access to data, let’s use the Data Builder to build an analytical model that could be consumed in SAP Analytics Cloud. Navigate to Data Builder and create a new Graphical View

Navigate to Sources->Connections->Databricks("Your Business Name For Generic JDBC Connection")->"Databricks Schema". You will find the list of tables under the schema "dltcheck"

Here is the Databricks SQL Access for the same schema "dltcheck"

Select the data with which you wanted to build the Analytical model. In my case it is bucketraw1, added some calculated columns, aggregated the necessary data, and exposed the relevant columns as the Analytical model “Confirmed_cases”. And the data preview of the model shows the corresponding records too.

This model could be consumed in SAP Analytics Cloud for reporting.

Troubleshooting Errors

1. If you do not see the connection validated, then there are two options to troubleshoot. Either we can use the generic log files from DP agent to identify the issue or add the log parameters in JDBC URL and collect those specific errors. If you face validation issues, then add these parameters in the jdbc url.

jdbc:databricks://<youraccount>.cloud.databricks.com:443/default;LogLevel=5;LogPath= <foldername>;IgnoreTransactions=1;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/<SQLPath>;AuthMech=3;UID=token;PWD=<token>

Log Level – 5 which means enable logging on the DEBUG level, which logs detailed information that is useful for debugging the connector

Logpath – This will be available in the following path /usr/sap/<dpagentfolder>/

The log path can be found on the linux VM and it will generate the specific log files

2. If you see some errors related to mismatched input, then please adjust the camel jdbc properties as mentioned in DPAgent Installation Instructions. The “DELIMIDENT” value should be set to “BACKTICK”.

2022-12-19 16:31:57,552 [ERROR] [f6053d30-7fcd-436f-bed8-bf4d1358847788523] DefaultErrorHandler | CamelLogger.log [] - Failed delivery for (MessageId: 2567788B8CF359D-0000000000000000 on ExchangeId: 2567788B8CF359D-0000000000000000). Exhausted after delivery attempt: 1 caught: java.sql.SQLException: [Databricks][DatabricksJDBCDriver](500051) ERROR processing query/statement. Error Code: 0, SQL state: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.sql.catalyst.parser.ParseException: mismatched input '"FUNCTION_ALIAS_PREFIX_0"' expecting {<EOF>, ';'}(line 1, pos 33)



                                == SQL ==



                                SELECT SUM("confirmed_cases") AS "FUNCTION_ALIAS_PREFIX_0", SUM("deaths") AS "FUNCTION_ALIAS_PREFIX_1", "covidsummarygcs$VT"."county_name" FROM "default"."covidsummarygcs" "covidsummarygcs$VT" GROUP BY "covidsummarygcs$VT"."county_name"

3. If you see the communication link failure, then the IgnoreTransactions parameter has not been set in your JDBC URL.

As mentioned before, the data federation is enabled through JDBC connector as of now . But things will change with additional connections in the future. Hopefully this helped you in understanding the Data Federation capabilities with Databricks. Please do share your feedback. In case of connection issues, feel free to comment, and will try to help you out.

werner_daehn · ‎03-11-2023

So in essence, there is nothing new. Not even a new SDI Adapter.

The SDI Agent exists since nine years, the CamelJDBCAdapter since seven years.

The DPAgent is not even provided by DataBricks but each customer must install his own in a separate container image.

Note: Other than commonly said, one DPAgent can serve multiple Hana instances and connect to different systems.

https://help.sap.com/docs/HANA_SMART_DATA_INTEGRATION/7952ef28a6914997abc01745fef1b607/598cdd48941a4...

Vivek-RR · ‎03-11-2023

Hi Werner

Thanks for your feedback.
As mentioned in my blog, I have explained how to establish connection between SAP Datasphere(Initial release) and Databricks as of now.
But there will additional ways of integrating in future as mentioned in our corporate blog. Hope this clarifies.

martinfrick · ‎03-13-2023

Thank you vivek.rr for taking the time to share your technical Datasphere knowledge with the community. Your blog post is very informative and well-written! Highly Appreciated!

Vivek-RR · ‎03-13-2023

Thanks martinfrick for your feedback!

werner_daehn · ‎03-14-2023

Regarding

there will additional ways of integrating in future as mentioned in our corporate blog.

I have read that now twice but cannot find anything in the text pointing into this direction. What paragraphs did you have in mind?

Vivek-RR · ‎03-14-2023

Hi Werner

You can check the article as highlighted below

julienleloup · ‎04-21-2023

Hey vivek.rr ,

Thanks a lot for this blog! We're trying to set up something similar, and your walkthrough has helped us a lot!

I do have one question regarding the user credentials. In your walkthrough, you mention you get a personal access token from the DBX user settings, but it's unclear to me what you do with this. Do you paste this in the JDBC url in the PWD parameter? And if so, which credentials do you then use in the end to log in?

thanks in advance.

Regards,

Julien

Vivek-RR · ‎04-21-2023

Hi Julien

Thanks for your feedback. If you see the last parameter, that's where you will add the token. You will just add the token there and NOT the user id. UID will be hardcoded as a token as you see below.

jdbc:databricks://<youraccount>.cloud.databricks.com:443/default;LogLevel=5;LogPath= <foldername>;IgnoreTransactions=1;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/<SQLPath>;AuthMech=3;UID=token;PWD=<token>

And when you login from SAP Datasphere, just provide the Databricks email as username. The password could be the actual Databricks password or the token. Hope this helps.

dsun · ‎12-17-2023

Hi vivek.rr

Thanks a lot for so detailed explanation on data federation between SAP datasphere and Databricks.

I have a quick question: you highlighted one bulletpoint in one post below:

Databricks and SAP will deliver bi-directional integration between SAP Datasphere—with SAP data’s complete business context—with its Data Lakehouse platform on any cloud platform

It mentioned 'bi-directional', but in your post, SAP datasphere consumes Databricks data. Do you know whether there is any update on that Databricks consumes SAP datasphere data? Where can I find related meterials or documentation? Thanks in advance.

Regards,

David

Vivek-RR · ‎12-29-2023

Hi @dsun , You can now replicate now to Cloud storages using Premium outbound integration pipelines using rep flows. And Cloud storage can talk directly to Databricks using Autoloader. Check this blog series. You can also read/write using FeML/JDBC. I am almost done with the blog on this topic with in-depth technical details. Will share it shortly.

virendevi_2 · ‎02-11-2024

Thanks for the blog. Can you please advice if we can change the number of fetched records per block via a JDBC string,

jdbc:databricks://<youraccount>.cloud.databricks.com:443/default;LogLevel=5;LogPath= <foldername>;IgnoreTransactions=1;

@Vivek-RR

Vivek-RR · ‎02-11-2024

@virendevi_2 Thanks for reaching out. Yes, you can try the parameter RowsFetchedPerBlock. As mentioned in the document, please check from the performance perspective too if you are going to increase the default value. And you can check all the connector configuration options when you download the JDBC driver. You can find the configuration pdf under the docs folder.

kavi12 · ‎02-26-2024

Hi Vivek,

I have added the databricks cluster URL as mentioned in the document and using the username as "token", password as generated token. But i am getting error "Remote Tables: internal error: Cannot get remote source objects: Could not get tables, Code: 403, SQL State: HY000". Your advice will help immensely.

DP Agent log:

[2024-01-26 18:31:59] [error] [ 5072] apxServiceControl(): dwState(4) != dwCurrentState(1); dwWin32ExitCode = 0, dwWaitHint = 2000, dwServiceSpecificExitCode = 0
[2024-01-26 18:31:59] [error] [ 5072] The service did not start due to a logon failure.
[2024-01-26 18:31:59] [error] [ 5072] apxServiceControl(): returning FALSE
[2024-01-26 18:31:59] [error] [ 5072] The service did not start due to a logon failure.
[2024-01-26 18:31:59] [error] [ 5072] Failed to start service 'SAP_HANA_SDI_Agent_Service_Daemon_Agentt'.
[2024-01-26 18:31:59] [error] [ 5072] The service did not start due to a logon failure.
[2024-01-26 18:31:59] [error] [ 5072] Apache Commons Daemon procrun failed with exit value: 5 (failed to start service).
[2024-01-26 18:31:59] [error] [ 5072] The service did not start due to a logon failure.

Thanks,

Kavi

Vivek-RR · ‎02-26-2024

hi @kavi12

Could you share the URL that was used to connected from Datasphere side? Please share the whole URl.

kavi12 · ‎02-26-2024

@Vivek-RR Thanks a lot for swift response.

Here is the URL combinations i tried.

jdbc:databricks://dbc-b17fa12e-0815.cloud.databricks.com:443/default;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/3502160834700658/0224-050807-9tuoxcuc;AuthMech=3;IgnoreTransactions=1;UseNativeQuery=0

jdbc:databricks://dbc-b17fa12e-0815.cloud.databricks.com:443/default;IgnoreTransactions=1;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/3502160834700658/0224-050807-9tuoxcuc;AuthMech=3;UID=token;PWD=<token>

Thanks

marcelo_bruno · ‎03-15-2024

Hi, @Vivek-RR!

Thanks so much for your blog!!

We're trying the configuration, but facing an issue on the connection. When Datasphere is fetching the preview data, it shows an jdbc error as the log below:

Caused by: com.databricks.client.support.exceptions.ErrorException: [Databricks][JDBCDriver](500051) ERROR processing query/statement. Error Code: 0, SQL state: 42601, Query: SELECT TOP***, Error message from Server: org.apache.hive.service.cli.HiveSQLException: Error running query: [PARSE_SYNTAX_ERROR] org.apache.spark.sql.catalyst.parser.ParseException:
[PARSE_SYNTAX_ERROR] Syntax error at or near '1000'. SQLSTATE: 42601 (line 1, pos 11)

== SQL ==
SELECT TOP 1000 `centro_dbrocks$VT`.`id`, `centro_dbrocks$VT`.`armazem`, `centro_dbrocks$VT`.`centro_eva`, `centro_dbrocks$VT`.`codigo`, `centro_dbrocks$VT`.`codigo_lote`, `centro_dbrocks$VT`.`diretoria`, `centro_dbrocks$VT`.`gerencia_executiva`, `centro_dbrocks$VT`.`gerencia_geral`, `centro_dbrocks$VT`.`gestao_estoque`, `centro_dbrocks$VT`.`participacao_pb`, `centro_dbrocks$VT`.`tipo_processo`, `centro_dbrocks$VT`.`usuario` FROM `pgea`.`centro` `centro_dbrocks$VT`

We tried to use DBeaver and the select top also shows error.

There's any configuration to fix this?

Thanks in advance,
Marcelo Bruno.

yaw_mak2 · ‎03-15-2024

Hi Marcelo,

Are you using unity catalog in Databricks? If so, you need to pass that along in the connection (jdbc URL): ConnCatalog=<catalog name>. For example:

...;UseNativeQuery=0;ConnCatalog=catalog_abc

Regards,

Yaw

Vivek-RR · ‎03-19-2024

Hi @marcelo_bruno thanks for your feedback. I still see this as error with SQL sent to Databricks. Did you get additional logs? I would go with what @yaw_mak2 said about the parameter UseNativeQuery = 0. This means "The connector transforms the queries emitted by applications and converts
them into an equivalent form in HiveQL". Did you already use it in your JDBC URL?

marcelo_bruno · ‎03-22-2024

Thanks @Vivek-RR and @yaw_mak2. The parameter UseNativeQuery = 0 was the problem!!!

Unified Analytics with SAP Datasphere & Databricks Lakehouse Platform- Data Federation Scenarios

Introduction

Brief Introduction to the Lakehouse Platform

Data Federation with Databricks SQL

Prerequisites

Databricks Cluster & SQL Access

Data Provisioning Agent(DP Agent) Installation Instructions

Configuration Adjustments on the Virtual Machine

Establishing Connection from SAP Datasphere

Troubleshooting Errors

Get Your SAP HANA Idea Incubator Badge Today!

SCN Mission - SAP HANA Quiz Challenge is now retired

Share your #HANAStory and Win