Technology Blogs by SAP
Learn how to extend and personalize SAP applications. Follow the SAP technology blog for insights into SAP BTP, ABAP, SAP Analytics Cloud, SAP HANA, and more.
cancel
Showing results for 
Search instead for 
Did you mean: 
Vivek-RR
Product and Topic Expert
Product and Topic Expert

Introduction


With the announcements in the SAP Data Unleashed, SAP introduced the successor of SAP Data Warehouse Cloud, SAP Datasphere – a powerful Business Technology Platform data service that addresses the Data to value strategy of every Enterprise organization to deliver seamless access to business data.  Our Datasphere has been enriched with new features thereby delivering a unified service for data integration, cataloging, semantic modeling, data warehousing, and virtualizing workloads across SAP and non-SAP data.


I am sure there will be lots of blogs that will be published soon discussing the latest offerings, roadmaps, and features. However, this blog focuses on the latest announcement related to open data partners and I am going to start by focusing on Databricks.


With the rise of the Lakehouse platform that combines both Data Warehouses & Data Lakes, there has been a trend with SAP customers exploring Unified Analytics Platforms or say unified environments that address different perspectives of data management, governance, development, and finally deployments of analytic workloads based on diverse data sources and formats. Previously, there is a need to replicate the data completely out of SAP environments for customers to adopt the lakehouse platform. But the current partnership with Databricks will focus to simplify and integrate the hybrid landscapes efficiently.

There was an article posted by The Register regarding the SAP Datasphere  and it exactly resonated with the SAP messaging

 


 

Please note that what I am about to discuss further is the data federation scenarios between SAP Datasphere and Databricks that works as of today. There will be additional ways of integrating with Databricks in the future.

Brief Introduction to the Lakehouse Platform


In simple terms, a lakehouse is a Data Management architecture that enables users to perform diverse workloads such as BI, SQL Analytics, Data Science & Machine Learning on a unified platform. And by utilizing a combined data management platform such as lakehouse has the following benefits

  1. Enables Direct Data Access across SAP and Non-SAP sources

  2. Simplifies data governance

  3. Removes Data Redundancy

  4. Direct Connectivity to BI Tools

  5. Simplified security with a single source of truth

  6. Support for open source & commercial tooling

  7. Operating on multi-cloud environments and many more.


And Databricks is one such Lakehouse platform that takes a unified approach by integrating disparate workloads to execute data engineering, Analytics, Data Science & Machine Learning use cases. And as mentioned on their site, the platform is simple, open & multi-cloud. We will discuss the Lakehouse platform features and capabilities in future blogs but as mentioned before we are going to focus on a data federation scenario to access data from Databricks SQL into SAP Datasphere.

Consider a scenario where the data from a non-SAP source is continuously ingested into cloud object storage say AWS S3. Note that Databricks has an autoloader feature to efficiently process data files from different cloud storages as they arrive and ingest them into Lakehouse seamlessly. Then we utilize the delta live table framework for building data pipelines and storing the transformed data in Delta format on cloud storage, which can subsequently be accessed by Databricks SQL(DB SQL).

As referred to in the integration scenario below, SAP Datasphere will connect to Databricks SQL with the existing data federation capabilities and users can blend the data with SAP sources for reporting/BI workloads based on SAP Analytics Cloud(SAC).


Assuming you process the incoming data and persist as tables in Databricks SQL, you will then perform the following steps to establish connectivity to SAP Datasphere



Data Federation with Databricks SQL


Prerequisites



  1. You have access to SAP Datasphere with authorization to create a new Data provisioning agent

  2. Access to Virtual Machine or On-Premise system where you install Data provisioning agent.

  3. You have access to Databricks Clusters as well as SQL warehouse.

  4. You have downloaded the JDBC driver from Databricks website.


Databricks Cluster & SQL Access


As mentioned in the previous section, I assume you are already working on Databricks topics. If you are interested in getting access , you can sign up for the free trial for 14 days. Or you can sign up from the hyperscaler marketplace such as AWS marketplace for the same.


Once you login to your account, you will notice the Unified environment for different workspaces

Data Science and Engineering

Machine Learning

SQL

Navigate to workspace “Data Science and Engineering” and select the compute which you have been using for data transformation. Just to note that when we build Delta Live tables pipelines, it uses its own compute to run pipelines. We will discuss that in the later blogs.


Select the all-purpose compute to get additional information.


Navigate to the “Advanced options” to get JDBC connection details


You will find the connection details here. We need to tweak it a bit prior to connecting with generic JDBC connectors.  Also, we will be connecting using personal access tokens and not user credentials.


 

To generate the personal access token, you can generate it from the user settings and save the token for later use.


Here is the modified JDBC URL that we will be using in SAP Datasphere connection. We need to add “IgnoreTransactions=1” to ignore transaction related operations. In your case, you will be just copying the URL from advanced options and add the parameter IgnoreTransactions as shown below
jdbc:databricks://dbc-xxxxxxxxxxx.cloud.databricks.com:443/default;IgnoreTransactions=1;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/XXXXXXXXXXXXXXXXX268/XX-XX-56fird2a;AuthMech=3;UID=token;PWD=<Personal Access Token>

We can now navigate to Databricks SQL to see if we have the necessary schemas and tables to access. The tables were created under the schema “dltcheck”.


 

Data Provisioning Agent(DP Agent) Installation Instructions


I am not going the discuss the DP agent installation in detail as there are numerous articles posted on the same topic. And will be addressing the specific changes that need to be done for Databricks access.

In order to establish secure connectivity between SAP Datasphere and Databricks SQL, the Data Provisioning Agent(DP agent) has to be installed on a virtual machine and configured. For DP agent installation, you can refer to the following document. To connect the DP agent to the SAP HANA Cloud tenant of SAP Datasphere, please follow the steps mentioned in this document. Assuming all the steps were completed, the status of the DP agent will be displayed as “Connected”. In my case, the DP agent is DBRICKSNEW.


 

Configuration Adjustments on the Virtual Machine


Navigate to the folder <dpagent>/camel/lib and copy the jar file that is downloaded from Databricks site. Also, extract the jar files in the same location.


And navigate to <dpagent>/camel folder and adjust the properties of configfile-jdbc.properties


Change the delimident value to BACKTICK.


Save and restart the DP agent. Login to SAP Datasphere to check the status of DP agent. It should display the status as "Connected". Edit the DPagent and select the Agent Adapter “CameJDBCAdapter”


Now we should be able to connect to Databricks SQL from SAP Datasphere.

Establishing Connection from SAP Datasphere


Navigate to your space and create a new connection for the “Generic JDBC” connector


Provide any Business Name and the following details
Class : com.databricks.client.jdbc.Driver

JDBC URL : jdbc:databricks://<your databricksaccount>. cloud.databricks.com>:443/default;IgnoreTransactions=1;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/XXXXXXXXXXXXXXXXX268/XX-XX-XXXXXX;AuthMech=3;UID=token;PWD=<Personal Access Token>


Please copy the JDBC URL as such from Databricks Cluster advanced options and just add the parameter “IgnoreTransactions=1”. Setting this property to 1 ignores any transaction-related operations and returns successfully.


Provide your Databricks user account credentials or token credentials with user as token and select the Data provisioning agent that you just activated.


With the connection details and configurations done properly, validation should be successful.


 

To make sure we have access to data, let’s use the Data Builder to build an analytical model that could be consumed in SAP Analytics Cloud. Navigate to Data Builder and create a new Graphical View


Navigate to Sources->Connections->Databricks("Your Business Name For Generic JDBC Connection")->"Databricks Schema". You will find the list of tables under the schema "dltcheck"


 

Here is the Databricks SQL Access for the same schema "dltcheck"


Select the data with which you wanted to build the Analytical model. In my case it is bucketraw1,  added some calculated columns, aggregated the necessary data, and exposed the relevant columns as the Analytical model “Confirmed_cases”. And the data preview of the model shows the corresponding records too.


This model could be consumed in SAP Analytics Cloud for reporting.

Troubleshooting Errors


1. If you do not see the connection validated, then there are two options to troubleshoot. Either we can use the generic log files from DP agent to identify the issue or add the log parameters in JDBC URL and collect those specific errors. If you face validation issues, then add these parameters in the jdbc url.

 
jdbc:databricks://<youraccount>.cloud.databricks.com:443/default;LogLevel=5;LogPath= <foldername>;IgnoreTransactions=1;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/<SQLPath>;AuthMech=3;UID=token;PWD=<token>

Log Level – 5   which means enable logging on the DEBUG level, which logs detailed information that is  useful for debugging the connector

Logpath – This will be available in the following path /usr/sap/<dpagentfolder>/

The log path can be found on the linux VM and it will generate the specific log files


2. If you see some errors related to mismatched input, then please adjust the camel jdbc properties as mentioned in DPAgent Installation Instructions. The “DELIMIDENT” value should be set  to “BACKTICK”.
2022-12-19 16:31:57,552 [ERROR] [f6053d30-7fcd-436f-bed8-bf4d1358847788523] DefaultErrorHandler | CamelLogger.log [] - Failed delivery for (MessageId: 2567788B8CF359D-0000000000000000 on ExchangeId: 2567788B8CF359D-0000000000000000). Exhausted after delivery attempt: 1 caught: java.sql.SQLException: [Databricks][DatabricksJDBCDriver](500051) ERROR processing query/statement. Error Code: 0, SQL state: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.sql.catalyst.parser.ParseException: mismatched input '"FUNCTION_ALIAS_PREFIX_0"' expecting {<EOF>, ';'}(line 1, pos 33)

== SQL ==

SELECT SUM("confirmed_cases") AS "FUNCTION_ALIAS_PREFIX_0", SUM("deaths") AS "FUNCTION_ALIAS_PREFIX_1", "covidsummarygcs$VT"."county_name" FROM "default"."covidsummarygcs" "covidsummarygcs$VT" GROUP BY "covidsummarygcs$VT"."county_name"

 

3. If you see the communication link failure, then the IgnoreTransactions parameter has not been set in your JDBC URL.

         


As mentioned before, the data federation is enabled through JDBC connector as of now . But things will change with additional connections in the future. Hopefully this helped you in understanding the Data Federation capabilities with Databricks. Please do share your feedback. In case of connection issues, feel free to comment, and will try to help you out.

 
19 Comments
werner_daehn
Active Contributor
So in essence, there is nothing new. Not even a new SDI Adapter.

The SDI Agent exists since nine years, the CamelJDBCAdapter since seven years.

The DPAgent is not even provided by DataBricks but each customer must install his own in a separate container image.

Note: Other than commonly said, one DPAgent can serve multiple Hana instances and connect to different systems.

 

https://help.sap.com/docs/HANA_SMART_DATA_INTEGRATION/7952ef28a6914997abc01745fef1b607/598cdd48941a4...

 
Vivek-RR
Product and Topic Expert
Product and Topic Expert
0 Kudos
Hi Werner

Thanks for your feedback.
As mentioned in my blog, I have explained how to establish connection between SAP Datasphere(Initial release) and Databricks as of now. 
But there will additional ways of integrating in future as mentioned in our corporate blog. Hope this clarifies.
martinfrick
Product and Topic Expert
Product and Topic Expert
Thank you vivek.rr for taking the time to share your technical Datasphere knowledge with the community. Your blog post is very informative and well-written! Highly Appreciated!
Vivek-RR
Product and Topic Expert
Product and Topic Expert
Thanks martinfrick  for your feedback!
werner_daehn
Active Contributor
0 Kudos
Regarding
there will additional ways of integrating in future as mentioned in our corporate blog.

I have read that now twice but cannot find anything in the text pointing into this direction. What paragraphs did you have in mind?
Vivek-RR
Product and Topic Expert
Product and Topic Expert
Hi Werner

You can check the article as highlighted below


julienleloup
Explorer
Hey vivek.rr ,

 

Thanks a lot for this blog! We're trying to set up something similar, and your walkthrough has helped us a lot!

I do have one question regarding the user credentials. In your walkthrough, you mention you get a personal access token from the DBX user settings, but it's unclear to me what you do with this. Do you paste this in the JDBC url in the PWD parameter? And if so, which credentials do you then use in the end to log in?

 

thanks in advance.

 

Regards,

Julien
Vivek-RR
Product and Topic Expert
Product and Topic Expert

Hi Julien

Thanks for your feedback. If you see the last parameter, that's where you will add the token. You will just add the token there and NOT the user id. UID will be hardcoded as a token as you see below.

jdbc:databricks://<youraccount>.cloud.databricks.com:443/default;LogLevel=5;LogPath= <foldername>;IgnoreTransactions=1;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/<SQLPath>;AuthMech=3;UID=token;PWD=<token>

And when you login from SAP Datasphere, just provide the Databricks email as username. The password could be the actual Databricks password or the token. Hope this helps.

 

dsun
Discoverer
0 Kudos
Hi vivek.rr

 

Thanks a lot for so detailed explanation on data federation between SAP datasphere and Databricks.

I have a quick question: you highlighted one bulletpoint in one post  below:

  • Databricks and SAP will deliver bi-directional integration between SAP Datasphere—with SAP data’s complete business context—with its Data Lakehouse platform on any cloud platform


It mentioned 'bi-directional', but in your post, SAP datasphere consumes Databricks data. Do you know whether there is any update on that Databricks consumes SAP datasphere data? Where can I find related meterials or documentation? Thanks in advance.

 

Regards,

David
Vivek-RR
Product and Topic Expert
Product and Topic Expert
0 Kudos

Hi @dsun ,  You can now replicate now to Cloud storages using Premium outbound integration pipelines using rep flows. And Cloud storage can talk directly to Databricks using Autoloader. Check this blog series. You can also read/write using FeML/JDBC. I am almost done with the blog on this topic with in-depth technical details. Will share it shortly.

virendevi_2
Discoverer

Thanks for the blog. Can you please advice if we can change the number of fetched records per block via a JDBC string,

jdbc:databricks://<youraccount>.cloud.databricks.com:443/default;LogLevel=5;LogPath= <foldername>;IgnoreTransactions=1;

@Vivek-RR 

Vivek-RR
Product and Topic Expert
Product and Topic Expert
0 Kudos

@virendevi_2  Thanks for reaching out. Yes, you can try the parameter RowsFetchedPerBlock. As mentioned in the document, please check from the performance perspective too if you are going to increase the default value.  And you can check all the connector configuration options when you download the JDBC driver. You can find the configuration pdf under the docs folder.

VivekRR_0-1707673884473.png

 

kavi12
Discoverer
0 Kudos

Hi Vivek,

I have added the databricks cluster URL as mentioned in the document and using the username as "token", password as generated token. But i am getting error "Remote Tables: internal error: Cannot get remote source objects: Could not get tables, Code: 403, SQL State: HY000". Your advice will help immensely.

 

DP Agent log:

[2024-01-26 18:31:59] [error] [ 5072] apxServiceControl(): dwState(4) != dwCurrentState(1); dwWin32ExitCode = 0, dwWaitHint = 2000, dwServiceSpecificExitCode = 0
[2024-01-26 18:31:59] [error] [ 5072] The service did not start due to a logon failure.
[2024-01-26 18:31:59] [error] [ 5072] apxServiceControl(): returning FALSE
[2024-01-26 18:31:59] [error] [ 5072] The service did not start due to a logon failure.
[2024-01-26 18:31:59] [error] [ 5072] Failed to start service 'SAP_HANA_SDI_Agent_Service_Daemon_Agentt'.
[2024-01-26 18:31:59] [error] [ 5072] The service did not start due to a logon failure.
[2024-01-26 18:31:59] [error] [ 5072] Apache Commons Daemon procrun failed with exit value: 5 (failed to start service).
[2024-01-26 18:31:59] [error] [ 5072] The service did not start due to a logon failure.

 

Thanks,

Kavi

Vivek-RR
Product and Topic Expert
Product and Topic Expert
0 Kudos

hi @kavi12 

Could you share the URL that was used to connected from Datasphere side? Please share the whole URl.

kavi12
Discoverer
0 Kudos

@Vivek-RR Thanks a lot for swift response. 

Here is the URL combinations i tried.

jdbc:databricks://dbc-b17fa12e-0815.cloud.databricks.com:443/default;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/3502160834700658/0224-050807-9tuoxcuc;AuthMech=3;IgnoreTransactions=1;UseNativeQuery=0

jdbc:databricks://dbc-b17fa12e-0815.cloud.databricks.com:443/default;IgnoreTransactions=1;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/3502160834700658/0224-050807-9tuoxcuc;AuthMech=3;UID=token;PWD=<token>

Thanks

 

 

marcelo_bruno
Discoverer
0 Kudos

Hi, @Vivek-RR!

 

Thanks so much for your blog!!

We're trying the configuration, but facing an issue on the connection. When Datasphere is fetching the preview data, it shows an jdbc error as the log below:

Caused by: com.databricks.client.support.exceptions.ErrorException: [Databricks][JDBCDriver](500051) ERROR processing query/statement. Error Code: 0, SQL state: 42601, Query: SELECT TOP***, Error message from Server: org.apache.hive.service.cli.HiveSQLException: Error running query: [PARSE_SYNTAX_ERROR] org.apache.spark.sql.catalyst.parser.ParseException:
[PARSE_SYNTAX_ERROR] Syntax error at or near '1000'. SQLSTATE: 42601 (line 1, pos 11)

 

== SQL ==
SELECT TOP 1000  `centro_dbrocks$VT`.`id`, `centro_dbrocks$VT`.`armazem`, `centro_dbrocks$VT`.`centro_eva`, `centro_dbrocks$VT`.`codigo`, `centro_dbrocks$VT`.`codigo_lote`, `centro_dbrocks$VT`.`diretoria`, `centro_dbrocks$VT`.`gerencia_executiva`, `centro_dbrocks$VT`.`gerencia_geral`, `centro_dbrocks$VT`.`gestao_estoque`, `centro_dbrocks$VT`.`participacao_pb`, `centro_dbrocks$VT`.`tipo_processo`, `centro_dbrocks$VT`.`usuario` FROM `pgea`.`centro` `centro_dbrocks$VT`


We tried to use DBeaver and the select top also shows error.

There's any configuration to fix this?


Thanks in advance,
Marcelo Bruno.

yaw_mak2
Employee
Employee

Hi Marcelo,

Are you using unity catalog in Databricks?  If so, you need to pass that along in the connection (jdbc URL):  ConnCatalog=<catalog name>.  For example:  

...;UseNativeQuery=0;ConnCatalog=catalog_abc

Regards,

Yaw

Vivek-RR
Product and Topic Expert
Product and Topic Expert

Hi  @marcelo_bruno  thanks for your feedback. I still see this as error with SQL sent to Databricks. Did you get additional logs? I would go with what @yaw_mak2  said about the parameter UseNativeQuery = 0.  This means "The connector transforms the queries emitted by applications and converts
them into an equivalent form in HiveQL". Did you already use it in your JDBC URL?

 

marcelo_bruno
Discoverer
0 Kudos

Thanks @Vivek-RR and @yaw_mak2. The parameter UseNativeQuery = 0 was the problem!!!