SAP Datasphere & Partnerships – Databricks
Together with SAP Datashpere, SAP launched new partnerships to complete the picture of analytical data management and the idea of a Business Data Fabric.
Fig1: SAP Datasphere & Partnerships – Source: SAP, 2023 (slightly adapted)
I already gave an impression of the new partners in my blog “SAP Datasphere – Q&A and Partnerships” and wrote a similar blog on the partnership with Collibra. As I want to focus here on Databricks I repeat what I have written:
Databricks was founded by the creators of Apache Spark and they build a complete ecosystem around, delivering the Data Lakehouse based on a multi-cloud strategy similar to SAP. In simple terms a Data Lakehouse can be understood as bring together modern file formats (Apache Parquet (Databricks), Apache ORC, Apache Avro) with open table formats (Delta Lake (Databricks), Apache Iceberg, Apache Hudi) with a powerful, distributed query engine to process data (Photon for Databricks).
The advantage of a Data Lakehouse is to have all kind of data (structured, semi-structured, unstructured) in one tier and serve all your data roles like data engineer, data analyst, data scientists, BI modeler and so on, from this tier.
Databricks coined the term “Data Lakehouse” and is the one top partner in this area even if others provider Data Lakehouse technologies, too.
An interesting technical perspective about the interplay of SAP Datasphere and Databricks can be found the blog “Unified Analytics with SAP Datasphere & Databricks Lakehouse Platform- Data Federation Scenarios” by Vivek RR.
I recently came into some discussions about what Databricks mean for SAP Data Warehouse experts and especially for SAP Datasphere. Will Databricks now become the new SAP Data Warehouse approach? Doesn’t Databricks do the same as SAP Datasphere? Should I learn more about Databricks and upskill?
So here are my thoughts about these question.
Know the data architecture patterns
First we have to differentiate between the architecture patterns we see here.
SAP Datasphere was formerly the SAP Data Warehouse Cloud, available since end of 2019. So basically it is still a Cloud-based Data Warehouse based on SAP HANA Cloud technology incl. Data Lake, Machine Learning and a Data and Business Layer supporting the Data Warehouse use case and different kind of user groups.
SAP Datasphere is now evolving to a Business Data Fabric. Data Fabric is an approach to unifiy the access and management of all data within a company (in the best case) and make it more available. So Data Fabric can be seen as an approach for data democratization and breaking up data silos. SAP delivered new capabilities like (Data) Catalog, Replication Task and forecasted new semantic capabilities. See here for the new capabilities.
On the other side, Databricks delivers an Data Lakeouse approach, an evolution of combined Data Lake/Data Warehouse approaches enabled within an single technical tier. I once already analyzed SAP’s approach in my blog “Data Architecture with SAP – Data Lake” and came to the conclusion, SAP is not there at the moment but have a own useful approach within the SAP HANA Cloud Data Lake offering.
What Data Lakehouse means
In general there is a lot of movement in the cloud data management market and the offerings all currently evolve into the direction becoming a kind of Data Lakehouse – coming from different directions and try to find there differentiation:
Fig 2: Positioning and trend of popular cloud data platform offerings
We can go a little bit into detail to understand the different positions.
Solutions like Databricks Data Lakehouse and also Google BigQuery are born in the cloud. From the beginning they make use of cloud capabilities like separation of storage and compute for better scalability, with a shared storage architecture and distributed parallel processing. These solutions following the schema-on-read approach which is very popular in the data science field as it gives freedom of modeling the data and is served on object storage so that any kind of data like unstructured texts, documents, pictures, videos or sound can be stored easily.
They use parallel processing engines like Apache Spark and a highly optimized internal network to move and process data very fast.
For a long time now, file formats are available storing structured data column-based and compressed (as HANA does) for better read performance. The step to the Data Lakehouse came with open table formats like Delta Lake for Databricks, which brought essential Data Warehouse capabilities like ACID or row level security to the data lake.
Already started with the development of Apache Hive in 2010 the idea came up to use Big Data (Hadoop) for Data Warehouse use cases being able to make SQL-like queries. Today strong parallel query engines like Photon for Databricks enable the Data Warehouse part for the Data Lakehouse.
On the other side we have solutions evolve from on-prem Relational Database Management Systems (RDMS), like Azure Synapse Analytics, Amazon Redshift or SAP HANA Cloud as base of SAP Datasphere, as today (typically) Massively Parallel Processing (MPP) Cloud Data Warehouses with initially a shared nothing architecture like Hadoop, coupling storage and compute. For most of the solutions we have seen a decoupling over the time to make use of cloud scalability and elasticity. SAP HANA ist currently on the way to make steps here what is a big step as the big difference is here an in-memory based database. Through the SAP data pyramid (HANA Cloud, HANA Relational Data Lake and HANA Data Lake Files), we already see a kind of separation and the way SAP handles this evolution with their own technology.
I also recognized that the solutions coming from this direction als moved there storage to the data lake and largely enabled table formats like Apache Hudi or Apache Iceberg enabling Data Warehouse capabilities on Data Lake. This is not available for SAP HANA Cloud today. But with it’s in-memory based architecture, SAP IQ technology as a layer in between and SQL on Files for querying the Data Lake, SAP have maybe an alternative approach in place, but today I haven’t seen much practical usage in the market here.
So far for the question whether or not Databricks and SAP Datasphere doing the same. They do in some parts and they do not as SAP have own aspects, markets and customers.
While now Databricks evolve their data platform more and more, addressing new use cases the focus is to physically manage all kind of data especially big data and machine learning workloads. The Data Warehouse is possible and even more, as it brings the Data Lakehouse to us.
On the other side, SAP Datasphere is optimized for SAP, delivering content, a strong business perspective and with the Data Fabric approach it is not about having all data integrated for processing but rather steering the access and orchestration on enterprise data independently from which system data is stored.
As companies are getting more and more data-driven, SAP is not the only solution anymore for analytical data management. To accept that and make the next move to being the one face to the end user concerning data is a strategic move as SAP is this already often in the world of transactional business systems.
Currently we had an understanding what SAP Data Warehouse Cloud is and have now a vision what SAP Datasphere shall become. SAP has to deliver for what, how the german DSAG would maybe formulate it – is a “work in progress”.
Partnering with strong market players as Databricks makes sense as we all know to implement a vision as formulated with SAP Datasphere is hard and a way to go. To embrace strong partners, enable integration with a broad non-SAP ecosystem, but also relying on one’s own strengths, is the right way from my perspective.
SAP should not expect that it just happens with new functionalities. The “Release of SAP Data and Analytics Advisory Methodology” is a smart move as it become more important to understand the strategic use cases but also paint a holistic picture of data and analytics for the customer to find the right place for SAP within this picture. I hope we will see more partner use cases and partner reference architectures in the next time.
This is just my opinion and current perspective. I’m happy to hear from you how you see these new partnerships in the context of SAP Datasphere?
smart and nice write up Peter Baumann