SAP BW HANA to Databricks via SAP DI – I
The announcement of partnership between SAP and Databricks has been well perceived by the different communities considering the fact both technologies are very widely used and needs to have better integration between them.
High level end to end Architecture and requirement
Aim was to extract the data from SAP BW to Databricks , run highly compute intensive calculations in databricks and bring the millions of data records (more than 200 Mil) back to SAP BW in highly efficient manner to address the tight reporting SLA’s. In addition whole process needs to happen in fully automated manner without any manual intervention. Options such as Mulesoft were evaluated however post lengthy analysis, we concluded that for Data integration and as per the current architecture landscape, SAP DI will serve the purpose better.
BW to Databricks Flow
How data is picked from BW and sent to Databricks :
Source of the data are SAP providers. Data from SAP providers is extracted via HANA calculation views as it is efficient way to extract the data from SAP Data intelligence point of view.
Data intelligence performs the task of middleware to pull the data from SAP BW and send to Azure landing-Bronze layer.
In SAP BW, a custom program to trigger Data Intelligence graphs can be developed to automatically trigger DI graphs once load of the data in the source providers complete. Program uses HTTP connection to trigger DI API (/app/pipeline-modeler/service/v1/runtime/graphs) to trigger the respective graphs.
On target side, it uses standard operator to write into the Azure blob storage.
Once data is landed in the blob, databricks provided API (/api/2.1/jobs/run-now) is triggered to start the workflow that can load the data all the way to the Gold layer automatically.
Once data is in the gold layer, further required calculations are triggered. If the workspace of Gold layer and Calculation layer is different , a notebook in the source job can be setup to trigger calculation job/workflow via API(/api/2.1/jobs/run-now). Post completion of calculations , data can be parked in the delta table and status tables (for DI )are updated.
Databricks to BW flow related information, find here