Integration of SAP ERP Data into a Common Data Model
In the past, data was often stored in differing formats across an enterprise, leading to data integration challenges. To unify data formats, Microsoft, SAP and Adobe have agreed to pursue an Open Data Initiative. Part of this initiative is to develop a Common Data Model (CDM).
The purpose of the CDM is to store information in a unified shape, which consists of data in CSV or Parquet format, along with describing metadata JSON files. In the Microsoft landscape the CDM is already supported and implemented in different products and services such as Dynamics 365, PowerApps and Power BI, which leverage the Common Data Service. Many standard entities have been defined which offer structural and semantic consistency and simplify data integration into CDM structures. These benefits are often missed when extracting plain SAP tables on their own. The standard CDM entities can be browsed here: https://microsoft.github.io/CDM.
However, there is no one-click solution offered by SAP to integrate ERP or BW data into a CDM structure and thereby conveniently combine it with services and other CDM-based data, such as Dynamics 365. In this blog post I close this gap and outline the necessary steps to (1) extract SAP ERP data into a CDM format and (2) consume this data conveniently with machine learning services such as Databricks, data warehousing applications such as Synapse Analytics and Power Platform applications, such as Power BI for data visualisation (Figure 1.1).
In the extraction process for SAP objects I choose to use the Operational Data Provisioning (ODP) framework. The ODP framework bundles many advantages for extraction scenarios such as their availability by default, a delta load enablement, a harmonized extraction layer over multiple applications (ERP / BW), and the extraction of objects instead of plain tables. Please contact initions for further details on SAP ODP extraction scenarios.
2. ERP Data Extraction using SAP Data Intelligence
SAP Data Intelligence is the next generation data integration tool from SAP and already widely used as a machine learning and big data tool (example blog for getting started). The installation of Data Intelligence is possible on an Azure Kubernetes cluster and upon completion, the tool is accessible via a browser (Figure 2.1).
The ERP data extraction is accomplished by (1) creating a connection to an ERP or BW system, (2) extracting the data by using standard ODP extractors, and (3) writing the data into CDM format on an Azure Data Lake Gen 2.
2.1. Add connections
The first step is to enable communication with your SAP ERP system, the source, and with an Azure Data Lake Gen 2, the destination. These two connections can be created in the Connection Manager. An example of creating an ABAP connection via RFC to the ERP system is shown in Figure 2.2.
Establishing a connection to an Azure Data Lake Gen 2 is analogous (Connection Type ADL_V2).
2.2. Preparations of the ERP system
In order to use SAP Data Intelligence operators and functionalities, the ERP system needs to be updated by applying a few SAP Notes. This process is dependent on a few factors, such as the ERP version (S/4 HANA or classic ERP with Netweaver) and installed support packages.
2.3. Metadata catalogue
Once a connection is established, and pre-requisites installed on the ERP side, the ODP extractors can be viewed with the Metadata Explorer. Chose Browse Connections, select your ABAP ERP connection, and then the ODP objects (BW and SAPI), Tables and Views will be accessible (Figure 2.3).
Drill down to an ODP SAPI object of choice that you would like to extract. In this example I extract material data (MARA and related tables) by accessing the two ODP objects 0MATERIAL_ATTR and 0MATERIAL_TEXT. The ODP object metadata alongside a data preview is shown in Figure 2.4 and Figure 2.5.
2.4. ODP extraction and CDM creation
The workflow of generating CDM entities from ODP objects is realised with several different operators in the Data Intelligence Modeler. This workflow (or graph in SAP DI terminology) is shown in Figure 2.6. Here different operators are used to (1) read data from a parameterised ODP object (ABAP ODP Reader), (2) transform the data from type ABAP to string (ABAP converter), (3) Map the data into CDM format (Please get in touch with initions to learn further details about the Python script), (4) write the CDM data (Write CSV) and describing metadata (Write JSON) to the Azure Data Lake and (5) finally terminate the workflow. This workflow is executed in a parameterised fashion for each ODP object that is loaded into the CDM. In addition, the delta capabilities of the ODP extractors enables a scheduled execution to keep the CDM up to date. Part of the CDM manifest alongside the metadata for the ODP object 0MATERIAL_ATTR is shown in Figure 2.7 a) and b).
3. Consuming the CDM
Multiple services support the CDM format. These include different tools such as Databricks for machine learning scenarios, Synapse Analytics for data warehousing purposes and Power Platform applications such as Power BI for data visualisations. An introduction to loading CDM files into these tools is described subsequently. In Figure 1.1 the SAP Data Warehouse Cloud was mentioned as a consumer. This only applies to the CSV files. For further information on SAP DWH Cloud, please see this blog post: Loading the data from Microsoft Azure into SAP HANA Cloud, Data Lake.
Databricks has evolved into a preferred tool for data scientists, since it offers a convenient environment, easily integrates with the Microsoft stack and is available on Azure. By importing data in the CDM format to Databricks offers the SAP semantics and structures conveniently to data scientists who are otherwise not fluent in the SAP domain. Reading data in the CDM format is possible by using a Spark-CDM-Connector library, which can be installed to clusters in Databricks. The GitHub repository can be found here: https://github.com/Azure/spark-cdm-connector.
To load a CDM entity, create a cluster (e.g. Databricks Runtime Version 6.6, since Spark 3.0 is not supported at the time of writing). Then install the Spark-CDM Maven repository, as shown in Figure 3.1.
After connecting to your Datalake, the data can be read into a data frame using the following Python commands:
readDf = (spark.read.format("com.microsoft.cdm") .option("storage", storageAccountName) .option("manifestPath", container + "/Path/default.manifest.cdm.json") .option("entity", "MATERIALATTR") .option("appId", appid) .option("appKey", appkey) .option("tenantId", tenantid) .load()) readDf.select("*").show()
3.2. Azure Synapse
The CDM format can recently be accessed by dataflows in Azure Data Factories. Since Data Factory capabilities are included in Synapse Workspaces, this functionality can be used to load CDM data into a table of a Synapse Database in the form of a SQL Pool. A minimal dataflow example is shown in Figure 3.2.
3.3. Power BI
The functionality of consuming a CDM is currently in preview (October 2020 version) in Power BI. To load a CDM, select Get Data and choose Azure Data Lake Storage Gen2 (Figure 3.3). Subsequently, enter the URL of the Data Lake (https://<datalakeName>.dfs.core.windows.net) and choose CDM Folder View. Upon entering your Data Lake credentials, the CDM entities can be selected for import (Figure 3.4).
Now the CDM data are imported to Power BI with the correct data types and ready for visualisation or integration with additional data (Figure 3.5).
4. Conclusion and Outlook
The described tools are currently undergoing development and many of the CDM connectors are only available in preview mode. Therefore, there are still a few limitations that will be resolved in future releases.
For example, SAP Data Intelligence, traditionally used for machine learning scenarios, is undergoing development towards an ETL tool with diverse data loading capabilities. For example, a current limitation is that CSV files have to use a comma as a delimiter.
The CDM format does allow for relationships to be defined between entities. However, the relationships must be single key (mapping one column from Table A to a column in Table B). In SAP systems it is common to have composite keys. The CDM model is currently being extended by this capability (https://github.com/microsoft/CDM/issues/192).
Despite these limitations, the advantages of CDM can already be tapped, thereby minimising data integration challenges. In this blog I demonstrated how this can be achieved by exporting SAP data into a CDM format and reuse it over multiple tools like Power BI, Databricks and Synapse. With a continuous development and refinement of these tools, there is more to come, such as generating SAP specific CDM folder structures and joining SAP and Dynamics data models.
*Special thanks to Roman Broich for reviewing this blog post.
Amazing proof of concept. I like the idea of CDM and Open Data Initiative. Appreciate you sharing step by step non SAP flows as well.
SAP sometime referred to as closed ecosystem due to its proprietary protocols, though the statement is not entire true. Beside lot had changed in past couple of years and the pace of enhancement had increased in recent past. To an extent that it sometime gives an overwhelming feeling and a sense of excitement at the same time 🙂
Although openness vs controlled ecosystem will always remain a topic of debate.
If possible, can you add list of SAP notes implemented on ECC side as reference.
Are you using the attributes send by OPD v2 to create the CDM json with metadata?
yes, the attributes can be extracted with Data Intelligence. There is an ABAP reader which can accomplish this. Then the metadata need to be processed, which I achieve with a Python script.
At the moment we don't have SAP4Hana and I think therefore no Data Intelligence. We do have SAP BW version 7.5. Is there a way to transform to Microsoft CDM with your Python script without Data Intelligence? Or do you have other solutions?
SAP Data Intelligence is a standalone product and not part of S4HANA.
Apart from Data Intelligence, there are other solutions to extract SAP data from a BW system that we routinely implement. Writing the data into CDM format is possible as well. We do have workflows and Python scripts at hand. You may contact me or initions directly to discuss further.
How does building a model in SAP is different than building a model on GCP / AWS or Azure ? If SAP is hosted on cloud on either GCP, AWS or AZURE what is my advantage of ETL data to a a Data Lake on clouds vs SAP and do ML?
If all your data is within SAP systems, then the downstream processing (e.g. Datascience, ML, reporting, etc.) can be achieved within the SAP ecosystem. However, todays enterprise data landscape is often heterogeneous and data need to be joined across different systems. For that reason a data platform approach (e.g. central data lake on Azure or AWS or GCP) is often preferred. Then the SAP data need to be made accessible which I outline in the blog.