SAP Data Hub and the evolution of Data Integration in Enterprise Landscapes
In the past decades, the evolution of enterprise IT landscapes, of their maturity and of their business contexts and objectives, has always brought along essential demands of data integration, and critical concerns of data governance. The history of enterprise IT is now at a turning point, where several disruptions are coming together almost simultaneously: the vast adoption of IoT technologies, the rise of Data Science as a common IT practice, the availability of massive data assets and computational power which are now enabling large-scale usage of machine learning, the unprecedented distribution and scaling possibility offered by cloud architectures. These factors are all leading to the advent of the Intelligent Enterprise. As you may have seen mentioned already, “the pace of change has never been this fast, yet it will never be this slow again.”
But what is the impact that this change is having on data integration and data governance, and how are IT organizations dealing with it? My perception, based on what we see happening among large and medium enterprises right now, is that most organizations are just simply not dealing with it yet, struggling to adapt their old practices and tools to their new needs, in what looks like a lost battle. For example, the most common frustration of every data scientist is that they need to spend too much time and effort in accessing and preparing the data, versus working on the actual models and intelligence, as outlined in a survey by Rexer Analytics.
Despite these huge data integration and preparation efforts, the most common reason why data-driven innovation projects actually fail to move from experimentation to production is still the lack of accessibility, availability, quality and understanding of the various disparate data sets and sources. In other words, even if data integration and data governance are concerns which are as old as enterprise IT, they’ve never been so far from being solved and stable, and they’ve never been so critical to the outcome of every business innovation.
This is happening because though integration and governance still have to address some functional needs and concerns which are always the same across every context, like data movement, workflow orchestration or metadata management, these needs and concerns must translate and adapt to very different architectures, with different applicability and design focus, depending on the target integration domain. And what IT organizations are facing at this time of their history is, by all means, the breakthrough of a brand-new integration domain.
We can represent this evolution in a simple diagram, depicted below: the degree of Enterprise Intelligence, i.e. the ability to take more intelligent business decisions and execute more intelligent business processes, is rapidly growing, fueled by the ability to exploit more diverse and disparate kinds of data, the key enabling asset.
At first, the base for every enterprise architecture is to enable transactional applications to talk with each other, to connect business processes and execute their operations smoothly. That is the application integration domain, which is usually addressed with service oriented architectures and microservices architectures. This domain includes enabling technologies like enterprise service buses (ESB), business process management (BPM) and business rules management (BRM) engines. In past decades, the enabling technologies included managed file transfer (MFT) and enterprise application integration (EAI). Architectures addressing this domain are usually message-driven (at least in recent years), dealing with structured data, and focusing most of all on transactional reliability and consistency. All communications take place between OLTP systems.
The following picture shows a simple example of an enterprise service bus integration flow, providing integration between two transactional applications:
The other essential need in every enterprise architecture is to provide business intelligence and enterprise analytics, to enable better business decisions. This led to a completely different integration domain, analytics integration, which is addressed with technologies like extract transformation load (ETL) tools, enterprise data warehouses (EDW), data marts etc. Architectures addressing this domain are usually data base-driven, dealing with structured data, and focused on data quality to enable reliable insights on the business intelligence pyramid. In this domain, communications take place from OLTP systems towards OLAP systems.
The following picture shows an example of a very simple ETL flow, moving data from two tables in two source operational databases towards a primary data warehouse:
Now, the advent of the Intelligent Enterprise is adding a new major milestone in this evolution: enterprise landscapes are evolving to produce more intelligent applications and deliver next-generation data-driven business processes. To achieve this, the key technological innovation is the ability to leverage advanced analytics, which are now fueled by data science, predictive algorithms, machine learning, and massive sets of very disparate kinds of data. Advanced analytical insights can now drive and influence the behavior of business processes. In other words, OLTP and OLAP are merging into hybrid transactional/analytical processing (HTAP). A hybrid domain can now factor in new kinds of data, like event streams coming from sensors, location data, social media feeds, images, videos, signals and other unstructured or semi-structured data. This evolution is paving the way to a brand-new integration domain, the data orchestration domain, which merges disparate processing paradigms on disparate kinds of data, and extends and complements the previous domains. Architectures addressing this domain must be able to mix message-driven and DB-driven flows, streaming and bulk, structured and unstructured, and to do it across very distributed landscapes, spanning several cloud and on-premise locations. The main architectural focus in this domain is to produce timely and actionable insights, and adaptability is the key requirement.
The following picture shows an example of a heterogeneous pipeline, which gets images from a Hadoop data lake and processes them in Python, does some processing of streaming sensor data from Kafka in parallel, and then combines the outcomes and applies a predictive analytics algorithm in SAP HANA, for further use in application contexts:
The following table summarizes the three integration domains and their key differences:
In each of the domains mentioned above, the main functional needs and concerns to be addressed are still the same:
- Moving and transforming the data
- Orchestrating the data movements
- Governance and monitoring
But these functional needs are implemented with very different focuses and objectives in each domain.
At SAP, when it comes to application integration, we cover the data movement and transformation with SAP Process Orchestration on premise, or with the SAP Cloud Platform Integration Service on cloud. Orchestrating the data movements is done with the Business Process Management and Business Rules Management components of SAP Process Orchestration, or with SAP Cloud Platform Workflow Service and Rules Service. SAP Master Data Governance can be leveraged for workflows that deal with creating, updating or deleting master data. Finally, governance and monitoring are provided with SAP Process Orchestration Enterprise Service Repository and with SAP HANA Operational Process Intelligence.
For the analytics integration domain, we use SAP Data Services data flows to move and transform data, SAP Data Services workflows to orchestrate the data movement jobs, and SAP Information Steward and SAP Master Data Governance for governance and monitoring. In SAP HANA-centric architectures, we can also leverage SAP HANA Smart Data Integration, SAP HANA Smart Data Access, and SAP HANA Smart Data Quality.
The new arising data orchestration domain, often referred to with different terms and synonyms like “Big Data Fabric”, “Holistic Data Management”, “DataOps”, “Enterprise Data Hub”, “Data-driven Integration” etc., is instead a very different ballpark. At SAP, we decided to address this domain with a new innovative product called SAP Data Hub, conceived from the beginning on a cloud-native architecture on Kubernetes and Docker, capable of covering distributed landscapes across multiple cloud providers and on-premise data centers, and designed to streamline innovation projects across all sorts of different data sources and data processing paradigms, with an open approach that natively supports all most common processing engines and technologies, like Python, R, Spark, Google TensorFlow etc., as well as SAP HANA and SAP Leonardo.
SAP Data Hub solves the need to move, transform, and process the heterogeneous data with Data Hub pipelines. We orchestrate the pipelines with SAP Data Hub workflows, and provide end-to-end data governance with SAP Data Hub metadata catalog, data discovery, and profiling capabilities. All in one single tool, across all kinds of data sources (event streams, images, videos, sensor data, signals as well as data warehouses, data lakes, structured application data from backend systems), and across any distributed landscape and hybrid deployment, and any data processing engine.
It is crucial to note that the three domains do not replace each other, and do not really have much overlap. We do not foresee data pipelining as a way to replace the extraction scripts for a traditional BI stack, for example, as we did not use ETL to replace transactional integration flows, because these tools are conceived to respond to very different business needs and architectural focuses.
In other words, SAP Data Hub does not interfere with the original use cases which are covered by ETL and ESB tools. Instead, SAP Data Hub helps enterprises address the new use cases related to enabling the Intelligent Enterprise that would not be solved with any traditional tool. For these use cases, many enterprises today are struggling to plumb together a lot of different software components, e.g. one for streaming data, one for batch transfers, one for message-driven integration, ending up once again in the “point-to-point architecture” nightmare they thought they had solved years ago. With SAP Data Hub, enterprises can now simplify, streamline and succeed in delivering the promise of intelligent applications and processes and advanced analytical insights across the heterogeneous enterprise landscape, with one single and coherent overarching product addressing the data pipelining, processing and governance concerns.
The days of the data scientist frustration, and of innovation projects staying forever in their experimental stage, are now numbered.