SAP Data Hub, a Creative Solution for Big Data Problems
Enterprise Information Management is for Traditional Analytics
To understand what SAP Data Hub does, we first need to examine how the information management market has evolved over the past decades. Traditionally, most organizations build and maintain a centralized data warehouse environment, with the goal to get a single-version-of-the-truth. Besides, you may also want to implement a metadata management solution to capture and maintain the definitions of the data entities, as well as its relationships among them; or tracing lineage and impact analysis to ensure data quality. Over time, we end up using separate tools for traditional EIM because each tool is being introduced as new use cases emerged.
In almost all cases, EIM for the traditional analytics process is implemented when you have a solely on-premise environment with mainly structured data. Where you would have a E-T-L tool to get the data from various enterprise sources into a usable state, before loading it into the corporate data warehouse for Business Intelligence (BI) reporting and analytical decision-making. The E-T-L process involved with the following tasks:
- extract the data from one to multiple sources (e.g. ERP, CRM, RDBM, legacy, etc.)
- feed the collected data to a batch processing engine for cleansing and transformation
- load the results into a Data Warehouse to run reports or BI analytics.
All of the above data processing happens in the middleware layer. Could you imagine adopting this process to run complex computations (e.g. machine learning, multimedia processing, etc.) for Petabytes of disparate data that are distributed across the networks? Moving that large volume of data into a ETL-application server for complex data processing is basically asking for troubles. Simply put it this way, the traditional E-T-L approach is not fit for Machine Learning scenario with big data resides in distributed environments.
As the number and variety of data sources exploding continuously, along with proliferation of open sources and cloud technologies that are used to connect to it, there is a growing need for better ways to manage and administrate all these data and tools.
A better way of managing data, tools, and people
In contrast, SAP Data Hub is an all-on-one data orchestration solution that integrates, orchestrates, processes and governs any type and volume of data – across your entire distributed landscape. It is designed to solve the data silo problem posted in complex landscapes, where you have tremendous amount of disparate data stored in distributed locations (e.g. data lakes, cloud stores, enterprise applications, etc.), but been struggling with the data integration and data processing tasks.
The below diagram illustrates common problems that enterprises are facing in today’s increasingly diverse and distributed landscape – outlining the reasons why many enterprises fails in Big Data projects.
Many organizations find it extremely challenging to productize big data scenarios across the entire data landscape. First, you need to figure out what tools and resources are available, who needs to be involved, and how quickly can you get the results. Then, you also need to consider other technical aspects, such as:
- Is the solution optimized for running complex computations on tremendous amount of disparate data (e.g. video, picture, audio, documents, etc.)?
- Can the application be integrated with open source and 3rd party solutions easily? Can it help me to reduce the efforts on point-to-point integration with these technologies?
- How can I implement the solution in a scalable way? Is the process reusable and repeatable?
- Can the solution be deployed in different environments (e.g. on premise, cloud and hybrid environments)?
- How open is the solution? Is it possible to bring in custom code?
- Does the solution provide out-of-the-box machine learning and data science support?
- Will the tool help encourage communication, collaboration, integration, and automation between IT Data Engineers and all big data professionals (Data Scientists, Business Analysts, Data Stewards, etc.)?
If the above questions sound familiar to you, then SAP Data Hub is your answer. The product was created to solve the challenges inherent in diverse systems with disparate data sources spread across multiple networks. Because SAP Data Hub is built on the concept of cloud-native architecture, it can be deployed on any environment (e.g. on-premise, cloud, hybrid) and scales elastically to handle large amount of data processing.
So why SAP Data Hub?
A flexible solution that promotes cross-team collaboration
I think LEGO is a great analogy to explain how SAP Data Hub is different from other EIM products. Everyone loves Lego because they are fun to play with, allowing you to express your creativity and giving you the freedom to build nearly anything. We use the same principle to create SAP Data Hub – a product that is designed for addressing the needs of all personas. Whether you are a data engineer who is responsible for converting the raw data into a usable format, or a data scientist who is writing a new algorithm to retrieve valuable insights from data, you will find SAP Data Hub provides all you need to unlock the endless possibilities of your data in interesting ways.
You have complete freedom and control over of what you want to build. Think of each data operation as a Lego brick which can be combined in multiple configurations. The coolest part is that you can disassemble what you have created at any given time and reassemble it to recombine data and logic in many different ways. The same concept can be applied to SAP Data Hub, where you can use the operators to build complex data pipelines to support many types of use cases, such as:
- Create a total picture of your customer’s world by combining data from traditional systems (e.g. CRM, ERP, etc.) and other data from various digital sources (e.g. past orders, buying patterns, and social media use)
- Enable real-time analytics by streaming large volumes of messages from internet-enabled devices and analyzing the patterns using machine learning algorithms
- Develop an intelligent data warehouse by integrating and orchestrating data across the entire landscape (e.g. data lakes, enterprise applications, legacy systems, etc.) both on premise and in the cloud
- Operationalize and automate machine learning processes to support Data Science projects
With Lego bricks, you need a base plate to connect everything together. Similarly, SAP Data Hub strives to foster a team collaboration environment. It provides a centralized development platform for all data professionals to collaborate closely on their big data projects. Because SAP Data Hub uses the pipelining approach to model the data processing, every step along the way is automatically documented with visuals in the system. So, there are no more data silos. Instead, every team is working in sync to leverage the data they need in a more appropriate way in much less time.
Example of building a customer churn application with SAP Data Hub
Below is a real-life example showing how to build a re-usable data pipeline for a Customer Churn Prediction application using SAP Data Hub. Since the product is designed to be flexible, with a high degree of modularity in mind, you can easily exchange any operator within the pipeline to support different business scenarios. In this example, you can simply replace the Python-operator with R-operator if you decided to test the results with a different machine learning algorithm.
The data consumption for SAP Data Hub goes beyond the traditional OLAP analytical scenarios. On top of the intelligent applications, the end consumer could also be a fully automated process such as:
- Triggering a new maintenance order after a fault detection
- Adjusting the schedule of a logistic hub due to an event occurring in the inbound containers
- Sending notifications for misplaced items on the shelves based on image recognition
Data Processing happens at the source
With its unique ability to push runtime executions to where the data’s native location whenever possible, SAP Data Hub helps you to avoid unnecessary data movements. Imagining you have Petabytes of data that are stored across Amazon, Azure, Google Cloud, and on-premise applications. It would be a nightmare for you to use the traditional E-T-L way to move these data into a central location and process the data using machine learning. Not only this is extremely costly to move the data out of the clouds, it is also painfully slow to do data science with the traditional data processing engine. In such scenario, you need a scalable solution like SAP Data Hub to process the data within the respective cloud environment, and only move the high value data as needed. This is the only way you can process workloads at large scale across highly distributed environment.
Best of SAP and yet open for other technologies
It is important to note that SAP Data Hub is not just for SAP solutions. On one hand, SAP Data Hub integrates tightly with many of SAP systems (e.g. SAP S/4HANA, SAP BW/4HANA, SAP HANA, SAP Data Services, and SAP LT Replication Server, etc.). On the other hand, it also embeds a diverse range of open source components such as SPARK and programming engines like Python/Go/Java. SAP Data Hub is designed as a central place for your data professionals to be able to orchestrate and model the flow of the data movement, processing, and transformation across the entire landscape from end-to-end. The solution also supports fully automated monitoring capabilities to visually track and present the health of the connected systems and status of the scheduled workflows (a.k.a. the flow of the data).
Enterprise-wide Governance for all data
As for the governance matter, SAP Data Hub provides a unified information catalog of the connected sources. It can automatically crawl the metadata structures, and profile the data to get insights on the quality. Information about the data such as location, attributes, quality and sensitivity can be gathered in a centralized repository without any efforts on your end. You can search common data definitions and business rules seamlessly regardless how diverse and distributed data sources are in your landscape. With SAP Data Hub, you can truly understand the characteristics of those data and gain more visibility about your widely geo-distributed data assets.
When to use SAP Data Hub?
In summary, SAP Data Hub is suitable for any of the following:
- Complex data landscapes
- heterogeneous environments (e.g. data lake, enterprise apps, on-premises, in the cloud, hybrid)
- distributed landscapes where the data tiers span across multiple data centers and multiple clouds
- the need to ingest and process disparate data types (e.g. stream, event, semi-/structured, unstructured)
- Productize scenarios in a fast and scalable way
- prebuilt operators exist for a wide range of external machine learning libraries (e.g. TensorFlow, R, Python, SAP Leonardo MLF, etc.)
- custom code can be easily deployed in any programming language
- workloads can be processed on a large scale (e.g. distributed computation engine via Vora, container-centric & serverless environment via Kubernetes, processing data closer to where it resides and moving only high-value data)
- A centralized development platform
- for all data professionals (e.g. IT, business analysts, data scientists, etc.) to share data and collaborate closely on big data projects
- automatically documenting every step of the way with visuals via a flow-based data pipeline
- to bring everything together into a holistic view end-to-end: one tool for data integration, data operations, landscape management, metadata catalog, governance capabilities, and orchestration.
What to do if I still have questions?
If you have questions about the product or great ideas, please send it to firstname.lastname@example.org. We would love to hear from you!