What is SAP Data Hub? and Answers to Other Frequently Asked Questions
With the launch of SAP Data Hub on September 25, 2017, we know you have lots of questions. So, we have developed the following FAQ to provide answers to your most pressing questions. For those of you who want to dive deeper, you can find all the details on sap.com/datahub.
What is it?
SAP Data Hub is a data sharing, pipelining, and orchestration solution that helps companies accelerate and expand the flow of data across their modern, diverse data landscapes.
SAP Data Hub provides visibility and access to a broad range of data systems and assets; allows the easy and fast creation of powerful, organization-spanning data pipelines; and optimizes data pipeline execution speed with a “push-down” distributed processing approach at each step.
SAP Data Hub meets the governance and security needs of the enterprise, ensuring that appropriate policy measures are in place to meet regulatory and corporate requirements.
Why is this product necessary? What is the market need?
There is more data and more ways to store and use it than ever before. While this data holds business opportunity, corporate data landscapes are growing increasingly complex, and it is getting harder and costlier for organizations to not only understand the data that they have, but to work across all the different systems that need to use it, and apply end-to-end governance, to capture the maximum value.
Key Pain Points:
- Data is kept in silos (files, Hadoop, Data Warehouses, etc.) across the enterprise. Users can’t access and work with the data they need across the silos where it’s stored. In particular, it is complex, time consuming, and costly to connect Big Data with enterprise data and business processes to gain insight and value from it.
- End-to-end data governance required across complex landscapes: The need to manage and govern data across a landscape is well understood. Ensuring data lineage and impact analysis of changes, managing security and privacy requirements, etc. are all critical aspects of a trusted enterprise landscape. With the increased complexity of enterprise landscapes, which can now include Hadoop data lakes, EDWs, Cloud storage, enterprise apps, etc., the ability to appropriately provide effective governance is more difficult. Without end-to-end governance across all data sources, organizations cannot trust and rely on the data’s accuracy, creating risk for anyone using analytics or operational applications that use the data.
- Big Data technologies lack enterprise readiness: Businesses generally cannot solve the complexity of their landscape simply by storing all their data in a Hadoop data lake. Hadoop solutions, while powerful, often do not have the extent of governance and security measures that enterprises require. Data lakes often have limited governance for Big Data initiatives, little automation to schedule processing in the landscape, fragmented monitoring and tracing capabilities of individual technologies, and lack common security and access management.
- Currently available tools require high effort to productize data scenarios across the enterprise. Many integration tools today are point to point, require highly trained resources to execute, and are highly manual. This makes it challenging to rapidly connect and implement desired data outcomes.
- Specialized skill sets are often needed to implement, scale and create value out of Big Data initiatives. These specialized resources are often difficult to find and difficult to retain.
How is SAP Data Hub different from other offerings for integration, pipelining, or orchestration?
SAP Data Hub delivers a simpler, more scalable approach to data landscape management.
With enterprise-spanning data integration, processing, and governance, SAP Data Hub provides unprecedented visibility into and access across the complex network of data in the modern enterprise. By providing a broad, detailed, and easily understood view of the entire data landscape, from sources like Hadoop and Amazon S3 to SAP HANA and ERP, SAP Data Hub helps organizations deeply understand data sources, uses, interconnections, quality, and impacts. This allows enterprises to see new opportunities from data, resolve emerging data issues, and ensure that data is flowing to where it needs to go.
SAP Data Hub accelerates and expands your data projects by easily and quickly creating powerful data pipelines in a single, visual design environment
In a single design environment, data stewards can easily and quickly create powerful data pipelines that access, harmonize, transform, process, and move information from a variety of sources across the organization. Pipeline creators can easily activate powerful libraries for computation or machine learning, for example; rapidly connect data of a wide variety of types, such as social media, customer, and product information; and leverage existing processing investments, such as capabilities in SAP HANA, Apache Hadoop, SAP Vora, or Apache Spark. Pipeline models can be easily copied, modified, and re-used to accelerate pipeline deployment and leverage best practices.
SAP Data Hub accelerates business results with innovative “push-down” processing to power more agile, comprehensive data-driven applications
SAP Data Hub not only accelerates the creation and management of data pipelines that span varied data sources, it also provides fast execution of the pipeline activities themselves by distributing computational tasks to the native environments where the data reside. This federated “push-down” distributed processing ensures that the activities of the pipeline complete as rapidly as possible, delivering fast results to the business. This data processing approach allows customers to take advantage of serverless computing in the cloud, potentially reducing the overall cost of data pipelining and data management.
Other solutions often require you to centralize your data. Some companies offer a pipelining and orchestration solution, but only for the data held in their solution. They want you to move all your data into one place to create and execute advanced data pipelines.
Who benefits from SAP Data Hub?
- Organizations looking for an easier way to understand, manage, and get greater value from their complex data landscape, including data held on premise and in the cloud, in data lakes, data warehouses, and data marts
- Organizations that want to be able to quickly create data-driven applications and analytics that leverage data from across the organization
- Organizations challenged by integrating Big Data (such as IoT, Social Media, Web Log, or Streaming Data) into Enterprise landscapes for operational efficiency and/or analytic insights.
- Organizations looking for solutions to control and manage Big Data Lakes effectively (Data Transformations, Governance, Operations, Harmonization, Stream Integration, Coding, Scripting, Consolidation)
- Organizations trying to combine and integrate a SAP HANA-based landscape (Data Warehouse, BW, etc.) with Big Data Lakes
When is it generally available?
SAP Data Hub is already generally available, as of September 1, 2017.
What are the planned deployment options?
For the initial release, SAP Data Hub will be offered as an on-premise application, which can connect and process data in cloud environments (e.g. Data Lakes in Amazon AWS). Its
architecture is cloud-ready, and a PaaS and SaaS version will follow in future releases.
Why is it called SAP Data Hub? Does it centralize data?
SAP Data Hub gets its name from the fact that it offers centralized governance and pipelining capabilities – a unified view and data management of the complex data landscape.
Part of the power of the solution resides in its ability to leave the data where it is. The data does not have to be mass centralized with SAP Data Hub. This provides advantages in terms of ease of management and speed of data pipeline execution. Customers leverage their existing data stores and existing processing capabilities.
Is data stored in SAP Data Hub?
No. SAP Data Hub does not offer its own data storage. It is a platform to orchestrate and manage data between existing data storages, but is not a data warehouse, data mart, or Data Lake on its own.
Is SAP Data Hub yet another ETL or Streaming tool?
No. SAP Data Hub goes beyond classical batch ETL or real-time streaming. It modernizes these functions and focusses on the integration of new technologies, operating in distributed landscapes (e.g. Hadoop cluster or public cloud storages). The main paradigm is to bring the logic where the data resides and to leverage the cluster compute power. Hence it adds the processing and integration on top.
What key functionality does SAP Data Hub v1.0 include?
With its first version, SAP Data Hub will allow the enterprise to achieve:
- Data Pipelines across Data Lakes (based on Hadoop), object stores (Amazon S3), cloud/on-premises databases and data warehouses. From the very start, the solution reaches across the data landscape, leveraging “push-down” distributed data processing to:
- Perform data transformations, data quality, and data preparation processes via a graphical user interface
- Define data pipelines and streams
- Embed and productize scripts, programs, and algorithms of the Data Scientist
- Productize open libraries or ML algorithms in one framework
- Orchestration of complex processes and workflows across system boundaries
- Workflow creation of operations and processes across the landscape with monitoring and analysis capabilities
- Execution of end-to-end data processes, starting with the ingestion of data into the landscape (e.g. the data lake), including data processing, and leading up to the delivery or integration of the resulting data into enterprise processes and applications
- Remote Process scheduling: SAP Business Warehouse process chains, SAP Data Services dataflows, and SAP HANA smart data integration Flowgraphs
- Data ingestion and processing for Data Lakes, support of unstructured and structured data/files or streams
- Offers prebuilt functionality for data integration, cleansing, enrichment, masking, and anonymization
- No coding or scripting to prepare and transform data in Data Lakes
- Kafka stream integration in end-to-end data pipelines
- Enterprise data quality and data governance functions execute distributed in the Data Lake using built-in services which are extensible by open source components or cloud micro-services.
- Leveraging and integrating SAP HANA Smart Data Integration, SAP Data Services, SAP BW
- Control, manage, operationalize, and productize complex data landscapes
- Handling connections between systems with a delivered adapter for connections
- Unified monitoring and scheduling of the landscape provide a central entry point where data stewards can see the status of data processes across all connected components
- Pre-defined adapter framework for connectivity
- Establish and manage zones in a landscape (e.g. Lab environment, Production, etc.) with attached policies and services levels
- Security and Access Control capabilities
- Meta data lifecycle with lineage and impact analysis
- Metadata model content authoring with repository integration (based on GitHub)
- Data Discovery to visually understand the value in Data Lake data
- Data Profiles for Big Data Sets showing quality and comprehensive structure information
- Ability to crawl, discover, and tag data elements
- Expose discovered data for further usage
What is the relationship between SAP Data Hub and SAP Vora?
SAP Vora capabilities are included in SAP Data Hub, however SAP Data Hub and SAP Vora are designed to address different use cases, based on customers’ specific needs.
SAP Data Hub simplifies the orchestration of complex data processes while providing governance across modern and diverse landscapes including big data stores, enterprise data stores, enterprise applications and cloud solutions.
SAP Vora is an enterprise-ready, easy-to-use in-memory distributed computing engine to help organizations uncover actionable insights from Big Data, typically stored in Hadoop and NoSQL solutions. It is positioned for both data scientists, and as a part of multi-tier data strategy with Hadoop.
What is the relationship to SAP Data Services, SAP HANA smart data integration (SDI), and SAP HANA smart data quality (SDQ)?
SAP Data Hub will leverage existing customer investments and execute SAP HANA SDI/SDQ flowgraphs that run on SAP HANA boxes, as well as leverage SAP Data Services jobs that run on existing Data Services job servers. It will not replace their existing use cases.
SAP Data Hub is designed as a central place to orchestrate, monitor, and model integration flows, where SAP Data Services jobs, SAP HANA SDI and SDQ tasks, and Big Data flows can be brought together. These SAP EIM products will continue to be developed and offered separately from SAP Data Hub.
What is the relation to SAP Agile Data Preparation (ADP)?
SAP Data Hub has some built-in profiling capabilities, but can be complemented with SAP ADP as
a self-service data preparation tool. For this use case SAP ADP offers business users the
capabilities to search and access their data sources, visually manipulate the data to make it ready for reporting, and publish it. It will be interacting closely with SAP Data Hub to bring this self-service to Big Data scenarios. In later releases SAP ADP, will leverage the metadata repository of SAP Data Hub.
What is the relation to SAP Analytics?
SAP Data Hub helps drive value of analytics by optimizing the data pipeline with speed and security to enable organizations to act on the right information in the moment. SAP is the only vendor in the market that can offer an end-to-end software portfolio across Data, Analytics, and Business Applications. SAP Analytics Cloud, a cloud based solution for all analytics (built on SAP Cloud Platform); will take advantage of powerful data orchestration capabilities with SAP Data Hub, allowing organizations to enhance powerful analytical use cases through the ability to control, manage and optimize their data environments.
How is this part of SAP Leonardo?
SAP Leonardo is a digital innovation system that enables customers to rapidly innovate and then rapidly scale that innovation to redefine their business for the digital world. SAP’s Big Data solutions, SAP Data Hub, SAP Vora, and SAP Cloud Platform Big Data Services, are relevant to the Leonardo offering because they are key to scale and innovation. As such, they are offered in the Leonardo Big Data packages.
SAP Data Hub resonates with the core themes of Leonardo, because:
- It minimizes risk and disruption. It works with your existing data landscape and doesn’t require you to centralize data.
- It maximizes your existing technology investments and allows you to make the most of them – it plays the data where it lays and it utilizes the processing capabilities closest to the data, so that your data pipelines complete as quickly as possible.
- It allows you to rapidly scale innovation, since it makes data pipelining capability available to a broader range of users within your organization, and it allows you to easily build on successes.
- It allows you to be open to the future. Due to its open architecture, not only do you leverage the most of your data from today, no matter in the cloud, on premises, SAP solution or non-SAP solution, you can also quickly and easily adopt new advances, such as in machine learning and the next data analytics or processing innovation.
How do I buy SAP Data Hub?
Please contact your SAP Account Executive to get started, or contact us at: https://www.sap.com/registration/contact.html
Thanks for your detailed blog. Can you please elaborate upon below -
Lot of clients who have already invested heavily in setting up their Hadoop based big data platforms using Cloudera / Amazon EMR framework, how SAP Data Hub brings in Value Proposition?
The landscape as you have described exactly suits the Data Hub paradigm. The Data Hub does not persist data so it is not intended to replace any of the big data platforms you note, rather it is intended to bring even more value to these systems in a landscape. The Data Hub is designed to leverage existing investments and technologies at our customers. The Data Hub will allow data processes to be built that span our customers' landscapes - from big data platforms through other systems (e.g. BW, data marts, HANA systems, applications, non-SAP databases etc). The insight that is taken from these big data platfoms can now be shared across the enterprise, allowing for more value to be derived. Governance across all of these systems can be applied. Processes can be scheduled and monitored from a central cockpit. So the value proposition is to enhance and expand on these big data platforms and bring this information into the enterprise for maximum impact.
Can you elaborate how integration with non-SAP ETL tooling (we use Informatica Powercenter) can be implemented?
And we use Azure SQL DWH besides SAP BW on HANA as a data warehouse. Can this solution also work across both platforms to avoid data duplication?
How do you compare SAP Data Hub to SAP Cloud Platform Integration?
Can you please help understand if there is a possibility of direct reporting on Data Hub using the processed virtual data without persisting the data in any target system?
Like consuming the processed data virtually through some web service or oData by reporting tools.
I think it is important to point out that SAP Data Hub is created specifically to address the challenges posted on integration and analysis of a vast amount of diverse data in today's increasingly complex enterprise landscape. Our goal is to provide advanced analytics services and data management capabilities where customers can easily use for provisioning data and automate business processes across the entire landscape. Thereby, data is being cleansed, integrated, refined, shared, accessed, and governed at different level users to gain new insights and unlock potential cost saving and business opportunities. Keep in mind that Data Hub is not a data federation tool. If you are planning to create a virtual semantic layer across data assets for the purpose of making reports, then HANA would be a better solution for it.
HI Marc , one question in my mind ; i have landscape where i have SAP ERP in Cloud and SAP BW on-premise ; i have my both systems operational. We have transaction data from ERP being fed into my SAP BW system through SDI. In this current scenario how SAP Data Hub going to help me?
Thank You.This is a good article to go on for Data Hub. However i have following doubts. Can we write R Scripting (like whatever we do in R Studio), Phython scripting (like Jupytor interface) in Data Hub. Do we have R/Python Interface features in Data Hub?
Or How do we use the R/Python scripts written locally in enterprise? Do we just call those scripts?
What is best approach to handle R/python scripting overall data science landscape?
We have R-client/Python2/Python3 operators that allow you to write your R/Python script and execute them on Data Hub. As far as Jupyter interface, that is something which is planned for the future.
Here you can learn more on how to use Data Hub for your Data Science scenario: https://blogs.sap.com/2018/05/23/sap-data-hub-a-flexible-data-solution/
To learn more about how to execute your Python code I refer you to my colleagues blog here: https://blogs.sap.com/2018/01/23/sap-data-hub-develop-a-custom-pipeline-operator-with-own-dockerfile-part-3/.
Also you can learn more on hoe to deploy your Data Science scenario on Data Hub from this webcast: https://event.on24.com/eventRegistration/EventLobbyServlet?target=reg20.jsp&referrer=&eventid=1861921&sessionid=1&key=500CC9639BCEB35241E92C5B037BB869®Tag=&sourcepage=register