How BW/4HANA can fit on existing AWS Data Lake
Considering the scenario where customer already builds their Data Lake solution in AWS with a data source from Non-SAP applications. Now I am going to position BW/4HANA besides data lake (AWS) architecture where one can use standard extractor to load data from different SAP sources. Finally, how this data (coming from BW/4HANA as well as Data Lake) can be used intelligently in terms of various business and technical needs.
In this document I am not going to emphasis the conflict on data warehouse vs. data-lake but, how these two concepts can complement each another for the benefit of both business and IT.
Common Myth of DATA lake
- Use data lake to dump any data – no governance required à False
- Just as some data warehouse have become massive black holes from which vast amount of data never escape, a data lake can become a data swamp if good governance policies are not applied
- The Data Lake is a replacement of data warehouse à False
- The data lake can incorporate multiple data warehouses (EDW), plus other data sources such as those from social media as well as IO. These all come together in the data lake where governance can be encapsulated, simplifying trusted discovery of data for users throughout the organization
- Data Lake access is measured by delivering access à False
- Dumping data into a central location is not a true analytic solution. The goal is to run data analyses that produce meaningful business insights; to uncover new revenue streams, retention models or product extensions
Characteristics of Data Warehouse and Data Lake
|Characteristics||Data Warehouse||Data Lake|
|Data||Relational from transactional systems, operational databases, and line of business applications||Non-relational and relational from IoT devices, web sites, mobile apps, social media, and corporate applications|
|Schema||Designed prior to the Data Warehouse implementation (schema-on-write)||Written at the time of analysis (schema-on-read)|
|Price/Performance||Fastest query results using higher cost storage||Query results getting faster using low-cost storage|
|Data Quality||Highly curated data that serves as the central version of the truth||Any data that may or may not be curated (i.e. raw data)|
|Users||Business analysts||Data scientists, Data developers, and Business analysts (using curated data)|
|Analytics||Batch reporting, BI and visualizations||Machine Learning, Predictive analytics, data discovery and profiling|
Here, I will explore how to support different business and technical needs by the appropriate placement of requirements such as data load (preparation), data tier (classification of hot, warm and cold data) and how the data can be consumable across two different environments. A simple architecture model [Fig 1] (Specific to my use case, mentioned above – which may be expand to broader concept as well) defines what data warehouse and lakes are actually are and how they complement each other. In addition to that how these data (individual or collaborate) consumed using different options.
Communication between BW/4HANA and Data Lake
In terms of exchanging data between BW/4HANA and data lake, it is possible to use below combinations
- Data Lake to BW/4HANA
- Data Services connector
- Smart Data Access (SDA)
- BW/4HANA to Data Lake
- Open Hub Destination
- Data Services connector
There are couple of possible way one can consume the data either from warehouse (e.g. BW/4HANA) or lake (e.g. AWS data lake).
Option1: SAP Data HUB
SAP Data Hub is a solution that provides one to integrate, govern, orchestrate data processing and manage metadata across enterprise data source and data lake. It also allows to build data pipelines as well as manage, share and distribute data. SAP Data Hub provides broad, detailed and easily understandable view of entire data landscape from sources viz. Hadoop, Amazon S3, SAP HANA, ERP, BW/4HANA etc. This is one of the preferred options to visualize data in combination with SAP HANA-based landscape (Data Warehouse, BW/4HANA, etc.) and Data Lake.
Option2: SAP ANALYTICS CLOUD
SAP Analytics Cloud is a cloud-based SaaS (Software as a service) business intelligence (BI) platform from SAP for providing all analytics capabilities in one product. No matter whether data stored in spreadsheet, on-premises databases, cloud databases or a combination of three, data can be analyzed with SAP Analytics Cloud. We can create live data connections using a ‘Direct’ connection type (Cross-Origin Resource Sharing) to connect SAP BW/4HANA, SAP BW etc.
However, connectivity between relational cloud databases (e.g. AWS Red shift, Spark, Azure) and SAP Analytics Cloud is not yet introduced. This is planned on Q4/2018, keep an eye on SAP note 2532957 for latest update on this (https://launchpad.support.sap.com/#/notes/2532957).
OPTION3: SAP BUSINESS OBJECTS
SAP Business-objects BI (SAP BO, also known as BOBJ) is a reporting and analytics business platform (BI) for business users. It consists of number of reporting applications that allow users to discover data, perform analysis to derive the insights and create reports that visualize the insights. This can be easily deployed on SAP HEC (HANA Enterprise Cloud), Amazon Web Services (AWS), Microsoft Azure, IBM Softlayer, Alibaba YSF or any on-premises platform. Since 2017 it is simplified as fewer tools for below purposes,
- Reporting: Crystal Report & Web Intelligence
- Office Integration: Analysis Office
- Data Discovery: Lumira 2.x
In the picture (Fig 1), it is deployed in on-premises platform which can be connect Amazon Red shift through JDBC connection and BW/4HANA as well.
OPTION4: POWER BI
Microsoft Power BI is a business analytics service that gives us insights using live dashboards, create rich interactive reports and access your data on the go, from your mobile devices. Main components of Power BI can be explained as below
- Power BI service provides mechanism for users to access with their data
- Power BI Desktop can connect either directly to or can copy data from BW/4HANA to prepare data using its built-in capability. With Power BI desktop we can connect to Amazon Red shift database and use the underlying data just like other data sources in Power BI desktop,
- Power BI gateway facilities secure data transfer between multiple data sources (e.g. SAP data, files, SQL server etc.) and Power BI
Depending on the business and technical need of the organization, above options (one or in combinations) can be chosen.