Data Architecture with SAP – Data Lake
From my experience the usage of Data Lake is very manageable in the SAP world. What I see mostly in the SAP Community is about technically connecting a Data Lake. Or if you follow the SAP Community tag “SAP HANA Cloud, data lake” you will find only 40 blogs from 18 contributors with 21 followers, and the topics are mainly about events or new features.
A first explaination about the need for a Data Lake can be found in the blog “What is a Data Lake and Why You Need One” from 2020. This blog is viewed 65 times and there is no like and no comment or discussion. Another one from 2019 is “Solving Data Storage Challenges with SAP HANA Cloud” just having 53 views, no likes and no comments.
So why does no one in the SAP world seems to need a Data Lake or have data challenges to solve? Because outside of SAP in the Data & Analytics world Data Lakes are a hot topic for more than 10 years now.
For sure SAP challenged the topic of Big Data several times and Data Lakes have emerged from here. Does maybe HANA indeed delivered everything needed in the SAP world of analytical data management? Does the Data Warehouse already solved all the data challenges?
Why Data Warehouses are not sufficient
What happened in the data world that new concepts for data management became necessary? In 2001 Meta Group came up with new challenges from e-commerce and started to communicate the 3 V’s:
- Variety – new kind of data not structured to fit into relational databases like JSON, XML, Audio, Video, Images
- Volume – huge amouts of data, hundreds of terabytes until petabytes to be handled
- Velocity – the need to store and process new streaming data in (near-)realtime
Furthermore new digital services came up, based on consumer IoT or IIoT in manufacturing or tracking the customer journey based on Social Media, the own website and Apps or faster reactions to changing market data. They came with the promise to create value from the data via Machine Learning/Artificial Intelligence for optimization, automation and individualized user experience and production. Expectations from data are to save costs, increase revenues, reduce risks or even being faster with research and innovation.
So, while reporting and classical BI is still important and Data Warehouses do a great job to support this workloads, they haven’t done a good job to address these new use cases.
- Data storage is expensive
- Storage and Compute power is typically bound – you can not scale with big data
- Semi- and unstructured data is only supported at a minimum
- Pre-defined structures which can be analyzed (schema-on-write)
- DWH are typically not good in handling dynamic schemas (even if some extensions e. g. JSON document store are available)
- Very limited environment for Data Science tasks and Machine Learning Workloads
The rise of the Data Lake
The Data Lake – a central data store that enables any kind of data and of any size to be ingested and processed including the promises to support digital business models, data scientist workloads and big data with a central, open platform.
Figure 1: Data Lake – base architecture and benefits
Coined in 2010 by James Dixon, the Data Lake has established itself as a core component of data landscapes and is now indispensable. Whereas setting up a Hadoop cluster was the first step a few years ago, today a Data Lake can be activated within a few minutes by the cloud provider of your choice.
But not all Data Lakes are the same. In the simplest case, it is a block store, file store or object store in which you can quickly and easily store your files without processing. This follows the so called schema-on-read approach where data stays in its original format so that no information is lost and Data Scientists are free to decide on the transformation step based on their needs. Of course, the data should also be processed at some point. As Data Lakes support ELT (Extraction-Load-Transformation) over ETL (Extraction-Transformation-Load) transformation of data come in after data is already ingested into the Data Lake. Therefore, various access and processing options are made possible, such as Hadoop-compatible APIs or Map Reduce as a method for parallel processing of data volumes. Typically technologies like Apache Spark are playing an important role here. To improve access to structured data in the Data Lake, new file formats such as Apache Parquet, Apache ORC (Optimised Row Columnar) or Apache Avro have been developed, in the simplest case also the CSV format. For extended data management functions, NoSQL databases are then also used.
To prevent the Data Lake from becoming a data swamp, Data Lakes are organised into several zones that fulfil different tasks. There is typically a high variety in names for them and as 3-4 zones are very typical (like Curated|Transformed|Raw or Bronze|Silver|Gold) and similar to Data Warehouses, there can be found more zones, like for a speed layer or a sandbox with direct access. In general, the topic of data governance should not be neglected here. This shows that the construction and operation of a Data Lake involves a certain degree of complexity and that the effort involved must be in no way inferior to that of a Data Warehouse.
Today, despite everything the Data Lake can deliver, it is not able to meet all analytical requirements, even though new approaches such as the Data Lakehouse are trying to close the gap (more on this later). In order to still be able to serve the needs for classic Business Intelligence, the Data Lake is therefore typically combined with a Data Warehouse. Different usage patterns to combine these concepts have proven themselves, of which the following are frequently found:
Figure 2: Typical usage pattern for Data Lakes
The usage patterns shown are presented in more detail later in the SAP context.
Data Lake with SAP
Although SAP has been actively offering data management technologies for more than 20 years now, the Data Lake has until recently been limited to third-party providers. Historically, SAP HANA was first seen as the solution for all Big Data challenges. With the acquisition of Altiscale in 2016, SAP then offered its own Big Data service for the first time. In addition, SAP developed Vora in 2015, a tool for accessing Hadoop using the Apache Spark framework. Vora was also used accordingly in SAP Data Intelligence and formed what SAP called a Big Data Warehouse in the context of BW/4HANA.
With the re-design of the HANA Cloud service as a cloud-native offering, SAP then created a stack of SAP HANA, SAP IQ – the database acquired from Sybase in 2010 for very large, analytical workloads, and connectivity to 3rd party cloud Data Lake providers. The HANA Cloud, Data Lake was born in 2019 as the “Relational Data Lake” and has been available since 27th of March 2020. Since then, the Data Lake offering has evolved and is presented in the following overview:
In order to understand what lies behind SAP’s Data Lake offering, it is necessary to understand in which layers the offering is structured. Basically, a distinction can be made between whether the Data Lake is “HANA DB-managed” or not (standalone Data Lake). For the overall picture, I will typically look at the HANA DB-managed variant.
When you create a HANA Cloud instance, you get the following overview (trial configuration):
Figure 4: SAP Data Lake – Data Tiers during SAP BTP configuration
The overview shows roughly the first three levels that can be configured.
- SAP HANA Database – In-memory – the SAP HANA Cloud – a column-based, multi-modal and scalable in-memory database. SAP HANA includes, among other things, a JSON document store, which thus already includes the possibility of processing semi-structured data.
- SAP HANA Database Disk – A fully integrated extension of the disk-based data storage that always exists for SAP HANA, combined with a buffer cache. The function is called Native Storage Extension (NSE).
- Data Lake – The Data Lake level is again made up of various services:
- Data Lake Relational Engine – relational Data Lake based on SAP IQ technology, which is in principle capable of efficiently processing petabytes of data. SAP currently limits its use to 90 terabytes, which can be expanded on request.
- Data Lake Files – a hyperscaler-based (Azure, AWS S3, Google Cloud Storage) file container managed by SAP. The SAP HANA Cloud, Data Lake can also be booked as a standalone variant and thus enables “pure” Data Lake use.
A relational database management system is not the original idea of a Data Lake, even though IQ is basically able to process unstructured data. Similar to the on-premise world, the IQ-based Data Lake (Relational Engine) can thus be seen as a solution for outsourcing too large and historical data, which is still available with a very good query speed at acceptable costs.
Data Lake Files has been available since 25th of March 2021 and represents a Data Lake in the true sense, which can potentially hold unlimited amounts of structured, semi-structured and unstructured data. SAP currently offers two storage service options. SAP Native based on NetApp technology or via AWS Elastic File System (EFS). Costs only arise if data is put into Data Lake Files.
There are three ways to access the files container:
- SQL on Files – to query structured data in the following formats:
- Rest API
- Command Line Tool (hdlfscli)
SAP also provides a driver for Apache Spark, which is a kind of standard for big data processing in Data Lakes.
Basically, Data Lake Relational Engine (IQ) allows access to structured data as well as binary and ASCII files from the following hyperscalers:
- Microsoft Azure – Azure blob store (Example)
- Amazon AWS – S3 bucket (Example)
- Google Cloud storage – Google Cloud Storage bucket
This essentially involves moving or copying data from a hyperscaler to the HANA Cloud Data Lake.
It may be worth mentioning here that SAP Data Intelligence Cloud, depending on the cloud provider on which it runs, enables an integrated Data Lake – the Semantic Data Lake – on the same basis.
Not seen as part of the data pyramid, but nevertheless interesting in this context, is the possibility of accessing analysis services remotely via SAP HANA Cloud using Smart Data Access at the cloud providers. This delegates the Data Lake, but can also ensure integrated data processing without having to move the data. According to SAP Note 2600176, the following external services are currently possible:
Architecture variants in the SAP world
As shown above, the architecture can be found in different variants, depending on the history of the data landscape, the frame conditions and the objective or the use cases to be mapped.
Hybrid Data Platform
Figure 5: Possible SAP implementation of hybrid Data Lake variant
The presented approach of a hybrid architecture, which is classically found on-premises, in which there is an active exchange between Data Warehouse and Data Lake is a typical case in a BW scenario. In SAP IQ as an NLS solution, historical data is offloaded via data tiering optimisation, which does not have the same performance requirements as the in-memory, current data in the BW HANA database. Access to the data is transparent.
It can be extended e. g. with Hadoop for digitisation use cases, data science requirements and the handling of unstructured data. SAP has used Vora technology here for several years to process data in Hadoop. But direct access to data in Hadoop via Smart Data Access is also possible. Over time, SAP has relied more heavily on SAP Data Intelligence for such approaches, into which the Vora technology has been integrated.
Parallel Data Platform
Figure 6: Possible SAP implementation of parallel Data Lake variant
The parallel architecture approach naturally simplifies a lot because integration is not necessary. Each concept pursues its own independent use cases.
At the same time, a holistic view of all existing data is missing, which could be solved by virtualisation approaches, among others.
This is particularly a recommended approach in order not to build up large dependencies and to have more freedom and flexibility in the choice of technologies. Typically, the choice here falls on one of the hyperscalers (AWS, GCP, Azura, Alibaba), although there are still alternatives for the on-premises world beyond Hadoop.
The use of SAP IQ as nearline storage and relational Data Lake can be seen here as a possible extension of BW, which, however, requires integration into the BW structures in any case and is limited to structured data.
Figure 7: Possible SAP implementation of sequential Data Lake variant
The sequential approach is particularly suitable if you are prepared to rebuild your architecture. Of course, a migration to the “new” world – with a certain amount of effort – is also possible. As shown above, SAP offers various layers in the cloud, but these can also be used in different forms.
In the scenario, it is envisaged that all data will first land in the Data Lake and that this will serve as a staging area, so to speak. There, the data is either processed further within the zone concept or transferred to the data warehouse at a given point. A use scenario is described in the blog “Data management concepts and techniques in SAP HANA Cloud“.
The following considerations can be made when setting up:
- Data Warehouse via SAP Data Warehouse Cloud (DWC) or HANA Cloud SQL Data Warehouse – The consideration here can be made from a user group perspective as well as the history of your SAP data landscape. With DWC, SAP offers a migration path via BW Bridge, which facilitates a switch. DWC also offers power users their own modelling layer (Business Layer) as well as concepts such as data sharing and spaces, which can also serve as a basis for domain-driven approaches (like for Data Mesh). The HANA Cloud offers itself as the top layer for technically oriented data teams that want to model e.g. an agile DWH on the basis of Data Vault modeling, for example and make use of further PaaS services and development tools, running the environment in DevOps.
- Integrated or separate Data Lake – From the perspective of the Data Warehouse, several alternatives can be distinguished. For both DWH variants, a hyperscaler-based Data Lake can be used and integrated. The integrated approach with SAP HANA, Data Lake enables a more integrated management bringing better governance and less data redundancy and a relatively simple moving of data between the layers. There is another aspect to consider when using Data Warehouse Cloud. DWC offers to access the HANA layer and also to activate and allocate the integrated Data Lake for a Space. The connection of the Data Lake thus takes place through virtual tables by means of an optimised SDA connection. Both for the HANA tier, which could then be managed independently, and the limitation of Data Lake use can make a separate connection of a HANA Cloud with corresponding Data Lake configuration sensible. It should also be mentioned that the HANA Cloud, Data Lake can also be booked as a service independently of the HANA Cloud (standalone).
Of course, the approach is also conceivable with BW/4HANA on-premises. With the optional use of SAP IQ as an NLS or independent relational Data Lake, which also has its own options for handling and analysing unstructured data (see Blog and SAP Help).
Data Lakehouse – next step in architecture evolution
After discussion of different architectures and how they could look like with SAP, let’s have a look on the next evolutionary step here.
As we understood to deliver all workloads we need two technical components, a Data Lake for Data Science/Machine Learning and huge amounts of all kind of data and an Data Warehouse for structured, curated data ready to be consumed by Reporting and Business Intelligence. Data Lakehouses can be seen as a modern Data Lake architecture. Thus, it basically is a just a data lake as described in the beginning. Following as a modern data approach should more include Data Product thinking and Data-as-a-Product concept, a initial idea is shown in the following image:
Figure 8: Data Lakehouse – base architecture and benefits
In the last years vendors like Snowflake or Databricks started to optimize their Data Lake-oriented approaches and bring more SQL/Database capabilites into the Data Lake. Table formats like Apache Parquet, Apache ORC or Apache Avro are typically already supported by Data Lakes. They bring better performance, compression and schema evolution to Data Lake data. Data Lakehouses enhanced this with at least two additional components:
- Modern Table Formats – These metadata layers enhance capabilities of structured data with the ACID concept enabling transactional tasks, schema enforcement for more reliability options for better performance (indexing, caching) and time travel for compliance. This extends the Data Lake to behave like a relational database where it is important and make Data Lake data ready for BI consumption. The idea is not so new as we already have seen approaches with Apache Hive or Apache Impala to query data in a Hadoop-based Data Lake and even support ACID transactions. Databricks published the Delta Lake format in 2019 and enterprises like Netflix came up with Apache Iceberg and Uber came up with Apache Hudi to solve similar problems.
- Powerfull SQL Query Engines – Sometimes it is all about how to handle the data by putting enough power and bandwith and using the right caching mechanisms to it, to get the results. For SQL Query Engines we will find Trino, Dremio or Databricks Photon for example.
With these enhancements you can handle unstructured data the classical way, make possibly semi-structured data queryable and manage structured data on a comparable low cost data storeage. And you can deliver for every workload, BI and Machine Learning with a full flexibility at the same time.
You can look at SAP HANA, Data Lake (standalone) on the different levels which can basically be seen as “a little more than” on tier from a technical perspective. Maybe more as different services as part of the offering:
- Data Lake Relational Engine – If your data is just structured with a rigid schema it is possible the best choise with a very good performance based on the Multiplex Server architecture with full ACID support.
- Data Lake Files – what we see is you need the relational engine to query file data in Parquet, ORC or CSV format. The concept is calles “SQL on Files“. So a SQL Query Engine is in place, just the meta layer to handle the data is missing.
Is the HANA Cloud database connected to the relational engine (HANA DB-managed Data Lake) via a remote table (Smart Data Access) you can directly query Data Lake files with SQL on Files.
So as a conclusion maybe being a Data Lakehouse is not the base use case for HANA Cloud, Data Lake and will possibly not fullfill what Data Lakehouse advocates promise with the new concept. But it delivers already a lot as a integrated architecture which can handle and query data very differentiated on optimized costs. Therefore depending on what a Data Lakehouse should deliver you get some benefits for sure.
SAP HANA Cloud, Data Lake is the recommendation as successor of the use of Vora as integration of a Data Lake according to note 3036489 – Deprecation of SAP Vora in SAP Data Intelligence Cloud.
Integration with SAP HANA Cloud, Data Lake is also available for SAP Data Warehouse Cloud based on currently on Space. But it is a beginning which makes Data Lake concepts available for customers on an easier way than before.
As described, this is not SAP’s first attempt to gain a foothold in the area of Big Data. To round everything off, it should also be mentioned that SAP also purchased a powerful NoSQL technology with Orient DB a few years ago. OrientDB, however, only slipped in with Callidus Cloud and was not a strategic purchase in that sense, which is noticeable due to the lack of integration or marketing activities.
HANA is at the centre of SAP’s current data management activities, preferably in the cloud, of course. The major functional innovations show that a lot of development resources are currently being invested in SAP HANA Cloud, Data Lake, as in SAP HANA Cloud itself. However, this also means that the degree of maturity is possibly still manageable. This is a relatively new offering from SAP, which will certainly primarily go down the path of supporting existing SAP solutions such as HANA Cloud, Data Warehouse Cloud, Analytics Cloud and Data Intelligence Cloud and beeing possibly restricted in supporting non-SAP as a open platform.
SAP is basically taking the right path of cooperating with the hyperscalers and integrating their offerings where this is helpful. However, this also could means that SAP cannot offer a better data lake than these. What SAP can do, however, is provide excellent integration and use in a business context, and that is clearly where SAP has its strengths.
Seamless integration of a SAP HANA with its analytical capabilities with a relational or even hyperscaler-based Data Lake enables flexible data management across the different layers and thus bridges at least partly the gap to current Data Lakehouse approaches. The right level can be determined for every need and thus a good cost-performance ratio can be achieved. The management of the various combined technologies still appears to be complex and you need to find the right skills in your data team first in order to make them usable for your organisation.
So is Data Lake a must and Data Lakehouse possibly the way to go? Yes and No, or as we consultants like to say – it depends. Beside the technologies, it is still important to check your capabilities and data strategy, your business and IT goals, your rules and processes and define the target architecture which makes most sense for you. There is not the one product or concept which is right for everyone.
Dear reader! After reading all I have written about Data Lake in a SAP world I would be happy to hear from you how you handle such Data Lake-like workloads or if you already have experiences with Data Lakes in a SAP context.