Simple Data Warehouse Design Pattern on SAP HANA P...

klaus_riemer · ‎01-08-2020

Motivation

In my role as technical PreSales with focus on SAP HANA Platform, modeling topics and technical features of SAP HANA I was part of some projects with a goal on new Data Warehouse design concepts that allowed more flexibility and should be simple adapt to any business demand without "yet another data mart". Customers criticized losing control on information content, need of work around, separatism of data scientists with risk of compliance and data protection issues, realtime demands of shop floor production or financial streaming data inbound that cannot be captured and many other demands in a digital economy.

The design pattern I will show to you in this blog did evolve in some recent projects with a demand to be adaptive on changing business drivers no matter if these can be foreseen or not. It should allow a heterogeneous landscape including "old school" batch oriented mainframe systems as well as any sub second ultra low latency data feeds. Finally it will serve classic reporting demands of management and line of business, a solid data foundation for data science applications and even geospatial or graph applications in a single platform. Last but not least controlled access, minimum redundancy and data governance service to avoid compliance issues all in one single platform that can grow as you need it. - This is what I want to share with you.

Designing for Flexibility

When either corporate management, line of business, data scientists or other departments approach the IT with quite different demands a flexible solution is needed to cover these.

In a classic data warehouse scenario – everything is somewhat stable. Defined reporting processes, from source extraction to presentation. In a heterogeneous environment, with varying input consumed by data scientists testing new models and verifying information bringing the challenge in building the solution platform.

Segregation of platforms increases efforts

Some might consider separate platforms for classic reporting and sandboxes for data science or similar use means increased efforts by maintaining multiple feeds, reporting layers, retention, and governance, also operations and maintenance. Finally needed: A process of establishing a complex development process to transform models and modern methods usable in the stable classic reporting layer. Instead – SAP HANA 2 Platform can provide a common environment for all analytic demands.

Multitenancy is a key capability to use common sources and serve disparate requirements.

The advantage of Multitenancy avoids multiple platforms, resources and interfaces can be reused, latency of access can be minimised. If not just data scientists need a separate “sand box” to test der models on real data but other purposes too it is easy to create more of them as well as simply ceasing environments not required anymore. Reusing data across tenants is simple and access controlled. How to setup this functionality is described in the article “cross tenant database access” created by Goran Stoiljkowski.

A layered data models of granularity – allows both classic reporting and data science applications

Serving multiple requirements in reporting, data mining, prediction, series data or other intelligence on data – needs a data model that is both flexible and extensible. Distinguish on access frequency, level of granularity, stability need to be put in data containers based on need. Data temperature functionality supports this requirement by SAP HANA Native Storage Extension or using a relational Data Lake powered by SAP IQ. Access to unstable information – from operational systems can be added by SAP Smart Data Access functionality, ad hoc – on demand. The data model itself in level 0 may be configured as staging area or copy of operational data – to allow analytics to the source. Surrogate keys, data normalization, semantic and aggregation layers e.g. as calculation views can be applied as needed for consummation in the reporting layers and as described above as source also for data science in the separate tenant as sandbox – and avoid data duplication or maintaining duplicate ETL lines. The important factor in this approach is to maintain strict data governance. Data has to be imported from its true source only once and in most granular state available. I ‘ve seen fast growing analytics platforms where data was imported redundant and perhaps already aggregated resulting in reports of dubious content. Always know your data.

Data acquisition simple and extensible

No matter if there is an existing ETL infrastructure or Green Field. The SAP HANA platform based Design fits itself in any environment. Standard interfaces allow simple connection. If interfaces have to be created SAP HANA platform provides a comprehensive set of data provisioning services. The data provisioning should be configured according to volume and transport frequency.

SAP HANA Smart Data Access is a link to original source database table or view represented by a virtual table in HANA. The content is read in an “on demand” pattern using classic SQL queries or even DML if sufficient privileges apply.

Advantage – transactional data can be acquired in the state of “now” and included in any calculation. No data need to be moved or requires permanent space in the SAP HANA platform. It also can be used to take data from an embedded SAP Business Warehouse and include this information in combined short term / long term reporting.

However, as data is taken from original source, there is sufficient performance on original database required. Network latency and amount of records transported will also impact the complete execution time.

SAP HANA Smart Data Integration / Smart Data Quality is a replication based approach, with transformation and mapping capabilities and can be used with a variety of data sources.

Consuming Data…

SAP HANA Platform is an agnostic platform for data consumers. So no matter what consumers will take data, if classic reporting tools are used, data science typical Jupyter Notebook™, SAP Analytics Cloud, web services... Just choose what you prefer.

Development Process

The advantage of a separate sandbox is clearly defined. Full access to existing layers, no data duplication. But when new calculation models or data sources are considered to be fit for use by a larger community - how will this be established? Still there is a need for a development requirement. This means the classic approach of Development System, Quality Testing and move to production cannot be abandoned in this design pattern.

Models / Queries must be refactored and optimised for general purpose use, tables for new sources integrated, feed lines created and need to be verified.

The data model, semantic and calculation layers need to be adjusted, complemented, verified and deployed.

Data consuming software must be aware of new information and services available and new stories and scenarios need to be created.

The advantage of the sandbox if the change request is under development the capabilities of the sandbox model remain available even if limited to data scientists only.

A verification of the change request can be simply by comparing to the sandbox model in both result and performance. Scale up and scale out extensions have minimum impact on this system – as the tenant concept allows simple deployment in another node if required.

make simple - just make it.

To wrap this up - as you see in the diagram above SAP HANA Platform - the key capabilities needed are entirely functional bricks that are already part of the platform, making usage of Multitenancy as important feature to separate needs and still use the same data. If this is the automotive industry or any financial services, start small and grow as your business this design pattern will cope with your demands, no matter if you run it on premise or in SAP HANA Enterprise Cloud.