Taking the Data Lineage Challenge in the New World...

eric_simon · ‎04-27-2016

Why?

Two game changers are motivating organizations to evolve their information architecture in order to introduce more agility for the management of their data assets. Firstly, more and more users need to analyze a larger variety of data, not only familiar ERP data but also IoT data, social media data, weblogs, consumer app data, data that then has to be combined and curated. Organizations are now unlocking access to all these types of data, without having to resort to the IT team procedures and they therefore do not need to adhere to the IT data governance process. Secondly, as users want to have data that is customized to their needs, organizations have empowered users to prepare their data and share the data with others. Since the IT team does not govern this data, users are becoming an ungoverned source of data themselves.

Because of these two game changers, it is becoming harder to have a global view of the data, to trust the data or to find the data that users need. It is also more difficult to understand where the data comes from and how to make sense out of the data. This is setting the ground for what is called data lineage, a challenge that a growing number of our customers are facing (e.g., Shell, Philips, ING, Caterpillar, BASF). Data lineage information describes the origins and the history of data in its life cycle, including its metadata and data tags. For example, it can describe where the data comes from, how it is transformed or how and when it is used. This spawns multiple levels of data representation, ranging from physical data like files to analytic views, KPIs and dashboards.

Interestingly, the data lineage challenge is also apparent in big data projects where large organizations increasingly adopt Hadoop to store all kinds of detailed data (logs, receipts, feeds, etc.). Organizations also use Hadoop as the development infrastructure to transform and combine raw data into cleaner and more exploitable data from a business point of view. Data that comes out of possibly complex data pipelines can be loaded into traditional data and BI infrastructures. However, as already pointed out, it is becoming more difficult to understand, manage and govern large amounts of data created for big data projects. As one notable example, several industries are well aware that conforming to government regulations and internal data policies is an important part of the process. However, the lack of control on the data that constitutes the foundation level of their traditional data infrastructure complicates the task of auditing and conformance to data management regulations.

Data lineage is nothing new and SAP is already offering several products with data lineage and impact analysis features that help developers and designers, manage the creation and maintenance of data models and data transformations. There are however two new challenges in the emerging world of Agile EIM. Firstly, users expect to have a data-driven view of data lineage information that combines a map view of all datasets and their dependencies with interactive data visualization and manipulation capabilities. Users do not only want to see the metadata linkage between datasets but also want to see the data. Secondly, data lineage services go well beyond impact and lineage analysis. For instance, users want to answer questions like: “what is the data available on my BI platforms?”, “What are the data sources used for portfolio/business performance management applications?”, “how did you compute that margin for soda drink in Virginia was 1,070 in Q4 2015, and which data items were used?”. These services must be adapted to user persona, their respective jobs and the tools they use.

How?

EIM organization released a new product in 2015 called SAP HANA ESS (Enterprise Semantic Services) that enables business users to easily find, understand, customize and share information assets. ESS creates a knowledge graph natively stored in HANA, which describes the semantics of all datasets accessible to a business user within and outside the enterprise. The description of datasets is loaded into ESS through crawlers that automatically and periodically extract metadata from data source systems (connected to HANA through remote sources). In addition, data profiling techniques are used to extract meaningful searchable values and discover business types within crawled datasets. Leveraging ESS knowledge graph, the ESS team has successfully released in 2015 a semantic search service that enables users to ask natural language keyword-based search queries and retrieve the datasets they need. ESS dedicated semantic search ranking algorithms build upon HANA full text indexing technology.

SAP BW and SAP Agile Data Preparation currently use ESS search service. Moving further, the team recently proposed a rich data lineage service that addresses the new challenges of customers moving towards Agile EIM. The approach consists of optimizing the storage of data lineage information in HANA and exploit the capabilities of HANA Graph Engine to speed up the execution of graph traversal queries required by the new data lineage service exposed by HANA. Data lineage information is obtained through ESS metadata crawlers for non-HANA objects, and HANA activation plug-in extensions for HANA objects. The data lineage service is designed to offer the support of new types of data lineage queries such as “cartography queries” and “data checking” queries, which are highly demanded by all customers we encountered in order to facilitate data governance and make sense of data. Early prototyping and validation of the proposed data lineage service with targeted customers is still in its early stage.

Why is it critical for SAP to take this challenge?

In the last four months, several customers (see above) have approached us with a need for a modern data-driven data lineage service. Their requirements are very similar which is thereby making it possible for SAP to have an ROI on the development of a data lineage service in HANA. These customers expect a vision and roadmap from SAP and they have requested to have the first milestone within 12 months’ time (approximately the timeframe of HANA SP14) about their data lineage challenges. Their need for a solution is generally critical.

Given our current EIM portfolio and our strategic investment in the last two years on SAP Agile Data Preparation and SAP HANA ESS, SAP has a unique opportunity to address the data lineage challenges of our customers and deliver an unmatched solution within HANA. This could positively affect other SAP products such as Cloud for Analytics.

Finally, the data lineage approach fits in perfectly with our EIM big data process whereby flows of data transformations between different data source systems are designed and their execution can be distributed over multiple runtime systems like HANA, Spark for AWS, DCS, etc. Therefore, a data lineage service would enable SAP to have an impact on big data projects carried over by our customers and it would give SAP a distinctive competitive advantage over competitors in the field like Cloudera, Oracle, and Informatica.