Skip to Content
Author's profile photo Niels Weigel

Data Profiling and Data Cleansing – Use Cases and Solutions at SAP

This blog is the starting point for a series of blog articles of a guest lecture at the Hasso-Plattner-Institut für Softwaresystemtechnik in Potsdam, Germany

       

The topic of data quality in general is not brand new. However the maturity level of the organization with regards to data quality or data governance is still  different. While some organizations have set up enterprise wide data governance projects including, managing and tightly integrating People, Processes, Policies & Standards, Metrics and Tools other companies are still in the starting phase of mostly departmental data cleansing activities. So recent researches like this one from Gartner still indicate that the poor data quality is a primary reason for about 40% of all business initiatives failing.

Blog-DPDC_1-PoorDataQuality.png

             

At SAP we are looking at Information Governance as a discipline that oversees the management of your enterprise’s information. By bringing together the right people, processes, policies, and metrics, Information Governance delivers trusted, consistent, and clear information throughout your organization to enable organizations to meet their business goals in various areas like Business Process Efficiency Business Analytics, Merger & Acquisition, Compliance.
Various Information Management Initiative (and especially the combination of these initiatives) are supporting the organizations to achieve their business
goals.

Blog-DPDC_1-BusinessGoals_and_Initiatives.png

Data Profiling and Data Cleansing are two essential building blocks or components of these Information Management initiatives and before we get closer into the details of the use cases for these components here are some basic definition from Wikipedia:

Data Profiling (http://en.wikipedia.org/wiki/Data_profiling)

“Data profiling is the process of examining the data available in an existing data source (e.g. a database or a file) and collecting statistics and information about that data”

   

Data Validation (http://en.wikipedia.org/wiki/Data_validation)

“In computer science, data validation is the process of ensuring that a program operates on clean, correct and useful data.”

   

Data Quality Assessment (http://en.wikipedia.org/wiki/Data_quality_assessment)

“Data quality assessment is the process of exposing technical and business data issues in order to plan data cleansing and data enrichment strategies.”

   

Data Cleansing (http://en.wikipedia.org/wiki/Data_cleansing)

“Data cleansing, data cleaning or data scrubbing is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database.”

After this high-level definition, let’s take a look into specific use cases where especially the Data Profiling capabilities are supporting the end users (either
technical IT or business users) in their day-by-day work and enabling them to get better understanding or insight into the data they are using.

  

Let’s start with a classical ETL / Data Warehouse use case, where an Data Architect or Data Warehouse designer is working on the data provisioning side of the data warehouse and is going to integrate data from multiple data sources. For the ETL job designer it is essential that he can not only get the information what tables, views and technical objects are available at the databases or data sources, but also to get an understanding what kind of content is stored within the tables. A label or headline like Color is only an indicator what might be stored within the column. Data Profiling enables users to get a quick insight e.g. by using the frequency distribution to see how many different attributes exist and what are the most frequent attributes. Based on
data profiling results the ETL job designer can then also identify if different, inconsistent representations of same information exists. Based on that finding he can already set up his mapping tables within the data flow to standardize and unify the field content.

Blog-DPDC_1-ETL_DataWarehouse.png

            

Another typical use case is Data Modeling.  Typically a data or enterprise architect has the challenge to create and define new data models including logical record representations or specific lookup or reference tables to be used within the enterprise system landscape or for business intelligence or reporting solutions. He needs to find and join data from different database tables or even different databases. To do so, he needs to understand and know where the
required information is stored within the enterprise’s system landscape, what tables to join the data from and especially needs to get the right level of transparency what kind of data is really stored within the tables or database.  Again the statistical data profiling is giving great first insight into content by providing minimum and maximum information, length statistics, frequency distribution but also allowing uniqueness, dependency or redundancy profiling.

Blog-DPDC_1-DataModeling.png

Customers who are starting their Data Cleansing or Data Quality Management activities are in many cases already aware of some high-impact data quality issues that they want to resolve. It might be they do not have the exact number of issues or problems they have, but they know that they “have to do something”. However with the usage of automated Data Profiling functionalities on the large amount of data they have stored, they are able to get the first initial insight into data content for their cleansing activities by quickly identifying outliner or inconsistent data formats or representations.

Blog-DPDC_1-DataQualityManagement.png

Applying data discovery or data profiling methods to legacy data sources before their data is to be moved into a new SAP ERP or CRM system is one of the very common activities in the Use Case of Data Migration. Be it the challenge of moving data just from one single source into the new system or even migrating and consolidating data from several systems into one single target, it is anyway important to understand how the data is stored in the legacy system(s), what content or data domains are defined there, how long is the content in different systems defined (and populated), are there different representations for same kind of information like textual or numerical color codes. Data profiling provides an understanding whether your source data is fit for use in your new system.

Blog-DPDC_1-DataMigration.png

You can find more information on classical IM Use Cases and the related SAP solutions within the EIM Use Case Wiki (http://wiki.sdn.sap.com/wiki/display/EIM/Enterprise+Information+Management+Use+Case+Wiki) here on the SAP Community Network.

       

Stay tuned for next articles one the usage of SAP Information Steward and SAP Data Services to support your Data Quality Management activities following this step-by-step approach from initial data asset discovery to ongoing improvement and monitoring of the current data quality for any data, in any application.

Blog-DPDC_1-DataQualityManagementCircle.png

Related links including detailed Product Tutorials:

http://scn.sap.com/community/information-steward

http://scn.sap.com/community/data-services

Assigned Tags

      8 Comments
      You must be Logged on to comment or reply to a post.
      Author's profile photo Ina Felsheim
      Ina Felsheim

      Great job of summarizing the core use cases of EIM, including how Information Governance applies.

      Just a note that the artifacts from all of these use cases (policies, standards, validations, review/approve and ownership assignments, etc.) can be re-used as you tackle the next use case. That is a key way to get better, faster, and smarter. So make sure you are choosing technologies that fit multiple use cases, or you'll entirely lose this re-use benefit.

      Author's profile photo Former Member
      Former Member

      Simply Brilliant 😎

      Many many thanks for this sharing.Please share more on this

      Regards,

      Manjush Jose

      Author's profile photo Former Member
      Former Member

      Good to read

      Author's profile photo Ramesh Murugan
      Ramesh Murugan

      Very much understandable document.

      Author's profile photo Henk Binnendijk
      Henk Binnendijk

      It would help to mention the SAP products involved in DQ activities....

      Author's profile photo Ina Felsheim
      Ina Felsheim

      Thanks, Henk. The Data Quality products are these:

      SAP Data Services (Data Quality is embedded)

      SAP Data Quality Management for SAP (using the same DQ capabilities in Data Services)

      SAP Information Steward for the profiling, assessment, and scorecarding of data quality.

      Author's profile photo Former Member
      Former Member

      Good documentation.

      Author's profile photo Venugopalrao Kavuru
      Venugopalrao Kavuru

      Nice Overview

      Regards,

      Venu.K