This Blog Post will give the overview on Data Profiling technique within SAP Information Steward data quality tool. I will start with overview in this post and will explain the commonly used data profiling technique in SAP Information Steward which is column profiling.
This article will guide you through step by step procedure and will give you the complete idea on usage of column profiling.
So let’s start by understanding what is Data Profiling?
What is Data Profiling?
- It is the process of examining the data available from an existing information source (SAP, Database, File) and collecting statistics or informative summaries about that data
- Use profiling to examine data so you can understand its content, structure, and data quality dependencies.
Benefits of Data Profiling :
Types of Data Profiling :
Column Profiling – Determines the values and characteristics of data elements such as :
•Value, string length, and completeness
•Number of distinct values or patterns in a column
•Distribution of distinct words in a column
Dependency Profiling – Identifies attribute-level relationships in the data by finding the values in one or more dependent columns that rely on the value in a primary column.
Address Profiling – Determines the quality of addresses by determining whether the address is:
Redundancy Profiling – Determines the degree of overlapping data values or duplications between two sets of columns.
Uniqueness Profiling – Returns the percentage and counts of rows that contain unique data for the set of selected columns.
Content Type Profiling – Returns the information about the type of data exist in the columns of table.
Uniqueness vs Redundancy
Uniqueness profiling talks about finding duplicate values and checking uniqueness within the column/s of same table Whereas, Redundancy profiling talks about finding overlapping between the pair of columns of two different tables.
Above are the different types of profiling techniques available with in SAP Information Steward.
Now begin with explaining them in detail, I will start with column profiling.
Consider the below data set as an example to explain the column profiling-
Here are some key points to remember when you are performing column profiling in SAP Information Steward :
- Column profiling as the name implies, it helps the user in understanding the Data stored within the columns in table’s/view’s.
- This profiling feature help in determining Values, String lengths, Completeness and Distribution across the columns.
- Column profiling can be performed on table/view
How to perform the column profiling?
To perform the column profiling on Table/View:
- Just select the view/table and hit on Column profiling from the profiling options in workspace section of SAP Information Steward. Window shown in screenshot will get pop up. (See the screenshot)
- Check the boxes w/ types of column profile to perform and leave other values to default
- Hit Save and Run Now button to execute the column profiling
Important values to keep in mind :
Input Sampling Rate– How you want the records chosen. For example, if you chose a Max input size of 1,000 records and you enter a rate of 1, then the first 1000 records will be profiled. If you enter a rate of 2, then every second record of the total records in the table, up to 1000 records, will be profiled, and so on.
Filter Condition-: You can add filter condition while creating task also using filter condition option.
Distribution Option : If you have requirements for checking the distribution of data or if you want to see how the data is spread across the primary columns, you can use the median and distribution and word distribution options.
Note: this option should only be used for the columns for which you want to see distribution results as this option takes lot of time to give you the results when you have huge data.
You can check the task status in Task section of SAP Information Steward, once task is complete results can be viewed in workspace section.
Reading the results generated from column profiling-
Column Profiling gives us the statistical analysis of data which can be very useful in understanding the data which exists in the system. Below are the explanation of each result.
Output/Results of Column Profiling
Value (Analysis of values populated in the columns)
Min- Minimum value for that column
Max- Maximum value for that column
Average- Average value for that column(all records)
Median – Centre value of that column
String Length (Analysis of lengths of values populated in the column)
Here also it gives us the min, max, average and median of lengths of values available in column.
Completeness (Provides analysis around what percentage of data is populated within the columns)
Null %- Percentage of Null values present in the column
Blank %- Percentage of Blank values present in the column
Zero %-Percentage of Zero(0) values present in the column
Distribution (Analysis of how data is spread across columns)
Value- Shows distribution of records across the column
Pattern- Shows different data patterns followed across the column
Word- Tells different usage of words across the columns
Here I complete the detailed explanation of first type of data profiling which is column profiling in SAP Information Steward. I will be covering all other types in my next posts, so be connected.
Please do provide your valuable feedback on this post in comments section, this will help me in improving my content and share more knowledge with this community.
Thanks and Happy learning!