Building a data quality pipeline using SAP Information Steward
Data quality is relevant for all applications especially it can define the success of data migration or data reports above configuration above several aspects. It can be handled in several ways with several tools.
This article will explain with a few examples what are the main data quality issues and it will sketch a rough architecture of an SAP-based solution that identifies data quality issues using SAP Information Steward. Usually, Information Steward is used in combination with SAP Data Services. SAP Data Services can use the information provided by SAP Information Steward to perform mitigating measures on the database itself. This would be a second part article to this content.
The main data quality issues
First of all, the folk wisdom of Data Quality divides issues into seven dimensions, and we will follow that structure (according to this Source/Inspiration):
- Completeness: it captures instances where there are missing data and on which columns it’s incomplete;
- Accuracy: it identifies how well the data you collected reflects the actual occurrence. One example would be a date for a past event recorded as 01-01-9999;
- Consistency: how consistent is a specific piece of information across your IT landscape? For instance, if the marital status of a customer is consistent in two databases;
- Uniqueness: it detects duplicate entries. For instance, when there are two records for the same customer;
- Integrity: it ensures that data isn’t missing essential relationship linkages. One example can be if data is stored in a shorter form in different systems, making it harder to establish a direct connection across systems.
- Timeliness: it captures instances where data isn’t sufficiently up to date for some task. One example is data from the previous business day is loaded at noon of the following day, and we have a report is being executed at 10am;
- Validity: it identifies data that does not necessarily obey the rules we expect. For instance, a telephone number should not contain letters;
Analyzing SAP objects in SAP Information Steward
So, in this article, we are using SAP Information Steward to profile and assess the data according to the seven dimensions.
The first tab is the workspace, where you can run rules and profile the data.
The following highlighted tab is the rule tab, where Rules are created, published, and approved.
The next highlight is the Scorecard setup tab, where scorecards are created, using rules and applying them to tables.
The final tab is the Tasks tab, where you can view and rerun past tasks, such as profiles and rules.
To import the data, click the “Add” button in the menu.
Select the connection and in the following screen, expand the folder where the tables are located. Next, select the tables to import and add to the project.
Once imported, you can locate the tables in the workspace.
You can then create views and apply joining conditions and additional filters or transformations to the data. Finally, let’s advance to the actual data quality work.
There are six different profiles available.
The column profiling is considered basic and can be seen within this view tab. All others have the results in the advanced tab.
When you run a profile or a rule, the tool will tell you to create a task in a popup window. Save and run.
The result of a column profile can be seen below, expanding the table to see the columns.
It will give some information on the columns data distribution, for instance, in the highlighted area, we can see it states the low cardinality of some fields, and the material field is unique (potential key field). Then, the advisor will suggest some primary rules to clean the data (identifying the nature of the field by the header). These rules can either be accepted or declined. The other columns will display some statistics about the data, extreme values,…
The data content profile will identify the data type of the field and will be useful when there are generic data types, such as telephone numbers or addresses. These common types have standard SAP cleansing advisors, so dealing with these columns will be less time-consuming.
Other types of profiling can be run and seen in the Advanced view of the workspace.
Namely, to see if there are any duplicates records, select the Uniqueness profile, and it will display the number of duplicate records and the degree of repetition.
To see redundant data – data duplicated across tables – run the redundancy profile. It will compare selected columns and will present a Venn diagram with the two datasets. Each of the subsets can be viewed upon clicking.
To see the dependency between two or more fields, you select the dependency profile. It will present the N:M relation occurring between those fields. For instance, if a value for one field has three values mapped to it.
The Rule tab will present the rules divided by the data quality domain. It will also show the score of the rules that were already run.
To create a rule, click “New”.
Enter and name ($<name>)some of the parameters that will later be bound to specific table columns.
Also, define the expressions associated with those parameters. For instance, $parameter IS NOT NULL. Any record that has a null on the column the rule is reading will fail the rule.
Also, one must specify the rule approver and add a Quality dimension to the rule (described briefly at the beginning of the article).
It is important to be coherent when classifying rules by the quality dimension. This will be helpful because on scorecards, the quality dimensions will be analyzed separately, and results will have better quality.
In addition, the advanced rule editor has some added features, which enable more complex rules. All of the functions available are described and have syntax examples.
Also, one can quantify the cost per failed record. This is done in Financial Impact per failure.
You can categorize the type of cost accordingly and add a dollar value to each instance where it was identified. This will help Business estimate the impact of data quality and raise red flags where needed.
After this step is done, rules have to be bound.
After that, the parameters for the rules will have to be mapped to a particular field of the table.
Also, the quality threshold is customizable on a 10 point scale, depending on the specific requirements of the dataset. Also, notice that the same rule can be bound to several tables.
The next step is the creation of scorecards.
Each of the scorecards has a Key Data domain. Name accordingly to the tables you want to classify.
The next step is to add a quality dimension. If more than one is added, you can weigh them according to their relevance for the data domain.
For each of the quality dimensions, rules can be entered in the data domain. Again, for each of the rules belonging to the same quality dimension, a specific weight can be attributed.
The final step is to select, for each of the rules entered, the tables they are bound to. And the weight of these bindings is also customizable.
Once the scorecard is prepared, the rules are run in the workspace tab, and the scorecard gets populated.
Selecting on view or table and then clicking on the Calculate Score button will present the trend of the rule score – once it’s executed more than once – and a sample of the failed data, which can also be filtered or exported. Notice that the score’s color is based on the thresholds defined earlier. If the table is expanded, you can see a detailed view of the score of each field.
Bear in mind that the entire subset of data should be exported to tables in the database assigned in the Central Management Console.
To view the scorecard, change the Workspace tab to the Scorecard tab.
There the trend for Quality Domain can be seen dependent on the rule you wish to see. It’s important to distinguish how to Show the score. ‘As of Now’ will present manual rule execution; ‘Last Run’ will show the execution of the rule as a scheduled CMC scorecard run.
As in the Rule view, we can see the failed data but in the scorecard view, we can see the business value analysis.
This feature can be seen on the screen below:
Once clicked, it will present information about the object, synonyms, and keywords.
Let me know if this topic is interesting to you in the comments if you want to see how Data Services would mitigate the information extracted by Information Steward or if you have specific feedback about this topic.