SAP BTP Data & Analytics Showcase – How to improve and monitor data quality using SAP Data Intelligence
This is the 2nd of 5 blog posts which are part of the “SAP BTP Data & Analytics Showcase” series of blog posts. We recommend you to look into our overall blog post to gain a better understanding of the end-to-end scenario and context which involve multiple SAP HANA Database & Analytics components in the cloud.
In this blog post, we are going to demonstrate how to improve and monitor data quality with the help of data quality rules in SAP Data Intelligence. Through the user interfaces in SAP Data Intelligence, business users without technical backgrounds can manage and monitor quality of datasets. We highlight the data quality related activities in the data preparation phase, as such high-quality datasets help build confidence in data modelling and generate accurate business insights.
Following the upcoming sessions in this blog post, you will experience how SAP Data Intelligence could help improve and monitor your data quality in an end-to-end process. You can easily follow this process in five steps:
- Create basic/advanced quality rules under different categories
- Manage quality rules and bind rules to datasets
- Evaluate data quality for selected datasets
- Utilise self-service data preparation for correction
- Monitor and track overall quality status
End-to-end process to improve and monitor data quality using SAP Data Intelligence
End-to-end demo video related to this blog
1. Create basic/advanced quality rules
In this session, we are going to show is how to define basic and advanced quality rules targeting multiple quality issues. Let’s go to the below user interface “Define and categorize rules” under Metadata Explorer together.
Scenario 1: Define basic quality rules
We will demonstrate how to define a basic quality rule, which helps solve the data quality issue: inaccurate post code in Germany – entries with empty post code or not in 5 digits. Let’s look into the below rule, which was created under category “Accuracy” with the corresponding conditions.
Basic rule for inaccurate post codes under category Accuracy
In our demo, we have created another two basic rules under category “Completeness”, which detect missing gasoline brand and house number entries in station master data. We will not show the similar definition here.
Basic rules for incomplete brand and house number under category Completeness
Scenario 2: Define advanced quality rules
In the second scenario, we are going to demonstrate how advanced quality rules could identify duplicated records – stations with the same gasoline brand and geographical information (latitude and longitude). The condition script provided by advanced quality rules has more operators, which enable you to develop more complex logic for rules. You could learn how to use Rule Script Language writing condition scripts from these templates.
Let’s dive into this advanced rule, which was defined under category “Uniqueness”. The newly-added operator “is_unique“ is applied in this case.
Scenario 3: Test quality rules
After creating quality rules, you could use the feature “Test Rule” to check if the new-created rules meet your expectation. Here we checked our advanced quality rules for duplicated station entries.
2. Manage quality rules in Rulebook
In this session, we will illustrate how to import quality rules and bind quality rules with datasets you would like to evaluate via Rulebook. Let’s go to Metadata Explorer and select “Create Rulebook” feature there. In our case, we have created a Rulebook called “StationMasterDataRulebook”, as shown below.
Scenario 1: Import quality rules into Rulebook
Firstly, we will reference basic rules and advanced rules that we created in the first part to the Rulebook. You could easily do it with the following three steps.
Scenario 2: Bind quality rules with datasets
Secondly, we can choose the datasets whose quality we’d like to evaluate. This procedure can be completed in three steps: click “arrows” button – select dataset with F4 help – map attributes.
After successful setups, you will see the following binding information in your Rulebook. In our case, we have chosen the imported CSV file in DI internal data lake that stores the station master data to evaluate.
3.Evaluate data quality for selected datasets
Now we would like to show how to utilise the imported rules to evaluate our selected database table and gain a first feeling about the quality status.
Scenario 1: Configure threshold for evaluation results
Firstly, let’s configure the thresholds for maximal failed records and minimal passing records. This step could be easily done in Rulebook as follows.
Scenario 2: Run all rules and view results in Rulebook
Now, we have completed all the setups in Rulebook. Let’s run the evaluation with imported rules and check the results.
Recently, there is a new feature released by SAP Data Intelligence which enables users save all the failed records in a database table in SAP HANA Cloud (only connection type “HANA_DB” works here). Using this feature, you could easily adapt failed records and manage data quality.
Scenario 3: Deep dive into evaluation results in Rulebook
After clicking the “View Results” button, you’ll enter into the following analytical page where you are able to gain better statistical insights of quality status and identified quality issues of your datasets. Let’s look into the results together.
- Overall quality status: Warning (65.39% of 15444 station master entries passes all the four rules)
- Evaluated dataset: Station master data from CSV file in DI internal data lake
- Detail quality information related to an individual rule “Incomplete gasoline brand entries”: 544 (3.52%) stations have empty gasoline brand
- Sample data violated the rule: 5 records are sampled to show the quality issue
4. Utilise self-service data preparation for correction
SAP Data Intelligence offers powerful self-service data preparation to help business users and data scientists to correct detected quality issues (from the evaluation results shown in Part 3). What needs to be specially mentioned is that the data preparation related actions could be saved in a recipe. You could access the data preparation functionality easily via Metadata Explorer. The document published in SAP Help Portal could provide you more product information and user guidelines about this great feature. In our demo video, two examples for data preparation part is presented.
Scenario 1: Implement data preparation actions and store corrected data
In our case, we developed a data preparation template including all the relevant actions to correct and standardise the CSV file “stations_data_quality” and created a new CSV file called “stations_master_improved” to store the corrected data. Later, we can import this new-prepared CSV file into HANA Cloud database using data pipelines.
Scenario 2: Create rule bindings to new-created datasets
We have fixed the detected data quality issues in the station master data. How is the overall data quality status now? Let’s check the evaluation results again. For this purpose, we need to bind our four quality rules with the new-created CSV file (stations_master_improved) via Rulebook, as shown below.
5. Monitor and track data quality
Now, let’s click the “Run All” button and see the reevaluation results together! You could see that overall data quality status is improved from 65.4% to 85.42%, changing from “Warning (Orange)” status to “Healthy (Green)” status. Furthermore, you could also build your own Rule Dashboard and keep tracking the data quality of your datasets. For further information regarding Rule Dashboard, please reference this document.
Congratulations! You have finished this session managing data quality in the phase of data preparation. We hope you are able to improve and monitor your data quality through such simple five steps. Moreover, we’d like to convey our message that high-quality data could help you generate accurate business insights and accelerate your end-to-end process. As next step, you could create data pipelines via Modeler importing this high-quality dataset into your HANA Cloud database and start your data modelling activities!
We highly appreciate for your feedback and comments! Enjoy!
Excellent - thank you so much!!
Thanks for the clear explanation Wei Han,
I have also a question about the data quality results:
1- Do you know where data quality results are stored? I need this because, client wants to show DQ results in a different system. So I need to find where the rulebook results are. Also, in the rulebook processing Options never showed up.
First off - sorry for my late reply. I reached out to the product manager of Data Intelligence. Currently, you can configure a HANA database connection and export the failed records in a HANA table. Please kindly check if your Data Intelligence tenant supports this new feature.
Thank you for an informative blog.
I am trying to create the condition script in advance rules but can not find the newly added operator 'is_unique'?
I am using SAP DI 3.1 trial version.
Could you please suggest?
First off - sorry for my delay.
I checked with my colleague from SAP Data Intelligence Cloud. Yes, you're right. The operator "is_unique" is not available in the version (DI 3.1 trial) you mentioned. I'll give updates here.