Using Data Cleansing Advisor (DCA) to Estimate Match Review Tasks
Data Cleansing Advisor has the tools available to empower Data Stewards or data domain experts to get quick insight into the number of match groups that may need to be reviewed for data migration or data consolidation projects. Data migration projects effectively map the data within a legacy system to a new target system. This is also a great time to get insight into the data you are working with before it is loaded into the new system so that you can determine the overall quality of the data. This section will focus on the tools available within Data Cleansing Advisor to give you detailed insight into the number of duplicates found and the process of reviewing the duplicate records. A previous section entitled, “Determining Duplicates and a Matching Strategy” can be reviewed in case you
want to fine tune the match strategy and rules to change the match results.
Match Statistics and Graphs
When opening a data cleansing solution that has been created by a Data Steward, the first page displays the impact that the cleansing and match rules had on the input data set. The image below displays a summary of the impact the match rules had on the input data set and simply displays the number of matches and the number of unique records. The demo sample set of data has over 25% duplicate records. The number of duplicate records isn’t generally the best indicator of whether or not the match groups are valid or not, but the context is important. This is where the available tools within Data Cleansing Advisor come into focus.
Viewing Match Result Statistics
Drilling down further into the match results will display a graph and associated statistics for the number of matching records and the number of unique records. The matching records are then displayed in three different categories: High, Medium and Low. The high confidence matches are generally matches in which we have a high level of confidence that the data used for the criteria to determine a duplicate is the same between all records within the match group. Medium and low confidence matches are match groups that should be manually reviewed. This is generally a good indicator as to the number of match groups that will need to be manually reviewed.
A filter (Matching records -> Medium) can be easily created by navigating through the graph to immediately retrieve only the results that are important to you.
Creating a filter to return the medium confidence matches will return 63 match groups.
Reviewing the Match Results
Below is an image that shows the UI of the actual match results once you select to view the data. There are a few important tools to explain within the UI that will help you with the match review process.
The output data is separated by match groups and will contain both associated matching records and near matching records (indicated below in grey, italicized font). Near matches do not officially belong to the match group but are records that are just below the threshold to qualify as a match. Fine tuning the match rules may make these records officially part of the match group. The filter(All Records (2676) -> Matching Records (606) -> Medium (63) can also be changed at any time to return a different results set (including unique records). Also available to you would be full adjustment of where the columns are displayed and the actual data columns that get displayed as well.
Show near matches
This checkbox can be toggled on or off in order to display or remove near matches from the output data respectively. Unchecking this option will return the same results but without the near matches and shown below.
Triangle (red) indicator
Match groups that should be manually reviewed will always have a red triangle indicator in the upper-left hand corner of the “Group ID” column. This indicator will be displayed if at least one of the following is true: the match group has a near match associated with it, the match group has an indirect match or the match group has a questionable match. A match group can have any combination of the above reasons and the text explaining each can be found by hovering over the indicator.
The UI can be fully customized to display (or remove) any columns of data. We’ll focus on the “Match” output columns for consistency with the content of this section, but you can add data for person entities such as first name, last name, gender and so on. A sample set of columns that can be added is displayed in the picture below.
Each column can give you additional insight into why the match group was flagged to be reviewed and the description of each can be found in the Information Steward User’s Guide. For discussion purposes, I’ve added Conflict Group, Review Group, Conflict Record and Review Record.
By customizing the UI (adding the columns and sliding the “Other” column near the “Match” columns, I now can start gaining insight as to why these records are marked for review. The “Conflict Group” column will have the value of “C” for any match group that has an indirect match associated with it. This column should be paired with the “Conflict Record” column to easily identify the record in conflict. The “Review Group” column will have the value of “R” for any match group that has a questionable match associated with it. This column should be paired with the “Review Record” column to easily identify the record in question. Being able to identify the exact records to leverage during the review process allows you to easily determine what to review and why.
The compare records button allows you to view the data of each record side-by-side to see why they are considered to be a match. This can be done by selecting any two records within the UI and they do not need to be within the same match group. Using the results from the previous image, I can see that group ID has 6 records consisting of data with the context of 2 separate individuals. Record 302 says that the review record is 134, so I’m going to select both and compare the results.
We now see the results of each record side-by-side and determine that the records are matching using the “Family and Street Address 1” match rules. Now knowing that a “Family” match theme was used, I don’t need to justify why 2 separate individuals were within a single match group. Drilling-down further into the results will show us even more proof as to why these records should be a valid match.
We see from the compare results that they have the same address (and even further down the same phone number). These records have been reviewed, remembering that we’re trying to find people that are part of a family using person names, addresses, phone numbers and email addresses to determine duplicates.
The process of reviewing match results during a data migration project can be difficult, but the insight you get from using the tools within Data Cleansing Advisor can speed up the process and ensure that you are ensuring a level of quality data that’s needed.
Data Cleansing Advisor Best Practices Blog Series
Determining Duplicates and a Matching Strategy
Publishing to Data Services Designer
Configuring Best Record Using Data Services Designer
Match Review with Data Cleansing Advisor (DCA)
Data Quality Assessment for Party Data
Using Data Cleansing Advisor (DCA) to Estimate Match Review Tasks
Creating a Data Cleansing Solution for Multiple Sources