Determining Duplicates and a Matching Strategy

Information Steward’s Data Cleansing Advisor simplifies the entire data cleansing process by intelligently recommending rules for cleansing and matching based upon SAP best practices. The matching strategy recommended can also be customized to further meet your business requirements to further define how relationships are found within the input source.  This section will focus on the aspects of how to customize the recommended match strategy to get
different results based upon the same input source.

The image below shows a test data source in Information Steward that has common party data type entities.  The source has gone through content type identification (a new feature in Information Steward 4.2) and the content type of each column has been identified.  It can clearly be seen that we have an address entity, a person entity and other attributes of person data (email, phone, etc.) within this input file.  We could identity relationships (matches) using just the address fields or just the person fields, but what would be the results if we used both address and person?

/wp-content/uploads/2013/12/dd1_348280.png

Data Cleansing Advisor can recommend many different strategies to determine duplicates within an input source.  The following are supported: Individual,
Corporate, Individual and Corporate, ID Only, Family, Other.  A brief synopsis of each strategy is as follows:

  • Individual: searches for matching records based on personal name data
  • Corporate: searches for matching records based on organizational name data
  • Individual and Corporate: searches for matching records based on both personal and organizational name data
  • Family: searches for matching records based on last name data
  • ID Only: searches for matching records based on an identification or SSN column that you specify
  • Other: searches for matching records based on very specific criteria (address only, phone only, email only or another
    specified column)

A data cleansing solution was created to determine duplicates based on an address only theme.  The results are as follows:

/wp-content/uploads/2013/12/dd2_348293.png

Data Cleansing Advisor was able to determine that the input source contains over 25% (677) duplicate records.  You do not need to have technical knowledge on how to configure a Data Services’ match transform; you just need to know your business requirements and Data Cleansing Advisor will create the match rules for you.  Data Cleansing Advisor at this point gives you a few options that you can use to fine tune and review the results.  The first is a chart that allows you to drill-down into the results to create filters to view the data that is most important to you.  The image below shows a filter being created to view the matching records using the address only matching theme.  The results are then further divided by the match confidence (high, medium or low).  High confidence matches are close to being exact matches and may not need to be reviewed.  Medium and low confidence matches are considered to be suspect matches and you may want to review these record groups.

/wp-content/uploads/2013/12/dd3_348294.png

Viewing the data will immediately display the match results using the filter that we just previously defined (all matching records).  Looking at group ID 238 shows us that there is a single match group with the address of 444 Highland Dr that contains multiple different people.  These are not the results that were expected.  Data Cleansing Advisor allows you to fully customize the match theme (Other, Individual, etc.), the match rules used within a theme, the threshold of how accurate a match needs to be and certain advanced options (such as initials being able to match a person’s first name).

/wp-content/uploads/2013/12/dd4_348295.png

Fine tuning the match results is done within the same user interface when reviewing the data.  Selecting “Change match theme” will display the themes that are available to be selected from.  Knowing that we want to differentiate people whom have the same address means that we need to select an “Individual” match theme.  The rules are also displayed below, meaning that duplicates will be found also using phone, address or email.  These match rules can also be selected or de-selected depending on how much you want to customize the solution.

/wp-content/uploads/2013/12/dd5_348296.png

Once these changes are applied, a what-if analysis (preview) will be displayed showing you the exact impact of the changes made.  Modifying the match strategy is as simple as knowing how you want to define relationships and having Data Cleansing Advisor create the match rules for you.

/wp-content/uploads/2013/12/dd6_348297.png

The results show that we have ~140 less matches meaning we have more unique records.  Previewing the results and filtering on ‘Kohler’ shows us that there are now 3 match groups instead of the 1 large match group that we previously had using the address only matching theme.

/wp-content/uploads/2013/12/dd7_348298.png

The selected match theme will generally have the greatest influence on the results that you will get when trying to determine a strategy to use.  Now that we’ve changed the strategy to find matches using person data, we can now dive deeper into other changes that will also affect the result set.  The image above shows us that group ID 192 is a single match group with 3 records.  Each record has the same address, but the name is slightly different.  Data Cleansing Advisor allows you to further fine tune the match rules based on person, address or firm to get the results that you want.  The image below shows a highlighted checkbox that has been de-selected to no longer have first names match with initials.  Applying this change will again immediately show the what-if analysis and a preview of the records that were impacted.  When changing the match options, any combination of changes can be made, but to fully understand the impact of each change it is recommended to do one change at a tie.

/wp-content/uploads/2013/12/dd8_348302.png

The new results now show that P.T Coleman is now a near-match (grey, italicized text) to group ID 177.  This means that it is a unique record, but just under the threshold of it being part of the match group.

/wp-content/uploads/2013/12/dd9_348303.png

Data Cleansing Advisor can easily determine the duplicates within a specified input source and gives you the tools to easily customize the matching strategy to get the results that you want.

Data Cleansing Advisor Best Practices Blog Series

Determining Duplicates and a Matching Strategy
http://scn.sap.com/community/information-steward/blog/2013/12/31/determining-duplicates-and-a-matching-strategy

Publishing to Data Services Designer
http://scn.sap.com/community/information-steward/blog/2013/12/31/publishing-to-data-services-designer

Configuring Best Record Using Data Services Designer
http://scn.sap.com/community/information-steward/blog/2013/12/31/configuring-best-record-using-data-services-designer

Match Review with Data Cleansing Advisor (DCA)
http://scn.sap.com/community/information-steward/blog/2013/12/31/match-review-with-data-cleansing-advisor-dca

Data Quality Assessment for Party Data
http://scn.sap.com/community/information-steward/blog/2013/12/31/data-quality-assessment-for-party-data

Using Data Cleansing Advisor (DCA) to Estimate Match Review Tasks
http://scn.sap.com/community/information-steward/blog/2013/12/31/using-data-cleansing-advisor-dca-to-estimate-match-review-tasks

Creating a Data Cleansing Solution for Multiple Sources
http://scn.sap.com/community/information-steward/blog/2013/12/31/creating-a-data-cleansing-solution-for-multiple

To report this post you need to login first.

6 Comments

You must be Logged on to comment or reply to a post.

  1. Poonam Hemrajani

    Hi Ken

    I have a scenario, where my source table has some more columns which should not participate in match process. Is there a way to configure these columns as carry forward columns and avoid them to part of the match process.

    Poonam

    (0) 
  2. Fernanda Nunes Ramalho

    Hi Ken, thanks for this!

    Is there a way I can ‘force’ a content type into a field?

    IS is identifying only 39% of the content as Adress Line so it won’t use the type and allow me to use it in cleansing advisor!

    Thanks

    (0) 

Leave a Reply