Technology Blogs by SAP
Learn how to extend and personalize SAP applications. Follow the SAP technology blog for insights into SAP BTP, ABAP, SAP Analytics Cloud, SAP HANA, and more.
cancel
Showing results for 
Search instead for 
Did you mean: 
tim_huse
Advisor
Advisor

Introduction


The SAP Datasphere Intelligent Lookup Series is intended to provide you with useful guidance on how to utilize the Intelligent Lookup to leverage the potential of your data landscape. In order to design business processes efficiently and satisfy customers, a very high master data quality is essential. Often, this is not the case in-house. Fortunately, there are external providers who make optimized master data (e.g. organizational data and address data) available. For successful use of these, they have to be linked with internal data. SAP Datasphere provides Intelligent Lookup, a solution to match external data with internal data that semantically fits together even if there is no common key between these data.

This article is the fourth in the blog post series and describes what a fuzzy match is and how fuzzy rules can be created in an Intelligent Lookup.

The blogpost is structured as follows: First, we define what a fuzzy match is. Then it is shown how a fuzzy rule can be created within an Intelligent Lookup in SAP Datasphere. Thereafter, two examples containing fuzzy rules are shown. Finally, a conclusion is drawn.

We plan to release a new blog every Friday for the coming weeks. These blogs have been released so far:

  1. Harmonize your internal datasets with external data

  2. How filtering and sorting enhances your work with Intelligent Lookup

  3. How to integrate an Intelligent Lookup in your existing data landscape

  4. What is a fuzzy match and why should I care? (this blog)

  5. Up for a (Data) Challenge?


 

Please note: All data shown in this blogpost is sample data.

 

What is a fuzzy match?


A fuzzy match is used whenever I have semantically similar data in entities of the datasets I want to associate, but they do not have exactly identical data and I want to allow for fault tolerance. Entries will be matched even if two strings being compared contain typos or additional/missing characters. The following figure shows an example of fuzzy matches.



Figure 1: Fuzzy Match Example (Source: Own Image)


 

Fuzzy Match Rule in the Intelligent Lookup


A fuzzy match rule in Intelligent Lookup compares strings of the columns of the input entities with strings of the columns of the lookup entities with an error-tolerant search. Each match encountered is assigned a percentage score, which describes the accuracy of the match. The higher the score, the more similar the strings are. A score of 100 % means the strings are identical. A score of 0 % means the strings have nothing in common. The user can specify at what tresholds entries are classified as a match versus at what treshold they are classified as a review case. The fuzzy match rule is powered by the SAP HANA fuzzy search.

The following describes the steps to create a fuzzy match rule in an Intelligent Lookup (as a precondition, I have already created the Intelligent Lookup and defined which view or table acts as the input entity and which acts as the lookup entity. In addition, I have already clicked on the rule).


Figure 2: Add a fuzzy rule to an Intelligent Lookup (Part 1) (Source: Own Image)



Figure 3: Add a fuzzy rule to an Intelligent Lookup (Part 2) (Source: Own Image)


First, the General Section specifies how the rule is named and that it is a "fuzzy match" as a match strategy.

In the Match Tresholds Section, the threshold scores for match entries and review entries can be configured. The "Matched Records Score" defines the minimum score above which an input entry is matched to a lookup entry and the match appears in the "Matched" results. The default value is 100% and if more than one lookup entry matches to the input entity above this threshold, the entries are placed in the "Multiple" Results. The "Review Records Score" sets the minimum score above which an input entry will be matched to a lookup entry and the match will appear in the "Review" Results. The default value is 80% and is always below the "Matched Records Score". If more than one lookup entry is matched to the input entity between this threshold and the matched records score, the entries are placed in the multiple results. If an input entity does not match any lookup entity (score below the "Review Records Score"), then the entry is placed in the "Unmatched" Results.

In the Match Columns Section, one or more string columns (for input and lookup entities, respectively) are added, which are to be compared to each other. When you select more than one column on either side, the columns are concatenated together in the order in which they are displayed in the list.

At Advanced Settings you can define how multiple matches are handled (Parameter: "Multiple Match Processing"). If "Auto-Select" is activated, a match entry is automatically selected if a match candidate scores conclusively higher than any other.

 

Examples


Example 1 - Cities Matching


In example 1, a data set containing sales data from a web store (order Value, order Date, city, region, country) is linked to a second data set (city name, continent name, country name, latitude, longitude) to extend the reporting with the spatial data for the city. For this purpose, a fuzzy rule is created that compares the columns for the cities per record. The Matched Records Score is set to 95% and the Review Records Score is set to 80%.

In this example, the rule results in 51% of the input sales data (16,000 records) being matched, 7% being classified as review, and the remaining 42% being classified as multiple match. In the following figure it is shown that with the review score of 80% good review cases were identified. If the review case of the city "Zurich" (LocationID = 11) to the city "Zürich" (ID = 14306) is manually confirmed, 1035 of the total 1222 review cases are already confirmed as match. All future sales entries with the LocationID 11 are automatically assigned to the city "Zurich" from the lookup data set.


Figure 4: Example 1 Review Cases (Source: Own Image)


Another 42% of the input entries were identified as multiple matches within the review records score. The following figure shows an excerpt. The city "Frankfurt" (LocationID = 17) matches within the 80% range with 10 different cities of the second dataset. In this example, it makes sense to use the Sorting and Filter options to display the additional columns Region and Country (there is a separate blog post by Carolin Boeke). In this case, it makes sense to build additional rules that additionally matches by country and region. If "Frankfurt" (LocationID = 17) is matched to the city "Frankfurt am Main" (ID = 4851), another 943 matches of the 6875 review cases are resolved.


Figure 5: Example 1 Multiple Match Cases (Source: Own Image)


 

Example 2 - Company data (company name + address data) Matching


Example 2 outlines the use case from the blog post series (see here). An internal customer master record is associated with an external database (Legal Entity Identifier - LEI). Within the Intelligent Lookup several fuzzy rules were applied, as an illustration rule 9 is exemplified. Rule 9 matches the Name (Short Name), Postcode, City, Country and Address Line from the internal record to the corresponding fields of the LEI record with a Matched Records Score of 95% and a Review Records Score of 80%.

This rule can match 8% of the data, 3% match as a review case, 20% match as multiple matches, only 9% cannot be matched. The review cases of rule 9 are shown in the following figure. These were all correctly identified and can be manually confirmed. "LUFTHANSA" (CustomerID = 69) could be matched to "Deutsche Lufthansa". "RATIONAL" (CustomerID = 73) could be matched to "RATIONAL", although the city is defined as "Landsberg" and in the other data set "Landsberg am Lech". "TALANX" (CustomerID = 83) can also be successfully associated to "Talanx", although the internal data set has not maintained a postal code.


Figure 6: Example 2 Review Cases (Source: Own Image)


The following figure displays the multiple match cases of rule 9. To the entry "MUNICH RE" (CustomerID = 13) 18 entries of the LEI data set could be matched. The appropriate entry "Münchener Rückversicherungs-Gesellschaft" (LEI = 529900MUF4C20K50JS49) could also be detected, but with a score of 81% it is displayed quite far down. Further follow-up rules should be defined, which can match the entry more precisely, and/or sorting and filtering functionalities should be employed to identify the matching entry.


Figure 7: Example 2 Multiple Match Cases (Source: Own Image)


 

Conclusion


This blog post explained how fuzzy rules can be created within Intelligent Lookups in SAP Datasphere. The fuzzy rules have proven to be a useful addition that can increase the match rate.

  • Faster than manual maintenance

  • Less expensive than manual maintenance

  • High quality (user can focus on more difficult cases manually)


Often it makes sense to match with exact rules first and then add fuzzy rule nodes to achieve the best possible matching quality (blogs on Best Practices will follow).

Thanks for reading! I hope you find this post helpful. For any questions or feedback just leave a comment below this post. Feel free to also check out the other blogposts in the series. I would like to thank my colleagues carolivia, florian91, josef.georg.hampp and he-rima for collaborating on the blogpost series.

Best wishes,

Tim

 

Further Links




Find more information and related blog posts on the topic page for SAP Datasphere.