SAP Datasphere Intelligent Lookup Series – What is a fuzzy match and why should I care?
The SAP Datasphere Intelligent Lookup Series is intended to provide you with useful guidance on how to utilize the Intelligent Lookup to leverage the potential of your data landscape. In order to design business processes efficiently and satisfy customers, a very high master data quality is essential. Often, this is not the case in-house. Fortunately, there are external providers who make optimized master data (e.g. organizational data and address data) available. For successful use of these, they have to be linked with internal data. SAP Datasphere provides Intelligent Lookup, a solution to match external data with internal data that semantically fits together even if there is no common key between these data.
This article is the fourth in the blog post series and describes what a fuzzy match is and how fuzzy rules can be created in an Intelligent Lookup.
The blogpost is structured as follows: First, we define what a fuzzy match is. Then it is shown how a fuzzy rule can be created within an Intelligent Lookup in SAP Datasphere. Thereafter, two examples containing fuzzy rules are shown. Finally, a conclusion is drawn.
We plan to release a new blog every Friday for the coming weeks. These blogs have been released so far:
- Harmonize your internal datasets with external data
- How filtering and sorting enhances your work with Intelligent Lookup
- How to integrate an Intelligent Lookup in your existing data landscape
- What is a fuzzy match and why should I care? (this blog)
- Up for a (Data) Challenge?
Please note: All data shown in this blogpost is sample data.
What is a fuzzy match?
Fuzzy Match Rule in the Intelligent Lookup
A fuzzy match rule in Intelligent Lookup compares strings of the columns of the input entities with strings of the columns of the lookup entities with an error-tolerant search. Each match encountered is assigned a percentage score, which describes the accuracy of the match. The higher the score, the more similar the strings are. A score of 100 % means the strings are identical. A score of 0 % means the strings have nothing in common. The user can specify at what tresholds entries are classified as a match versus at what treshold they are classified as a review case. The fuzzy match rule is powered by the SAP HANA fuzzy search.
The following describes the steps to create a fuzzy match rule in an Intelligent Lookup (as a precondition, I have already created the Intelligent Lookup and defined which view or table acts as the input entity and which acts as the lookup entity. In addition, I have already clicked on the rule).
First, the General Section specifies how the rule is named and that it is a “fuzzy match” as a match strategy.
In the Match Tresholds Section, the threshold scores for match entries and review entries can be configured. The “Matched Records Score” defines the minimum score above which an input entry is matched to a lookup entry and the match appears in the “Matched” results. The default value is 100% and if more than one lookup entry matches to the input entity above this threshold, the entries are placed in the “Multiple” Results. The “Review Records Score” sets the minimum score above which an input entry will be matched to a lookup entry and the match will appear in the “Review” Results. The default value is 80% and is always below the “Matched Records Score”. If more than one lookup entry is matched to the input entity between this threshold and the matched records score, the entries are placed in the multiple results. If an input entity does not match any lookup entity (score below the “Review Records Score”), then the entry is placed in the “Unmatched” Results.
In the Match Columns Section, one or more string columns (for input and lookup entities, respectively) are added, which are to be compared to each other. When you select more than one column on either side, the columns are concatenated together in the order in which they are displayed in the list.
At Advanced Settings you can define how multiple matches are handled (Parameter: “Multiple Match Processing”). If “Auto-Select” is activated, a match entry is automatically selected if a match candidate scores conclusively higher than any other.
Example 1 – Cities Matching
In example 1, a data set containing sales data from a web store (order Value, order Date, city, region, country) is linked to a second data set (city name, continent name, country name, latitude, longitude) to extend the reporting with the spatial data for the city. For this purpose, a fuzzy rule is created that compares the columns for the cities per record. The Matched Records Score is set to 95% and the Review Records Score is set to 80%.
In this example, the rule results in 51% of the input sales data (16,000 records) being matched, 7% being classified as review, and the remaining 42% being classified as multiple match. In the following figure it is shown that with the review score of 80% good review cases were identified. If the review case of the city “Zurich” (LocationID = 11) to the city “Zürich” (ID = 14306) is manually confirmed, 1035 of the total 1222 review cases are already confirmed as match. All future sales entries with the LocationID 11 are automatically assigned to the city “Zurich” from the lookup data set.
Another 42% of the input entries were identified as multiple matches within the review records score. The following figure shows an excerpt. The city “Frankfurt” (LocationID = 17) matches within the 80% range with 10 different cities of the second dataset. In this example, it makes sense to use the Sorting and Filter options to display the additional columns Region and Country (there is a separate blog post by Carolin Boeke). In this case, it makes sense to build additional rules that additionally matches by country and region. If “Frankfurt” (LocationID = 17) is matched to the city “Frankfurt am Main” (ID = 4851), another 943 matches of the 6875 review cases are resolved.
Example 2 – Company data (company name + address data) Matching
Example 2 outlines the use case from the blog post series (see here). An internal customer master record is associated with an external database (Legal Entity Identifier – LEI). Within the Intelligent Lookup several fuzzy rules were applied, as an illustration rule 9 is exemplified. Rule 9 matches the Name (Short Name), Postcode, City, Country and Address Line from the internal record to the corresponding fields of the LEI record with a Matched Records Score of 95% and a Review Records Score of 80%.
This rule can match 8% of the data, 3% match as a review case, 20% match as multiple matches, only 9% cannot be matched. The review cases of rule 9 are shown in the following figure. These were all correctly identified and can be manually confirmed. “LUFTHANSA” (CustomerID = 69) could be matched to “Deutsche Lufthansa“. “RATIONAL” (CustomerID = 73) could be matched to “RATIONAL“, although the city is defined as “Landsberg” and in the other data set “Landsberg am Lech“. “TALANX” (CustomerID = 83) can also be successfully associated to “Talanx“, although the internal data set has not maintained a postal code.
The following figure displays the multiple match cases of rule 9. To the entry “MUNICH RE” (CustomerID = 13) 18 entries of the LEI data set could be matched. The appropriate entry “Münchener Rückversicherungs-Gesellschaft” (LEI = 529900MUF4C20K50JS49) could also be detected, but with a score of 81% it is displayed quite far down. Further follow-up rules should be defined, which can match the entry more precisely, and/or sorting and filtering functionalities should be employed to identify the matching entry.
This blog post explained how fuzzy rules can be created within Intelligent Lookups in SAP Datasphere. The fuzzy rules have proven to be a useful addition that can increase the match rate.
- Faster than manual maintenance
- Less expensive than manual maintenance
- High quality (user can focus on more difficult cases manually)
Often it makes sense to match with exact rules first and then add fuzzy rule nodes to achieve the best possible matching quality (blogs on Best Practices will follow).
Thanks for reading! I hope you find this post helpful. For any questions or feedback just leave a comment below this post. Feel free to also check out the other blogposts in the series. I would like to thank my colleagues Carolin Boeke, Florian Neukirch, Josef Hampp and Richard Herrmann for collaborating on the blogpost series.
- New openSAP course: Introduction to SAP Datasphere (24th May to 22nd June)
- One-Stop-Shop to Intelligent Lookup in SAP Datasphere
- End2End Walk Through – Intelligent Lookup
- Intelligent Lookup Documentation in SAP Help Portal
Find more information and related blog posts on the topic page for SAP Datasphere.
This article was very interesting and informative. I learned a lot about the concept of fuzzy matching and how it can help me find and match data records that are not exactly the same, but similar enough. I liked how you explained the difference between exact and fuzzy matching, and how to use the SAP DataSphere Intelligent Lookup service to perform fuzzy matching on various data sources and scenarios. I also liked how you showed some examples and screenshots of the fuzzy matching results and the metrics that measure the quality and accuracy of the matching. I think this blog post is very useful and relevant for anyone who wants to improve their data quality and integrity. Thank you for sharing this knowledge and experience. 😊
Best regards, Hanna from CodeIT