Q1. How is the similarity score calculated when I enter name fields only?
In a first step the Duplicate Check calculates the formatted name (e. g. ‘Microsoft Corp’; ‘Dr. Joe Doe’). This formatted name is compared with the formatted name of each persisted instance. To calculate the similarity of these two formatted names we use the algorithm for token search within a field. In our example a token is a name part (e. g. the family name):
1) For each token in the search field we find the best match with a token in the database field using the Levenstein algorithm (see https://archive.sap.com/discussions/thread/3757982). Best match means that all the tokens in the search field are matched with all the tokens in the database field. Then, for each token in the database field, the token in the search field that matched the best is recorded as the fuzzy similarity for that token in the search field. Note that one token in the database field can contribute to the fuzzy similarity of only one token in the search field and the order of the tokens does not influence the outcome.
2) If the number of tokens differs between the search field and the field in the database, then a fuzzy similarity 0 is used for each of the remaining tokens.
3) The list of fuzzy similarity values obtained above is aggregated using the OR formula for token search within fields:
Search String: Formatted name = Benjamin Franklin
Database Record: Formatted name = Banjaminn Franklin
‘Benjamin’ best matches to ‘Banjaminn’. Similarity with Levenshtein is 62% (1 x Update, 1 x Deletes, 1 x Insert, i. e. similarity = 1 – 3/8)
‘Franklin’ best matches to ‘Franklin’. Similarity obviously is 100%.
Finally, the two similarity values are aggregated by using the OR formula for token search within words:
SQRT( ( 0.62² + 1² ) / 2 ) ) = 83%.
Q2. I created a new account and maintained the name only. The name contains the legal form of the company (e. g. Microsoft Corp., SAP SE). The duplicate check shows me potential duplicates with names that are quite different from the name I entered. Isn’t this a bug?
Currently the Duplicate Check does not provide the option to configure legal forms (e. g. Corp, SE, AG, GmbH) to be ignored by the similarity score calculation. That means that identical legal forms contribute with a 100% token score (see Q1) to the name part score. For names consisting of 2 tokens (legal form + 1 additional token) it is already sufficient that the 2nd token matches with at least 55% similarity to become a potential duplicate. According to the algorithm explained in Q1 we get
SQRT( ( 0.55² + 1² ) / 2 ) = 81% which is above the default duplicate check threshold of 80%. Hence, this instance is displayed as potential duplicate.
‘Wintervoss Corp‘ and ‘Wintersnow Corp’ have similarity of 82%, as the similarity of ‘Wintervoss’ and ‘Wintersnow’ calculated with Levenshtein algorithm is 60%.
To overcome this restriction, we strongly recommend not only to enter the name for a new account, but enter as many fields as possible (e. g. ‘qualified’ address data) to get a more realistic similarity score. Additionally, it might be an option to decrease the name weight in the Business Configuration Finetuning to lower the influence of the name.
Q3. I created a new account and maintained some name fields and the country field. I leave all other fields empty. The duplicate check shows me potential duplicates with names that are quite different from the name I entered. Isn’t this a bug?
When you enter the country only and no further address data, the duplicate check only compares the country field of the postal address of persisted accounts. I. e. all accounts that are located in the same country have a 100% similarity for the address part. To calculate the total similarity score we use a similar OR formula to that mentioned in Q1 to aggregate the similarity of the name and address parts. Even if the name has a similarity of only 55% this yields in a total score of more than 80%.
We strongly recommend to enter as many fields as possible (e. g. ‘qualified’ address data) to get a more realistic similarity score.
Q4. I created a new account and maintained the name with ‘Wintervoss’. I leave all other fields empty. In the system there is already an account ‘Wintervoss Electronics’ existing, however the duplicate check does not return ‘Wintervoss Electronics’ as potential duplicate. Isn’t this a bug?
No, this behavior is correct. As described in Q1, the similarity score for ‘Wintervoss’ and ‘Wintervoss Electronics’ is calculated like this:
SQRT( (1² + 0²) / 2 ) = 71%
This is below the default threshold of 80%. Therefore this instance is not displayed as potential duplicate.
Q5: How does the system separate tokens in a name field?
The system uses this set of characters for token separation:
Let’s assume a name field contains this entry:
Europe/Ireland South-East Construction/Support(Pro)&Partner
This is separated into these tokens: