# HOWTO – Business Partner Duplicate Check Algorithm

The Business Partner Duplicate Check is used to find out potential business partner duplicates existing in the system at the time of creation of a new business partner. It depends on HANA capabilities for performing a fuzzy search on the database with the given data and for calculating the overall rank for the records that have some match. HANA uses the Levenshtein algorithm for calculating the similarity of terms. Combining the capabilities of HANA’s error tolerant search and ranking with FSI’s flexibility to create dynamic SQL ‘Where’ clauses, business partner duplicate check finds duplicates reliably and also efficiently.

Levenshtein algorithm:
Named after its developer Vladimir Levenshtein, ‘Levenshtein Algorithm’ employs the calculation of the Levenshtein Distance i.e. the distance between two given strings or sequences by finding out the minimum number of operations needed to transform one string into the other, where an operation is an insertion, deletion, or substitution of a single character. It is often used in applications that need to determine how similar, or different, two strings are, such as spell checkers.

The Duplicate Check considers these “nodes” (= collection of attributes) when calculating the similarity score: name, postal address, e-mail, fax, phone, birthdate (only for persons), additional identifiers. I. e. it calculates the single similarity of the above mentioned nodes compared to an existing business partner on the database and finally aggregates the single similarities to a total similarity. In this step the nodes contribute to the total similarity with different weights which are configured in the Business Configuration Finetuning “Duplicate Check Weighting For Business Partners”.

A short example shall illustrate this:
Let’s assume you are creating a new contact with these data
Name = Bond
E-mail = j.bond@mi6.com

Let’s say there is already this similar contact existing on the database:
Name = Band
E-mail = j.bond@mi6.com

The similarity of the names is 1 – ¼ = 0.75 = 75%.
The similarity of the e-mail addresses is 1 = 100%.

Now the total similarity depends on the weights for name and e-mail configured in “Duplicate Check Weighting For Business Partners”. Only the weights for name and e-mail are relevant, as the similarity of the other nodes is 0.

Let’s assume we have these weights:
Name weight = 30
E-mail weight = 70

To combine the single similarities the Duplicate Check uses the so called OR-formula:

with
R = rank or fuzzy similarity of a node

w = node weight

i = index value of the node

n = number of individual nodes

For this example we get this total similarity:
SIM = SQRT( (30² * 0.75² + 70² * 1²) / (30² + 70²) ) = SQRT( 5406.25 / 5800 ) = 0.97 = 97%

If you decrease the weight for Name and increase the weight for E-mail the total similarity will increase as well, as the contribution of the 100% matching node ‘e-mail’ increases compared to the node ‘name’ which has got a lower similarity.

Now let’s have a look at a second example where we enter only the name for the new contact:
Name = Bond

Now the configuration in “Duplicate Check Weighting For Business Partners” is completely irrelevant as there are no parts to be combined.
That means the total similarity is just the single similarity of the name part independent of any weights:
SIM = 0.75 = 75%

Summary:
The node weights you are configuring in the “Duplicate Check Weighting For Business Partners” only change the influence how much each individual of the above mentioned nodes (e. g. name or address) contribute to the total similarity compared to the other nodes.
Note, that the node weights are relative weights, i. e. only the ratio of a node weight compared to the other node weights is relevant and not the absolute value. If you e. g. double all node weights the calculated fuzzy similarity keeps the same.

Thomas, thanks a LOT for this complete answer. Now I will stop telling my customers the duplicate check is just a fancy Black Box

I can now give them the whole formula and even suggest an evening course for Mathematics!

Hi Thomas,

thank you for the very good and detailed reply.

Our Client’s IT Department asked us a quick question: how much is then the threshold on the Levenshtein distance in order to include a Business Partner in the Checked Duplicates list? We guess should be around 80%~90%, is that correct?

Thanks a lot,

Davide

Hi Davide,

In the Business Configuration Scoping you can configure this threshold. You can
choose if you want a Strong, Medium or Weak Duplicate Check for Business Partners.

Here are the respective thresholds:
Strong Duplicate Check for Business Partners’: 85%
Medium Duplicate Check for Business Partners’: 80%

Weak Duplicate Check for Business Partners’: 70%

The default setting is ‘Medium Duplicate Check for Business Partners’, i. e. a threshold of 80%.

Best regards,
Thomas

Thank you, Thomas. It was exactly the information we needed about.

Cheers,

Davide

This document was generated from the following discussion: Algorithm behind duplicate check rule

### Assigned Tags

You must be Logged on to comment or reply to a post.

Hello!

Can the duplicate check be used when using ODATA API to create an individual customer?

Thanks.

Hello Pushkar,

Thanks for the great write-up.

I have a few questions:

• When we raised some incidents with SAP in the past, we were told that each of these options for weightage can have “sub-fields”. For eg, Name has some more characteristics inside it – like Academic Title, first name, last name, additional name etc. Same thing with address, I am guessing. Is it possible to know all the “sub-fields” within each of the above options? How is the weightage give to each option distributed among those “sub-fields”?
• Since different countries have names and addresses in different formats, is there a possibility to have this duplicate check at country level or org level?
• What enhancements are planned for future?

thanks

Mahesh

Hello Pushkar,

I have a question. Thomas described the name and email similarity with a simple example below:

short example shall illustrate this:
Let’s assume you are creating a new contact with these data
Name = Bond
E-mail = j.bond@mi6.com

Let’s say there is already this similar contact existing on the database:
Name = Band
E-mail = j.bond@mi6.com

The similarity of the names is 1 – ¼ = 0.75 = 75%.
The similarity of the e-mail addresses is 1 = 100%.

What if we change our example like that:

Lets assume my 1. data that I want to compare with other data is like that:

Name1 = Bla Bla Company Ltd. Co.

Address1 = Bla Bla Street No:1769 Newyork

2. data is:

Name2 = Bla Bla Bla Company Ltd. Co.

Address2 = Bla Bla Street No:1768 NewYork

And when I compared the two names with VBA (macro in excel) with Levenshtein distance formula, I get the result like "4"

So, what would be my "The similarity of the names"?

How can I calculate this?

is it possible to make a calculation like that:

1- (levensthein distance/ lenght of the 1. name ) = 1 - 4/24 = 0,83?

And I can find address similarity with the same way and it would be = 1 - 2/30 = 0,93?

And then we can put this ratios into our similarity formula:

lets assume

name weight is:70

and our similarity =SQRT((70²*0,83²+20²*0,93²)/(70²+20²)) = 0,83

It this the correct way. I ask this question especially to proof "The similarity of the names" and "The similarity of the address"

Regards,

Here the problem is kind of end part of customers like

gmbh

ag

S.p.a

limited

ltd etc etc

For example we have an account Berco S.p.a and lead as companny name Berco SPA but system is unable to find based on Name weitage.