Jaro–Winkler Distance Algorithm
To tackle a real world problem of the hiring department performing a new hire instead of a rehire for seasonal contractors, I decided to implement the Jaro-Winkler algorithm in SAP. Since there is a policy to not store key information on contractors (such as SSN), there is only the ability to match on a person’s name. Unfortunately, the name is not always typed correctly which leads to the inability to find a previous hire and this leads to hiring a new contractor instead of rehiring a contractor. Using the Jaro-Winkler algorithm, we are now able to suggest possible similar contractors based on the string comparison of first and last name. Jaro-Winkler calculates the distance (a measure of similarity) between strings. The measurement scale is 0.0 to 1.0, where 0.0 is the least likely and 1.0 is a positive match. For our purposes, anything below a 0.8 is not considered useful.
“Jaro–Winkler distance (Winkler, 1990) is a measure of similarity between two strings”.
See Wiki link: http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance for more in depth information.
The Class: ZCL_JARO_WINKLER
Method: STRINGDISTANCE
method stringdistance.
data: firstlen type i, secondlen type i, halflen type i, commonmatches type i, common1 type string, common2 type string.
data: transpositions type i, i type i, j type i.
if ( not ( firstword is initial ) and not ( secondword is initial ) ).
if ( firstword eq secondword ).
stringdistance = totalmatchscore.
else.
firstlen = strlen( firstword ).
secondlen = strlen( secondword ).
halflen = zcl_jaro_winkler=>math_min( num1 = firstlen num2 = secondlen ) / 2 + 1.
common1 = zcl_jaro_winkler=>getcommoncharacters( firstword = firstword secondword = secondword distance = halflen ).
commonmatches = strlen( common1 ).
if ( commonmatches eq 0 ).
stringdistance = totalmismatchscore.
else.
common2 = zcl_jaro_winkler=>getcommoncharacters( firstword = secondword secondword = firstword distance = halflen ).
if ( commonmatches ne strlen( common2 ) ).
stringdistance = totalmismatchscore.
else.
transpositions = 0.
i = 0.
while ( i lt commonmatches ).
if ( common1+i(1) ne common2+i(1) ).
transpositions = transpositions + 1.
endif.
i = i + 1.
endwhile.
transpositions = transpositions / 2.
stringdistance
= commonmatches / ( 3 * firstlen ) + commonmatches / ( 3 * secondlen ) + ( commonmatches –transpositions ) / ( 3 * commonmatches ).
endif.
endif.
endif.
else.
stringdistance = totalmismatchscore.
endif.
endmethod.
Method: GETCOMMONCHARACTERS
method getcommoncharacters.
data: firstlen type i, secondlen type i, i type i, j type i, ch type c, foundit type c, secondword_copy type string.
data: first_half type string, second_half type string, next_start_point type i, remaining_length type i.
if ( not ( firstword is initial ) and not ( secondword is initial ) ).
secondword_copy = secondword.
firstlen = strlen( firstword ).
secondlen = strlen( secondword ).
i = 0.
while i lt firstlen.
ch = firstword+i(1).
foundit = bool_false.
j = zcl_jaro_winkler=>math_max( num1 = 0 num2 = ( i – distance ) ).
while ( ( foundit = bool_false ) and ( j lt zcl_jaro_winkler=>math_min( num1 = ( i + distance ) num2 = secondlen ) ) ).
if ( secondword_copy+j(1) eq ch ).
foundit = bool_true.
concatenate commons ch into commons.
move secondword_copy+0(j) to first_half.
remaining_length = ( secondlen – j ) – 1.
next_start_point = j + 1.
move secondword_copy+next_start_point(remaining_length) to second_half.
concatenate first_half ‘#’ second_half into secondword_copy.
endif.
j = j + 1.
endwhile.
i = i + 1.
endwhile.
else.
clear commons.
endif.
endmethod.
Creating a test program:
This test program requires that a first name and last name be input by the user and then these names are compared against every employee for possible matches.
loop at t_pa0002 assigning <fs>.
translate <fs>–nachn to upper case.
translate <fs>–vorna to upper case.
move <fs>–nachn to l_compare.
call method zcl_jaro_winkler=>stringdistance
exporting
firstword = s_nachn
secondword = l_compare
receiving
stringdistance = <fs>–nachn_score.
move <fs>–vorna to l_compare.
call method zcl_jaro_winkler=>stringdistance
exporting
firstword = s_vorna
secondword = l_compare
receiving
stringdistance = <fs>–vorna_score.
<fs>–total_score = <fs>–nachn_score + <fs>–vorna_score.
endloop.
sort t_pa0002 by total_score descending.
format color col_normal.
write: 80 ‘Total Score ‘ color col_total.
uline.
loop at t_pa0002 assigning <fs>.
write: / <fs>–nachn(20), <fs>–nachn_score, <fs>–vorna(20), <fs>–vorna_score.
write 80 <fs>–total_score color col_total.
endloop.
Deliberately misspelling my last name and shortening my first name:
Results:
I only show the top three results. In fact, since I’m creating a combined score of two different measurements, I would likely consider anything below a 1.60 as not useful in real world. So now we have a safeguard to alleviate hiring when we should be re-hiring. Of course, there are many uses for this algorith, such as zip code verification for zip+4. Also, another use is using as a dictionary check for valid words.
This is brilliant! I can see a LOT of uses for this. Great work and thanks for sharing!
Hi, very interesting! But what kind of method is zcl_jaro_winkler=>math_min?
Thx and greets
Chris
method math_min.
if num1 <= num2.
result = num1.
else.
result = num2.
endif.
endmethod.
method math_max.
if num1 >= num2.
result = num1.
else.
result = num2.
endif.
endmethod.
Really brilliant, but I think you should exemplify best classes because not everyone here is developers and would be a good you have a solution easy to implement. Adding more, I think it would be a good you place your project within open source projects here. (This space).
Warm regards,
Raphael Pacheco.
Thanks for sharing. In SAP, there's also a function module named RS_COMPARE_WORDS_SIMILAR, but no idea which algorithm it implements...
Hi Philipp,
what is the reason for implementing name matching using Jaro-Winkle? Have you also tested other techniques?
The reason I'm asking is that I probably would have chosen the naive implementation using the Levenshtein distance. This is available in ABAP as string distance function (ABAP Keyword Documentation).
Christian
Here's a great reference for all your pattern matching questions:
http://users.cecs.anu.edu.au/~Peter.Christen/publications/tr-cs-06-02.pdf
Speed Comparison of the four Jaro and Levenshtein algorithms ( from fastest to slowest ):
with the slowest taking 2 to 3 times as long as the fastest.
Hi Gurus,
I tried copying the code and creating the class but it is not working for me. I'm sure I did not do it right. Can I have the complete working code? It is not clear what the return parameters and the types basically I guess. I was able to compile everything and when I excute the program it is always returning 0. Totalmatchscore and totalmismatchscore I defined as constants with values 0 and 1 respectively. is that right?
I would appreciate if someone can help me on this.
Thanks,
Siva
Why not using the distance lehvenstein function?
Code with missing parts:
Hi Sandra Rossi,
Thanks for the code. It works!! I appreciate the help.
One thing I don't understand is if there is no match then also it returns 0.00 which doesn't make sense to me. This is what I pass and compare with first and last names in a table
In S4CORE there is a class CL_RSH_MATCH_STRING, which is going to calc Jaro-Winkler Similarity
My realization of jaro-winkler similarity
definition
implementation