# Jaro–Winkler Distance Algorithm

To tackle a real world problem of the hiring department performing a new hire instead of a rehire for seasonal contractors, I decided to implement the Jaro-Winkler algorithm in SAP.   Since there is a policy to not store key information on contractors (such as SSN), there is only the ability to match on a person’s name.  Unfortunately, the name is not always typed correctly which leads to the inability to find a previous hire and this leads to hiring a new contractor instead of rehiring a contractor.  Using the Jaro-Winkler algorithm, we are now able to suggest possible similar contractors based on the string comparison of first and last name.  Jaro-Winkler calculates the distance (a measure of similarity) between strings.  The measurement scale is 0.0 to 1.0, where 0.0 is the least likely and 1.0 is a positive match.  For our purposes, anything below a 0.8 is not considered useful.

“Jaro–Winkler distance (Winkler, 1990) is a measure of similarity between two strings”.

See Wiki link:  http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance  for more in depth information. The Class:  ZCL_JARO_WINKLER

Method: STRINGDISTANCE

method stringdistance.

data: firstlen type i, secondlen type i, halflen type i, commonmatches type i, common1 type string, common2 type string.

data: transpositions type i, i type i, j type i.

if ( not ( firstword is initial ) and not ( secondword is initial ) ).

if ( firstword eq secondword ).

stringdistance = totalmatchscore.

else.

firstlen = strlen( firstword ).

secondlen = strlen( secondword ).

halflen = zcl_jaro_winkler=>math_min( num1 = firstlen num2 = secondlen ) / 2 + 1.

common1 = zcl_jaro_winkler=>getcommoncharacters( firstword = firstword secondword = secondword distance = halflen ).

commonmatches = strlen( common1 ).

if ( commonmatches eq 0 ).

stringdistance = totalmismatchscore.

else.

common2 = zcl_jaro_winkler=>getcommoncharacters( firstword = secondword secondword = firstword distance = halflen ).

if ( commonmatches ne strlen( common2 ) ).

stringdistance = totalmismatchscore.

else.

transpositions = 0.

i = 0.

while ( i lt commonmatches ).

if ( common1+i(1) ne common2+i(1) ).

transpositions = transpositions + 1.

endif.

i = i + 1.

endwhile.

transpositions = transpositions / 2.

stringdistance

= commonmatches / ( 3 * firstlen ) + commonmatches / ( 3 * secondlen ) + ( commonmatches transpositions ) / ( 3 * commonmatches ).

endif.

endif.

endif.

else.

stringdistance = totalmismatchscore.

endif.

endmethod.

Method: GETCOMMONCHARACTERS

method getcommoncharacters.

data: firstlen type i, secondlen type i, i type i, j type i, ch type c, foundit type c, secondword_copy type string.

data: first_half type string, second_half type string, next_start_point type i, remaining_length type i.

if ( not ( firstword is initial ) and not ( secondword is initial ) ).

secondword_copy = secondword.

firstlen = strlen( firstword ).

secondlen = strlen( secondword ).

i = 0.

while i lt firstlen.

ch = firstword+i(1).

foundit = bool_false.

j = zcl_jaro_winkler=>math_max( num1 = 0 num2 = ( i distance ) ).

while ( ( foundit = bool_false ) and ( j lt zcl_jaro_winkler=>math_min( num1 = ( i + distance ) num2 = secondlen ) ) ).

if ( secondword_copy+j(1) eq ch ).

foundit = bool_true.

concatenate commons ch into commons.

move secondword_copy+0(j) to first_half.

remaining_length = ( secondlen j ) 1.

next_start_point = j + 1.

move secondword_copy+next_start_point(remaining_length) to second_half.

concatenate first_half ‘#’ second_half into secondword_copy.

endif.

j = j + 1.

endwhile.

i = i + 1.

endwhile.

else.

clear commons.

endif.

endmethod.

Creating a test program:

This test program requires that a first name and last name be input by the user and then these names are compared against every employee for possible matches.

loop at t_pa0002 assigning <fs>.

translate <fs>nachn to upper case.

translate <fs>vorna to upper case.

move <fs>nachn to l_compare.

call method zcl_jaro_winkler=>stringdistance

exporting

firstword      = s_nachn

secondword     = l_compare

receiving

stringdistance = <fs>nachn_score.

move <fs>vorna to l_compare.

call method zcl_jaro_winkler=>stringdistance

exporting

firstword      = s_vorna

secondword     = l_compare

receiving

stringdistance = <fs>vorna_score.

<fs>total_score = <fs>nachn_score + <fs>vorna_score.

endloop.

sort t_pa0002 by total_score descending.

format color col_normal.

write: 80 ‘Total Score      ‘ color col_total.

uline.

loop at t_pa0002 assigning <fs>.

write: / <fs>nachn(20), <fs>nachn_score, <fs>vorna(20), <fs>vorna_score.

write 80 <fs>total_score color col_total.

endloop.

Deliberately misspelling my last name and shortening my first name: Results: I only show the top three results.  In fact, since I’m creating a combined score of two different measurements, I would likely consider anything below a 1.60 as not useful in real world.  So now we have a safeguard to alleviate hiring when we should be re-hiring.  Of course, there are many uses for this algorith, such as zip code verification for zip+4.  Also, another use is using as a dictionary check for valid words.

/   You must be Logged on to comment or reply to a post.
• This is brilliant! I can see a LOT of uses for this.  Great work and thanks for sharing!

• Hi, very interesting! But what kind of method is zcl_jaro_winkler=>math_min?

Thx and greets

Chris

• Philip Johnston Post author

method math_min.

if num1 <= num2.

result = num1.

else.

result = num2.

endif.

endmethod.

• Philip Johnston Post author

method math_max.

if num1 >= num2.

result = num1.

else.

result = num2.

endif.

endmethod.

• Really brilliant, but I think you should exemplify best classes because not everyone here is developers and would be a good you have a solution easy to implement. Adding more, I think it would be a good you place your project within open source projects here. (This space).

Warm regards,

Raphael Pacheco.

• Thanks for sharing. In SAP, there's also a function module named RS_COMPARE_WORDS_SIMILAR, but no idea which algorithm it implements...

• Hi Philipp,

what is the reason for implementing name matching using Jaro-Winkle? Have you also tested other techniques?

The reason I'm asking is that I probably would have chosen the naive implementation using the Levenshtein distance. This is available in ABAP as string distance function (ABAP Keyword Documentation).

Christian

• Philip Johnston Post author

Here's a great reference for all your pattern matching questions:

http://users.cecs.anu.edu.au/~Peter.Christen/publications/tr-cs-06-02.pdf

Speed Comparison of the four Jaro and Levenshtein algorithms ( from fastest to slowest ):

• Jaro
• Jaro-Winkler
• Levenshtein
• Damerau-Levenshtein

with the slowest taking 2 to 3 times as long as the fastest.

• Hi Gurus,

I tried copying the code and creating the class but it is not working for me. I'm sure I did not do it right. Can I have the complete working code? It is not clear what the return parameters and the types basically I guess. I was able to compile everything and when I excute the program it is always returning 0. Totalmatchscore and totalmismatchscore I defined as constants with values 0 and 1 respectively. is that right?

I would appreciate if someone can help me on this.

Thanks,

Siva

• Why not using the distance lehvenstein function?

• Code with missing parts:

``````CLASS zcl_jaro_winkler DEFINITION.
PUBLIC SECTION.
TYPES ty_distance TYPE p LENGTH 6 DECIMALS 2.
CLASS-METHODS stringdistance
IMPORTING
firstword             TYPE string
secondword            TYPE string
RETURNING
VALUE(stringdistance) TYPE ty_distance.
CLASS-METHODS getcommoncharacters
IMPORTING
firstword      TYPE string
secondword     TYPE string
distance       TYPE numeric
RETURNING
VALUE(commons) TYPE string.
CLASS-METHODS math_min
IMPORTING
num1       TYPE numeric
num2       TYPE numeric
RETURNING
VALUE(min) TYPE ty_distance.
CLASS-METHODS math_max
IMPORTING
num1       TYPE numeric
num2       TYPE numeric
RETURNING
VALUE(max) TYPE ty_distance.
CLASS-DATA: totalmatchscore    TYPE ty_distance,
totalmismatchscore TYPE ty_distance.
CONSTANTS: bool_false TYPE abap_bool VALUE abap_false,
bool_true  TYPE abap_bool VALUE abap_true.
ENDCLASS.

CLASS zcl_jaro_winkler IMPLEMENTATION.

METHOD stringdistance.
DATA: firstlen      TYPE i, secondlen TYPE i, halflen TYPE i, commonmatches TYPE i, common1 TYPE string, common2 TYPE string.
DATA: transpositions TYPE i, i TYPE i, j TYPE i.
IF ( NOT ( firstword IS INITIAL ) AND NOT ( secondword IS INITIAL ) ).
IF ( firstword EQ secondword ).
stringdistance = totalmatchscore.
ELSE.
firstlen = strlen( firstword ).
secondlen = strlen( secondword ).
halflen = zcl_jaro_winkler=>math_min( num1 = firstlen num2 = secondlen ) / 2 + 1.
common1 = zcl_jaro_winkler=>getcommoncharacters( firstword = firstword secondword = secondword distance = halflen ).
commonmatches = strlen( common1 ).
IF ( commonmatches EQ 0 ).
stringdistance = totalmismatchscore.
ELSE.
common2 = zcl_jaro_winkler=>getcommoncharacters( firstword = secondword secondword = firstword distance = halflen ).
IF ( commonmatches NE strlen( common2 ) ).
stringdistance = totalmismatchscore.
ELSE.
transpositions = 0.
i = 0.
WHILE ( i LT commonmatches ).
IF ( common1+i(1) NE common2+i(1) ).
transpositions = transpositions + 1.
ENDIF.
i = i + 1.
ENDWHILE.
transpositions = transpositions / 2.
stringdistance
= commonmatches / ( 3 * firstlen ) + commonmatches / ( 3 * secondlen ) + ( commonmatches - transpositions ) / ( 3 * commonmatches ).
ENDIF.
ENDIF.
ENDIF.
ELSE.
stringdistance = totalmismatchscore.
ENDIF.
ENDMETHOD.

METHOD getcommoncharacters.
DATA: firstlen        TYPE i, secondlen TYPE i, i TYPE i, j TYPE i, ch TYPE c, foundit TYPE c, secondword_copy TYPE string.
DATA: first_half       TYPE string, second_half TYPE string, next_start_point TYPE i, remaining_length TYPE i.
IF ( NOT ( firstword IS INITIAL ) AND NOT ( secondword IS INITIAL ) ).
secondword_copy = secondword.
firstlen = strlen( firstword ).
secondlen = strlen( secondword ).
i = 0.
WHILE i LT firstlen.
ch = firstword+i(1).
foundit = bool_false.
j = zcl_jaro_winkler=>math_max( num1 = 0 num2 = ( i - distance ) ).
WHILE ( ( foundit = bool_false ) AND ( j LT zcl_jaro_winkler=>math_min( num1 = ( i + distance ) num2 = secondlen ) ) ).
IF ( secondword_copy+j(1) EQ ch ).
foundit = bool_true.
CONCATENATE commons ch INTO commons.
MOVE secondword_copy+0(j) TO first_half.
remaining_length = ( secondlen - j ) - 1.
next_start_point = j + 1.
MOVE secondword_copy+next_start_point(remaining_length) TO second_half.
CONCATENATE first_half '#' second_half INTO secondword_copy.
ENDIF.
j = j + 1.
ENDWHILE.
i = i + 1.
ENDWHILE.
ELSE.
CLEAR commons.
ENDIF.
ENDMETHOD.

METHOD math_max.
max = nmax( val1 = num1 val2 = num2 ).
ENDMETHOD.

METHOD math_min.
min = nmin( val1 = num1 val2 = num2 ).
ENDMETHOD.

ENDCLASS.

PARAMETERS s_nachn TYPE string DEFAULT 'HOHNSTON'.
PARAMETERS s_vorna TYPE string DEFAULT 'PHIL'.

START-OF-SELECTION.
DATA: l_compare TYPE string.
TYPES: BEGIN OF ty_repository_line,
nachn       TYPE c LENGTH 30,
vorna       TYPE c LENGTH 30,
nachn_score TYPE zcl_jaro_winkler=>ty_distance,
vorna_score TYPE zcl_jaro_winkler=>ty_distance,
total_score TYPE zcl_jaro_winkler=>ty_distance,
END OF ty_repository_line,
ty_repository_lines TYPE STANDARD TABLE OF ty_repository_line WITH EMPTY KEY.
DATA(t_pa0002) = VALUE ty_repository_lines(
( nachn = 'JOHNSTON' vorna = 'PHILIP' )
( nachn = 'HOUSTON'  vorna = 'DANIEL' )
( nachn = 'CHUMCHAL' vorna = 'PHILLIP' ) ).
LOOP AT t_pa0002 ASSIGNING FIELD-SYMBOL(<fs>).
TRANSLATE <fs>-nachn TO UPPER CASE.
TRANSLATE <fs>-vorna TO UPPER CASE.
MOVE <fs>-nachn TO l_compare.
CALL METHOD zcl_jaro_winkler=>stringdistance
EXPORTING
firstword      = s_nachn
secondword     = l_compare
RECEIVING
stringdistance = <fs>-nachn_score.
MOVE <fs>-vorna TO l_compare.
CALL METHOD zcl_jaro_winkler=>stringdistance
EXPORTING
firstword      = s_vorna
secondword     = l_compare
RECEIVING
stringdistance = <fs>-vorna_score.
<fs>-total_score = <fs>-nachn_score + <fs>-vorna_score.
ENDLOOP.
SORT t_pa0002 BY total_score DESCENDING.
FORMAT COLOR COL_NORMAL.
WRITE: 80 'Total Score      ' COLOR COL_TOTAL.
ULINE.
LOOP AT t_pa0002 ASSIGNING <fs>.
WRITE: / <fs>-nachn(20), <fs>-nachn_score, <fs>-vorna(20), <fs>-vorna_score.
WRITE 80 <fs>-total_score COLOR COL_TOTAL.
ENDLOOP.``````
• Hi Sandra Rossi,

Thanks for the code. It works!! I appreciate the help.

• One thing I don't understand is if there is no match then also it returns 0.00 which doesn't make sense to me. This is what I pass and compare with first and last names in a table / • In S4CORE there is a class CL_RSH_MATCH_STRING, which is going to calc Jaro-Winkler Similarity

• My realization of jaro-winkler similarity

definition

``````  methods ABAP_JARO_WINKLER_CUSTOM
importing
!SRC_STR type STRING
!TRG_STR type STRING
exporting
!EV_SIM_JARO_WINKLER type DECFLOAT16
!EV_SIM_JARO type DECFLOAT16 .``````

implementation

``````  METHOD abap_jaro_winkler_custom.
CONSTANTS: winkler_increase TYPE p LENGTH 3 DECIMALS 2 VALUE '0.1'.
TYPES: BEGIN OF ts_matrix_cell
, i TYPE syindex
, j TYPE syindex
, src_symbol TYPE char1
, trg_symbol TYPE char1
, if_equal TYPE syindex
, END OF ts_matrix_cell
, tt_matrix TYPE SORTED TABLE OF ts_matrix_cell
WITH UNIQUE KEY primary_key COMPONENTS i j.

DATA lt_matrix TYPE tt_matrix.
DATA ls_matrix_cell TYPE ts_matrix_cell.
DATA src_str_length TYPE syindex.
DATA trg_str_length TYPE syindex.
DATA match_char_lim  TYPE syindex.
DATA matching_chars  TYPE syindex.
DATA transpose  TYPE syindex.
DATA transpose_half  TYPE syindex.
DATA similar_prefix_of_str TYPE syindex.
DATA symbol_index_position_i TYPE syindex.
DATA symbol_index_position_j TYPE syindex.

FIELD-SYMBOLS <fs_matrix> TYPE ts_matrix_cell.

"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
src_str_length = strlen( src_str ).
trg_str_length = strlen( trg_str ).
match_char_lim = floor( nmax( val1 = src_str_length val2 = trg_str_length ) / 2 ) - 1 .

"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
CLEAR lt_matrix.
" create matrix
DO ( src_str_length ) TIMES.
ls_matrix_cell-i = sy-index.
DO ( trg_str_length ) TIMES.
ls_matrix_cell-j = sy-index.
INSERT ls_matrix_cell INTO TABLE lt_matrix.
ENDDO.
ENDDO.
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
" check equlaty of elements
LOOP AT lt_matrix ASSIGNING <fs_matrix>.
symbol_index_position_i = <fs_matrix>-i - 1.
symbol_index_position_j = <fs_matrix>-j - 1.
<fs_matrix>-src_symbol = src_str+symbol_index_position_i(1).
<fs_matrix>-trg_symbol = trg_str+symbol_index_position_j(1).
IF <fs_matrix>-src_symbol EQ <fs_matrix>-trg_symbol.
<fs_matrix>-if_equal = 1.
ELSE.
<fs_matrix>-if_equal = 0.
ENDIF.
ENDLOOP.
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
" matching_chars and transpose_half
" this is separate steps of the algo - so we do it in separate loops to be clean
matching_chars = 0.
transpose_half = 0.
LOOP AT lt_matrix ASSIGNING <fs_matrix> WHERE if_equal EQ 1.
IF abs( <fs_matrix>-i - <fs_matrix>-j ) GT  match_char_lim.
CONTINUE.
ENDIF.
matching_chars = matching_chars + 1.
IF <fs_matrix>-i EQ <fs_matrix>-j.
ELSE.
transpose = transpose + 1.
ENDIF.
ENDLOOP.
transpose_half = transpose / 2.
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
" calculating JaroSimilarity
ev_sim_jaro = (   ( matching_chars / src_str_length )
+ ( matching_chars / trg_str_length )
+ ( ( matching_chars - transpose_half ) / matching_chars )
) / 3.

"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
similar_prefix_of_str = 0.
DO.
symbol_index_position_i = sy-index - 1.
IF src_str+symbol_index_position_i(1) EQ trg_str+symbol_index_position_i(1).
similar_prefix_of_str = similar_prefix_of_str + 1.
ELSE.
EXIT.
ENDIF.
IF similar_prefix_of_str GE 4.
EXIT.
ENDIF.
ENDDO.
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
" calculating Jaro-Winkler Similarity
ev_sim_jaro_winkler =
ev_sim_jaro + ( similar_prefix_of_str * winkler_increase ) * ( 1 - ev_sim_jaro ).
ENDMETHOD.``````