Jaro–Winkler Distance Algorithm

Former Member · ‎12-04-2013

To tackle a real world problem of the hiring department performing a new hire instead of a rehire for seasonal contractors, I decided to implement the Jaro-Winkler algorithm in SAP. Since there is a policy to not store key information on contractors (such as SSN), there is only the ability to match on a person’s name. Unfortunately, the name is not always typed correctly which leads to the inability to find a previous hire and this leads to hiring a new contractor instead of rehiring a contractor. Using the Jaro-Winkler algorithm, we are now able to suggest possible similar contractors based on the string comparison of first and last name. Jaro-Winkler calculates the distance (a measure of similarity) between strings. The measurement scale is 0.0 to 1.0, where 0.0 is the least likely and 1.0 is a positive match. For our purposes, anything below a 0.8 is not considered useful.

“Jaro–Winkler distance (Winkler, 1990) is a measure of similarity between two strings”.

See Wiki link: http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance for more in depth information.

The Class: ZCL_JARO_WINKLER

Method: STRINGDISTANCE

method stringdistance.

data: firstlen type i, secondlen type i, halflen type i, commonmatches type i, common1 type string, common2 type string.

data: transpositions type i, i type i, j type i.

if ( not ( firstword is initial ) and not ( secondword is initial ) ).

    if ( firstword eq secondword ).

      stringdistance = totalmatchscore.

    else.

      firstlen = strlen( firstword ).

      secondlen = strlen( secondword ).

      halflen = zcl_jaro_winkler=>math_min( num1 = firstlen num2 = secondlen ) / 2 + 1.

      common1 = zcl_jaro_winkler=>getcommoncharacters( firstword = firstword secondword = secondword distance = halflen ).

      commonmatches = strlen( common1 ).

      if ( commonmatches eq 0 ).

        stringdistance = totalmismatchscore.

      else.

        common2 = zcl_jaro_winkler=>getcommoncharacters( firstword = secondword secondword = firstword distance = halflen ).

        if ( commonmatches ne strlen( common2 ) ).

          stringdistance = totalmismatchscore.

        else.

          transpositions = 0.

          i = 0.

          while ( i lt commonmatches ).

            if ( common1+i(1) ne common2+i(1) ).

              transpositions = transpositions + 1.

            endif.

            i = i + 1.

          endwhile.

          transpositions = transpositions / 2.

          stringdistance

               = commonmatches / ( 3 * firstlen ) + commonmatches / ( 3 * secondlen ) + ( commonmatches -transpositions ) / ( 3 * commonmatches ).

        endif.

      endif.

    endif.

else.

    stringdistance = totalmismatchscore.

endif.

endmethod.

Method: GETCOMMONCHARACTERS

method getcommoncharacters.

data: firstlen type i, secondlen type i, i type i, j type i, ch type c, foundit type c, secondword_copy type string.

data: first_half type string, second_half type string, next_start_point type i, remaining_length type i.

if ( not ( firstword is initial ) and not ( secondword is initial ) ).

    secondword_copy = secondword.

    firstlen = strlen( firstword ).

    secondlen = strlen( secondword ).

    i = 0.

    while i lt firstlen.

      ch = firstword+i(1).

      foundit = bool_false.

      j = zcl_jaro_winkler=>math_max( num1 = 0 num2 = ( i - distance ) ).

      while ( ( foundit = bool_false ) and ( j lt zcl_jaro_winkler=>math_min( num1 = ( i + distance ) num2 = secondlen ) ) ).

        if ( secondword_copy+j(1) eq ch ).

          foundit = bool_true.

          concatenate commons ch into commons.

          move secondword_copy+0(j) to first_half.

          remaining_length = ( secondlen - j ) - 1.

          next_start_point = j + 1.

          move secondword_copy+next_start_point(remaining_length) to second_half.

          concatenate first_half '#' second_half into secondword_copy.

        endif.

        j = j + 1.

      endwhile.

      i = i + 1.

    endwhile.

else.

    clear commons.

endif.

endmethod.

Creating a test program:

This test program requires that a first name and last name be input by the user and then these names are compared against every employee for possible matches.

    loop at t_pa0002 assigning <fs>.

      translate <fs>-nachn to upper case.

      translate <fs>-vorna to upper case.

      move <fs>-nachn to l_compare.

      call method zcl_jaro_winkler=>stringdistance

        exporting

          firstword      = s_nachn

          secondword     = l_compare

        receiving

          stringdistance = <fs>-nachn_score.

      move <fs>-vorna to l_compare.

      call method zcl_jaro_winkler=>stringdistance

        exporting

          firstword      = s_vorna

          secondword     = l_compare

        receiving

          stringdistance = <fs>-vorna_score.

      <fs>-total_score = <fs>-nachn_score + <fs>-vorna_score.

    endloop.

    sort t_pa0002 by total_score descending.

    format color col_normal.

    write: 80 'Total Score      ' color col_total.

    uline.

    loop at t_pa0002 assigning <fs>.

      write: / <fs>-nachn(20), <fs>-nachn_score, <fs>-vorna(20), <fs>-vorna_score.

      write 80 <fs>-total_score color col_total.

    endloop.

Deliberately misspelling my last name and shortening my first name:

Results:

I only show the top three results. In fact, since I’m creating a combined score of two different measurements, I would likely consider anything below a 1.60 as not useful in real world. So now we have a safeguard to alleviate hiring when we should be re-hiring. Of course, there are many uses for this algorith, such as zip code verification for zip+4. Also, another use is using as a dictionary check for valid words.

Jaro–Winkler Distance Algorithm

Get Started with the ABAP Development Tools for SAP NetWeaver

Become an ABAP in Eclipse Feature Explorer and earn the Explorer Badge

Six kinds of debugging tips to find the source code where the message is raised