To tackle a real world problem of the hiring department performing a new hire instead of a rehire for seasonal contractors, I decided to implement the Jaro-Winkler algorithm in SAP.   Since there is a policy to not store key information on contractors (such as SSN), there is only the ability to match on a person’s name.  Unfortunately, the name is not always typed correctly which leads to the inability to find a previous hire and this leads to hiring a new contractor instead of rehiring a contractor.  Using the Jaro-Winkler algorithm, we are now able to suggest possible similar contractors based on the string comparison of first and last name.  Jaro-Winkler calculates the distance (a measure of similarity) between strings.  The measurement scale is 0.0 to 1.0, where 0.0 is the least likely and 1.0 is a positive match.  For our purposes, anything below a 0.8 is not considered useful.

“Jaro–Winkler distance (Winkler, 1990) is a measure of similarity between two strings”.

See Wiki link:  http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance  for more in depth information.

/wp-content/uploads/2013/12/p_336669.jpg

The Class:  ZCL_JARO_WINKLER

Method: STRINGDISTANCE

method stringdistance.

  data: firstlen type i, secondlen type i, halflen type i, commonmatches type i, common1 type string, common2 type string.

  data: transpositions type i, i type i, j type i.

  if ( not ( firstword is initial ) and not ( secondword is initial ) ).

    if ( firstword eq secondword ).

      stringdistance = totalmatchscore.

    else.

      firstlen = strlen( firstword ).

      secondlen = strlen( secondword ).

      halflen = zcl_jaro_winkler=>math_min( num1 = firstlen num2 = secondlen ) / 2 + 1.

      common1 = zcl_jaro_winkler=>getcommoncharacters( firstword = firstword secondword = secondword distance = halflen ).

      commonmatches = strlen( common1 ).

      if ( commonmatches eq 0 ).

        stringdistance = totalmismatchscore.

      else.

        common2 = zcl_jaro_winkler=>getcommoncharacters( firstword = secondword secondword = firstword distance = halflen ).

        if ( commonmatches ne strlen( common2 ) ).

          stringdistance = totalmismatchscore.

        else.

          transpositions = 0.

          i = 0.

          while ( i lt commonmatches ).

            if ( common1+i(1) ne common2+i(1) ).

              transpositions = transpositions + 1.

            endif.

            i = i + 1.

          endwhile.

          transpositions = transpositions / 2.

          stringdistance

               = commonmatches / ( 3 * firstlen ) + commonmatches / ( 3 * secondlen ) + ( commonmatches transpositions ) / ( 3 * commonmatches ).

        endif.

      endif.

    endif.

  else.

    stringdistance = totalmismatchscore.

  endif.

endmethod.

Method: GETCOMMONCHARACTERS

method getcommoncharacters.

  data: firstlen type i, secondlen type i, i type i, j type i, ch type c, foundit type c, secondword_copy type string.

  data: first_half type string, second_half type string, next_start_point type i, remaining_length type i.

  if ( not ( firstword is initial ) and not ( secondword is initial ) ).

    secondword_copy = secondword.

    firstlen = strlen( firstword ).

    secondlen = strlen( secondword ).

    i = 0.

    while i lt firstlen.

      ch = firstword+i(1).

      foundit = bool_false.

      j = zcl_jaro_winkler=>math_max( num1 = 0 num2 = ( i distance ) ).

      while ( ( foundit = bool_false ) and ( j lt zcl_jaro_winkler=>math_min( num1 = ( i + distance ) num2 = secondlen ) ) ).

        if ( secondword_copy+j(1) eq ch ).

          foundit = bool_true.

          concatenate commons ch into commons.

          move secondword_copy+0(j) to first_half.

          remaining_length = ( secondlen j ) 1.

          next_start_point = j + 1.

          move secondword_copy+next_start_point(remaining_length) to second_half.

          concatenate first_half ‘#’ second_half into secondword_copy.

        endif.

        j = j + 1.

      endwhile.

      i = i + 1.

    endwhile.

  else.

    clear commons.

  endif.

endmethod.

Creating a test program:

This test program requires that a first name and last name be input by the user and then these names are compared against every employee for possible matches.

    loop at t_pa0002 assigning <fs>.

      translate <fs>nachn to upper case.

      translate <fs>vorna to upper case.

      move <fs>nachn to l_compare.

      call method zcl_jaro_winkler=>stringdistance

        exporting

          firstword      = s_nachn

          secondword     = l_compare

        receiving

          stringdistance = <fs>nachn_score.

      move <fs>vorna to l_compare.

      call method zcl_jaro_winkler=>stringdistance

        exporting

          firstword      = s_vorna

          secondword     = l_compare

        receiving

          stringdistance = <fs>vorna_score.

      <fs>total_score = <fs>nachn_score + <fs>vorna_score.

    endloop.

    sort t_pa0002 by total_score descending.

    format color col_normal.

    write: 80 ‘Total Score      ‘ color col_total.

    uline.

    loop at t_pa0002 assigning <fs>.

      write: / <fs>nachn(20), <fs>nachn_score, <fs>vorna(20), <fs>vorna_score.

      write 80 <fs>total_score color col_total.

    endloop.

Deliberately misspelling my last name and shortening my first name:

/wp-content/uploads/2013/12/p_336669.jpg

Results:

/wp-content/uploads/2013/12/p_336669.jpg

I only show the top three results.  In fact, since I’m creating a combined score of two different measurements, I would likely consider anything below a 1.60 as not useful in real world.  So now we have a safeguard to alleviate hiring when we should be re-hiring.  Of course, there are many uses for this algorith, such as zip code verification for zip+4.  Also, another use is using as a dictionary check for valid words.

To report this post you need to login first.

8 Comments

You must be Logged on to comment or reply to a post.

    1. Philip Johnston Post author

      method math_min.

       

         if num1 <= num2.

           result = num1.

         else.

           result = num2.

         endif.

       

      endmethod.

      (0) 
    2. Philip Johnston Post author

      method math_max.

       

         if num1 >= num2.

           result = num1.

         else.

           result = num2.

         endif.

       

      endmethod.

      (0) 
  1. Raphael Pacheco

    Really brilliant, but I think you should exemplify best classes because not everyone here is developers and would be a good you have a solution easy to implement. Adding more, I think it would be a good you place your project within open source projects here. (This space).


    Warm regards,


    Raphael Pacheco.

    (0) 
  2. Sandra Rossi

    Thanks for sharing. In SAP, there’s also a function module named RS_COMPARE_WORDS_SIMILAR, but no idea which algorithm it implements…

    (0) 
  3. Christian Dr. Drumm

    Hi Philipp,

    what is the reason for implementing name matching using Jaro-Winkle? Have you also tested other techniques?

    The reason I’m asking is that I probably would have chosen the naive implementation using the Levenshtein distance. This is available in ABAP as string distance function (ABAP Keyword Documentation).

    Christian

    (0) 

Leave a Reply