Skip to Content
Author's profile photo Philip Johnston

Jaro–Winkler Distance Algorithm

To tackle a real world problem of the hiring department performing a new hire instead of a rehire for seasonal contractors, I decided to implement the Jaro-Winkler algorithm in SAP.   Since there is a policy to not store key information on contractors (such as SSN), there is only the ability to match on a person’s name.  Unfortunately, the name is not always typed correctly which leads to the inability to find a previous hire and this leads to hiring a new contractor instead of rehiring a contractor.  Using the Jaro-Winkler algorithm, we are now able to suggest possible similar contractors based on the string comparison of first and last name.  Jaro-Winkler calculates the distance (a measure of similarity) between strings.  The measurement scale is 0.0 to 1.0, where 0.0 is the least likely and 1.0 is a positive match.  For our purposes, anything below a 0.8 is not considered useful.

“Jaro–Winkler distance (Winkler, 1990) is a measure of similarity between two strings”.

See Wiki link:  http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance  for more in depth information.

/wp-content/uploads/2013/12/p_336669.jpg

The Class:  ZCL_JARO_WINKLER

Method: STRINGDISTANCE

method stringdistance.

  data: firstlen type i, secondlen type i, halflen type i, commonmatches type i, common1 type string, common2 type string.

  data: transpositions type i, i type i, j type i.

  if ( not ( firstword is initial ) and not ( secondword is initial ) ).

    if ( firstword eq secondword ).

      stringdistance = totalmatchscore.

    else.

      firstlen = strlen( firstword ).

      secondlen = strlen( secondword ).

      halflen = zcl_jaro_winkler=>math_min( num1 = firstlen num2 = secondlen ) / 2 + 1.

      common1 = zcl_jaro_winkler=>getcommoncharacters( firstword = firstword secondword = secondword distance = halflen ).

      commonmatches = strlen( common1 ).

      if ( commonmatches eq 0 ).

        stringdistance = totalmismatchscore.

      else.

        common2 = zcl_jaro_winkler=>getcommoncharacters( firstword = secondword secondword = firstword distance = halflen ).

        if ( commonmatches ne strlen( common2 ) ).

          stringdistance = totalmismatchscore.

        else.

          transpositions = 0.

          i = 0.

          while ( i lt commonmatches ).

            if ( common1+i(1) ne common2+i(1) ).

              transpositions = transpositions + 1.

            endif.

            i = i + 1.

          endwhile.

          transpositions = transpositions / 2.

          stringdistance

               = commonmatches / ( 3 * firstlen ) + commonmatches / ( 3 * secondlen ) + ( commonmatches transpositions ) / ( 3 * commonmatches ).

        endif.

      endif.

    endif.

  else.

    stringdistance = totalmismatchscore.

  endif.

endmethod.

Method: GETCOMMONCHARACTERS

method getcommoncharacters.

  data: firstlen type i, secondlen type i, i type i, j type i, ch type c, foundit type c, secondword_copy type string.

  data: first_half type string, second_half type string, next_start_point type i, remaining_length type i.

  if ( not ( firstword is initial ) and not ( secondword is initial ) ).

    secondword_copy = secondword.

    firstlen = strlen( firstword ).

    secondlen = strlen( secondword ).

    i = 0.

    while i lt firstlen.

      ch = firstword+i(1).

      foundit = bool_false.

      j = zcl_jaro_winkler=>math_max( num1 = 0 num2 = ( i distance ) ).

      while ( ( foundit = bool_false ) and ( j lt zcl_jaro_winkler=>math_min( num1 = ( i + distance ) num2 = secondlen ) ) ).

        if ( secondword_copy+j(1) eq ch ).

          foundit = bool_true.

          concatenate commons ch into commons.

          move secondword_copy+0(j) to first_half.

          remaining_length = ( secondlen j ) 1.

          next_start_point = j + 1.

          move secondword_copy+next_start_point(remaining_length) to second_half.

          concatenate first_half ‘#’ second_half into secondword_copy.

        endif.

        j = j + 1.

      endwhile.

      i = i + 1.

    endwhile.

  else.

    clear commons.

  endif.

endmethod.

Creating a test program:

This test program requires that a first name and last name be input by the user and then these names are compared against every employee for possible matches.

    loop at t_pa0002 assigning <fs>.

      translate <fs>nachn to upper case.

      translate <fs>vorna to upper case.

      move <fs>nachn to l_compare.

      call method zcl_jaro_winkler=>stringdistance

        exporting

          firstword      = s_nachn

          secondword     = l_compare

        receiving

          stringdistance = <fs>nachn_score.

      move <fs>vorna to l_compare.

      call method zcl_jaro_winkler=>stringdistance

        exporting

          firstword      = s_vorna

          secondword     = l_compare

        receiving

          stringdistance = <fs>vorna_score.

      <fs>total_score = <fs>nachn_score + <fs>vorna_score.

    endloop.

    sort t_pa0002 by total_score descending.

    format color col_normal.

    write: 80 ‘Total Score      ‘ color col_total.

    uline.

    loop at t_pa0002 assigning <fs>.

      write: / <fs>nachn(20), <fs>nachn_score, <fs>vorna(20), <fs>vorna_score.

      write 80 <fs>total_score color col_total.

    endloop.

Deliberately misspelling my last name and shortening my first name:

/wp-content/uploads/2013/12/p_336669.jpg

Results:

/wp-content/uploads/2013/12/p_336669.jpg

I only show the top three results.  In fact, since I’m creating a combined score of two different measurements, I would likely consider anything below a 1.60 as not useful in real world.  So now we have a safeguard to alleviate hiring when we should be re-hiring.  Of course, there are many uses for this algorith, such as zip code verification for zip+4.  Also, another use is using as a dictionary check for valid words.

Assigned Tags

      15 Comments
      You must be Logged on to comment or reply to a post.
      Author's profile photo Christopher Solomon
      Christopher Solomon

      This is brilliant! I can see a LOT of uses for this.  Great work and thanks for sharing!

      Author's profile photo Christian Koch
      Christian Koch

      Hi, very interesting! But what kind of method is zcl_jaro_winkler=>math_min?

      Thx and greets

      Chris

      Author's profile photo Philip Johnston
      Philip Johnston
      Blog Post Author

      method math_min.

       

         if num1 <= num2.

           result = num1.

         else.

           result = num2.

         endif.

       

      endmethod.

      Author's profile photo Philip Johnston
      Philip Johnston
      Blog Post Author

      method math_max.

       

         if num1 >= num2.

           result = num1.

         else.

           result = num2.

         endif.

       

      endmethod.

      Author's profile photo Raphael Pacheco
      Raphael Pacheco

      Really brilliant, but I think you should exemplify best classes because not everyone here is developers and would be a good you have a solution easy to implement. Adding more, I think it would be a good you place your project within open source projects here. (This space).


      Warm regards,


      Raphael Pacheco.

      Author's profile photo Sandra Rossi
      Sandra Rossi

      Thanks for sharing. In SAP, there's also a function module named RS_COMPARE_WORDS_SIMILAR, but no idea which algorithm it implements...

      Author's profile photo Christian Drumm
      Christian Drumm

      Hi Philipp,

      what is the reason for implementing name matching using Jaro-Winkle? Have you also tested other techniques?

      The reason I'm asking is that I probably would have chosen the naive implementation using the Levenshtein distance. This is available in ABAP as string distance function (ABAP Keyword Documentation).

      Christian

      Author's profile photo Philip Johnston
      Philip Johnston
      Blog Post Author

      Here's a great reference for all your pattern matching questions:

       

      http://users.cecs.anu.edu.au/~Peter.Christen/publications/tr-cs-06-02.pdf

       

      Speed Comparison of the four Jaro and Levenshtein algorithms ( from fastest to slowest ):

      • Jaro
      • Jaro-Winkler
      • Levenshtein
      • Damerau-Levenshtein

      with the slowest taking 2 to 3 times as long as the fastest.

      Author's profile photo avis alab
      avis alab

      Hi Gurus,

       

      I tried copying the code and creating the class but it is not working for me. I'm sure I did not do it right. Can I have the complete working code? It is not clear what the return parameters and the types basically I guess. I was able to compile everything and when I excute the program it is always returning 0. Totalmatchscore and totalmismatchscore I defined as constants with values 0 and 1 respectively. is that right?

      I would appreciate if someone can help me on this.

       

      Thanks,

      Siva

      Author's profile photo Sandra Rossi
      Sandra Rossi

      Why not using the distance lehvenstein function?

      Author's profile photo Sandra Rossi
      Sandra Rossi

      Code with missing parts:

      CLASS zcl_jaro_winkler DEFINITION.
        PUBLIC SECTION.
          TYPES ty_distance TYPE p LENGTH 6 DECIMALS 2.
          CLASS-METHODS stringdistance
            IMPORTING
              firstword             TYPE string
              secondword            TYPE string
            RETURNING
              VALUE(stringdistance) TYPE ty_distance.
          CLASS-METHODS getcommoncharacters
            IMPORTING
              firstword      TYPE string
              secondword     TYPE string
              distance       TYPE numeric
            RETURNING
              VALUE(commons) TYPE string.
          CLASS-METHODS math_min
            IMPORTING
              num1       TYPE numeric
              num2       TYPE numeric
            RETURNING
              VALUE(min) TYPE ty_distance.
          CLASS-METHODS math_max
            IMPORTING
              num1       TYPE numeric
              num2       TYPE numeric
            RETURNING
              VALUE(max) TYPE ty_distance.
          CLASS-DATA: totalmatchscore    TYPE ty_distance,
                      totalmismatchscore TYPE ty_distance.
          CONSTANTS: bool_false TYPE abap_bool VALUE abap_false,
                     bool_true  TYPE abap_bool VALUE abap_true.
      ENDCLASS.
      
      CLASS zcl_jaro_winkler IMPLEMENTATION.
      
        METHOD stringdistance.
          DATA: firstlen      TYPE i, secondlen TYPE i, halflen TYPE i, commonmatches TYPE i, common1 TYPE string, common2 TYPE string.
          DATA: transpositions TYPE i, i TYPE i, j TYPE i.
          IF ( NOT ( firstword IS INITIAL ) AND NOT ( secondword IS INITIAL ) ).
            IF ( firstword EQ secondword ).
              stringdistance = totalmatchscore.
            ELSE.
              firstlen = strlen( firstword ).
              secondlen = strlen( secondword ).
              halflen = zcl_jaro_winkler=>math_min( num1 = firstlen num2 = secondlen ) / 2 + 1.
              common1 = zcl_jaro_winkler=>getcommoncharacters( firstword = firstword secondword = secondword distance = halflen ).
              commonmatches = strlen( common1 ).
              IF ( commonmatches EQ 0 ).
                stringdistance = totalmismatchscore.
              ELSE.
                common2 = zcl_jaro_winkler=>getcommoncharacters( firstword = secondword secondword = firstword distance = halflen ).
                IF ( commonmatches NE strlen( common2 ) ).
                  stringdistance = totalmismatchscore.
                ELSE.
                  transpositions = 0.
                  i = 0.
                  WHILE ( i LT commonmatches ).
                    IF ( common1+i(1) NE common2+i(1) ).
                      transpositions = transpositions + 1.
                    ENDIF.
                    i = i + 1.
                  ENDWHILE.
                  transpositions = transpositions / 2.
                  stringdistance
                       = commonmatches / ( 3 * firstlen ) + commonmatches / ( 3 * secondlen ) + ( commonmatches - transpositions ) / ( 3 * commonmatches ).
                ENDIF.
              ENDIF.
            ENDIF.
          ELSE.
            stringdistance = totalmismatchscore.
          ENDIF.
        ENDMETHOD.
      
        METHOD getcommoncharacters.
          DATA: firstlen        TYPE i, secondlen TYPE i, i TYPE i, j TYPE i, ch TYPE c, foundit TYPE c, secondword_copy TYPE string.
          DATA: first_half       TYPE string, second_half TYPE string, next_start_point TYPE i, remaining_length TYPE i.
          IF ( NOT ( firstword IS INITIAL ) AND NOT ( secondword IS INITIAL ) ).
            secondword_copy = secondword.
            firstlen = strlen( firstword ).
            secondlen = strlen( secondword ).
            i = 0.
            WHILE i LT firstlen.
              ch = firstword+i(1).
              foundit = bool_false.
              j = zcl_jaro_winkler=>math_max( num1 = 0 num2 = ( i - distance ) ).
              WHILE ( ( foundit = bool_false ) AND ( j LT zcl_jaro_winkler=>math_min( num1 = ( i + distance ) num2 = secondlen ) ) ).
                IF ( secondword_copy+j(1) EQ ch ).
                  foundit = bool_true.
                  CONCATENATE commons ch INTO commons.
                  MOVE secondword_copy+0(j) TO first_half.
                  remaining_length = ( secondlen - j ) - 1.
                  next_start_point = j + 1.
                  MOVE secondword_copy+next_start_point(remaining_length) TO second_half.
                  CONCATENATE first_half '#' second_half INTO secondword_copy.
                ENDIF.
                j = j + 1.
              ENDWHILE.
              i = i + 1.
            ENDWHILE.
          ELSE.
            CLEAR commons.
          ENDIF.
        ENDMETHOD.
      
        METHOD math_max.
          max = nmax( val1 = num1 val2 = num2 ).
        ENDMETHOD.
      
        METHOD math_min.
          min = nmin( val1 = num1 val2 = num2 ).
        ENDMETHOD.
      
      ENDCLASS.
      
      PARAMETERS s_nachn TYPE string DEFAULT 'HOHNSTON'.
      PARAMETERS s_vorna TYPE string DEFAULT 'PHIL'.
      
      START-OF-SELECTION.
        DATA: l_compare TYPE string.
        TYPES: BEGIN OF ty_repository_line,
                 nachn       TYPE c LENGTH 30,
                 vorna       TYPE c LENGTH 30,
                 nachn_score TYPE zcl_jaro_winkler=>ty_distance,
                 vorna_score TYPE zcl_jaro_winkler=>ty_distance,
                 total_score TYPE zcl_jaro_winkler=>ty_distance,
               END OF ty_repository_line,
               ty_repository_lines TYPE STANDARD TABLE OF ty_repository_line WITH EMPTY KEY.
        DATA(t_pa0002) = VALUE ty_repository_lines(
            ( nachn = 'JOHNSTON' vorna = 'PHILIP' )
            ( nachn = 'HOUSTON'  vorna = 'DANIEL' )
            ( nachn = 'CHUMCHAL' vorna = 'PHILLIP' ) ).
        LOOP AT t_pa0002 ASSIGNING FIELD-SYMBOL(<fs>).
          TRANSLATE <fs>-nachn TO UPPER CASE.
          TRANSLATE <fs>-vorna TO UPPER CASE.
          MOVE <fs>-nachn TO l_compare.
          CALL METHOD zcl_jaro_winkler=>stringdistance
            EXPORTING
              firstword      = s_nachn
              secondword     = l_compare
            RECEIVING
              stringdistance = <fs>-nachn_score.
          MOVE <fs>-vorna TO l_compare.
          CALL METHOD zcl_jaro_winkler=>stringdistance
            EXPORTING
              firstword      = s_vorna
              secondword     = l_compare
            RECEIVING
              stringdistance = <fs>-vorna_score.
          <fs>-total_score = <fs>-nachn_score + <fs>-vorna_score.
        ENDLOOP.
        SORT t_pa0002 BY total_score DESCENDING.
        FORMAT COLOR COL_NORMAL.
        WRITE: 80 'Total Score      ' COLOR COL_TOTAL.
        ULINE.
        LOOP AT t_pa0002 ASSIGNING <fs>.
          WRITE: / <fs>-nachn(20), <fs>-nachn_score, <fs>-vorna(20), <fs>-vorna_score.
          WRITE 80 <fs>-total_score COLOR COL_TOTAL.
        ENDLOOP.
      Author's profile photo avis alab
      avis alab

      Hi Sandra Rossi,

       

      Thanks for the code. It works!! I appreciate the help.

      Author's profile photo avis alab
      avis alab

      One thing I don't understand is if there is no match then also it returns 0.00 which doesn't make sense to me. This is what I pass and compare with first and last names in a table

      Author's profile photo Oleg Bashkatov
      Oleg Bashkatov

      In S4CORE there is a class CL_RSH_MATCH_STRING, which is going to calc Jaro-Winkler Similarity

      Author's profile photo Oleg Bashkatov
      Oleg Bashkatov

      My realization of jaro-winkler similarity

      definition

        methods ABAP_JARO_WINKLER_CUSTOM
          importing
            !SRC_STR type STRING
            !TRG_STR type STRING
          exporting
            !EV_SIM_JARO_WINKLER type DECFLOAT16
            !EV_SIM_JARO type DECFLOAT16 .

      implementation

        METHOD abap_jaro_winkler_custom.
          CONSTANTS: winkler_increase TYPE p LENGTH 3 DECIMALS 2 VALUE '0.1'.
          TYPES: BEGIN OF ts_matrix_cell
                  , i TYPE syindex
                  , j TYPE syindex
                  , src_symbol TYPE char1
                  , trg_symbol TYPE char1
                  , if_equal TYPE syindex
                , END OF ts_matrix_cell
                , tt_matrix TYPE SORTED TABLE OF ts_matrix_cell
                  WITH UNIQUE KEY primary_key COMPONENTS i j.
      
          DATA lt_matrix TYPE tt_matrix.
          DATA ls_matrix_cell TYPE ts_matrix_cell.
          DATA src_str_length TYPE syindex.
          DATA trg_str_length TYPE syindex.
          DATA match_char_lim  TYPE syindex.
          DATA matching_chars  TYPE syindex.
          DATA transpose  TYPE syindex.
          DATA transpose_half  TYPE syindex.
          DATA similar_prefix_of_str TYPE syindex.
          DATA symbol_index_position_i TYPE syindex.
          DATA symbol_index_position_j TYPE syindex.
      
          FIELD-SYMBOLS <fs_matrix> TYPE ts_matrix_cell.
      
          """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
          src_str_length = strlen( src_str ).
          trg_str_length = strlen( trg_str ).
          match_char_lim = floor( nmax( val1 = src_str_length val2 = trg_str_length ) / 2 ) - 1 .
      
          """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
          CLEAR lt_matrix.
          " create matrix 
          DO ( src_str_length ) TIMES.
            ls_matrix_cell-i = sy-index.
            DO ( trg_str_length ) TIMES.
              ls_matrix_cell-j = sy-index.
              INSERT ls_matrix_cell INTO TABLE lt_matrix.
            ENDDO.
          ENDDO.
          """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
          " check equlaty of elements
          LOOP AT lt_matrix ASSIGNING <fs_matrix>.
            symbol_index_position_i = <fs_matrix>-i - 1.
            symbol_index_position_j = <fs_matrix>-j - 1.
            <fs_matrix>-src_symbol = src_str+symbol_index_position_i(1).
            <fs_matrix>-trg_symbol = trg_str+symbol_index_position_j(1).
            IF <fs_matrix>-src_symbol EQ <fs_matrix>-trg_symbol.
              <fs_matrix>-if_equal = 1.
            ELSE.
              <fs_matrix>-if_equal = 0.
            ENDIF.
          ENDLOOP.
          """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
          """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
          " matching_chars and transpose_half
          " this is separate steps of the algo - so we do it in separate loops to be clean
          matching_chars = 0.
          transpose_half = 0.
          LOOP AT lt_matrix ASSIGNING <fs_matrix> WHERE if_equal EQ 1.
            IF abs( <fs_matrix>-i - <fs_matrix>-j ) GT  match_char_lim.
              CONTINUE.
            ENDIF.
            matching_chars = matching_chars + 1.
            IF <fs_matrix>-i EQ <fs_matrix>-j.
            ELSE.
              transpose = transpose + 1.
            ENDIF.
          ENDLOOP.
          transpose_half = transpose / 2.
          """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
          """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
          " calculating JaroSimilarity
          ev_sim_jaro = (   ( matching_chars / src_str_length )
                          + ( matching_chars / trg_str_length )
                          + ( ( matching_chars - transpose_half ) / matching_chars )
                         ) / 3.
      
          """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
          similar_prefix_of_str = 0.
          DO.
            symbol_index_position_i = sy-index - 1.
            IF src_str+symbol_index_position_i(1) EQ trg_str+symbol_index_position_i(1).
              similar_prefix_of_str = similar_prefix_of_str + 1.
            ELSE.
              EXIT.
            ENDIF.
            IF similar_prefix_of_str GE 4.
              EXIT.
            ENDIF.
          ENDDO.
          """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
          " calculating Jaro-Winkler Similarity
          ev_sim_jaro_winkler =
          ev_sim_jaro + ( similar_prefix_of_str * winkler_increase ) * ( 1 - ev_sim_jaro ).
        ENDMETHOD.