Skip to Content

Jaro–Winkler Distance Algorithm

To tackle a real world problem of the hiring department performing a new hire instead of a rehire for seasonal contractors, I decided to implement the Jaro-Winkler algorithm in SAP.   Since there is a policy to not store key information on contractors (such as SSN), there is only the ability to match on a person’s name.  Unfortunately, the name is not always typed correctly which leads to the inability to find a previous hire and this leads to hiring a new contractor instead of rehiring a contractor.  Using the Jaro-Winkler algorithm, we are now able to suggest possible similar contractors based on the string comparison of first and last name.  Jaro-Winkler calculates the distance (a measure of similarity) between strings.  The measurement scale is 0.0 to 1.0, where 0.0 is the least likely and 1.0 is a positive match.  For our purposes, anything below a 0.8 is not considered useful.

“Jaro–Winkler distance (Winkler, 1990) is a measure of similarity between two strings”.

See Wiki link:  http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance  for more in depth information.

/wp-content/uploads/2013/12/p_336669.jpg

The Class:  ZCL_JARO_WINKLER

Method: STRINGDISTANCE

method stringdistance.

  data: firstlen type i, secondlen type i, halflen type i, commonmatches type i, common1 type string, common2 type string.

  data: transpositions type i, i type i, j type i.

  if ( not ( firstword is initial ) and not ( secondword is initial ) ).

    if ( firstword eq secondword ).

      stringdistance = totalmatchscore.

    else.

      firstlen = strlen( firstword ).

      secondlen = strlen( secondword ).

      halflen = zcl_jaro_winkler=>math_min( num1 = firstlen num2 = secondlen ) / 2 + 1.

      common1 = zcl_jaro_winkler=>getcommoncharacters( firstword = firstword secondword = secondword distance = halflen ).

      commonmatches = strlen( common1 ).

      if ( commonmatches eq 0 ).

        stringdistance = totalmismatchscore.

      else.

        common2 = zcl_jaro_winkler=>getcommoncharacters( firstword = secondword secondword = firstword distance = halflen ).

        if ( commonmatches ne strlen( common2 ) ).

          stringdistance = totalmismatchscore.

        else.

          transpositions = 0.

          i = 0.

          while ( i lt commonmatches ).

            if ( common1+i(1) ne common2+i(1) ).

              transpositions = transpositions + 1.

            endif.

            i = i + 1.

          endwhile.

          transpositions = transpositions / 2.

          stringdistance

               = commonmatches / ( 3 * firstlen ) + commonmatches / ( 3 * secondlen ) + ( commonmatches transpositions ) / ( 3 * commonmatches ).

        endif.

      endif.

    endif.

  else.

    stringdistance = totalmismatchscore.

  endif.

endmethod.

Method: GETCOMMONCHARACTERS

method getcommoncharacters.

  data: firstlen type i, secondlen type i, i type i, j type i, ch type c, foundit type c, secondword_copy type string.

  data: first_half type string, second_half type string, next_start_point type i, remaining_length type i.

  if ( not ( firstword is initial ) and not ( secondword is initial ) ).

    secondword_copy = secondword.

    firstlen = strlen( firstword ).

    secondlen = strlen( secondword ).

    i = 0.

    while i lt firstlen.

      ch = firstword+i(1).

      foundit = bool_false.

      j = zcl_jaro_winkler=>math_max( num1 = 0 num2 = ( i distance ) ).

      while ( ( foundit = bool_false ) and ( j lt zcl_jaro_winkler=>math_min( num1 = ( i + distance ) num2 = secondlen ) ) ).

        if ( secondword_copy+j(1) eq ch ).

          foundit = bool_true.

          concatenate commons ch into commons.

          move secondword_copy+0(j) to first_half.

          remaining_length = ( secondlen j ) 1.

          next_start_point = j + 1.

          move secondword_copy+next_start_point(remaining_length) to second_half.

          concatenate first_half ‘#’ second_half into secondword_copy.

        endif.

        j = j + 1.

      endwhile.

      i = i + 1.

    endwhile.

  else.

    clear commons.

  endif.

endmethod.

Creating a test program:

This test program requires that a first name and last name be input by the user and then these names are compared against every employee for possible matches.

    loop at t_pa0002 assigning <fs>.

      translate <fs>nachn to upper case.

      translate <fs>vorna to upper case.

      move <fs>nachn to l_compare.

      call method zcl_jaro_winkler=>stringdistance

        exporting

          firstword      = s_nachn

          secondword     = l_compare

        receiving

          stringdistance = <fs>nachn_score.

      move <fs>vorna to l_compare.

      call method zcl_jaro_winkler=>stringdistance

        exporting

          firstword      = s_vorna

          secondword     = l_compare

        receiving

          stringdistance = <fs>vorna_score.

      <fs>total_score = <fs>nachn_score + <fs>vorna_score.

    endloop.

    sort t_pa0002 by total_score descending.

    format color col_normal.

    write: 80 ‘Total Score      ‘ color col_total.

    uline.

    loop at t_pa0002 assigning <fs>.

      write: / <fs>nachn(20), <fs>nachn_score, <fs>vorna(20), <fs>vorna_score.

      write 80 <fs>total_score color col_total.

    endloop.

Deliberately misspelling my last name and shortening my first name:

/wp-content/uploads/2013/12/p_336669.jpg

Results:

/wp-content/uploads/2013/12/p_336669.jpg

I only show the top three results.  In fact, since I’m creating a combined score of two different measurements, I would likely consider anything below a 1.60 as not useful in real world.  So now we have a safeguard to alleviate hiring when we should be re-hiring.  Of course, there are many uses for this algorith, such as zip code verification for zip+4.  Also, another use is using as a dictionary check for valid words.

15 Comments
You must be Logged on to comment or reply to a post.
  • Really brilliant, but I think you should exemplify best classes because not everyone here is developers and would be a good you have a solution easy to implement. Adding more, I think it would be a good you place your project within open source projects here. (This space).


    Warm regards,


    Raphael Pacheco.

  • Thanks for sharing. In SAP, there's also a function module named RS_COMPARE_WORDS_SIMILAR, but no idea which algorithm it implements...

  • Hi Philipp,

    what is the reason for implementing name matching using Jaro-Winkle? Have you also tested other techniques?

    The reason I'm asking is that I probably would have chosen the naive implementation using the Levenshtein distance. This is available in ABAP as string distance function (ABAP Keyword Documentation).

    Christian

  • Hi Gurus,

     

    I tried copying the code and creating the class but it is not working for me. I'm sure I did not do it right. Can I have the complete working code? It is not clear what the return parameters and the types basically I guess. I was able to compile everything and when I excute the program it is always returning 0. Totalmatchscore and totalmismatchscore I defined as constants with values 0 and 1 respectively. is that right?

    I would appreciate if someone can help me on this.

     

    Thanks,

    Siva

    • Code with missing parts:

      CLASS zcl_jaro_winkler DEFINITION.
        PUBLIC SECTION.
          TYPES ty_distance TYPE p LENGTH 6 DECIMALS 2.
          CLASS-METHODS stringdistance
            IMPORTING
              firstword             TYPE string
              secondword            TYPE string
            RETURNING
              VALUE(stringdistance) TYPE ty_distance.
          CLASS-METHODS getcommoncharacters
            IMPORTING
              firstword      TYPE string
              secondword     TYPE string
              distance       TYPE numeric
            RETURNING
              VALUE(commons) TYPE string.
          CLASS-METHODS math_min
            IMPORTING
              num1       TYPE numeric
              num2       TYPE numeric
            RETURNING
              VALUE(min) TYPE ty_distance.
          CLASS-METHODS math_max
            IMPORTING
              num1       TYPE numeric
              num2       TYPE numeric
            RETURNING
              VALUE(max) TYPE ty_distance.
          CLASS-DATA: totalmatchscore    TYPE ty_distance,
                      totalmismatchscore TYPE ty_distance.
          CONSTANTS: bool_false TYPE abap_bool VALUE abap_false,
                     bool_true  TYPE abap_bool VALUE abap_true.
      ENDCLASS.
      
      CLASS zcl_jaro_winkler IMPLEMENTATION.
      
        METHOD stringdistance.
          DATA: firstlen      TYPE i, secondlen TYPE i, halflen TYPE i, commonmatches TYPE i, common1 TYPE string, common2 TYPE string.
          DATA: transpositions TYPE i, i TYPE i, j TYPE i.
          IF ( NOT ( firstword IS INITIAL ) AND NOT ( secondword IS INITIAL ) ).
            IF ( firstword EQ secondword ).
              stringdistance = totalmatchscore.
            ELSE.
              firstlen = strlen( firstword ).
              secondlen = strlen( secondword ).
              halflen = zcl_jaro_winkler=>math_min( num1 = firstlen num2 = secondlen ) / 2 + 1.
              common1 = zcl_jaro_winkler=>getcommoncharacters( firstword = firstword secondword = secondword distance = halflen ).
              commonmatches = strlen( common1 ).
              IF ( commonmatches EQ 0 ).
                stringdistance = totalmismatchscore.
              ELSE.
                common2 = zcl_jaro_winkler=>getcommoncharacters( firstword = secondword secondword = firstword distance = halflen ).
                IF ( commonmatches NE strlen( common2 ) ).
                  stringdistance = totalmismatchscore.
                ELSE.
                  transpositions = 0.
                  i = 0.
                  WHILE ( i LT commonmatches ).
                    IF ( common1+i(1) NE common2+i(1) ).
                      transpositions = transpositions + 1.
                    ENDIF.
                    i = i + 1.
                  ENDWHILE.
                  transpositions = transpositions / 2.
                  stringdistance
                       = commonmatches / ( 3 * firstlen ) + commonmatches / ( 3 * secondlen ) + ( commonmatches - transpositions ) / ( 3 * commonmatches ).
                ENDIF.
              ENDIF.
            ENDIF.
          ELSE.
            stringdistance = totalmismatchscore.
          ENDIF.
        ENDMETHOD.
      
        METHOD getcommoncharacters.
          DATA: firstlen        TYPE i, secondlen TYPE i, i TYPE i, j TYPE i, ch TYPE c, foundit TYPE c, secondword_copy TYPE string.
          DATA: first_half       TYPE string, second_half TYPE string, next_start_point TYPE i, remaining_length TYPE i.
          IF ( NOT ( firstword IS INITIAL ) AND NOT ( secondword IS INITIAL ) ).
            secondword_copy = secondword.
            firstlen = strlen( firstword ).
            secondlen = strlen( secondword ).
            i = 0.
            WHILE i LT firstlen.
              ch = firstword+i(1).
              foundit = bool_false.
              j = zcl_jaro_winkler=>math_max( num1 = 0 num2 = ( i - distance ) ).
              WHILE ( ( foundit = bool_false ) AND ( j LT zcl_jaro_winkler=>math_min( num1 = ( i + distance ) num2 = secondlen ) ) ).
                IF ( secondword_copy+j(1) EQ ch ).
                  foundit = bool_true.
                  CONCATENATE commons ch INTO commons.
                  MOVE secondword_copy+0(j) TO first_half.
                  remaining_length = ( secondlen - j ) - 1.
                  next_start_point = j + 1.
                  MOVE secondword_copy+next_start_point(remaining_length) TO second_half.
                  CONCATENATE first_half '#' second_half INTO secondword_copy.
                ENDIF.
                j = j + 1.
              ENDWHILE.
              i = i + 1.
            ENDWHILE.
          ELSE.
            CLEAR commons.
          ENDIF.
        ENDMETHOD.
      
        METHOD math_max.
          max = nmax( val1 = num1 val2 = num2 ).
        ENDMETHOD.
      
        METHOD math_min.
          min = nmin( val1 = num1 val2 = num2 ).
        ENDMETHOD.
      
      ENDCLASS.
      
      PARAMETERS s_nachn TYPE string DEFAULT 'HOHNSTON'.
      PARAMETERS s_vorna TYPE string DEFAULT 'PHIL'.
      
      START-OF-SELECTION.
        DATA: l_compare TYPE string.
        TYPES: BEGIN OF ty_repository_line,
                 nachn       TYPE c LENGTH 30,
                 vorna       TYPE c LENGTH 30,
                 nachn_score TYPE zcl_jaro_winkler=>ty_distance,
                 vorna_score TYPE zcl_jaro_winkler=>ty_distance,
                 total_score TYPE zcl_jaro_winkler=>ty_distance,
               END OF ty_repository_line,
               ty_repository_lines TYPE STANDARD TABLE OF ty_repository_line WITH EMPTY KEY.
        DATA(t_pa0002) = VALUE ty_repository_lines(
            ( nachn = 'JOHNSTON' vorna = 'PHILIP' )
            ( nachn = 'HOUSTON'  vorna = 'DANIEL' )
            ( nachn = 'CHUMCHAL' vorna = 'PHILLIP' ) ).
        LOOP AT t_pa0002 ASSIGNING FIELD-SYMBOL(<fs>).
          TRANSLATE <fs>-nachn TO UPPER CASE.
          TRANSLATE <fs>-vorna TO UPPER CASE.
          MOVE <fs>-nachn TO l_compare.
          CALL METHOD zcl_jaro_winkler=>stringdistance
            EXPORTING
              firstword      = s_nachn
              secondword     = l_compare
            RECEIVING
              stringdistance = <fs>-nachn_score.
          MOVE <fs>-vorna TO l_compare.
          CALL METHOD zcl_jaro_winkler=>stringdistance
            EXPORTING
              firstword      = s_vorna
              secondword     = l_compare
            RECEIVING
              stringdistance = <fs>-vorna_score.
          <fs>-total_score = <fs>-nachn_score + <fs>-vorna_score.
        ENDLOOP.
        SORT t_pa0002 BY total_score DESCENDING.
        FORMAT COLOR COL_NORMAL.
        WRITE: 80 'Total Score      ' COLOR COL_TOTAL.
        ULINE.
        LOOP AT t_pa0002 ASSIGNING <fs>.
          WRITE: / <fs>-nachn(20), <fs>-nachn_score, <fs>-vorna(20), <fs>-vorna_score.
          WRITE 80 <fs>-total_score COLOR COL_TOTAL.
        ENDLOOP.
  • One thing I don't understand is if there is no match then also it returns 0.00 which doesn't make sense to me. This is what I pass and compare with first and last names in a table

    /
  • My realization of jaro-winkler similarity

    definition

      methods ABAP_JARO_WINKLER_CUSTOM
        importing
          !SRC_STR type STRING
          !TRG_STR type STRING
        exporting
          !EV_SIM_JARO_WINKLER type DECFLOAT16
          !EV_SIM_JARO type DECFLOAT16 .

    implementation

      METHOD abap_jaro_winkler_custom.
        CONSTANTS: winkler_increase TYPE p LENGTH 3 DECIMALS 2 VALUE '0.1'.
        TYPES: BEGIN OF ts_matrix_cell
                , i TYPE syindex
                , j TYPE syindex
                , src_symbol TYPE char1
                , trg_symbol TYPE char1
                , if_equal TYPE syindex
              , END OF ts_matrix_cell
              , tt_matrix TYPE SORTED TABLE OF ts_matrix_cell
                WITH UNIQUE KEY primary_key COMPONENTS i j.
    
        DATA lt_matrix TYPE tt_matrix.
        DATA ls_matrix_cell TYPE ts_matrix_cell.
        DATA src_str_length TYPE syindex.
        DATA trg_str_length TYPE syindex.
        DATA match_char_lim  TYPE syindex.
        DATA matching_chars  TYPE syindex.
        DATA transpose  TYPE syindex.
        DATA transpose_half  TYPE syindex.
        DATA similar_prefix_of_str TYPE syindex.
        DATA symbol_index_position_i TYPE syindex.
        DATA symbol_index_position_j TYPE syindex.
    
        FIELD-SYMBOLS <fs_matrix> TYPE ts_matrix_cell.
    
        """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
        src_str_length = strlen( src_str ).
        trg_str_length = strlen( trg_str ).
        match_char_lim = floor( nmax( val1 = src_str_length val2 = trg_str_length ) / 2 ) - 1 .
    
        """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
        CLEAR lt_matrix.
        " create matrix 
        DO ( src_str_length ) TIMES.
          ls_matrix_cell-i = sy-index.
          DO ( trg_str_length ) TIMES.
            ls_matrix_cell-j = sy-index.
            INSERT ls_matrix_cell INTO TABLE lt_matrix.
          ENDDO.
        ENDDO.
        """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
        " check equlaty of elements
        LOOP AT lt_matrix ASSIGNING <fs_matrix>.
          symbol_index_position_i = <fs_matrix>-i - 1.
          symbol_index_position_j = <fs_matrix>-j - 1.
          <fs_matrix>-src_symbol = src_str+symbol_index_position_i(1).
          <fs_matrix>-trg_symbol = trg_str+symbol_index_position_j(1).
          IF <fs_matrix>-src_symbol EQ <fs_matrix>-trg_symbol.
            <fs_matrix>-if_equal = 1.
          ELSE.
            <fs_matrix>-if_equal = 0.
          ENDIF.
        ENDLOOP.
        """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
        """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
        " matching_chars and transpose_half
        " this is separate steps of the algo - so we do it in separate loops to be clean
        matching_chars = 0.
        transpose_half = 0.
        LOOP AT lt_matrix ASSIGNING <fs_matrix> WHERE if_equal EQ 1.
          IF abs( <fs_matrix>-i - <fs_matrix>-j ) GT  match_char_lim.
            CONTINUE.
          ENDIF.
          matching_chars = matching_chars + 1.
          IF <fs_matrix>-i EQ <fs_matrix>-j.
          ELSE.
            transpose = transpose + 1.
          ENDIF.
        ENDLOOP.
        transpose_half = transpose / 2.
        """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
        """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
        " calculating JaroSimilarity
        ev_sim_jaro = (   ( matching_chars / src_str_length )
                        + ( matching_chars / trg_str_length )
                        + ( ( matching_chars - transpose_half ) / matching_chars )
                       ) / 3.
    
        """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
        similar_prefix_of_str = 0.
        DO.
          symbol_index_position_i = sy-index - 1.
          IF src_str+symbol_index_position_i(1) EQ trg_str+symbol_index_position_i(1).
            similar_prefix_of_str = similar_prefix_of_str + 1.
          ELSE.
            EXIT.
          ENDIF.
          IF similar_prefix_of_str GE 4.
            EXIT.
          ENDIF.
        ENDDO.
        """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
        " calculating Jaro-Winkler Similarity
        ev_sim_jaro_winkler =
        ev_sim_jaro + ( similar_prefix_of_str * winkler_increase ) * ( 1 - ev_sim_jaro ).
      ENDMETHOD.