Skip to Content
Author's profile photo Jerry Wang

Seek the most efficient way to detect whether there are table row with duplicate key

The requirement is: there is an internal table with a large number of table row.

If all rows have the identical recipient_id, that id( 30273 ) must be returned.

UUID Phone_number Recipient_id
0412ASFDSFDSFXCVS 138XXXXX1 30273
0412ASFDSFDSFXCVD 138XXXXX2 30273
0412ASFDSFDSFXCVF 138XXXXX3 30273
30273

If not, it must return empty.

UUID Phone_number Recipient_id
0412ASFDSFDSFXCVS 138XXXXX1 30273
0412ASFDSFDSFXCVD 138XXXXX2 30273
0412ASFDSFDSFXCVF 138XXXXX3 30273
30272

The table line type structure in the project looks like below:

/wp-content/uploads/2014/03/clipboard1_416509.png

Three different solutions have been made.

Approach1

the idea is a temporary table lt_sms_status is used to hold all the content of the internal table to be checked, and then SORT on the temporary table and delete adjacent entries. If all the table row have the same recipient id, after the operation there must be only one entry left.

DATA: lt_sms_status LIKE it_tab.
lt_sms_status = it_tab.
SORT lt_sms_status BY recipient_id.
DELETE ADJACENT DUPLICATES FROM lt_sms_status COMPARING recipient_id.
IF lines( lt_sms_status ) = 1.
  READ TABLE it_tab ASSIGNING FIELD-SYMBOL(<line>) INDEX 1.
  ev_rec_id = <line>-recipient_id.
ENDIF.

The drawback of approach1 is it could lead to the unnecessary high memory assumption. when lt_sms_status = it_tab is executed, no new memory allocation will not occur, until the write operation on the copied content. This behavior is documented as “Delayed Copy”.

We also have concern regarding the performance of SORT and DELETE keyword when they are executed on a big internal table.

/wp-content/uploads/2014/03/clipboard2_416510.png

Approach2

Now we fetch the recipient id of the first row, and compare it with the left rows in the table. If most of the table rows have different recipient id, the execution has the chance to quit early. However if unfortunately all the table rows have exactly the same recipient id, this approach has to loop until last table row.

DATA: lv_diff_found TYPE abap_bool VALUE abap_false.
READ TABLE it_tab ASSIGNING FIELD-SYMBOL(<line>) INDEX 1.
DATA(lv_account_id) = <line>-recipient_id.
LOOP AT it_tab ASSIGNING FIELD-SYMBOL(<ls_line>).
  IF lv_account_id <> <ls_line>-recipient_id.
     lv_diff_found = abap_true.
     EXIT.
  ENDIF.
ENDLOOP.
IF lv_diff_found = abap_false.
  ev_rec_id = lv_account_id.
ENDIF.

Approach3

the idea is similar as approach2, now instead of manual comparison inside each LOOP, we leverage “LOOP AT XXX WHERE condition”.

READ TABLE it_tab ASSIGNING FIELD-SYMBOL(<line>) INDEX 1.
LOOP AT it_tab ASSIGNING FIELD-SYMBOL(<ls_line>) WHERE recipient_id <> <line>-recipient_id.
ENDLOOP.
IF sy-subrc <> 0.
   ev_rec_id = <line>-recipient_id.
ENDIF.

In order to measure the performance, we construct two kinds of test case. In the first one, we generate the internal table with N rows, each has exactly the same recipient id. And for the second, each one has different. Both are extreme kinds of scenarios. We may consider to measure the case between these two, for example for a N rows table there are 50% table rows have the same id and another 50% have difference one.

Performance test result

The time spent is measured in microsecond.

N = 1000

For the first test case, approach3 is most efficient. For the second test case, approach2 is the winner, as we expected.

/wp-content/uploads/2014/03/clipboard4_416511.png

N = 10000

/wp-content/uploads/2014/03/clipboard5_416512.png

N = 100000

/wp-content/uploads/2014/03/clipboard6_416513.png

N = 1000000

/wp-content/uploads/2014/03/clipboard7_416514.png

N = 5000000

/wp-content/uploads/2014/03/clipboard8_416515.png

Based on the performance result, we do not consider approach1 any more. For the choice between approach2 and 3, we need to investigate on the distraction of recipient id in the real world.

Maybe you can also share if you have better solutions?

Assigned Tags

      14 Comments
      You must be Logged on to comment or reply to a post.
      Author's profile photo Jacques Nomssi Nzali
      Jacques Nomssi Nzali

      Hi Jerry,

      1. After a SORT, you can compare the values from first and last row. No expensive DELETE is needed. So the result from the 1st approach would be better
      2. And if you have acceptable results, I would focus on order-of-growth comparison. The running time as a function of N is described as power law  T(N) = a N^b

      b is easily estimated using the doubling hypothesis. I have implemented this once in a performance unit test.

      regards,

      JNN

      Author's profile photo Jerry Wang
      Jerry Wang
      Blog Post Author

      Hello JNN,

      thank you very much for your comment. I will look into your useful wiki.

      Best regards,

      Jerry

      Author's profile photo Ryan Crosby
      Ryan Crosby

      A hashmap of objects that each contain an internal table would give you fast results in either scenario.  If the hashmap size is greater than 1 then you have the not found scenario or the opposite otherwise.  All appends would be order 1 operations.

      Regards,

      Ryan Crosby

      Author's profile photo Former Member
      Former Member

      Hi Ryan,

      I guess this will be slower than the linear table scan using LOOP-WHERE. You are right that the insert into the hashmap is an order 1 operation resulting in O(n) for the whole algorithm. But the linear search for an element which is different from the first one in the table is in O(n) anyway. So using a hashmap will just add a constant factor to an algorithm that already works in linear time.

      Kind regards

      Harald

      Author's profile photo Ryan Crosby
      Ryan Crosby

      Hi Harald,

      While the overall algorithm including the INSERT is going to execute in O(n) time the determination of the uniqueness in the approach I have suggested doesn't require a table scan using LOOP...WHERE.  It simply is a matter of determining the size of the hashmap as I had suggested already.  If the size is 1 then you have uniqueness and not otherwise.  See the following pseudo code of such a setup:

      DATA: map TYPE REF TO hashmap,

                itab  TYPE TABLE OF some_type,

                object TYPE REF TO some_object,

                lv_contains TYPE c,

                lv_size       TYPE i.

      CREATE OBJECT map.

      LOOP AT itab.

        lv_contains = map->contains_key( itab-recipient_id ).

        IF lv_contains IS INITIAL.

          CREATE OBJECT object.

          object->add_table_entry( itab ).

          map->put( EXPORTING

                          key = itab-recipient_id

                          value = object ).

        ELSE.

          object = map->get_key( itab-recipient_id ).

          object->add_table_entry( itab ).

        ENDIF.

      ENDLOOP.

      * Check for uniqueness only requires a size check now

      lv_size = map->get_size( ).

      IF lv_size IS INITIAL.

      * Empty

      ELSEIF lv_size = 1.

      * Unique

      ELSE.

      * Non-unique

      ENDIF.

      In the end, you do still need to add the entries but the puzzle of checking for uniqueness is simplified.

      Regards,

      Ryan Crosby

      Author's profile photo Former Member
      Former Member

      Hi Ryan,

      I'm sorry for the ambiguity of my reply. I agree with you that the size of the hashmap can be checked in constant time. However, the problem Jerry presented was the following: determine whether the recipient-ids within the internal table are all the same. So the size of the input is the size of the internal table. Adapting Jerry's solution #3, a simple linear search can be done like this:

        METHOD unique_recipient_id.

          DATA first_line LIKE LINE OF itab.

          CHECK itab IS NOT INITIAL.

          READ TABLE itab INDEX 1 INTO first_line.

          LOOP AT itab TRANSPORTING NO FIELDS

          WHERE recipient_id <> first_line-recipient_id.

            RETURN.

          ENDLOOP.

         

          result = first_line-recipient_id.

        ENDMETHOD.

       

      In the worst case this will simply scan the whole table. Your solution will always scan the whole table plus adding the overhead of creating objects within the loop and filling the hashmap. That's what I meant when I said your solution "will be slower than the linear table scan using LOOP-WHERE".

      Regards

      Harald

      Author's profile photo Ryan Crosby
      Ryan Crosby

      Hi Harald,

      I'm not presuming to know anything about the runtime of organizing the table entries ahead of time because I'm not familiar with Jerry's full requirement.  I am merely pointing out that if the table entries were organized in the fashion that I have represented then I would not have to scan the full table to know the answer.  I don't know if he's doing a SQL query into a table or if this data is merged with several sources and then compiled together.  In the latter case I could take this approach because by merely inserting entries I would know my answer at the end without doing any READs or LOOPs on the table.

      Regards,

      Ryan Crosby

      Author's profile photo Former Member
      Former Member

      Hi Ryan,

      Would it be correct to rephrase your last statement like this:

      "If the data has to be merged from several sources and compiled together anyway, use a hashmap mapping recipient-ids to all entries containing this id. Determining whether the recipient-id is unique would then be very easy."?

      Regards

      Harald

      Author's profile photo Ryan Crosby
      Ryan Crosby

      Hi Harald,

      Yes, that is the idea where if the scenario is something of that sort you could know the answer very fast and also easily.  You could do it also in SELECT...ENDSELECT but you would take other performance hits going that route I believe.

      Regards,

      Ryan Crosby

      Author's profile photo Former Member
      Former Member

      Hi Jerry,

      As far as I know, when STANDARD TABLE is used (not SORTED), methods 2 and 3 should lead to the same results: Both of them execute full table scan.

      It seems that you've missed the EXIT command in the source code of Approach3.

      Regards,

      Shai

      Author's profile photo Jerry Wang
      Jerry Wang
      Blog Post Author

      Hi Shai,

      you are right! After I added the missing EXIT in solution3, for the second test case, the solution 2 and 3 do have the same result. Thanks a lot for your correction!

      Best regards,

      Jerry

      Author's profile photo Randolf Eilenberger
      Randolf Eilenberger

      Hi Jerry,

      since you are searching for an entry

      lv_account_id <> <ls_line>-recipient_id,

      that is with condition "not equal", the search will - even for SORTED tables - end up in a linear scan of the internal table. I think the fastest way would be:

      READ TABLE it_tab INTO first_entry  INDEX 1 TRANSPORTING recipient_id. 

      READ TABLE it_tab TRANSPORTING NO FIELDS

         WITH KEY recipient_id = first_entry-recipient_id.

      IF sy-subrc <> 0.

         ev_rec_id = first_entry-recipient_id.

      ENDIF.

      COMMENT: This was nonsense, one has to do a LOOP ... WHERE (approach 3).

      Best Regards, Randolf

      Author's profile photo Jerry Wang
      Jerry Wang
      Blog Post Author

      Hi Randolf,

      your code will make the sy-subrc always equals to 0 in both test case.

      Best regards,

      Jerry

      Author's profile photo Randolf Eilenberger
      Randolf Eilenberger

      Hi Jerry,

      You are of course right, I changed my answer.

      Best Regards, Randolf