Seek the most efficient way to detect whether there are table row with duplicate key
The requirement is: there is an internal table with a large number of table row.
If all rows have the identical recipient_id, that id( 30273 ) must be returned.
If not, it must return empty.
The table line type structure in the project looks like below:
Three different solutions have been made.
the idea is a temporary table lt_sms_status is used to hold all the content of the internal table to be checked, and then SORT on the temporary table and delete adjacent entries. If all the table row have the same recipient id, after the operation there must be only one entry left.
DATA: lt_sms_status LIKE it_tab. lt_sms_status = it_tab. SORT lt_sms_status BY recipient_id. DELETE ADJACENT DUPLICATES FROM lt_sms_status COMPARING recipient_id. IF lines( lt_sms_status ) = 1. READ TABLE it_tab ASSIGNING FIELD-SYMBOL(<line>) INDEX 1. ev_rec_id = <line>-recipient_id. ENDIF.
The drawback of approach1 is it could lead to the unnecessary high memory assumption. when lt_sms_status = it_tab is executed, no new memory allocation will not occur, until the write operation on the copied content. This behavior is documented as “Delayed Copy”.
We also have concern regarding the performance of SORT and DELETE keyword when they are executed on a big internal table.
Now we fetch the recipient id of the first row, and compare it with the left rows in the table. If most of the table rows have different recipient id, the execution has the chance to quit early. However if unfortunately all the table rows have exactly the same recipient id, this approach has to loop until last table row.
DATA: lv_diff_found TYPE abap_bool VALUE abap_false. READ TABLE it_tab ASSIGNING FIELD-SYMBOL(<line>) INDEX 1. DATA(lv_account_id) = <line>-recipient_id. LOOP AT it_tab ASSIGNING FIELD-SYMBOL(<ls_line>). IF lv_account_id <> <ls_line>-recipient_id. lv_diff_found = abap_true. EXIT. ENDIF. ENDLOOP. IF lv_diff_found = abap_false. ev_rec_id = lv_account_id. ENDIF.
the idea is similar as approach2, now instead of manual comparison inside each LOOP, we leverage “LOOP AT XXX WHERE condition”.
READ TABLE it_tab ASSIGNING FIELD-SYMBOL(<line>) INDEX 1. LOOP AT it_tab ASSIGNING FIELD-SYMBOL(<ls_line>) WHERE recipient_id <> <line>-recipient_id. ENDLOOP. IF sy-subrc <> 0. ev_rec_id = <line>-recipient_id. ENDIF.
In order to measure the performance, we construct two kinds of test case. In the first one, we generate the internal table with N rows, each has exactly the same recipient id. And for the second, each one has different. Both are extreme kinds of scenarios. We may consider to measure the case between these two, for example for a N rows table there are 50% table rows have the same id and another 50% have difference one.
Performance test result
The time spent is measured in microsecond.
N = 1000
For the first test case, approach3 is most efficient. For the second test case, approach2 is the winner, as we expected.
N = 10000
N = 100000
N = 1000000
N = 5000000
Based on the performance result, we do not consider approach1 any more. For the choice between approach2 and 3, we need to investigate on the distraction of recipient id in the real world.
Maybe you can also share if you have better solutions?