Seek the most efficient way to detect whether there are table row with duplicate key
The requirement is: there is an internal table with a large number of table row.
If all rows have the identical recipient_id, that id( 30273 ) must be returned.
UUID | Phone_number | Recipient_id |
0412ASFDSFDSFXCVS | 138XXXXX1 | 30273 |
0412ASFDSFDSFXCVD | 138XXXXX2 | 30273 |
0412ASFDSFDSFXCVF | 138XXXXX3 | 30273 |
… | … | 30273 |
If not, it must return empty.
UUID | Phone_number | Recipient_id |
0412ASFDSFDSFXCVS | 138XXXXX1 | 30273 |
0412ASFDSFDSFXCVD | 138XXXXX2 | 30273 |
0412ASFDSFDSFXCVF | 138XXXXX3 | 30273 |
… | … | 30272 |
The table line type structure in the project looks like below:
Three different solutions have been made.
Approach1
the idea is a temporary table lt_sms_status is used to hold all the content of the internal table to be checked, and then SORT on the temporary table and delete adjacent entries. If all the table row have the same recipient id, after the operation there must be only one entry left.
DATA: lt_sms_status LIKE it_tab.
lt_sms_status = it_tab.
SORT lt_sms_status BY recipient_id.
DELETE ADJACENT DUPLICATES FROM lt_sms_status COMPARING recipient_id.
IF lines( lt_sms_status ) = 1.
READ TABLE it_tab ASSIGNING FIELD-SYMBOL(<line>) INDEX 1.
ev_rec_id = <line>-recipient_id.
ENDIF.
The drawback of approach1 is it could lead to the unnecessary high memory assumption. when lt_sms_status = it_tab is executed, no new memory allocation will not occur, until the write operation on the copied content. This behavior is documented as “Delayed Copy”.
We also have concern regarding the performance of SORT and DELETE keyword when they are executed on a big internal table.
Approach2
Now we fetch the recipient id of the first row, and compare it with the left rows in the table. If most of the table rows have different recipient id, the execution has the chance to quit early. However if unfortunately all the table rows have exactly the same recipient id, this approach has to loop until last table row.
DATA: lv_diff_found TYPE abap_bool VALUE abap_false.
READ TABLE it_tab ASSIGNING FIELD-SYMBOL(<line>) INDEX 1.
DATA(lv_account_id) = <line>-recipient_id.
LOOP AT it_tab ASSIGNING FIELD-SYMBOL(<ls_line>).
IF lv_account_id <> <ls_line>-recipient_id.
lv_diff_found = abap_true.
EXIT.
ENDIF.
ENDLOOP.
IF lv_diff_found = abap_false.
ev_rec_id = lv_account_id.
ENDIF.
Approach3
the idea is similar as approach2, now instead of manual comparison inside each LOOP, we leverage “LOOP AT XXX WHERE condition”.
READ TABLE it_tab ASSIGNING FIELD-SYMBOL(<line>) INDEX 1.
LOOP AT it_tab ASSIGNING FIELD-SYMBOL(<ls_line>) WHERE recipient_id <> <line>-recipient_id.
ENDLOOP.
IF sy-subrc <> 0.
ev_rec_id = <line>-recipient_id.
ENDIF.
In order to measure the performance, we construct two kinds of test case. In the first one, we generate the internal table with N rows, each has exactly the same recipient id. And for the second, each one has different. Both are extreme kinds of scenarios. We may consider to measure the case between these two, for example for a N rows table there are 50% table rows have the same id and another 50% have difference one.
Performance test result
The time spent is measured in microsecond.
N = 1000
For the first test case, approach3 is most efficient. For the second test case, approach2 is the winner, as we expected.
N = 10000
N = 100000
N = 1000000
N = 5000000
Based on the performance result, we do not consider approach1 any more. For the choice between approach2 and 3, we need to investigate on the distraction of recipient id in the real world.
Maybe you can also share if you have better solutions?
Hi Jerry,
b is easily estimated using the doubling hypothesis. I have implemented this once in a performance unit test.
regards,
JNN
Hello JNN,
thank you very much for your comment. I will look into your useful wiki.
Best regards,
Jerry
A hashmap of objects that each contain an internal table would give you fast results in either scenario. If the hashmap size is greater than 1 then you have the not found scenario or the opposite otherwise. All appends would be order 1 operations.
Regards,
Ryan Crosby
Hi Ryan,
I guess this will be slower than the linear table scan using LOOP-WHERE. You are right that the insert into the hashmap is an order 1 operation resulting in O(n) for the whole algorithm. But the linear search for an element which is different from the first one in the table is in O(n) anyway. So using a hashmap will just add a constant factor to an algorithm that already works in linear time.
Kind regards
Harald
Hi Harald,
While the overall algorithm including the INSERT is going to execute in O(n) time the determination of the uniqueness in the approach I have suggested doesn't require a table scan using LOOP...WHERE. It simply is a matter of determining the size of the hashmap as I had suggested already. If the size is 1 then you have uniqueness and not otherwise. See the following pseudo code of such a setup:
DATA: map TYPE REF TO hashmap,
itab TYPE TABLE OF some_type,
object TYPE REF TO some_object,
lv_contains TYPE c,
lv_size TYPE i.
CREATE OBJECT map.
LOOP AT itab.
lv_contains = map->contains_key( itab-recipient_id ).
IF lv_contains IS INITIAL.
CREATE OBJECT object.
object->add_table_entry( itab ).
map->put( EXPORTING
key = itab-recipient_id
value = object ).
ELSE.
object = map->get_key( itab-recipient_id ).
object->add_table_entry( itab ).
ENDIF.
ENDLOOP.
* Check for uniqueness only requires a size check now
lv_size = map->get_size( ).
IF lv_size IS INITIAL.
* Empty
ELSEIF lv_size = 1.
* Unique
ELSE.
* Non-unique
ENDIF.
In the end, you do still need to add the entries but the puzzle of checking for uniqueness is simplified.
Regards,
Ryan Crosby
Hi Ryan,
I'm sorry for the ambiguity of my reply. I agree with you that the size of the hashmap can be checked in constant time. However, the problem Jerry presented was the following: determine whether the recipient-ids within the internal table are all the same. So the size of the input is the size of the internal table. Adapting Jerry's solution #3, a simple linear search can be done like this:
METHOD unique_recipient_id.
DATA first_line LIKE LINE OF itab.
CHECK itab IS NOT INITIAL.
READ TABLE itab INDEX 1 INTO first_line.
LOOP AT itab TRANSPORTING NO FIELDS
WHERE recipient_id <> first_line-recipient_id.
RETURN.
ENDLOOP.
result = first_line-recipient_id.
ENDMETHOD.
In the worst case this will simply scan the whole table. Your solution will always scan the whole table plus adding the overhead of creating objects within the loop and filling the hashmap. That's what I meant when I said your solution "will be slower than the linear table scan using LOOP-WHERE".
Regards
Harald
Hi Harald,
I'm not presuming to know anything about the runtime of organizing the table entries ahead of time because I'm not familiar with Jerry's full requirement. I am merely pointing out that if the table entries were organized in the fashion that I have represented then I would not have to scan the full table to know the answer. I don't know if he's doing a SQL query into a table or if this data is merged with several sources and then compiled together. In the latter case I could take this approach because by merely inserting entries I would know my answer at the end without doing any READs or LOOPs on the table.
Regards,
Ryan Crosby
Hi Ryan,
Would it be correct to rephrase your last statement like this:
"If the data has to be merged from several sources and compiled together anyway, use a hashmap mapping recipient-ids to all entries containing this id. Determining whether the recipient-id is unique would then be very easy."?
Regards
Harald
Hi Harald,
Yes, that is the idea where if the scenario is something of that sort you could know the answer very fast and also easily. You could do it also in SELECT...ENDSELECT but you would take other performance hits going that route I believe.
Regards,
Ryan Crosby
Hi Jerry,
As far as I know, when STANDARD TABLE is used (not SORTED), methods 2 and 3 should lead to the same results: Both of them execute full table scan.
It seems that you've missed the EXIT command in the source code of Approach3.
Regards,
Shai
Hi Shai,
you are right! After I added the missing EXIT in solution3, for the second test case, the solution 2 and 3 do have the same result. Thanks a lot for your correction!
Best regards,
Jerry
Hi Jerry,
since you are searching for an entry
lv_account_id <> <ls_line>-recipient_id,
that is with condition "not equal", the search will - even for SORTED tables - end up in a linear scan of the internal table. I think the fastest way would be:
READ TABLE it_tab INTO first_entry INDEX 1 TRANSPORTING recipient_id.
READ TABLE it_tab TRANSPORTING NO FIELDS
WITH KEY recipient_id = first_entry-recipient_id.
IF sy-subrc <> 0.
ev_rec_id = first_entry-recipient_id.
ENDIF.
COMMENT: This was nonsense, one has to do a LOOP ... WHERE (approach 3).
Best Regards, Randolf
Hi Randolf,
your code will make the sy-subrc always equals to 0 in both test case.
Best regards,
Jerry
Hi Jerry,
You are of course right, I changed my answer.
Best Regards, Randolf