The requirement is: there is an internal table with a large number of table row.

If all rows have the identical recipient_id, that id( 30273 ) must be returned.

UUID

Phone_number

Recipient_id

0412ASFDSFDSFXCVS

138XXXXX1

30273

0412ASFDSFDSFXCVD

138XXXXX2

30273

0412ASFDSFDSFXCVF

138XXXXX3

30273

30273

If not, it must return empty.

UUID

Phone_number

Recipient_id

0412ASFDSFDSFXCVS

138XXXXX1

30273

0412ASFDSFDSFXCVD

138XXXXX2

30273

0412ASFDSFDSFXCVF

138XXXXX3

30273

30272

The table line type structure in the project looks like below:

/wp-content/uploads/2014/03/clipboard1_416509.png

Three different solutions have been made.

Approach1

the idea is a temporary table lt_sms_status is used to hold all the content of the internal table to be checked, and then SORT on the temporary table and delete adjacent entries. If all the table row have the same recipient id, after the operation there must be only one entry left.


    DATA: lt_sms_status LIKE it_tab.
    lt_sms_status = it_tab.
SORT lt_sms_status BY recipient_id.
DELETE ADJACENT DUPLICATES FROM lt_sms_status COMPARING recipient_id.
IF lines( lt_sms_status ) = 1.
READ TABLE it_tab ASSIGNING FIELD-SYMBOL(<line>) INDEX 1.
        ev_rec_id = <line>-recipient_id.
ENDIF.

The drawback of approach1 is it could lead to the unnecessary high memory assumption. when lt_sms_status = it_tab is executed, no new memory allocation will not occur, until the write operation on the copied content. This behavior is documented as “Delayed Copy”.

We also have concern regarding the performance of SORT and DELETE keyword when they are executed on a big internal table.

/wp-content/uploads/2014/03/clipboard2_416510.png

Approach2

Now we fetch the recipient id of the first row, and compare it with the left rows in the table. If most of the table rows have different recipient id, the execution has the chance to quit early. However if unfortunately all the table rows have exactly the same recipient id, this approach has to loop until last table row.

  


 DATA: lv_diff_found TYPE abap_bool VALUE abap_false.
READ TABLE it_tab ASSIGNING FIELD-SYMBOL(<line>) INDEX 1.
DATA(lv_account_id) = <line>-recipient_id.
LOOP AT it_tab ASSIGNING FIELD-SYMBOL(<ls_line>).
IF lv_account_id <> <ls_line>-recipient_id.
          lv_diff_found = abap_true.
EXIT.
ENDIF.
ENDLOOP.
IF lv_diff_found = abap_false.
       ev_rec_id = lv_account_id.
ENDIF.

Approach3

the idea is similar as approach2, now instead of manual comparison inside each LOOP, we leverage “LOOP AT XXX WHERE condition”.

  


 READ TABLE it_tab ASSIGNING FIELD-SYMBOL(<line>) INDEX 1.
    LOOP AT it_tab ASSIGNING FIELD-SYMBOL(<ls_line>) WHERE recipient_id <> <line>-recipient_id.
    ENDLOOP.
IF sy-subrc <> 0.
       ev_rec_id = <line>-recipient_id.
ENDIF.

In order to measure the perfomance, we construct two kinds of test case. In the first one, we generate the internal table with N rows, each has exactly the same recipient id. And for the second, each one has different. Both are extreme kinds of scenarios. We may consider to measure the case between these two, for example for a N rows table there are 50% table rows have the same id and another 50% have difference one.

Performance test result

The time spent is measured in microsecond.

N = 1000

For the first test case, approach3 is most efficient. For the second test case, approach2 is the winner, as we expected.

/wp-content/uploads/2014/03/clipboard4_416511.png


N = 10000

/wp-content/uploads/2014/03/clipboard5_416512.png

N = 100000

/wp-content/uploads/2014/03/clipboard6_416513.png

N = 1000000

/wp-content/uploads/2014/03/clipboard7_416514.png

N = 5000000

/wp-content/uploads/2014/03/clipboard8_416515.png

Based on the performance result, we do not consider approach1 any more. For the choice between approach2 and 3, we need to investigate on the distriction of recipient id in the real world.

Maybe you can also share if you have better solutions?

To report this post you need to login first.

14 Comments

You must be Logged on to comment or reply to a post.

  1. Jacques Nomssi

    Hi Jerry,

    1. After a SORT, you can compare the values from first and last row. No expensive DELETE is needed. So the result from the 1st approach would be better
    2. And if you have acceptable results, I would focus on order-of-growth comparison. The running time as a function of N is described as power law  T(N) = a N^b

    b is easily estimated using the doubling hypothesis. I have implemented this once in a performance unit test.

    regards,

    JNN

    (0) 
  2. Ryan Crosby

    A hashmap of objects that each contain an internal table would give you fast results in either scenario.  If the hashmap size is greater than 1 then you have the not found scenario or the opposite otherwise.  All appends would be order 1 operations.

    Regards,

    Ryan Crosby

    (0) 
    1. Harald Frankenberger

      Hi Ryan,

      I guess this will be slower than the linear table scan using LOOP-WHERE. You are right that the insert into the hashmap is an order 1 operation resulting in O(n) for the whole algorithm. But the linear search for an element which is different from the first one in the table is in O(n) anyway. So using a hashmap will just add a constant factor to an algorithm that already works in linear time.

      Kind regards

      Harald

      (0) 
      1. Ryan Crosby

        Hi Harald,

        While the overall algorithm including the INSERT is going to execute in O(n) time the determination of the uniqueness in the approach I have suggested doesn’t require a table scan using LOOP…WHERE.  It simply is a matter of determining the size of the hashmap as I had suggested already.  If the size is 1 then you have uniqueness and not otherwise.  See the following pseudo code of such a setup:

        DATA: map TYPE REF TO hashmap,

                  itab  TYPE TABLE OF some_type,

                  object TYPE REF TO some_object,

                  lv_contains TYPE c,

                  lv_size       TYPE i.

        CREATE OBJECT map.

        LOOP AT itab.

          lv_contains = map->contains_key( itab-recipient_id ).

          IF lv_contains IS INITIAL.

            CREATE OBJECT object.

            object->add_table_entry( itab ).

            map->put( EXPORTING

                            key = itab-recipient_id

                            value = object ).

          ELSE.

            object = map->get_key( itab-recipient_id ).

            object->add_table_entry( itab ).

          ENDIF.

        ENDLOOP.

        * Check for uniqueness only requires a size check now

        lv_size = map->get_size( ).

        IF lv_size IS INITIAL.

        * Empty

        ELSEIF lv_size = 1.

        * Unique

        ELSE.

        * Non-unique

        ENDIF.

        In the end, you do still need to add the entries but the puzzle of checking for uniqueness is simplified.

        Regards,

        Ryan Crosby

        (0) 
        1. Harald Frankenberger

          Hi Ryan,

          I’m sorry for the ambiguity of my reply. I agree with you that the size of the hashmap can be checked in constant time. However, the problem Jerry presented was the following: determine whether the recipient-ids within the internal table are all the same. So the size of the input is the size of the internal table. Adapting Jerry’s solution #3, a simple linear search can be done like this:

            METHOD unique_recipient_id.

              DATA first_line LIKE LINE OF itab.

              CHECK itab IS NOT INITIAL.

              READ TABLE itab INDEX 1 INTO first_line.

              LOOP AT itab TRANSPORTING NO FIELDS

              WHERE recipient_id <> first_line-recipient_id.

                RETURN.

              ENDLOOP.

             

              result = first_line-recipient_id.

            ENDMETHOD.

           

          In the worst case this will simply scan the whole table. Your solution will always scan the whole table plus adding the overhead of creating objects within the loop and filling the hashmap. That’s what I meant when I said your solution “will be slower than the linear table scan using LOOP-WHERE”.

          Regards

          Harald

          (0) 
          1. Ryan Crosby

            Hi Harald,

            I’m not presuming to know anything about the runtime of organizing the table entries ahead of time because I’m not familiar with Jerry’s full requirement.  I am merely pointing out that if the table entries were organized in the fashion that I have represented then I would not have to scan the full table to know the answer.  I don’t know if he’s doing a SQL query into a table or if this data is merged with several sources and then compiled together.  In the latter case I could take this approach because by merely inserting entries I would know my answer at the end without doing any READs or LOOPs on the table.

            Regards,

            Ryan Crosby

            (0) 
            1. Harald Frankenberger

              Hi Ryan,

              Would it be correct to rephrase your last statement like this:

              “If the data has to be merged from several sources and compiled together anyway, use a hashmap mapping recipient-ids to all entries containing this id. Determining whether the recipient-id is unique would then be very easy.”?

              Regards

              Harald

              (0) 
              1. Ryan Crosby

                Hi Harald,

                Yes, that is the idea where if the scenario is something of that sort you could know the answer very fast and also easily.  You could do it also in SELECT…ENDSELECT but you would take other performance hits going that route I believe.

                Regards,

                Ryan Crosby

                (0) 
  3. Shai Sinai

    Hi Jerry,

    As far as I know, when STANDARD TABLE is used (not SORTED), methods 2 and 3 should lead to the same results: Both of them execute full table scan.

    It seems that you’ve missed the EXIT command in the source code of Approach3.

    Regards,

    Shai

    (0) 
    1. Jerry Wang Post author

      Hi Shai,

      you are right! After I added the missing EXIT in solution3, for the second test case, the solution 2 and 3 do have the same result. Thanks a lot for your correction!

      Best regards,

      Jerry

      (0) 
  4. Randolf Eilenberger

    Hi Jerry,

    since you are searching for an entry

    lv_account_id <> <ls_line>-recipient_id,

    that is with condition “not equal”, the search will – even for SORTED tables – end up in a linear scan of the internal table. I think the fastest way would be:

    READ TABLE it_tab INTO first_entry  INDEX 1 TRANSPORTING recipient_id. 

    READ TABLE it_tab TRANSPORTING NO FIELDS

       WITH KEY recipient_id = first_entry-recipient_id.

    IF sy-subrc <> 0.

       ev_rec_id = first_entry-recipient_id.

    ENDIF.

    COMMENT: This was nonsense, one has to do a LOOP … WHERE (approach 3).

    Best Regards, Randolf

    (0) 

Leave a Reply