Performance when working with internal table with results in a hashed table: “insert into hashed table inside loop” or “sort internal table and delete adjusted duplicated”?
Here is my thoughts about performance in the next task:
select many records from database and put results by some keys into an internal hashed table.
We can insert records into hashed table inside loop or we can sort internal table and delete duplicated records.
What is quicker?
Here is a small program to test:
REPORT ztest_performance.
TYPES: BEGIN OF ekpo_typ,
ebeln TYPE ebeln,
ebelp TYPE ebelp,
matnr TYPE matnr,
bukrs TYPE bukrs,
werks TYPE EWERK,
END OF ekpo_typ.
DATA: lt_ekpo TYPE TABLE OF ekpo_typ.
DATA: lt_ekpo_h TYPE HASHED TABLE OF ekpo_typ WITH UNIQUE KEY bukrs.
DATA: lt_ekpo_2 TYPE HASHED TABLE OF ekpo_typ WITH UNIQUE KEY bukrs.
DATA: i1 TYPE i,
i2 TYPE i,
i3 TYPE i,
i4 TYPE i,
lv_lines1 TYPE i,
lv_lines2 TYPE i,
diff21 TYPE i,
diff32 TYPE i.
FIELD-SYMBOLS: <fs_ekpo> LIKE LINE OF lt_ekpo.
SELECT ebeln
ebelp
matnr
bukrs
werks
FROM ekpo
INTO CORRESPONDING FIELDS OF TABLE lt_ekpo
UP TO 1000000 ROWS.
GET RUN TIME FIELD i1.
LOOP AT lt_ekpo ASSIGNING <fs_ekpo>.
INSERT <fs_ekpo> INTO TABLE lt_ekpo_h.
ENDLOOP.
GET RUN TIME FIELD i2.
lv_lines1 = LINES( lt_ekpo_h ).
REFRESH lt_ekpo_h[].
GET RUN TIME FIELD i3.
SORT lt_ekpo BY bukrs.
DELETE ADJACENT DUPLICATES FROM lt_ekpo COMPARING bukrs.
lt_ekpo_2[] = lt_ekpo.
GET RUN TIME FIELD i4.
lv_lines2 = LINES( lt_ekpo_2 ).
refresh lt_ekpo_2[].
diff21 = i2 – i1.
diff32 = i4 – i3.
WRITE: ‘i2-i1 = ‘, diff21, /.
WRITE: ‘i4-i2 = ‘, diff32, /.
WRITE: ‘lines1 = ‘, lv_lines1.
WRITE: ‘lines2 = ‘, lv_lines2.
In my test system the result is:
i2-i1 = 814.957
i4-i2 = 480.459
lines = 29
So, “delete duplicated records” seems to work quiker than “insert records into hashed table inside loop“.
Sounds like you may want to use GROUP BY in your SELECT.
I got ekko table only for example of big table. In common case the key fields that I need don't have indexes in db table.
As long as you can write the select yourself (and don't receive the itab from *somewhere* to work with), and do not need the detailed data anyways, a SELECT using either GROUP BY, SELECT DISTINCT, or COUNT directly will have better performance, i think no matter what indexes you have.
Comparing only SORT + DELETE ADJACENT DUPLICATES and the INSERT to a hash table, your findings were interesting for me, but i cannot confirm it.
I tried this with other data, with the insert being about twice as fast as the sorting plus delete.
Important though: The sorting takes much longer than the DELETE ADJACENT DUPLICATES. Maybe in your case the data is already sorted, or pretty much sorted. The DELETE ADJACENT DUPLICATES alone is indeed considerably faster than the INSERT method.
I'm with Thomas on this one - just don't select the data you don't need in the first place.
If it was the "receive the itab from *somewhere* to work with" case I'd probably go with SORT/DELETE just because it's clear, only 2 lines and does not require FS declaration.
One thing you should take into account is Releasing Internal Table Memory.
However, you need to then look at what your table is being used for. You may lose some time inserting information into a hashed table but you would certainly gain that time back when using it correctly as a lookup table.
So personally, I think your question doesn't need asking. It's the consideration of the use of the table that you are building that needs to be thought about.
Rich
In my experience, COLLECT works faster than any other options.
You can read more on that - ABAP internal table unique key Performance Comparison - ABAP help blog
Because COLLECT uses internal hash algorithm. http://help.sap.com/abapdocu_702/en/abapcollect.htm
Regards,
Naimesh Patel
So that's one example of table use, another example I was thinking about was a lookup table for instance.
However, when you think of it, any table that has a good hashing algorithm that does not generate too many collisions generally be faster than a sorted table when the number of records start getting huge.
Collect has some restrictions: All components that are not part of the primary table key must have a numeric data type.