Joins on Internal Tables? ‒ The ABAP SQL Table Buffer Engine is grown up!
The fastest in-memory database in ABAP is … often the ABAP table buffer!
The ABAP table buffer has an innocent-sounding name. Contrary to query buffers, which store the result of a previous query, the ABAP table buffer buffers the underlying database tables. It is then possible to execute SELECT queries on these buffered tables. This happens through a full-fledged in-memory database engine, the ABAP SQL Engine. In this case, all SQL is executed in the ABAP server and not on the database.
While SAP HANA is a column store database, the table buffer stores the buffered database tables in internal tables in the ABAP server’s shared memory. This means it is a row store database, as internal tables are organized in rows. The ABAP SQL Engine can operate on these internal tables but it can operate on ordinary internal tables, too. The latter is possible since AS ABAP 752 and kernel 753 (but just for a single internal table).
This makes it necessary to distinguish between the ABAP table buffer, which denotes the in-memory data persistence of buffered database tables, and the ABAP SQL Engine, which executes SQL statements on top of the table buffer or on top of internal tables or on top of a combination of both.
This article was written together with my colleague Stephan-Marian Piatkowski.
When talking about the ABAP SQL Engine it is necessary to know its limitations. As has been already mentioned, the ABAP SQL Engine is a row-store engine. This has advantages and disadvantages.
It is also inherently single-threaded. This is a deliberate design decision. Unlike the HANA database, the ABAP server is designed not primarily for performance but for scalability. Using a multithreaded program typically decreases the wall time but will increase the CPU time resulting in worse scalability and a worse overall performance of the system. Thus the ABAP server tries to decrease the CPU time and not the wall-time.
The most serious limitation comes from the limited support of advanced SQL features. Before ABAP Platform 2308 only the following was supported:
- Single buffered table or internal table in the FROM clause
- WHERE conditions without subselects
- COUNT(*) aggregate function (but no other aggregate functions)
- ORDER BY PRIMARY KEY
- Some common SQL functions (like CONCAT, SUBSTRING, LEFT, UPPER, since 753)
- Arithmetic expressions (+, -, *, since 753)
- SQL expressions like CASE or COALESCE (since 753)
- NULL values (since release 753)
- arbitrary ORDER BY clause (since release 757)
3. SQL on internal tables?
Normally the statements LOOP and READ are used to read from internal tables in ABAP. Since AS ABAP 752 it has been possible to use SELECT statements for internal table processing. This has the advantage that standard SQL can be used for both database and internal tables. It also enables functional and performance benefits. By the use of SQL expressions and the full power of SQL, the SELECT statement is much more powerful than both LOOP and READ.
A classic example where SELECT offers new features for ABAP developers is the following:
select * from @itab as i order by substring( column, 2, 2 ) into @data(result). ... endselect.
This loops over an internal table with a custom sorting criterion. Neither sorted looping nor a custom sorting criterion is possible with ABAP LOOP/READ or SORT statements. By using the SQL CASE expression, a huge amount of possible sorting criteria can be realized.
Note that the ABAP command SORT changes the internal table, i.e. is intrusive. The SELECT statement does a sorted loop without changing the internal table. This often avoids unnecessary copying.
Of course, there are certain differences between READ/LOOP and SELECT. The most important difference is that READ/LOOP may process deep internal tables containing references or other internal tables. This is not yet possible with SELECT. Another difference is that SQL semantics is used in SELECT statements, for example trailing blanks of short strings are truncated.
4. Just in Time Compilation for the ABAP SQL Engine
The ABAP generally does not yet support Jit (Just-in-time) compilation. Jit compilation is a heavy-weight optimization technique, where machine language coding is generated at runtime and dramatically speeds up the runtime of bytecode-interpreted languages like Java.
Since ABAP Platform 2308 and kernel 793, the ABAP SQL Engine might use Jit compilation internally to execute WHERE and ON conditions.
Every WHERE or ON condition is transformed into a stack-based bytecode during the compilation of an ABAP SQL statement. For example the WHERE condition
(A = 5 OR A = 9) AND B = 7
is transformed into the following bytecode:
To execute “A = 5”, first the value of column A is pushed on the stack. Next, the value 5 is pushed on the stack. Then the EQ instruction pops the two topmost values from the stack, compares them, and pushes true/false on the stack. And so on …
For simple WHERE conditions, the ABAP SQL Engine can generate machine language coding out of the generated bytecode. This results in dramatic speedups in certain scenarios. It also enables processor-specific optimizations for different platforms.
5. Joins and CTEs on internal tables in the ABAP SQL Engine with ABAP Platform 2308
As a new main feature, the ABAP SQL Engine can execute joins and common table expressions in ABAP Platform 2308 with multiple internal tables. This is a major new feature and transforms the ABAP SQL Engine into a more serious SQL database engine. Both inner, left outer and cross joins are supported and, of course, the NULL values are handled correctly.
So it is now possible to use more than one internal table in an ABAP SQL statement if both of the following conditions hold:
- Only internal tables and no buffered database tables are used in the statement.
- The statement can be fully executed by the ABAP SQL Engine and does not use advanced HANA SQL features like hierarchies, aggregate functions, subselects, and the like.
One could imagine that a JOIN between a buffered table and an internal table or between multiple buffered tables is executed by the ABAP SQL Engine and not on the database. This is disabled at the moment because there might still be cases where the HANA outperforms the ABAP SQL Engine. Of course, these kinds of statements work (at least if there is just a single internal table involved) by pushing everything down to the database.
In essence: statements involving more than one internal table were forbidden before and are now allowed in certain cases.
For example, the following SQL statement is completely executed by the ABAP SQL Engine without pushing the data down to the HANA database:
with +a as ( select column + column2 as column from @itab1 as t where column in @values order by column2 ), +b as ( select * from @itab2 as t where other_column like @like ) select * from +a left join +b on +a~column = +b~other_column into table @data(result).
The used JOIN algorithms and their performance are still improved. In 2308 the following JOIN algorithms are available:
- Merge join, with a complexity of O(N+M) where N is the cardinality of the left and M the cardinality of the right side
- Hash Join, with a complexity of O(N+M)
- Nested-Loop Join, with a complexity of O(N*M)
The available join algorithms and their performance might change in later releases.
Consider the following implementation of a “poor man’s join” in plain ABAP:
LOOP AT itab1 INTO DATA(wa) WHERE ... READ TABLE itab2 INTO DATA(wa2) WITH KEY key1 = wa-key1. ... ENDLOOP.
This will run with a complexity of O(N * LOG(M)) if itab2 is sorted (according to the WITH KEY condition) and with the complexity of O(N*M) if itab2 is a standard table or is a sorted/hashed table with another key.
Using the SELECT statement an equivalent statement will run with a complexity of O(N+M) in any case, even if both itab1 and itab2 are standard tables without any meaningful key (using the hash join algorithm).
Whether O(N*LOG(M)) or O(N+M) is faster depends on the sizes of N and M. In most cases the strictly linear O(N+M) is preferable. If the size of the left size N is relatively small and constant and the size of the right side M grows fast, then O(N*LOG(M)) will be faster.
At the moment the selection of the JOIN algorithm (i.e. the “optimizer”) is still quite simple and is completely determined at compile-time without any size information at runtime.
The merge join is generally the first choice if both sides are already sorted according to (parts of) the ON – condition. Note that in ABAP Platform 2308, the sort-merge join is not yet implemented, i.e. the merge join can’t be used if one side is not sorted according to (parts of) the ON condition. The hash join is taken if there is at least one comparison of the type “TABLE1~COL = TABLE2~COL” and both columns happen to have the same data type. In all other cases (for example, if there is a top-level OR condition in the ON condition) the nested loop join is used.
Note that because the ABAP SQL Engine doesn’t implement the index join algorithm in ABAP Platform 2308, the LOOP/READ might clearly outperform the SELECT if itab2 is a hashed table and the READ statement accidentally happens to use the hashed key. In this case, the LOOP/READ might run in O(N) time instead of O(N+M) when using the SELECT. Note that the SELECT statement might still be faster in this case if the WHERE condition of the outer LOOP is more complicated.
Note that predicate pushdown is used heavily as is expected from a database engine. Consider the following query:
SELECT * FROM @itab1 as i1 INNER JOIN @itab2 as i2 ON i1~col = i2~col WHERE i1~col = 'A'
This is transformed internally to:
WITH +temp1 AS ( SELECT * FROM @itab1 AS i1 WHERE i1~col = 'A' ) " potential index lookup +temp2 AS ( SELECT * FROM @itab2 AS i2 WHERE i2~col = 'A' ) " potential index lookup SELECT * FROM +temp1 as i1 INNER JOIN +temp2 as i2 ON i1~col = i2~col
This means that the predicate of the WHERE condition is “pushed down” to both sides.
6. Optimizations for IN and OR conditions
In the past, the ABAP Table Buffer and the ABAP SQL Engine have been slow when OR or IN conditions have been used in the WHERE condition. Take the following example:
SELECT * FROM buffered_table_or_itab WHERE key1 IN ('A', 'B', 'C') INTO ...
This statement is semantically equivalent to the following:
SELECT * FROM buffered_table_or_itab WHERE key1 = 'A' or key1 = 'B' or key1 = 'C' INTO ...
Due to the OR conditions, no index lookup can be used. Thus a full table scan has been made.
In ABAP Platform 2308, the OR/IN condition is internally transformed into a UNION operation instead:
SELECT * FROM buffered_table_or_itab WHERE key1 = 'A' UNION ALL SELECT * FROM buffered_table_or_itab WHERE key1 = 'B' UNION ALL SELECT * FROM buffered_table_or_itab WHERE key1 = 'C' INTO ...
This is executed like three separate SELECT statements using a full key lookup.
This works even with multiple equal comparisons, for example:
SELECT * FROM buffered_table_or_itab WHERE ( key1 = 'A' AND key2 = 5 ) OR ( key1 = 'B' AND key2 = 6 ) INTO ...
This also works for MULTIDIMENSIONAL IN predicates:
SELECT * FROM buffered_table_or_itab WHERE ( key1, key2 ) IN ( ( 'A', 1 ), ( 'B', 2 ) ) INTO ...
Note that ordinary UNIONs are not yet enabled in the ABAP SQL Engine (i.e. UNIONS which are explicitly written in the ABAP SQL statement).
It is still not possible to read from multiple different generic or single buffered key areas for a single table in a SELECT/WITH statement
This article introduced the ABAP SQL Engine and described some new features. The most prominent feature was the possibility to JOIN two or more internal tables. Other features included Common Table Expressions (CTEs), speed-up of OR/IN condition evaluation, and Jit compilation.