SAP Data Services (DS) provides connections to data sources and targets of different categories. It supports a wide range of relational database types (HANA, Sybase IQ, Sybase ASE, SQL Anywhere, DB2, Microsoft SQL Server, Teradata, Oracle…). It can also read and write data into files (text, Excel, XML), adapters (WebServices, salesforce.com) and applications (SAP, BW et al.). Typically, to enable transformations during an ETL process, non-database data are temporarily stored (staged, cached) in databases, too. When interfacing with relational databases, DS generates SQL-statements for selecting, inserting, updating and deleting data records.
When processing database data, DS can leverage the power of the database engine. That may be very important for performance reasons. The mechanism applied is called SQL-Pushdown: (part of) the transformation logic is pushed downed to the database in the form of generated SQL statements. That is because, although DS itself is a very powerful tool, databases are often able to process data much faster. On top of that, internal processing within the database layer avoids or significantly reduces costly time-consuming data transfers between database server memory and DS memory and vice versa.
In many cases, the DS engine is smart enough to take the right decisions at this level. But it is obvious that a good dataflow (DF) design will help. The overall principle should consist in minimizing processing capacity and memory usage by the DS engine. In fact, following are the most important factors influencing the performance of a DS dataflow:
- Maximize the number of operations that can be performed by the database
- Minimize the number of records processed by the DS engine
- Minimize the number of columns processed by the DS engine ( a bit less important, because often with lower impact)
During development of a DS dataflow, it is always possible to view the code as it will be executed by the DS engine at runtime. More in particular, when reading from a relational database, one can always see the SQL that will be generated from the dataflow. When a dataflow is open in the DS Designer, select Validation à Display Optimized SQL… from the menu:
Figure 1: Display Optimised SQL
It will show the SQL code that will be generated and pushed down by the DS engine:
Figure 2: Optimised SQL
Make sure that the dataflow has not been modified after it has last been saved to the repository. If the dataflow is modified, it must be saved before displaying the generated SQL. The Optimized SQL popup window will always show the code corresponding to the saved version and not to the one displayed in DS Designer.
When all sources and targets in a flow are relational database tables, the complete operation will be pushed to the database under following conditions:
- All tables exist in the same database, or in linked databases.
- The dataflow contains Query transforms only. (Bear with me! In a next blog I will describe some powerful new features. When connected to HANA, DS 4.2 is able to push down additional transforms such as Validation, Merge and Table_Comparison.)
- For every DS function used there’s an equivalent function at database level. This has to be true for any implicitly generated functions, too. For instance, when data types of source and target columns are different, DS will include a conversion function, for which possibly no equivalent function exists at database level! There are no substitution parameters in the where-clause (replace them by global variables if necessary).
- Bulk loading is not enabled.
- The source sets are distinct for every target.
This functionality is commonly called full SQL-Pushdown. Without any doubt, a full pushdown often gives best performance, because the generated code will completely bypass any operations to DS memory. As a matter of fact that constitutes the best possible application of the main principle to let the database do the hard work!
Don’t bother applying the performance improvements described here, if your applications are already performing well. If that’s the case, you can stop reading here 😆 .
Don’t fix if it’s not broken. Check the overall performance of your job. Concentrate on the few dataflows that take most of the processing time. Then try and apply the tips and tricks outlined below on those.
1. Pushdown_sql function
DS functions for which there is no database equivalent (or DS does not know it!) prevent the SQL-Pushdown. Check out the AL_FUNCINFO table in the DS repository to find out about which DS functions can be pushed down:
SELECT NAME,FUNC_DBNAME FROM AL_FUNCINFO where SOURCE = ‘<your_database_type>’
Figure 3: DS does not know equivalent database function
There is a solution though when the culprit function is used in the where-clause of a Query transform. Using the DS built-in pushdown_sql function this code can be isolated from DS processing and pushed down to the database so that the complete statement can be executed at database level again.
Figure 4: Use of sql_pushdown
2. Use global variables
There is not always a database equivalent for all DS date functions. As a result the function is not pushed down to the database.
Figure 5: Date function – no pushdown
Whenever a system timestamp or a derivation thereof (current year, previous month, today…) is needed in a mapping or a where-clause of a Query transform, use a global variable instead. Initialize the variable; give it the desired value in a script before the dataflow. Then use it in the mapping. The database will treat the value as a constant that will be pushed to the database.
Figure 6: Use of a global variable
3. Single target table
Best practice is to have one single target table only in a dataflow.
Figure 7: Single target table
For an extract dataflow that always means a single driving table, eventually in combination with one or more lookup sources. For transform, load and aggregate flows, the columns of the target table are typically sourced from multiple tables that have to be included as sources in the dataflow.
By definition, a full SQL-Pushdown cannot be achieved when there’s more than one target table sharing some of the source tables. With multiple target tables it is impossible to generate a single SQL insert statement with a sub-select clause.
Figure 8: More than one target table
Whenever the dataflow functionality requires multiple target table, adding a Data_Transfer transform (with transfer_type = Table) between the Query transform and the target tables might help in solving performance issues. The full table scan (followed by further DS processing and database insert operations) is now replaced by three inserts (with sub-select) that are completely pushed down to the database.
Figure 9: Data_Transfer transform
Figure 10: Data_Transfer Table type
4. Avoid auto-joins
When multiple data streams are flowing out of a single source table, DS is not able to generate the most optimal SQL code. To that extent, best practice is to include additional copies of the source table in the flow.
Figure 11: Auto-join
When designing the flow as shown below, DS will generate a full SQL-Pushdown.
Figure 12: Without auto-join
5. Another application of the Data_Transfer transform
When joining a source table with a Query transform (e.g. containing a distinct-clause or a group by) DS does not generate a full pushdown.
Figure 13: Sub-optimal DS dataflow
An obvious correction to that problem consists in removing the leftmost Query transform from the dataflow by including its column mappings in the Join.
When that’s not possible, the Data_Transfer transform may bring the solution. By using a Data_Transfer transform, with transfer_type = Table, between the two Query transforms, performance may be significantly improved. For the dataflow below, DS will generate 2 full pushdown SQL statements. The first will insert the Query results into a temporary table. The second will insert the Join results into the target.
Figure 14: Optimization with Data_Transfer transform
6. The Validation transform
In a non-HANA environment, when using transforms different from the Query transform, processing control will pass to the DS engine preventing it from generating a full pushdown. There exists a workaround for validation transforms, though.
Figure 15: Validation transform
Replacing the Validation by two or more Query transforms, each with one of the validation conditions in its where clause will allow DS to generate a (separate) insert with sub-select for every data stream.
Figure 16: Parallel queries