Technical Articles
SAP Data Services – Defining delta using hash functions
In section 34 Changed Data Capture of the SAP Data Services Designer Guide you can find a very good description of the various delta load capabilities in SAP Data Services. Isn’t that worth an attentive read?
Delta loads are commonly used to reduce overall ETL processing time. When the number of new and modified source records is relatively low compared to the full data size, overall performance will increase significantly. Not only the time spent in extraction will be much lower, also the transformation and load steps will run much faster because those operations will apply to a (minor) subset of data only.
I would say that, in general, you may consider following options to implement an incremental extraction:
- Use last_modification timestamps from the source tables. Create a control table for keeping track of the timestamp(s) used in the previous incremental extraction. Have your jobs call an initialisation script to assign the timestamps to global variables. Use the variables in the where-clause of the Query transform in your extraction dataflows.
If there’s no timestamp available, but the records are numbered sequentially, you can apply a similar approach using the sequence numbers. - Use log tables. Many systems do not include timestamps in the actual tables modified. They use so-called log tables to keep track of changes. Join source and log tables and proceed as above.
- Use the built-in CDC-mechanisms of the source database.
- Use specific CDC-enabling software, like SLT.
- With SAP as a source, you can leverage delta-enabled Business Content Extractors.
Unfortunately, when none of those options are available, you probably will have to do a full extraction of your source data, not such a great idea when dealing with big data. Note that, in such a scenario, you still have the possibility to reduce transformation and load times by calculating the delta yourself. You achieve this by keeping a copy of the data extracted in the previous job run and running it thru a Table_Comparison transform to calculate the delta. Then you can apply the transformations to the delta only. And write the new and modified records only to the target.
This blog describes a solution around the two potential bottlenecks in that approach:
- Full extraction times are related to the size of the source data, the number of records multiplied by the record size. The larger the latter (multiple and wide columns), the more time it will take to extract the data.
- The time spent in the Table_comparison transform (make sure to always sort the input data and set the comparison method in the transform to Sorted input!) may negate the profit obtained from transforming and loading a reduced data set. The more columns, the longer the comparison process will run.
Both problems can be overcome by using hash functionality. The example below is for SAP HANA but can be applied to any database that has a built-in hash function or is able to call an external one.
Create two tables in HANA, first:
- CHECKSUM_MINUS1, to contain checksums from the previous job run. The table has 2 columns, one for the primary key and the other for the hash value of the record.
- CHECKSUM, to contain the same data from the current job run.
Create a 1st dataflow to copy the checksums from the previous run to CHECKSUM_MINUS1:
As both tables are in the same database, full logic is pushed to HANA and execution terminated almost instantly.
Create a 2nd dataflow to calculate the checksums. Use a SQL transform as source:
Open the SQL – Transform Editor. Enter the SQL text as:
SELECT PKCOL, hash_sha256( to_binary(PKCOL) , to_binary(ifnull(INT1,0)) , to_binary(ifnull(INT2,0)) , to_binary(ifnull(DATE1,'')) , to_binary(ifnull(INT3,0)) , to_binary(ifnull(INT4,0)) , to_binary(ifnull(INT5,'')) , to_binary(ifnull(TXT01,'')) , to_binary(ifnull(TXT02,'')) , to_binary(ifnull(INT6,0)) , to_binary(ifnull(TXT03,'')) , to_binary(ifnull(TXT04,'')) , to_binary(ifnull(INT7,0)) , to_binary(ifnull(DEC1,0)) , to_binary(ifnull(TXT05,'')) , to_binary(ifnull(DEC2,0)) , to_binary(ifnull(TXT06,'')) , to_binary(ifnull(DEC3,0)) , to_binary(ifnull(TXT07,'')) , to_binary(ifnull(INT8,0)) , to_binary(ifnull(TXT08,'')) , to_binary(ifnull(TXT09,'')) , to_binary(ifnull(INT9,0.0)) , to_binary(ifnull(DEC4,0.0)) , to_binary(ifnull(DEC5,0)) , to_binary(ifnull(TXT10,'')) , to_binary(ifnull(TXT11,'')) , to_binary(ifnull(DEC6,0)) , to_binary(ifnull(TXT12,'')) , to_binary(ifnull(TXT13,'')) , to_binary(ifnull(DATE2,'')) ) as hash_value FROM MY_SOURCE_TABLE
Column PKCOL is the primary key column of MY_SOURCE_TABLE that has several additional numeric, date and varchar columns.
Then select the Update schema button. And close the Editor window.
The SQL transform prevents full SQL-pushdown to HANA, so all data is passing thru SAP Data Services memory. Because the data size is relatively small (2 columns only) compared to the full source table size, performance gains may be high.
Create a 3rd dataflow to calculate the delta:
The Table_Comparison compares the input data to CHECKSUM_MINUS1 (current to previous data set).
The Map_Operation maps all Input row types to Normal. And registers the type of operation in the additional column.
Because there’s one comparison column only, the operation will run much quicker, too.
When you run the 3 dataflows in sequence, the DELTA table will contain the primary key column of all new, updated and deleted records, with the type of operation.
You can now join DELTA and MY_SOURCE_TABLE to limit the size of the input data set. Apply the necessary transformations to that subset. And use a Map_CDC_Operation to eventually apply the results to the target.
Hello Dirk,
Thats a nice write-up. Very Helpful!
Regards
Anirban
Hi Dirk,
Nice post with a lot of great ideas for improving delta throughput.
One thing that I don't like is the use of a SQL transform as it breaks all lineage on the object you are creating the delta for, in your example MY_SOURCE_TABLE.
Ideally it would be great to get a source table query pushed down to the DB. I believe this would be possible for Oracle by creating a SP as a wrapper for the Hash and importing into DS where it can then be used in the mapping. I don't know if HANA procedures/functions can push down in a similar way.
In your example as posted would you need to :
To explain points 2 and 3 above, taking the example given,
2. If (INT0 = 123, INT1 = 456) changed to (INT0 = 12345, INT1 = 6) there would be no change to the HASH argument. Providing a delimiting would make the values distinct 123^456 changing to 12345^6.
3. If (INT0 = <NULL>) changed from NULL to (INT0 = 0) there would be no change to the HASH argument.
Regards
Pat