SAP Data Services: SCD 2 with bulk loading in Syba...

former_member186198 · ‎09-03-2012

In a POC last year I implemented a Slowly Changing Dimension (type 2, SCD 2) load into a Sybase IQ data warehouse (see SCD for more information on slowly changing dimensions).

I needed to implement a special workaround due to the fact that Data Services cannot use bulk loads in such a scenario because in SCD 2 the previous versions of changed dimension records need to be updated (the columns IsValid and ValidTo need to be updated.)

Bulk loading is a MUST in IQ if you are loading large amounts of data (let’s say everything above 100.000 records). If you are not using the bulk load option for target tables Data Services will send a single INSERT SQL statement for each record to IQ. Columnar databases like IQ are not designed for such kind of transactional processing and performance will decrease dramatically (I actually had the impression that the system hangs when I implemented the first version of my load without the bulk load option). With respect to performance, bulk loads are simply the only feasible way to load large amounts of data into IQ.

Likewise, you should not update thousands or millions of records by issuing onr UPDATE SQL statement per record within a loop. Instead, in columnar databases you need to update the whole data set with one batch update statement.

Until now I did not have the chance to implement a similar scenario with Hana. But because Hana is columnar database too (and in BI environments the column store will be used anyway) my findings and rest of this article will apply to Hana in the same way as to IQ.

Typical dataflow for SCD 2 loads

In Data Services a typical dataflow implementing a SCD 2 load looks like this:

Figure 1: Typical dataflow implementing SCD 2

I am assuming here that we do not get any change-capture data from the source system, so a table comparison in Data Services will do the job to identify changes in the dimensions compared to the previous load. The History Preserving transform implements the actual logic behind SCD 2 and will output records with INSERT opcodes for new dimensions and new versions of changed dimensions. The old versions of changed dimensions will be output as records with an UPDATE opcode. The key generation transform will then generate new surrogate keys for all new dimensions with an INSERT opcode.

Because at the end of the data flow there will be records with INSERT and UPDATE opcodes you cannot use the bulk load option in the target table (bulk load would simply insert all records – even the ones which are intended to get updated – and thus violate referential integrity). On the other hand, performance is simply not acceptable if not using the bulk load option.

Workaround – Data Services 4.0 or previous versions

The workaround for Data Services 4.0 or lower is illustrated in figure 2:

Figure 2: Dataflow implementing SCD 2 with bulk load

After the History preserve transform the records are split into different branches: the records with INSERT opcodes are filtered and routed to the key generation transform and the final target table. Here we can finally use the bulk load option.

The records with UPDATE opcodes get filtered and mapped into NORMAL opcodes.:

Figure 3: Map Transform for old versions of changed dimensions

The following query transform filters the surrogate key of the records. This is the primary key of the table and need be used later to update the IsValid and ValidTo columns:

Figure 4: Filter surrogate key column of old versions of changed dimension records

The surrogate keys of the old versions of changed dimensions then get inserted into a working table: TMP_Outdated_Customers. Here again, we are using the bulk load option (with truncate before load).

At this stage we managed to load the new dimensions and new versions of changed dimensions into the DWH using the bulk load. We also loaded the surrogate (primary) keys of old versions of changed dimensions using bulk load.

Finally, we only need to update the IsValid and ValidTo columns in the target dimension table using the primary keys in the working table TMP_Outdated_Customers. This can be done with one single UPDATE SQL statement. This is implemented in a Data Services script directly after the dataflow (not illustrated in Figure 2). For instance:

# Arne Weitzel, IT-Logix AG, 2011

#

# Update customers that had been identified as no longer valid as part of a previous history preserving transform

# This is to avoid single-record updates as part of the standard processing of the history preserving

# Instead, outdated records had been temporary loaded into TMP_Outdated_Customers

print ('Updating outdated customer records after history preserving...');

sql('iq_server_db_1', ' update DIM_Customers

set IsValid = 0,

ValidTo = dateadd(dd, -1, convert(date, {$G_LOAD_DATE_CHAR}, 104))

from TMP_Outdated_Customers oc where DIM_Customers.CustomerSK = oc.CustomerSK');

print ('Updatefinsished');

Please note that we are using a special SQL feature in Sybase IQ which is called an update join: joining two or more tables and using the result set to update a table. In HANA you can use the UPSERT statement instead.

In IQ such a batch update statement is quite fast: in our test environment we were using IQ on a 4 CPU Windows server with 8 GB memory. With the update statement above 10 million dimension records got updated within 7 seconds. The good performance can be achieved here because we are only updating two columns. If we needed to update more columns in a columnar database the performance would decrease according!

Solution in Data Services 4.1

The problem has been addressed by SAP in Data Services 4.1: you can use the bulk load option for target tables, even if the records contain UPDATE or DELETE opcodes. In this case Data Services creates staging tables for the update and/or delete records, bulk loads these records into the staging table and then applies the updates/deletes in batch SQL statements within the target database.This is a similar solution as the workaround described above, but the coding and maintenance is simple again: you can now go back t the typical solution described in figure 1 and at the same time apply bulk loading.

The DS 4.1 documentation describes in detail when this mechanism applies:
For Sybase IQ (source: Technical manuals Data Services 4.1, Performance Optimzation Guide):

SAP BusinessObjects Data Services supports bulk loading to Sybase IQ databases via the Sybase IQ
LOAD TABLE SQL command. For detailed information about the Sybase IQ LOAD TABLE parameters
and their behavior in the Sybase IQ database environment, see the relevant Sybase IQ product
documentation.
For improved performance when using changed-data capture or auto correct load, Data Services uses
a termporary staging table to load the target table. Data Services first loads the data to the staging
table, then it applies the operation codes (INSERT, UPDATE, and DELETE) to update the target table.
With the Bulk load option selected in the target table editor, any one of the following conditions triggers
the staging mechanism:
• The data flow contains a Map_CDC_Operation transform
• The data flow contains a Map_Operation transform that outputs UPDATE or DELETE rows
• The Auto correct load option in the target table editor is set to Yes
If none of these conditions are met, that means the input data contains only INSERT rows. Therefore,
Data Services does only a bulk INSERT operation, which does not require a staging table or the need
to execute any additional SQL.
Note that because the bulk loader for Sybase IQ also supports UPDATE and DELETE operations, the
following options (target table editor Options > Advanced > Update control) are also available for
bulk loading:
• Use input keys
• Auto correct load

For Hana (source: Technical manuals Data Services 4.1, Performance Optimzation Guide):

SAP BusinessObjects Data Services supports bulk loading to the SAP HANA database.

For improved performance when using changed-data capture or auto correct load, Data Services uses
a temporary staging table to load the target table. Data Services first loads the data to the staging table,
then it applies the operation codes (INSERT, UPDATE, and DELETE) to update the target table. With
the Bulk load option selected in the target table editor, any one of the following conditions triggers the
staging mechanism:

• The data flow contains a Map_CDC_Operation transform
• The data flow contains a Map_Operation transform that outputs UPDATE or DELETE rows
• The data flow contains a Table_Comparison transform
• The Auto correct load option in the target table editor is set to Yes
If none of these conditions are met, that means the input data contains only INSERT rows. Therefore
Data Services does only a bulk insert operation, which does not require a staging table or the need to
execute any additional SQL.

By default, Data Services automatically detects the SAP HANA target table type and updates the table
accordingly for optimal performance.
Because the bulk loader for SAP HANA is scalable and supports UPDATE and DELETE operations,
the following options (target table editorOptions > Advanced > Update control) are also available for
bulk loading:
• Use input keys
• Auto correct load

Performance considerations

The solution in DS 4.1 simplifies coding and maintenance. But the performance may still be a bit slower as the workaround described above for DS 4.0 or lower:

With the new solution in 4.1 Data Services bulk loads ALL columns of the UPDATE records in the dataflow to the staging table. In contrast, the workaround for DS 4.0 only uploads the primary key column(s). In case your target table has a lot of columns the bulk load performance will be much better with the DS 4.0 workaraound solution.