Skip to Content

In a POC last year I implemented a Slowly Changing Dimension (type 2, SCD 2) load into a Sybase IQ data warehouse (see SCD for more information on slowly changing dimensions).

I needed to implement a special workaround due to the fact that Data Services cannot use bulk loads in such a scenario because in SCD 2 the previous versions of changed dimension records need to be updated (the columns IsValid and ValidTo need to be updated.)

Bulk loading is a MUST in IQ if you are loading large amounts of data (let’s say everything above 100.000 records). If you are not using the bulk load option for target tables Data Services will send a single INSERT SQL statement for each record to IQ. Columnar databases like IQ are not designed for such kind of transactional processing and performance will decrease dramatically (I actually had the impression that the system hangs when I implemented the first version of my load without the bulk load option). With respect to performance, bulk loads are simply the only feasible way to load large amounts of data into IQ.

Likewise, you should not update thousands or millions of records by issuing onr UPDATE SQL statement per record within a loop. Instead, in columnar databases you need to update the whole data set with one batch update statement.

Until now I did not have the chance to implement a similar scenario with Hana. But because Hana is columnar database too (and in BI environments the column store will be used anyway) my findings and rest of this article will apply to Hana in the same way as to IQ.

Typical dataflow for SCD 2 loads

In Data Services a typical dataflow implementing a SCD 2 load looks like this:

Typical dataflow implementing SCD 2

Figure 1: Typical dataflow implementing SCD 2

I am assuming here that we do not get any change-capture data from the source system, so a table comparison in Data Services will do the job to identify changes in the dimensions compared to the previous load. The History Preserving transform implements the actual logic behind SCD 2 and will output records with INSERT opcodes for new dimensions and new versions of changed dimensions. The old versions of changed dimensions will be output as records with an UPDATE opcode. The key generation transform will then generate new surrogate keys for all new dimensions with an INSERT opcode.

Because at the end of the data flow there will be records with INSERT and UPDATE opcodes you cannot use the bulk load option in the target table (bulk load would simply insert all records – even the ones which are intended to get updated – and thus violate referential integrity). On the other hand, performance is simply not acceptable if not using the bulk load option.

Workaround – Data Services 4.0 or previous versions

The workaround for Data Services 4.0 or lower is illustrated in figure 2:

Dataflow implementing SCD 2 with bulk load

Figure 2: Dataflow implementing SCD 2 with bulk load

After the History preserve transform the records are split into different branches: the records with INSERT opcodes are filtered and routed to the key generation transform and the final target table. Here we can finally use the bulk load option.

The records with UPDATE opcodes get filtered and mapped into NORMAL opcodes.:

Map Tranform for old versions of changed dimensions

Figure 3: Map Transform for old versions of changed dimensions

The following query transform filters the surrogate key of the records. This is the primary key of the table and need be used later to update the IsValid and ValidTo columns:

Figure 4: Filter surrogate key column of old versions of changed dimension records

The surrogate keys of the old versions of changed dimensions then get inserted into a working table: TMP_Outdated_Customers. Here again, we are using the bulk load option (with truncate before load).

At this stage we managed to load the new dimensions and new versions of changed dimensions into the DWH using the bulk load. We also loaded the surrogate (primary) keys of old versions of changed dimensions using bulk load.

Finally, we only need to update the IsValid and ValidTo columns in the target dimension table using the primary keys in the working table TMP_Outdated_Customers. This can be done with one single UPDATE SQL statement. This is implemented in a Data Services script directly after the dataflow (not illustrated in Figure 2). For instance:

# Arne Weitzel, IT-Logix AG, 2011

#

# Update customers that had been identified as no longer valid as part of a previous history preserving transform

# This is to avoid single-record updates as part of the standard processing of the history preserving

# Instead, outdated records had been temporary loaded into TMP_Outdated_Customers


print (‘Updating outdated customer records after history preserving…’);

sql(‘iq_server_db_1’, ‘   update DIM_Customers     

                                       set IsValid = 0,          

                                             ValidTo = dateadd(dd, -1, convert(date, {$G_LOAD_DATE_CHAR}, 104))    

                                     from TMP_Outdated_Customers oc    where DIM_Customers.CustomerSK = oc.CustomerSK’);

print (‘Updatefinsished’);

Please note that we are using a special SQL feature in Sybase IQ which is called an update join: joining two or more tables and using the result set to update a table. In HANA you can use the UPSERT statement instead.

In IQ such a batch update statement is quite fast: in our test environment we were using IQ on a 4 CPU Windows server with 8 GB memory. With the update statement above 10 million dimension records got updated within 7 seconds. The good performance can be achieved here because we are only updating two columns. If we needed to update more columns in a columnar database the performance would decrease according!

Solution in Data Services 4.1

The problem has been addressed by SAP in Data Services 4.1: you can use the bulk load option for target tables, even if the records contain UPDATE or DELETE opcodes. In this case Data Services creates staging tables for the update and/or delete records, bulk loads these records into the staging table and then applies the updates/deletes in batch SQL statements within the target database.This is a similar solution as the workaround described above,  but the coding and maintenance is simple again: you can now go back t the typical solution described in figure 1 and at the same time apply bulk loading.

The DS 4.1 documentation describes in detail when this mechanism applies:
For Sybase IQ (source: Technical manuals Data Services 4.1, Performance Optimzation Guide):

SAP BusinessObjects Data Services supports bulk loading to Sybase IQ databases via the Sybase IQ
LOAD TABLE SQL command. For detailed information about the Sybase IQ LOAD TABLE parameters
and their behavior in the Sybase IQ database environment, see the relevant Sybase IQ product
documentation.

For improved performance when using changed-data capture or auto correct load, Data Services uses
a termporary staging table to load the target table. Data Services first loads the data to the staging
table, then it applies the operation codes (INSERT, UPDATE, and DELETE) to update the target table.
With the Bulk load option selected in the target table editor, any one of the following conditions triggers
the staging mechanism:
• The data flow contains a Map_CDC_Operation transform
• The data flow contains a Map_Operation transform that outputs UPDATE or DELETE rows
• The Auto correct load option in the target table editor is set to Yes
If none of these conditions are met, that means the input data contains only INSERT rows. Therefore,
Data Services does only a bulk INSERT operation, which does not require a staging table or the need
to execute any additional SQL.

Note that because the bulk loader for Sybase IQ also supports UPDATE and DELETE operations, the
following options (target table editor Options > Advanced > Update control) are also available for
bulk loading:
• Use input keys
• Auto correct load

For Hana (source: Technical manuals Data Services 4.1, Performance Optimzation Guide):

SAP BusinessObjects Data Services supports bulk loading to the SAP HANA database.


For improved performance when using changed-data capture or auto correct load, Data Services uses
a temporary staging table to load the target table. Data Services first loads the data to the staging table,
then it applies the operation codes (INSERT, UPDATE, and DELETE) to update the target table. With
the Bulk load option selected in the target table editor, any one of the following conditions triggers the
staging mechanism:


• The data flow contains a Map_CDC_Operation transform
• The data flow contains a Map_Operation transform that outputs UPDATE or DELETE rows
• The data flow contains a Table_Comparison transform
• The Auto correct load option in the target table editor is set to Yes
If none of these conditions are met, that means the input data contains only INSERT rows. Therefore
Data Services does only a bulk insert operation, which does not require a staging table or the need to
execute any additional SQL.


By default, Data Services automatically detects the SAP HANA target table type and updates the table
accordingly for optimal performance.

Because the bulk loader for SAP HANA is scalable and supports UPDATE and DELETE operations,
the following options (target table editorOptions > Advanced > Update control) are also available for
bulk loading:
• Use input keys
• Auto correct load

Performance considerations

The solution in DS 4.1 simplifies coding and maintenance. But the performance may still be a bit slower as the workaround described above for DS 4.0 or lower:

With the new solution in 4.1 Data Services bulk loads ALL columns of the UPDATE records in the dataflow to the staging table. In contrast, the workaround for DS  4.0 only uploads the primary key column(s). In case your  target table has a lot of columns the bulk load performance will be much better with the DS 4.0 workaraound solution.

To report this post you need to login first.

3 Comments

You must be Logged on to comment or reply to a post.

  1. Venkata Ramana Paidi

    Hi Arne ,

    It is very informative  and good. I have some doubts in this .

    I want to load the data into Sybase IQ. Here my ETL jobs are SCD type 2 jobs. I mean maintaining the full history in the dimension tables. Can I use bulk load option for the dimension tables.

    In the technical document they have given like if  Data flow contains MAP_CDC or MAP_operation transform  BODS load the data into temp table then load into the target . Is Sybase automatically create the temp table and load the data into the target in this case?

    If the flow doesn’t contain the map operation or map cdc and using only table comparison ,history preserving and key generation  then we cannot use bulk load for Sybase IQ?

    In the technical document they have given like this

    For improved performance when using changed-data capture or auto correct load, Data Services uses

    a termporary staging table to load the target table. Data Services first loads the data to the staging

    table, then it applies the operation codes (INSERT, UPDATE, and DELETE) to update the target table.

    With the Bulk load option selected in the target table editor, any one of the following conditions triggers

    the staging mechanism:

    • The data flow contains a Map_CDC_Operation transform.

    • The data flow contains a Map_Operation transform that outputs UPDATE or DELETE rows.

    • The Auto correct load option in the target table editor is set to Yes.

    If none of these conditions are met, that means the input data contains only INSERT rows. Therefore,

    Data Services does only a bulk INSERT operation, which does not require a staging table or the need

    to execute any additional SQL.

    I am using BODS 4.1 SP1 and Sybase IQ 15.4

    (0) 
    1. Arne Weitzel Post author

      Hi Venkata,

      yes in DS 4.1 it will create the tempory table automaticaly and do the insert/update operations in a batch, and not by using single-record operations.

      The documentation states that one condition is that a table_comparison transform must be part of the dataflow. We have such a setup in a typical SCD2 dataflow in figure 1.

      Try it out, it shold work.

      Arne

      (0) 
      1. Venkata Ramana Paidi

        Hi Arne,

        Thanks for your clarification .

        In my system I have DS 4.2  and I tried with bulk load for SCD1 .  As you mentioned temporary table creating automatically .  But I think there a bug in the 4.2. While generating the temporary table  it is adding the _s out of the double quotes.

        Example If  target table is EMP_TGT then in the load statement syntax it is generating like “EMP_TGT”_s

        Thanks & Regards,

        Ramana.

        (0) 

Leave a Reply