We had a POC where it was required to load large flat files into a Sybase IQ data warehouse. We used Data Services 4.0 (DS) to load the files into the staging area of the data warehouse.
Loading files into a database table is an easy job for an ETL tool. In DS, all you need to do is create a file format and import the staging table from IQ. Then connect the file format with the staging table in a dataflow, for instance:
Please note that this is a typical scenario if you are loading large files into an analytic or columnar database such as SAP Sybase IQ or SAP Hana. Even if you need to do some transformations at the beginning of your ETL process you would rather do this in a second step, because the database can most often process transformations much faster than Data Services could do. So, instead of using a Data Services query transform right after the file format you may want to simply first import the file in the database. Then in the next dataflow you would use the imported table (IMP_Customers in the example above) as a source table and specify the transformation logic in a query transform. Ideally the transformation logic will get pushed down to IQ and thus benefit from the high performance of the columnar database. In most of the cases I have seen so far this approach proofed to be quicker as opposed to executing the transformation logic within the ETL tool.
So far so good, but there two things to keep in mind when loading flat files directly into a Sybase IQ table. First you need to decide on the loading mechanism that you specify in the target table options in Data Services:
In all my tests that I have done so far in various setups the option of loading a complete file which is local on the IQ server was always the fastest option. Named pipes seem to slow down the bulk load. Still, I would recommend that you test these various options in your own environment. If you decide not to use named pipes there is still one important issue to keep in mind:
The screenshot above might infer that Data Services would simply send a LOAD TABLE ... USING FILE ... command to IQ. The whole loading process would be managed completely by IQ and Data Services actually had nothing to do in the data flow. This not true, though:
You can read these processing steps from the trace log, for instance:
In this example we are loading 130 million records with the dataflow as shown in the screenshot above. From 10:53 until 11:07 Data Services did the first two steps: loading the file into its memory and writing it into the bulk load directory. At 11:07 Data Services sent the LOAD TABLE statement to IQ. From 11:07 until 11:11 IQ loaded the interim file from the bulk load directory into its table.
If you are not depending on the file error handling from Data Services the extra work from Data Services from 10:53 until 11:07 looks like useless overhead and waste of time. Furthermore, a similar kind of error handling could be defined at the level of the bulk load statement in IQ. It is also supported by Data Services: you can specify the error handling options of the IQ bulk load in the target table options (obviously, your ETL process still needed to check the IQ error log files if you need to have some tighter control in case of loading errors):
If you don't need the error handling features of Data Services while loading a file (or if you can handle these with the IQ bulk load error handling) you may want to get rid of the extra loading/writing steps that Data Services is doing. There are two options:
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
37 | |
10 | |
6 | |
4 | |
3 | |
3 | |
3 | |
2 | |
2 | |
2 |