Data Services: Loading flat files into Sybase IQ
We had a POC where it was required to load large flat files into a Sybase IQ data warehouse. We used Data Services 4.0 (DS) to load the files into the staging area of the data warehouse.
Loading files into a database table is an easy job for an ETL tool. In DS, all you need to do is create a file format and import the staging table from IQ. Then connect the file format with the staging table in a dataflow, for instance:
Please note that this is a typical scenario if you are loading large files into an analytic or columnar database such as SAP Sybase IQ or SAP Hana. Even if you need to do some transformations at the beginning of your ETL process you would rather do this in a second step, because the database can most often process transformations much faster than Data Services could do. So, instead of using a Data Services query transform right after the file format you may want to simply first import the file in the database. Then in the next dataflow you would use the imported table (IMP_Customers in the example above) as a source table and specify the transformation logic in a query transform. Ideally the transformation logic will get pushed down to IQ and thus benefit from the high performance of the columnar database. In most of the cases I have seen so far this approach proofed to be quicker as opposed to executing the transformation logic within the ETL tool.
So far so good, but there two things to keep in mind when loading flat files directly into a Sybase IQ table. First you need to decide on the loading mechanism that you specify in the target table options in Data Services:
Available loading mechanisms:
- Singe-record insert statements: this is the default option in Data Services. The tool will send a SQL insert statement for each record to be loaded. In fact it is a clear no-go if you are loading large amounts of data. It can work fine for a few hundred or thousand records. It will be a definite showstopper if you are loading hundred thousands or millions or records. The only feasible solution for this is to use the bulk loading option in the target table.
- With the bulk loading option the client (Data Services) simply sends the LOAD TABLE … USING FILE …command to IQ. IQ will then handle the data load process completely on its own. Here again, IQ offers various other options which are all supported by Data Services. You can specify these options in the advanced datastore configuration:
- JS and DB on same machine: the file can reside local on the Sybase IQ server or remote. If it is remote, the IQ ODBC client will handle the transfer of the file from the client machine (in this case the Data Services server) to the IQ server.
- Use named pipe: the bulk load can be managed using a named pipe as data file: in this scenario another process keeps on writing data into the file (named pipe) while at the same time IQ picks up the new incoming records from the file and loads them into the table. Again, named pipes can reside local on the IQ server or remote on the client machine.
In all my tests that I have done so far in various setups the option of loading a complete file which is local on the IQ server was always the fastest option. Named pipes seem to slow down the bulk load. Still, I would recommend that you test these various options in your own environment. If you decide not to use named pipes there is still one important issue to keep in mind:
Bulk loading from Data Services into IQ:
The screenshot above might infer that Data Services would simply send a LOAD TABLE … USING FILE … command to IQ. The whole loading process would be managed completely by IQ and Data Services actually had nothing to do in the data flow. This not true, though:
- First, Data Services loads the file into its memory. In this step it will also perform all kind of error handling that is specified at the level of the file format!
- Second, Data Services writes the just loaded data to another file into its bulk loader directory (option Bulk loader directory in the advanced datastore settings – if not specified it is the ~/BusinessObjects/DataServices/log/bulkloader directory).
- Third it sends the LOAD TABLE … USING FILE …command to IQ to initiate the bulk loading process within IQ.
You can read these processing steps from the trace log, for instance:
In this example we are loading 130 million records with the dataflow as shown in the screenshot above. From 10:53 until 11:07 Data Services did the first two steps: loading the file into its memory and writing it into the bulk load directory. At 11:07 Data Services sent the LOAD TABLE statement to IQ. From 11:07 until 11:11 IQ loaded the interim file from the bulk load directory into its table.
If you are not depending on the file error handling from Data Services the extra work from Data Services from 10:53 until 11:07 looks like useless overhead and waste of time. Furthermore, a similar kind of error handling could be defined at the level of the bulk load statement in IQ. It is also supported by Data Services: you can specify the error handling options of the IQ bulk load in the target table options (obviously, your ETL process still needed to check the IQ error log files if you need to have some tighter control in case of loading errors):
How to avoid the extra work of Data Services ?
If you don’t need the error handling features of Data Services while loading a file (or if you can handle these with the IQ bulk load error handling) you may want to get rid of the extra loading/writing steps that Data Services is doing. There are two options:
- Write your own LOAD TABLE statement in a Data Services script using the sql function. I did this by copying the LOAD table statement from the trace log from Data Services. You will still need to adapt this LOAD TABLE statement, because Data Services will probably write its own interim file in a slight different format than the original file. Probably row delimiter and field delimiter need to be adapted.
- (More unrealistic:) vote for my improvement suggestion in SAP idea place: the solution above works definitely, but this is not the way we want to code an ETL process when using an ETL tool. I therefore suggested that the IQ bulk load from a data file (which is provided by some supplier outside Data Services) bypasses the loading into the memory of Data Services. In this case features like error handling in Data Services needed to be deacti