Data Services Best Practices Job Design Tips
Job Design Tips
Job and Dataflow Design
A rule of thumb is one data flow per target table. Try to divide your Data Services application into components with minimal inter-dependencies. This modular design will create a simpler architecture and make maintenance easier. It can also help performance tuning because different components can be run independently.
Bulk loading can be utilized to feed a large amount of data quickly. If there are both updates and inserts in the same data flow, they need to be separated. Do updates first, and then drop all indexes on the target table and do the bulk load of all new rows. Then rebuild indexes. Bulk loading to Sybase ASE databases is supported through Sybase ASE bulk copy utility.
Lookup or Use join
The advantage of using Data Services lookup functions is that they always return exactly one row. The disadvantage is it is done in Data Services therefore needs a round-trip communication between Data Services server and database server although using caching can significantly reduce the overhead. Joins can be pushed down to the database server but they may return more than one row.
Caching Options in Lookup Function
Caching depends on the physical memory, the size of the data set, and the application usage pattern. If the size of the lookup table is too large to fit in the physical memory, on demand caching is recommended because the Data Services will only load those values that are actually used. If the lookup table is small, you can use the pre-load caching option. If the input data set is large and look up values are relatively small, then the “DEMAND_LOAD_CACHE” option is more efficient.
Windows Authentication vs. Database Authentication
Windows Authentication is only available when the MS SQL Server is used for the repository or data store connection type. When selected, the MS SQL Server will validate the user login name and password using the information from the Windows operating system.
Database authentication can be used for any type of RDBMS connection type including MS SQL Server, Oracle, DB2, etc. This mode authenticates the user login information against the user account on the database.
Data Services uses two special transforms, Try and Catch, to trap exceptions at runtime. When there is error or exception raised in the code block surrounded by the Try Catch block, the control is passed to the Catch, where error handling code usually resides.
This transformer is used to generate surrogate keys for a target table. It is very efficient because it only reads the maximum value for the key column in the target table once before it generates keys. However, there is an important side effect you must aware – it is not synchronized among data flows. Do not use it in more than one data flow feeding into the same target table.
It is OK when only one Key_Generation transformer is used for a target table.
Incorrect use of Key_Generation
When two Key_Generation transformers are used to populate the same target table, there could be duplicate keys generated in the Data Services process. As a result, either duplicate keys are loaded into the target table if there is no constraint on the key column, or the load fails because of the “unique key” constraint violation.
Data Services jobs can be executed or scheduled with greater flexibility than previous versions. You can schedule DS jobs in BOE. Once the scheduling object is created in the BOE, it can be modified and scheduled with all options in BOE CMC (Content Management Console – a standard BOE administration console). In CMC, schedule events can be created to kick off a job based on the completion status (success or failure) of another job.
Secondly, file based scheduling is available by using the function called “wait_for_file”. With this function, it is possible to start a job based on the existence of a file in a local folder on the Data Services Job server. The function can be called in a script as the first step in a job so the execution of the rest depends on whether or not a file exists in a folder.
Thirdly, jobs can also be exported to SAP BW and executed and scheduled by using an InfoPackage.
Remove Empty Rows from Excel Input Files
The Excel source files may contain white spaces in some cells. Without filtering, rows with only empty cells generate all-NULL meaningless rows in the target table. One way to deal with this issue is to ask the business users to clean it up before sending it to Data Services. While it solves the problem, this approach adds quite some overhead to the pre-processing of Excel files.
A better way to address the issue is to create additional logic in the Data Services dataflow to exclude all NULL rows before loading the target table. For example, let’s assume that the there are three input columns, column_1, column_2, and column_3 that are loaded from Excel. By adding the following logic in the Where clause, empty rows can be eliminated.
NOT (column_1 IS NULL AND column_2 IS NULL AND column_3 IS NULL)