An example where most of the transforms are used,
Considering the Customer as source,
a) Key generation
The Surrogate Key (similar to Surrogate ID) is generated in the transformation Key generation. The table name is to be selected in this transformation along with the increment value.
Key Generation transform helps to generate artificial keys for new rows in a table. The transform looks up the maximum existing key value of the surrogate key column from the table and uses it as the starting value to generate new keys for new rows in the input dataset. The transform expects a column with the same name as the Generated key column of the source table to be a part of the input schema.
The source table must be imported into the DS repository before defining the source table for this transform. Also we can set the Increment value i.e. the interval between the generated key values. By default it is 1. We can also use a variable placeholder for this option. We will be using this transform frequently while populating surrogate key values of slowly changing dimension tables.
Depending on the column EMP_ID, the EMP_SURR_KEY (Surrogate Key) is incremented based on the increment value.
b) Table comparison
Table Comparison transform helps to compare two data sets and generates the difference between them as a resultant data set with rows flagged as INSERT, UPDATE, or DELETE. This transform can be used to ensure rows are not duplicated in a target table, or to compare the changed records of a data warehouse dimension table. It helps to detect and forward all changes or the latest ones that have occurred since the last time the comparison table was updated. We will be using this transform frequently while implementing slowing changing dimensions and while designing dataflow for recovery.
There are three methods for accessing the comparison table namely Row-by-row select, cached comparison table and Sorted input. Below is the brief on when to select which option.
- Row-by-row select option is best if the target table is large compared to the number of rows the transform will receive as input. In this case for every input row the transform fires a SQL to lookup the target table.
- Cached comparison table option is best when we are comparing the entire target table. DS uses page able cache as the default. If the table fits in the available memory, we can change the Cache type property of the dataflow to In-Memory.
- Sorted input option is best when the input data is pre sorted based on the primary key columns. DS reads the comparison table in the order of the primary key columns using sequential read only once.
NOTE: The order of the input data set must exactly match the order of all primary key columns in the Table Comparison transform.
c) History Preservation – converts ‘UPDATE’ to ‘INSERT’
The output of the history preservation is that we get 2 columns in addition as Effective from and Effective to.
As in the example above,
Employee Id – 2222 with Name – KV belongs to Region R2 in the interval 05.04.2011 – 04.03.2012 & Region R3 from 05.03.2012 till date.
d) Validation – filtering erroneous data (data cleansing)
Validation transform is used to filter or replace the source dataset based on criteria or validation rules to produce desired output dataset. It enables to create validation rules on the input dataset, and generate the output based on whether they have passed or failed the validation condition. This transform is typically used for NULL checking for mandatory fields, Pattern matching, existence of value in reference table, validate data type, etc.
The Validation transform can generate three output dataset Pass, Fail, and Rule Violation. The Pass Output schema is identical with the Input schema. The Fail Output schema has two more columns, DI_ERRORACTION and DI_ERRORCOLUMNS. The Rule Violation has three columns DI_ROWID, DI_RULENAME and DI_COLUMNNAME.
The rule for the Validation is entered in the highlighted area and the Action on FAIL is also mentioned here.
Example rule above – Zip code should be in format ‘99999’
The output of the PASS output schema is,
The output of FAIL output schema is,
These records doesn’t match the zip code of format ‘99999’.