Table Comparison: Input Contains Duplicate Keys
I was trying to understand Table Comparison’s Input Contain Duplicate Keys option and after my experiments I found below results.
TABLE COMPARISON: – Input Contain Duplicate Keys
When records coming from source has multiple similar values for primary key then to process them/handle them we use Input Contain Duplicate Keys option of Table Comparison.
If column(s) specified in Input Contain Primary Key do not have unique key for every incoming row then we go for this option to handle duplicate keys.
How did table comparison process the records when this option is checked?
Initially, before job execution, both before image and after image are empty. When job is executed then data from comparison table is loaded in the before image of Table Comparison and from before image the correct record(s), as per generated OPCODES, Insert, Update or Delete, are sent to after image which is nothing but our target table. When job is executed this transform fires select statement based on columns present in Input Primary Key columns and all the records are brought to before image buffer.
select col1, col2, .. from target_table where input primary key columns in (xxxx,xxxx,.. ) -> If there are more than one columns specified in Input Primary Key columns section. But if there is only one column present in Input Primary Key columns then resulted sql is:
select col1, col2, .. from target_table where input primary key columns = ID;
Now, before Image contains the initial target data before job execution. When job is executed then every incoming source row is compared with the records present in before image of comparison table and according to the opcodes generated (I, U or D) the result is sent to after image i.e., to final target.
When Input Contain Duplicate Keys option is checked then it means transform knows that there is duplicate data coming from source. So of all the records present in after image if any column values, based on columns present in Input Primary Key columns, changes then Table comparison will generate ‘U’ opcode for all incoming rows having same ID ( ID means column present in Input Primary Key columns).
Lil bit confused right? Let’s dig deeper with an example.
Initial target: – Data present in target before job gets executed.
Now I’ve inserted two new records with EMP_ID = 1010, 1011 as highlighted.
Updated Source: –
Input Contain Duplicate Keys option is checked in Table Comparison: –
After job execution:-
Record with EMP_ID 1010 (EMPNAME RAJ) and 1011 (EMPNAME RAJ EY) is present in target before job is executed. Supposed two new records with same EMP_ID (1010, 1011) arrives from source. Also we’ve 1010 and 1011 already present in source too.
Existing records: 1010 RAJ…, 1011 RAJ EY….
New Records: 1010 RAJ WRITER,… 1011 RAJ MCKINSEY….
First record with EMP_ID 1010 (1010, RAJ WRITER,..) is compared with existing EMP_ID, 1010, in comparison table’s before image buffer (1010, RAJ,..). Table comparison finds that EMPNAME was initially RAJ and new incoming name for same column is RAJ WRITER so it sends an update ‘U’ opcode.
Next record with EMP_ID 1011 (1010, RAJ MCKINSEY,..) is compared with existing EMP_ID, 1011, in comparison table’s Before Image buffer (1011, RAJ EY,..). Table comparison finds that EMPNAME was initially RAJ EY and new incoming name for same column is RAJ MCKINSEY so it sends an update ‘U’ opcode.
Then comes 3rd row of source which has EMP_ID 1010 (1010, RAJ,..) and this row is compared with the existing record in target but it finds that no column value is changed. So ideally it should not send an Update but since Input Contain duplicate keys is checked so due to this the existing id will also be sent as ‘U’ opcode because for this EMP_ID earlier there was an update.
Similar is the case with 4th row which has EMP_ID 1010 (1010, RAJ EY,..).
Now you might be thinking why for other records it has not sent an update?
It will not sent an update as no column values for EMP_ID 1012, 1013, 1014… and so on have changed. Hence it no update for them.
Now comes the question out of all the updates which will be reflected in target?
Always the last record or the latest record will be reflected in target of all the Updates.
So Final Target: –
Now let’s suppose that I insert a new record with EMP_ID 1020: –
Target before job execution: –
So now TC will see that of all the id’s EMP_ID 1020 is not present in target. So it’ll fire an insert statement and will insert the records.
Hence after execution result is: –
Final Target: –
To understand how does records get inserted and how their processing happen, visit my blog
This option slower down the performance as additional memory is required to keep a track of duplicate records.
Hope it help!
Please let me know if I have missed anything or anything needs to added/delete.