Finally after a long time I’m back again with my observational stuffs :p.
So this time its SCD1. For those who are new please have a glimpse of what is SCD1.
if (existing target records have been changed at source)
load the changed records
elseif (new records have arrived in source but not present in target)
load new records
“NO ACTION” 😀 (Trying to be a coder)
SCD1: Updated the existing records if already present in target or insert the new records present in source.
Normally SCD1 can be implemented by either using Table Comparison or by setting Auto-Correct load to Yes but what if this same can be done using a delete script?
Consider the below requirement where I’ve two dataflows.
Stage_df contains the data which get loaded from source to the staging table.
script_del contains an SQL delete statement.
final_df contains final target table which gets filled from staging table.
Source Name: Alternate.txt.
Staging table Name: Alternate_rbd8.
Final Table Name, Target: Alternate_final_rbd8.
Before job execution data present is,
Target: Same as Alternate_rbd8.
Upon executing the job, data in staging table is:
and data in final target table is:
Delete script did nothing for this first load.
Now suppose if next day two new records are inserted then using only the staging table will not suffice and meet the requirements because:
- It’s Drop and Recreate for every load.
- It’ll always contain the latest data flooded from source.
So this entire data from staging will be dumped to final table that’ll store the changed data and newly inserted data.
But how to maintain SCD type 1?
Imagine how easy would it become if I select all the records which are present in Staging table and then compare these records with final target table. If there happens to be some matches then delete the data present in final table and load it with the staging table. Let me make it clearer!
Source data will be loaded to the Alternate_rbd8 table present in stage_df and from there the records will be moved to final table, Alternate_final_rbd8.
Script_del is the magic here!
This script is going to delete all the existing values from Alternate_final_rbd8 table and load the new records from source.
Code of script_del:
sql( 'ods_mkt','delete from alternate_final_rbd8 where row_id in (select row_id from alternate_rbd8 where alternate_final_rbd8.row_id = alternate_rbd8.row_id)'); print( 'Records Deleted'); sql( 'ods_mkt','commit'); print( 'Script completed');
Now let’s consider that some new records have been added (row_id = 18,19,20) and one existing (CODE value changed from 1000 to 9000 and OLD_OPT_CODE changed from H to R for row_id = 3) record has been modified into the source which aren’t present in the target.
Modified source data, Alternate_rbd8.txt:
Before execution data in final and staging table is,
After job execution data in staging and target table is,
Now this time I delete two records from source (Row_id = 1,2)
So the existing records in target, which have been modified at source level has been changed and this was SCD1.
Hence it worked! 🙂
And yes, its faster than Table Comparison and Auto Correct Load.
Now let me put some light on how does this sql query processed the records.
- All the common row_id’s, which were common to Alternate_rbd8 (final table) and Alternate_Final_rbd8 (staging table), were selected from Alternate_rbd8(final table).
- Post selection these records were deleted from Alternate_rbd8 table(final table).
- Finally new records/changed records were inserted in Alternate_rbd8 table(final table).
Hope this article was helpful! 🙂
Please let me know if I’ve missed anything or there is something incorrect with my understanding.