Skip to Content

ETL Framework – An Introduction

1. Objective

The primary objective of this ETL Framework is to address the non-functional requirements of any ETL project using the tool SAP Data Services. This framework has a common, uniform and structured flow which can be used across the ETL projects, irrespective of the business requirement.

2. Purpose of this document

This document will explain the complete framework which will include

  • Scope
  • Features
  • Key Assumptions
  • Pre-requisite
  • Core Design
  • BODS Job components
  • Guide for initial setup
  • Code of the framework and other required files

3. Scope

The scope of the framework has been focused only on the generic non-functional requirements that are common across ETL projects which uses SAP Data Services as the ETL tool. The actual business data extraction or the data loading process is not in the scope of this framework.

4. Features

This framework has the below features, which has been explained further with more details:

  • Simple code – Easy to understand, analyze and debug
  • Reusable – The components of this framework can be re-sued with no hassle
  • Independent of Business requirements – This framework covers only the non-functional requirements of ETL projects hence can be used in any ETL project
  • Saves time – Cuts down the effort by effectively reducing the time required for the project development and hence increases the productivity
  • Independent – This framework is not dependent on any specific OS platform or DB platform
  • This framework has addressed many non-functional requirements

5. Assumptions

This framework has been designed with the below points as assumptions:

  • Source and target can be a file or target.
  • A job can have more than one source.
  • A job can have more than one target.
  • Framework will not handle any file movements like archiving. The developer can extend the framework to implement the same.
  • No process logs or error logs are created explicitly, as the BODS tool itself has such logs.
  • This framework does not have any readymade solution for data reconciliation; this just has one sample to demonstrate the approach of implementing data reconciliation feature.
  • There are few instances where the developer should pass inputs or load few tables beforehand. It is assumed that developer will take care of these tasks with the help of this particular document.
  • This framework is built with SQL server as the backend. Hence when the backend is different, the developer should ensure that the SQL query syntax used in the framework is working as expected. (Note: SQL Queries used in the framework is ANSI standard, hence not much changes are expected)

6. Pre-requisites

This section explains the pre-requisite required for the framework:

  • ETL Tool – System should have SAP Data Services 4.1 (or above) installed (Assumption: Required repositories and Job server will be configured).
  • Database – System should have Microsoft SQL Server 2012 (or above) installed. Note:Though the SQL scripts inside the framework is of ANSI standard, if the database is different, then minor changes (1%) will be required in the SQL scripts written inside the framework.
  • ATL Files – All the ATL files provided in the appendix section should be imported into the developer’s repository. One for job and another for substitution parameters.
  • SQL Files – All the SQL scripts provided in the appendix section imported and executed to get the required tables created.
  • Manually load the table JOB_VALIDITY with the list of valid and invalid job. For more details, please refer section 7 (FRAMEWORK CONTROL TABLES).
  • Manually load the table TARGET_DEPENDENCY with the list of target tables and its source tables. For more details, please refer section 7 (FRAMEWORK CONTROL TABLES).
  • Thorough understanding of the forthcoming sections

7. Framework Control Tables

This framework has few backend tables which supports various functionalities like job auditing, table auditing, job statistics, etc. These tables are built on Microsoft SQL SERVER 2012. The section below explains each table in detail.

  1. JOB_VALIDITY – This table will hold the validity code for each job. This is not populated by the framework; hence manual loading of the table is necessary. The columns of the tables are-
    • JOB_NAME – This column will hold the name of the job. Ex: JOB_DIM_Customer
    • CONFIG_NAME – This column will hold the system configuration of the given job. Ex: A job can be executed for various geographical locations. Hence a job can have system configurations like AMER, EMEA, APAC, etc.
    • VALID_FLAG – This column is the main field based on which the validity of the given job can be determined. If this column holds value as 1, then the given job is valid for the specified system configuration, else the job is not a valid job
  2. JOB_AUDIT – This table will hold the information required for job auditing like job’s load date, job status, load type etc. This table will be loaded and maintained by the Framework. Hence no manual loading is required. The columns of the tables are-
    • JOB_ID–This column will hold the job ID generated by the job server for each execution of the job. Ex: If one job has been executed 5 times, then this table will hold 5 records corresponding to 5 different executions with 5 unique job IDs
    • JOB_NAME – This column will hold the name of the job. Ex: JOB_DIM_Customer
    • LOAD_TYPE – This column will hold the load type information. The possible entries for this column are – INITIAL and DELTA (case sensitive). Ex: When a job is running for the first time or doing a fresh load, the load type will be INITIAL; and when a job is doing an incremental load, the load type will be DELTA.
    • LOAD_DATE – This column will hold the job’s load date and time.
    • CONFIG_NAME – This column will hold the system configuration of the given job. Ex: A job can be executed for various geographical locations. Hence a job can have system configurations like AMER, EMEA, APAC, etc.
    • JOB_STATUS – This column will hold the execution status of the given job. The possible entries are STARTED, COMPLETED and FAILED (case sensitive).Ex: When a job starts, the status will be marked as STARTED; after the successful execution, the status will be updated as COMPLETED. If the execution of the job fails, then the status will be modified as FAILED.
  3. JOB_STATISTICS – This table will hold the statistics of the job like the row counts of the data flow.This table has been designed specifically for data reconciliation. The piece of code which manages this table is kept as optional in the framework. Hence, if data reconciliation is required, then the developer can use the code in the framework and in turn this table will be used in the backend. No manual loading of the table is required. The columns of the tables are –
    • JOB_ID–This column will hold the job ID generated by the job server for each execution of the job. Ex: If one job has been executed 5 times, then this table will hold 5 records corresponding to 5 different executions with 5 unique job IDs
    • JOB_NAME – This column will hold the name of the job. Ex: JOB_DIM_Customer
    • DATAFLOW_NAME – This column will hold the name of the dataflow for which the statistics are captured
    • SOURCE_ROW_COUNT – This column will hold the number of rows in the source table
    • PASSED_ROW_COUNT – This column will hold the number of rows which are loaded successfully to the target table after the validation rules or the business rules
    • FAILED_ROW_COUNT – This column will hold the number of rows which did not pass the given validation rules or the business rules

(Please note that the errors or the reason for failure is not captured in the same table. It has been maintained in different tables for clarity purpose).

  1. TARGET_STATUS – This table will hold the audit information of the tables like the load status of the table, load timestamp, etc. This table will be loaded and managed by the framework. Hence no manual loading will be required. The columns of the tables are –
    • JOB_ID–This column will hold the job ID generated by the job server for each execution of the job. Ex: If one job has been executed 5 times, then this table will hold 5 records corresponding to 5 different executions with 5 unique job IDs
    • TARGET_NAME – This column will hold the name of the table for which the audit information is being captured
    • CONFIG_NAME – This column will hold the system configuration for which the table is getting loaded. Ex: A job can be executed for various geographical locations. Hence a job can have system configurations like AMER, EMEA, APAC, etc.
    • START_DATE – This column will hold the start date and time of the load for the specific table. Ex: If the job is running in delta mode, then the start date and time of that particular load will be captured and loaded into this column
    • END_DATE – This column will hold the end date and time of the load for the specific table. Ex: If the job is running in delta mode, then the end date and time of that particular load will be captured and loaded into this column
    • RUN_ID – This column will hold the run ID of the specific run. RUN ID is unique number determined based on the job’s load timestamp. This will be useful while restarting the job in case of failures. Based on this ID, the failed loads can be re-executed
    • RUN_STATUS – This column will hold the load status of the given table. The possible entries are STARTED, COMPLETED and FAILED (case sensitive).Ex: When a load starts for a particular table, the status will be marked as STARTED; after the successful execution, the status will be updated as COMPLETED. If the execution of the job fails, then the status will be modified as FAILED
  2. CONTROL_TABLE – This table has been used to perform chronological data loading to the target table. For instance, if the source to stage load process is successful for 3 days but the load process from stage to target failed for same 3 days, then this table will hold the name of the table for which the load failed and the RUN_ID of the failed days (3 days => 3 run IDs) in chronological order. With this information, data consistency can be ensured in the target table. This table will be loaded and managed by the framework. Hence no manual loading will be required. The columns of the tables are –
    • TARGET_NAME – This column will hold the name of the table for which the load failed.
    • CONFIG_NAME – This column will hold the system configuration for which the table is getting loaded. Ex: A job can be executed for various geographical locations. Hence a job can have system configurations like AMER, EMEA, APAC, etc.
    • ID – This column will act as a row ID within this table. It will hold numbers like 1,2,3,.. for each table. The number will restart from 1 for every table. Based on this ID, the code will fetch the required RUN_ID.
    • RUN_ID – This column will hold the run ID of the failed days in chronological order. Based on this ID, data will be loaded into the target table in a sequential fashion to ensure data consistency.
  3. TARGET_DEPENDENCY – This table will hold the information about the dependency between the tables. Ex: One target table might be dependent of many dimension tables and also on other target tables. This type of information will be manually loaded into the table. Framework does not load this table. Framework uses (only reads) this table to check dependency information and ensures the data consistency. The columns of the tables are –
    • SEQ_ID – This column will hold numbers like 1,2,3.. for each table. Number will restart from 1 for each target table.
    • TARGET_NAME – This column will hold the name of the target table for which the dependency information is being captured.
    • SOURCE_NAME – This column will hold the name of the source tables on which the given target table is dependent on. Ex: If SALES_FACT table is dependent on CUSTOMER_DIM, TIME_DIM, REGION_DIM, then there will be 3 records in this table each holding one source table name.
    • CONFIG_NAME – This column will hold the system configuration for which the table is getting loaded. Ex: A job can be executed for various geographical locations. Hence a job can have system configurations like AMER, EMEA, APAC, etc.

8. Framework Guide

This section will explain the process to implement the framework. Follow the below steps for successful implementation:

  • Copy the below ATL files to the required machine or system (Note: The ATL files are attached in the appendix section of this document)
    • ETL_Framework_Substitution_parameter.atl – This file contains the required substitution parameters for the repository
    • ETL_Framework_Datastore.atl – This file contains the required datastores and backend tables for the framework
    • ETL_Framework_Job.atl – This file contains the framework job
    • Login to the SAP Data Services Designer

 

  • Import all the ATL files. Select TOOLS ->IMPORT FROM FILE.

 

  • Browse the required document and click on OPEN

  • Enter the passphrase as ds and click on IMPORT.

  • To create the required backend tables, go to Microsoft SQL SERVER, and execute the queries given in the file Backend_Tables_Query.sql. If this file format is not supported, use the notepad file Backend_tables_Script.txt (Note: All the required files are provided in the Appendix section)
  • In SAP Data Services Designer, edit the datastore properties with the proper login credentials
  • Replicate the template job and rename it based on the business requirement
  • Replicate the workflows WF_JOB, WF_STAGE and WF_TARGET and rename as per the requirement
  • In the workflow WF_STAGE, replace the dataflow DF_STAGE with the dataflow that will extract the source system data to the stage table
  • In the workflow WF_TARGET, inside the loop WHILE_LOAD, replace the dataflow DF_TARGET with the actual dataflow that will load the data from the stage table to the target table
  • In the workflow WF_STAGE, in PRE_STAGE, modify the stage table name, the target table name and the datastore name of the stage table (Note: Comments are provided in the framework to identify the line which requires changes. Also sample code has been provided for reference).

 

  • In the workflow WF_TARGET, in PRE_TARGET, modify the stage table name and the target table name (Note: Comments are provided in the framework to identify the line which requires changes. Also sample code has been provided for reference).

 

  • Using SQL server (or any database) populate the table TARGET_DEPENDENCY and JOB_VALIDITY with the required information (refer section 7 Framework Control Tables for more information)
  • Create the required system configurations

  • Initialize the substitution parameters for each system configuration. The set of substitution parameters are as below:

Name

Description

$$S_Archival_File_Path

Will hold the archival file path details

$$S_Local_File_Path

Will hold the local file path details

$$S_Remote_File_Path

Will hold the remote file path details

$$S_Host_Name

Will hold the host name details

$$S_User_Name

Will hold the FTP user name details

$$S_Password

Will hold FTP password

  • The framework has a feature of sending automated mails about the job failure. But the code lines are commented. Hence, to receive notification, please remove the comment and provide the recipient list. If required, modify the body of the mail.
  • Schedule the created jobs by setting the variables $G_Load_Type (values should be either INITIAL or DELTA), $G_Default_Date (this is the start date for the load process when the load type is INITIAL). This step is optional. If these variables are not set default values will be initialized by the job
  • If the load type is delta, then the load dates will be set automatically by the job.

9. Framework Components

This section explains each component of the framework in detail. This will cover the complete logic used in the framework. For better understanding purpose, this section has been divided into 3 sub-sections – Job Level Components, Stage Level Components &Target Level Components.

9.1. Job Level Components

This framework has one job which will be replicated by the developer. This will act as the template or a platform using which the developer will build the required BODS jobs. The name of the template job is JOB_ETL_FRAMEWORK. This has been provided in the file ETL_Framework_Job.atl. The datastores required for this job has been provided in the file ETL_Framework_Datastore.atland the substitution parameters required are in the file ETL_Framework_Substitution_parameter.atl.

The job has 3 parts embedded between TRY and CATCH block – PRE_SCRIPT, WF_JOB and POST_SCRIPT. The below picture provides an overview about the various job level components:

Each object depicted in the picture has been explained in the below section.

9.1.1. TRY

TRY block has been included in the ETL framework to enable the developer to handle the different exceptions with ease.

9.1.2. PRE_SCRIPT

This script does all the required pre-processing for the job’s execution. This script initializes the below global variables:

  • $G_Job_Name – This will hold the name of the job (No hardcoding required, Built-in BODS functions are used)
  • $G_Job_ID – This will hold the ID generated by the job server during the execution of the job
  • $G_Config_Name – This will hold the system configuration for which the current instance of the job is running
  • $G_Load_Date – This will hold the system date and time of the job’s start
  • $G_Default_Date – This will hold the default start date for the load process. This can be set while scheduling the job. If not set during scheduling, then it will be defaulted to 1/1/1900
  • $G_Current_RunID – This will hold the RUN ID for the current execution of the job. This has been determined based on the timestamp.
  • $G_Load_Type – This will hold the load type of the job. Ex: If the job should do a fresh load of data, then this variable will hold the value INITIAL or if the job will do an incremental load, then this variable will hold the value DELTA. This can be set while scheduling the job. If not set during scheduling, then it will be defaulted to DELTA

This script also does various checks like –

  • Job’s validity – This script will check the validity code of the given job from the backend table JOB_VALIDITYand based on that, job’s execution will be controlled. This ensures the data correctness or data integrity. Ex: If a job has been decommissioned and marked as invalid in the table JOB_VALIDITY, then the PRE_SCRIPT will ensure that this job will never be executed. A job which is invalid will be aborted and a mail will be sent to the required mail ID. If a job’s validity code is marked as 1, then the PRE_SCRIPT will allow the job to execute.
  • Job’s previous status – This script will check the status of the previous instance of the same job from the backend table JOB_AUDIT. If the status is COMPLETED or FAILED, then the script will allow the job to execute. If the status is STARTED, then the script will abort the current instance to ensure the data consistency. Also a mail will be sent to the required mail IDs with brief information.

(Note: As the system on which this framework was built did not have SMTP configured, the code which will send mail has been commented. Hence please remove the comment and use the code.)

When the above tests are successful, the script will insert a record in the table JOB_AUDIT with the status marked as STARTED.

9.1.3. WF_JOB

This workflow has two sections – STAGE LEVEL COMPONENTS (WF_STAGE) and TARGET LEVEL COMPONENTS (WF_TARGET). This has been explained in forthcoming sections (Section 9.2 and 9.3).

9.1.4. CATCH

CATCH block will allow the developer to handle various exceptions. If any of the objects fail in the job, then the control will be passed on to the CATCH block, where the below actions will be performed:

  • If any of the table in the job has not been loaded successfully, then the load status of the table will be updated as FAILED in the table TARGET_STATUS.
  • The execution status of the job will be updated as FAILED in the table JOB_AUDIT.

Apart from the above actions, the developer can extend the code to perform other required actions.

9.1.5. POST_SCRIPT

The execution control will flow to POST_SCRIPT, when there is no exception in the job. Hence the job’s status will be updated as COMPLETED in the table JOB_AUDIT.

9.2. Stage Level Components

Stage level components have been packaged under the workflow WF_STAGE, which is part of WF_JOB. Stage level components handle the data extraction from the source system to the stage tables.

The below picture shows the contents of WF_JOB and the position of WF_STAGE:

The below picture provides an overview of the workflow WF_STAGE:

Each object depicted in the picture has been explained in the below section.

9.2.1. PRE_STAGE

This script handles the required pre-processing for the stage level components. The two actions performed by this script are –

  • Ensure Data Integrity – This script ensures that the source system is not disturbed, when the load process to the target table fails, without affecting the data integrity. Ex: If the data extraction process from the source to stage is successful, but the load process from the stage table to target table failed, this scenario will not lead the job to re-extract the data from the source system. Instead, the data which was already extracted will be retained in stage table, until the target table is completely up-to-date. This has been implemented using the RUN_ID maintained in the table TARGET_STATUS. If the latest RUN_ID of the stage table and the target table are same, then it is clear that the data extraction from source to stage table and the data load from the stage table to the target table are successful. Hence, no data will be retained in the stage table (the stage table will be truncated). Else, the data should be retained in the stage table until the successful completion of the load to the target table (the stage table will not be truncated).
  • Determine Delta Dates – This script determines the start date and the end date for the delta load based on the previous execution’s status. Ex: If the previous status of the load is COMPLETED, then the end date of the previous instance will be taken as the start date for the current load. If the previous status of the load is FAILED, then the end date of the last successful load will be taken as the start date. In case, if there is no successful load for a given table, then the first start date will be taken as the start date of the current load. If this is the first load of the table, then the default date set by the user will be taken as the start date. Below table provides example of all scenarios:

Scenario

Table Name

Start Date

End Date

Status

Scenario 1
(Last load successful)

Sample

23-Jun-2014

24-Jun-2014

Completed

Sample

24-Jun-2014

25-Jun-2014

Started

Scenario 2
(Last Load Failed)

Sample

23-Jun-2014

24-Jun-2014

Failed

Sample

23-Jun-2014

25-Jun-2014

Started

Scenario 3
(Last Load Failed, but there is a successful run in the history)

Sample

22-Jun-2014

23-Jun-2014

Completed

Sample

23-Jun-2014

24-Jun-2014

Failed

Sample

23-Jun-2014

25-Jun-2014

Started

Scenario 4
(No previous run)

Sample

1-Jan-1900

24-Jun-2014

Started

After the above process, this script will insert a record in the table TARGET_STATUS registering the status of the given stage table as STARTED.

9.2.2. DF_STAGE

DF_STAGE is a sample data flow provided in the framework to explain the data reconciliation approach. But this should be replaced by the developer with the actual data flow that will extract the data from the source system to the stage table based on the business requirement.

9.2.3. POST_STAGE

This script will update the load status of the stage table as COMPLETED after the successful execution of all the required objects.

9.3. Target Level Components

Target level components have been packaged under the workflow WF_TARGET, which is part of WF_JOB. Target level components handle the data load from the stage tables to the target table.

The below picture shows the contents of WF_JOB and the position of WF_TARGET:

The below picture provides an overview of the workflow WF_TARGET:

Each object depicted in the picture has been explained in the below section.

9.3.1. PRE_TARGET

This script does the pre-processing required for the data load from the stage table to the target table. Below are the actions performed by this script:

  • TARGET DEPENDENCY – This script will check the list of source names given in the table TARGET_DEPNEDENCY and ensures that every table is ready. If any of the table is not ready, then the job’s execution will be aborted. This will ensure the data consistency.
  • JOB RESTART – If the load process fails, and if the job has been restarted, then the job will perform a chronological load of the data to the target table. This will be implemented based on the last successful RUN ID of the given target table. The failed RUN IDs will be captured in CONTROL_TABLE which will be populated by the data flow DF_CONTROL_TABLE (explained below). This script will clean the entries in CONTROL_TABLE so that the dataflow DF_CONTROL_TABLE can load the required RUN IDs.

9.3.2. DF_CONTROL_TABLE

This dataflow will use TARGET_STATUS table as source and will load CONTROL_TABLE. This will ensure that all the failed or missed RUN IDs of the given target table are captured in chronological order in CONTROL_TABLE. Based on this table, the succeeding objects will perform the load process to the target table.

9.3.3. SET_COUNTER

As the load process to the target table should be in chronological order, WHILE LOOP is required to handle the scenario. Hence few variables are required to control the WHILE LOOP. This script ensures that the required variables are set with the proper value. The variables are:

  • $L_Loop_Counter– This variable will hold the value 1 always.
  • $L_Control_Counter – This variable will hold a value which will be equal to the number of entries in the CONTROL_TABLE. Hence this will act as the upper level for the WHILE LOOP. Ex: If the data load was not successful for 4 instances, then the CONTROL_TABLE will have 4 different RUN IDs and hence this variable will hold the value 4, which will ensure that the WHILE LOOP will execute 4 times.

9.3.4. WHILE_LOAD

This WHILE_LOOP ensures data consistency by handling the load process in sequential and chronological order. As explained under the section 9.3.3, the variables $L_Loop_Counter and $L_Control_Counter will control the execution of the WHILE LOOP.

9.3.4.1. CHECK_COUNTER

This script fetches the required RUN ID for the given execution and also increments the variable $L_Loop_Counter by 1, so that the WHILE LOOP can progress. This script will insert a record in the table TARGET_STATUS registering the status of the given target table as STARTED with the required RUN ID.

9.3.4.2. DF_TARGET

This is a sample data flow which depicts the approach of the load process from the stage table to the target table. This should be replaced by the actual data flow which will load the data from the stage table to the target table based on the business requirements. It is not mandatory to have only the data warehouse table as the target; it could be operation data store’s table or any other ETL target table.

9.3.4.3. POST_TARGET

This script will update the status of the target table as COMPLETED in the table TARGET_STATUS.

10. Appendix

This section provides all the required files.

  • ATL file for substitution parameters
  • ATL file for Datastores
  • ATL file for Framework job
  • SQL file to create the required tables
  • Same SQL file in notepad format

11. Conclusion

This framework provides a platform in which most common non-functional requirements are addressed. In real time project, there might be many such scenarios, hence this framework has been designed in such a way that the developer can extend the code at any given point of time. Any part of the code can be ignored by commenting or code can be added as well as per the user’s / business requirement. This Framework provides a scalable solution to minimize the effort required by developer and to maximize the productivity of one’s project.

To report this post you need to login first.

9 Comments

You must be Logged on to comment or reply to a post.

  1. Swetha N

    Hello Sumit,

    From where we can download the file:

    1. ATL file for substitution parameter
    2. ATL file for Datastores
    3. ATL file for Framework job
    4. SQL file to create the required tables
    5. Same SQL file in notepad format
    (0) 
  2. Venkata Ramana Paidi

    Hi Sumit,

     

    Thanks for your detailed explanation for the framework. Could you please share the atl file that is used for the framwork.

     

    Thanks & Regards,

    Ramana.

     

    (0) 

Leave a Reply