Skip to Content

I have used following optimization techniques while working in project.

I am sharing my problems and solution which will help you to understand the issues and optimization techniques used for same,

1) Table Comparison options:

The performance of a ‘Table_Comparison’ transform can be improved by caching the comparison table. There are three modes of comparisons:

  • Row-by-row select
  • Cached comparison table
  • Sorted input


Basically we use table comparison for target based Change data capture (CDC).

It is used to capture a changed data that is present in the source but not in target and/or changes in source but not in target.

There many methods available for source and target bases CDC.

Follow the link below (for Oracle DB) for the same: https://docs.oracle.com/cd/B28359_01/server.111/b28313/cdc.htm

tb com.jpg

1.1 Row-by-row Select

We can choose this option in following case

1.1.1Option 1 – Normal operations

Select this option to have the transform look up the target table using SQL every time it receives an input row. This option is best if the target table is large compared to the number of rows the transform will receive as input. Make sure the appropriate indexes exist on the lookup columns in the target table for optimal performance.

1.1.2Option 2 – When we need to consider a trailing blank while comparison

While comparison if in source field value is with trailing blanks then BODS will treat it as different value i.e. BODS will not apply any trim function for that Particular field.

Example:

Consider an account number ‘1234 ‘ (I.e with trailing space) in source and ‘1234’ in target. In such case, BODS considers both the account numbers as separate in case of row by row but in case of cache comparison it will consider both account numbers to be the same. The latter is explained below.

1.2Cached Comparison:

Select this option to load the comparison table into memory. In this case, queries to the comparison table access memory rather than the actual table. However, the table must fit in the available memory. This option is best when the table fits into memory and you are comparing the entire target table.

With the help of this option BODS will capture target data into internal Cache which is again very faster to access.

1.3Sorted input:

Often the most efficient solution when dealing with large data sources, because DS reads the comparison table only once. This option can only be selected when it is guaranteed that the incoming data are sorted in exactly the same order as the primary key in the comparison table. In most cases incoming data must be pre-sorted, e.g. using a Query transform with an Order-by (that may be pushed down to the underlying database), to take advantage of this functionality.

2) Auto Correct Load option :

Auto-correct load is used to avoid loading of duplicate data in target table using SAP BODS.

Basically, Auto-Correct load is used to implement SCD-1 when we are not using table comparison feature of SAP BODS. There are many options available if you are using Auto -Correct load which can help to optimize the performance.

If you are not choosing any Auto -correct load option then BODS will generate simple insert query.

Snapshot and query generated by BODS is as shown below:AC 1.jpg


Query :(BODS will generate Insert query,so it may insert duplicate data.)

INSERT /*+ APPEND */ INTO “DS_1″.”TEST_AUTO_CORRECT_LOAD” ( “ACCOUNT_NO” , “STAT” )

SELECT “TABLE1″.”ACCOUNT_NO” ACCOUNT_NO , “TABLE1″.”STAT” STAT

FROM “DS_2″.”TABLE1” “TABLE1“;

Many options are available with BODS Auto-correct load. Some of them are explained in below section:

2.1Allow merge set to Yes & Ignore Columns with null set to No


While going for auto-correct load, if you select ‘Allow merge option to yes & Ignore columns with null to No’, BODS will generate Merge Query to maintain SCD 1, ignoring null values if any.

So by choosing ‘Allow merge option to Yes’, BODS will insert the new rows coming from source to target and update the existing rows from source into target.


Snapshot for same is as below:

AC 2.jpg

Query generated by BODS job is as follows:

MERGE INTO “DS_1″.”TEST_AUTO_CORRECT_LOAD” s

USING

(SELECT  “TABLE_1″.”ACCOUNT_NO”  ACCOUNT_NO ,  “TABLE_1″.”STAT”  STAT

FROM “PSEUDO_PROD”.”TABLE_1″ “TABLE_1”

) n

ON ((s.ACCOUNT_NO = n.ACCOUNT_NO))

WHEN MATCHED THEN

UPDATE SET s.”STAT” = n.STAT

WHEN NOT MATCHED THEN

INSERT  /*+ APPEND */ (s.”ACCOUNT_NO”, s.”STAT” )

VALUES (n.ACCOUNT_NO , n.STAT)

Here, the query generated by BODS is Merge query which will insert the new rows coming from source and update the existing rows present in target.

This query will be pushed down to database hence it will be an optimized one.

2.2Allow merge set to Yes & Ignore Columns with null set to Yes

Query generated by BODS job is as follows:

MERGE INTO “DS_1″.”TEST_AUTO_CORRECT_LOAD” s

USING

(SELECT  “TABLE_1″.”ACCOUNT_NO”  ACCOUNT_NO ,  “TABLE_1″.”STAT”  STAT

FROM “PSEUDO_PROD”.”TABLE_1″ “TABLE_1”

) n

ON ((s.ACCOUNT_NO = n.ACCOUNT_NO))

WHEN MATCHED THEN

UPDATE SET s.”STAT” = NVL(n.STAT,S.”STAT”)

WHEN NOT MATCHED THEN

INSERT  /*+ APPEND */ (s.”ACCOUNT_NO”, s.”STAT” )

VALUES (n.ACCOUNT_NO , n.STAT).

Snap Shot for same as follows:

ac 3.jpg

As seen in the snapshot above, BODS adds NVL function to consider null values if any.


2.3Allow Merge set to No while Auto correct load set to Yes


If you select ‘Allow Merge to No and Auto correct load to Yes’ then BODS will generate PL/SQL code which will again be very helpful when considering performance.

Below is the code generated by BODS:


BEGIN

DECLARE

CURSOR s_cursor IS

SELECT  “TABLE_1″.”ACCOUNT_NO”  ACCOUNT_NO ,  “TABLE_1″.”STAT”  STAT

FROM “PSEUDO_PROD”.”TABLE_1″ “TABLE_1”

;

s_row   s_cursor%ROWTYPE;

CURSOR t_cursor(p_ACCOUNT_NO s_row.ACCOUNT_NO%TYPE) IS

SELECT “ACCOUNT_NO” ACCOUNT_NO, “STAT” STAT, rowid

FROM “DS_1″.”TEST_AUTO_CORRECT_LOAD”

WHERE (p_ACCOUNT_NO = “ACCOUNT_NO”);

t_row   t_cursor%ROWTYPE;

commit_count NUMBER;

BEGIN

commit_count := 0;

:processed_row_count := 0;

FOR r_reader IN

(SELECT  “TABLE_1″.”ACCOUNT_NO”  ACCOUNT_NO ,  “TABLE_1″.”STAT”  STAT

FROM “PSEUDO_PROD”.”TABLE_1″ “TABLE_1”

) LOOP

OPEN t_cursor(r_reader.ACCOUNT_NO);

FETCH t_cursor INTO t_row;

IF t_cursor%NOTFOUND THEN

INSERT INTO “DS_1”.”TEST_AUTO_CORRECT_LOAD”(“ACCOUNT_NO”, “STAT” )

VALUES (r_reader.ACCOUNT_NO , r_reader.STAT);

commit_count := commit_count + 1;

:processed_row_count := :processed_row_count + 1;

ELSE

LOOP

UPDATE “DS_1″.”TEST_AUTO_CORRECT_LOAD” SET

“STAT” = r_reader.STAT

WHERE rowid = t_row.rowid;

commit_count := commit_count + 1;

:processed_row_count := :processed_row_count + SQL%ROWCOUNT;

IF (commit_count = 1000) THEN

COMMIT; commit_count := 0;

END IF;

FETCH t_cursor INTO t_row;

EXIT WHEN t_cursor%NOTFOUND;

END LOOP;

END IF;

CLOSE t_cursor;

IF (commit_count = 0) THEN

COMMIT; commit_count := 0;

END IF;

END LOOP;

COMMIT;

END;

END;

Snapshot for same is as below.

AC 4.jpg

3) Array fetch & Rows par commit option

3.1Array Fetch size:

The array fetch feature lowers the number of database requests by “fetching” multiple rows (an array) of data with each request. The number of rows to be fetched per request is entered in the Array fetch size option on any source table editor or SQL transform editor. The default setting is 1000, which means that with each database request, the software will automatically fetch 1000 rows of data from the source database. The maximum array fetch size that can be specified is 5000 bytes.

Suggestion while using array fetch size option:

The optimal number for Array fetch size depends on the size of your table rows (the number and type of columns involved) as well as the network round-trip time involved in the database requests and responses. If your computing environment is very powerful, (which means that the computers running the Job Server, related databases, and connections are extremely fast), then try higher values for Array fetch size and test the performance of your jobs to find the best setting.

3.2Rows per commit:

Rows per commit’ specifies the transaction size in number of rows. If set to 1000, Data Integrator sends a commit to the underlying database for every 1000 rows.

Rows per commit’ for regular loading defaults to 1000 rows. Setting the Rows per commit value significantly affects job performance. Adjust the rows per commit value in the target table editor’s Options tab, noting the following rules:

  • Do not use negative number signs and other non-numeric characters.
  • If you enter nothing or 0, the text box will automatically display 1000.
  • If you enter a number larger than 5000, the text box automatically displays 5000.

It is recommended that you set rows per commit between 500 and 2000 for best performance. You might also want to calculate a value. To do this, use the following formula:

max_IO_size/row size (in bytes)

For most platforms, max_IO_size is 64K. For Solaris, max_IO_size is 1024K.

Note that even with a value greater than one set for Rows per commit, SAP Data Services will submit data one row at a time if the following conditions exist:

You are loading into a database (this scenario does not apply to Oracle databases), and have a column with a LONG datatype attribute.

You are using an overflow file where the transaction failed. However, once all the rows are loaded successfully, the commit size reverts to the number you entered. In this case, depending on how often a load error happens, performance might become worse than setting Rows per commit to 1.


Let us consider a scenario while choosing a different array fetch size and rows per commit value.

Let’s say you are setting Array fetch size value to 100 and rows per commit to 500 and the total number of rows in source are 800. Suppose the job terminates after processing 700 rows, then the job will enter only 500 rows into target because we have set rows per commit value to 500. The remaining 200 rows will not enter into the target, while the same 200 rows will be fetched by the same job. It will cause an insufficient data in target.

So, setting array fetch size & rows per commit value is very important while designing a job.


4) Use database link In datastore option :

While fetching data from different datastores connected to different database then BODS may not be able to push down the query to database instead BODS will generate separate query for each datastore.

Now to avoid such a situation we can define database link between databases at datastore level, so after applying DB link at datastore level BODS will generate Push-down SQL which is optimized one.

Refer the scenario for the same below.

4.1 Scenario :

I have considered the following tables and database for this scenario.

Table Information:

1) ACCOUNT DIMENSION

2) BORG FLAT TABLE

Datastore & Database information:

1) DATASTORE: EDW_REP POINTING TO DATABASE TCLBI.

2) DATASTORE: PSEUDO_PROD POINTING TO DATABASE PSEUDO_PROD.

The account dimension table has been taken from EDW_REP datastore which is pointing to EDW_REP schema of TCLBI database whereas the other table BORG_DTL is from Pseudo_prod database.

The job has been deigned as shown in image below:

db 1.jpg

As both the tables are from different databases, hence BODS generates multiple SQL which is shown in snapshot below.

db 2.jpg


Also, a DB link is present between both the databases.(DB link is edwrep_to_pseudoprod). This link helps us to generate Push-down SQL thereby generating an optimized SQL.

In order to achieve this, the following advance setting needs to be done in the Datastore of BODS.

Right click on DS_EDW_REP Datastore -> EDIT -> Advanced ->scroll down to Linked DataStores .


Snapshot of same is attached below:

db 3.jpg

Click on Linked Datastores you will get following window :

db 4.jpg

Choose a datastore that needs to be linked with other datastore, In this case, it is DS_1.

Press OK. Provide proper DB link name from the dropdown


db 5.jpg

Press OK again. After saving the job, check the optimized SQL generated by BODS, which is as follows.


INSERT /*+ APPEND */ INTO “DS_1″.”TEST_DB_LINK_BODS” ( “KEY_1” , “APP_PRINCIPLE_AMT” , “LOAN_AVAILED_DATE” )

SELECT  “BORG_DTL”.”KEY_1″  KEY_1 ,  “BORG_DTL”.”APP_PRINCIPLE_AMT”  APP_PRINCIPLE_AMT ,  “ACCOUNT_DIM”.”LOAN_AVAILED_DATE”  LOAN_AVAILED_DATE

FROM “DS_1″.”ACCOUNT_DIM” “ACCOUNT_DIM” INNER JOIN “PSEUDO_PROD”.”BORG_DTL”@DBLINK “BORG_DTL” ON ( “ACCOUNT_DIM”.”ACCOUNT_NO”  =  “BORG_DTL”.”KEY_1″ )

It generates Insert into statement that is an optimized Push down SQL.

Similarly more than 1 DB link may be added if required.

5) Use server group for splitting dataflows among servers

In BODS, we can define multiple dataflows parallel in single workflow. So when a job is designed by parallel dataflow, BODS will start all the dataflow at same time on particular job server mentioned while executing a job.

Because of this, single job server might experience heavy load. Also resources of that particular job server are shared among multiple parallel dataflows which may degrade the performance of BODS job.

To avoid such scenario we can distribute parallel dataflows to job server group having multiple job servers.

5.1     What is server group?

You can distribute the execution of a job or a part of a job across multiple Job Servers within a Server Group to better balance resource-intensive operations. A server group automatically measures resource availability on each Job Server in the group and distributes scheduled batch jobs to the Job Server with the lightest load at runtime.

5.2     How to create/manage server group

Goto management console ->Server Groups ->All server groups->Server Group configuration-> Click on Add



SG1.jpg

Add the job servers into predefined server group and click Apply.


SG2.jpg


5.3     Distribution levels for data flow execution

When you execute a job, you can specify the following values on the Distribution level option:

  • Job level – An entire job can execute on an available Job Server.
  • Data flow level – Each data flow within a job can execute on an available Job Server and can take advantage of additional memory (up to two gigabytes) for both in-memory and pageable cache on another computer.
  • Sub data flow level – A resource-intensive operation (such as a sort, table comparison, or table lookup) within a data flow can execute on an available Job Server. Each operation can take advantage of up to two gigabytes additional memory for both in-memory and pageable cache on another computer.

SG3.jpg

If a job is run using only single job server then all the dataflow will start on a single job server which may take more time to execute because single job server have a limited resources.

For this job we can split dataflows to multiple job servers with the help of job server group, which is shown below.


[Note : In this example we will consider Data Flow level distribution.]

I have created a sample job to test how data flows spilt occurs.

SG4.jpg


In Job server or Server Group, select server group.

In Distribution level, select on which level (i.e Job,Dataflow or sub dataflow ) splitting should occur.

After executing the job, the Management Console appears as below.


SG6.jpg



6) Error handling by using BODS internal (Metadata) Table

Will update soon…

To report this post you need to login first.

6 Comments

You must be Logged on to comment or reply to a post.

  1. KC K

    Hi Swapnil, Thanks for sharing knowledge. Would like to see error handling concepts also. Thanks once again…

    (0) 
  2. Chandra Sekhar K

    Hi Swapnil,

    Nice Documentation.
    When I tried DB link in BODS for optimization to HANA data base Data store and SQL data base Data store,but HANA Database Data store not Linking Data store option.
    Is it possible HANA Database DAta store?

    Please share needful information

    Thanks,
    Sekhar

    (0) 

Leave a Reply