Technical Articles
Automate SAP HANA to File Splitting – Passing variables between different CPI-DS task instances
INTRODUCTION
This blog provides a walkthrough of the steps that needs to be followed in CPI-DS for automating the reading of millions of records from HANA Database and generating files with 100k (this size can be dynamic) records in each file.
Problem: As simple as it sounds, there is a challenge due to the limitation of CPI-DS i.e. global variables are only used within a task instance and variables will be cleared after the task is executed and finished. In our scenario, we would like to pick and identify records with next batch of 100k row numbers based on what last row number that is picked up by previous task. This can be done in SAP Data Services but a challenge in CPI-DS.
This blog post provides an approach that will help companies to accelerate the reconciliation of large volume migration and data profiling in the following scenarios:
-
- Automate reading millions of records from a large database and split the content into small chunks i.e. 100k records per each file as needed using dynamic row number limit on each file to optimize SAP Cloud Load Routines throughput and performance and is recommended file size for most of SAP Cloud Systems.
- It automates the scheduling of tasks by retaining the task variables between different task instances of CPI-DS i.e. sharing variables between different task instances.
- Ability to trace back the rownumber range to filenames for efficient reconciliation of records between source and Sap Cloud Systems i.e. file name 1 contains records 1 to 100k if the file size limit is 100k records per file and file name 2 contains records 2 to 200k and so on.
This scenario is very common when we migrate large volume of data into SAP Cloud Systems from various legacy systems as the load programs of SAP Cloud Systems expect files of 100k records to achieve maximum load throughput. This approach will easily help to trace back the error records in SAP Cloud load files to exact records in the database. Additionally, it will also accelerate the identification and fixing of data issues as we can trace back errors to exact source system records and reconciling the data from source systems to sap cloud systems.
In fact, CPI-DS was not designed to act by means of storing information once it is an integration middleware but it can in fact store a small amount of variable data linked to each Task Name and Memory ID Section which can be retrieved posteriorly at the next Task execution process.
BACKGROUND
Let’s imagine the scenario below, which a CPI-DS Task will read a database which has 500.000 rows and for each Task execution, it has to read the first 100.000 rows and create a filename with the 001 counter on its name, then the next Task execution, will read the following 100.000 rows and will create a second filename 002 and so on…
There are two functions that can be used at the PreLoad and PostLoad script sections and these functions can store and retrieve contents in CHAR(255) type.
Inserting at the PostLoad the function SAVEDATA() and at the PreLoad the function GETDATA() will make CPI-DS store a CHAR content that can be retrieved in the next processing.
Therefore during the mapping, it will also be possible to use the value retrieved, work it out and make it be stored again to be retrieved posteriourly and so on.
CHECKING THE SAP HELP PORTAL
Curiously, checking these two functions at Help Portal it only mentions that it saves and read Date information.
But checking the section related to Change Data Capture (Delta Loads) we can see that the same SAVE_DATA function was described slightly different.
It keeps linking these functions related to date information but at the same time it also mentioned that can store “any piece of data”.
This blog post will help in explaining the use of these two functions without date correlation.
HOW TO ACHIEVE IT
Step 1:
Beyond the SAVE_DATA and GET_DATA functions, it will be needed these other following functions and features:
GEN_ROW_NUM function
JOB_NAME function
TO_CHAR function
Global Variables
Step 2:
Below is the Dataflow which is reading a Database, selecting the rows based on row number counter stored and generating the files also using the file number counter stored.
Below are all the global variables needed:
$G_RESET – Fill with letter “X” in order to erase the values stored.
$G_FILE_COUNTER_CHAR – Used to store/read the values at the functions SAVE_DATA and GET_DATA that only work with CHAR type.
$G_ROW_COUNTER_CHAR – Used to store/read the values at the functions SAVE_DATA and GET_DATA that only work with CHAR type.
$G_CHUNK_SIZE – Has the number of rows desired in each file.
$G_FILENAME – Has the filename with the counter file number concatenated in its name.
$G_ROW_COUNTER – Has the latest row number.
$G_FILE_COUNTER – Has the latest file counter.
Step 3:
Transform SelectFields
It’s selecting the needed columns in the target file and also creating a field named Counter that will have the sequence of row numbers related to the data retrieved from the Source database, using the GEN_ROW_NUM function.
At the left side is the source database and at the right side is the needed columns and the field Counter with the row numbers.
Step 4:
Transform FilterCounter
Here it’s selecting the rows by the use of global variables which contains exactly the row range calculated throughout the task executions.
Global variable $G_ROW_COUNTER at the first task execution will have the number 1, the $G_CHUNK_SIZE carries the row limit for each file and the following executions the $G_ROW_COUNTER will always have the last row sent to the target.
Step 5:
PreLoad Script
The following code is executing the following steps:
– At the beginning of the code it’s checking if it has to reset the memory.
– It’s retrieving the stored data.
– It’s checking if the stored data is empty, which means the first execution then the number 1 is inserted.
– It’s creating the filename concatenating the counter number retrieved.
IF($G_RESET = ‘X’) save_data(job_name() || ‘-FILECOUNTER’,’ ‘);
IF($G_RESET = ‘X’) save_data(job_name() || ‘-ROWCOUNTER’,’ ‘);
IF($G_RESET = ‘X’) print(‘ ‘);
IF($G_RESET = ‘X’) print(‘===========================================================================’);
IF($G_RESET = ‘X’) print(‘ ‘);
IF($G_RESET = ‘X’) print(‘PRELOAD – RESET – MEMORY ERASED’);
print(‘ ‘);
print(‘===========================================================================’);
print(‘ ‘);
print(‘PRELOAD – BEFORE MAPPING’);
print(‘ ‘);
print(‘CURRENTLY VALUE FOR $G_CHUNK_SIZE = ‘ || $G_CHUNK_SIZE);
print(‘CURRENTLY VALUE FOR $G_FILENAME = ‘ || $G_FILENAME);
print(‘CURRENTLY VALUE FOR $G_ROW_COUNTER = ‘ || $G_ROW_COUNTER);
print(‘CURRENTLY VALUE FOR $G_FILE_COUNTER = ‘ || $G_FILE_COUNTER);
print(‘CURRENTLY VALUE FOR $G_ROW_COUNTER_CHAR = ‘ || $G_ROW_COUNTER_CHAR);
print(‘CURRENTLY VALUE FOR $G_FILE_COUNTER_CHAR = ‘ || $G_FILE_COUNTER_CHAR);
print(‘ ‘);
$G_ROW_COUNTER_CHAR = get_data(job_name() || ‘-ROWCOUNTER’);
$G_FILE_COUNTER_CHAR = get_data(job_name() || ‘-FILECOUNTER’);
IF($G_ROW_COUNTER_CHAR = ‘ ‘) $G_ROW_COUNTER_CHAR = ‘1’;
IF($G_FILE_COUNTER_CHAR = ‘ ‘) $G_FILE_COUNTER_CHAR = ‘1’;
$G_ROW_COUNTER = $G_ROW_COUNTER_CHAR;
$G_FILE_COUNTER = $G_FILE_COUNTER_CHAR;
print(‘===========================================================================’);
print(‘ ‘);
print(‘PRELOAD – VALUES RETRIEVED’);
print(‘ ‘);
print(‘CURRENTLY VALUE FOR $G_ROW_COUNTER = ‘ || $G_ROW_COUNTER);
print(‘CURRENTLY VALUE FOR $G_FILE_COUNTER = ‘ || $G_FILE_COUNTER);
print(‘ ‘);
print(‘===========================================================================’);
print(‘ ‘);
print(‘PRELOAD – NEW FILENAME’);
$G_FILENAME = ‘LOCATION_FILE_’ || $G_FILE_COUNTER_CHAR || ‘.CSV’;
print(‘CURRENTLY VALUE FOR $G_FILENAME = ‘ || $G_FILENAME);
print(‘ ‘);
print(‘===========================================================================’);
print(‘ ‘);
print(‘FILTERING’);
print(‘ ‘);
print(‘FILTERING ‘ || ‘ >= ‘ || $G_ROW_COUNTER || ‘ AND < ‘ || $G_CHUNK_SIZE + $G_ROW_COUNTER);
print(‘ ‘);
print(‘===========================================================================’);
Step 6:
PostLoad Script
The following code is executing the following steps:
– At the beginning of the code it’s checking if it has to run in reset mode.
– It’s adding the filename sequential number.
– It’s converting the Integer content to Char content, needed by the functions SAVE_DATA and GET_DATA.
– It’s saving the data in a memory section string which contains the Task name and the strings -FILECOUNTER and -ROWCOUNTER.
IF($G_RESET = ‘X’)
print(‘POSTLOAD – RESET’);
print(‘ ‘);
print(‘===========================================================================’);
print(‘ ‘);
print(‘POSTLOAD – AFTER MAPPING’);
print(‘ ‘);
print(‘CURRENTLY VALUE FOR $G_ROW_COUNTER = ‘ || $G_ROW_COUNTER);
print(‘CURRENTLY VALUE FOR $G_FILE_COUNTER = ‘ || $G_FILE_COUNTER);
print(‘CURRENTLY VALUE FOR $G_ROW_COUNTER_CHAR = ‘ || $G_ROW_COUNTER_CHAR);
print(‘CURRENTLY VALUE FOR $G_FILE_COUNTER_CHAR = ‘ || $G_FILE_COUNTER_CHAR);
print(‘ ‘);
$G_FILE_COUNTER = $G_FILE_COUNTER + 1;
$G_FILE_COUNTER_CHAR = to_char($G_FILE_COUNTER,’999999999′);
IF($G_RESET != ‘X’) save_data(job_name() || ‘-FILECOUNTER’,$G_FILE_COUNTER_CHAR);
$G_ROW_COUNTER = $G_ROW_COUNTER + $G_CHUNK_SIZE;
$G_ROW_COUNTER_CHAR = to_char($G_ROW_COUNTER,’999999999′);
IF($G_RESET != ‘X’) save_data(job_name() || ‘-ROWCOUNTER’,$G_ROW_COUNTER_CHAR);
print(‘===========================================================================’);
print(‘ ‘);
print(‘POSTLOAD – VALUES STORED’);
print(‘ ‘);
print(‘CURRENTLY VALUE FOR $G_ROW_COUNTER = ‘ || $G_ROW_COUNTER);
print(‘CURRENTLY VALUE FOR $G_FILE_COUNTER = ‘ || $G_FILE_COUNTER);
print(‘CURRENTLY VALUE FOR $G_ROW_COUNTER_CHAR = ‘ || $G_ROW_COUNTER_CHAR);
print(‘CURRENTLY VALUE FOR $G_FILE_COUNTER_CHAR = ‘ || $G_FILE_COUNTER_CHAR);
print(‘ ‘);
print(‘===========================================================================’);
CHECKING THE RESULTS
Step 7:
At the following screenshots will be possible to see the filename with its counter number generated and also the row range used to filter the desired registers.
Filename LOCATION_FILE_1.CSV generated.
Filtering rows >=1 to < 101 which means rows from 1 to 100.
Filename LOCATION_FILE_2.CSV generated.
Filtering rows >=101 to < 201 which means rows from 101 to 200.
Filename LOCATION_FILE_3.CSV generated.
Filtering rows >=201 to < 301 which means rows from 201 to 300.
CSV files generated.
Each one with 100 rows.
CONCLUSION
It’s completely feasible the use of a Loop Counter among the tasks executions which helps the company schedule the tasks to be executed and the chunk of data and the filenames will be correctly generated throughout the tasks executions.
Clearly this blog post only aims to give an interface example of how to build this Loop Counter and now it can be tailored for other specific scenarios, especially the ones which use ABAP Transformation programs, therefore, to improve the performance the Loop Counter global variable can be used by the ABAP program to perform SQL Selects avoiding so to bring the whole data to be filtered in CPI-DS runtime.
Good one!!
Hello Alexandre,
Can you help, how you triggering the job multiple time ?
Do you need to manual run the last N time based on the number of batches ? or we just need to trigger it once and automatically, it runs N times based on the Number of batches needed ?
with regards,
Nazeer
Hi Nazeer,
I too have the same question as yours. Did you manage to find any way to make the logic to auto-determine how many cycle of runs it should run the Task to get all data?
Am not quite get it yet, whether this should be done at DataFlow level or at Task/Process Run level?
Appreciate yr time.
Thanks
Adam
Did anyone get the answers for above questions?