Recently, one of my customer’s had an interesting data movement problem. They needed to extract 150+ tables from a traditional RDBMS, with some tables having 500M+ rows, and move those tables to an SAP HANA instance. This would seem to be a straightforward exercise, but there were some gotchas:
- The volume of the data was large and the network bandwidth between the servers was restricted due to WAN limitations.
- The goal was to move and import all the data into their SAP HANA instance in less than one hour.
- The only open ports between the two servers was for SSH (port 22), and this could not be changed due to security policies.
To address the volume and bandwidth issues, it was clear that compression would be necessary. To address the one-hour loading time, it seemed some form of parallelism would be necessary. The last limitation, SSH only, suggested some form of flat file copy using secure copy (SCP) between the machines. Flat file content for this project ended-up being CSV files.
The customer’s source database was an Oracle 12c instance and the target was an SAP HANA 2.0 SPS03 on-premise instance. Knowing the Oracle RDBMS has a powerful job scheduler along with the ability to execute scripts in the operating system, the architecture of this process seemed pretty straightforward: use the job scheduler to process a list of table asynchronously in as many parallel jobs as were reasonable on the source system.
For this proof of concept, two very small virtual machines (4 cores, 16GB of RAM) were used. Even with these very limited resources this tool moved and loaded 50M rows in 4 minutes. This includes creating the necessary schema objects on the SAP HANA instance.
Note: the customer’s requirements did not require ensuring transactional integrity on the source tables while the export moved the tables.
OK, the architecture pretty much described itself, but what’s the best way to get the data out of the source database. There are lots of examples of CSV dumping procedures on the Internet, but specific information on handling SAP HANA data type translations, and most importantly, BLOB handling was hard to find.
In the PL/SQL code installed on the source Oracle database, there is a specific procedure called, not unsurprisingly, dump_table_to_csv to dump the contents of a table to CSV format. This procedure was created specifically to handle data type translations to appropriate SAP HANA data types. To verify BLOBs were being handled correctly, the output from this procedure of dumping a 4MB Microsoft PowerPoint file produced the same output as a native SAP HANA export to CSV of the same content.
Note: for this proof of concept, the tool does not handle exotic types such as spatial columns and nested tables. These could easily be added but were not required at this time.
To get started, specific privileges need to be granted on both the source Oracle database and the target SAP HANA instance. On the source side, the ability to write to files to the operating system was clearly a must (UTL_FILE). The ability to run operating system level scripts was also necessary (DBMS_SCHEDULER, job_type=EXECUTABLE). See the create‑users.txt file for details on all the configuration and grants needed on the source and target databases.
With the necessary privileges nailed down, the next step was to build the asynchronous process. It turned out to be an interesting problem in how to ensure 100% utilization of our parallel resources with no “blocking” operations. This is where the scheduling aspect of the source database was the right answer.
To manage the scheduler, we created two tables. The first is a “job” definition table that includes the information about the local Oracle environment and the target HANA instance, including SSH information like hostname and Linux user. The second tabe contains a list of tables chosen to be exported. This list of tables is fed sequentially to the parallel jobs and this keeps all the jobs at 100% utilization exporting tables. Once a job is defined, it can be executed any number of times.
Running the tool
To define a job, there is a simple API to create a job, assign individual tables or entire schemas to a job, and a stop procedure. Refer to the CSV_EXPORTER.pls file for details on the API. The key here is how many parallel threads to use for a job – this should be a number smaller than the number of cores on the source machine to avoid overrunning the source server.
The sample-job.sql script in the tests folder shows an example of creating a job, adding a schema, and running the new or re-created job.
When a job is started with the run_job procedure, a series of scheduler jobs are created to start processing the list of tables. The tool orders the tables from largest to smallest based on the current statistics for each table – this could have been done any number of ways and we may revisit this with the customer.
The submit_job procedure is the entry point for starting and sustaining table selection until all tables are exported. The key to controlling the process is the ability to “lock” the table list so individual tables can be selected without getting in the way of other parallel jobs. The following flow diagrams shows how a table is selected for processing – note the initial lock and subsequent commit to release the lock.
Once the table data, and some metadata, are exported, a new job is launched to run the csv_exporter.sh script in the operating system. By starting a new job for this table, the export of the next available table in the job can begin immediately. The Linux BASH script runs at the operating system to compress the CSV file and transfer it via SCP it to the target SAP HANA server. The servers are configured with PKI keys to permit login without password. The final step in this script opens an SSH session to the target server and launches the csv_launch.sh script – also a BASH script.
On the SAP HANA platform, the csv_launch.sh and csv_loader.sh scripts work together to execute HDBSQL commands to create the target table, if not already present, and executes the SAP HANA “IMPORT FROM CSV” command. The SAP HANA IMPORT statement is incredibly fast and even large tables can be loaded very quickly. The following example shows the log output from the loading process of a testing table named EXTRACT__077_0933777 (having 933,777 rows). Also, notice the GUID appearing throughout the script – this uniquely identifies this table as part of the job we are part of and this does not change between runs.
do begin declare table_count integer; select count(1) into table_count from tables where schema_name = 'TARGET_SCHEMA' and table_name = 'EXTRACT_077_0933777'; if table_count = 0 then execute immediate ' CREATE COLUMN TABLE "TARGET_SCHEMA"."EXTRACT_077_0933777" ( "METERID" INTEGER /* T: 2 L: 22 P: 38 P: 0 N: */ ,"REPID" INTEGER /* T: 2 L: 22 P: 38 P: 0 N: */ ,"DEVICETYPE" INTEGER /* T: 2 L: 22 P: 38 P: 0 N: */ ,"LAT" NUMBER(38,5) /* T: 2 L: 22 P: 38 P: 5 N: */ ,"LONGITUDE" NUMBER(38,5) /* T: 2 L: 22 P: 38 P: 5 N: */ ,"RATECODE" NVARCHAR(26) /* T: 1 L: 26 P: 0 P: 0 N: */ ,"METERFORM" NVARCHAR(26) /* T: 1 L: 26 P: 0 P: 0 N: */ ,"METERCLASS" NVARCHAR(26) /* T: 1 L: 26 P: 0 P: 0 N: */ ,"READING_TIMESTAMP" TIMESTAMP /* T: 180 L: 11 P: 0 P: 6 N: */ ,"READING_VALUE" NUMBER(38,3) /* T: 2 L: 22 P: 38 P: 3 N: */ ,"DESCRIPTION" NVARCHAR(128) /* T: 1 L: 128 P: 0 P: 0 N: */ ,"VENDOR_UOM" NVARCHAR(26) /* T: 1 L: 26 P: 0 P: 0 N: */ ); '; end if; end truncate table "TARGET_SCHEMA"."EXTRACT_077_0933777" import from csv file '/usr/sap/HXE/HDB90/hhs/exports/AFC6A8A1C7FB66DDE0530100007FC94B.csv' into "TARGET_SCHEMA"."EXTRACT_077_0933777" with batch 100000 column list in first row field delimited by ',' optionally enclosed by '"' error log '/tmp/AFC6A8A1C7FB66DDE0530100007FC94B.log'
Lessons learned and next steps
Overall, this was a very interesting exercise and is a very good starting point for a production installation. We met all the requirements and achieved remarkable throughput.
Some things that might be changed include better clean-up on the SAP HANA side, i.e., truncate all the tables at the start of the job to facilitate more efficient transfer and loading. Also, for the largest tables cutting up the CSV files into multiple smaller files could be more efficient. Finally, adding in the exotic data types, e.g., ST_GEOMETRY, would make the tool more generally usable.
The code for this project is available from the SAP Samples GitHub repository: