Import PostgreSQL Tables Containing Free-Text Data into SAP HANA
We at SAP Research-Boston, have been using SAP HANA’s data analytics capabilities for quite some time now in our research with the medical datasets (e.g. the well-known MIMIC2 datasets). Thus, in our line of work, we regularly need to import data from various sources into our SAP HANA databases.
Luckily, SAP HANA provides a handful number of features to import data in different ways from different kinds of sources. Now, importing data using .csv files is one of the effective methods for migrating data to SAP HANA from different popular DBMS. Also, HANA can be extremely fast too, when importing the .csv files at the server-side, using its control (.ctl) files.
Much has already been written all around the web on these processes of importing .csv files into SAP HANA (a very good overview can be found here: http://wiki.scn.sap.com/wiki/display/inmemory/Importing+CSV+files+into+SAP+HANA). However, one challenge that may not have been thoroughly discussed is about dealing with .csv files that contain free-text or unrestricted natural language as the vital data to be imported. In this blog, I will be presenting the issues one may encounter when dealing with the free-text data, and also the details of how to preprocess this free-text data to prepare the .csv files, so that they are ready to be imported to SAP HANA with zero problems. I will be using the PostgreSQL database as the source database for my examples. Similar methodology can be applied import such free-text data from the other popular database systems as well (e.g. Oracle, MySQL, SQL Server etc.)
To migrate data to SAP HANA from any other database systems, we first need to build the table structures in SAP HANA, representing the data-types of the fields/columns correctly following SAP HANA’s standard. The list of supported data-types in SAP HANA can be found here: http://help.sap.com/hana/html/_csql_data_types.html.
After the table structures are built on SAP HANA, the next step is to prepare the .csv files using the source database system, where each .csv file contains the data of one table.
All the common database systems are equipped with the feature of exporting the data of a table as a .csv file. These .csv files usually follow the same structure, where each record (or row) is delimited by line-break or a newline character (\n). Moreover, the text-type values are usually enclosed by double quote characters (“), and in case a double-quote character appears within a text-type value, it is usually escaped by another double-quote character (“) appearing immediately before it.
Now, PostgreSQL, like many other database systems, allows one to choose any character for this escape character. However, like most other databases, it always delimits the records by a newline character (\n), with no option to choose otherwise.
In contrast, when importing .csv files, SAP HANA allows one to choose any character that have been used for delimiting the records, which is generally chosen to be a newline character (\n) in most cases. However, when importing .csv files, SAP HANA uses a backslash character (\) as the only escape character, with no option to choose otherwise.
Therefore, when exporting a table of any database system, like PostgreSQL, as a .csv file one should be paying attention to the above restrictions, in case the .csv file is meant to be imported on SAP HANA. Thus, the command to use in PostgreSQL is as follows:
COPY schema_name.table_name TO ‘/path/table_name.csv’
WITH CSV QUOTE AS ‘”‘ ESCAPE AS ‘\’;
A .csv file exported with the above command usually gets imported to SAP HANA with no problem, when using its control file-based server-side CSV import feature. However, dealing with free-text can be a little harder. Here’s the reason why: Text-type fields often hold unrestricted natural language or free-text values, which can contain line-breaks or the newline characters (e.g. \n, or \r, or a combination of both) and also rarely backslash characters (\) that do not get escaped in the exported .csv files.
This creates problem for SAP HANA when importing these .csv files, as its CSV parser (which is used during the control file-based import) wrongly assumes the start of a new record as soon as it encounters a newline character, even if it appears within the enclosure of a text-type value.
To solve this problem, we need to preprocess the source data in order to replace these newline characters that appear within the text-type values with “something” that will not confuse SAP HANA’s CSV parser. In our case, we chose to insert the html line-break tag (</br>) instead of the newline characters. Moreover, we also need to cleanup (i.e. remove) the backslash characters (\) appearing within the text-type values.
To apply this solution, some may choose to preprocess data on the exported .csv files, which I find to be cumbersome, as it requires processing these mammoth .csv files with a powerful (regular expression-based) text-file processing engine, that needs to be able to differentiate between newline characters appearing within text-values and the newline characters used to delimit records.
The solution I present here will preprocess the data on the source database system, and then output the .csv file in a way that is ready to be imported on SAP HANA without any problems.
The following are the steps of preprocessing the data on a PostgreSQL database:
STEP 1: Create a Temporary Copy of the Table
On PostgreSQL’s console, enter the following SQL to first create a temporary schema, and then create a copy of the table to be exported in this temporary schema:
CREATE SCHEMA temporary_schema;
CREATE TABLE temporary_schema.table_name AS
(SELECT * FROM original_schema.table_name);
STEP 2: For Each Text-type Field/Column in the Table:
The text-type fields/columns are of data-types text, char, varchar, varchar2, nvarchar, nvarchar2 etc. Now, do the following for each such text-type field/column in the table:
STEP 2.1: Remove All Backslash Characters (\) from the Values of the Text-type Field:
Enter the following SQL on PostgreSQL’s console to remove all the backslash characters (\) from the values of the text-type field:
SET field_name = REPLACE(field_name, ‘\’,”);
STEP 2.2: Replace All Newline Characters from the Values of the Text-type Field:
Enter the following SQL on PostgreSQL’s console to replace all the newline characters from the values of the text-type field, with the custom string “</br>”:
SET field_name = REGEXP_REPLACE(field_name, E‘[\\n\\r]+’, ‘</br>’, ‘g’);
Thus, repeat the steps 2.1 and 2.2 for each text-type fields.
STEP 3: Export the Preprocessed Data as CSV:
Enter the following SQL on PostgreSQL’s console to export the preprocessed data of the table as a .csv file:
COPY temporary_schema.table_name TO ‘/path/table_name.csv’
WITH CSV QUOTE AS ‘”‘ ESCAPE AS ‘\’;
The “table_name.csv” file containing the preprocessed data will now be saved to “/path/” on the machine hosting the PostgreSQL database.
Note: All the SQL used in the steps 1 to 3 can be combined into an SQL script and be sequentially executed altogether. Similar SQL commands can be used to preprocess the data on other database systems as well (like Oracle).
STEP 4: Transfer the CSV File To the HANA Server
Use SCP or any FTP client to transfer the “table_name.csv” file to “/path-on-hana-server/” on the server hosting the SAP HANA database.
STEP 5: Prepare the Control File
Prepare a plain text file, named “table_name.ctl” with the following contents:
INTO TABLE target_schema.”table_name”
RECORD DELIMITED BY ‘\n’
FIELDS DELIMITED BY ‘,’
OPTIONALLY ENCLOSED BY ‘”‘
ERROR LOG ‘ table_name.err’
Then, save the “table_name.ctl” also on the server hosting the SAP HANA database. In this example, I will be saving it in the same location as the “table_name.csv” file, which is “/path-on-hana-server/”.
STEP 6: Execute the Import on SAP HANA
As mentioned earlier, please make sure that you already have an empty table, called “table_name”, in the “target_schema” on your SAP HANA instance. Pease also make sure that this empty table has a correctly translated table structure with the SAP HANA data-types correctly identified. Please note that the list of supported data-types in SAP HANA can be found here: http://help.sap.com/hana/html/_csql_data_types.html
Now, execute the following command on SAP HANA Studio’s SQL Console:
IMPORT FROM ‘/path-on-hana-server/table_name.ctl’;
This will start loading the preprocessed data on to the SAP HANA database.
If followed correctly, these steps will successfully load all the data of a table containing free-text, on to the SAP HANA database. However, please check the “table_name.err” file on the SAP HANA server to confirm that no error has occurred during this process.
Happy migrating your database to SAP HANA. 🙂