Skip to Content
Author's profile photo Former Member

Loading large CSV files to SAP HANA

SAP HANA makes it very easy to import a CSV file to a database. There are multiple tutorials available on HANA Academy that detail process of how to import CSV file to a table in HANA database.

Though process of importing CSV data to HANA is very simple, it can be a challenge to import a large CSV file to HANA database. Large CSV file can’t be imported using HANA Studio because import process is very slow. Loading large CSV files to HANA is normally accomplished using either a control file or data is directly imported using a CSV file.

IMPORT FROM CONTROL FILE ‘/data/data.ctl’;

where contents of /data/data.ctl file are

IMPORT DATA INTO TABLE “MYTABLE” FROM ‘/data/data.csv’

RECORD DELIMITED BY ‘\n’

FIELD DELIMITED BY ‘,’;

or

IMPORT FROM CSV FILE ‘/data/data.csv’ INTO “MYTABLE”

WITH

RECORD DELIMITED BY ‘\n’

FIELD DELIMITED BY ‘,’;

Both of these methods are similar and all techniques described in this article apply for both methods.

1. Download/copy CSV file to appropriate directory on HANA Appliance. This allows use of powerful HANA appliance for processing CSV file. If you get “Cannot open CSV file” error during CSV import and your path is correct, ensure that UNIX user hdbadm has read permissions on directory. If user doesn’t have read permissions, change permissions on directory where CSV and CTL file will reside.

2. Create sample data set from large data file and create a target table based on sample data. To create a sample data set of 1000 records, use

head -1000 data.csv > sampledatafile.csv

Copy sampledatafile.csv to HANA Studio machine and import sampledatafile.csv using HANA Studio. This allows easy creation of target table and column definitions can easily be adjusted. If definitions for columns are available, then edit table definition in HANA Studio. Alternatively, use ALTER TABLE commands to adjust table definitions once initial column definition has been created.

3. For importing large number of rows use batch clause in IMPORT command to ensure faster performance.

IMPORT FROM CSV FILE ‘/data/data.csv’ INTO “MYTABLE”

WITH

RECORD DELIMITED BY ‘\n’

FIELD DELIMITED BY ‘,’

BATCH 1000

4. Always include an error file in import command and check error file after import to ensure no errors are generated. Ensure that UNIX user hdbadm has write permissions available on directory where error file is being written, otherwise no error file will be generated.

IMPORT FROM CSV FILE ‘/data/data.csv’ INTO “MYTABLE”

WITH

RECORD DELIMITED BY ‘\n’

FIELD DELIMITED BY ‘,’

ERROR LOG /log/error.log


By default generated error file is read only. If script is executed for a 2nd time, with same error file name, file will not be refreshed. Delete log file after each execution.

5. If CSV load should fail if it encounters any errors, use “FAIL ON INVALID DATA”. Load will fail as soon as it encounters first error.

IMPORT FROM CSV FILE ‘/data/data.csv’ INTO “MYTABLE”

WITH

RECORD DELIMITED BY ‘\n’

FIELD DELIMITED BY ‘,’

FAIL ON INVALID DATA

6. Before starting import, ensure enough memory is available on HANA appliance and set AUTOCOMMIT property to false. In HANA Studio, click on right click SQL Editor and click “Show In properties”. Then set AUTOCOMMIT to false.

7. While importing data, ensure that column widths are set correctly for target table. If definitions for input data are not available, first import data to a staging table with large column sizes. CSV import will fail with error “inserted value is too large for column” if any column doesn’t have sufficient width. During CSV import, rows with any columns with too large value will be rejected.

8. Manually inspect sample of CSV file and check data format. If there are any dates in CSV file, verify date column format. By default HANA expects dates to be in YYYY-MM-DD format. If dates are in any other format, use DATE FORMAT in import clause to specify a different date format.

IMPORT FROM CSV FILE ‘/data/data.csv’ INTO “MYTABLE”

WITH

RECORD DELIMITED BY ‘\n’

FIELD DELIMITED BY ‘,’

DATE FORMAT ‘MM/DD/YYYY’

See this link for more details on DATE FORMAT.

9. Ensure that data doesn’t have any special characters ” included in data set. If special characters are part of data set, clean input data using sed scripts.

e.g. Description field may contain data as “Length of rod: 56″ and further description”. . In this case ” after 56 needs to be cleaned before import function will work.

10. After CSV import, check table runtime properties using HANA Studio and ensure “Number of entries” looks correct. Verify size for main memory and delta memory. Perform delta merge to move data to main memory and reduce memory usage.

Any other suggestions are welcome.

Links for further reading.

Assigned Tags

      18 Comments
      You must be Logged on to comment or reply to a post.
      Author's profile photo Former Member
      Former Member

      If you can write XS server code, there's another option.

       

      Like you, I had large (1GB) CSV files to load but did not have direct access to the HANA machine (no FTP or shell access). Processing files from my workstation using ODBC or HANA Studio was much too slow.

       

      I ended up writing an XS service to process CSV files uploaded via HTTP. I created a drag-and-drop HTML page to initiate the upload. The combination is fast and gives complete control over the back-end processing.

       

      The key to making this work is to use PreparedStatement.executeBatch(). This little gem can commit a million rows of prepared insert statements in about a second.

       

      The main limitation is the size of an HTML upload.  For me, this occurred at around 30MB. So, I split my data into approximately 30 chunks of 1 million rows each.  The upload page can queue lots of uploads so I was able to drag all the files onto the page in one step.

       

      You can read more about it here: Mikey's Blog: Bulk loading HANA

      Author's profile photo Former Member
      Former Member
      Blog Post Author

      Thanks Mikey for your comment. I hadn't considered this option. I will try this out. If I can upload a million rows in a second, that will be great.

       

      I am also getting concerned with stability of HANA server if you don't have enough space to load your data. I tried loading a 5GB data file on Cloudshare 20GB HANA instance and it hung. I had to shutdown and restart HANA database.

       

      I was assuming twice the file size as my rule of thumb ( for a 5GB file, 10GB of HANA storage) but apparently I need more space.

      Author's profile photo John Appleby
      John Appleby

      Covered this some while back in Best Practices for SAP HANA Data Loads and Inside SAP HANA - optimising data load performance and tuning

       

      If you want big data loads to fly then make sure you increase the number of threads to the number of cores in your system, and the batch, to a good number. Usually a batch of 100000 is a good number, but you can tweak this to suit your data

       

      IMPORT FROM '/hana/FACT_02_2010.ctl' WITH THREADS 40 BATCH 100000;

       

      Hope this helps.

      Author's profile photo Former Member
      Former Member
      Blog Post Author

      Thanks John for your comments. Your original article was inspiration for my article.

       

      I ran some sample tests based on your input.

       

      I had a table with 6 columns and a CSV file with 50 million rows. As I increased my batch size my load performance significantly improved.

       

      No batch size - 1318 sec

      100,000 - 364 sec

      1,000,000 - 29 sec

      5,000,000 - 1.3 sec

       

      But while increasing threads my performance actually degraded. While increasing thread size to 4, my load performance with same batch sizes was 3 to 4 times worse than without any threads. While reading on import reference, it said that number of threads specify concurrent loads and if we have multiple objects to load, then threads will help. For a single table load, increasing threads did not do any improvement for my load.

       

      Keep in mind, that I performed my tests on 20GB 4 core Cloudshare HANA appliance so my results may not match to an actual production HANA appliance.

       

      May be you will have more insight into these numbers.

       

      Thanks.

      Author's profile photo John Appleby
      John Appleby

      Try partitioning your table before load. But, a 4-core appliance may be part of the problem.

       

      John

      Author's profile photo Former Member
      Former Member

      How does this work with a HANA cloud platform instance (HCP)? Is there a way to upload a large .csv-file to the HCP instance?

      Author's profile photo Former Member
      Former Member

      I figured out that there is no way yet to do that - the only way to import data to a cloud instance is to use HANA Studio (which is not yet enterprise ready because of file size limitations [most likely due to memory limits] and e.g. a buggy implementation within the "com.sap.ndb.studio.bi.filedataupload.ui.wizards" component).

      Author's profile photo John Appleby
      John Appleby

      I'm thinking that Matthias Steiner and friends may have some input here.

       

      What I have done in the past is to use SAP HANA XS Table Import feature to load data, I've tested that with CSV files 1GB+ and it is quite effective.

       

      Data Provisioning Using Table Import - SAP HANA Developer Guide for SAP HANA Studio - SAP Library

      Author's profile photo Former Member
      Former Member

      Could you provide a document with an example of how to do this?

      Author's profile photo Lars Breddemann
      Lars Breddemann
      Author's profile photo Former Member
      Former Member

      Hi John,

      When using HQBSQL Client locally and using the Control file approach to load the Huge CSV data to HANA databse, When I invoke the batch file having the IMPORT FROM CONTROL FILE 'xxxxx.ctl'. I get error, Can not open Control file xxxx.ctl.

      My Control file and CSV is on my local computer. Does these files need to be on the HANA server inorder get this working? or do we have any other approach. I can upload through HANA Studio but its failing when i upload more than 4 MB file.

      please help.

      Author's profile photo John Appleby
      John Appleby

      Yes they must be on the server.

      Author's profile photo Former Member
      Former Member

      Hi John,

       

      I have 3 CSV files with approximatly 30gb each. I cant upload them to hana server. Files are on an application server, at most I can download them to my local maschine. In this case can I still use the control file approach? if not then what is recommended?

       

      thanks and best regards.

       

      Fahad

      Author's profile photo John Appleby
      John Appleby

      You can import from a local CSV, yes. It may be slow.

       

      SAP HANA Academy - Importing Data from CSV file - YouTube

      Author's profile photo Former Member
      Former Member

      Hi John,

       

      thanks for your quick reply. What if the CSV file is neither on loca maschine nor on HANA Server in that case how can I load data into HANA from this remote csv file? can I put the path of this remote csv file into IMPORT FROM CSV FILE command?

       

      Best regards

       

      Fahad

      Author's profile photo John Appleby
      John Appleby

      There are four choices:

       

      1) Move the file to the HANA system

      2) Move the file to the system with HANA Studio

      3) Install HANA Studio on the 3rd system

      4) Mount the filesystem remotely to HANA Studio

      Author's profile photo Hakeem Tolulope OKUSAGA
      Hakeem Tolulope OKUSAGA

      is there any limit to the size of the data that can be imported from a flat file?

      i get an internal error when i try to upload a large file.

      Author's profile photo Daniel Klein
      Daniel Klein

      The first two links in the blog post are not valid anymore, but the videos can still be found on youtube: