In this blog I will introduce how to import data from flat files to SAP HANA Cloud.
There are several methods you can use, depending on your requirements and the data size.
Overall, I would recommend to use DBeaver to manually import CSV files up to ~1GB.
Use the “IMPORT FROM” SQL statement to import larger files from AWS or Azure file storage.
If you need to import files other than CSV or to pre-process your data, Python is the most versatile tool.
If you need to import many files based on a pattern and/or at repeated intervals, use the SAP HANA Smart Data Integration File adapter.
As said in this blog, when the import of the data is connected to modelling tasks in SAP Web IDE Full-Stack or SAP Business Application Studio, you can insert test small files up to 10MB directly in your database projects. Place files you want to import directly into the design-time definition of the HDI container and define the import via an .hdbtabledata HDI artefact.
This approach only supports .csv files.
Keep attention when working with Git. Most Git repositories have size limits. It is not a good practice to check in large files in a Git repository, nor to put it into a project in SAP Web IDE.
2. SAP HANA Database Explorer
You can use the data import function of the SAP HANA Database Explorer to import CSV files up to 200MB from your local PC. From this UI, you can also import CSV / PARQUET files from AWS S3 and Azure.
Access the database explorer from your SAP HANA Cloud instance on the SAP Cloud Platform Cockpit.
Within the Database explorer, you can access all your HDI containers. Right click on the database/HDI container in which you want to import data, and select “Import Data”.
The “Import Data” wizard will open up. You can select the CSV file, then your import target table.
If you want to import files directly from Amazon S3 or Azure, you need to register the IaaS certificate into SAP HANA Cloud first. For details, refer to section ‘5. “IMPORT FROM” SQL statement’ of this blog.
If you choose Amazon S3 as the import source in the UI, you need to set the S3 region and the S3 path, made of your IAM user access key and secret key, as well as your bucket name and object ID. Just enter them as requested by the UI.
Select the table where you want to import data.
In the import options, you can choose the file format (CSV, PARQUET), you can choose whether to use the first row as column names, delimiters and date formats.
Set the error handling, and you can start importing your file !
If you use SAP HANA Cloud as a datawarehouse, I definitiely recommed DBeaver to manage your data. DBeaver has a very intuitive data import function which lets users import data from CSV files.
I was able to import a 100MB CSV file from my client to SAP HANA Cloud in 4 minutes with DBeaver.
As a reference, importing that same file to a SAP HANA 2.0 SPS05 instance located on my local network takes 12 seconds with DBeaver.
If you need to pre-process data before inserting it into SAP HANA Cloud, Python offers a powerful and flexible environment.
In the notebook, start by importing the hana_ml library,
establish the connection to SAP HANA Cloud,
read your CSV file with pandas,
create a dataframe in the SAP HANA Cloud database containing your data,
and finally close the connection.
Here is the python code I used. The HANA ML dataframe allows developers to use data in HANA as a pandas dataframe. This means you can prepare the data to fit your needs. Learn more about an actual use case in this blog by Andreas.
!pip install hana_ml import hana_ml print(hana_ml.__version__) import hana_ml.dataframe as dataframe # Instantiate connection object conn = dataframe.ConnectionContext(address = 'hostname.hanacloud.ondemand.com', port = 443, user = 'user', password = 'password', encrypt = 'true' ) # Send basic SELECT statement and display the result sql = 'SELECT 12345 FROM DUMMY' df_remote = conn.sql(sql) print(df_remote.collect()) import pandas as pd df_data = pd.read_csv('NYCFT.csv', sep = ',') df_data.head(5) df_remote = dataframe.create_dataframe_from_pandas(connection_context = conn, pandas_df = df_data, table_name = 'NYCFT_PYTHON', force = True, replace = False) conn.close()
The data import finished in 9 minutes. This confirms the results of Lars Breddermann that DBeaver is around twice as fast as Python regarding data import to SAP HANA Cloud.
I used the hana_ml library to import data the simplest way I could, but you can also use other libraries such as SQL Alchemy.
5. “IMPORT FROM” SQL statement
Finally, if you have a very large dataset, the fastest way to import data to HANA Cloud is to use the built-in “IMPORT FROM” SQL statement.
Follow the documentation to enable data import directly from your IaaS platform. For this guide, I used AWS.
First, you will need an AWS account, with S3 enabled and your files need to be uploaded to S3. I used the same file as in the DBeaver and Python examples to compare performance.
In the IAM section, create a user with API access. You will receive an Access key ID and Secret access key.
Grant the necessary access permissions to your IAM user. I granted the AmazonS3ReadOnlyAccess policy to my user, which gives access to all files on S3 for my account. You can also give a more granular access if necessary.
REST API-based adapters communicate with the endpoint through an HTTP client. To ensure secure communication, the adapters require client certificates from the endpoint services. Connections to an SAP HANA Cloud instance, which are based on the hanaodbc adapter, also require an SSL certificate.
Copy each root certificate text and create a certificate from this text which can be saved in a personal security environment (PSE).
Execute these SQL statements towards your SAP HANA Cloud database :
- Create a PSE.
create pse HTTPS;
- Create a certificate.
CREATE CERTIFICATE FROM ' -----BEGIN CERTIFICATE----- ... -----END CERTIFICATE----- ' COMMENT <comment>;
The created certificate is now contained in the CERTIFICATES system view and can be added to a PSE store.
- Add the certificate created above to the PSE:
SELECT CERTIFICATE_ID FROM CERTIFICATES WHERE COMMENT = '<comment>'; ALTER PSE <pse_name> ADD CERTIFICATE <certificate_id>;
- Set the purpose of the newly created PSE to REMOTE SOURCE:
SET PSE <pse_name> PURPOSE REMOTE SOURCE;
This makes the usage of the certificates in the PSE store available to all remote sources.
IMPORT FROM CSV FILE 's3-<region>://<access_key>:<secret_key>@<bucket_name>/<object_id>.csv' INTO <TARGET_TABLE_NAME> [with options];
For my 100MB CSV file, the command took 34 seconds to execute.
The SAP HANA SDI File adapter is particularly useful for these use cases :
- SharePoint access
- SharePoint on Office365
- Pattern-based reading; reading multiple flies in a directory that match a user-defined partition
- Real-time file replication (only APPEND)
In conclusion, each file import has its merits depending on your use case. Find the method which fits your needs.
Thank you for reading,
Special thanks to Daisuke Ikeguchi for his support in writing this blog.