Skip to Content
Technical Articles
Author's profile photo Sohit Tandon

Uploading S3 data into SAP HANA Cloud, Data Lake

Uploading AWS S3 bucket data into SAP HANA Cloud, Data lake

In this blog, we will explore how to load the data in AWS S3 bucket into the SAP HANA Cloud, Data Lake. We will load a csv file and a parquet file.

Prerequisites

  • You should have EXECUTE privilege on SYSRDL#CG.REMOTE_EXECUTE procedure.
  • You should have read access rights to objects in the AWS S3 bucket
  • The objects in the AWS S3 bucket should be using CESU-8 character encoding.
  • You should know the column delimiter and the row delimiter of the data residing in AWS S3 bucket.
  • You should also know the AWS Access Key ID and Secret Access Key ID

 

Let’s have a look at the AWS bucket. It contains the csv files that we will upload to the data lake.

AWS%20bucket%20with%20csv%20files

AWS bucket with csv files

Log into the SAP HANA Cloud console and open the SQL console. Create a table in the data lake with the same structure as the data in the Amazon S3 bucket.

CALL SYSRDL#CG.REMOTE_EXECUTE ('CREATE TABLE DEMO2 
        (first_name  VARCHAR(10) ,
	 last_name VARCHAR(13) ,
	 city VARCHAR(19) ,
	 state VARCHAR(2) ,
	 phone1 VARCHAR(12) )');

Note that all commands inside REMOTE_EXECUTE have to be enclosed in single quotes.

Next grant the SELECT privilege on the table to the data lake container role. In an SQL console connected to an SAP HANA Cloud database, execute:

CALL SYSRDL#CG.REMOTE_EXECUTE ('GRANT SELECT ON DEMO2 TO SYSRDL#CG_ROLE');

Now load the table using the load command like below

CALL SYSRDL#CG.REMOTE_EXECUTE ( 'LOAD TABLE DEMO2 
(first_name, last_name ,city ,state,phone1 ''\x0a'')  
USING FILE ''s3://hscbucket/demo2.txt''
DELIMITED BY '',''
ACCESS_KEY_ID ''SAMPLEACCESSKEY''
SECRET_ACCESS_KEY  ''samplesecretaccesskey''
REGION ''ap-southeast-1''
ESCAPES OFF
');

In the above example, the field delimiter is comma (,) and the row delimiter is new line.

We will need to specify the location of the file in S3, AWS region, AccessKeyId and the SecretAccessKey in the load command.

Note that each single quotation marks of the Load command have to be escaped by another single quote.

Here is the screenshot of commands in the SQL editor.

SQL%20commands%20in%20the%20editor

SQL commands in the editor

 

The load command allows loading of gzip compressed files as well. The extension of the file in S3 should be .gz to load the compressed files.

Now we will navigate to Remote Sources and choose SYSRDL#CG_SOURCE which is our data lake. We will choose the schema SYSRDL#CG . There we can find the table DEMO2

Searching%20the%20data%20lake%20schema

Searching the data lake schema

Now we will check the box and click on Create Virtual Object to create a Virtual table in SAP HANA Cloud. The below dialog box will open.

 

Creating%20Virtual%20Table%20in%20HANA%20Cloud

Creating Virtual Table in SAP HANA Cloud

 

Once the table is created, we navigate to Tables and can see DEMO2 under the tables.

Virtual%20table%20DEMO2%20in%20DBADMIN%20schema

Virtual table DEMO2 in DBADMIN schema

 

We can right click and see the data.

Preview%20of%20data%20in%20the%20table

Preview of data in the table

 

Now we will load a parquet file from the S3 bucket. Here is a screenshot of the parquet file userdata1.parquet residing in my S3 bucket.

 

parquet%20file%20in%20the%20S3%20bucket

parquet file in the S3 bucket

 

First, let’s create a table in the data lake with the same schema as the parquet file.

CALL SYSRDL#CG.REMOTE_EXECUTE ('CREATE TABLE PARQUET_TAB 
    (C1  datetime ,
	 C2 int,
	 C3 VARCHAR(50),
	 C4 VARCHAR(50),
	 C5 VARCHAR(50),
	 C6 VARCHAR(50),
	 C7 VARCHAR(50),
	 C8 VARCHAR(50),
	 C9 VARCHAR(50),
	 C10 VARCHAR(50),
	 C11 double,
	 C12 VARCHAR(50),
	 C13 VARCHAR(50))
 ');

 

Next grant the SELECT privilege on the table to the data lake container role. In an SQL console connected to an SAP HANA Cloud database, execute:

CALL SYSRDL#CG.REMOTE_EXECUTE (‘GRANT SELECT ON PARQUET_TAB TO SYSRDL#CG_ROLE’);

Next we will load the file using the below Load Table command. Note that we will need to specify the FORMAT as parquet along with the location of the file in S3. Also the AWS region, AccessKeyID and SecretAccessKey need to be mentioned in the load command.

CALL SYSRDL#CG.REMOTE_EXECUTE ( 'LOAD TABLE PARQUET_TAB (C1, C2 ,C3,C4,C5,C6,C7,C8,C9,C10,C11,C12,C13)  
USING FILE ''s3://hscbucket/userdata1.parquet''
FORMAT parquet
ACCESS_KEY_ID ''SAMPLEACCESSKEYID''
SECRET_ACCESS_KEY  ''Secret/Access+Key''
REGION ''ap-southeast-1''
ESCAPES OFF
');

Here is a screenshot of the commands executed in the SQL console.

SQL%20console%20screenshot

SQL console screenshot

Navigate to the Remote Sources in the Database explorer and choose SYSRDL#CG_SOURCE. Then choose SYSRDL#CG as schema and input the name of the table in the Object and click search.

Data%20lake%20schema%20search

Data lake schema search

 

The newly created table will appear. Now click the checkbox and then click on Create Virtual Objects. The below dialog box will appear.

 

Create%20virtual%20table%20in%20SAP%20HANA%20Cloud

Creating virtual table in SAP HANA Cloud

 

Choose Create a virtual table will be created in the DBADMIN schema. You can navigate to the Tables section and select the table to see the schema.

PARQUET_TAB%20table%20schema

PARQUET_TAB table schema

Now we can right click and display the data that was loaded from the parquet file.

Data%20preview%20of%20PARQUET_TAB%20table

Data preview of PARQUET_TAB table

 

Conclusion

The SAP HANA Cloud, Data Lake can ingest data from S3 easily using the Load Table command. The command is versatile enough to load csv and parquet format files. It can load compressed files as well as load multiple tables in parallel as well.

Assigned Tags

      2 Comments
      You must be Logged on to comment or reply to a post.
      Author's profile photo Ethan Zhang
      Ethan Zhang

      May I ask how to upload data to Hana data lake from google cloud storage? Is it supported yet?

      Author's profile photo Mustafa KUTLU
      Mustafa KUTLU

      Yes, you can use the same LOAD TABLE statement to load data into a Data Lake Relational Engine table from your Google Cloud Storage bucket

      Please see below document for details

      Loading Data from Google Cloud Storage