Technical Articles
Scripting import of multiple files into SAP HANA Cloud from a cloud storage (Amazon S3)
Being the cloud-native product SAP HANA Cloud does not give you an access to the underlying OS shell and file system. Therefore it is not possible to load a file from the database server as you can do with on-prem SAP HANA. But through its broad data integration capabilities SAP HANA Cloud offers import and export data using popular cloud storage services.
In this post I show you steps done to import TPC-DS data files into my trial SAP HANA Cloud database’s tables in a regular schema and share some of the gotchas. My database instance is hosted in eu10
region of SAP Business Technology Platform (BTP), where infrastructure is provided by Amazon Web Services. So, I use Amazon Simple Storage Service (S3) as the cloud storage in this exercise.
At the moment of writing this post Object Store service is only available in SAP BTP productive accounts, and I plan to return to that service another time. So, here I will use S3 service directly from AWS. At this moment S3 is available in AWS account’s free tier for 12 months after an account creation.
Before we move on just a few words about the dataset and why.
What is TPC-DS?
The TPC Benchmark DS (TPC-DS) is “a decision support benchmark that models several generally applicable aspects of a decision support system“. While I am not going to use it for any benchmarking (which would not even make sense in the trial environment), I liked the fact that it provides the snowflake-modeled schema of a retail business’s decision system. It is much closer to the real customer case, then most of the single-file datasets out there.
Store Sales Snow Flake Schema (source: The Making of TPC-DS)
Instead of providing a data set, it provides two generators:
- the data generator
dsdgen
, which I used here, - the query generator
dsqgen
, which could be nice if you want to practice knowledge gained during “The First Step Towards SAP HANA Query Optimization“.
What’s more the data generator allows you to create datasets of the different size, starting from 1GB of raw data. So, it is a volume that is nice to work with. I plan to play with it after this exercise too.
The flow of the exercise…
…in general:
- Generate TPC-DS data files using
dsdgen
program. - Move files to S3 bucket using
aws
client. - Import data from files into SAP HANA Cloud database using
hdbsql
client.
…and required tools
To accomplish my goal I am using a command line more than UIs to make this text compact and to script repetitive task. This is more efficient when working with more files and bigger data, like in this exercise.
If you are looking at the general introduction into the import of data into SAP HANA Cloud, then I would recommend to start with:
|
While all required command-line tools can be installed and used from any OS shell, here I am going to use the Linux-based AWS CloudShell service — mostly for the performance reasons of moving generated files within the same region of the same cloud infrastructure.
AWS CloudShell is a free service, but with 1GB of persistent storage — the limitation I needed to take into the consideration when generating files.
At the moment CloudShell is not available in AWS Frankfurt region eu-central-1
, so I had to open it in eu-west-1
.
The code repository
All code files are available in the SAP Tech Bytes repository: https://github.com/SAP-samples/sap-tech-bytes/tree/2021-02-22-saphanacloud-db-import-tpcds-datafiles-from-s3.
You can create files following this post, you can copy their code from links provided in the post, or you can clone the code.
git clone https://github.com/SAP-samples/sap-tech-bytes.git \
--branch 2021-02-22-saphanacloud-db-import-tpcds-datafiles-from-s3 \
~/Projects/saphc-hdb-import-tpcds
So, let’s go.
Create a file with project’s environment variables
Create a project folder ~/Projects/saphc-hdb-import-tpcds
:
mkdir -p ~/Projects/saphc-hdb-import-tpcds
cd ~/Projects/saphc-hdb-import-tpcds
To make it consistent between different commands and scripts, and to make customizable for your needs let’s create a file ~/Projects/saphc-hdb-import-tpcds/env_tpcds.sh
to store and configure environment variables.
myTemp_folder=~/tmp
myProjects_folder=~/Projects
tpcds_kit_folder=${myProjects_folder}/tpcds-kit
tpcds_kit_tools_folder=${tpcds_kit_folder}/tools
tpcds_scale_factor=1
datasets_root_folder=~/Datasets
prefixObj=sf$(printf "%04d" ${tpcds_scale_factor})/data
aws_region=eu-central-1
s3bucket=tpcds4hdbc
aws_ro_s3user=S3TPC-reader
aws_key_file_folder=${myTemp_folder}
aws_key_file=${aws_ro_s3user}-access-key.json
HDB_USE_IDENT=AWSCloudShell
saphc_hdb_endpoint=8e1a286a-21d7-404d-8d7a-8c77d2a77050.hana.trial-eu10.hanacloud.ondemand.com:443
Modify the file to set your own:
- SAP HANA Cloud database endpoint in
saphc_hdb_endpoint
variable, and - globally unique S3 bucket name in
tpcds4hdbc
.
Gotchas:
- Only a content of the user’s home directory persisted between your AWS CloudShell sessions, so take this in considerations when setting folders structure
- AWS CloudShell virtual machines are changing internal IP addresses and therefore host names. To prevent issues with SAP HANA Clients, the
HDB_USE_IDENT
environment variable is set.- I have experienced the AWS CloudShell resetting the user session once in a while and it is causing the work folder and environment variables to be reset. You will find me adding
source ~/Projects/saphc-hdb-import-tpcds/env_tpcds.sh
at the beginning of every block of OS commands to make sure they are executed properly.
Generate TPC-DS data files and store them in S3
I used TPC-DS generator from https://github.com/gregrahn/tpcds-kit, because it can be compiled on MacOS too (but that’s the different story).
In AWS CloudeShell run the following commands.
sudo yum install --assumeyes gcc make flex bison byacc git
source ~/Projects/saphc-hdb-import-tpcds/env_tpcds.sh
git clone https://github.com/gregrahn/tpcds-kit.git ${tpcds_kit_folder}
cd ${tpcds_kit_folder}/tools
make OS=LINUX
ls -l ds*
The make
command will produce a lot of warnings, but it was Ok to ignore them and to move on.
Now you should have TPC-DS data generator dsdgen
and query generator dsqgen
available locally.
If executed without any parameters ./dsdgen
will generate the smallest sample (1GB of size) with all data files in the current directory. The --scale
option allows generating bigger samples, but even with 1GB we face the limitation of the CloudShell persistent storage.
The approach I took was to generate a data file for each table listed in tpcds.sql
and to move (not copy) it to the S3 bucket defined as s3bucket
variable in the file env_tpcds.sh
. Remember to choose your own globally unique name for your S3 bucket!
I want the bucket to be close to my SAP HANA Cloud db instance, so I used a flag --region ${aws_region}
, where I set this variable to eu-central-1
in env_tpcds.sh
.
cd ~/Projects/saphc-hdb-import-tpcds/
source ~/Projects/saphc-hdb-import-tpcds/env_tpcds.sh
aws s3 mb s3://${s3bucket} --region ${aws_region}
aws s3 ls ${s3bucket} --summarize
To script the creation and movement of data files to S3 bucket I created an executable Bash script ~/Projects/saphc-hdb-import-tpcds/gen_tpcds_s3_init.sh
with the following code…
#!/bin/bash
set -e
source ${0%/*}/env_tpcds.sh
log_file=${myTemp_folder}/${0##*/}__$(date +%Y%m%d%H%M).log
if [ ! -f ${tpcds_kit_tools_folder}/dsdgen ]; then
echo "You should have dsdgen installed and the environment configured in env_tpcds.sh!"
exit 1
fi
datasets_tpcds_files_folder=${datasets_root_folder}/tpcds-data/${prefixObj}/init
echo ${datasets_tpcds_files_folder}
mkdir -p ${datasets_tpcds_files_folder}
mkdir -p ${myTemp_folder}
touch ${log_file}
tables=$(grep table ${tpcds_kit_tools_folder}/tpcds.sql | grep --invert-match returns | cut -d' ' -f3)
for t in ${tables}
do
echo "Working on the table: "$t | tee -a ${log_file}
${tpcds_kit_tools_folder}/dsdgen -QUIET Y -FORCE \
-TABLE $t -SCALE ${tpcds_scale_factor} \
-DIR ${datasets_tpcds_files_folder} -DISTRIBUTIONS ${tpcds_kit_tools_folder}/tpcds.idx
for f in ${datasets_tpcds_files_folder}/*.dat; do
tail -3 $f >> ${log_file}
echo -e "$(wc -l $f) \n" >> ${log_file}
mv -- "$f" "${f%.dat}.csv"
done
aws s3 mv --recursive ${datasets_root_folder}/tpcds-data s3://${s3bucket}
done
…and execute it.
Let’s check generated files and their total size in the bucket.
source ~/Projects/saphc-hdb-import-tpcds/env_tpcds.sh
aws s3 ls s3://${s3bucket} --recursive --human-readable --summarize
There are 25 files generated — one for each target database table.
Create a read-only user for the S3 bucket
SAP HANA Cloud will use S3’s REST API to get the data from the bucket. It will require an AWS user with an access key and required permissions.
It is a good idea to use an AWS user with a minimal permissions to read objects from the bucket.
Let’s create a template file S3TPC-reader-inline-policy-template.json
following the example from AWS documentation, but allowing only read access to the bucket…
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ListObjectsInBucket",
"Effect": "Allow",
"Action": "s3:ListBucket",
"Resource": "arn:aws:s3:::${s3bucket}"
},
{
"Sid": "AllowObjectRead",
"Effect": "Allow",
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::${s3bucket}/*"
}
]
}
…and a script create_ro_s3user.sh
to create a AWS user (S3TPC-reader
) and its programatic access keys for an authentication:
#!/bin/bash
source ${0%/*}/env_tpcds.sh
log_file=${myTemp_folder}/${0##*/}__$(date +%Y%m%d%H%M).log
mkdir -p ${myTemp_folder}
inline_policy_template=$(cat S3TPC-reader-inline-policy-template.json)
echo ${inline_policy_template//'${s3bucket}'/${s3bucket}} > ${myTemp_folder}/${aws_ro_s3user}-inline-policy.json
aws iam create-user --user-name ${aws_ro_s3user}
aws iam put-user-policy --user-name ${aws_ro_s3user} \
--policy-name inline_S3ReadTPCDSbucket \
--policy-document file://${myTemp_folder}/${aws_ro_s3user}-inline-policy.json
aws iam create-access-key --user-name ${aws_ro_s3user} > ${myTemp_folder}/${aws_ro_s3user}-access-key.json
cat ${myTemp_folder}/${aws_ro_s3user}-access-key.json
A copy of an access key is written to a file in the temporary directory and will be accessed during the execution of SQL statements to provide user’s key+secret values.
Import the data from S3 bucket into SAP HANA Cloud tables
Following steps could be done from SAP HANA Database Explorer or any other SQL client supporting SAP HANA Cloud. I will use command-line hdbsql
within AWS CloudShell.
For this activity we need to do following steps:
- Install SAP HANA Clients
- Create a database schema and tables to store the data
- Set host certificates required by SAP HANA
- Import data into tables
Install SAP HANA Clients
wget --no-cookies \
--header "Cookie: eula_3_1_agreed=tools.hana.ondemand.com/developer-license-3_1.txt" \
"https://tools.hana.ondemand.com/additional/hanaclient-latest-linux-x64.tar.gz" \
-P /tmp
tar -xvf /tmp/hanaclient-latest-linux-x64.tar.gz -C /tmp/
source ~/Projects/saphc-hdb-import-tpcds/env_tpcds.sh
/tmp/client/hdbinst --batch --hostname=${HDB_USE_IDENT}
SAP HANA Clients should be installed in the default directory ~/sap/hdbclient
thanks to the flag --batch
.
For the general introduction into hdbsql I would recommend to start with tutorials:
|
Add your database user to the SAP HANA Clients user store
I will use a flag -i
to interactively provide the password of my DBAdmin
database user.
source ~/Projects/saphc-hdb-import-tpcds/env_tpcds.sh
~/sap/hdbclient/hdbuserstore -i -H ${HDB_USE_IDENT} \
SET HANACloudTrial_DBAdmin ${saphc_hdb_endpoint} DBAdmin
Let’s check if we can connect to the database with this user key.
~/sap/hdbclient/hdbsql -U HANACloudTrial_DBAdmin \
"SELECT Current_User FROM DUMMY"
Gotchas:
- As mentioned above, the use of
-H ${HDB_USE_IDENT}
inhdbuserstore
will prevent loosing database user keys between AWS CloudShell sessions, as the internal host name is changing with each session.- If you have problems connecting, then check my articles Secure connection from HDBSQL to SAP HANA Cloud and Allow connections to SAP HANA Cloud instance from selected IP addresses.
Create a schema and tables
TPC-DS Kit provides a file tpcds-kit/tools/tpcds.sql
with an ANSI-compliant SQL statements to create all required tables. I’ll create a dedicated schema TPCDS
for them.
Logon to the database using hdbsql
and a created db user key.
~/sap/hdbclient/hdbsql -A -U HANACloudTrial_DBAdmin
…and execute following statements.
CREATE SCHEMA TPCDS;
SET SCHEMA TPCDS;
\input "/home/cloudshell-user/Projects/tpcds-kit/tools/tpcds.sql"
SELECT TABLE_NAME FROM TABLES WHERE SCHEMA_NAME='TPCDS';
Setup certificates for S3 remote source in SAP HANA Cloud
The setup of trust follows defined steps described in https://help.sap.com/viewer/c82f8d6a84c147f8b78bf6416dae7290/2020_04_QRC/en-US/342122ab55684a20a2e076e5bee4109c.html.
While still in an hdbsql
session connected to the database, check if a certificate collection (also called PSE, or Personal Security Environment) with REMOTE SOURCE
purpose already exists.
\al off
SELECT * FROM PSES WHERE PURPOSE='REMOTE SOURCE';
If not, then let’s create it with the name TLS_SOURCES
. But if it already exists, then skip the step of PSE creation and replace PSE name TLS_SOURCES
in the SQL statements below with the one from your database instance.
CREATE PSE TLS_SOURCES;
SET PSE TLS_SOURCES PURPOSE REMOTE SOURCE;
Gotchas:
- Only one collection with unqualified purpose may exist. For more details see: https://help.sap.com/viewer/c82f8d6a84c147f8b78bf6416dae7290/2020_04_QRC/en-US/75d0cfec8e4f44c3a649d26e9cefa314.html.
Exit the hdbsql
session to return to the OS shell.
\quit
Now we need to create an entry for a public certificate of a virtual-hosted–style URI of the S3 bucket, which is https://tpcds4hdbc.s3.eu-central-1.amazonaws.com
in my case.
To do this programatically install openssl
with yum
in AWS CloudShell.
sudo yum install --assumeyes openssl
…and use it to retrieve the certificate for SSL access to HTTPS port 443.
source ~/Projects/saphc-hdb-import-tpcds/env_tpcds.sh
openssl x509 -in <(openssl s_client -connect ${s3bucket}.s3.${aws_region}.amazonaws.com:443 -prexit 2>/dev/null)
source ~/Projects/saphc-hdb-import-tpcds/env_tpcds.sh
~/sap/hdbclient/hdbsql -U HANACloudTrial_DBAdmin \
"CREATE CERTIFICATE FROM '$(openssl x509 -in <(openssl s_client -connect ${s3bucket}.s3.${aws_region}.amazonaws.com:443 -prexit 2>/dev/null))' COMMENT 'S3-${aws_region}';"
~/sap/hdbclient/hdbsql -A -U HANACloudTrial_DBAdmin \
"SELECT CERTIFICATE_ID, SUBJECT_COMMON_NAME, COMMENT FROM CERTIFICATES WHERE COMMENT='S3-${aws_region}';"
Now that certificate is created in SAP HANA Cloud database, let’s add it to the certificate collection TLS_SOURCES
(or your existing certificate collection with the REMOTE SOURCE
purpose).
source ~/Projects/saphc-hdb-import-tpcds/env_tpcds.sh
export s3CertID=$(~/sap/hdbclient/hdbsql -U HANACloudTrial_DBAdmin -xa "SELECT CERTIFICATE_ID FROM CERTIFICATES WHERE COMMENT='S3-${aws_region}';")
~/sap/hdbclient/hdbsql -U HANACloudTrial_DBAdmin "ALTER PSE TLS_SOURCES ADD CERTIFICATE ${s3CertID};"
~/sap/hdbclient/hdbsql -U HANACloudTrial_DBAdmin -A "SELECT PSE_NAME, CERTIFICATE_ID, SUBJECT_NAME FROM PSE_CERTIFICATES WHERE PSE_NAME='TLS_SOURCES';"
While I promised we are focusing on working with the command line, it is good to know that results the same can be achieved from SAP HANA Cloud Cockpit.
Gotchas:
- It is important to note that the certificate is expiring on September 1st of this year (i.e. 2021). Certificates should be periodically checked and updated in real systems to keep applications running!
Import data from files in S3 bucket into SAP HANA Cloud tables
To script the import of 25 data files into corresponding database tables I created an executable Bash script ~/Projects/saphc-hdb-import-tpcds/imp_s3_hdbc_init.sh
with the following code…
#!/bin/bash
set -e
source ${0%/*}/env_tpcds.sh
log_file=${myTemp_folder}/${0##*/}__$(date +%Y%m%d%H%M).log
if [ ! -f ~/sap/hdbclient/hdbsql ]; then
echo "You should have ~/sap/hdbclient/hdbsql installed!"
exit 1
fi
keyid=$(jq -r ".AccessKey.AccessKeyId" ${aws_key_file_folder}/${aws_key_file})
keysecret=$(jq -r ".AccessKey.SecretAccessKey" ${aws_key_file_folder}/${aws_key_file})
tables=$(grep table ${tpcds_kit_tools_folder}/tpcds.sql | cut -d' ' -f3)
for t in ${tables}
do
echo -n "Inserting into $t"
query="IMPORT FROM CSV FILE 's3-${aws_region}://${keyid}:${keysecret}@${s3bucket}/${prefixObj}/init/${t}.csv' INTO TPCDS.${t} WITH FIELD DELIMITED BY '|' THREADS 2 FAIL ON INVALID DATA;"
# echo ${query//${keysecret}/'***'}
~/sap/hdbclient/hdbsql -U HANACloudTrial_DBAdmin -f "TRUNCATE TABLE TPCDS.${t}" >> ${log_file}
~/sap/hdbclient/hdbsql -U HANACloudTrial_DBAdmin -f "${query}" >> ${log_file}
echo ": " $(~/sap/hdbclient/hdbsql -U HANACloudTrial_DBAdmin -xa "SELECT COUNT(1) FROM TPCDS.${t}") " records"
done
…and execute.
For this step, I described as well an alternative approach: Import of multiple files into SAP HANA Cloud from a cloud storage (Amazon S3) using SAP Data Intelligence 3.1. |
Bingo! The data is loaded and we can run some sample queries, like those generated by TPC-DS…
SET SCHEMA TPCDS;
SELECT TOP 10
i_brand_id AS brand_id,
i_brand AS brand,
i_manufact_id,
i_manufact,
d_year,
sum(ss_ext_sales_price) AS ext_price
FROM
date_dim,
store_sales,
item,
customer,
customer_address,
store
WHERE d_date_sk = ss_sold_date_sk
AND ss_item_sk = i_item_sk
AND i_manager_id = 8
AND d_moy = 11
AND ss_customer_sk = c_customer_sk
AND c_current_addr_sk = ca_address_sk
AND substr( ca_zip, 1, 5 ) <> substr( s_zip, 1, 5 )
AND ss_store_sk = s_store_sk
GROUP BY
i_brand,
i_brand_id,
i_manufact_id,
i_manufact,
d_year
ORDER BY
ext_price DESC,
d_year DESC,
i_brand ASC,
i_brand_id ASC,
i_manufact_id ASC,
i_manufact ASC;
…but that’s a kind of the exercise for another time!
FYI, it is possible to import data from S3 buckets directly into the Relational Data Lake of SAP HANA Cloud, as described by Sohit Tandon in Uploading S3 data into SAP HANA Cloud, Data Lake. |
Till next episode then,
-Vitaliy aka @Sygyzmundovych
Hi Vitaliy,
Nice article and write up on TPC-DS. I was able to follow till the end.
For dsqgen input, I am not able to find query template list for Hana. The only options available in the kit were for db2, netezza, oracle and ansi.
Hello and thank you for trying it out.
Yes, some queries I tried were generated using the ansi option, as SAP HANA is ANSI compliant: https://help.sap.com/viewer/4fe29514fd584807ac9f2a04f6754767/2.0.05/en-US/09940a1ed38c4ceb80770632997b286b.html
But then playing with these queries and tuning them may require a separate exercise 🙂
Best regards.
Thanks Witalij,
Around 10 queries had to be fixed by using add_days function in HANA, instead of + symbol. There are still 3 queries that have errors with select... grouping statement. Not sure how to rewrite those queries.
Regards.
Hi Witalij,
Great tutorial!
I am stuck on the last step: importing files from S3 to HANA.
[cloudshell-user@ip-10-0-12-190 ~]$ ~/Projects/saphc-hdb-import-tpcds/imp_s3_hdbc_init.sh
Logging to a file /home/cloudshell-user/tmp/imp_s3_hdbc_init.sh__202107020905.log
Inserting into dbgen_version* 2: general error: Invalid path format SQLSTATE: HY000
I am starting the investigation, but maybe you have a ready idea with the solution.
Regards,
Leszek
The way I read it is that the path of the input file
is incorrect.
Do you have any more details in the file /home/cloudshell-user/tmp/imp_s3_hdbc_init.sh__202107020905.log?
Here is the content of the log file:
TRUNCATE TABLE TPCDS.dbgen_version
0 rows affected (overall time 207.520 msec; server time 4787 usec)
IMPORT FROM CSV FILE 's3-us-east-2://:@myhanabucket/sf0001/data/init/dbgen_version.csv' INTO TPCDS.dbgen_version WITH FIELD DELIMITED BY '|' THREADS 2 FAIL ON INVALID DATA;
Seems like the ${keyid} and ${keysecret} are empty
I removed the AWS user (with policies, etc.) and recreated it. This helped. Great step-by-step tutorial. Thanks Witalij.
Sorry for late comment. I was on vacation 🙂