Setting Up Initial Access to HANA Cloud data lake Files
The HANA Cloud, data lake supports storage of any type of data in its native format. Managed file storage provides a place to securely store any type of file without requiring you to setup and configure storage on an external hyperscaler account. This is very useful if you need a place to put files for ingestion into HANA Cloud, data lake IQ for high speed SQL analysis, or if you need to extract data for any reason. Hopefully, in the near future, HANA data lake Files will also be easily accessible from HANA Cloud, HANA databases as well.
Setting up and configuring access to SAP HANA Cloud, data lake Files for the first time can be a difficult process, especially if you are coming from a database background and are not familiar with object storage or REST APIs.
Here is a process I have used to test HANA Data Lake Files. Because HANA Data Lake Files manages user security and access via certificates, you need to generate signed certificates to setup user access. If you don’t have access to a signing authority, you can create a CA and signed client certificate and update the HDL Files configuration using the process below, which leverages OpenSSL. I have used it many times so it should work for you.
First you need to create and upload a CA bundle. You can generate the CA using the OpenSSL command:
openssl genrsa -out ca.key 2048
Next, you create the CA’s public certificate (valid for 200 days in this case). Provide at least a common name and fill other fields as desired.
openssl req -x509 -new -key ca.key -days 200 -out ca.crt
Now you need to create a signing request for the client certificate. Provide at least a common name and fill other fields as desired.
openssl req -new -nodes -newkey rsa:2048 -out client.csr -keyout client.key
Finally, create the client certificate (valid for 100 days in this case)
openssl x509 -days 100 -req -in client.csr -CA ca.crt -CAkey ca.key -CAcreateserial -out client.crt
*Note* – Make sure the fields are not all exactly the same between the CA and client certs, otherwise it is assumed to be a self-signed cert and the cert validation below will fail.
To verify the certificate was signed by a given CA (so that when you upload the CA certificate to HANA data lake you know it can be used to validate your client certificate):
openssl verify -CAfile ca.crt client.crt
Then open your instance in HANA Cloud Central and choose “Manage File Container” to setup your HANA Data Lake Files user.
Edit the configuration and choose “Add” in the “Trusts” section. Copy or upload the ca.crt you generated earlier and click “Apply”. However, don’t close the “Manage File Container” screen just yet.
Now we can configure our user to enable them to access the managed file storage.
Scroll down to the “Authorizations” section and choose “Add”. A new entry will appear
Choose a role for your user from the drop down (by default there are admin and user roles).
Here is where things get a little tricky. You need to add the pattern string from your client certificate so that when you make a request, the storage gateway (the entry point into HANA data lake files) can determine which user to validate you against.
You have 2 options for generating the pattern string. You can use the following OpenSSL command to generate the pattern string (omit the “subject= “ prefix that will appear in the output)
openssl x509 -in client.crt -in client.crt -nameopt RFC2253 -subject -noout
Alternatively, you can use the “generate pattern” option on the screen, which will open a dialog box that allows you to upload/paste your client certificate and will automatically generate the pattern for you. Note that we do not save the certificate, only the pattern string:
Click “Apply” to add the pattern string to your authorizations entry.
Note that the pattern string also allows wild cards, so you can authorized a class of certificates with a certain role. If the certificate pattern matches multiple authorizations, the one that is used is governed by the “Rank” value set for the specific authorization entry.
You should now be able to access and use HANA Data Lake Files via the REST api.
Here is a sample curl command which works for me and should validate that you have a successful connection (the intance id and files REST API endpoint can be copied from the instance details in the HANA Cloud Central). Use the client certificate and key that you generated above and used to create your authorization.
Note that curl can be a little tricky – I was on Windows and I could not get the Windows 10 version of curl to work for me. I ended up downloading a new curl version (7.75.0), which did work, however, I had to use the ‘–insecure’ option to skip validation of the HC server certificate because I wasn’t sure how to access the certificate store on Windows from curl.
curl --insecure -H "x-sap-filecontainer: <instance_id>" --cert ./client.crt --key ./client.key "https://<Files REST API endpoint>/webhdfs/v1/?op=LISTSTATUS" -X GET
The above command should return (for an empty HANA Data Lake):
Now you should be all set to use HANA Data Lake Files to store any type of file in HANA Cloud. For the full set of supported REST APIs and arguments for managing files, see the documentation
Thanks for reading!
Would you have an example to demonstrate how I can PUT the local files (e.g., compressed zip files) or external storage (Azure Data Lake or S3) to HDL (uncompress and load or retain as a compressed zip file)? Also, how do I use a postman or swagger instead of the CURL commands with the certificates?
This would be really helpful in trying out some scenarios in which I can move image or video or blob files from local/ external storage to HDL and back.
The REST api is documented here. There is a tool shipped with the HANA data lake client called hdlfscli which is a wrapper for the REST api and allows you to connect to and interact with your instance.
To use Postman to access HANA data lake Files, you need to add the client certificates for the HANA data lake Files endpoint to your list of certificates in your settings.
You can also use standard http libraries from Java, Python, etc... to access HANA data lake Files. An example of an upload (PUT) request for a file using a python script would look something like this:
We want to use the existing connection from Connection Management.
We can access the connection properties via the SAP DI library & get the connection details, but the problem is it stores the certificate as 'authToken'. And there is no direct way of passing 'authToken' in HTTPSConnection.Is there a way we can use a connection auth token instead of a cert and key.
I am trying it over Jupyter Lab.
You should be able to package the client certificate/key into a keystore and use that to access HANA Cloud, data lake Files from Jupyter.
You might want to check out:
Using SAP HANA Cloud, data lake Files with a Jupyter Notebook and PySpark | SAP Blogs
Hello Nishant Gopinath
It might be easier to do the same thing "in-house" ... 😉
See the Blog - SAP (Sybase) IQ – the hidden treasure …
Best Regards Roland
Thanks Jason. I am able to run with the swagger.json on postman and also use the sample python script to transfer the local files to HDL.
That's great to hear Nishant.
Thanks for your explanation on data lake on files. I recreated the steps mentioned above and face unauthorized error in the final step (i.e. loading data into the relational database table). Can you please help me with it.
This error is almost always due to a mistake in the certificate configuration. I would revisit those steps.
I have recreated the certification steps and I'm able to upload the file to data lake using CURL along with certificate. We're facing an issue while using load table command from DB explorer with HDLADMIN DB user.
Can you please help me in configuring the authorization pattern to validate the certificate to load the table.
Below is authorization pattern configured in cockpit:-
Below is my load table syntax:
LOAD TABLE SCUSTOM_FILES_NEW ( "MANDT" ,"ID" ,"NAME" ,"FORM" ,"STREET" ,"POSTBOX","POSTCODE","CITY" ,"COUNTRY" ,"REGION" ,"TELEPHONE" ,"CUSTTYPE" ,"DISCOUNT" ,"LANGU" ,"EMAIL" ,"WEBUSER") FROM 'hdlfs:///user/home/scustom_dl_file.csv' format ascii --SKIP 1 DELIMITED BY ',' ESCAPES OFF;
Below is error message:
Could not execute 'LOAD TABLE SCUSTOM_FILES_NEW ( "MANDT" ,"ID" ,"NAME" ,"FORM" ,"STREET" ...' Error: (dberror) [-1014083]: Error during HDLFS operation: Unauthorized. Please verify request and authorizations. Path: [/user/home/scustom_dl_file.csv] Query: [op=GETFILESTATUS] FileContainer: [ca3e0097-3d06-4708-ac6f-2xqg40b0b6af] Client: [CN=iqserver,OU=ca3e0097-3d06-. -- (oslib/hos_HDLFS.cxx 1970)
Thanks in advance.
The HDL instance has certificate access automatically configured between the HDL RE and HDL Files containers in a single instance. **BUT** If you configure a user home directory for HDL Files, only a database user with the same name as the name specified in the HDL Files client certificate common name (CN) would be able to access files in that directory from an HDLE RE connection. Only that user (or users with impersonation privileges) are able to access data stored in the user home directory.
So, in this case, you would have to create a user named "iqserver" and grant them LOAD permissions on the HDLADMIN.SCUSTOM_FILES_NEW table in order to load it.
Then you could connect as that user and issue the LOAD or if you are connected as HDLADMIN (or some other admin user) you could use the SETUSER statement to impersonate the iqserver user and do the LOAD.
Your alternative would be to store the file outside of the specified user home directory structure, which would make it accessible to any HDL Files user certificate, including the HDL RE one.
Do you have a java example to upload(PUT) file to HDL pod ?