Technical Articles
Hana SDI Adapter for Azure Data Lake
Design goal
When dealing with lots of data, there are different concepts. SAP favors a multi-temperature storage inside the database but the majority of projects use an object storage – in case of Microsoft Azure that would be the Azure Data Lake. It is cheap, can grow to virtually any size, is secure and can handle all types of data.
To make this data accessible via Hana also, a new SDI adapter is needed. I wrote one that provides access to the data in Azure Data Lake and Azure Service Bus. With this adapter the data can be used by Hana onPrem, Hana Cloud, SAP Hana Data Warehouse Cloud, S/4Hana or SAP BW.
Regarding the payload formats supported, in Azure Avro and Json are popular formats. In below example the Data Lake contains millions of files, each representing one patient record in the FHIR standard. Whenever new files are uploaded to Azure, the Azure Event Grid posts this information into a Service Bus queue which the adapter is listening on. But the adapter works with any Json or Avro format, no limitations.
My sample file in Azure looks like this
{
"resourceType": "Patient",
"id": "example",
"identifier": [
{ "use": "usual", "system": "urn:oid:1.2.36.146.595.217.0.1" }
],
"active": true,
..
..
and all files should be queried via Hana like any other table.
Requirements
This creates a couple of requirements for this Data Lake adapter:
- Parse Json and Avro files.
- As these file formats allow for nested data, a mechanism to blend the nested Json/Avro data into the relational Hana data model must exist.
- When files are changed, deleted or new files arrive they must be visible in Hana right away.
- One Data Lake storage has different kind of files, each with their unique structure, which must be treated independently.
The solution is is using the concept of “views” in the Azure Data Lake adapter to define transformation from the nested data into Hana tables. For above example the user does create a file named patient.view with the following content:
{
"namefilter": "patient.*\\.json",
"from": "$",
"columns": [
{
"accessor": "$.id",
"alias": "ID",
"datatype": "VARCHAR(256)"
},
{
"accessor": "$.identifier[0].system",
"alias": "SYSTEM_IDENTIFIER",
"datatype": "VARCHAR(40)"
},
{
"accessor": "$.gender",
"alias": "GENDER",
"datatype": "VARCHAR(10)"
},
{
"accessor": "$.birthDate",
"alias": "BIRTHDATE",
"datatype": "DATE",
"format": "yyyy-M-d"
}
],
"queuename": "filechanges"
}
Apart from the view projection definitions (from and columns) with their information about:
- What nesting level is driving the row generation? from: $ means the root level, so one row per file.
- Which field to output?
- What should be its Hana column name?
- What should be its Hana data type?
- What is the format of the value in case a data type conversion is needed, e.g. from String to Date?
a name and directory filter can be specified also. Because this format definition makes sense for json files containing patient data only and all these files follow the pattern patient*.json in the Data Lake storage, a namefilter is used. Hence this view should consider only those files.
The last setting in above example is the Enterprise Service Bus (“ESB”) queue name where Azure puts all file changes. This is another feature of the Azure Data Lake. It allows to trigger events whenever a file is modified and using Azure Event Grid, this event can be put into an ESB queue.
Consuming Azure Data Lake data
This view is imported as virtual table just like with any other SDI adapter, either via SQL or via the various UIs. As the definition file is called patient.view its remote name is patient and the chosen name v_patient. In above screenshot this table was selected and it returned the data contained in the files.
Creating the Hana remote source object
The remote source definition asks for the URL of the Data Lake endpoint and optionally – if realtime subscriptions will be used also – for the Enterprise Service Bus connection string. Both values can simply be copied from Azure.
As a Storage Account in Azure can have multiple containers, the file system is a mandatory parameter and it can be augmented with additional sub directories. For example the setting /main/FHIR configures this remote source to read only data from the file system called “main” and within the “FHIR” directory and sub directories only.
The security keys for the Storage Account and the Service Bus are placed into the secure credential storage.
The only additional parameter needed is a Adapter-local directory where all view definition files are placed by the user.
Realtime Subscriptions
Using Hana SDI realtime subscriptions, changes in the Data Lake are captured and the target table in Hana gets updated with the new data immediately.
CREATE COLUMN TABLE t_patient LIKE v_patient;
CREATE REMOTE SUBSCRIPTION s_patient
ON v_patient TARGET TABLE t_patient;
ALTER REMOTE SUBSCRIPTION s_patient QUEUE;
INSERT INTO t_patient select * from v_patient; // Initial load
ALTER REMOTE SUBSCRIPTION s_patient DISTRIBUTE;
SELECT * FROM t_patient; // Returns old and changed file contents
The corresponding setup in Azure is very easy to achieve. All that needs to be done is to define a new Event in the File System and use a previously created Service Bus queue as end point.
Summary
With that not only can all Azure Data Lake data be read in a federated mode but also the data be cached in Hana tables for best query performance and this data is current always – thanks to the realtime subscriptions. These changes are even transactional consistent.
Thanks for sharing, I believe we will see an increasing interest in real-time integration, so the more we know about this, the better we'll get there.
Hi Werner,
great blog!
If one wants to establish this scenario with SAP HANA Cloud and Azure Data Lake, I guess there is still the DP Agent needed? In this case it would not make much sense to go from Cloud to On- premise to Cloud again. Therefore: Is it possible to install DP-Agent on a VM in Azure?
Regards,
Michael
Michael Kaufmann Yes, of course. Any small Linux VM will do.
Would be nice if each adapter can be packaged as docker image with dpagent included already, wouldn't it? I will make the suggestion to SAP.
Hi Werner,
any update regarding your suggestion?
Thanks!
I have suggested and discussed it with the developers last year - no response.
There is a Docker image for DP agent: https://github.com/entmike/hana-dpagent,
but it would be nice to have an official one.
Maybe I'll file an idea at influence.sap.com.
I'm thinking about migrating our DP agents to a PaaS service like RedHat OpenShift.
Do you have any experience/feedback regarding such a setup?
The docker image is nice but actually just another way of installing the full blown dpagent. My proposal goes much further than that:
Great find & great use-case!
I don't find any mention of this adapter when looking into the PAM documentation or Installation guide for SDI. Is there any documentation on this adapter?
Simon Vandelanotte If you can send me an email and I will grant you access to the github repo.
Note: This is not a SAP product, I wrote the adapter for a customer and others can participate on that.
Dear Werner
we have a use case as well for the Azure Data Lake, is it possible to have the AzureDataLake adapter in HANA SDI ?
regards
Harikishore
Harikishore Sreenivasalu If you send me an email I will grant you access to the github repo.
Werner
We have a use case as well for the Azure Data Lake, is it possible to have the AzureDataLake adapter in HANA SDI? My email id: ugandhar.nandam@gmail.com
Hi Werner ,
We have similar use case for Azure Data Lake adapter in HANA SDI . Could you please provide the installation reference. my mail id (needara2021@gmail.com)
We need help with the code to have an adapter for AZURE Data Lake in SAP Commissions.
my email marki@asardigital.com
Dear Werner Dähn
We need help with the code to have an adapter for AZURE Data Lake in SAP Commissions on HANA.
my email marki@asardigital.com
Hi, please share the script for creating the remote source, acruzto2002@gmail.com
Regards.
Aristobulo Cruz I will bring the documentation up to standard in the next weeks.
thanks Werner, you make other step before for creating the adapter?
CREATE REMOTE SOURCE "AzureDataLakeAdapter" ADAPTER "AzureSQLDB"
AT LOCATION AGENT "DPA_HNPRD" CONFIGURATION
'<?xml version="1.0" encoding="UTF-8"?>
<ConnectionProperties name="azuredb">
Error: (dberror) [474]: invalid adapter name: AzureSQLDB:
Yes, there are the usual SDI steps to create an agent, adapter and adapter location.
see the first two commands here:
https://blogs.sap.com/2015/08/20/hana-adapter-sdk-interaction-via-sql/
The adapter name of AzureSQLDB is fixed, you want to enable this exact adapter.
Hi Werner, yes I can do it, but if the name adapter ist'n register into agent, doesn't works the creation script adapter.
CREATE ADAPTER "AzureSQLDB" PROPERTIES 'display_name=Azure Adapter;description=Azure Adapter' AT LOCATION DPSERVER;
Could not execute 'CREATE ADAPTER "AzureSQLDB" PROPERTIES 'display_name=Azure Adapter;description=Azure Adapter' AT ...'
Error: (dberror) [403]: internal error: Cannot get adapter capabilities: exception 151002: Adapter type AzureSQLDB not registered.
You have created an agent called DPAGENT that points to the dpagent installation where the adapter is deployed, yes?
In that case, the SQL to execute would be
This contacts the DPAGENT and queries the metadata of the AzureSQLDB adapter, which was deployed there by you (via the agentconfig tool). I am a bit confused because in the create-remote-source you used as agent location "DPA_HNPRD".
And I made a mistake as well. Are we talking about AzureDataLake or AzureSQLDB. The first is about files in the Azure Blob Storage, the second a SQL Server Database.
Above you wrote in one SQL statement AzureSQLDB and that's the first I saw.
The AzureDataLakeAdapter has a quite comprehensive documentation, I thought.
e.g. https://github.com/rtdi/HanaSDIAzureDataLakeAdapter/blob/master/docs/Installation.md
Hi Werner, I have a data provisioning agent DPA_HNPRD, I need the AzureDataLakeAdapter for a DEMO with json file.
Please confirm the documentation I get error with this url:
https://github.com/rtdi/HanaSDIAzureDataLakeAdapter/blob/master/docs/Installation.md
so thanks.
You understoof that this adapter is something I wrote, you must install in SDI and I must grant you access to the repository in github to download it?
Werner,
I'm trying to create remote source using the adapter provided by you but facing the below issue.
Can you please help?
Also, can you tell me what is the purpose of Data Lake API in HANA SDI mentioned in the document. If required, can you let me know how to install it in HANA.
Regards,
Ugandhar
The Azure Data Lake API is just the Microsoft API this adapter is using to access the Azure Blob Storage. Microsoft provides multiple APIs for that and which to use is our choice. So nothing to install in Hana or the adapter.
The error message says that you can login to the Data Lake Endpoint, only there is no "filetest" folder within that end point. Can you post a screenshot what objects are in this filesystem?
Thank you Werner for the update. I've created everything from scratch again and remote source is created successfully. But I'm not able to read data from json file.
Followed all the steps but could not get data, please let me what was issue.
Created Event Subscriptions.
Created .view file in the Dictionary Directory as well.
Screenshots below for reference.
Can't see the problem based on the information you provided either. What I just did was:
The view looks fine. Filter is okay, accessor is okay, balance is a number but defined as a VARCHAR, so it will get converted. No problem either. The queue does not play a role here yet, as you read from the virtual table only.
Yet you get zero record and no error. So either the sampleData.json file is empty or the remote source does not point to that directory.
btw, can we discuss this in a github issue? Otherwise we clob this post too much.
If nothing helps, let's have an online webmeeting.
Happy to connect.
Please let me know when we can connect.
Hi Werner,
Thank you for this nice blog. I also have a use case to connect to an Azure Datalake via SDI, could you please provide the installation reference?
Kr,
Gilles (gilles.vandenbleeken@ontexglobal.com)
Hi Werner,
I have one customer asking for the same use case. Could you please grant me access to your github for azuredatalakeadapter - prathyusha.garimella@sap.com. I can do quick POC. Does it have any limitations with regards to volume or data type ?
Thanks,
Prathy
Hi Werner Dähn,
Thank you for the nice blog. Could you please provide us the accesses to github repo. we have an use case to connect Azuredatalake using SDI.
Please send me an email with your github username and I will add you to the repo.
Hi Werner Dähn,
Thanks you for the prompt reply ! Please find my details below
Email: XXXX
Github user id : XXX
Thanks you much ! deployed succssfully !
Hi Werner Dähn
we have a use case can you please share the azuredata lake adpter repo access. My Mail:mukkamala2004@yahoo.com
Hi Werner Dähn,
Thank you for this nice and helpful blog. We have a requirement to setup integration between BW4Hana and data lake with read and write back to database. I feel this will help perform the integration, Kindly share the access and provide any document reference. Also, I would like to understand the approximate effort (In terms of number of hours) needed to complete this setup?
Thanks and Regards,
Amrith Kumar V M
Werner Dähn - Could you please grant me access to your github for azuredatalakeadapter - my email id: amrithsap22@gmail.com
done
Great blog Werner. I could not find the adapter though in SAP SDI or on any SAP KBAs. Would you be able to help send some docs on how to create this adapter for Data Lake? Thanks a lot.
Robert Camangon
robert.camangon@ghd.com
done
Thanks Werner. Just a question, do you know if it is possible to deploy custom adapter from SQL Console and if you know what the command is? Asking since we have HA in place for Staging and Prod and we are only able to register the agent and adapters using SQL console since we are getting error "Received fatal alert: unrecognized name" when doing via the agentcli.bat hence SAP suggested just registering using SQL console and that worked there.
Thanks a lot.
Robert
Yes, the config tools interact with the Hana database via SQL only also.
https://blogs.sap.com/2015/08/20/hana-adapter-sdk-interaction-via-sql/
Hello Werner Dähn,
We also need the azure datalake adapter, can you please give us access to the repository.
This is the email: eusebio.hernandez@cemosa.es and our GitHub user is CEMOSA-DES
Many thanks and regards.
Yasser.