Skip to Content
Technical Articles
Author's profile photo Werner Dähn

Hana SDI Adapter for Azure Data Lake

Design goal

When dealing with lots of data, there are different concepts. SAP favors a multi-temperature storage inside the database but the majority of projects use an object storage – in case of Microsoft Azure that would be the Azure Data Lake. It is cheap, can grow to virtually any size, is secure and can handle all types of data.

To make this data accessible via Hana also, a new SDI adapter is needed. I wrote one that provides access to the data in Azure Data Lake and Azure Service Bus. With this adapter the data can be used by Hana onPrem, Hana Cloud, SAP Hana Data Warehouse Cloud, S/4Hana or SAP BW.

Regarding the payload formats supported, in Azure Avro and Json are popular formats. In below example the Data Lake contains millions of files, each representing one patient record in the FHIR standard. Whenever new files are uploaded to Azure, the Azure Event Grid posts this information into a Service Bus queue which the adapter is listening on. But the adapter works with any Json or Avro format, no limitations.

My sample file in Azure looks like this

{
  "resourceType": "Patient",
  "id": "example",
  "identifier": [
    { "use": "usual", "system": "urn:oid:1.2.36.146.595.217.0.1" }
  ],
  "active": true,
  ..
  ..

and all files should be queried via Hana like any other table.

 

Requirements

 

This creates a couple of requirements for this Data Lake adapter:

  • Parse Json and Avro files.
  • As these file formats allow for nested data, a mechanism to blend the nested Json/Avro data into the relational Hana data model must exist.
  • When files are changed, deleted or new files arrive they must be visible in Hana right away.
  • One Data Lake storage has different kind of files, each with their unique structure, which must be treated independently.

The solution is is using the concept of “views” in the Azure Data Lake adapter to define transformation from the nested data into Hana tables. For above example the user does create a file named patient.view with the following content:

{
	"namefilter": "patient.*\\.json",
	"from": "$",
	"columns": [
		{
			"accessor": "$.id",
			"alias": "ID",
			"datatype": "VARCHAR(256)"
		},
		{
			"accessor": "$.identifier[0].system",
			"alias": "SYSTEM_IDENTIFIER",
			"datatype": "VARCHAR(40)"
		},
		{
			"accessor": "$.gender",
			"alias": "GENDER",
			"datatype": "VARCHAR(10)"
		},
		{
			"accessor": "$.birthDate",
			"alias": "BIRTHDATE",
			"datatype": "DATE",
			"format": "yyyy-M-d"
		}
	],
	"queuename": "filechanges"
}

Apart from the view projection definitions (from and columns) with their information about:

  • What nesting level is driving the row generation? from: $ means the root level, so one row per file.
  • Which field to output?
  • What should be its Hana column name?
  • What should be its Hana data type?
  • What is the format of the value in case a data type conversion is needed, e.g. from String to Date?

a name and directory filter can be specified also. Because this format definition makes sense for json files containing patient data only and all these files follow the pattern patient*.json in the Data Lake storage, a namefilter is used. Hence this view should consider only those files.

The last setting in above example is the Enterprise Service Bus (“ESB”) queue name where Azure puts all file changes. This is another feature of the Azure Data Lake. It allows to trigger events whenever a file is modified and using Azure Event Grid, this event can be put into an ESB queue.

 

 

Consuming Azure Data Lake data

This view is imported as virtual table just like with any other SDI adapter, either via SQL or via the various UIs. As the definition file is called patient.view its remote name is patient and the chosen name v_patient. In above screenshot this table was selected and it returned the data contained in the files.

 

Creating the Hana remote source object

The remote source definition asks for the URL of the Data Lake endpoint and optionally – if realtime subscriptions will be used also – for the Enterprise Service Bus connection string. Both values can simply be copied from Azure.

 

As a Storage Account in Azure can have multiple containers, the file system is a mandatory parameter and it can be augmented with additional sub directories. For example the setting /main/FHIR configures this remote source to read only data from the file system called “main” and within the “FHIR” directory and sub directories only.

The security keys for the Storage Account and the Service Bus are placed into the secure credential storage.

The only additional parameter needed is a Adapter-local directory where all view definition files are placed by the user.

 

Realtime Subscriptions

Using Hana SDI realtime subscriptions, changes in the Data Lake are captured and the target table in Hana gets updated with the new data immediately.

CREATE COLUMN TABLE t_patient LIKE v_patient;

CREATE REMOTE SUBSCRIPTION s_patient 
ON v_patient TARGET TABLE t_patient;

ALTER REMOTE SUBSCRIPTION s_patient QUEUE;

INSERT INTO t_patient select * from v_patient; // Initial load

ALTER REMOTE SUBSCRIPTION s_patient DISTRIBUTE;

SELECT * FROM t_patient; // Returns old and changed file contents

 

 

The corresponding setup in Azure is very easy to achieve. All that needs to be done is to define a new Event in the File System and use a previously created Service Bus queue as end point.

 

Summary

With that not only can all Azure Data Lake data be read in a federated mode but also the data be cached in Hana tables for best query performance and this data is current always – thanks to the realtime subscriptions. These changes are even transactional consistent.

 

 

Assigned Tags

      37 Comments
      You must be Logged on to comment or reply to a post.
      Author's profile photo Mario De Felipe
      Mario De Felipe

      Thanks for sharing, I believe we will see an increasing interest in real-time integration, so the more we know about this, the better we'll get there.

      Author's profile photo Michael Kaufmann
      Michael Kaufmann

      Hi Werner,

      great blog!

      If one wants to establish this scenario with SAP HANA Cloud and Azure Data Lake, I guess there is still the DP Agent needed? In this case it would not make much sense to go from Cloud to On- premise to Cloud again. Therefore: Is it possible to install DP-Agent on a VM in Azure?

      Regards,
      Michael

      Author's profile photo Werner Dähn
      Werner Dähn
      Blog Post Author

      Michael Kaufmann Yes, of course. Any small Linux VM will do.

      Would be nice if each adapter can be packaged as docker image with dpagent included already, wouldn't it? I will make the suggestion to SAP.

      Author's profile photo Michael Bauer
      Michael Bauer

      Hi Werner,

       

      any update regarding your suggestion?

      Thanks!

      Author's profile photo Werner Dähn
      Werner Dähn
      Blog Post Author

      I have suggested and discussed it with the developers last year - no response.

      Author's profile photo Michael Bauer
      Michael Bauer

      There is a Docker image for DP agent: https://github.com/entmike/hana-dpagent,
      but it would be nice to have an official one.
      Maybe I'll file an idea at influence.sap.com.

      I'm thinking about migrating our DP agents to a PaaS service like RedHat OpenShift.
      Do you have any experience/feedback regarding such a setup?

      Author's profile photo Werner Dähn
      Werner Dähn
      Blog Post Author

      The docker image is nice but actually just another way of installing the full blown dpagent. My proposal goes much further than that:

      • Remove all the extra software being installed like Data Services Engine. It is not needed to run e.g. an Oracle adapter.
      • Provide an Web UI to configure all instead of command line tools.
      • Monitoring and support via the Web UI.
      • One docker = one adapter = one agent to make everything very small and simple.
      • Remove the old communication patterns, JDBC method is a superset of all.
      Author's profile photo Simon Vandelanotte
      Simon Vandelanotte

      Great find & great use-case!

       

      I don't find any mention of this adapter when looking into the PAM documentation or Installation guide for SDI. Is there any documentation on this adapter?

      Author's profile photo Werner Dähn
      Werner Dähn
      Blog Post Author

      Simon Vandelanotte If you can send me an email and I will grant you access to the github repo.

      Note: This is not a SAP product, I wrote the adapter for a customer and others can participate on that.

      Author's profile photo Harikishore Sreenivasalu
      Harikishore Sreenivasalu

      Dear Werner

      we have a use case as well for the Azure Data Lake, is it possible to have the AzureDataLake adapter in HANA SDI ?

      regards

      Harikishore

      Author's profile photo Werner Dähn
      Werner Dähn
      Blog Post Author

      Harikishore Sreenivasalu If you send me an email I will grant you access to the github repo.

      Author's profile photo Ugandhar Nandam
      Ugandhar Nandam

      Werner

      We have a use case as well for the Azure Data Lake, is it possible to have the AzureDataLake adapter in HANA SDI? My email id: ugandhar.nandam@gmail.com

      Author's profile photo ramesh k
      ramesh k

      Hi Werner ,

      We have similar use case for Azure Data Lake adapter in HANA SDI . Could you please provide the installation reference. my mail id (needara2021@gmail.com)

      Author's profile photo MARK IVANOWSKI
      MARK IVANOWSKI

      We need help with the code to have an adapter for AZURE Data Lake in SAP Commissions.

      my email marki@asardigital.com

      Author's profile photo MARK IVANOWSKI
      MARK IVANOWSKI

      Dear Werner Dähn

      We need help with the code to have an adapter for AZURE Data Lake in SAP Commissions on HANA.

      my email marki@asardigital.com

       

      Author's profile photo Aristobulo Cruz
      Aristobulo Cruz

      Hi, please share the script for creating the remote source, acruzto2002@gmail.com

      Regards.

      Author's profile photo Werner Dähn
      Werner Dähn
      Blog Post Author

      Aristobulo Cruz I will bring the documentation up to standard in the next weeks.

       

      CREATE REMOTE SOURCE "xxxxx" ADAPTER "AzureSQLDB"
      AT LOCATION AGENT "yyyyy" CONFIGURATION
      	'<?xml version="1.0" encoding="UTF-8"?>
      	<ConnectionProperties name="azuredb">
      		<PropertyEntry name="jdbcurl">jdbc:sqlserver://zzzzz.database.windows.net:1433;database=rtditest1;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;</PropertyEntry>
      		<PropertyEntry name="jdbcurlrt"></PropertyEntry>
      		<PropertyEntry name="pollsleep"></PropertyEntry>
      	</ConnectionProperties>'
      WITH CREDENTIAL TYPE 'PASSWORD' USING
        '<CredentialEntry name="credential">
      	<user>aaaaaa</user>
          <password>bbbbbb</password>
        </CredentialEntry>';
      Author's profile photo Aristobulo Cruz
      Aristobulo Cruz

      thanks Werner, you make other step before for creating the adapter?

       

      CREATE REMOTE SOURCE "AzureDataLakeAdapter" ADAPTER "AzureSQLDB"
      AT LOCATION AGENT "DPA_HNPRD" CONFIGURATION
      '<?xml version="1.0" encoding="UTF-8"?>
      <ConnectionProperties name="azuredb">

       

      Error: (dberror) [474]: invalid adapter name: AzureSQLDB:

       

       

      Author's profile photo Werner Dähn
      Werner Dähn
      Blog Post Author

      Yes, there are the usual SDI steps to create an agent, adapter and adapter location.

       

      see the first two commands here:

      https://blogs.sap.com/2015/08/20/hana-adapter-sdk-interaction-via-sql/

       

      The adapter name of AzureSQLDB is fixed, you want to enable this exact adapter.

      Author's profile photo Aristobulo Cruz
      Aristobulo Cruz

      Hi Werner, yes I can do it, but if the name adapter ist'n register into agent, doesn't works the creation script adapter.

      CREATE ADAPTER "AzureSQLDB" PROPERTIES 'display_name=Azure Adapter;description=Azure Adapter' AT LOCATION DPSERVER;

       

      Could not execute 'CREATE ADAPTER "AzureSQLDB" PROPERTIES 'display_name=Azure Adapter;description=Azure Adapter' AT ...'
      Error: (dberror) [403]: internal error: Cannot get adapter capabilities: exception 151002: Adapter type AzureSQLDB not registered.

      Author's profile photo Werner Dähn
      Werner Dähn
      Blog Post Author

      You have created an agent called DPAGENT that points to the dpagent installation where the adapter is deployed, yes?

       

      In that case, the SQL to execute would be

      CREATE ADAPTER "AzureSQLDB" AT LOCATION AGENT "DPAGENT";

      This contacts the DPAGENT and queries the metadata of the AzureSQLDB adapter, which was deployed there by you (via the agentconfig tool). I am a bit confused because in the create-remote-source you used as agent location "DPA_HNPRD".

      Author's profile photo Werner Dähn
      Werner Dähn
      Blog Post Author

      And I made a mistake as well. Are we talking about AzureDataLake or AzureSQLDB. The first is about files in the Azure Blob Storage, the second a SQL Server Database.

      Above you wrote in one SQL statement AzureSQLDB and that's the first I saw.

      The AzureDataLakeAdapter has a quite comprehensive documentation, I thought.

      e.g. https://github.com/rtdi/HanaSDIAzureDataLakeAdapter/blob/master/docs/Installation.md

       

       

      CREATE REMOTE SOURCE "DATALAKE_RTDI" ADAPTER "AzureDataLakeAdapter"
      AT LOCATION AGENT "desktop" CONFIGURATION
      	'<?xml version="1.0" encoding="UTF-8"?>
      	<ConnectionProperties name="azuredatalake">
      	  <PropertyEntry name="DataLakeEndpoint">https://xxxx.dfs.core.windows.net/</PropertyEntry>
      	  <PropertyEntry name="RootDir">/main</PropertyEntry>
      	  <PropertyEntry name="DefinitionDirectory">c:\temp\ADL\views</PropertyEntry>
      	  <PropertyEntry name="ServiceBusConnectionString">Endpoint=sb://yyyyy.servicebus.windows.net/;SharedAccessKeyName=RootManageSharedAccessKey</PropertyEntry>
      	</ConnectionProperties>'
      WITH CREDENTIAL TYPE 'PASSWORD' USING
        '<CredentialEntry name="StorageKey">
          <password>bbbb</password>
        </CredentialEntry>
        <CredentialEntry name="ServiceBusKey">
          <password>aaaa</password>
        </CredentialEntry>';
      Author's profile photo Aristobulo Cruz
      Aristobulo Cruz

      Hi Werner, I have a data provisioning agent DPA_HNPRD, I need the AzureDataLakeAdapter for a DEMO  with json file.

       

      Please confirm the documentation I get error with this url:

       

      https://github.com/rtdi/HanaSDIAzureDataLakeAdapter/blob/master/docs/Installation.md

       

      so thanks.

       

      Author's profile photo Werner Dähn
      Werner Dähn
      Blog Post Author

      You understoof that this adapter is something I wrote, you must install in SDI and I must grant you access to the repository in github to download it?

      Author's profile photo Ugandhar Nandam
      Ugandhar Nandam

      Werner,

      I'm trying to create remote source using the adapter provided by you but facing the below issue.

      Can you please help?

      Also, can you tell me what is the purpose of Data Lake API in HANA SDI mentioned in the document. If required, can you let me know how to install it in HANA.

      Regards,

      Ugandhar

      Author's profile photo Werner Dähn
      Werner Dähn
      Blog Post Author

      The Azure Data Lake API is just the Microsoft API this adapter is using to access the Azure Blob Storage. Microsoft provides multiple APIs for that and which to use is our choice. So nothing to install in Hana or the adapter.

       

      The error message says that you can login to the Data Lake Endpoint, only there is no "filetest" folder within that end point. Can you post a screenshot what objects are in this filesystem?

      Author's profile photo Ugandhar Nandam
      Ugandhar Nandam

      Thank you Werner for the update. I've created everything from scratch again and remote source is created successfully. But I'm not able to read data from json file.

      Followed all the steps but could not get data, please let me what was issue.

      Created Event Subscriptions.

      Created .view file in the Dictionary Directory as well.

      Screenshots below for reference.

      Author's profile photo Werner Dähn
      Werner Dähn
      Blog Post Author

      Can't see the problem based on the information you provided either. What I just did was:

      1. Create a remote source with the root directory "main".
      2. Placed the sampleData.json Ffile there with similar contents like you. An array of records with guid, balance and name
      3. Created the view file with the same settings (first three fields only)
      4. Imported the view file as virtual table
      5. executed a select * from that virtual table
      6. Got all three records of the json file

      The view looks fine. Filter is okay, accessor is okay, balance is a number but defined as a VARCHAR, so it will get converted. No problem either. The queue does not play a role here yet, as you read from the virtual table only.

      Yet you get zero record and no error. So either the sampleData.json file is empty or the remote source does not point to that directory.

      btw, can we discuss this in a github issue? Otherwise we clob this post too much.

      If nothing helps, let's have an online webmeeting.

      Author's profile photo Ugandhar Nandam
      Ugandhar Nandam

      Happy to connect.

      Please let me know when we can connect.

      Author's profile photo Gilles Van den Bleeken
      Gilles Van den Bleeken

      Hi Werner,

       

      Thank you for this nice blog. I also have a use case to connect to an Azure Datalake via SDI, could you please provide the installation reference?

       

      Kr,

      Gilles (gilles.vandenbleeken@ontexglobal.com)

      Author's profile photo Prathyusha Garimella
      Prathyusha Garimella

      Hi Werner,

      I have one customer asking for the same use case. Could you please grant me access to your github for azuredatalakeadapter - prathyusha.garimella@sap.com. I can do quick POC. Does it have any limitations with regards to volume or data type ?

      Thanks,

      Prathy

      Author's profile photo Jithendra Tirakala
      Jithendra Tirakala

      Hi Werner Dähn,

      Thank you for the nice blog. Could you please provide us the accesses to github repo. we have an use case to connect Azuredatalake using SDI.

      Author's profile photo Werner Dähn
      Werner Dähn
      Blog Post Author

      Please send me an email with your github username and I will add you to the repo.

      Author's profile photo Jithendra Tirakala
      Jithendra Tirakala

      Hi Werner Dähn,

      Thanks you for the prompt reply ! Please find my details below

      Email: XXXX

      Github user id : XXX

      Author's profile photo Jithendra Tirakala
      Jithendra Tirakala

      Thanks you much ! deployed succssfully !

      Author's profile photo Bhaskara-Rao Mukkamala
      Bhaskara-Rao Mukkamala

      Hi Werner Dähn

       

      we have a use case can you please share the azuredata lake adpter repo access. My Mail:mukkamala2004@yahoo.com

      Author's profile photo Amrith Kumar V M
      Amrith Kumar V M

      Hi Werner Dähn,

       

      Thank you for this nice and helpful blog. We have a requirement to setup integration between BW4Hana and data lake with read and write back to database. I feel this will help perform the integration, Kindly share the access and provide any document reference. Also, I would like to understand the approximate effort (In terms of number of hours) needed to complete this setup?

      Thanks and Regards,

      Amrith Kumar V M