Hana SDI Adapter for Azure Data Lake

werner_daehn · ‎02-08-2021

Design goal

When dealing with lots of data, there are different concepts. SAP favors a multi-temperature storage inside the database but the majority of projects use an object storage - in case of Microsoft Azure that would be the Azure Data Lake. It is cheap, can grow to virtually any size, is secure and can handle all types of data.

To make this data accessible via Hana also, a new SDI adapter is needed. I wrote one that provides access to the data in Azure Data Lake and Azure Service Bus. With this adapter the data can be used by Hana onPrem, Hana Cloud, SAP Hana Data Warehouse Cloud, S/4Hana or SAP BW.

Regarding the payload formats supported, in Azure Avro and Json are popular formats. In below example the Data Lake contains millions of files, each representing one patient record in the FHIR standard. Whenever new files are uploaded to Azure, the Azure Event Grid posts this information into a Service Bus queue which the adapter is listening on. But the adapter works with any Json or Avro format, no limitations.

My sample file in Azure looks like this

{

  "resourceType": "Patient",

  "id": "example",

  "identifier": [

    { "use": "usual", "system": "urn:oid:1.2.36.146.595.217.0.1" }

  ],

  "active": true,

  ..

  ..

and all files should be queried via Hana like any other table.

Requirements

This creates a couple of requirements for this Data Lake adapter:

Parse Json and Avro files.

As these file formats allow for nested data, a mechanism to blend the nested Json/Avro data into the relational Hana data model must exist.

When files are changed, deleted or new files arrive they must be visible in Hana right away.

One Data Lake storage has different kind of files, each with their unique structure, which must be treated independently.

The solution is is using the concept of "views" in the Azure Data Lake adapter to define transformation from the nested data into Hana tables. For above example the user does create a file named patient.view with the following content:

{

	"namefilter": "patient.*\\.json",

	"from": "$",

	"columns": [

		{

			"accessor": "$.id",

			"alias": "ID",

			"datatype": "VARCHAR(256)"

		},

		{

			"accessor": "$.identifier[0].system",

			"alias": "SYSTEM_IDENTIFIER",

			"datatype": "VARCHAR(40)"

		},

		{

			"accessor": "$.gender",

			"alias": "GENDER",

			"datatype": "VARCHAR(10)"

		},

		{

			"accessor": "$.birthDate",

			"alias": "BIRTHDATE",

			"datatype": "DATE",

			"format": "yyyy-M-d"

		}

	],

	"queuename": "filechanges"

}

Apart from the view projection definitions (from and columns) with their information about:

What nesting level is driving the row generation? from: $ means the root level, so one row per file.

Which field to output?

What should be its Hana column name?

What should be its Hana data type?

What is the format of the value in case a data type conversion is needed, e.g. from String to Date?

a name and directory filter can be specified also. Because this format definition makes sense for json files containing patient data only and all these files follow the pattern patient*.json in the Data Lake storage, a namefilter is used. Hence this view should consider only those files.

The last setting in above example is the Enterprise Service Bus ("ESB") queue name where Azure puts all file changes. This is another feature of the Azure Data Lake. It allows to trigger events whenever a file is modified and using Azure Event Grid, this event can be put into an ESB queue.

Consuming Azure Data Lake data

This view is imported as virtual table just like with any other SDI adapter, either via SQL or via the various UIs. As the definition file is called patient.view its remote name is patient and the chosen name v_patient. In above screenshot this table was selected and it returned the data contained in the files.

Creating the Hana remote source object

The remote source definition asks for the URL of the Data Lake endpoint and optionally - if realtime subscriptions will be used also - for the Enterprise Service Bus connection string. Both values can simply be copied from Azure.

As a Storage Account in Azure can have multiple containers, the file system is a mandatory parameter and it can be augmented with additional sub directories. For example the setting /main/FHIR configures this remote source to read only data from the file system called "main" and within the "FHIR" directory and sub directories only.

The security keys for the Storage Account and the Service Bus are placed into the secure credential storage.

The only additional parameter needed is a Adapter-local directory where all view definition files are placed by the user.

Realtime Subscriptions

Using Hana SDI realtime subscriptions, changes in the Data Lake are captured and the target table in Hana gets updated with the new data immediately.

CREATE COLUMN TABLE t_patient LIKE v_patient;



CREATE REMOTE SUBSCRIPTION s_patient 

ON v_patient TARGET TABLE t_patient;



ALTER REMOTE SUBSCRIPTION s_patient QUEUE;



INSERT INTO t_patient select * from v_patient; // Initial load



ALTER REMOTE SUBSCRIPTION s_patient DISTRIBUTE;



SELECT * FROM t_patient; // Returns old and changed file contents

The corresponding setup in Azure is very easy to achieve. All that needs to be done is to define a new Event in the File System and use a previously created Service Bus queue as end point.

Summary

With that not only can all Azure Data Lake data be read in a federated mode but also the data be cached in Hana tables for best query performance and this data is current always - thanks to the realtime subscriptions. These changes are even transactional consistent.