As soon as I finished my earlier post about the integration of the SAP Vora with the Azure HDInsight I started to wonder if it would be possible to reuse the Spark Extension with the Azure Databricks. It’s a new analytics platform based on the Apache Spark that enables an active collaboration between data scientists, technical teams and the business. It offers a fully managed Spark clusters that can be created within seconds and automatically terminated when not used. It can even scale up and down depending on the current utilization. As all Azure services, the Databricks natively connects to other cloud platform services like the Data Lake storage or Azure Data Factory. A big differentiator comparing to HDInsight is a tight integration with Azure AD to manage users and permissions which increase the security of your landscape.
That’s enough of theory, but if you’d like to get more information why the Azure Databricks is the way to go, please visit official Microsoft Documentation .
What I would like to present today is how to build the Spark cluster using Azure Databricks, connect it to the SAP Vora engine and expose the table to SAP HANA. The scope is very similar to the post about HDInsight and I will even re-use parts of the code. If something is not clear enough, please check the previous episode as I explained there how to get all required connection information or how to create a service principal.
CREATE DATABRICKS CLUSTER
As always, the first thing we need to do is to define a resource in Azure Portal. When using the Azure Databricks you’re billed based on the used virtual machines and the processing capability per hour (DBU). At this step we just define the service – we will deploy the cluster later.
There are two pricing tiers available. I used the Standard one, but in case you need additional features like ODBC endpoints or indexing / caching support then choose the Premium tier. A detailed summary of what’s included in each tier can be found on Microsoft pricing details.
I want the Databricks cluster to be integrated with my existing VNet as it will simplify the connection to the SAP Vora on AKS. During the deployment, Azure will create two new subnets in selected VNet and assign the IP ranges according to our input. If you don’t want to put the cluster to an existing network, then the Azure can deploy a separate VNet just for the Databricks. Communication would still be possible but an additional effort is required to enable VNet peering or expose the SAP Vora service to the internet.
The service is created within seconds and you can access the Databricks blade. Click on the Launch Workspace to log in to the service.
Sometimes it can take a couple of minutes to access the workspace for the first time, so please be patient.
Now we can define a cluster. Choose Clusters in the menu and then click on the Create Cluster button. Here we can decide about basic settings like autoscaling limits or cluster termination after a certain amount of time (this way you don’t pay for the compute when the cluster is not used). When the cluster is stopped, all VMs are deleted but the configuration is retained by the Databricks service. Later on, if you start the cluster again all virtual machines will be redeployed using the same configuration.
I decided to go with the smallest virtual machine size and allowed the cluster termination after 60 minutes of inactivity.
The Create Cluster screen is also a place where you can enter Spark configuration for the SAP Vora connection. Expand the Advanced Options and enter the following settings:
spark.sap.vora.host = <vora-tx-coordinator hostname> spark.sap.vora.port = <vora-tx-coordinator port> spark.sap.vora.username = <SAP Data Hub username> spark.sap.vora.password = <SAP Data Hub password> spark.sap.vora.tenant = <SAP Data Hub tenant> spark.sap.vora.tls.insecure = true
If you’d like to connect through SSH to cluster nodes, you can also enter the public key in the SSH tab (I won’t use this feature in this blog).
It takes a couple of minutes to provision VMs. You can see them in the Azure Portal:
When the cluster is ready you can see it in the Databricks workspace:
To access SAP Vora we will require the Spark Extension – the same that was used in the previous blog. There are two ways of installing the library – you can do it directly on the cluster or set up automatic installation. I think the second approach is better if you plan to use the Vora connection continuously – whenever the Databricks cluster is provisioned the extension is installed without additional interaction.
Choose Workspace from the menu create a new library in the Shared section:
Select JAR as the library type and upload the spark-sap-datasources-spark2.jar file.
On the next screen mark the Install automatically checkbox and confirm by clicking Install. The service deploys the library to all cluster nodes.
When the installation completes, we can execute the first test to verify the connectivity between Databricks and the SAP Vora. Go back to the Workspace menu and create a new Notebook in the user area:
Enter and execute the following script:
import sap.spark.vora.PublicVoraClientUtils import sap.spark.vora.config.VoraParameters val client = PublicVoraClientUtils.createClient(spark) client.query("select * from SYS.VORA_DATABASE").foreach(println) client.query("select * from SYS.SCHEMAS").foreach(println)
When the cluster finished the processing, it returned information about the database and available schemas. It means the addon was installed successfully and we can establish the connection to the SAP Vora.
EXPOSE THE DATA TO SAP VORA AND HANA DATABASE
The Databricks offers its unique distributed filesystem called DBFS. All data stored in the cluster are persisted in the Azure Blob Storage, therefore, you won’t lose them even if you terminate the VMs. But you can also access the Azure Data Lake Storage from the Databricks by mounting a directory on the internal filesystem.
The Azure Data Lake Storage is using POSIX permissions, so we need to use the Service Principal to access the data. What I like in the Databricks configuration is that you can use Azure Key Vault service to store the service principal credentials instead of typing them directly to configuration. In my opinion, it’s a much safer approach and it also simplifies the setup, as there is only a single place of truth. Otherwise, if you’d like to change the assigned service principal you’d have to go through all scripts and manually replace the password.
Create a new Key Vault service and then open following webpage to create a Secret Scope inside the Databricks service:
In my case it is:
Type the desired Name that we will reference later in the script. You can find the DNS name and resource ID in the properties of the Key Vault. If you are using a Standard tier of the Databricks you need to select All Users in the Manage Permissions list, otherwise, you will encounter an issue during saving.
In the created Azure Key Vault service add a new entry under “Create a secret” and type the service principal key in the Value field.
The bellow script mounts the ADLS storage in the Databricks File System. Modify the values in brackets with your environment details.
val configs = Map( "dfs.adls.oauth2.access.token.provider.type" -> "ClientCredential", "dfs.adls.oauth2.client.id" -> "<your-service-client-id>", "dfs.adls.oauth2.credential" -> dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>"), "dfs.adls.oauth2.refresh.url" -> "https://login.microsoftonline.com/<your-directory-id>/oauth2/token") // Optionally, you can add <your-directory-name> to the source URI of your mount point. dbutils.fs.mount( source = "adl://<your-data-lake-store-account-name>.azuredatalakestore.net/<your-directory-name>", mountPoint = "/mnt/<mount-name>", extraConfigs = configs)
Add new section with the script in the existing notebook and execute it:
We can check if that’s working correctly by listing the files inside the directory:
The file is available so we can create SAP Vora table where we will upload the data:
import sap.spark.vora.PublicVoraClientUtils import sap.spark.vora.config.VoraParameters val client = PublicVoraClientUtils.createClient(spark) client.execute("""CREATE TABLE "default"."airports" ("Airport_ID" INTEGER, "Name" VARCHAR(70), "City" VARCHAR(40), "Country" VARCHAR(40), "IATA" VARCHAR(3), "ICAO" VARCHAR(4), "Latitude" DOUBLE, "Longitude" DOUBLE, "Altitude" INTEGER, "Timezone" VARCHAR(5), "DST" VARCHAR(5), "Tz" VARCHAR(40), "Type" VARCHAR(10), "Source" VARCHAR(15)) TYPE STREAMING STORE ON DISK""")
Read the airports.dat file to a DataFrame. The inferSchema option automatically detects each field type and we won’t need any additional data conversion.
val df = spark.read.option("inferSchema", true).csv("dbfs:/mnt/DemoDatasets/airports.dat") df.show()
And finally, we can push the data from the DataFrame into SAP Vora table:
import org.apache.spark.sql.SaveMode df.write. format("sap.spark.vora"). option("table", "airports"). option("namespace", "default"). mode(SaveMode.Append). save()
The only outstanding activity is to configure SAP Vora Remote Source in the HANA database…
…and display the content of virtual table:
My experience when working with Azure Databricks was excellent. I think the required configuration is much more straightforward compared to the HDInsight and I really enjoy the possibility to terminate the cluster when not active. As Databricks is the future way to work with Big Data I’m happy that existing SAP Vora extension can be used without any workarounds.