I think everyone who ever worked with big data is already familiar with the Hadoop clusters. There are many Hadoop distributions available, like the HortonWorks or Cloudera. But you can also provision the cluster in the Cloud. Microsoft Azure offers service called HDInsights that can be used to quickly spin up a big data clusters without thinking about underlying infrastructure. It seamlessly integrates with Azure Data Lake Storage or Azure Storage Account creating a resilient and cost-effective solution. If you prefer PaaS services as I do, then you should definitely check it out.
But the reason to write this post is not to list the advantages of the HDInsight over other Hadoop distributions. In my last two posts, I focused on the SAP DataHub and if you read one of Frank’s Schuler articles about its architecture then you already know that the integral part of the solution is SAP Vora. And as the HDInsight meets the SAP Vora requirement I was wondering if I can enable integration between those two products.
The SAP Data Hub and SAP Vora run on Kubernetes engine and if you’d like to follow this guide I highly recommend to use my previous post where I describe how to make it running in Azure.
To enable the integration between SAP Vora and the Hadoop cluster we need to install the Spark extension available to download from SAP Launchpad. Depending if you’re running a pure Vora or the DataHub + Vora you should download the correct package:
SAP Vora 2 Spark Extension – when you have a pure SAP Vora installation
SAP Data Hub Spark Extension – when working with SAP Data Hub
There are a few differences between those products but in most cases, it uses the same code base. The major difference I noticed is different handling of authentication, as the SAP Data Hub implements multitenancy. In the following guide, I will focus on the SAP Data Hub extension, but the steps in the majority will work exactly the same for the SAP Vora plug-in. To get a full understanding of how to implement the connection I decided to follow the manual installation mode. As always I recommend getting familiar with the Installation Guide which contains much valuable information.
PROVISION HDINSIGHT CLUSTER
I think deployment of HDInsight cluster is quite complex compared to other Azure PaaS services and requires a bit of Hadoop understanding. But if you follow my steps you should not encounter any issues. The very first question to consider is what sort of the HDInsight cluster you need. The Azure choice is quite large and you can choose from seven different configurations that include solution optimized for data analytics (Spark), NoSQL (HBase) or messaging (Kafka).
When creating an HDInsight cluster in Hadoop I found it better to switch to the Custom settings as it gives much more control over the server configuration. As I want to focus on the data analytics I decided to go with the Spark cluster. If you choose a different type there it may be required to perform additional configuration.
Let’s begin! Open HDInsight from the available services and choose the name of the cluster. Type the default password, which will be used also to connect to the cluster nodes through SSH.
The second step focuses on the networking so I choose to connect the service to my default VNet – it will simplify the connection to the SAP Vora. To separate the traffic between different Azure services I created additional subnet just for HDInsight workload:
The thirds step is all about the storage. There is a significant difference between HDInsight and other Hadoop distributions. Usually the storage is distributed across nodes that belong to the cluster, but HDInsight natively integrates with Data Lake or Storage Account. Such implementation have a number of benefits like lower cost of storing the data or possibility to destroy the cluster and keeping the data at the same time.
You can choose to use the ADLS or the Storage Account as the underlying file system. I decided to go with the Data Lake as it’s optimized for parallel processing and therefore suits the HDInsight better. The blob storage has certain limitations, like the maximum available space.
There is also a requirement to configure the Service Principal to enable access to ADLS storage. You can create the account directly in the portal. The advantage is that it automatically creates the certificate that will be used for authentication which makes the process easier.
The service principal requires POSIX (rwx) permissions to read and write to Data Lake storage. Click on the Access entry and you’ll be redirected to a wizard, where you select the account and choose permissions.
Confirm by clicking Select and choose Run on the next screen.
Leave the default settings on the optional step Application and stop for a moment when deciding about the cluster size. Depending on the usage and resiliency requirements you can choose how many worker nodes you want to deploy. As I will use the cluster just for testing purposes I decided to go with a single worker node. You can adjust the number and the size of nodes after the cluster is provisioned. To ensure the system is highly available you can’t decrease the number of head nodes below two.
You don’t have to maintain any custom script actions in the sixth step so you can jump to cluster validation and deployment:
INSTALL SAP DATA HUB SPARK EXTENSION
After around 20 minutes the cluster is deployed and we can start the installation of the Spark extension. You can log in to the operating system of the master node using the following command:
Execute following commands to list all nodes that belongs to the cluster:
export CLUSTERNAME=<cluster_name> curl -u admin -sS -G "https://$CLUSTERNAME.azurehdinsight.net/api/v1/clusters/$CLUSTERNAME/hosts" | jq '.items.Hosts.host_name'
Download the Spark extension to a temp directory and unzip the file. We need to upload the spark-sap-datasources-spark2.jar to ADLS filesystem and then we will write a script that will distribute the file to all HDInsight nodes.
unzip DHSPARKINT04_1-70003482.ZIP hadoop fs -mkdir /SAPVora/ hadoop fs -put SAPDataHub-SparkExtension/spark-sap-datasources-spark2.jar /SAPVora/
Now create a script that will distribute the file to each cluster node and upload it to the ADLS:
vi distribute-Spark.sh #!/bin/bash mkdir /opt/SAPVora/ hadoop fs -get /SAPVora/spark-sap-datasources-spark2.jar /opt/SAPVora/
We can automate the distribution the file the Spark extension file using the HDInsight Script Action. Go to Azure portal and open the cluster configuration. Choose Script Action from the menu and click Submit New.
The script type should be set to Custom. Type the desired script name. The uploaded script URL follows the format:
The script should be executed on head nodes and worker nodes.
It takes a few seconds to execute the script. If everything went well, you will see a green mark next to the script name:
You can verify the file exists on the master node:
ls -ltr /opt/SAPVora
The first step in order to configure the SAP Data Hub Spark Extension is to identify the hostname and the port number of the tx-coordinator service on the SAP Data Hub cluster.
Log in to the Kubernetes cluster dashboard using following PowerShell command from your local computer:
az aks browse --resource-group <AKS_Resource_Group> --name <AKS_Name>
In the Kubernetes Dashboard click on Services and find the vora-tx-coordinator-ext service.
You’re interested in the node hostname and the endpoint port
Note: your SAP Vora and the HDInsight cluster have to be able to communicate with each other. My scenario uses the same VNet for both services, but if that’s not the case for you, you should expose the vora-tx-coordinator-ext externally.
Now we can configure the Spark2 component through the Ambari UI, which can be accessed through a web browser. The URL follows the format:
Do not enter the configuration directly to the files on the nodes, as whenever you restart the Spark service the settings will be overwritten.
Click on Spark2 and then choose Configs from the top menu. In the long list of files find Custom spark2-defaults. Add bellow parameters:
spark.sap.vora.host = <vora-tx-coordinator hostname> spark.sap.vora.port = <vora-tx-coordinator port> spark.sap.vora.username = <SAP Data Hub username> spark.sap.vora.password = <SAP Data Hub password> spark.sap.vora.tenant = <SAP Data Hub tenant> spark.sap.vora.tls.insecure = true
Save the settings and restart affected services. We can now validate the connection between HDInsight and the SAP Vora. Go back to the Master Node of the Hadoop cluster and launch Scala:
spark-shell --jars /opt/SAPVora/spark-sap-datasources-spark2.jar
Execute following commands in the editor:
import sap.spark.vora.PublicVoraClientUtils import sap.spark.vora.config.VoraParameters val client = PublicVoraClientUtils.createClient(spark) client.query("select * from SYS.VORA_DATABASE").foreach(println) client.query("select * from SYS.SCHEMAS").foreach(println)
If the above commands were executed successfully and you didn’t receive any exceptions it means the installation is complete and we can try to execute a more advanced test!
LOAD DATA TO SAP VORA AND ACCESS THEM IN SPARK
Currently, SAP Vora extension for Spark doesn’t support Data Lake authentication using certificates, therefore we need to set up a new Service Principal that will be used to access the ADLS from the Vora tools. If you have followed my previous post about extracting data you can re-use the same username and password. If not, then execute the following script:
$SPName = "dataextractSP" $URI = "http://" + $SPName $app = New-AzADApplication -DisplayName $SPName -IdentifierUris $URI $servicePrincipal = New-AzADServicePrincipal -ApplicationId $app.ApplicationId $BSTR = [System.Runtime.InteropServices.Marshal]::SecureStringToBSTR($servicePrincipal.Secret) $password = [System.Runtime.InteropServices.Marshal]::PtrToStringAuto($BSTR) [Runtime.InteropServices.Marshal]::ZeroFreeBSTR($BSTR) $tenant = Get-AzTenant Write-Host '' Write-Host 'Below information will be required during SAP Data Hub configuration:' Write-Host 'Service Principal Name :' $SPName Write-Host 'Service Principal Tenant ID :' $tenant.Id Write-Host 'Service Principal Client ID :' $app.ApplicationId Write-Host 'Service Principal Key :' $password
Assign the full permissions to the Data Lake in Azure Portal. There is an option that I selected to save the permissions as default. Otherwise, if you create new files or directories the service principal won’t inherit the access from the parent.
Log in to HDInsight through SSH and download the test dataset containing airports from around the world and upload it to the Azure Data Lake:
Now log in to the SAP Vora Tools and create a new relational table using the service principal credentials and pointing to the downloaded file:
In the next step I corrected the columns names and the data types. I think SAP Vora tries to identify the correct data type based on the first row only (which is a bad idea) so initially I run into many mapping issues. But after setting below parameters the import went fine.
You can see a data preview in the Vora Tools:
Now let’s try to access the table in Spark and do some operations. Open Spark and establish a connection with SAP Vora as we did before. But this time instead of selecting a dummy data I will display a few rows from the uploaded dataset.
spark-shell --jars /opt/SAPVora/spark-sap-datasources-spark2.jar import sap.spark.vora.PublicVoraClientUtils import sap.spark.vora.config.VoraParameters val client = PublicVoraClientUtils.createClient(spark) client.query("""select "Country", "IATACode", "Name", "City" from "default"."airports" where "Country"='Poland'""").foreach(println)
Let’s import that list to a DataFrame count the rows:
val airport_DF = spark.read. format("sap.spark.vora"). option("table", "airports"). option("namespace", "default"). load()
We can save the DataFrame to Hive Metastore:
airport_DF.write. format("orc"). saveAsTable("airport_orc") spark.sql("""SELECT * FROM airportxxx_orc WHERE COUNTRY='Poland'""").show()
And to finish I’m going to create a table in SAP Vora and transfer subset of data from the Hive metastore.
import org.apache.spark.sql.SaveMode client.execute("""CREATE TABLE "default"."airports_poland" ("City" VARCHAR(40), "Name" VARCHAR(70), "ICAOCode" CHAR(4)) TYPE STREAMING STORE ON DISK""") val airports_poland_DF = spark.sql("""SELECT City, Name, ICAOCode FROM airport_orc WHERE COUNTRY='Poland'""") airports_poland_DF.count() airports_poland_DF.show() airports_poland_DF.write. format("sap.spark.vora"). option("table", "airports_poland"). option("namespace", "default"). mode(SaveMode.Append). save()
You can verify the table exist in the SAP Vora Tools:
ACCESS SAP VORA FROM THE HANA DATABASE
There is one more thing I would like to present. The big advantage of the SAP VORA is that it can expose the data stored in the Hadoop cluster to the SAP HANA database. We have a sample data uploaded to the storage, so let’s try to set up the connection.
The SAP HANA Wire protocol is enabled by default on the SAP Vora installation, so we just need to configure the remote connection in SAP HANA. You can get the endpoint details from the Kubernetes Dashboard. The vora-tx-coordinator-ext service expose additional port that should be used:
To create a remote connection, you can either use SQL command or the SAP HANA Studio.
CREATE REMOTE SOURCE HDInsight_VORA ADAPTER "voraodbc" CONFIGURATION 'ServerNode=<hostname>:<port>;Driver=libodbcHDB;sslValidateCertificate=false;encrypt=true' WITH CREDENTIAL TYPE 'PASSWORD' USING 'user=<tenant>\<username>;password="<password>"';
You can also do it fromt the HANA studio:
Now I’m able to create a virtual table:
And display the data preview:
The test is completed. As you can see the integration between the HDInsight and the SAP Vora works without any issues and you can easily access the tables from the Spark or HANA database. If you are interested in building more advanced applications check out the SAP Vora Developer Guide.