In another part of the SAP HANA Academy’s SAP HANA Vora series Tahir Hussain Babar (Bob) walks through how to configure the SAP HANA Spark Controller in nine tutorial videos. With the SAP HANA Spark Controller you will be able to read your SAP HANA Vora tables from SAP HANA.
Each and every script that Bob uses through out this nine part series can be found here on GitHub.
How to Install Hive and Load Data
In the first video of the series Bob walks through how to install Hive on a Hadoop cluster in Ambari. Already before this tutorial series Bob has a Hadoop cluster with SAP HANA Vora running. In order to connect from SAP HANA you must have a Hive Meta Store. In either Ambari and/or Cloudera one of the easiest ways to get this is to install Hive. To conclude the video Bob will load some test data to ensure that the Hive installation was successful.
First in Ambari Bob chooses to add Hive as a service and follows the steps in the add services wizard. Bob adds his Hive password in the customize services step and keeps all of the wizard’s defaults during the Hive installation. Next, Bob stops and starts all of the services in Ambari.
Then connected to the Vora instance in PuTTY, Bob logs in as the Hive user, logs into Hive and tests the connection by running a simple command. Next, as the EC2 user Bob loads a simple table called SHA_Employee into his media folder. Then again as the Hive user, Bob runs the create table select statement to create the table in Hive and confirms it’s existence.
After Bob loads the SHA_Employee.dat file into HDFS and gives access rights to the Hive user. Now back in Hive Bob runs a select * to view the data in the table.
How to Install the SAP HANA Spark Controller
In the second video Bob details how to install the SAP HANA Spark Controller.
In PuTTY as the EC2 user, Bob creates a new folder in the media folder called Spark Controller. Six of the files are available for free online while two of the files are on SAP service marketplace. Bob loads all six of the publicly available jar files and the AWS jar file and the Spark controller rpm file from SAP.
Bob then installs the SAP HANA Spark Controller rpm file. The SAP HANA Spark Controller creates a user called hanaes and Bob logs in as that user and examines the contents of the Spark Controller.
Finally Bob loads the AWS file into the Spark Controller lib folder.
Placing Third Party Files into HDFS
Bob, in the series’ third video, walks through how to place the third party jar files and the spark assembly file into HDFS.
As the HDFS user in PuTTY, Bob creates a new sub folder in lib folder for the third party files. Bob then puts the Spark Controller jar file into the Spark lib folder. Now HDFS will be aware of the jar file. Next, Bob puts the four third party files into the third party folder.
Then Bob checks to confirm that the files exist. Finally, Bob creates another folder for the hanaes user and allocates the rights to it. This folder will be used by the SAP HANA Spark Controller to cache files.
Changing the Configuration Files
In the fourth video of the series Bob shows how to configure the SAP HANA Spark Controller by modifying the hanaes-site.xml file.
In PuTTY as the hanaes user, Bob enters the Spark Controller conf folder and opens the hanaes-site.xml file to check the paths. The hanaes server port (7860 in Bob’s example) must be opened on the Hadoop machine. If you’re using a multi-node system then you must open the port for the master node.
Note that when the data is returned to SAP HANA a range of ports are opened up. Therefore you must make sure that the SAP HANA box can connect to this range of ports on your Hadoop server. The easiest way to accomplish this is to whitelist the SAP HANA box when connecting to the Hadoop box to allow all ports.
Next you must insert the host name of your machine for your Spark Controller next to <value> underneath sap.hana.es.driver.host. Bob leaves a . in the IP address to assist with the testing later on.
Scrolling down through the rest of the hanaes-site.xml file Bob inserts the correct HDP version name for both the Spark yarn and Spark driver property. Also, you should change the minExecutors and maxExecutors values so they’re optimally tuned to get the best performance out of your system. Then Bob pastes in a bit of code from GitHub that sets the value for the Spark executor memory.
After saving the file, Bob copies the AWS.resolver jar file from the lib folder and then pastes it in as a final property in the hanes-site.xml file to make sure its aware of the AWS resolver.
In the next video in the series Bob details how to configure the hive-site.xml file. This file tells the SAP HANA Spark Controller all of the configuration details about the Hive deployment.
In PuTTY Bob goes into Hive conf folder to find the hive-site.xml file and places it into the SAP HANA Spark Controller conf folder. Bob also places the hive-site.xml file into the Spark Controller conf folder. In the site.xml file Bob removes the s from the delay and time values and insert the value below for the Hive security authorization manager.
The next step must be preformed for every single node in your cluster. Find the yarn-shuffle jar file and copy it to the Hadoop yarn folder for each node.
Next, in Ambari Bob configures the mapreduce.application.classpath in the advanced mapred-site option to include the current version of HDP he is using. Then in Yarn Bob appends spark_shuffle to the end of mapreduce_shuffle in the yarn.nodemanager.aux-services.
Then Bob shows how to add the custom property shown below to Yarn.
Finally, Bob restarts the Yarn and MapReduce services in Ambari.
How to Start the Spark Controller
Continuing on with the series Bob shows how to start the SAP HANA Spark Controller.
As the HANA ES user in PuTTY, Bob goes into the bin folder of the Spark Controller folder. To Start the Spark Controller enter ./hanaes start. To confirm it has started Bob pulls up the log file and actually see that there is in an error due to an unknown host. After fixing his host name in the hanaes-site.xml file (removing the . he placed in the IP Address earlier) Bob successfully starts the Spark Controller. Bob confirms this by viewing his four third party files in HDFS have been registered.
Creating a Remote Data Source for HIVE Tables in SAP HANA Studio
In the seventh video of the series Bob shows how to create a remote data source in SAP HANA Studio that will connect to the Hive system through the SAP HANA Spark Controller. Connecting to a Hive table will insure that SAP HANA Spark Controller is working.
In Bob’s SAP HANA SPS10 version of SAP HANA Studio he logs into his SAP HANA system and manually creates a user. Then Bob creates a new system using that recently created user and opens a SQL console in this new system. Bob enters the syntax shown below to establish a connection to the SAP HANA Spark Controller.
Once his connection is establish Bob creates a new virtual table and views its creation documented in the log file. Finally, Bob can open his Spark_Employe Spark table in HANA Studio and sees it populated with data from Hive.
Configuring the SAP HANA Spark Controller to Connect to SAP HANA Vora Tables
In the second to last video in the Series Bob details how to configure the SAP HANA Spark Controller to connect to a list of SAP HANA Vora tables.
In PuTTY as the root EC2 user, Bob navigates to the hanaes-site.xml file and inserts a few properties. The properties are for the SAP HANA Vora host and the Spark Vora Zkurls (Zookeeper Server). For each of the properties insert the IP Address(es) and port number(s) for each of the nodes that are running SAP HANA Vora and/or the Zookeeper Server in your system.
Next you must put the spark-sap-datasources jar file in the SAP HANA Spark Controller lib folder to enable the Spark extensions.
Finally, Bob stops the Spark Controller before starting it again and confirms that it’s running by viewing the log file.
How to Connect to SAP HANA Vora Tables in SAP HANA Studio
In the final video of the series Bob shows how to connect to SAP HANA Studio so that it can read SAP HANA Vora tables.
First Bob goes to this file on the SAP HANA Academy’s GitHub page goes over the instructions starting on line 212.
Bob sudos to the HDFS user. Then creates a folder for the Vora user in HDFS and gives access to the Vora user. Next, Bob creates a simple test table by running the commands shown below.
Now that the test.csv table is in the HDFS system Bob creates an SAP HANA Vora table based on it. Bob starts Spark Shell and then creates the table using SAP SQL context. Bob covers this step much more in-depth in this video in the Vora series.
Back in SAP HANA Studio, Bob refreshes the remote data source. Bob now sees the testtable in the spark_velocity (or Vora) folder. Bob finally creates a virtual table and puts it in his newly created SAP HANA Studio schema and views the contents with a data preview in SAP HANA Studio.
For more SAP HANA Vora tutorial videos please check out the playlist
SAP HANA Academy – Over 1,200 free tutorial videos on SAP HANA, SAP Analytics and the SAP HANA Cloud Platform.