In another part of the SAP HANA Academy’s SAP HANA Vora series Tahir Hussain Babar (Bob) walks through how to configure the SAP HANA Spark Controller in nine tutorial videos. With the SAP HANA Spark Controller you will be able to read your SAP HANA Vora tables from SAP HANA.


Each and every script that Bob uses through out this nine part series can be found here on GitHub.


How to Install Hive and Load Data

Screen Shot 2015-10-23 at 4.17.01 PM.png

In the first video of the series Bob walks through how to install Hive on a Hadoop cluster in Ambari. Already before this tutorial series Bob has a Hadoop cluster with SAP HANA Vora running. In order to connect from SAP HANA you must have a Hive Meta Store. In either Ambari and/or Cloudera one of the easiest ways to get this is to install Hive. To conclude the video Bob will load some test data to ensure that the Hive installation was successful.


First in Ambari Bob chooses to add Hive as a service and follows the steps in the add services wizard. Bob adds his Hive password in the customize services step and keeps all of the wizard’s defaults during the Hive installation. Next, Bob stops and starts all of the services in Ambari.

Screen Shot 2015-10-26 at 7.11.36 PM.png

Then connected to the Vora instance in PuTTY, Bob logs in as the Hive user, logs into Hive and tests the connection by running a simple command. Next, as the EC2 user Bob loads a simple table called SHA_Employee into his media folder. Then again as the Hive user, Bob runs the create table select statement to create the table in Hive and confirms it’s existence.

Screen Shot 2015-10-26 at 7.15.45 PM.png

After Bob loads the SHA_Employee.dat file into HDFS and gives access rights to the Hive user. Now back in Hive Bob runs a select * to view the data in the table.

Screen Shot 2015-10-26 at 7.21.40 PM.png

How to Install the SAP HANA Spark Controller

Screen Shot 2015-10-26 at 2.41.11 PM.png

In the second video Bob details how to install the SAP HANA Spark Controller.


In PuTTY as the EC2 user, Bob creates a new folder in the media folder called Spark Controller. Six of the files are available for free online while two of the files are on SAP service marketplace. Bob loads all six of the publicly available jar files and the AWS jar file and the Spark controller rpm file from SAP.

Screen Shot 2015-10-26 at 7.35.13 PM.png

Bob then installs the SAP HANA Spark Controller rpm file. The SAP HANA Spark Controller creates a user called hanaes and Bob logs in as that user and examines the contents of the Spark Controller.

Screen Shot 2015-10-26 at 7.37.59 PM.png

Finally Bob loads the AWS file into the Spark Controller lib folder. 


Placing Third Party Files into HDFS

Screen Shot 2015-10-26 at 2.42.35 PM.png

Bob, in the series’ third video, walks through how to place the third party jar files and the spark assembly file into HDFS.


As the HDFS user in PuTTY, Bob creates a new sub folder in lib folder for the third party files. Bob then puts the Spark Controller jar file into the Spark lib folder. Now HDFS will be aware of the jar file. Next, Bob puts the four third party files into the third party folder.

Screen Shot 2015-10-26 at 9.38.34 PM.png

Then Bob checks to confirm that the files exist. Finally, Bob creates another folder for the hanaes user and allocates the rights to it. This folder will be used by the SAP HANA Spark Controller to cache files.


Changing the Configuration Files

Screen Shot 2015-10-26 at 2.45.51 PM.png

In the fourth video of the series Bob shows how to configure the SAP HANA Spark Controller by modifying the hanaes-site.xml file.


In PuTTY as the hanaes user, Bob enters the Spark Controller conf folder and opens the hanaes-site.xml file to check the paths. The hanaes server port (7860 in Bob’s example) must be opened on the Hadoop machine. If you’re using a multi-node system then you must open the port for the master node.

Screen Shot 2015-10-29 at 11.01.43 AM.png

Note that when the data is returned to SAP HANA a range of ports are opened up. Therefore you must make sure that the SAP HANA box can connect to this range of ports on your Hadoop server. The easiest way to accomplish this is to whitelist the SAP HANA box when connecting to the Hadoop box to allow all ports.


Next you must insert the host name of your machine for your Spark Controller next to <value> underneath sap.hana.es.driver.host. Bob leaves a . in the IP address to assist with the testing later on.

Screen Shot 2015-10-29 at 11.06.13 AM.png

Scrolling down through the rest of the hanaes-site.xml file Bob inserts the correct HDP version name for both the Spark yarn and Spark driver property. Also, you should change the minExecutors and maxExecutors values so they’re optimally tuned to get the best performance out of your system. Then Bob pastes in a bit of code from GitHub that sets the value for the Spark executor memory.

Screen Shot 2015-10-29 at 11.19.39 AM.png

After saving the file, Bob copies the AWS.resolver jar file from the lib folder and then pastes it in as a final property in the hanes-site.xml file to make sure its aware of the AWS resolver.

Screen Shot 2015-10-29 at 11.22.41 AM.png

Configuring Hadoop

Screen Shot 2015-10-26 at 2.50.56 PM.png

In the next video in the series Bob details how to configure the hive-site.xml file. This file tells the SAP HANA Spark Controller all of the configuration details about the Hive deployment.


In PuTTY Bob goes into Hive conf folder to find the hive-site.xml file and places it into the SAP HANA Spark Controller conf folder. Bob also places the hive-site.xml file into the Spark Controller conf folder. In the site.xml file Bob removes the s from the delay and time values and insert the value below for the Hive security authorization manager.

Screen Shot 2015-11-12 at 11.45.47 AM.png

The next step must be preformed for every single node in your cluster. Find the yarn-shuffle jar file and copy it to the Hadoop yarn folder for each node.


Next, in Ambari Bob configures the mapreduce.application.classpath in the advanced mapred-site option to include the current version of HDP he is using. Then in Yarn Bob appends spark_shuffle to the end of mapreduce_shuffle in the yarn.nodemanager.aux-services.

Screen Shot 2015-11-12 at 11.58.19 AM.png

Then Bob shows how to add the custom property shown below to Yarn.

Screen Shot 2015-11-12 at 11.58.19 AM.png

Finally, Bob restarts the Yarn and MapReduce services in Ambari.


How to Start the Spark Controller

Screen Shot 2015-10-26 at 2.52.11 PM.png

Continuing on with the series Bob shows how to start the SAP HANA Spark Controller.


As the HANA ES user in PuTTY, Bob goes into the bin folder of the Spark Controller folder. To Start the Spark Controller enter ./hanaes start. To confirm it has started Bob pulls up the log file and actually see that there is in an error due to an unknown host. After fixing his host name in the hanaes-site.xml file (removing the . he placed in the IP Address earlier) Bob successfully starts the Spark Controller. Bob confirms this by viewing his four third party files in HDFS have been registered.

Screen Shot 2015-11-12 at 12.54.56 PM.png

Creating a Remote Data Source for HIVE Tables in SAP HANA Studio

Screen Shot 2015-10-26 at 2.56.04 PM.png

In the seventh video of the series Bob shows how to create a remote data source in SAP HANA Studio that will connect to the Hive system through the SAP HANA Spark Controller. Connecting to a Hive table will insure that SAP HANA Spark Controller is working.


In Bob’s SAP HANA SPS10 version of SAP HANA Studio he logs into his SAP HANA system and manually creates a user. Then Bob creates a new system using that recently created user and opens a SQL console in this new system. Bob enters the syntax shown below to establish a connection to the SAP HANA Spark Controller.

Screen Shot 2015-11-12 at 2.16.38 PM.png

Once his connection is establish Bob creates a new virtual table and views its creation documented in the log file. Finally, Bob can open his Spark_Employe Spark table in HANA Studio and sees it populated with data from Hive.

Screen Shot 2015-11-12 at 2.25.13 PM.png

Configuring the SAP HANA Spark Controller to Connect to SAP HANA Vora Tables

Screen Shot 2015-10-26 at 2.59.42 PM.png

In the second to last video in the Series Bob details how to configure the SAP HANA Spark Controller to connect to a list of SAP HANA Vora tables.


In PuTTY as the root EC2 user, Bob navigates to the hanaes-site.xml file and inserts a few properties. The properties are for the SAP HANA Vora host and the Spark Vora Zkurls (Zookeeper Server). For each of the properties insert the IP Address(es) and port number(s) for each of the nodes that are running SAP HANA Vora and/or the Zookeeper Server in your system.

Screen Shot 2015-11-12 at 3.41.33 PM.png

Next you must put the spark-sap-datasources jar file in the SAP HANA Spark Controller lib folder to enable the Spark extensions.

Screen Shot 2015-11-12 at 3.45.24 PM.png

Finally, Bob stops the Spark Controller before starting it again and confirms that it’s running by viewing the log file.


How to Connect to SAP HANA Vora Tables in SAP HANA Studio

Screen Shot 2015-10-26 at 3.01.09 PM.png

In the final video of the series Bob shows how to connect to SAP HANA Studio so that it can read SAP HANA Vora tables.


First Bob goes to this file on the SAP HANA Academy’s GitHub page goes over the instructions starting on line 212.


Bob sudos to the HDFS user. Then creates a folder for the Vora user in HDFS and gives access to the Vora user. Next, Bob creates a simple test table by running the commands shown below.

Screen Shot 2015-11-12 at 4.05.03 PM.png

Now that the test.csv table is in the HDFS system Bob creates an SAP HANA Vora table based on it. Bob starts Spark Shell and then creates the table using SAP SQL context. Bob covers this step much more in-depth in this video in the Vora series.


Back in SAP HANA Studio, Bob refreshes the remote data source. Bob now sees the testtable in the spark_velocity (or Vora) folder. Bob finally creates a virtual table and puts it in his newly created SAP HANA Studio schema and views the contents with a data preview in SAP HANA Studio.


Screen Shot 2015-11-12 at 4.11.33 PM.png


For more SAP HANA Vora tutorial videos please check out the playlist


SAP HANA Academy – Over 1,200 free tutorial videos on SAP HANA, SAP Analytics and the SAP HANA Cloud Platform.


Follow us on Twitter @saphanaacademy and connect with us on LinkedIn.

To report this post you need to login first.

6 Comments

You must be Logged on to comment or reply to a post.

  1. Przemyslaw Swiecicki

    Hi Bob! I have a problem with starting SparkController after doing this configuration. I successfully run the SparkController for Hive from previous videos, but have a problem to connect it to Vora After i run ./hanaes restart in the log i get the error:


    16/03/25 13:20:57 INFO Server: Starting Spark Controller 16/03/25 13:21:12 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerBlockManagerAdded(1458912072362,BlockManagerId(2, master.cluster, 51670),1111511531)


    What is the issue in here? Can you please help me? Thanks!

    (0) 
    1. Tom Flanagan Post author

      Hi Przemyslaw,

      Have you fixed the host name in the hanaes-site.xml? In Bob’s example his host IP has a . instead of a dash.

      Best,

      Tom

      (0) 
      1. Przemyslaw Swiecicki

        Hi Tom!

        Yes, I was able to connect to Hive tables as in videos 6 and 7. The problem occurs only in 8th video after re-configuration of hanaes-site.xml file to read Vora tables.

        (0) 
      2. Przemyslaw Swiecicki

        After restarting my instances now i don’t get this error anymore. Now after ./hanaes start it just stop on “Starting Spark Controller” and freeze on this.



        Here is full log:


        SLF4J: Class path contains multiple SLF4J bindings.

        SLF4J: Found binding in [jar:file:/usr/sap/spark/controller/lib/spark-sap-datasources-1.0.0-assembly.jar!/org/slf4j/impl/StaticLoggerBinder.class]

        SLF4J: Found binding in [jar:file:/usr/sap/spark/controller/lib/external/spark-assembly-1.4.1-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]

        SLF4J: Found binding in [jar:file:/usr/hdp/2.3.2.0-2950/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]

        SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

        SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

        16/03/30 07:44:00 INFO HanaESConfig: Loaded HANA Extended Store Configuration Found Spark Libraries. Proceeding with Current Class Path

        16/03/30 07:44:01 INFO Server: Starting Spark Controller




        Can you please help me with this problem?


        Best,


        Przemek.

        (0) 
        1. Tom Flanagan Post author

          Hi Przemek,

          I’ve never encountered something like this in the log. Did you insert the IP address and port number for each of your nodes as properties into the hanaes-site.xml file?

          (0) 
          1. Przemyslaw Swiecicki

            Hi Tom,

            Yes. I have two nodes, both with Vora and Zookeeper installed. In hanaes-site.xml file I inserted private DNS’s in Vora property and Zookeeper property (as Bob did, with 2022, and 2181 ports and ec2.internal before it). Still freezes on “Starting Spark Controller”.

            (0) 

Leave a Reply