In the second installment of a four part video series on SAP Smart Data Access for Apache Spark Tahir Hussain ‘Bob’ Babar demonstrates how to start the three Apache Spark servers. Bob also details how to use Hadoop Beeline to test the connectivity from Apache Spark to Apache HDFS.

Background Information (0:39 – 2:28)

In this lesson set on Apache Spark Bob is using two machines. A Linux machine that contains SAP HANA. The Linux machine is connected to a Windows machine that has HDFS and Apache Spark.

To use Spark first a master server must be started. Many worker servers can be attached to the master server and these worker servers can be on many different machines. Once the master and worker servers are started then a third and final server, the thrift server, will be started. The thrift server enables connectivity from Spark to HDFS.

Once the servers are started Beeline can be used to run SQL statements using JDBC connectivity on the HDFS system to test the connection between Spark on HDFS.

Screen Shot 2014-08-25 at 11.16.50 AM.png

Starting the Apache Spark Master Server on a Windows Box (2:28 – 5:56)

Go to the bin folder of your Apache Spark installation and select the spark-class.cmd file to see all of the commands that will be used.

Now open a command line and navigate to the bin directory of Spark. Now enter spark-class to begin its execution.  Some parameters must be applied to start the appropriate servers in order.

On the next line after spark-class enter org.apache.spark.deploy.master.Master. Once the master servers starts a message will appear that reads “NFO master.Master: I have been elected leader! New state: ALIVE.

To test the server enter the MasterWebUI that appears above the elected leader line in the command prompt window into a web browser to confirm its alive status.

Assigning a Worker Server (5:56 – 8:07)

The master server only acts as a name service so a worker server must be assigned to it. The worker server is connected to the master server using an IP address and an assigned port. The IP address is listed five lines up from the bottom in the command prompt and as well on the first line in the MasterWebUI. In this example Bob’s IP address and port is spark://10.80.38.82:7077.

In a new command line go to the bin folder again and enter to run the spark-class file. However, now instead enter spark-class org.apache.spark.deploy.worker.Worker spark://10.80.38.82:7077 and press return.  The end should be your Spark IP address and port.  

Now a message will appear stating that the worker server has successfully registered with master spark. After refreshing the MasterWebUI it will list a single worker server that is now alive.

Enabling Spark to Speak to Hadoop by Launching a Thrift Server (8:07 – 10:13)

The thrift server is a service that runs in Spark that is used to connect Spark to a Hadoop system. When a Hadoop system is started a Thrift server is automatically started on port 10001.

Go to the list of services, right click on Apache Hadoop liveserver2 and stop the currently running thrift server. Now launch a new command line and navigate back to the Spark bin folder. Enter spark-class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2. This will use the same port as the hiveserver2 that was just stopped. Once the thrift server is started there will be SparkUI in the command prompt window and if you visit the address a web browser you will see that the thrift server has started.

Testing the Connection with Beeline (10:13 – 15:33)

Launch a new command line and go again to the Spark folder’s bin folder and then enter beeline. Enter !connect jdbc:hive2://10.80.38.82:10001 (your master server IP address: thrift  server port) into the command line and press enter. You will be prompted for a user name and password but if you did the default Hadoop installation then just press return twice to access the beeline system.

Reopen the thrift server command prompt window and in the beeline command prompt window enter show databases; and witness the process running in the thrift window. After seeing there are two databases in the beeline command window enter select * from live2.connections; to return a select statement from that dataset.

Now you have successfully connected and tested that the Apache Spark system can talk to the Hadoop system using the thrift system. The SparkUI on the web browser will list out all of the SQL statements that have been run.

Check out Bob’s video on how to start the master, worker, and thrift servers of Apache Spark and how to test the connectivity with Beeline.

Screen Shot 2014-08-25 at 2.31.23 PM.png

SAP HANA Academy – over 500 free tutorial technical videos on using SAP HANA.


-Tom Flanagan

SAP HANA Academy

Follow @saphanaacademy

To report this post you need to login first.

Be the first to leave a comment

You must be Logged on to comment or reply to a post.

Leave a Reply