Technology Blogs by Members
Explore a vibrant mix of technical expertise, industry insights, and tech buzz in member blogs covering SAP products, technology, and events. Get in the mix!
cancel
Showing results for 
Search instead for 
Did you mean: 

Carsten Mönning and Waldemar Schiller


Part 1 - Single node Hadoop on Raspberry Pi 2 Model B (~120 mins), http://bit.ly/1dqm8yO

Part 2 - Hive on Hadoop (~40 mins), http://bit.ly/1Biq7Ta

Part 3 - Hive access with SAP Lumira (~30mins), http://bit.ly/1cbPz68
Part 4 - A Hadoop cluster on Raspberry Pi 2 Model B(s) (~45mins)

Part 4 - A Hadoop cluster on Raspberry Pi 2 Model B(s) (~45mins)

In Parts 1-3 of this blog series, we worked our way towards a single node Hadoop and Hive implementation on a Raspberry Pi 2 Model B showcasing a simple word count processing example with the help of HiveQL on the Hive command line and via a standard SQL layer over Hive/Hadoop in the form of the Apache Hive connector of the SAP Lumira desktop trial edition. The single node Hadoop/Hive setup represented just another SAP Lumira data source allowing us to observe the actual SAP Lumira-Hive server interaction in the background.


This final part of the series will go full-circle by showing how to move from the single node to a multi-node Raspberry Pi Hadoop setup. We will restrict ourselves to introducing a second node only, the principle naturally extending to three or more nodes.


Master node configuration

Within our two node cluster setup, "node1" will be set up as the master node with "node2" representing a slave node 'only'. Set the hostname of the master node, as required, in file /etc/hostname.


To keep things nice and easy, we will 'hard-code' the node's IP settings in the local hosts file instead of setting up a proper DNS service. That is, using, for example, the leafpad text editor, sudo leafpad /etc/hosts and modify the master node hosts file as follows:


     192.168.0.110     node1

     192.168.0.111     node2


Remember in this context that we edited the /etc/network/interfaces text file of node1 in Part 1 of this blog in such a way that the local ethernet settings for eth0 were set to the static IP address 192.168.0.110. Thus, the master node IP address in the hosts file above needs to reflect this specific IP address setting.


Similarly, edit the file /opt/hadoop/etc/hadoop/masters to indicate which host will be operating as "master node" (here: node1) by simply adding a single line consisting of the entry node1. Note that in the case of older Hadoop versions, you need to set up the masters file in /opt/hadoop/conf. The "masters" file only really indicates to Hadoop which machine(s) should operate a secondary namenode. Similarly, the "slaves" file provides a list of machines which should run as datanodes in the cluster. Modify the file /opt/hadoop/etc/hadoop/slaves by simply adding the list of host IDs, for example:


     node1
     node2

You may remember from Part 1 of the series that the Hadoop configuration files are not held globally, i.e. each node in an Hadoop cluster holds its own set of configuration files which need to be kept in sync by the administrator using, for example, rsync. To keep the configuration of nodes of a cluster of significant size in sync represents one of the key challenges when operating a Hadoop environment. A discussion of the various means available for managing a cluster configuration is beyond the scope of this blog. You will find useful pointers in [1].

In Part 1, we configured the Hadoop system for operation in pseudodistributed mode. This time round we need to modify the relevant configuration files for operation in truly distributed mode by referring to the master node determined in the hosts file above (here: node1). Note that under YARN there is only a single resource manager for the cluster operating on the master node.

core-site.xml

Common configuration settings for Hadoop Core.


hdfs-site.xml

Configuration settings for HDFS daemons:
The namenode, the secondary namenode and the datanodes.

mapred-site.xml

General configuration settings for MapReduce
daemons
. Since we are running MapReduce using YARN, the MapReduce jobtracker and tasktrackers are replaced with a single resource manager running on the namenode.

File: core-site.XML - Change the host name from localhost to node1

  <configuration>

    <property>

      <name>hadoop.tmp.dir</name>

      <value>/hdfs/tmp</value>

    </property>

    <property>

      <name>fs.default.name</name>

      <value>hdfs://node1:54310</value>

    </property>

  </configuration>


File: hdfs-site.xml - Update the replication factor from 1 to 2

    

  <configuration>

     <property>

          <name>dfs.replication</name>

          <value>2</value>

     </property>

  </configuration>

File: mapred-site.xml.template ( “mapred-site.xml”, if dealing with older Hadoop versions) - Change the host name from localhost to node1
  <configuration>

     <property>

          <name>mapred.job.tracker</name>

          <value>node1:54311</value>

     </property>

  </configuration>

Assuming that you worked your way through Parts 1-3 with the specific Raspberry Pi device that you are now turning into the master node, you need to delete its HDFS storage, i.e.: sudo rm -rf /hdfs/tmp/*


This already completes the master node configuration.

Slave node configuration


When planning to setup a proper Hadoop cluster consisting of considerably more than two Raspberry Pis, you may want to use a SD card cloning programme such as Win32 Disk Imager download | SourceForge.net to copy the node1 configuration above onto the future slave nodes. See, for example, http://bit.ly/1imyCXv for a step-by-step guide to cloning a Raspberry Pi SD card.

For each of these clones, modify the /etc/network/interfaces and /etc/hostname file, as described above, by replacing the node1 entries with the corresponding clone host name.


Alternatively and assuming that the Java environment, i.e. both the Java run-time environment and the JAVA_HOME environment variable, is already set up on the relevant node as decribed in Part 1, use rsync for distributing the node1 configuration to the other nodes in your local Hadoop network. More specifically, on the slave node (here: node2) run the following command:


     sudo rsync -avxP /usr/local/hadoop/ hduser@node2:/usr/local/hadoop/


This way the files in the hadoop directory of the master node are distributed automatically to the hadoop folder of the slave node. When dealing with a two-node setup as described here, however, you may simply want to work your way through Part 1 for node2. Having already done so in the case of node1, you are likely to find this pretty easy-going.

The public SSH key generated in Part 1 of this blog series and stored in id_rsa.pub (and then appended to the list of SSH authorised keys in the file authorized_keys) on the master node needs to be shared with all slave nodes to allow for seamless, password-less node communication between master and slaves. Therefore, switch to the hduser on the master node via su hduser and add ~/.ssh/id_rsa.pub from node1 to ~/.ssh/authorized_keys on slave node node2 via:


          ssh-copy-id -i ~/.ssh/id_rsa.pub hduser@node2


You should now have password-less access to the slave node and vice versa.


Cluster launch


Format the Hadoop file system and launch both the file system, i.e. namenode, datanodes and secondary namenode, and the YARN resource manager services on node1, i.e.:


     hadoop namenode -format


     /opt/hadoop/sbin/start-dfs.sh

     /opt/hadoop/sbin/start-yarn.sh


When dealing with an older Hadoop version using the original map reduce service, the start services to be used read /opt/hadoop/bin/start-dfs.sh and /opt/hadoop/bin/start-mapred.sh, respectively.


To verify that the Hadoop cluster daemons are running ok, launch the jps command on the master node. You should be presented with a list of services such as both namenode and secondary namenode as well as datanode on the master node and datanode on the slave nodes. For example, in the case of the master node, the list of services should look something like this, i.e., amongst other things, both the single YARN resource manager and the secondary namenode are operational:

If you find yourself in need for issue diagnostics at any point, consult the log4j.log file in the Hadoop installation directory /logs first. If preferred, you can separate the log files from the Hadoop installation directory by setting a new log directory in HADOOP_LOG_DIR and adding it to script hadoop-env.sh.

The picture shows what a two node cluster setup may look like. In this specific case, the nodes are powered by the powerbank on the right-hand side of the picture.

And this is really pretty much all there is to it. We hope that this four-part blog series helped to take some of the mystery out of the Hadoop world for you and that this Lab project demonstrated how easily and cheaply a, admittedly simple, "Big Data" setup can be implemented on truly commodity hardware such as Raspberry Pis. We shall have a look at combining this setup with the world of Data Virtualization and, possibly, Open Data in the not-too-distant future.

Links

A Hadoop data lab project on Raspberry Pi - Part 1/4 - http://bit.ly/1dqm8yO
A Hadoop data lab project on Raspberry Pi - Part 2/4 - http://bit.ly/1Biq7Ta

A Hadoop data lab project on Raspberry Pi - Part 3/4 - http://bit.ly/1cbPz68

A BOBI document dashboard with Raspberry Pi - http://bit.ly/1Mv2Rv5

Jonas Widriksson blog - http://www.widriksson.com/raspberry-pi-hadoop-cluster/

How to clone your Raspberry Pi SD card for super easy reinstallations - http://bit.ly/1imyCXv

References

[1] T. White, "Hadoop: The Definitive Guide", 3rd edition, O'Reilly, USA, 2012

Labels in this area