A Hadoop data lab project on Raspberry Pi – Part 1/4
Carsten Mönning and Waldemar Schiller
Hadoop has developed into a key enabling technology for all kinds of Big Data analytics scenarios. Although Big Data applications have started to move beyond the classic batch-oriented Hadoop architecture towards near real-time architectures such as Spark, Storm, etc.,  a thorough understanding of the Hadoop & MapReduce & HDFS principles and services such as Hive, HBase, etc. operating on top of the Hadoop core still remains one of the best starting points for getting into the world of Big Data. Renting a Hadoop cloud service or even getting hold of an on-premise Big Data appliance will get you Big Data processing power but no real understanding of what is going on behind the scene.
To inspire your own little Hadoop data lab project, this four part blog will provide a step-by-step guide for the installation of open source Apache Hadoop from scratch on Raspberry Pi 2 Model B over the course of the next three to four weeks. Hadoop is designed for operation on commodity hardware so it will do just fine for tutorial purposes on a Raspberry Pi. We will start with a single node Hadoop setup, will move on to the installation of Hive on top of Hadoop, followed by using the Apache Hive connector of the free SAP Lumira desktop trial edition to visually explore a Hive database. We will finish the series with the extension of the single node setup to a Hadoop cluster on multiple, networked Raspberry Pis. If things go smoothly and varying with your level of Linux expertise, you can expect your Hadoop Raspberry Pi data lab project to be up and running within approximately 4 to 5 hours.
We will use a simple, widely known processing example (word count) throughout this blog series. No prior technical knowledge of Hadoop, Hive, etc. is required. Some basic Linux/Unix command line skills will prove helpful throughout. We are assuming that you are familiar with basic Big Data notions and the Hadoop processing principle. If not so, you will find useful pointers in  and at: http://hadoop.apache.org/. Further useful references will be provided in due course of this multi-part blog.
Part 1 – Single node Hadoop on Raspberry Pi 2 Model B (~120 mins)
Part 2 – Hive on Hadoop (~40 mins), http://bit.ly/1Biq7Ta
Part 3 – Hive access with SAP Lumira (~30mins), http://bit.ly/1cbPz68
Part 4 – A Hadoop cluster on Raspberry Pi 2 Model B(s) (~45mins), http://bit.ly/1eO766g
Part 1 – Single node Hadoop on Raspberry Pi 2 Model B (~120 mins)
To get going with your single node Hadoop setup, you will need the following Raspberry Pi 2 Model B bits and pieces:
- One Raspberry Pi 2 Model B, i.e. the latest Raspberry Pi model featuring a quad core CPU with 1 GB RAM.
- 8GB microSD card with NOOBS (“New Out-Of-the-Box Software”) installer/boot loader pre-installed (https://www.raspberrypi.org/tag/noobs/).
- Wireless LAN USB card.
- Mini USB power supply, heat sinks and HDMI display cable.
- Optional, but recommended: A case to hold the Raspberry circuit board.
To make life a little easier for yourself, we recommend to go for a Raspberry Pi accessory bundle which typically comes with all of these components pre-packaged and will set you back approx. € 60-70.
We intend to install the latest stable Apache Hadoop and Hive releases available from any of the Apache Software Foundation download mirror sites, http://www.apache.org/dyn/closer.cgi/hadoop/common/, alongside the free SAP Lumira desktop trial edition, http://saplumira.com/download/, i.e.
- Hadoop 2.7.2
- Hive 1.1.0
- SAP Lumira 1.23 desktop edition
The initial Raspberry setup procedure is described by, amongst others, Jonas Widriksson at http://www.widriksson.com/raspberry-pi-hadoop-cluster/. His blog also provides some pointers in case you are not starting off with a Raspberry Pi accessory bundle but prefer obtaining the hard- and software bits and pieces individually. We will follow his approach for the basic Raspbian setup in this part, but updated to reflect Raspberry Pi 2 Model B-specific aspects and providing some more detail on various Raspberry Pi operating system configuration steps. To keep things nice and easy, we are assuming that you will be operating the environment within a dedicated local wireless network thereby avoiding any firewall and port setting (and the Hadoop node & rack network topology) discussion. The basic Hadoop installation and configuration descriptions in this part make use of .
The subsequent blog parts will be based on this basic setup.
Raspberry Pi setup
Powering on your Raspberry Pi will automatically launch the pre-installed NOOBS installer on the SD card. Select “Raspbian”, a Debian 7 Wheezy-based Linux distribution for ARM CPUs, from the installation options and wait for its subsequent installation procedure to complete. Once the Raspbian operating system has been installed successfully, your Raspberry Pi will reboot automatically and you will be asked to provide some basic configuration settings using raspi-config. Note that since we are assuming that you are using NOOBS, you will not need to expand your SD card storage (menu Option Expand Filesystem). NOOBS will already have done so for you. By the way, if you want or need to run NOOBS again at some point, press & hold the shift key on boot and you will be presented with the NOOBS screen.
What you might want to do though is to set a new password for the default user “pi” via configuration option Change User Password. Similarly, set your internationalisation options, as required, via option Internationalisation Options.
More interestingly in our context, go for menu item Overclock and set a CPU speed to your liking taking into account any potential implications for your power supply/consumption (“voltmodding”) and the life-time of your Raspberry hardware. If you are somewhat optimistic about these things, go for the “Pi2” setting featuring 1GHz CPU and 500 MHz RAM speeds to make the single node Raspberry Pi Hadoop experience a little more enjoyable.
Under Advanced Options, followed by submenu item Hostname, set the hostname of your device to “node1”. Selecting Advanced Options again, followed by Memory Split, set the GPU memory to 32 MB.
Finally, under Advanced Options, followed by SSH, enable the SSH server and reboot your Raspberry Pi by selecting <Finish> in the configuration menu. You will need the SSH server to allow for Hadoop cluster-wide operations.
Once rebooted and with your “pi” user logged in again, the basic configuration setup of your Raspberry device has been successfully completed and you are ready for the next set of preparation steps.
To make life a little easier, launch the Raspbian GUI environment by entering startx in the Raspbian command line.(Alternatively, you can use, for example, the vi editor, of course.) Use the GUI text editor, “Leafpad”, to edit the /etc/network/interfaces text file as shown to change the local ethernet settings for eth0 from DHCP to the static IP address 192.168.0.110. Also add the netmask and gateway entries shown. This is the preparation for our multi-node Hadoop cluster which is the subject of Part 4 of this blog series.
Check whether the nameserver entry in file /etc/resolv.conf is given and looks ok. Restart your device afterwards.
Hadoop is Java coded so requires Java 6 or later to operate. Check whether the pre-installed Java environment is in place by executing:
You should be prompted with a Java 1.8, i.e. Java 8, response.
Hadoop user & group accounts
Set up dedicated user and group accounts for the Hadoop environment to separate the Hadoop installation from other services. The account IDs can be chosen freely, of course. We are sticking here with the ID examples in Widriksson’s blog posting, i.e. group account ID “hadoop” and user account ID “hduser” within this and the sudo user groups.
sudo addgroup hadoop
sudo adduser –-ingroup hadoop hduser
SSH server configuration
Generate a RSA key pair to allow the “hduser” to access slave machines seamlessly with empty passphrase. The public key will be stored in a file with the default Name “id_rsa.pub” and then appended to the list of SSH authorised keys in the file “authorized_keys”. Note that this public key file will need to be shared by all Raspberry Pis in an Hadoop cluster (Part 4).
ssh-keygen –t rsa –P “”
cat ~/.ssh/id_rsa.pub > ~/.ssh/authorized_keys
Hadoop installation & configuration
Similar to the Rasbian installation & configuration description above, we will talkyou through the basic Hadoop installation first, followed by the various
environment variable and configuration settings.
You need to get your hands on the latest stable Hadoop version (here: version 2.6.0) so initiate the download from any of the various Apache mirror sites (here: spacedump.net).
Once the download has been completed, unpack the archive to a sensible location, e.g., /opt represents a typical choice.
sudo mkdir /opt
sudo tar –xvzf hadoop-2.7.2.tar.gz -C /opt/
Following extraction, rename the newly created hadoop-2.7.2 folder into something a little more convenient such as “hadoop”.
sudo mv hadoop-2.7.2 hadoop
Running, for example, ls –al, you will notice that your “pi” user is the owner of the “hadoop” directory, as expected. To allow for the dedicated Hadoop user “hduser” to operate within the Hadoop environment, change the ownership of the Hadoop directory to “hduser”.
sudo chown -R hduser:hadoop hadoop
This completes the basic Hadoop installation and we can proceed with its configuration.
Switch to the “hduser” and add the export statements listed below to the end of the shell startup file ~/.bashrc. Instead of using the standard vi editor, you could, of course, make use of the Leafpad text editor within the GUI environment again.
Export statements to be added to ~/.bashrc:
export JAVA_HOME=$(readlink -f /usr/bin/java | sed “s:bin/java::”)
This way both the Java and the Hadoop installation as well as the Hadoop binary paths become known to your user environment. Note that you may add the JAVA_HOME setting to the hadoop-env.sh script instead, as shown below.
Apart from these environment variables, modify the /opt/hadoop/etc/hadoop/hadoop-env.sh script as follows. If you are using an older version of Hadoop, this file can be found in: /opt/hadoop/conf/. Note that in case you decide to relocate this configuration directory, you will have to pass on the
directory location when starting any of the Hadoop daemons (see daemon table below) using the –config option.
Hadoop assigns 1 GB of memory to each daemon so this default value needs to be reduced via parameter HADOOP_HEAPSIZE to
allow for Raspberry Pi conditions. The JAVA_HOME setting for the location of the Java implementation may be omitted if already set in your shell environment, as shown above. Finally, set the datanode’s Java virtual machine to client mode. (Note that with the Raspberry Pi 2 Model B’s ARMv7 processor, this
ARMv6-specific setting is not strictly necessary anymore.)
# The java implementation to use. Required, if not set in the home shell
export JAVA_HOME=$(readlink -f /usr/bin/java | sed “s:bin/java::”)
# The maximum amount of heap to use, in MB. Default is 1000.
# Command specific options appended to HADOOP_OPTS when specified
export HADOOP_DATANODE_OPTS=”-Dcom.sun.management.jmxremote $HADOOP_DATANODE_OPTSi -client”
Hadoop daemon properties
With the environment settings completed, you are ready for the more advanced Hadoop daemon configurations. Note that the configuration files are not held globally, i.e. each node in an Hadoop cluster holds its own set of configuration files which need to be kept in sync by the administrator using, for example, rsync.
Modify the following files, as shown below, to configure the Hadoop system for operation in pseudodistributed mode. You can find these files in directory /opt/hadoop/etc/hadoop. In the case of older Hadoop versions, look for the files in: /opt/hadoop/conf
|Common configuration settings for Hadoop Core.|
|Configuration settings for HDFS daemons:
The namenode, the secondary namenode and the datanodes.
|mapred-site.xml||General configuration settings for MapReduce
daemons. Since we are running MapReduce using YARN, the MapReduce jobtracker and tasktrackers are replaced with a single resource manager running on the namenode.
File: mapred-site.xml.template ( “mapred-site.xml”, if dealing with older Hadoop versions)
Hadoop Data File System (HDFS) creation
HDFS has been automatically installed as part of the Hadoop installation. Create a tmp folder within HDFS to store temporary test data and change the directory ownership to your Hadoop user of choice. A new HDFS installation needs to be formatted prior to use. This is achieved via -format.
sudo mkdir -p /hdfs/tmp
sudo chown hduser:hadoop /hdfs/tmp
sudo chmod 750 /hdfs/tmp
hadoop namenode -Format
Launch HDFS and YARN daemons
Hadoop comes with a set of scripts for starting and stopping the various daemons. They can be found in the /bin directory. Since you are dealing with a single node setup, you do not need to tell Hadoop about the various machines in the cluster to execute any script on and you can simply execute the following scripts straightaway to launch the Hadoop file system (namenode, datanode and secondary namenode) and YARN resource manager daemons. If you need to stop these daemons, use the stop-dfs.sh and stop-yarn.sh script, respectively.
Check the resource manager web UI at http://localhost:8088 for a node overview. Similarly, http://localhost:50070 will provide you with details on your HDFS. If you find yourself in need for issue diagnostics at any point, consult the log4j.log file in the Hadoop installation directory /logs first. If preferred, you can separate the log files from the Hadoop installation directory by setting a new log directory in HADOOP_LOG_DIR and adding it to script hadoop-env.sh.
With all the implementation work completed, it is time for a little Hadoop processing example.
We will run some word count statistics on the standard Apache Hadoop license file to give your Hadoop core setup a simple test run. The word count executable represents a standard element of your Hadoop jar file. To get going, you need to upload the Apache Hadoop license file into your HDFS home directory.
hadoop fs -copyFromLocal /opt/hadoop/LICENSE.txt /license.txt
Run word count against the license file and write the result into license-out.txt.
hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount /license.txt /license-out.txt
You can get hold of the HDFS output file via:
hadoop fs -copyToLocal /license-out.txt ~/
Have a look at ~/license-out.txt/part-r-00000 with your preferred text editor to see the word count results. It should look like shown in the extract below.
Apache Software Foundation Hadoop Distribution – http://www.apache.org/dyn/closer.cgi/hadoop/common/
Jonas Widriksson blog – http://www.widriksson.com/raspberry-pi-hadoop-cluster/
SAP Lumira desktop trial edition – http://saplumira.com/download/
A BOBI document dashboard with Raspberry Pi – http://bit.ly/1Mv2Rv5
 V. S. Agneeswaran, “Big Data Beyond Hadoop”, Pearson, USA, 2014
 K. Shvachko, H. Kuang, S. Radia and R. Chansler, “The Hadoop Distributed File System”, Proc. of MSST 2010, 05/2010
 T. White, “Hadoop: The Definitive Guide”, 3rd edition, O’Reilly, USA, 2012