A Hadoop data lab project on Raspberry Pi – Part 2/4
Carsten Mönning and Waldemar Schiller
Part 1 – Single node Hadoop on Raspberry Pi 2 Model B (~120 mins), http://bit.ly/1dqm8yO
Part 2 – Hive on Hadoop (~40 mins)
Part 3 – Hive access with SAP Lumira (~30mins), http://bit.ly/1cbPz68
Part 4 – A Hadoop cluster on Raspberry Pi 2 Model B(s) (~45mins), http://bit.ly/1eO766g
Part 2 – Hive on Hadoop (~40 mins)
Following on from the Hadoop core installation on a Raspberry Pi 2 Model B in Part 1 of this blog series, in this Part 2, we will proceed with installing Apache Hive on top of HDFS and show its basic principles with the help of last part’s word count Hadoop processing example.
Hive represents a distributed relational data warehouse featuring a SQL-like query language, HiveQL, inspired by the MySQL SQL dialect. A high-level comparison of the HiveQl and SQL is provided in . For a HiveQL command reference, see: https://cwiki.apache.org/confluence/display/Hive/LanguageManual.
The Hive data sits in HDFS with HiveQL queries getting translated into MapReduce jobs by the Hadoop run-time environment. Whilst traditional relational data warehouses enforce a pre-defined meta data schema when writing data to the warehouse, Hive performs schema on read, i.e., the data is checked when a query is launched against it. Hive alongside the NoSQL data warehouse HBase represent frequently used components of the Hadoop data processing layer for external applications to push query workloads towards data in Hadoop. This is exactly what we are going to do in Part 3 of this series when connecting to the Hive environment via the SAP Lumira Apache Hive standard connector and pushing queries through this connection against the word count output file.
The latest stable Hive release will operate alongside the latest stable Hadoop release and can be obtained from Apache Software Foundation mirror download sites. Initiate the download, for example, from spacedump.net and unpack the latest stable Hive release as follows. You may also want to rename the binary directory to something a little more convenient.
tar -xzvf apache-hive-1.1.0-bin.tar.gz
mv apache-hive-1.1.0-bin hive-1.1.0
Add the paths to the Hive installation and the binary directory, respectively, to your user environment.
Make sure your Hadoop user chosen in Part 1 (here: hduser) has ownership rights to your Hive directory.
chown -R hduser:hadoop hive
To be able to generate tables within Hive, run the Hadoop start scripts start-dfs.sh and start-yarn.sh (see also Part 1). You may also want to create the following directories and access settings.
hadoop fs -mkdir /tmp
hadoop fs -chmod g+w /tmp
hadoop fs -mkdir /user/hive/warehouse
hadoop fs -chmod g+w /user/hive/warehouse
Strictly speaking, these directory and access settings assume that you are intending to have more than one Hive user sharing the Hadoop cluster and are not required for our current single Hive user scenario.
By typing in hive, you should now be able to launch the Hive command line interface. By default, Hive issues information to standard error in both interactive and noninteractive mode. We will see this effect in action in Part 3 when connecting to Hive via SAP Lumira. The -S parameter of the hive statement will suppress any feedback statements.
Typing in hive –service help will provide you with a list of all available services :
Command-line interface to Hive. The default service.
Hive operating as a server for programmatic client access via, for example, JDBC and ODBC. Http, port 10000. Port configuration parameter HIVE_PORT.
|Hive web interface for exploring the Hive schemas. Http, port: 9999. Port configuration parameter hive.hwi.listen.port.|
|jar||Hive equivalent to hadoop jar. Will run Java applications in both the Hadoop and Hive classpath.|
|metastore||Central repository of Hive meta data.|
If you are curious about the Hive web interface, launch hive –service hwi, enter http://localhost:9999/hwi in your browser and you will be shown something along the lines of the screenshot below.
If you run into any issues, check out the Hive error log at /tmp/$USER/hive.log. Similarly, the Hadoop error logs presented in Part 1 can prove useful for Hive debugging purposes.
An example (continued)
Following on from our word count example in Part 1 of this blog series, let us upload the word count output file into Hive’s local managed data store. You need to generate the Hive target table first. Launch the Hive command line interface and proceed as follows.
create table wcount_t(word string, count int) row format delimited fields terminated by ‘\t’ stored as textfile;
In other words, we just created a two-column table consisting of a string and an integer field delimited by tabs and featuring newlines for each new row. Note that HiveQL expects a command line to be finished with a semicolon.
The word count output file can now be loaded into this target table.
load data local inpath ‘~/license-out.txt/part-r-00000’ overwrite into table wcount_t;
Effectively, the local file part-r-00000 is stored in the Hive warehouse directory which is set to user/hive/warehouse by default. More specifically, part-r-00000 can be found in Hive Directory user/hive/warehouse/wcount_t and you may query the table contents.
select * from wcount_t;
If everything went according to plan, your screen should show a result similar to the screenshot extract below.
If so, it means you managed to both install Hive on top of Hadoop on Raspberry Pi 2 Model B and load the word count output file generated in Part 1 into the Hive data warehouse environment. In the process, you should have developed a basic understanding of the Hive processing environment, its SQL-like query language and its interoperability with the underlying Hadoop environment.
In the next part of this series, we will bring the implementation and configuration effort of Parts 1 & 2 to fruition by running SAP Lumira as a client against the Hive server and will submit queries against the word count result file in Hive using standard SQL with the Raspberry Pi doing all the MapReduce work. Lumira’s Hive connector will translate these standard SQL queries into HiveQL so that things appear pretty standard from the outside. Having worked your way through the first two parts of this blog series, however, you will be very much aware of what is actually going on behind the scene.
Apache Software Foundation Hive Distribution – Index of /hive
Apache Hive wiki – https://cwiki.apache.org/confluence/display/Hive/GettingStarted
Apache Hive command reference – https://cwiki.apache.org/confluence/display/Hive/LanguageManual
A Hadoop data lab project Part 1 – http://bit.ly/1dqm8yO
A BOBI document dashboard with Raspberry Pi – http://bit.ly/1Mv2Rv5
Configuring Hive ports – http://docs.hortonworks.com/HDP2Alpha/index.htm#Appendix/Ports_Appendix/Hive_Ports.htm
 T. White, “Hadoop: The Definitive Guide”, 3rd edition, O’Reilly, USA, 2012