Hello World program in Hadoop using Hortonworks Sandbox
This blog is part of the series My Learning Journey for Hadoop. In this blog I will focus running a Hello World program in Hadoop using Hortonworks Sandbox. The Hello World program will use 3 components of Hadoop – HDFS, HCatalog and Hive.
Prerequisite: Although you can directly start from this blog and finish the Hello World example. But to have better understanding, I will suggest to go through previous blogs in the series My Learning Journey for Hadoop.
Step 1: Prepare Hadoop Environment
Instead of installing Hadoop from scratch, we will be using sandbox system provided by Hortonworks.
The Hortonworks Sandbox is a personal, portable Hadoop environment that comes with a dozen interactive Hadoop tutorials also. Follow below steps to install it.
- Download and install Oracle Virtual Box from link http://download.virtualbox.org/virtualbox/4.3.12/VirtualBox-4.3.12-93733-Win.exe
- Download Hortonworks sandbox file (.ova file) http://hortonassets.s3.amazonaws.com/2.1/virtualbox/Hortonworks_Sandbox_2.1.ova
- Open Oracle Virtual Box and click on File ->Import Appliance and select the .ova file you just downloaded and click on Next.
- Have a quick look on Appliance Settings which would tell you that what is the guest OS on which you would be running Sandbox.
- Next Click on Import button. It would take some time.
- Now click on Start to start Hortonworks Sandbox Environment
- You will notice that virtual machine starts booting up and loading different configuration
- At last the sandbox IP address would be displayed in the command prompt window
Done!! We have successfully installed and configured HortonWorks SandBox.
Step 2: First glance at Hortonworks sandbox
- In virtual box starts screen, open browser and enter IP address. Fill the form and register.
- The home page after registration shows different tutorials on Hadoop.
- You can check different components integrated with Hadoop Sandbox by typing /aboutafter the IP address.
Step 3: Write a Hello World program
Let’s take a simplest use case with database of a single csv file.
Input: A csv file with two columns “cities”, “temperature”.
Output: Find the city with maximum and minimum temperature.
We will load the csv file to HDFS, register the data with HCatalog and process this file using Hive.
Download the example csv file from here: https://drive.google.com/file/d/1yXGya4UB2n4F40hPJ4FysQxQagkQT_WG/view?usp=sharing
Load The Data to Hadoop Distributed File System (HDFS)
Click on the Fie Browser to upload the example data in HDFS. Click on upload Files section to upload the csv file.
You will notice that file is upload into HDFS. You can check the content of the data file by just one click.
Register the data with HCatalog
After we are done with loading the data into HDFS, we need to register it with HCatalog to make data available to all processing languages Pig, Hive etc.
Click on HCatalog icon and Create a New Table from a file. Give it some meaningful name (for example “City_Temparature_List”).
Upload the example data file which you have just downloaded.
You would notice the data file raw content would appear. Here you can do the schema metadata settings i.e. you can define encoding styles, can change column name, column types etc. For this very first example we leave all things as it is.
Click on Create Table button to register this data file in HCatalog. You would notice that data file successfully registered and appears on HCatalog.
Process the Data using Hive
We will use Hive language to query the data. Hive provides a mechanism to query the data using a SQL-like language called HiveQL.
Click on tool Beeswax which gives you an interactive interface to Hive.
Since we have already registered our tables in HCatalog, Hive would have access to it. Execute a query to find the city with maximum temperature.
The output should be like below image.
Similarly, we can also run a query to find city with minimum temperature.
Congratulations!!! We completed our very first Hadoop example by loading the data into Hadoop HDFS, then registering it with HCatalog and finally executing Hive scripts to get result from the data.
Check the blog series My Learning Journey for Hadoop for further learning materials.