Step 1: Prepare Hadoop Environment

former_member45323 · ‎06-24-2019

This blog is part of the series My Learning Journey for Hadoop. In this blog I will focus running a Hello World program in Hadoop using Hortonworks Sandbox. The Hello World program will use 3 components of Hadoop – HDFS, HCatalog and Hive.

Prerequisite: Although you can directly start from this blog and finish the Hello World example. But to have better understanding, I will suggest to go through previous blogs in the series My Learning Journey for Hadoop.

Step 1: Prepare Hadoop Environment

Instead of installing Hadoop from scratch, we will be using sandbox system provided by Hortonworks.

The Hortonworks Sandbox is a personal, portable Hadoop environment that comes with a dozen interactive Hadoop tutorials also. Follow below steps to install it.

Download and install Oracle Virtual Box from link http://download.virtualbox.org/virtualbox/4.3.12/VirtualBox-4.3.12-93733-Win.exe

Download Hortonworks sandbox file (.ova file) http://hortonassets.s3.amazonaws.com/2.1/virtualbox/Hortonworks_Sandbox_2.1.ova

Open Oracle Virtual Box and click on File ->Import Appliance and select the .ova file you just downloaded and click on Next.

Have a quick look on Appliance Settings which would tell you that what is the guest OS on which you would be running Sandbox.

Next Click on Import button. It would take some time.

Now click on Start to start Hortonworks Sandbox Environment

You will notice that virtual machine starts booting up and loading different configuration

At last the sandbox IP address would be displayed in the command prompt window

Done!! We have successfully installed and configured HortonWorks SandBox.

Step 2: First glance at Hortonworks sandbox

In virtual box starts screen, open browser and enter IP address. Fill the form and register.

The home page after registration shows different tutorials on Hadoop.

You can check different components integrated with Hadoop Sandbox by typing /aboutafter the IP address.

Step 3: Write a Hello World program

Example Data

Let’s take a simplest use case with database of a single csv file.

Input: A csv file with two columns “cities”, “temperature”.

Output: Find the city with maximum and minimum temperature.

We will load the csv file to HDFS, register the data with HCatalog and process this file using Hive.

Download the example csv file from here: https://drive.google.com/file/d/1yXGya4UB2n4F40hPJ4FysQxQagkQT_WG/view?usp=sharing

Load The Data to Hadoop Distributed File System (HDFS)

Click on the Fie Browser to upload the example data in HDFS. Click on upload Files section to upload the csv file.

You will notice that file is upload into HDFS. You can check the content of the data file by just one click.

Register the data with HCatalog

After we are done with loading the data into HDFS, we need to register it with HCatalog to make data available to all processing languages Pig, Hive etc.

Click on HCatalog icon and Create a New Table from a file. Give it some meaningful name (for example "City_Temparature_List").

Upload the example data file which you have just downloaded.

You would notice the data file raw content would appear. Here you can do the schema metadata settings i.e. you can define encoding styles, can change column name, column types etc. For this very first example we leave all things as it is.

Click on Create Table button to register this data file in HCatalog. You would notice that data file successfully registered and appears on HCatalog.

Process the Data using Hive

We will use Hive language to query the data. Hive provides a mechanism to query the data using a SQL-like language called HiveQL.

Click on tool Beeswax which gives you an interactive interface to Hive.

Since we have already registered our tables in HCatalog, Hive would have access to it. Execute a query to find the city with maximum temperature.

The output should be like below image.

Similarly, we can also run a query to find city with minimum temperature.

Congratulations!!! We completed our very first Hadoop example by loading the data into Hadoop HDFS, then registering it with HCatalog and finally executing Hive scripts to get result from the data.

What’s Next?

Check the blog series My Learning Journey for Hadoop for further learning materials.

Hello World program in Hadoop using Hortonworks Sandbox

Step 1: Prepare Hadoop Environment

Step 2: First glance at Hortonworks sandbox

Step 3: Write a Hello World program

Example Data

Load The Data to Hadoop Distributed File System (HDFS)

Register the data with HCatalog

Process the Data using Hive

What’s Next?

Get Your SAP HANA Idea Incubator Badge Today!

SCN Mission - SAP HANA Quiz Challenge is now retired

Share your #HANAStory and Win