From my previous blog (http://scn.sap.com/blogs/SomaMukherjee/2015/08/06/hadoop-overview) you have got basic idea about Hadoop. Now let’s go a little deeper and learn different components of Hadoop ecosystem. This blog talks about one component of Hadoop ecosystem that is Hive.

You have learnt how the data is stored in Hadoop file system. But we should know how we can use that data for analysis.

What is Hive?

Hive is a data warehouse system for Hadoop that facilitates ad-hoc queries and the analysis of large datasets stored in Hadoop. Hive provides a SQL-like language called HiveQL. Due its SQL-like interface, Hive is increasingly becoming the technology of choice for using Hadoop.

The data is organized in three different formats in HIVE.

Tables: They are very similar to RDBMS tables and contains rows and columns. Hive is just layered over the Hadoop File System (HDFS), hence tables are directly mapped to directories of the filesystems. It also supports tables stored in other native file systems.

Partitions: Hive tables can have more than one partition. They are mapped to subdirectories and file systems as well.

Buckets: In Hive, data may be divided into buckets. Buckets are stored as files in partition in the underlying file system.

Why do Hive and Pig exist when they seem to do much of the same thing?


Pig: Pig is a high-level platform for creating MapReduce programs used in Hadoop. The language for this platform is called Pig Latin.

Hive

Pig

Because of its SQL like query language it’s often used as the interface to an Apache Hadoop based data warehouse. Hive is considered friendlier and more familiar to users who are used to using SQL for querying data.

Pig fits in through its data flow strengths where it takes on the tasks of bringing data into Apache Hadoop and working with it to get it into the form for querying.

Hive is for structured data

Pig is for unstructured data.

Developed by Yahoo

Developed by Facebook

Language name HiveQL

Language name Pig Latin

Hive has web interface

Pig does not have web interface

HiveQL is a declarative language like SQL

PigLatin is a data flow language. The output of one PigLatin construct can be sent as input to another PigLatin construct and so on.

Steps to explore Hive on your own:

  1. Need a system with Hadoop installed, up and running.

You can install the sandbox system and start working on Hadoop.

Please find the links below,

Download Virtual box:

https://www.virtualbox.org/wiki/Downloads

Download the sandbox for virtual box:

http://hortonworks.com/hdp/downloads/

   2. Start the sandbox system.

   3. Launch the Hive using below link.

http://127.0.0.1:8000/

  4. Click on the Hive UI on the top left corner.

Here you can write queries and explore more about Hive.

QUERIES

Function

MySQL

HiveQL

Retrieving information

SELECT from_columns FROM table WHERE conditions;

SELECT from_columns FROM table WHERE conditions;

All values

SELECT * FROM table;

SELECT * FROM table;

Some values

SELECT * FROM table WHERE rec_name = “value”;

SELECT * FROM table WHERE rec_name = “value”;

Multiple criteria

SELECT * FROM table WHERE rec1=”value1” AND rec2=”value2”;

SELECT * FROM TABLE WHERE rec1 = “value1” AND rec2 = “value2”;

Selecting specific columns

SELECT column_name FROM table;

SELECT column_name FROM table;

Retrieving unique output records

SELECT DISTINCT column_name FROM table;

SELECT DISTINCT column_name FROM table;

Sorting

SELECT col1, col2 FROM table ORDER BY col2;

SELECT col1, col2 FROM table ORDER BY col2;

Sorting backward

SELECT col1, col2 FROM table ORDER BY col2 DESC;

SELECT col1, col2 FROM table ORDER BY col2 DESC;

Counting rows

SELECT COUNT(*) FROM table;

SELECT COUNT(*) FROM table;

Grouping with counting

SELECT owner, COUNT(*) FROM table GROUP BY owner;

SELECT owner, COUNT(*) FROM table GROUP BY owner;

Maximum value

SELECT MAX(col_name) AS label FROM table;

SELECT MAX(col_name) AS label FROM table;

Selecting from multiple tables (Join same table using alias w/”AS”)

SELECT pet.name, comment FROM pet, event WHERE pet.name = event.name;

SELECT pet.name, comment FROM pet JOIN event ON (pet.name = event.name);

METADATA

Function

MySQL

HiveQL

Listing databases

SHOW DATABASES;

SHOW DATABASES;

Listing tables in a database

SHOW TABLES;

SHOW TABLES;

Creating a database

CREATE DATABASE db_name;

CREATE DATABASE db_name;

Dropping a database

DROP DATABASE db_name;

DROP DATABASE db_name (CASCADE);

General Information:

Hive supports all the common primitive data formats such as BIGINT, BINARY, BOOLEAN, CHAR, DECIMAL, DOUBLE, FLOAT, INT, SMALLINT, STRING, TIMESTAMP, and TINYINT. In addition, analysts can combine primitive data types to form complex data types, such as struts, maps and arrays.

Pros 

Similarity with SQL

Multiple schemas can be projected on the same data — no ETL required

Works with popular query tools that connect via JDBC and ODBC

Able to query very large data sets interactively

Cons

DDL requires a bit of learning and different thinking

Tuning Hive is very different from tuning RDBMS

Here are the basics of Hive which will enable you to start exploring more about Hive. If you want to know more about how sap is connecting HANA to Hadoop, you can follow the below links.

http://scn.sap.com/thread/3773791

http://scn.sap.com/community/developer-center/hana/blog/2013/08/22/smart-data-access-and-hadoop

http://scn.sap.com/thread/3226387

References:

http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-hive/

http://stackoverflow.com/questions/3356259/difference-between-pig-and-hive-why-have-both

http://www.dezyre.com/article/difference-between-pig-and-hive-the-two-key-components-of-hadoop-ecosystem/79

http://www.thecloudavenue.com/2012/12/introduction-to-apache-hive-and-pig.html

http://www.orzota.com/hive-tutorial-for-beginners/

http://www.infoworld.com/article/2608271/hadoop/hadoop-review-apache-hive-brings-real-time-queries-to-hadoop.html

http://hortonworks.com/wp-content/uploads/downloads/2013/08/Hortonworks.CheatSheet.SQLtoHive.pdf

http://blog.sqlauthority.com/2013/10/21/big-data-data-mining-with-hive-what-is-hive-what-is-hiveql-hql-day-15-of-21/

To report this post you need to login first.

2 Comments

You must be Logged on to comment or reply to a post.

Leave a Reply