Skip to Content

This blog is the second in the series of using SAP HANA One and Hadoop to gain insights form Wikipedia page hit data. To see how I got the data from Wikipedia into AWS S3 storage, check out “Using SAP HANA to analyze Wikipedia data – Preparing the Data”. Hive is a great tool for people that know any variation of the SQL language to process huge amounts of data locked up in delimited text files. You can then use the results for future analysis with tools like SAP HANA One and the new SAP Lumira data visualization tool. HiveQL looks a lot like ANSI SQL, but without data manipulation language (DML) statements like INSERT VALUES, UPDATE and DELETE commands. Data is essential inserted into the Hive data warehouse using the LOAD DATA command, or simply instructed where to find the data on an AWS S3 storage account using the ALTER TABLE RECOVER PARTITIONS command – which I will demonstrate. For the complete list of Hive commands, check out the language manual at https://cwiki.apache.org/confluence/display/Hive/LanguageManual. You will need to correlate the Hive features available in the manual to the AWS supported Hive versions. As of today, the current Hive version on AWS EMR is 0.8.1.

In this blog post, I will focus on the interaction of between the “Elastic MapReduce Hive” and “S3 Storage” blocks below.

Blog 2 diagram.png                   

Process flow for running Hive jobs on AWS Elastic MapReduce (EMR)

The objective for this blog is to create a set of output files from Hive that contain the data for the Wikipedia pages that have 100 or more hits for any hour within the month of April. I am going to assume that the Wikipedia data was copied over to the S3 storage in a directory structure that looks like this:  ‘s3://wikipedia.pagecounts/year=2013/month=04/day=01/hour=00’. The results from the Hive query that I will demonstrate will go into the directory ‘s3://wikipedia-pagecounts-hive-results/year=2013/month=04/’. To get there, I will show how to setup an AWS EMR cluster for Hive and run the needed HiveQL commands using the AWS SSH client.

NOTE: If you are trying this from home, you will need to change the bucket names to match the unique bucket names on your system.

Create an AWS EMR cluster for Hive using an interactive console

Sign into AWS account

Go to Services->Elastic MapReduce

Click on ‘Create New Job Flow’

Enter the job flow name, select ‘Amazon Distribution’ for the Hadoop version,

‘Run your own application’ under ‘create a job flow’ and ‘Hive Program’ for the job flow type as shown below. Click ‘Continue’

Blog 1 01.png

Select ‘Start an Interactive Hive Session’ and click ‘Continue’

Blog 1 02.png

You can operationalize the process by using the “Execute a Hive Script” option where EMR runs the script and terminates the cluster when the job is complete. Since, I’m showing you how to do this on your own, I’m goin gwith the Interactive session.

Select the instance types and count for the master and core instance groups as shown below and press Continue.

Blog 1 03.png

You can reduce costs by checking the Request Spot Instance and typing in the value that you would like to pay for the instances. In this case, a m1.xlarge instance normally runs 48 cents per hour. On the spot market, the average going rate is 5.2 cents per hour. The problem is, if someone out bids you and AWS doesn’t have the machines available to satisfy the higher bidder’s price, AWS will terminate your instances. 😯 I have a video that I’ll post that shows how this works and add the link here once it’s up on YouTube.

Select the Amazon EC2 Key Pair and press Continue.

Blog 1 04.png

Note: If you aren’t using the Interactive Hive Session, you don’t need to specify the key file.

Leave the option to ‘Proceed with no Bootstrap options’ checked and click Continue.

Blog 1 05.png

You can use the Bootstrap options to install other software programs needed for your Hadoop operations. For this demonstration, choosing Hive provides everything that we need.

You are now ready to create the cluster. Just click on Create a new job flow’ to start the process.

Blog 1 06.png

EMR then displays the following dialog to let you know that it has started the job.

Blog 1 07.png

To see the results, click on ‘view my job flows and check on job flow status’ link.

Connect to the Master node from an ssh client

EMR creates a total of 9 EC2 instances. The process takes around 10 minutes to complete. To track the progress of the instance creation, Go to Services -> EC2. You should see 9 new instances that EMR creates for the master and slave nodes.

Blog 1 08.png

Once the state of all the new instances changes to “running”, select the row corresponding to the Master node by looking at the Security Groups column. Note the address of this node: In this case, it is ec2-54-226-84-23.compute-1.amazonaws.com.

Now that all the instances are running for the master and slaves, you need to go back to the EMR dashboard to make sure the cluster is ready. You should see a green circle next to the cluster along with a status value of either WAITING or RUNNING as shown below.

Blog 1 09.png

Note that the Master Public DNS Name highlighted above matches the address for the master in the EC2 console.

You are now ready to connect to the master node using the AWS SSH client.

Navigate to the EC2 console and click on the master node instance. Then right click on the node and select the Connect command.

Blog 1 10.png 

In the Connect to an instance dialog, change the User name to hadoop. If you leave it as root, you will not get the environment setup for running Hadoop and Hive. Next, type in the Private key path for your .pem file and then click on the Launch SSH client button.

Blog 1 11.png 

If all goes well, you should see the SSH client pup up. If you don’t see it, you may have to install the Java plug-in for your browser.

Blog 1 12.png 

You are now ready to start the Hive command line interface. To get hive running, type in the command hive and press Enter.

Blog 1 13.png 

You are now ready to run hive commands.

NOTE: The hive command will also allow you to run with other command line parameters – for example, specify a name of a HiveQL script for automating query execution. For now, I’ll show how to run the commands in interactive mode.

Create a Hive table with partitions so that the Year, Month, Day, and Hour values are set based on the structure of the S3 bucket

To create a table in Hive that will reference the Wikipedia data, copy the text below and then paste it into the SSH client using the Ctrl-Ins keystroke or using the Edit | Paste command. I’m using the PARTITIONED BY clause to tell Hive that it should pull in the Year, Month, Day and Hour values from the directory names. These values will look just like other column names in SELECT queries and are not included in the main column list for the table definition.

CREATE EXTERNAL TABLE pagecounts (projectcode STRING, pagename STRING, hitsperhour BIGINT, bytesperhour BIGINT) PARTITIONED BY (year STRING, month STRING, day STRING, hour STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\32’ LOCATION ‘s3://wikipedia.pagecounts/’;

I’m using the EXTERNAL command so that Hive doesn’t move the data into it’s warehouse on the cluster. This way, when you terminate the cluster, the data is still in your S3 bucket. Note also, Hive commands are always terminated with a semi-colon (;). This has tripped me more than once. 🙂

Next, we need to tell Hive where to find the data. In this case, I’m using the following statement to point the table to the first partition that I have for March.

ALTER TABLE pagecounts ADD PARTITION (year=’2013′, month=’03’, day=’01’, hour=’00’) ;

Once, you execute the command, you can now try out your first HiveQL statement that actually creates a map-reduce job to do something. Let’s get a count of records using the following statement.

SELECT COUNT(*) from pagecounts;

The result should look something like this:

Blog 1 14.png 

Notice, that we only got back a count of 7,297,874. This is far short from the 16,671,880,003 records of data that are in the three month set that I have in the S3 bucket. This is because the ALTER TABLE statement only initialized one partition.

To make all the data available, you need to issue the following command:

ALTER TABLE pagecounts RECOVER PARTITIONS ;

With the RECOVER PARTITIONS clause, Hive goes through each of the folders and parses out the values for Year, Month, Day and Hour to initialize the entire data set. If you add more folders of data while the EMR cluster is active and you want to see that data in your results, you will need to reissue the command above to let Hive know there is more data.

If you run the “SELECT COUNT(*) from pagecounts;” command again, it will take a much longer time to complete as Hive needs to process all the data. For example, on the 8 data node cluster with m1.xlarge servers, the command takes around 1,800 seconds or 30 minutes!

In terms of scale, Hive processed 2,214 individual gz files – 745.3 GB of uncompressed data and then read each line to come up with the count. It took 51 seconds to read one hour’s worth of data. Fortunately, Hadoop tries to use as much of the processing power of the cluster as possible, so instead of talking 51 seconds times 2,214 files or 112,914 seconds, the time was reduced to 1,800 seconds – 63 times faster. Interesting enough, the each of the m1.xlarge systems has 8 virtual processors and there are 8 data nodes for 64 total virtual processors. When we ran the first COUNT(*) for one day, Hive only generated one mapper and one reducer to process the query. This means that we essentially got linear scaling to compute the COUNT(*) based on processors.

Use a HiveQL statement to count the number of records that will be loaded into SAP HANA One

In order to limit the data you want Hive to process, you will want to use a WHERE clause. The following example counts the number of rows that we intend to import for April.

SELECT COUNT(*) from pagecounts where year=’2013′ and month=’04’ and hitsperhour > 99;

The count for this query was 11,756,984 records. I decided to go with hitsperhour > 99 as this produces a result with a significant amount of data that catches interesting events.

With respect to scale, by going with the partitioned table, the map-reduce job ignored all the data in the March and May folders. The result was that this query ran in 670 seconds or 2.7 times faster than processing all the data.

Run a HiveQL statement that outputs data for the month back to the S3 bucket for future loading into SAP HANA One

Now it’s time to create the output files from Hive so that we can import the data into HANA. You will first need to create a new bucket for the output results using the S3 console. In this example, I called the bucket “s3://wikipedia-pagecounts-hive-results”. You will need to come up with a different name if you are trying this with your own S3 account.

The following command tells Hive to run the SELECT statement portion of the command and then output the results to the specified bucket/directory – overwriting whatever was there.

insert overwrite directory ‘s3://wikipedia-pagecounts-hive-results/year=2013/month=04/’

SELECT projectcode, pagename, year, month, day, hour, hitsperhour, bytesperhour FROM pagecounts where year=’2013′ and month=’04’ and hitsperhour > 99;

Using the 8 data node cluster, the copy operation took 968 seconds to copy over 11,756,984 records.

NOTE: I learned the hard way that you need to use a different bucket than the one used for the source files. During my experiments, is mistakenly used the ‘s3://wikipedia.pagecounts/’ bucket without a directory specification. The end result is that the command finished processing almost instantly. That’s because the OVERWRITE clause deleted all the sub-directories where I had the source data and then had nothing to process. 😯

You can run Hadoop file system commands within the Hive shell by using the dfs command. The following command lists the files that Hive created in the prior step.

dfs -ls  s3://wikipedia-pagecounts-hive-results/year=2013/month=04/;

To see what the contents of the file look like, you can use the tail command to view the end of the file. Note, there is a –cat option to output the contents of the file, but that would send approximately 2 million rows of data to the console – not a good thing. The cat command and Big Data should never mix. 😉

dfs -tail  s3://wikipedia-pagecounts-hive-results/year=2013/month=04/000000;

Blog 1 15.png 

Notice the ^A values used for field delimiters. We will need to replace this non-printable character with a “|” pipe character before loading into the HANA table. I’ll cover how to make this replacement in the next blog for this series.

At this point, you can run these two commands to put the March and May data into their respective folders.

insert overwrite directory ‘s3://wikipedia-pagecounts-hive-results/year=2013/month=03/’

SELECT projectcode, pagename, year, month, day, hour, hitsperhour, bytesperhour FROM pagecounts where year=’2013′ and month=’03’ and hitsperhour > 99;

insert overwrite directory ‘s3://wikipedia-pagecounts-hive-results/year=2013/month=05/’

SELECT projectcode, pagename, year, month, day, hour, hitsperhour, bytesperhour FROM pagecounts where year=’2013′ and month=’05’ and hitsperhour > 99;

Terminate the AWS EMR cluster after verifying the results

I hope you are thinking to yourself – that looked easy. It really is. It’s now time to tear down the EMR cluster. To do that, you want to use the EMR console.

Go to Services->Elastic MapReduce, right click on the EMR job flow name and select ‘terminate job flow’. This action will terminate the nine m1.xlarge EC2 instances used for the cluster by using this one step. You don’t want to attempt to terminate individual instances using the EC2 dashboard.

I hope that you enjoyed this little journey on using AWS Elastic MapReduce and Hive to process the 745.3 GB of uncompressed data down to 559.5 MB of data and still have more than 30 million rows of interesting data to play with in SAP HANA One and SAP Lumira.

In the next blog post, I will show you how to transform the Hive results to create a staging table in SAP HANA One.

To report this post you need to login first.

3 Comments

You must be Logged on to comment or reply to a post.

  1. Jan Penninkhof

    Really liked the blog series, Bill!

    Even though have had the oppurtunity to get my hands dirty with Hadoop and Hive already, I learned quite a few nice new tricks and it was an eye-opener how easy this can be done on AWS. E.g. I didn’t even know that S3 buckets can be mounted from Linux. That makes things so much easier to off-load these kinds of jobs to AWS.

    Cheers,

    Jan

    (0) 

Leave a Reply