Skip to Content

Welcome to part 3 in this series of demonstrating how to analyze Wikipedia page hit data using Hadoop with SAP HANA One and Hive. To see how we got here, check out the first blog post in the series at “Using SAP HANA to analyze Wikipedia data – Preparing the Data”. There are numerous ways to get data from AWS Elastic MapReduce (Hadoop) and Hive into Sap HANA One. SAP put together a video on using SAP BusinessObjects DataServices to import data from Hadoop into SAP HANA at http://www.youtube.com/watch?v=ls_MGp8R7Yk. The challenge with this approach is that Data Services is not easily available for AWS SAP HANA One users. Another way to import data from an active Hadoop cluster is through Apache Sqoop via a JDBC connection to your SAP HANA One database. With Sqoop, the operative word here is having an “active” Hadoop cluster. This means you need to keep your ASW EMR cluster active to run Sqoop. If you are going to keep an EMR cluster with nine m1.xlarge EC2 instances running at 48 cents per hour per instance, you are looking at approximately $100 per day to keep the cluster alive. The beauty of using AWS EMR with S3 storage is that you can access the data in the S3 bucket after terminating the EMR cluster.

In part 2 of this series “Reducing the amount of working data for SAP HANA to process using Amazon Elastic MapReduce and Hive”, I showed how you can use Hive to come up with a reasonable working set of data for SAP HANA One to process. The result was the creation of three files for each month stored in the “s3://wikipedia-pagecounts-hive-results/” bucket in the directories named “year=2013/month=03”, “year=2013/month=04” and “year=2013/month=05”.

In this blog post, I will show how to import the data generated into Hive using just the resources available to you on AWS and SAP HANA One using the components highlighted in the diagram below.

Blog 301.png

Note: This is the first time that I’m actually going to be using SAP HANA One. If you want to follow along, you will need an active SAP HANA One database up and running. You can get started by going to http://www.saphana.com/community/try and following the steps in the “Try SAP HANA One” box. I’m also assuming that you have downloaded SAP HANA Studio to either your local computer or to an EC2 instance on AWS. For instructions on configuring SAP HANA Studio, check out http://www.saphana.com/docs/DOC-2438. I prefer using a Windows Server 2012 Server EC2 instance that is at least an m1.xlarge instance. The reason that I like the xlarge instances is that they use a high network bandwidth connection that makes data operations within the same AWS data center fast – much faster than using a local machine and then having to move data locally and then back to AWS for processing.

Let’s get started!

Copy the S3 files to your SAP HANA One Linux server

If you think you are going to do lots of work with AWS S3 storage, I recommend you get one of the many available tools that are out there for working with S3 storage. The one I use is the S3 Browser by NetSDK Software. They have a freeware edition available at http://s3browser.com/. NetSDK also has a free trial of TntDrive that allows you to map an S3 bucket as a network drive at http://tntdrive.com/. If you have experience with other S3 browsers, please share your suggestions as a comment to the blog.

I’m going to provide instructions on how to download the files using the AWS S3 console.

Making a local copy of the Hive files to your computer

First, create a local directory that you will use to download the nine Hive files. I’m going to use C:\wiki-data for this example.

Sign into your AWS account and then navigate to Services -> S3. If you followed along, you will navigate to the unique S3 bucket name you created in part 2. For me, I’m using the “s3://wikipedia-pagecounts-hive-results/” bucket that I created in part 2. Then, navigate to the year=2013/month=03/ directory as shown below.

Blog 302.png

Next, select the first file named 000000 and then right click on the file and select the download command.

Blog 303.png

AWS displays a message box to download the file. You need to right click on the Download link and choose the Save link as… command in your browser.

Blog 304.png

Next, navigate to your C:\wiki-data directory and save the file with the name 2013-13-000000 without a file extension. Once the download is complete, click the OK button to dismiss the message box.

Blog 305.png

You then need to repeat this operation for the 000001 and 000002 files in the month=03\ folder. Then, do the same for the three files for the month=04\ and month=05\ directories.

NOTE: As a public service and for a limited time, I made the Hive files available for download using the following URLs:


http://wikipedia-pagecounts-hive-results.s3.amazonaws.com/year=2013/month=03/000000

http://wikipedia-pagecounts-hive-results.s3.amazonaws.com/year=2013/month=03/000001

http://wikipedia-pagecounts-hive-results.s3.amazonaws.com/year=2013/month=03/000002

http://wikipedia-pagecounts-hive-results.s3.amazonaws.com/year=2013/month=04/000000

http://wikipedia-pagecounts-hive-results.s3.amazonaws.com/year=2013/month=04/000001

http://wikipedia-pagecounts-hive-results.s3.amazonaws.com/year=2013/month=04/000002

http://wikipedia-pagecounts-hive-results.s3.amazonaws.com/year=2013/month=05/000000

http://wikipedia-pagecounts-hive-results.s3.amazonaws.com/year=2013/month=05/000001

http://wikipedia-pagecounts-hive-results.s3.amazonaws.com/year=2013/month=05/000002

Just click on the link to download the file and use the Save As command to save the files into your C:\wiki-data directory.

Copy the local files to your SAP HANA One instance

If you haven’t started you SAP HANA One instance, now would be a good time. J Navigate to your EC2 console, right click on your SAP HANA One instance and choose the Connect command.

Blog 306.png

For the user name, use root. For the Private key path, enter in the path name for your pem key file. Then, click on the Launch SSH Client button.

In the SSH client, create a directory called /wiki-data and then grant rights to the directory so that the HAHA instance can access it with the following Linux commands:

hana:/ # mkdir /wiki-data                                                      

hana:/ # chmod 777 wiki-data

In the AWS SSH client, go to the Plugins menu and choose the SFTP File Transfer … command.

Blog 307.png

On the left side of the dialog, navigate to your C:\wiki-data directory. Then, on the right side, click on the ChDir button, type in /wiki-data for the path and click OK. The dialog should look like the one shown below.

Blog 308.png

Then, multi-select all of the files on the left site and click the –> button in the middle of the dialog to start the transfer.

You should see the following dialog that shows the progress of the copy operation.

Blog 309.png

Again, this goes very fast if you used an EC2 instance on S3 as your client. You can see in the dialog above, that I got an upload speed of 509 kb/sec.

Blog 310.png

When using my xlarge EC2 instance with high network bandwidth, you can see the transfer rate was 37 MB/sec! Yes – 72 times faster!

Once your copy operation is complete, click the Close button to dismiss the SFTP dialog.

NOTE: I decided to copy the data to the Linux system like this for a couple of reasons. The first one is that SAP HANA Studio’s import command supports a limited number of characters like comma, colon and semi-colon as the delimited value for columns. Second, I want to show how you can use the IMPORT command to load data instead of using the HANA Studio import wizard.

Replacing the ^A field delimiters with a pipe “|” delimiter for importing into HANA

The IMPORT FROM command makes it easy to import text-delimited files with the CSV FILE option, but there is a catch. You can’t specify a non-printable character using the FIELD DELIMITED BY clause. Since the Hive output files use the non-printable ^A character, we need a way to transform the character into a printable character that doesn’t conflict with the Wikipedia data. I used the Linux grep command to verify that there are no pipe characters in the Hive files.

In order to convert the field delimited values, we are going to use the Linux sed command. Here is what the command looks like for performing the substitution:

sed ‘s/\x01/|/g’ 2013-03-000000 > 2013-03-000000.csv


The s/ indicates the string expression to replace. In this case, the Linux represents the Ctrl-A character as \x01. The /| tells sed to replace the expression with the “|” character. The /g parameter tells sed to replace all instances in the stream. The last two parameters are input file and the output file. If you don’t specify the output file, sed makes the replacement within the input file.

To make the replacement, first change directory in the SSH client using the cd command below.

hana:~ # cd /wiki-data

You can then check to see that all the files are in place by using the ls command below.

hana:/wiki-data # ls -l

total 1798800

-rw-r–r– 1 root root 256709072 Jun 30 21:34 2013-03-000000

-rw-r–r– 1 root root 256564106 Jun 30 21:34 2013-03-000001

-rw-r–r– 1 root root 155604882 Jun 30 21:34 2013-03-000002

-rw-r–r– 1 root root 256407589 Jun 30 21:34 2013-04-000000

-rw-r–r– 1 root root 256251290 Jun 30 21:34 2013-04-000001

-rw-r–r– 1 root root 36089431 Jun 30 21:34 2013-04-000002

-rw-r–r– 1 root root 256520276 Jun 30 21:34 2013-05-000000

-rw-r–r– 1 root root 256179168 Jun 30 21:34 2013-05-000001

-rw-r–r– 1 root root 109761031 Jun 30 21:34 2013-05-000002

Now, you are ready to issue the following nine sed commands:

sed ‘s/\x01/|/g’ 2013-03-000000 > 2013-03-000000.csv

sed ‘s/\x01/|/g’ 2013-03-000001 > 2013-03-000001.csv

sed ‘s/\x01/|/g’ 2013-03-000002 > 2013-03-000002.csv

sed ‘s/\x01/|/g’ 2013-04-000000 > 2013-04-000000.csv

sed ‘s/\x01/|/g’ 2013-04-000001 > 2013-04-000001.csv

sed ‘s/\x01/|/g’ 2013-04-000002 > 2013-04-000002.csv

sed ‘s/\x01/|/g’ 2013-05-000000 > 2013-05-000000.csv

sed ‘s/\x01/|/g’ 2013-05-000001 > 2013-05-000001.csv

sed ‘s/\x01/|/g’ 2013-05-000002 > 2013-05-000002.csv

As a quick verification, you can use the tail command as shown below to see the last few lines of the file.

hana:/wiki-data # tail 2013-05-000002.csv

www.wd|Special:AutoLogin|2013|05|20|16|438|782851

zh.mw|zh|2013|05|20|16|13888|207241012

zh|%E6%88%91%E6%84%9B%E4%BD%A0|2013|05|20|16|132|48055

zh|%E7%99%BE%E5%BA%A6|2013|05|20|16|122|1679770

zh|%E9%80%B2%E6%93%8A%E7%9A%84%E5%B7%A8%E4%BA%BA|2013|05|20|16|391|14259683

zh|%E9%87%91%E9%99%B5%E5%8D%81%E4%B8%89%E9%92%97|2013|05|20|16|167|1894824

zh|File:Otto_Hahn_(Nobel).jpg|2013|05|20|16|217|2557464

zh|Special:Search|2013|05|20|16|229|683531

zh|Special:\xE9\x9A\x8F\xE6\x9C\xBA\xE9\xA1\xB5\xE9\x9D\xA2|2013|05|20|16|3099|2031107

zh|Wikipedia:%E9%A6%96%E9%A1%B5|2013|05|20|16|1530|29640480

The last step we want to perform is to use the following Linux wc – word count – command to see how many rows to expect for importing into HANA:

hana:/wiki-data # wc -l *.csv

   5177628 2013-03-000000.csv

   5109305 2013-03-000001.csv

   3173608 2013-03-000002.csv

   5176167 2013-04-000000.csv

   5080170 2013-04-000001.csv

   1500647 2013-04-000002.csv

   4755634 2013-05-000000.csv

   4868995 2013-05-000001.csv

   2169557 2013-05-000002.csv

  37011711 total


We can use the total number of lines to verify the total number of records imported into HANA.


Importing the delimited files into a staging table

It’s time now to import the delimited text files into a staging table. The reason that I use the term “staging table” is because we will eventually want to enhance our data mode so that we have meaningful values for language, project and the date values. The first step is to import the data as is into the staging table to validate the data. In the next blog for this series, I’ll show how to enhance the data model by creating a fact table out of the staging table and add the dimension tables.

I’m going to assume that you have already installed SAP HANA Studio and that you have connected it to your SAP HANA One database instance. If you haven’t done this before, follow the instructions in documented in the topic “Configuring HANA One and Installing SAP HANA Studio”.

Go ahead and start SAP HANA Studio. Then right click on your system and select the SQL Console command as shown below.

Blog 311.png

This command opens up a new query window that you can use to execute HANA SQL statements.

You can copy and paste the following commands into the query window:

— Create the schema for the Wikipedia page hit data and related tables

CREATE SCHEMA “WIKIDATA”;

— Create the COLUMN table for the staging data

CREATE COLUMN TABLE “WIKIDATA”.“STAGE-PAGEHITS”

(“PROJECTCODE” VARCHAR(50),

“PAGENAME” VARCHAR(2000),

“YEAR” VARCHAR(4),

“MONTH” VARCHAR(2),

“DAY” VARCHAR(2),

“HOUR” VARCHAR(2),

“PAGEHITCOUNTFORHOUR” BIGINT

“BYTESDOWNLOADEDFORHOUR” BIGINT

)

Press the F8 key to execute the two statements.

To import the first file into the staging table, paste the following statement at the end of the window.

— Import the first delimited text file

IMPORT FROM CSV FILE ‘/wiki-data/2013-03-000000.csv’

INTO “WIKIDATA”.“STAGE-PAGEHITS”

WITH RECORD DELIMITED BY ‘\n’

FIELD DELIMITED BY ‘|’

ERROR LOG ‘/wiki-data/import-2013-03-000000.err’;

Next, select the six lines as shown above and then press the F8 – Execute – command to run just the selected IMPORT statement. NOTE: I recommend that you always use the ERROR LOG clause for IMPORT FROM statements so that you can capture any errors that may have occurred. Check out my blog post titled “A simple rule to live by when using the SAP HANA IMPORT FROM command – Don’t forget the ERROR LOG clause” for the full story. We are about to find out why. 🙂

Go ahead and run the following select statement in the query editor to get the count of imported records.

SELECT COUNT(*) FROM “WIKIDATA”.“STAGE-PAGEHITS”;

You should come back with a value of 5,177,628, but the result from the query was 5,177,032! You should ask – what happened? Time to switch over to the SSH client and use the SFTP File Transfer plugin to copy the “/wiki-data/import-2013-03-000000.err” file onto your local computer’s c:\wiki-data directory as shown below.

Blog 312.png

In SAP HANA Studio, File | Open File command to open up .err file.

Blog 313.png

If you look at the results, you will notice that we got multiple errors as shown below.

Blog 314.png

It turns out there are records with no value for the BTYESDOWNLOADEDFORHOUR column. The \N in the example above is the special Hive notation for a NULL value. We have a couple of options to deal with this problem. First, we can simply ignore these records. Second, we could go back to Hive and create two output results. One Hive INSERT statement would add the clause “AND bytesperhour IS NOT NULL” and the other would have “AND bytesperhour IS NULL”. The third option is to use a VARCHAR(50) to import the value as text and later convert the value to BIGINT value for the actual fact table.  I’ll take the third approach as I believe that no record should be left behind. 😉

Go back to the query window in HANA Studio and run the following drop table statement:

Blog 315.png

Notice that I placed the command before the CREATE table statement. This is to prevent dropping the table in the event I forget to select the statement to execute in the query window. 😯

Now, change the CREATE TABLE statement to use VARCHAR(50) for the BYTESDOWNLOADEDFORHOUR column. Then, select the entire CREATE TABLE statement and press F8.

Next, select the IMPORT FROM and the SELECT COUNT(*) commands and press F8.

Success! You should now see the correct result of records as shown below.

It’s now time to use the following commands to import the remaining 8 files:

— Import the remaining files

IMPORT FROM CSV FILE ‘/wiki-data/2013-03-000001.csv’ INTO “WIKIDATA”.“STAGE-PAGEHITS” WITH RECORD DELIMITED BY ‘\n’ FIELD DELIMITED BY ‘|’ ERROR LOG ‘/wiki-data/import-2013-03-000001.err’;

IMPORT FROM CSV FILE ‘/wiki-data/2013-03-000002.csv’ INTO “WIKIDATA”.“STAGE-PAGEHITS” WITH RECORD DELIMITED BY ‘\n’ FIELD DELIMITED BY ‘|’ ERROR LOG ‘/wiki-data/import-2013-03-000002.err’;

IMPORT FROM CSV FILE ‘/wiki-data/2013-04-000000.csv’ INTO “WIKIDATA”.“STAGE-PAGEHITS” WITH RECORD DELIMITED BY ‘\n’ FIELD DELIMITED BY ‘|’ ERROR LOG ‘/wiki-data/import-2013-04-000000.err’;

IMPORT FROM CSV FILE ‘/wiki-data/2013-04-000001.csv’ INTO “WIKIDATA”.“STAGE-PAGEHITS” WITH RECORD DELIMITED BY ‘\n’ FIELD DELIMITED BY ‘|’ ERROR LOG ‘/wiki-data/import-2013-04-000001.err’;

IMPORT FROM CSV FILE ‘/wiki-data/2013-04-000002.csv’ INTO “WIKIDATA”.“STAGE-PAGEHITS” WITH RECORD DELIMITED BY ‘\n’ FIELD DELIMITED BY ‘|’ ERROR LOG ‘/wiki-data/import-2013-04-000002.err’;

IMPORT FROM CSV FILE ‘/wiki-data/2013-05-000000.csv’ INTO “WIKIDATA”.“STAGE-PAGEHITS” WITH RECORD DELIMITED BY ‘\n’ FIELD DELIMITED BY ‘|’ ERROR LOG ‘/wiki-data/import-2013-05-000000.err’;

IMPORT FROM CSV FILE ‘/wiki-data/2013-05-000001.csv’ INTO “WIKIDATA”.“STAGE-PAGEHITS” WITH RECORD DELIMITED BY ‘\n’ FIELD DELIMITED BY ‘|’ ERROR LOG ‘/wiki-data/import-2013-05-000001.err’;

IMPORT FROM CSV FILE ‘/wiki-data/2013-05-000002.csv’ INTO “WIKIDATA”.“STAGE-PAGEHITS” WITH RECORD DELIMITED BY ‘\n’ FIELD DELIMITED BY ‘|’ ERROR LOG ‘/wiki-data/import-2013-05-000002.err’;

SELECT COUNT(*) FROM “WIKIDATA”.“STAGE-PAGEHITS”;

Success again! The total COUNT(*) value was 37,011,711 and this matches the number of lines for all 9 CSV files.

Blog 317.png

You can also see below, that each of the .err files has a zero length indicating no errors.

Blog 318.png

Exploring the data in HANA Studio

The Open Data Preview command in HANA Studio provides a great way to make sure the data is consistent with how you think it should look. To launch the command, go to the navigation pane, expand out the WIKIDATA schema, navigate to Tables, right click on the STAGE-PAGEHITS table and choose the Open Data Preview Command as shown below.

Blog 319.png

HANA Studio displays the first 200 rows in the table in no particular order. Here are just a few of the rows you might see.

Blog 320.png

You can use the Analysis tab to do basic charting of data in the table. Let’s say we want to see how many rows there are per month. To try it out, drag the MONTH column into the Label axis region and then drag the DAY column into the Values axis. You should see something like this.

Blog 321.png

Now, to see the page hits per month, click on the little X next to DAY (Count) to remove the field and then drag the PAGEHITCOUNTFORHOUR field into the Values axis.

Blog 322.png

It’s very interesting that March has almost twice the page hits as April and May.  I’ll need to do some digging around in the next blog post to see if this is a data error or just a lot of people wanting to research their favorite sports team in the month of March. 🙂

At this point, we have a staging table created. In the next blog post for this series, I will show how to enhance the data model to create a page hit fact table and add dimension tables for dates, project codes and language codes.

I hope you have enjoyed the series so far.

To report this post you need to login first.

10 Comments

You must be Logged on to comment or reply to a post.

  1. Bill Ramos Post author

    There is an interesting optimization that I can make by making the files public using a URL in AWS like this: http://wikipedia-pagecounts-hive-results.s3.amazonaws.com/year=2013/month=03/000000

    I can use the Linux wget command in my SAP HANA One SSH client to download the files – bypassing the SFTP console. For example:

    wget http://wikipedia-pagecounts-hive-results.s3.amazonaws.com/year=2013/month=03/000000 –output-document=2013-03-000000

    This command downloads the file and places the result as specified by the –output-document=<file name>

    Best of all, because this is all in the same data center, I got smoking a transfer rate of 31.7 MB/second. 😎 Even better – no need to download the file to my local hard drive with the S3 console and then upload them again! 😀

    (0) 
  2. Aron MacDonald

    Hi Bill,

    Excellent work and nice easy to follow steps.

    It’s a bit more expensive than your option, but you could alternatively have run a very small AWS Hadoop cluster (2 * m1.medium)  at  $5.76 p/day to keep the cluster alive.

    If you are using Cloudera Manager then you can also  temporarily spin out  AWS instances (nodes)   to the cluster (similar to EMR Map reduce) depending on how fast you need the results via HIVE. E.g. temporarily add 8 m1.xlarge EC2 instances then decommission after usage.

    You can then ultimately terminate the 2 medium instances when you have no further need of them.

    http://blog.cloudera.com/blog/2013/03/how-to-create-a-cdh-cluster-on-amazon-ec2-via-cloudera-manager/

    It’s definitely more work to setup and operate than Amazon EMR, but you would be able remove the need for the separate Ubuntu server and the S3 bucket.

    In addition you would also have the option to use Impala SQL (instead of HIVE SQL) to  return your summarized query results 10 times+  faster than HIVE.

    Running you own HADOOP cluster then provides the option of automating all the data collection processes, from data download right through to HANA imports (e.g. if your  HADOOP cluster can access your HANA box with the appropriate security and network settings.)   without the need of BODS, but would require custom build. 


    For ease of use and cost effectiveness though your example is spot on.

    All the best

    Aron

    (0) 
  3. Divya Mahajan

    Hi Bill,

    Here is a slightly alternative method to transfer files from S3 that uses Amazon’s S3 utilities for Linux.

    Step 1: Install the S3 utilities

       Login as root into your HANA One or HANA developer instance. Execute the following command in your putty / ssh session:

       zypper addrepo http://s3tools.org/repo/SLE_11/s3tools.repo

       zypper install s3cmd

    Step 2:  Configure S3cmd. You need your Amazon account Access Key and Secret Key

       s3cmd –configure

    Step 3: Transfer the files

      mkdir /sap/hanadata

         cd /sap/hanadata

         s3cmd get s3://wikipedia-pagecounts-hive-results/year=2013/month=03/2013-03-00000.csv .

    The transfer rates with this approach are very fast.

    (0) 
    1. Bill Ramos Post author

      Hi Divya,

          Great suggestion! I definitely need to check this out for secure – private data transfers. I have to admit – using wget is really easy and fast too. Secretly, I really need to spend a little more time trying to get s3fs working in the HANA One box. I really want to use S3 as my backup location instead of using EBS – much cheaper for one. The other reason is that the backups would be available for restoration on other HANA One instances 😉 .

      Thanks again for the tip!

      Bill

      (0) 
  4. Naveen Byreddy

    Hi Bill.

    is it possible to get  Wikipedia orginal files and logic to get HIVE out files,so that it will be end to end solution from hadoop to HANA analysis.at the moment we have HIVE results   out file ,it will be helpful if we get original file and how the logic implemented to get HIVE results.as we doing one demo POC we need that.thanks so much

    thanks

    (0) 
      1. Naveen Byreddy

        HI Bill,

        Thanks so much for your reply. Now we are having issue when we fallow steps for Ubuntu as in classical wizard to connect putty ,IP or Public DNS name not accpecting.my computer in SAP network. Am getting connection time out error.do you suggest anything.

        thanks

        Naveen.

        (0) 
  5. Naveen Byreddy

    thanks so much Bill.we are trying that.am at EC2 Ubuntu step as in document,but some firewall issue coming. and  my computer  not connecting  with AWS  using putty.any idea how to solve the  firewall  issue?.

    thanks,

    Naveen.

    (0) 
    1. Bill Ramos Post author

      Hi Naveen,

         You should check to see if on AWS if the security group has the required ports open. I believe you need port 22 open to connect. I use the default security group and that seems to work.

      Regards,

      Bill

      (0) 

Leave a Reply