Importing Wikipedia Hive data into SAP HANA One
Welcome to part 3 in this series of demonstrating how to analyze Wikipedia page hit data using Hadoop with SAP HANA One and Hive. To see how we got here, check out the first blog post in the series at “Using SAP HANA to analyze Wikipedia data – Preparing the Data”. There are numerous ways to get data from AWS Elastic MapReduce (Hadoop) and Hive into Sap HANA One. SAP put together a video on using SAP BusinessObjects DataServices to import data from Hadoop into SAP HANA at http://www.youtube.com/watch?v=ls_MGp8R7Yk. The challenge with this approach is that Data Services is not easily available for AWS SAP HANA One users. Another way to import data from an active Hadoop cluster is through Apache Sqoop via a JDBC connection to your SAP HANA One database. With Sqoop, the operative word here is having an “active” Hadoop cluster. This means you need to keep your ASW EMR cluster active to run Sqoop. If you are going to keep an EMR cluster with nine m1.xlarge EC2 instances running at 48 cents per hour per instance, you are looking at approximately $100 per day to keep the cluster alive. The beauty of using AWS EMR with S3 storage is that you can access the data in the S3 bucket after terminating the EMR cluster.
In part 2 of this series “Reducing the amount of working data for SAP HANA to process using Amazon Elastic MapReduce and Hive”, I showed how you can use Hive to come up with a reasonable working set of data for SAP HANA One to process. The result was the creation of three files for each month stored in the “s3://wikipedia-pagecounts-hive-results/” bucket in the directories named “year=2013/month=03”, “year=2013/month=04” and “year=2013/month=05”.
In this blog post, I will show how to import the data generated into Hive using just the resources available to you on AWS and SAP HANA One using the components highlighted in the diagram below.
Note: This is the first time that I’m actually going to be using SAP HANA One. If you want to follow along, you will need an active SAP HANA One database up and running. You can get started by going to http://www.saphana.com/community/try and following the steps in the “Try SAP HANA One” box. I’m also assuming that you have downloaded SAP HANA Studio to either your local computer or to an EC2 instance on AWS. For instructions on configuring SAP HANA Studio, check out http://www.saphana.com/docs/DOC-2438. I prefer using a Windows Server 2012 Server EC2 instance that is at least an m1.xlarge instance. The reason that I like the xlarge instances is that they use a high network bandwidth connection that makes data operations within the same AWS data center fast – much faster than using a local machine and then having to move data locally and then back to AWS for processing.
Let’s get started!
Copy the S3 files to your SAP HANA One Linux server
If you think you are going to do lots of work with AWS S3 storage, I recommend you get one of the many available tools that are out there for working with S3 storage. The one I use is the S3 Browser by NetSDK Software. They have a freeware edition available at http://s3browser.com/. NetSDK also has a free trial of TntDrive that allows you to map an S3 bucket as a network drive at http://tntdrive.com/. If you have experience with other S3 browsers, please share your suggestions as a comment to the blog.
I’m going to provide instructions on how to download the files using the AWS S3 console.
Making a local copy of the Hive files to your computer
First, create a local directory that you will use to download the nine Hive files. I’m going to use C:\wiki-data for this example.
Sign into your AWS account and then navigate to Services -> S3. If you followed along, you will navigate to the unique S3 bucket name you created in part 2. For me, I’m using the “s3://wikipedia-pagecounts-hive-results/” bucket that I created in part 2. Then, navigate to the year=2013/month=03/ directory as shown below.
Next, select the first file named 000000 and then right click on the file and select the download command.
AWS displays a message box to download the file. You need to right click on the Download link and choose the Save link as… command in your browser.
Next, navigate to your C:\wiki-data directory and save the file with the name 2013-13-000000 without a file extension. Once the download is complete, click the OK button to dismiss the message box.
You then need to repeat this operation for the 000001 and 000002 files in the month=03\ folder. Then, do the same for the three files for the month=04\ and month=05\ directories.
NOTE: As a public service and for a limited time, I made the Hive files available for download using the following URLs:
Just click on the link to download the file and use the Save As command to save the files into your C:\wiki-data directory.
Copy the local files to your SAP HANA One instance
If you haven’t started you SAP HANA One instance, now would be a good time. J Navigate to your EC2 console, right click on your SAP HANA One instance and choose the Connect command.
For the user name, use root. For the Private key path, enter in the path name for your pem key file. Then, click on the Launch SSH Client button.
In the SSH client, create a directory called /wiki-data and then grant rights to the directory so that the HAHA instance can access it with the following Linux commands:
hana:/ # mkdir /wiki-data
hana:/ # chmod 777 wiki-data
In the AWS SSH client, go to the Plugins menu and choose the SFTP File Transfer … command.
On the left side of the dialog, navigate to your C:\wiki-data directory. Then, on the right side, click on the ChDir button, type in /wiki-data for the path and click OK. The dialog should look like the one shown below.
Then, multi-select all of the files on the left site and click the –> button in the middle of the dialog to start the transfer.
You should see the following dialog that shows the progress of the copy operation.
Again, this goes very fast if you used an EC2 instance on S3 as your client. You can see in the dialog above, that I got an upload speed of 509 kb/sec.
When using my xlarge EC2 instance with high network bandwidth, you can see the transfer rate was 37 MB/sec! Yes – 72 times faster!
Once your copy operation is complete, click the Close button to dismiss the SFTP dialog.
NOTE: I decided to copy the data to the Linux system like this for a couple of reasons. The first one is that SAP HANA Studio’s import command supports a limited number of characters like comma, colon and semi-colon as the delimited value for columns. Second, I want to show how you can use the IMPORT command to load data instead of using the HANA Studio import wizard.
Replacing the ^A field delimiters with a pipe “|” delimiter for importing into HANA
The IMPORT FROM command makes it easy to import text-delimited files with the CSV FILE option, but there is a catch. You can’t specify a non-printable character using the FIELD DELIMITED BY clause. Since the Hive output files use the non-printable ^A character, we need a way to transform the character into a printable character that doesn’t conflict with the Wikipedia data. I used the Linux grep command to verify that there are no pipe characters in the Hive files.
In order to convert the field delimited values, we are going to use the Linux sed command. Here is what the command looks like for performing the substitution:
sed ‘s/\x01/|/g’ 2013-03-000000 > 2013-03-000000.csv
The s/ indicates the string expression to replace. In this case, the Linux represents the Ctrl-A character as \x01. The /| tells sed to replace the expression with the “|” character. The /g parameter tells sed to replace all instances in the stream. The last two parameters are input file and the output file. If you don’t specify the output file, sed makes the replacement within the input file.
To make the replacement, first change directory in the SSH client using the cd command below.
hana:~ # cd /wiki-data
You can then check to see that all the files are in place by using the ls command below.
hana:/wiki-data # ls -l
-rw-r–r– 1 root root 256709072 Jun 30 21:34 2013-03-000000
-rw-r–r– 1 root root 256564106 Jun 30 21:34 2013-03-000001
-rw-r–r– 1 root root 155604882 Jun 30 21:34 2013-03-000002
-rw-r–r– 1 root root 256407589 Jun 30 21:34 2013-04-000000
-rw-r–r– 1 root root 256251290 Jun 30 21:34 2013-04-000001
-rw-r–r– 1 root root 36089431 Jun 30 21:34 2013-04-000002
-rw-r–r– 1 root root 256520276 Jun 30 21:34 2013-05-000000
-rw-r–r– 1 root root 256179168 Jun 30 21:34 2013-05-000001
-rw-r–r– 1 root root 109761031 Jun 30 21:34 2013-05-000002
Now, you are ready to issue the following nine sed commands:
sed ‘s/\x01/|/g’ 2013-03-000000 > 2013-03-000000.csv
sed ‘s/\x01/|/g’ 2013-03-000001 > 2013-03-000001.csv
sed ‘s/\x01/|/g’ 2013-03-000002 > 2013-03-000002.csv
sed ‘s/\x01/|/g’ 2013-04-000000 > 2013-04-000000.csv
sed ‘s/\x01/|/g’ 2013-04-000001 > 2013-04-000001.csv
sed ‘s/\x01/|/g’ 2013-04-000002 > 2013-04-000002.csv
sed ‘s/\x01/|/g’ 2013-05-000000 > 2013-05-000000.csv
sed ‘s/\x01/|/g’ 2013-05-000001 > 2013-05-000001.csv
sed ‘s/\x01/|/g’ 2013-05-000002 > 2013-05-000002.csv
As a quick verification, you can use the tail command as shown below to see the last few lines of the file.
hana:/wiki-data # tail 2013-05-000002.csv
The last step we want to perform is to use the following Linux wc – word count – command to see how many rows to expect for importing into HANA:
hana:/wiki-data # wc -l *.csv
We can use the total number of lines to verify the total number of records imported into HANA.
Importing the delimited files into a staging table
It’s time now to import the delimited text files into a staging table. The reason that I use the term “staging table” is because we will eventually want to enhance our data mode so that we have meaningful values for language, project and the date values. The first step is to import the data as is into the staging table to validate the data. In the next blog for this series, I’ll show how to enhance the data model by creating a fact table out of the staging table and add the dimension tables.
I’m going to assume that you have already installed SAP HANA Studio and that you have connected it to your SAP HANA One database instance. If you haven’t done this before, follow the instructions in documented in the topic “Configuring HANA One and Installing SAP HANA Studio”.
Go ahead and start SAP HANA Studio. Then right click on your system and select the SQL Console command as shown below.
This command opens up a new query window that you can use to execute HANA SQL statements.
You can copy and paste the following commands into the query window:
— Create the schema for the Wikipedia page hit data and related tables
CREATE SCHEMA “WIKIDATA”;
— Create the COLUMN table for the staging data
CREATE COLUMN TABLE “WIKIDATA”.“STAGE-PAGEHITS”
Press the F8 key to execute the two statements.
To import the first file into the staging table, paste the following statement at the end of the window.
— Import the first delimited text file
IMPORT FROM CSV FILE ‘/wiki-data/2013-03-000000.csv’
WITH RECORD DELIMITED BY ‘\n’
FIELD DELIMITED BY ‘|’
ERROR LOG ‘/wiki-data/import-2013-03-000000.err’;
Next, select the six lines as shown above and then press the F8 – Execute – command to run just the selected IMPORT statement. NOTE: I recommend that you always use the ERROR LOG clause for IMPORT FROM statements so that you can capture any errors that may have occurred. Check out my blog post titled “A simple rule to live by when using the SAP HANA IMPORT FROM command – Don’t forget the ERROR LOG clause” for the full story. We are about to find out why. 🙂
Go ahead and run the following select statement in the query editor to get the count of imported records.
SELECT COUNT(*) FROM “WIKIDATA”.“STAGE-PAGEHITS”;
You should come back with a value of 5,177,628, but the result from the query was 5,177,032! You should ask – what happened? Time to switch over to the SSH client and use the SFTP File Transfer plugin to copy the “/wiki-data/import-2013-03-000000.err” file onto your local computer’s c:\wiki-data directory as shown below.
In SAP HANA Studio, File | Open File command to open up .err file.
If you look at the results, you will notice that we got multiple errors as shown below.
It turns out there are records with no value for the BTYESDOWNLOADEDFORHOUR column. The \N in the example above is the special Hive notation for a NULL value. We have a couple of options to deal with this problem. First, we can simply ignore these records. Second, we could go back to Hive and create two output results. One Hive INSERT statement would add the clause “AND bytesperhour IS NOT NULL” and the other would have “AND bytesperhour IS NULL”. The third option is to use a VARCHAR(50) to import the value as text and later convert the value to BIGINT value for the actual fact table. I’ll take the third approach as I believe that no record should be left behind. 😉
Go back to the query window in HANA Studio and run the following drop table statement:
Notice that I placed the command before the CREATE table statement. This is to prevent dropping the table in the event I forget to select the statement to execute in the query window. 😯
Now, change the CREATE TABLE statement to use VARCHAR(50) for the BYTESDOWNLOADEDFORHOUR column. Then, select the entire CREATE TABLE statement and press F8.
Next, select the IMPORT FROM and the SELECT COUNT(*) commands and press F8.
Success! You should now see the correct result of records as shown below.
It’s now time to use the following commands to import the remaining 8 files:
— Import the remaining files
IMPORT FROM CSV FILE ‘/wiki-data/2013-03-000001.csv’ INTO “WIKIDATA”.“STAGE-PAGEHITS” WITH RECORD DELIMITED BY ‘\n’ FIELD DELIMITED BY ‘|’ ERROR LOG ‘/wiki-data/import-2013-03-000001.err’;
IMPORT FROM CSV FILE ‘/wiki-data/2013-03-000002.csv’ INTO “WIKIDATA”.“STAGE-PAGEHITS” WITH RECORD DELIMITED BY ‘\n’ FIELD DELIMITED BY ‘|’ ERROR LOG ‘/wiki-data/import-2013-03-000002.err’;
IMPORT FROM CSV FILE ‘/wiki-data/2013-04-000000.csv’ INTO “WIKIDATA”.“STAGE-PAGEHITS” WITH RECORD DELIMITED BY ‘\n’ FIELD DELIMITED BY ‘|’ ERROR LOG ‘/wiki-data/import-2013-04-000000.err’;
IMPORT FROM CSV FILE ‘/wiki-data/2013-04-000001.csv’ INTO “WIKIDATA”.“STAGE-PAGEHITS” WITH RECORD DELIMITED BY ‘\n’ FIELD DELIMITED BY ‘|’ ERROR LOG ‘/wiki-data/import-2013-04-000001.err’;
IMPORT FROM CSV FILE ‘/wiki-data/2013-04-000002.csv’ INTO “WIKIDATA”.“STAGE-PAGEHITS” WITH RECORD DELIMITED BY ‘\n’ FIELD DELIMITED BY ‘|’ ERROR LOG ‘/wiki-data/import-2013-04-000002.err’;
IMPORT FROM CSV FILE ‘/wiki-data/2013-05-000000.csv’ INTO “WIKIDATA”.“STAGE-PAGEHITS” WITH RECORD DELIMITED BY ‘\n’ FIELD DELIMITED BY ‘|’ ERROR LOG ‘/wiki-data/import-2013-05-000000.err’;
IMPORT FROM CSV FILE ‘/wiki-data/2013-05-000001.csv’ INTO “WIKIDATA”.“STAGE-PAGEHITS” WITH RECORD DELIMITED BY ‘\n’ FIELD DELIMITED BY ‘|’ ERROR LOG ‘/wiki-data/import-2013-05-000001.err’;
IMPORT FROM CSV FILE ‘/wiki-data/2013-05-000002.csv’ INTO “WIKIDATA”.“STAGE-PAGEHITS” WITH RECORD DELIMITED BY ‘\n’ FIELD DELIMITED BY ‘|’ ERROR LOG ‘/wiki-data/import-2013-05-000002.err’;
SELECT COUNT(*) FROM “WIKIDATA”.“STAGE-PAGEHITS”;
Success again! The total COUNT(*) value was 37,011,711 and this matches the number of lines for all 9 CSV files.
You can also see below, that each of the .err files has a zero length indicating no errors.
Exploring the data in HANA Studio
The Open Data Preview command in HANA Studio provides a great way to make sure the data is consistent with how you think it should look. To launch the command, go to the navigation pane, expand out the WIKIDATA schema, navigate to Tables, right click on the STAGE-PAGEHITS table and choose the Open Data Preview Command as shown below.
HANA Studio displays the first 200 rows in the table in no particular order. Here are just a few of the rows you might see.
You can use the Analysis tab to do basic charting of data in the table. Let’s say we want to see how many rows there are per month. To try it out, drag the MONTH column into the Label axis region and then drag the DAY column into the Values axis. You should see something like this.
Now, to see the page hits per month, click on the little X next to DAY (Count) to remove the field and then drag the PAGEHITCOUNTFORHOUR field into the Values axis.
It’s very interesting that March has almost twice the page hits as April and May. I’ll need to do some digging around in the next blog post to see if this is a data error or just a lot of people wanting to research their favorite sports team in the month of March. 🙂
At this point, we have a staging table created. In the next blog post for this series, I will show how to enhance the data model to create a page hit fact table and add dimension tables for dates, project codes and language codes.
I hope you have enjoyed the series so far.