Technology Blogs by Members
Explore a vibrant mix of technical expertise, industry insights, and tech buzz in member blogs covering SAP products, technology, and events. Get in the mix!
cancel
Showing results for 
Search instead for 
Did you mean: 
Former Member

Back at the end of April, I posted a blog titled “SAP HANA – Hot N Cold – HANA for Hot Data - Hadoop for Cold Data”. In my quest to find a “Big Data” dataset, I discovered that Wikimedia Foundation – the group that brings the world Wikipedia - maintains hourly page view statistics for all their projects. Anyone can access this data at http://dumps.wikimedia.org/other/pagecounts-raw/.

UPDATE - October 23 2013 - You can now run the scenario for free at http://bit.ly/TryHANAHadoop starting at my 3rd blog post in this series. Check out http://scn.sap.com/community/hana-in-memory/blog/2013/10/23/try-out-sap-hana-in-a-big-data-scenario for the details.

Legal notice: Page view statistics for Wikimedia projects provided by the Wikimedia Foundation, Inc. at http://dumps.wikimedia.org/other/pagecounts-raw/ under the Creative Commons Attribution-ShareAlike 3.0 Unported License (see http://creativecommons.org/licenses/by-sa/3.0/). The original statistics have been filtered, sorted and visualized.

Each hourly file is a .gz file that contains a space delimited text file that contains the language.project code, the page name, the number of hits per hour and the number of bytes returned. To give you an idea of the scope of the data, for the three months of March, April and May of 2013 there were 2214 files totaling 200.8 GB of compressed data. Uncompressed, the data ends up being 745.2 GB representing 16,671,880,003 records of data. Clearly, we are getting into the realm of Big Data in terms of volume. This also represents more data than AWS SAP HANA One can handle out of the box, as the data drive is only 190 GB.

In this blog series, I will document the steps and gotchas that I went through so that I could use SAP HANA One and Lumira to analyze the data to uncover some interesting observations about how people use Wikipedia. At a high level, these steps include:

  1. Copying the data from Wikipedia to an AWS S3 bucket – this blog post
  2. Using AWS Elastic MapReduce (EMR) and Hive to reduce the data to a reasonable working set for SAP HA...
  3. Loading the data into SAP HANA One column table for use as a staging table
  4. Creating a star-schema in HANA with dimension tables for the project, language and date information
  5. Building an Analytic & Calculation View from the Wikipedia HANA tables for use with SAP Lumira
  6. Analyzing Wikipedia Data with SAP Lumira

Here is a high-level view of the technologies that I will present in this series. This first blog covers the left side of the diagram where I will show how to copy data from Wikipedia into an AWS S3 bucket.


Background for the project

Back in March 2013, Sandy Wang from The Globe and Mail, did a web cast titled “SAP HANA One – A Customer Story: Merging HANA, Pig, Hadoop & the Cloud” that inspired the scenario that I’m going to run through.Sandy used SAP HANA one to analyze click stream data from Adobe Omniture. Globe and Mail, like many former print media, are transitioning to becoming digital media. Globe and Mail’s goal was to be able to determine the optimal balance of free news stories to pull in audiences with advertising and subscription-based access. Clickstream analysis is also very private data as it’s involving individual user access, and is most often proprietary. Wikipedia does not track, or at least, does not make clickstream data available, so we are using page hit data as a publicly available proxy to reconstruct this exercise.

We have now have a “business case”, the data and the tools to make sense of Wikipedia’s page hit data, so let’s begin the adventure.

Loading Wikipedia data into S3 storage

If you want to follow along at home, you will need an AWS account with the following services available:

  1. S3 storage that will be used for processing the Wikipedia data.
  2. EC2 instances that will be used for copying the data from Wikipedia to S3 storage.

ASW EMR makes it easy to process uncompressed or compressed .gz files with Hive, but you need to have the data in S3 storage to make it work. In addition, Hive makes it possible to process data using partitions that will allow us to associate the data with the year, month, day and hour with each page hit for the given hour, the trick is putting the files into a directory structure that accommodates partitioning. These steps will show you how you can copy the data and put it into the correct directory structure so that it can be processed by Hive.

Task

Steps and comments

Create an S3 bucket to store Wikipedia pagecount files

Sign into your AWS account

Go to Services->S3

Click on ‘Create Bucket’

Enter name ‘wikipedia.pagecounts’, choose region ‘US-Standard’

Note: I already created an S3 bucket with this name, so you will need to create one with a different name. I chose the US-Standard option as I wanted to keep the data close to N. Virginia as this is where the HANA One instance is located that I’m using.

Click on Create

Create an EC2 Ubuntu instance for automating the copy operation

Switch to Services-> EC2

Click on ‘Instances’ and ‘Launch Instance’

Follow the steps in the classic wizard and select the Ubuntu Server 13.04 to launch

Note: I’m using an Ubuntu instance because it seems to be the most reliable for the s3fs program that I used to make an S3 bucket look like a Linux drive. I’m also using a different server from the SAP HANA One instance as I don’t want to potentially interrupt the HANA database with non-certified software.

Once the instance in launched, right click on instance name and click on ‘Start’

Right click and select ‘Connect’

Enter private key path  (.pem file location)

Note: You can use the same .pem file used for your SAP HANA One instance – assuming both machines are in the same AWS data center.

Enter the commands in the Ubuntu VM window to use the Google S3fs tool and mount the S3 storage bucket as a drive on the Ubuntu machine

Note: You can copy/paste the commands shown in blue the SSH client

sudo -s

Runs the commands as root.

mkdir /mnt/wikipedia

Prepares the mount point for the S3 storage.

apt-get install gcc

Downloads the Linux C++ compiler.

apt-get install make

Downloads the make command for building the s3fs program.

wget http://s3fs.googlecode.com/files/s3fs-1.67.tar.gz

This is an open source utility that creates a FUSE based file system backed by S3 storage.

tar zxf s3fs-1.67.tar.gz

Decompresses the source files.

cd s3fs-1.67

Go to the source directory to start the build process.

sudo apt-get install build-essential libxml2-dev libfuse-dev libcurl4-openssl-dev

Get the other related libraries for the S3 storage.

./configure

Runs the configuration step for building the s3fs code.

make

Builds the program.

make install

Builds the installation program.

touch /etc/passwd-s3fs && chmod 640 /etc/passwd-s3fs &&  echo 'AccessKey:SecretKey' > /etc/passwd-s3fs

This adds the S3 AccessKey and SecretKey into the Lunix /etc/passwd-s3fs file so that s3fs knows how to connect to the S3 bucket. You can get this information from the AWS Security page for your account.

s3fs wikipedia.pagecounts /mnt/wikipedia

This command hooks up the S3 bucket to the mount point directory.

cd /mnt/Wikipedia

Navigate to the mount point to start working with the S3 storage.

Enter the command in the Ubuntu VM window to start downloading the pagecount files for the month of April 2013 from Wikipedia and uploading to the S3 bucket

nohup wget –r –np -nc –nd  http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-04/ &

This command uses the Linux wget command to pull down the Wikipedia page hits data for April 2013 and put them in the current directory. Just change the year and month values for the month that you want to copy over.

Create a shell script to create a directory structure with year, month, date and hour and move the pagecounts file to their location as per the filename

Create a file with the script below using command

nano createfolders.sh

Copy the following code

for i in pagecount*

do

    year=${i#pagecounts-}

year=${year%????-??????.gz}

month=${i#pagecounts-????}

month=${month%??-??????.gz}

day=${i#pagecounts-??????}

    day=${day%-??????.gz}

hour=${i#pagecounts-????????-}

    hour=${hour%????.gz}

    mkdir -p "year=$year/month=$month/day=$day/hour=$hour"

    mv $i "year=$year/month=$month/day=$day/hour=$hour"

done

This shell script parses out the year, month, day and hour values for each of the pagecount files and then moves them into a directory structure that is compatible with Hive partitioning. For example, the file for midnight on April 1 2013 would go into the directory names year=2013/month=04/day=01/hour=00.

Save (CtrlX, Y, Enter) and run by typing

sh createfolders.sh

Remove all the files that are not pagecounts files.

rm projectcounts*

rm index.html

rm createfolders.sh

rm nohup.out

rm md5sums.txt

These are a bunch of other files that should be deleted assuming all worked as expected. Note, the project count files have the same similar structure as the pagecounts files and if not deleted could inflate the values for page hits and bytes when the Hive job runs.

Optionally, you can unmount the drive using following commands

cd ..

umount Wikipedia

Close the window and stop the instance by right clicking on instance name and selecting the stop command to stop billing. If you don’t need the instance anymore, then terminate it to avoid extra storage charges.

With these basics, you could operationalize the process to copy down the files on an hourly bases for Hive processing to create smaller hourly subsets of the data to eventually process with SAP HANA one. That’s it for this blog post.

In the next blog post of this series, I’ll show how you can use Hive to extract out a subset of the data for loading into SAP HANA One.

6 Comments
Labels in this area