Build your own Wikipedia Keynote Part 1 – Build and load data
So by now you may have seen the Wikipedia page counts model that we built for the keynote. I’ll blog later on about the logistical challenges of getting 30TB of flat files and building a system to load it in 10 days, but since SAP TechEd & d-code is a conference with a load of developers, I’m going to spend some time showing you how to build your own.
The beauty of the SAP HANA platform is that you can build this whole model inside the HANA platform using one tool – HANA Studio
Background
The background is that Wikipedia publishes a list of Page view statistics for Wikimedia projects, which are consolidated page views per hour by title and by project. These go back to 2007 and are now a total of nearly 250bn rows. It’s a fascinating dataset because the cardinality is really rough on databases, there are over 4m articles in the English version alone and well over that including all projects.
The total dataset is around 30TB of flat files, which translates into around 6TB of HANA database memory required. The SAP HANA Developer Edition is a 60GB Amazon EC2 cloud system, and so you can comfortably fit around 30GB of RAM, so you can fit around 0.5% of the overall dataset. That means we can comfortably fit around 3 weeks of data. This should be enough for you to have some fun!
So how do you get started? Trust me, its very easy to do!
Step 1 – get SAP HANA Developer Edition
N.B. This is free of charge from SAP, but you will have to pay Amazon EC2 fees. Be mindful of this and turn off the system when you’re not using it.
It takes a few minutes to setup, because you have to configure your AWS account to receive the HANA Developer Edition AMI, but Craig Cmehil and Thomas Grassl have done a great job of making this easy, so please go ahead and configure the HANA Developer Edition!
You can of course use an on-premise version or any other HANA instance, though our scripts do assume that you have internet access, so your system doesn’t, then you will adapt them. That’s part of the fun, right!
Step 2 – Create and build the model
For the purposes of this exercise, this couldn’t be easier as there’s just one database table. Note that we use a HASH partition on TITLE. In the big model, we actually use a multilevel partition with a range on date as well, but you won’t need this for just 3 weeks. The HASH partition is really handy as we are mostly searching for a specific title, so we can be sure that we’ll only hit 1/16th of the data for a scan. This won’t hurt performance.
Also note that there’s a 2bn row limit to partition sizes in HANA, and we don’t want to get near to that (I recommend 2-300m rows max as a target). HASH partitioning is neat, because it evenly distributes values between partitions.
Also note that we use a generated always statement for date. Most of the time we’re not interested in timestamp, and it’s very expensive to process the attribute vector of timestamps when you only need the date. Materializing the date allows for a minimum of 24x more efficient time series processing.
CREATE USER WIKI PASSWORD “Initial123”;
DROP TABLE “WIKI”.”PAGECOUNTS”;
CREATE COLUMN TABLE “WIKI”.”PAGECOUNTS” (
“WIKITIME” TIMESTAMP,
“WIKIDATE” DATE GENERATED ALWAYS AS to_date(“WIKITIME”),
“PROJECT” VARCHAR(25),
“TITLE” VARCHAR(2048),
“PAGEVIEWS” BIGINT,
“SIZE” BIGINT) PARTITION BY HASH(TITLE) PARTITIONS 16;
Step 3 – Download and load the data
The friendly folks at Your.org maintain an excellent mirror of the Wikipedia Page Views data. There are a few challenges with this data, from a HANA perspective.
First, it comes in hourly files of around 100MB, which means you have to process a lot of files. So, we wrote a batch script that allows processing of a lot of files (we used this script on all 70,000 files in a modified form to allow for much more parallel processing than your AWS developer instance can cope with!).
Second, they are gzipped, and we don’t want to unzip the whole lot as that would be huge and takes a lot of time. So the script unzips them to a RAM disk location for speed of processing.
Third, the files are space delimited and don’t contain the date and time in them, to save space. For efficient batch loading into HANA without an ETL tool like Data Services, we reformat the file before writing to RAM disk, to contain the timestamp as the first column, and be CSV formatted with quotes around the titles.
Anyhow, the script is attached as hanaloader.sh. You need to copy this script to your AWS system and run it as the HANA user. Sit back and relax for an hour whilst it loads. The script is uploaded as a txt file so please remember to rename as .sh
Please follow these instructions to run the script:
— login to the server as root and run the following:
mkdir /vol/vol_HDB/sysfiles/wiki
chown hdbadm:sapsys /vol/vol_HDB/sysfiles/wiki
chmod u=wrx /vol/vol_HDB/sysfiles/wiki
su – hdbadm
cd /vol/vol_HDB/sysfiles/wiki
— place wikiload.sh in this folder
— edit wikidownload.sh as described in the comments in the file
— Once ready run as follows:
./ wikidownload.sh 2014 10
Step 4 – Install HANA Studio
Whilst this is loading, go ahead and get SAP Development Tools for Eclipse installed. If you have access to SAP Software Downloads, you could alternatively use HANA Studio. Make sure you are on at least Revision 80, because otherwise the developer tools won’t work.
Step 5 – Testing
Well now you have a database table populated with 3 weeks of Wikipedia page views. You can test a few SQL scripts to make sure it works, for example:
SELECT WIKIDATE, SUM(PAGEVIEWS) FROM WIKI.PAGECOUNTS GROUP BY WIKIDATE;
SELECT WIKIDATE, SUM(PAGEVIEWS) FROM WIKI.PAGECOUNTS WHERE TITLE = ‘SAP’ GROUP BY WIKIDATE;
Note how when you filter, performance dramatically improves. This is the way that HANA works – it’s far faster to scan (3bn scans/sec/core) than it is to aggregate (16m aggs/sec/core). That’s one of the keys to HANA’s performance.
Next Steps
This is just the first of a multi-part series. Here’s what we’re going to build next:
Part 2: Building the OLAP model. We use the HANA Developer Perspective to build a virtual model that allows efficient processing and ad-hoc reporting in Lumira.
Part 3: Predictive Analysis. We build a predictive model that allows on the fly prediction of future page views.
Part 4: Web App. We expose the model via OData and build a simple web app on the top using SAP UI5.
I just want to say a big thanks to Werner Steyn Lars Breddemann, Brenton O’Callaghan and Lloyd Palfrey for their help with putting all this together.
Keep tuned for next steps!
Hi John,
thanks a lot for sharing.
Great work!
/Weiss
Hi John,
Thanks a heap for this!
The demo and the subsequent blog were both awesome. Kudos!
-Abhik
P.S.: This is quite off-topic but when's the Graph Database coming out? I am eagerly waiting to try out something similar with Facebook and Twitter data with the graph database.
Glad you enjoyed! Graph comes with SPS09 in November.
HANA Dev Edition volume size is being increased right now to make this even easier to try out.
Hi,
Is it possible to set this up on SAP Developer instance running on Azure as well?
The Developer instance on Azure should work as well.
The WIKI data loaded... can't wait for "Part 2: Building the OLAP model."
Hi John,
I ran into an error while executing the .sh file. Please refer to the screenshot:
Please guide how to resolve the same.
Regards
Kumar
Hi Kumar,
The command not found errors are all on empty lines in the script.
Looks like a "Carriage Return and Line Feed" problem. How did you copy/past the script text from your PC to the Linux server?
Can you make a screenshot from the script shown in vi and once with more
Hi Bouten,
As John has mentioned - I downloaded the script and edited the script - then saved it from txt file to .sh file. Uploaded the same using FTP client to Server in the respective folder.
But when I execute it it gives the above mentioned error.
From your point
if I remove all the empty lines from the script then it should work right ??
Please advise.
Regards
Kumar
I think the problem is due to the fact that you edited the file on Windows and then did the upload. You have the "Carriage Return and Line Feed" problem.
Deleting the empty lines will not work as the "Carriage Return and Line Feed" problem is on every end of a text line.
The best way is to upload the original txt file from John to the server and then edit the file on the SAP HANA server. Use vi for that.
Documentation: Basic vi commands
Hi Bouten,
I am still getting the same error but at different line - Please refer to the screen shot :
Any thoughts ?
Regards
Kumar
Did you upload the original file that John posted here? Did you do a ftp binary upload?
I downloaded the original file, extracted it >>without opening the file i renamed it form txt to .sh and uploaded it > editing was done using vi on hana server.
Just checked the file here on SCN and that one gives problems.
John Appleby was the file changed over the last few days??
@kumar Solution: On the SAP HANA server run the command: dos2unix wikidownload.sh
This will strip all the "Carriage Return and Line Feed" and make it a Unix file again.
Hi Bouten
I created a new file on HANA server as (wikiload.sh) copied John script saved it and its running fine now.
Thanks for your guidance 🙂
Regards
Kumar
Hi Bouten
I ran into another issue : Something went wrong 🙁
Any thoughts
Regards
Kumar
You made a few mistakes.
In the text above John does a mkdir /vol/vol_HDB/sysfiles/wiki There is a '/' in front of the full path.
The first line should read #!/bin/bash and that isn't a comment line.
Please post a screenshot from the content of the load.sh script.
Wow thanks for the help Hay, I was traveling yesterday 🙂
What I usually do is to open the file in Windows, copy the contents into the clipboard and then do:
cat > wikiload.sh
*PASTE*
ctrl-D
chmod + x wikiload.sh
This works around any problems with CR/LF. Hope this helps if you haven't got it fixed already Kumar.
No problem John.
Hi John,
I'm using your way however still get the carriage return error (file name ended with %0D):
: No such file or directory-050000.gz
MD5 mismatch. Downloading file again
rm: cannot remove `pagecounts-20141001-050000.gz\r': No such file or directory
--2014-11-18 16:52:03-- ftp://dumps.wikimedia.your.org/pub/wikimedia/dumps/other/pagecounts-raw/2014/2014-10/pagecounts-20141001-050000.gz%0D
Connecting to ###:8080... connected.
Proxy request sent, awaiting response... 404 Not Found
2014-11-18 16:52:09 ERROR 404: Not Found.
--2014-11-18 16:52:09-- ftp://dumps.wikimedia.your.org/pub/wikimedia/dumps/other/pagecounts-raw/2014/2014-10/pagecounts-20141001-060000.gz%0D
I'm using the new files attached here. Thanks.
finally found out the CR was due to the md5sums.txt downloaded.
run dos2unix -a md5sums.txt and rerun the script by commented out the command to download md5sums.txt again, and pagecounts*.gz get downloaded.
Is it normal or anyone else facing this?
Hi John & Hay
Thanks for the help.
Regards
Kumar 🙂
Hi John
Any update on when rest of the remaining parts will be published ?
Regards
Kumar 🙂
Hi John,
Good day to you.
I tried to load the data into rev91 and it failed with below:
* 2: general error: Cannot load table from the specified path - /2012-12. Please check configuration: csv_import_path_filter. SQLSTATE: HY000
However, tested fine in rev82 and rev85. Just wondering any table structure changed in rev91?
Hope to hear from you soon!
Thanks,
Nicholas Chang
Yes SPS09 has a limit on CSV imports. Go to the configuration screen in HANA Studio and search for "csv". You will see enable_csv_import_path_filter
Change it from true to false and you will be good
Good to know! You can also achieve this by running this SQL as an appropriate user :
ALTER SYSTEM ALTER CONFIGURATION ('indexserver.ini', 'system') set ('import_export', 'enable_csv_import_path_filter') = 'false' with reconfigure;
Good to know! Thanks John!
Hi John,
Thank you for sharing. I could do it!
I could not find part 2 and others. Are they already available?
Guilherme.
Hi Guilherme,
any news on finding part 2?
Roberto Falk