Build your own Wikipedia Keynote Part 1 – Build and load data
So by now you may have seen the Wikipedia page counts model that we built for the keynote. I’ll blog later on about the logistical challenges of getting 30TB of flat files and building a system to load it in 10 days, but since SAP TechEd & d-code is a conference with a load of developers, I’m going to spend some time showing you how to build your own.
The beauty of the SAP HANA platform is that you can build this whole model inside the HANA platform using one tool – HANA Studio
The background is that Wikipedia publishes a list of Page view statistics for Wikimedia projects, which are consolidated page views per hour by title and by project. These go back to 2007 and are now a total of nearly 250bn rows. It’s a fascinating dataset because the cardinality is really rough on databases, there are over 4m articles in the English version alone and well over that including all projects.
The total dataset is around 30TB of flat files, which translates into around 6TB of HANA database memory required. The SAP HANA Developer Edition is a 60GB Amazon EC2 cloud system, and so you can comfortably fit around 30GB of RAM, so you can fit around 0.5% of the overall dataset. That means we can comfortably fit around 3 weeks of data. This should be enough for you to have some fun!
So how do you get started? Trust me, its very easy to do!
Step 1 – get SAP HANA Developer Edition
N.B. This is free of charge from SAP, but you will have to pay Amazon EC2 fees. Be mindful of this and turn off the system when you’re not using it.
It takes a few minutes to setup, because you have to configure your AWS account to receive the HANA Developer Edition AMI, but Craig Cmehil and Thomas Grassl have done a great job of making this easy, so please go ahead and configure the HANA Developer Edition!
You can of course use an on-premise version or any other HANA instance, though our scripts do assume that you have internet access, so your system doesn’t, then you will adapt them. That’s part of the fun, right!
Step 2 – Create and build the model
For the purposes of this exercise, this couldn’t be easier as there’s just one database table. Note that we use a HASH partition on TITLE. In the big model, we actually use a multilevel partition with a range on date as well, but you won’t need this for just 3 weeks. The HASH partition is really handy as we are mostly searching for a specific title, so we can be sure that we’ll only hit 1/16th of the data for a scan. This won’t hurt performance.
Also note that there’s a 2bn row limit to partition sizes in HANA, and we don’t want to get near to that (I recommend 2-300m rows max as a target). HASH partitioning is neat, because it evenly distributes values between partitions.
Also note that we use a generated always statement for date. Most of the time we’re not interested in timestamp, and it’s very expensive to process the attribute vector of timestamps when you only need the date. Materializing the date allows for a minimum of 24x more efficient time series processing.
CREATE USER WIKI PASSWORD “Initial123”;
DROP TABLE “WIKI”.”PAGECOUNTS”;
CREATE COLUMN TABLE “WIKI”.”PAGECOUNTS” (
“WIKIDATE” DATE GENERATED ALWAYS AS to_date(“WIKITIME”),
“SIZE” BIGINT) PARTITION BY HASH(TITLE) PARTITIONS 16;
Step 3 – Download and load the data
First, it comes in hourly files of around 100MB, which means you have to process a lot of files. So, we wrote a batch script that allows processing of a lot of files (we used this script on all 70,000 files in a modified form to allow for much more parallel processing than your AWS developer instance can cope with!).
Second, they are gzipped, and we don’t want to unzip the whole lot as that would be huge and takes a lot of time. So the script unzips them to a RAM disk location for speed of processing.
Third, the files are space delimited and don’t contain the date and time in them, to save space. For efficient batch loading into HANA without an ETL tool like Data Services, we reformat the file before writing to RAM disk, to contain the timestamp as the first column, and be CSV formatted with quotes around the titles.
Anyhow, the script is attached as hanaloader.sh. You need to copy this script to your AWS system and run it as the HANA user. Sit back and relax for an hour whilst it loads. The script is uploaded as a txt file so please remember to rename as .sh
Please follow these instructions to run the script:
— login to the server as root and run the following:
chown hdbadm:sapsys /vol/vol_HDB/sysfiles/wiki
chmod u=wrx /vol/vol_HDB/sysfiles/wiki
su – hdbadm
— place wikiload.sh in this folder
— edit wikidownload.sh as described in the comments in the file
— Once ready run as follows:
./ wikidownload.sh 2014 10
Step 4 – Install HANA Studio
Whilst this is loading, go ahead and get SAP Development Tools for Eclipse installed. If you have access to SAP Software Downloads, you could alternatively use HANA Studio. Make sure you are on at least Revision 80, because otherwise the developer tools won’t work.
Step 5 – Testing
Well now you have a database table populated with 3 weeks of Wikipedia page views. You can test a few SQL scripts to make sure it works, for example:
SELECT WIKIDATE, SUM(PAGEVIEWS) FROM WIKI.PAGECOUNTS GROUP BY WIKIDATE;
SELECT WIKIDATE, SUM(PAGEVIEWS) FROM WIKI.PAGECOUNTS WHERE TITLE = ‘SAP’ GROUP BY WIKIDATE;
Note how when you filter, performance dramatically improves. This is the way that HANA works – it’s far faster to scan (3bn scans/sec/core) than it is to aggregate (16m aggs/sec/core). That’s one of the keys to HANA’s performance.
This is just the first of a multi-part series. Here’s what we’re going to build next:
Part 2: Building the OLAP model. We use the HANA Developer Perspective to build a virtual model that allows efficient processing and ad-hoc reporting in Lumira.
Part 3: Predictive Analysis. We build a predictive model that allows on the fly prediction of future page views.
Part 4: Web App. We expose the model via OData and build a simple web app on the top using SAP UI5.
Keep tuned for next steps!