Skip to Content
Author's profile photo John Appleby

Build your own Wikipedia Keynote Part 1 – Build and load data

So by now you may have seen the Wikipedia page counts model that we built for the keynote. I’ll blog later on about the logistical challenges of getting 30TB of flat files and building a system to load it in 10 days, but since SAP TechEd & d-code is a conference with a load of developers, I’m going to spend some time showing you how to build your own.

The beauty of the SAP HANA platform is that you can build this whole model inside the HANA platform using one tool – HANA Studio

Background

The background is that Wikipedia publishes a list of Page view statistics for Wikimedia projects, which are consolidated page views per hour by title and by project. These go back to 2007 and are now a total of nearly 250bn rows. It’s a fascinating dataset because the cardinality is really rough on databases, there are over 4m articles in the English version alone and well over that including all projects.

The total dataset is around 30TB of flat files, which translates into around 6TB of HANA database memory required. The SAP HANA Developer Edition is a 60GB Amazon EC2 cloud system, and so you can comfortably fit around 30GB of RAM, so you can fit around 0.5% of the overall dataset. That means we can comfortably fit around 3 weeks of data. This should be enough for you to have some fun!

So how do you get started? Trust me, its very easy to do!

Step 1 – get SAP HANA Developer Edition

N.B. This is free of charge from SAP, but you will have to pay Amazon EC2 fees. Be mindful of this and turn off the system when you’re not using it.

It takes a few minutes to setup, because you have to configure your AWS account to receive the HANA Developer Edition AMI, but Craig Cmehil and Thomas Grassl have done a great job of making this easy, so please go ahead and configure the HANA Developer Edition!

You can of course use an on-premise version or any other HANA instance, though our scripts do assume that you have internet access, so your system doesn’t, then you will adapt them. That’s part of the fun, right!

Step 2 – Create and build the model

For the purposes of this exercise, this couldn’t be easier as there’s just one database table. Note that we use a HASH partition on TITLE. In the big model, we actually use a multilevel partition with a range on date as well, but you won’t need this for just 3 weeks. The HASH partition is really handy as we are mostly searching for a specific title, so we can be sure that we’ll only hit 1/16th of the data for a scan. This won’t hurt performance.

Also note that there’s a 2bn row limit to partition sizes in HANA, and we don’t want to get near to that (I recommend 2-300m rows max as a target). HASH partitioning is neat, because it evenly distributes values between partitions.

Also note that we use a generated always statement for date. Most of the time we’re not interested in timestamp, and it’s very expensive to process the attribute vector of timestamps when you only need the date. Materializing the date allows for a minimum of 24x more efficient time series processing.

CREATE USER WIKI PASSWORD “Initial123”;

DROP TABLE “WIKI”.”PAGECOUNTS”;

CREATE COLUMN TABLE “WIKI”.”PAGECOUNTS” (

    “WIKITIME” TIMESTAMP,

    “WIKIDATE” DATE GENERATED ALWAYS AS to_date(“WIKITIME”),

    “PROJECT” VARCHAR(25),

    “TITLE” VARCHAR(2048),

    “PAGEVIEWS” BIGINT,

    “SIZE” BIGINT) PARTITION BY HASH(TITLE) PARTITIONS 16;

Step 3 – Download and load the data

The friendly folks at Your.org maintain an excellent mirror of the Wikipedia Page Views data. There are a few challenges with this data, from a HANA perspective.

First, it comes in hourly files of around 100MB, which means you have to process a lot of files. So, we wrote a batch script that allows processing of a lot of files (we used this script on all 70,000 files in a modified form to allow for much more parallel processing than your AWS developer instance can cope with!).

Second, they are gzipped, and we don’t want to unzip the whole lot as that would be huge and takes a lot of time. So the script unzips them to a RAM disk location for speed of processing.

Third, the files are space delimited and don’t contain the date and time in them, to save space. For efficient batch loading into HANA without an ETL tool like Data Services, we reformat the file before writing to RAM disk, to contain the timestamp as the first column, and be CSV formatted with quotes around the titles.

Anyhow, the script is attached as hanaloader.sh. You need to copy this script to your AWS system and run it as the HANA user. Sit back and relax for an hour whilst it loads. The script is uploaded as a txt file so please remember to rename as .sh

Please follow these instructions to run the script:

— login to the server as root and run the following:


     mkdir /vol/vol_HDB/sysfiles/wiki

     chown hdbadm:sapsys /vol/vol_HDB/sysfiles/wiki

     chmod u=wrx /vol/vol_HDB/sysfiles/wiki

     su – hdbadm

     cd /vol/vol_HDB/sysfiles/wiki


— place wikiload.sh in this folder


edit wikidownload.sh as described in the comments in the file


— Once ready run as follows:

     ./ wikidownload.sh 2014 10

Step 4 – Install HANA Studio

Whilst this is loading, go ahead and get SAP Development Tools for Eclipse installed. If you have access to SAP Software Downloads, you could alternatively use HANA Studio. Make sure you are on at least Revision 80, because otherwise the developer tools won’t work.

Step 5 – Testing

Well now you have a database table populated with 3 weeks of Wikipedia page views. You can test a few SQL scripts to make sure it works, for example:

SELECT WIKIDATE, SUM(PAGEVIEWS) FROM WIKI.PAGECOUNTS GROUP BY WIKIDATE;

SELECT WIKIDATE, SUM(PAGEVIEWS) FROM WIKI.PAGECOUNTS WHERE TITLE = ‘SAP’ GROUP BY WIKIDATE;

Note how when you filter, performance dramatically improves. This is the way that HANA works – it’s far faster to scan (3bn scans/sec/core) than it is to aggregate (16m aggs/sec/core). That’s one of the keys to HANA’s performance.

Next Steps

This is just the first of a multi-part series. Here’s what we’re going to build next:

Part 2: Building the OLAP model. We use the HANA Developer Perspective to build a virtual model that allows efficient processing and ad-hoc reporting in Lumira.

Part 3: Predictive Analysis. We build a predictive model that allows on the fly prediction of future page views.

Part 4: Web App. We expose the model via OData and build a simple web app on the top using SAP UI5.

I just want to say a big thanks to Werner Steyn Lars Breddemann, Brenton O’Callaghan and Lloyd Palfrey for their help with putting all this together.

Keep tuned for next steps!

Assigned Tags

      30 Comments
      You must be Logged on to comment or reply to a post.
      Author's profile photo Jorge Weiss
      Jorge Weiss

      Hi John,

      thanks a lot for sharing.

      Great work!

      /Weiss

      Author's profile photo Abhik Gupta
      Abhik Gupta

      Hi John,

      Thanks a heap for this!

      The demo and the subsequent blog were both awesome. Kudos!

      -Abhik

      P.S.: This is quite off-topic but when's the Graph Database coming out? I am eagerly waiting to try out something similar with Facebook and Twitter data with the graph database.

      Author's profile photo John Appleby
      John Appleby
      Blog Post Author

      Glad you enjoyed! Graph comes with SPS09 in November.

      Author's profile photo Craig Cmehil
      Craig Cmehil

      HANA Dev Edition volume size is being increased right now to make this even easier to try out.

      Author's profile photo Former Member
      Former Member

      Hi,

        Is it possible to set this up on SAP Developer instance running on Azure as well?

      Author's profile photo Craig Cmehil
      Craig Cmehil

      The Developer instance on Azure should work as well.

      Author's profile photo Henricus Bouten
      Henricus Bouten

      The WIKI data loaded... can't wait for "Part 2: Building the OLAP model."

      Author's profile photo Kumar Mayuresh
      Kumar Mayuresh

      Hi John,

      I ran into an error while executing the .sh file. Please refer to the screenshot:

      /wp-content/uploads/2014/11/wikierror_579256.png

      Please guide how to resolve the same.

      Regards

      Kumar

      Author's profile photo Henricus Bouten
      Henricus Bouten

      Hi Kumar,

      The command not found errors are all on empty lines in the script.

      Looks like a "Carriage Return and Line Feed" problem. How did you copy/past the script text from your PC to the Linux server?

      Can you make a screenshot from the script shown in vi and once with more

      Author's profile photo Kumar Mayuresh
      Kumar Mayuresh

      Hi Bouten,

      As John has mentioned - I downloaded the script and edited the script - then saved it from txt file to .sh file. Uploaded the same using FTP client to Server in the respective folder.

      But when I execute it it gives the above mentioned error.

      From your point

      The command not found errors are all on empty lines in the script.

      if I remove all the empty lines from the script then it should work right ??

      Please advise.

      Regards

      Kumar

      Author's profile photo Henricus Bouten
      Henricus Bouten

      I think the problem is due to the fact that you edited the file on Windows and then did the upload. You have the "Carriage Return and Line Feed" problem.

      Deleting the empty lines will not work as the "Carriage Return and Line Feed" problem is on every end of a text line.

      The best way is to upload the original txt file from John to the server and then edit the file on the SAP HANA server. Use vi for that.

      Documentation: Basic vi commands

      Author's profile photo Kumar Mayuresh
      Kumar Mayuresh

      Hi Bouten,

      I am still getting the same error but at different line - Please refer to the screen shot :

      /wp-content/uploads/2014/11/error_579701.jpg

      Any thoughts ?

      Regards

      Kumar

      Author's profile photo Henricus Bouten
      Henricus Bouten

      Did you upload the original file that John posted here? Did you do a ftp binary upload?

      Author's profile photo Kumar Mayuresh
      Kumar Mayuresh

      I downloaded the original file, extracted it >>without opening the file i renamed it form txt to .sh and uploaded it > editing was done using vi on hana server.

      Author's profile photo Henricus Bouten
      Henricus Bouten

      Just checked the file here on SCN and that one gives problems.

      John Appleby was the file changed over the last few days??

      @kumar Solution: On the SAP HANA server run the command: dos2unix wikidownload.sh

      This will strip all the "Carriage Return and Line Feed" and make it a Unix file again.

      Author's profile photo Kumar Mayuresh
      Kumar Mayuresh

      Hi Bouten

      I created a new file on HANA server as (wikiload.sh) copied John script saved it and its running fine now.

      Thanks for your guidance 🙂

      Regards

      Kumar

      Author's profile photo Kumar Mayuresh
      Kumar Mayuresh

      Hi Bouten

      I ran into another issue : Something went wrong 🙁

      /wp-content/uploads/2014/11/2_579854.jpg

      Any thoughts

      Regards

      Kumar

      Author's profile photo Henricus Bouten
      Henricus Bouten

      You made a few mistakes.

      1. Your mkdir was in the /usr/sap/HDBHDB00 directory.
        In the text above John does a mkdir /vol/vol_HDB/sysfiles/wiki There is a '/' in front of the full path.
      2. From the error I see now it looks like you are missing the first few lines in your script.
        The first line should read #!/bin/bash and that isn't a comment line.

      Please post a screenshot from the content of the load.sh script.


      Author's profile photo John Appleby
      John Appleby
      Blog Post Author

      Wow thanks for the help Hay, I was traveling yesterday 🙂

      What I usually do is to open the file in Windows, copy the contents into the clipboard and then do:

      cat > wikiload.sh

      *PASTE*

      ctrl-D

      chmod + x wikiload.sh

      This works around any problems with CR/LF. Hope this helps if you haven't got it fixed already Kumar.

      Author's profile photo Henricus Bouten
      Henricus Bouten

      No problem John.

      Author's profile photo Nicholas Chang
      Nicholas Chang

      Hi John,

      I'm using your way however still get the carriage return error (file name ended with %0D):

      : No such file or directory-050000.gz

      MD5 mismatch. Downloading file again

      rm: cannot remove `pagecounts-20141001-050000.gz\r': No such file or directory

      --2014-11-18 16:52:03--  ftp://dumps.wikimedia.your.org/pub/wikimedia/dumps/other/pagecounts-raw/2014/2014-10/pagecounts-20141001-050000.gz%0D

      Connecting to ###:8080... connected.

      Proxy request sent, awaiting response... 404 Not Found

      2014-11-18 16:52:09 ERROR 404: Not Found.

      --2014-11-18 16:52:09--  ftp://dumps.wikimedia.your.org/pub/wikimedia/dumps/other/pagecounts-raw/2014/2014-10/pagecounts-20141001-060000.gz%0D

      I'm using the new files attached here. Thanks.

      Author's profile photo Nicholas Chang
      Nicholas Chang

      finally found out the CR was due to the md5sums.txt downloaded.

      run dos2unix -a md5sums.txt and rerun the script by commented out the command to download md5sums.txt again, and pagecounts*.gz get downloaded.

      Is it normal or anyone else facing this?

      Author's profile photo Kumar Mayuresh
      Kumar Mayuresh

      Hi John & Hay

      Thanks for the help.

      Regards

      Kumar 🙂

      Author's profile photo Kumar Mayuresh
      Kumar Mayuresh

      Hi John

      Any update on when rest of the remaining parts will be published ?

      Regards

      Kumar 🙂

      Author's profile photo Nicholas Chang
      Nicholas Chang

      Hi John,

      Good day to you.

      I tried to load the data into rev91 and it failed with below:

      * 2: general error: Cannot load table from the specified path - /2012-12. Please check configuration: csv_import_path_filter. SQLSTATE: HY000

      However, tested fine in rev82 and rev85. Just wondering any table structure changed in rev91?

      Hope to hear from you soon!

      Thanks,

      Nicholas Chang

      Author's profile photo John Appleby
      John Appleby
      Blog Post Author

      Yes SPS09 has a limit on CSV imports. Go to the configuration screen in HANA Studio and search for "csv". You will see enable_csv_import_path_filter

      Change it from true to false and you will be good

      Screen Shot 2015-01-06 at 2.30.05 PM.png

      Author's profile photo Alan McShane
      Alan McShane

      Good to know!  You can also achieve this by running this SQL as an appropriate user :

      ALTER SYSTEM ALTER CONFIGURATION ('indexserver.ini', 'system') set ('import_export', 'enable_csv_import_path_filter') = 'false' with reconfigure;

      Author's profile photo Nicholas Chang
      Nicholas Chang

      Good to know! Thanks John!

      Author's profile photo Former Member
      Former Member

      Hi John,

      Thank you for sharing. I could do it!

      I could not find part 2 and others. Are they already available?

      Guilherme.

      Author's profile photo Roberto Falk
      Roberto Falk

      Hi Guilherme,

      any news on finding part 2?

      Roberto Falk