Skip to Content

In this article, I’m going to explain through experience why you need to break up UPDATE statements as part of a data loading process. And unlike the hombre from the movie “Treasure of the Sierra Madre” —

http://www.deadredart.com/madre.jpg

Image from: http://www.deadredart.com/madre.jpg

You can’t ignore issues like this when working in relatively scare memory environments like HANA One.

Wikipedia and Big Data

I’ve been working on a “Big Data” project using AWS Hive against Wikipedia’s page hit history files for the month of April 2013 to do aggregations over the data and then load it into SAP HANA One and then use the new Lumira tools to do analysis on the results. Each hour, Wikipedia kicks out a file that averages around 7 million records that show the Wikipedia project code, the page name hit, the number of hits and the number of bytes served for that page during the hour. If you go back to my SAP HANA – Hot N Cold – HANA for Hot Data – Hadoop for Cold Data post, my inclination was to just process this using AWS Hive. I’m sticking to my guns – especially after my little experiment going with just a HANA One approach.

What I did was use the Linux wget command to pull down the .gz files for each day for April 2013 into separate directories on a 500 GB EBS drive that I added to my AWS HANA One instance (blog post pending on this ๐Ÿ™‚ ). For April 1 2013, I loaded up 167,879,044 records for the 24 hour period. On disk, the uncompressed files took up 8.45 GB of disk space. Now – multiple this by 30 days – I’m looking at 253.66 GB of uncompressed data. You can see – I’m going to push the boundaries of HANA One here. Not to mention – Wikipedia is a great target to test your Big Data skills.

Loading up HANA

What I did was create the following COLUMN table to import the data:

CREATE COLUMN TABLE “WIKIPEDIA”.“PAGEHITS”

(

“PROJNAME” VARCHAR(50),

“PAGENAME” VARCHAR(500),

“PAGEHITCOUNTFORHOUR” BIGINT CS_FIXED,

“BYTESFORHOUR” BIGINT CS_FIXED,

“HITYEAR” VARCHAR(4),

“HITMONTH” VARCHAR(2),

“HITDAY” VARCHAR(2)

)

I then did a series of 30 IMPORT statements that looked like this to load the data.

IMPORT FROM CSV FILE ‘/wiki-data/year=2013/month=04/day=01/pagecounts-20130401-010000’

INTO wikipedia.pagehits

WITH RECORD DELIMITED BY ‘\n’

FIELD DELIMITED BY ‘ ‘;

I did a COUNT(*) query to make sure I got all the rows and then went to do the following UPDATE statement to set the HITYEAR, HITMONTH and HITDAY values for each of the records.

UPDATE wikipedia.pagehits SET hityear = ‘2013’, hitmonth = ’04’,hitday=’01’ WHERE hitday IS NULL;

After 10 minutes plus minutes of grinding away, I got the following error:

Could not execute ‘update wikipedia.pagehits set hityear = ‘2013’, hitmonth = ’04’, hitday=’01”
in 10:50.841 minutes .

[129]: transaction
rolled back by an internal error: Memory allocation failed

Ouch! Sure enough, I looked at the system memory using the System Monitor in SAP HANA Studio and it showed the Database Resident Memory was taking 31+ GB of RAM!

Time for Batches

It’s obvious now after looking over all the DELTA MERGE architecture diagrams that HANA really doesn’t like to do large updates against 167 million records. Although it’s a bit surprising as I was updating integer values that were all the same value. I hope one of the SAP engineers can help explain this in detail. Regardless, my next step was to check an make sure that I could at least import and update one hour at a time.

So, after importing each hour of data which ranged from about 55 seconds to 70 seconds after 24 loads to load around 7 million records each time. I then performed the following update statement to set the date values.

UPDATE wikipedia.pagehits SET hityear = ‘2013’, hitmonth = ’04’,hitday=’01’ WHERE hitday IS NULL;

The update statements averaged around 30 seconds each. Best of all, I was able to process an entire day at one time. I also watched the Database Memory like a hawk and saw it go from 0.3 GB to 7.7 GB after 20 import and update statements. Then an amazing thing happened after the next load – Database Memory dropped down to 3.9 GB. What I suspect happened is that the compression algorithm kicked in and realized that lots of the page names were really duplicates and decided to compress the data – sweet! ๐Ÿ™‚

At the end of the operation, memory size was at 4.1 GB. Plenty of room left to add more data. More importantly – I didn’t run out of memory with the smaller batches. After restarting the server and then loading the table into memory, my memory size is sitting at 3.2 GB – I got even more compression when the table was loaded off of disk!

What’s Next

So, back to our hombre

Yes – you do need batches! If you think you are pushing the memory limits of your HANA instance – consider breaking up big set based operations like updating all of your inventory items cost by 10% into smaller batches.

Back to Big Data and HANA. A good friend of mine, Buck Woody (http://twitter.com/buckwoody) once asked someone – which is faster – a dump truck or a sports car? If you are a DBA at heart – your answer it should be – it all depends. If you are trying to move 10 tons of compost from the nursery to your home – it’s going to take a lot more trips (time) in a sports car compared to the one trip needed for the dump truck.

With AWS, I was able to load all the needed Wikipedia files into S3 storage in their compressed form and then use a Hive job to get the top 1000 page hits for English language Wikipedia articles  for each day in April using 6 beefy EC2 instances in under four hours. No import and update time was needed. No trying to configure the HANA One instance to increase the size of the 190 GB data drive to 400 GB to hold the uncompressed data. Plenty of headroom to process more that one month of data without having to unload the data from HANA One to load in the next months of data.

I hope you found this article interesting. Please rate this as you see fit if you want to see more “real-life” experiences from me in my HANA adventures.

Regards,

Bill Ramos

P.S. Be sure to check out my interview with SAP at: Five Questions for… Bill Ramos

To report this post you need to login first.

6 Comments

You must be Logged on to comment or reply to a post.

  1. Former Member

    Hi Bill,

    Great idea to try.

    I was just giving it a go as well.

    You may have accidentally missed a field in your COLUMN table shown above, when you posted this.

    To load just 1 file, with one hours data, I had to add a project name field.

    CREATE COLUMN TABLE “WIKIPEDIA”.”PAGEHITS”

    (

    “PROJNAME” VARCHAR(50),

    “PAGENAME” VARCHAR(500),

    “PAGEHITCOUNTFORHOUR” BIGINT CS_FIXED,

    “BYTESFORHOUR” BIGINT CS_FIXED,

    “HITYEAR” VARCHAR(4),

    “HITMONTH” VARCHAR(2),

    “HITDAY” VARCHAR(2)

    );

    Thanks again

    Aron

    (0) 
  2. Former Member

    Hi Bill,

    Great idea for people to try as well.

    One small observation, you may have accidentally left the leading field for wiki project name out of your example above.

    The following works for me:

    CREATE COLUMN TABLE “WIKIPEDIA”.”PAGEHITS”

    (

    “PROJNAME” VARCHAR(50),

    “PAGENAME” VARCHAR(500),

    “PAGEHITCOUNTFORHOUR” BIGINT CS_FIXED,

    “BYTESFORHOUR” BIGINT CS_FIXED,

    “HITYEAR” VARCHAR(4),

    “HITMONTH” VARCHAR(2),

    “HITDAY” VARCHAR(2)

    );

    Regards

    Aron

    (0) 
      1. Former Member

        Hi Bill,

        I’ll try not to repeat myself this time, sorry for that.  ๐Ÿ˜ณ

        While I eagerly await your AWS HIVE approach I thought I’d try and see how many WIKIPEDIA pagehit files I could load before my AWS HANA box ground to a halt.

        First I tried a 17Gb AWS HANA Developer box. The best I could load was 10 files, representing 10 hours of pagehit stats. ๐Ÿ™

        Next I tried a 34Gb AWS HANA Developer box.

        So far I’ve managed to load 48 files(hours),  but the box is struggling under the weight so I don’t plan to push it further.

        After a bit of tuning, using a batch loading schedule (shell script), I managed to load the first days records (24 files) in 1hr 45 minutes. At the 34 file mark though the poor little box started to show signs of stress, for example the IMPORT statement which was taking under 20 seconds to complete, per file, suddenly jumped to over 2 Minutes.

        The first 34 files loaded in 3 hours and the remaining 15 files loaded in 3 hours.

        From what I can see from the stats I recorded, the HANA box was consistently unloading the “PAGEHITS” table from memory (presumably triggered by some system process which required the memory), causing subsequent import/update tasks to firstly need to LOAD the data back into memory in-order to exectue. I’m sure this done for a good reason, to stop the system crashing or something equally bad, however it’d be nice to know how to predict this in advance.

        At the 34 file mark I had recorded the following stats:

        – “PUBLIC.M_HOST_RESOURCE_UTILIZATION”  had 12Gb FREE PHYSICAL MEMORY

        – M_CS_Columns for “PAGEHITS” showed as Loaded(True),16Gb uncompressed,  4.3Gb Main, 0 Gb Delta

        – In “PAGEHITS” Day 1 had 178.3 Million records

        – In “PAGEHITS” Day 2 had 71 Million records

        Finally after the 48 files eventually loaded:

        Day 1 = 178M records@3.1Gb compressed,11.8Gb uncompressd (from M_CS_Columns)

        Day 1&2 = 366M records@6.0Gb compressed,23Gb uncomprssd (from M_CS_Columns)

        HANA One with 60Gb may grind to halt with less than 10 days of data.

        A 16Tb Hana box  though might manage a couple of years of pagehit stats, if you had nothing better to use it for. ๐Ÿ˜‰

        I’ll give my HANA box a well deserved rest now.

        Regards

        Aron

        BTW: I increased “PAGENAME” to VARCHAR(2000) because in some files I noticed some larger lengths. I also disabled AUTOMERGE on the table and controlled the MERGE DELTA during the load process.

        (0) 
  3. Lars Breddemann

    Hi Bill,

    actually I don’t find it too surprising to see that some memory is required to do the update – especially since we don’t know if the table had been already merged between the data load and the update statement.

    And looking at this example, I’m pretty sure that other DBMS would also need some time to work on this request. (just looking at the in-place updates from NULL to {hityear = ‘2013’, hitmonth = ’04’,hitday=’01’} … ).
    And sure enough: we do compress data during the delta merge operation as it basically moves data from the change-optimized inbound stage (delta store) to the read-optimized main store.
    More on the merge operation can be found here: http://scn.sap.com/docs/DOC-27558.
    BTW: I feel that it’s a common pattern that the big data capabilities are easily over-generalized. Not all functions will allow for the same amount of parallelization and speed up. With your reference to the truck-sports car-example you pretty much nailed this.
    The same kind of misconception occurs with topics like heavy parallel query workload and/or scale out scenarios.
    Anyway… I like your blog.
    It’s just what I want in a blog post: personal experiences nicely written and not just plain documentation or “I don’t know how to format text, but, hey look, I still can publish blogs…“-crap.
    Thanks for that.
    (0) 

Leave a Reply