Skip to Content

UPDATE! Thanks to a comment by Alan Suffolk I fixed my script a little bit, just moving the end = time.clock() and time_taken = end – start after the cur.execute(query) because that’s exactly when the SAP HANA query ends…you can see the new processing time in the image below…

My good friend Pakdi Decnud gave a great idea while we were having lunch the very same day of the SAP CodeJam Montreal event.

Pakdi told me…”Why don’t you make a comparison between SAP HANA and let’s say…MongoDB”…I thought that it was of course a great idea…so yesterday I start exploring MongoDB…so you may ask yourselves…”Why then are you talking about MySQL and PostreSQL?”

Easy answer…and here are my thoughts…

  • I really don’t get MongoDB…the whole No-SQL is really alien to me…
  • MongoDB is “Document” based, meaning that you create collections of documents, no databases nor tables…
  • MongoDB doesn’t support Inner Joins and aggregates need a framework that it’s more weird than MongoDB itself…
  • MongoDB is not meant for enterprise applications

That’s why I decide to make a little bit shift and grab the two most used databases by start-ups and developers…MySQL and PostgreSQL.

For this blog, I wanted to have a lot of information…so as always, I grab my beloved Python and create a little script to generate 1 million records for two tables. One script per table.

The structure of the tables is as follows…

DOC_HEADER

Field Name DataType Length
DOCUMENT_ID VARCHAR 8
YEAR VARCHAR 4
AREA VARCHAR 2
NAME VARCHAR 20
LAST_NAME VARCHAR 20

DOC_DETAIL


Field Name DataType Length
DOCUMENT_ID VARCHAR 8
YEAR VARCHAR 4
AREA VARCHAR 2
AMOUNT BIGINT

And here are the script to generate the 1 million records in a nice .CSV file

Doc_Header_Generator.py

import random

import csv

names = [“Anne”, “Gigi”, “Juergen”, “Ingo”, “Inga”, “Alvaro”, “Mario”,

“Julien”, “Mike”, “Michael”, “Karin”, “Rui”, “John”, “Rocky”, “Sebastian”,

“Kai-Yin”, “Hester”, “Katrin”, “Uwe”, “Vitaliy”]

last_names = [“Hardy”, “Read”, “Schmerder”, “Sauerzapf”, “Bereza”, “Tejada”,

“Herger”, “Vayssiere”, “Flynn”, “Byczkowski”, “Schattka”,

“Nogueira”, “Mayerhofer”, “Ongkowidjojo”, “Wieczorek”, “Gau”, “Hilbrecht”,

“Staehr”, “Kylau”, “Rudnytskiy”]

area = [“001”, “002”, “003”, “004”, “005”, “006”, “007”, “008”, “009”, “010”]

year = [“2000”, “2001”, “2002”, “2003”, “2004”, “2005”, “2006”, “2007”,

“2008”, “2009”, “2010”, “2011”, “2012”]

def Generate_File(pSchema, pNumber):

    iNumber = 0

    c = csv.writer(open(“Doc_Header.csv”, “wb”))

    while iNumber < pNumber:

        queries = []

        r_doc_id = random.randrange(1, 999999)

        r_names = random.randrange(0, 20)

        r_lastnames = random.randrange(0, 20)

        r_areas = random.randrange(0, 10)

        r_years = random.randrange(0, 13)

        iNumber += 1

        queries.append(r_doc_id)

        queries.append(year[r_years])

        queries.append(str(area[r_areas]))

        queries.append(names[r_names])

        queries.append(last_names[r_lastnames])

        c.writerow(queries)

num_files = input(“How many records?: \n”)

Generate_File(num_files)

Doc_Detail_Generator.py

import random

import csv

area = [“001”, “002”, “003”, “004”, “005”, “006”, “007”, “008”, “009”, “010”]

year = [“2000”, “2001”, “2002”, “2003”, “2004”, “2005”, “2006”, “2007”,

“2008”, “2009”, “2010”, “2011”, “2012”]

def Generate_File(pSchema, pNumber):

    iNumber = 0

    c = csv.writer(open(“Doc_Detail.csv”, “wb”))

    while iNumber < pNumber:

        queries = []

        r_doc_id = random.randrange(1, 999999)

        r_areas = random.randrange(0, 10)

        r_years = random.randrange(0, 13)

        r_amount = random.randrange(0, 10000, 1)

        iNumber += 1

        queries.append(r_doc_id)

        queries.append(year[r_years])

        queries.append(str(area[r_areas]))

        queries.append(r_amount)

        c.writerow(queries)

num_files = input(“How many records?: \n”)

Generate_File(num_files)

With the two files ready, I upload them to MySQL, PostgreSQL and SAP HANA.

To measure the speed, I create three Python scripts using…yes…again Bottle

The basic idea is to join the two tables, select the Document_Id, Year, Area and the sum of Amount.

Let’s start with the MySQL Script…

MySQL_Bottle_Documents.py

from bottle import get, run

import mysql.connector

import time

@get(‘/show_query’)

def show_form():

    counter = 0

    start = time.clock()

    conn = mysql.connector.Connect(host=’localhost’, user=’root’,

                        password=’root’, database=’P075400′)

    cur = conn.cursor()

    query = ”’SELECT A.DOCUMENT_ID, A.YEAR, A.AREA, SUM(AMOUNT)

                  FROM DOC_HEADER AS A INNER JOIN DOC_DETAIL AS B

                  WHERE A.DOCUMENT_ID = B.DOCUMENT_ID

                      AND A.YEAR = B.YEAR

                      AND A.AREA = B.AREA

                  GROUP BY DOCUMENT_ID, YEAR, AREA”’

    cur.execute(query)

     output = “<DIV ALIGN=’CENTER’><TABLE BORDER=’1′>”

    output += “<TR BGCOLOR=’#B9C9FE’>”

    output += “<TH>Document</TH><TH>Year</TH>”

    output += “<TH>Area</TH><TH>Amount</TH>”

    output += “</TR>”

    for row in cur:

        counter += 1

        document_id = str(row[0])

        year = str(row[1])

        area = str(row[2])

        amount = str(row[3])

        output += “<TR BGCOLOR=’#E8EDFF’>”

        output += ”'<TD>%s</TD><TD>%s</TD>

                         <TD>%s</TD><TD>%s</TD>”’ % (document_id, year,

                                                                               area, amount)

        output += “</TR>”

    output += “</TABLE>”

    end = time.clock()

    time_taken = end – start

    output += “<H1>SAP HANA – %s records in %s seconds</H1></DIV>”\

    % (counter, time_taken)

    return output

run(host=’localhost’, port=8080)

I let the script run…and after more than one hour…I simply got bored and interrupt the process…

So, I continue with PostgreSQL

PostgreSQL_Bottle_Documents.py

from bottle import get, run

import psycopg2

import time

@get(‘/show_query’)

def show_form():

    counter = 0

    start = time.clock()

    conn = psycopg2.connect(“dbname=P075400 user=postgres password=root”)

    cur = conn.cursor()

    query = ”’SELECT “DOC_HEADER”.”DOCUMENT_ID”, “DOC_HEADER”.”YEAR”,

                “DOC_HEADER”.”AREA”, SUM(“DOC_DETAIL”.”AMOUNT”) FROM

                public.”DOC_HEADER”, public.”DOC_DETAIL” WHERE

                “DOC_HEADER”.”DOCUMENT_ID” = “DOC_DETAIL”.”DOCUMENT_ID”

                AND “DOC_HEADER”.”YEAR” = “DOC_DETAIL”.”YEAR”

                AND “DOC_HEADER”.”AREA” = “DOC_DETAIL”.”AREA”

                GROUP BY “DOC_HEADER”.”DOCUMENT_ID”, “DOC_HEADER”.”YEAR”,

                “DOC_HEADER”.”AREA””’

    cur.execute(query)

    output = “<DIV ALIGN=’CENTER’><TABLE BORDER=’1′>”

    output += “<TR BGCOLOR=’#B9C9FE’>”

    output += “<TH>Document</TH><TH>Year</TH>”

    output += “<TH>Area</TH><TH>Amount</TH>”

    output += “</TR>”

    for row in cur:

        counter += 1

        document_id = str(row[0])

        year = str(row[1])

        area = str(row[2])

        amount = str(row[3])

        output += “<TR BGCOLOR=’#E8EDFF’>”

        output += ”'<TD>%s</TD><TD>%s</TD>

                         <TD>%s</TD><TD>%s</TD>”’ % (document_id, year,

                                                                               area, amount)

        output += “</TR>”

    output += “</TABLE>”

    end = time.clock()

    time_taken = end – start

    output += “<H1>PostgreSQL – %s records in %s seconds</H1></DIV>”\

    % (counter, time_taken)

    return output

run(host=’localhost’, port=8080)

This time…I was lucky…

PostgreSQL.pngOut of 2 millions records, PostgreSQL managed to aggregate the amount field and generate 7669 records in 36 seconds…not bad at all…

For SAP HANA, I decided to take fully advantage of the Calculation Views, so I create the following…

CV_Documents.pngI joined both tables, used a projection, applied the aggregation and specified the result…then I wrote this Python script…

SAPHANA_Bottle_Documents.py

from bottle import get, run

import dbapi

import time

@get(‘/show_query’)

def show_form():

    start = time.clock()

    conn = dbapi.connect(‘******’, 30015, ‘********’, ‘*******’)

    cur = conn.cursor()

    queries = [”’create type test_out as table (DEP_DESC NVARCHAR(20),

               YEAR VARCHAR(4), Q1 BIGINT, Q2 BIGINT, Q3 BIGINT, Q4 BIGINT)”’,

               ”’create procedure pyTest (OUT out_test TEST_OUT)\

               language sqlscript reads sql data with result view ProcView as\

               begin\

               out_test = CE_CALC_VIEW(“_SYS_BIC”.”blag/CV_EXPENSES”,

               [“DEP_DESC”,”YEAR”,”Q1″,”Q2″,”Q3″,”Q4″]);\

               end”’]

    for query in queries:

        cur.execute(query)

        conn.commit()

    query = ”’select DEP_DESC, YEAR, Q1, Q2, Q3, Q4 from ProcView”’

    cur.execute(query)

    ret = cur.fetchall()

    output = “<DIV ALIGN=’CENTER’><TABLE BORDER=’1′>”

    output += “<TR BGCOLOR=’#B9C9FE’>”

    output += “<TH>Department</TH><TH>Year</TH>”

    output += “<TH>Q1</TH><TH>Q2</TH><TH>Q3</TH><TH>Q4</TH>”

    output += “</TR>”

    for row in ret:

        dep_desc = str(row[“DEP_DESC”])

        year = str(row[“YEAR”])

        Q1 = str(row[“Q1”])

        Q2 = str(row[“Q2”])

        Q3 = str(row[“Q3”])

        Q4 = str(row[“Q4”])

        output += “<TR BGCOLOR=’#E8EDFF’>”

        output += ”'<TD>%s</TD><TD>%s</TD>

                     <TD>%s</TD><TD>%s</TD>

                     <TD>%s</TD><TD>%s</TD>”’ % (dep_desc, year,

                                                  Q1, Q2, Q3, Q4)

        output += “</TR>”

    output += “</TABLE>”

    end = time.clock()

    time_taken = end – start

    output += “<H1>SAP HANA = %s seconds</H1></DIV>” % (time_taken)

    return output

run(host=’localhost’, port=8080)

After the execution…I couldn’t be happier…here’s the result…

SAP_HANA.png

SAP HANA managed the same 2 million records…generate the same 7669 aggregated records in only 18 seconds…that’s 50% faster than PostgreSQL and…well…let’s only say…way faster than MySQL…

Now…tell me that SAP HANA is not the coolest and fastest Database around…I dare you ๐Ÿ™‚

By doing that little fix on my Python Script for SAP HANA…the new processing time, without generating the Bottle table is…

SAP_HANA_Fixed.png

To report this post you need to login first.

28 Comments

You must be Logged on to comment or reply to a post.

  1. Nigel James

    Great comparison Alvero.

    Wow, only twice as fast? I would have expected 1.8 seconds for HANA at least. Perhaps I have drunk too much KoolAid, but it is a tad more than twice the price.

    What hardware were mysql and PostgreSQL running on? What if you installed them on an SSD ?

    What about MongoDB or couchDB v HANA? It is sold as being able to handle unstructured data also.

    You don’t show the CREATE TABLE queries – did you index the primary keys?

    There is always much more that what meets the eye in these types of comparisons but this is excellent and much food for thought.

    (0) 
    1. Alvaro Tejada Galindo Post author

      Nigel:

      It’s only twice faster because it’s a plain selection…no optimization done ๐Ÿ™‚ and I’m not an SAP HANA expert…so I’m sure a lot of people can do it better than me ๐Ÿ˜‰

      Both MySQL and PostreSQL are running on my laptop…8GB RAM, SSD, Windows 7, 2,8Ghz. SAP HANA is running on Amazon Web Services…

      In the beginning of the blog I talked why I didn’t used MongoDB…I just can’t understand it ๐Ÿ™ Same goes for CouchDB and all the No-SQL gang…

      I didn’t used any CREATE TABLE statements…as I used the graphical interfaces…no primary keys or indices for any of the 3 DB’s…wanted to keep it as primitive as possible…

      I agree with you completely…that what I used the “friendly comparison” in the title…what I did can’t be considered as a professional benchmark…but it’s still fun to see it and it was fun to do it ๐Ÿ˜‰

      Greetings,

      Blag.

      (0) 
                1. Alvaro Tejada Galindo Post author

                  Uwe:

                  Learning ABAP is like riding a bycicle…you never forget it ๐Ÿ˜‰ I have already joined both groups and will take a look at them as soon as I can… ๐Ÿ™‚

                  Greetings,

                  Blag.

                  (0) 
    2. Nigel James

      in a second reading of your code Blag I would be interested in the time to do the select rather than the time to output the table so if you could kindly put

          end = time.clock()

          time_taken = end – start

      after

          ret = cur.fetchall()

      and

          cur.execute(query)

      That would be an interesting comparison.

      Cheers,

      Nigel

      (0) 
      1. Alvaro Tejada Galindo Post author

        Done ๐Ÿ™‚

        PostgreSQL –> 28.219

        SAP HANA –> 17.787

        The difference decreased, but…the Python connector for SAP HANA is still on beta…and while PostgreSQL only does a SELECT…SAP HANA is creating a TABLE TYPE, a STORED PROCEDURE, buffering the content on the CALCULATION VIEW on a temporary table and the doing a SELECT on that temp table…way more processes in less time ๐Ÿ™‚

        Greetings,

        Blag.

        (0) 
        1. Nigel James

          So the Python connector is at fault? Is it possible that 16 of those seconds are transporting the data from AWS to your computer?

          What if you had a Protgres server on AWS?

          Just trying to get a Pink Lady v Granny Smith comparison  (two types of apples)

          (0) 
          1. Alvaro Tejada Galindo Post author

            I wouldn’t say “faulty”…but it haven’t been released for customers…I just use it because I like Python -:)

            Regarding PostgreSQL on AWS…well…I already have 3 servers…SAP HANA, R and SMP…adding one more will only increase my monthly bill ๐Ÿ˜‰

            I like to believe that I have planted the seed…I will leave others to run more comparisons and show the results ๐Ÿ™‚

            Greetings,

            Blag.

            (0) 
      2. Alvaro Tejada Galindo Post author

        Nigel! Sorry man…failed to see that you were pointing me to the right direction….

        after

            ret = cur.fetchall()

        and

            cur.execute(query)

        I only did the after ret = cur.fetchall() yesterday…today I did the after cur.execute(query) and you can really see the difference now ๐Ÿ˜‰

        Next time…I will be more careful with your wisely advices! ๐Ÿ˜€

        Greetings,

        Blag.

        (0) 
  2. Former Member

    I dare you!

    After fixing the error in your Python scripts, I loaded the data and ran the same queries in Sybase IQ (v.15.4). This took 6.3 seconds without any tuning. With elementary tuning, I can run it in 5.6 seconds. Of this, 1.7 seconds is taken by sending the results to the client so the actual query processing time is only 3.9 seconds. With more tuning, this can certainly be improved.

    This is on a pretty old 2-core Linux machine.

    The thing is that everyone here always seems to think that the magic words “in-memory” are what it is all about. But it’s not — IQ will keep this dataset in memory as well, the smart query processing algorithms are what counts here. Sybase IQ has been on the market since the mid-1990’s and is a proven product.

    (0) 
    1. Alvaro Tejada Galindo Post author

      Alan:

      Thanks for your comment…but…can you be so kind to point the errors in my Python scripts? Clearly, I haven’t spot them ๐Ÿ™‚ Did you used the same 1 million records for each table? The ones generated by my scripts?

      BTW…as I said already, I didn’t want to use any tuning at all…

      Sure, my HANA numbers are slow…but as I always say…I’m not an SAP HANA Expert, so I’m sure a lot of people can do it better than me ๐Ÿ˜‰

      I don;t have IQ installed, so could you share an image with the processing time?

      On a second thought…I moved the end = time.clock()  time_taken = end – start just after the cur.execute(query) because the ret = cur.fetchall() is on the Python side, because the SAP HANA selection has already finished…the new processing time is 1.180 seconds…will update the blog with the new image ๐Ÿ˜‰ Maybe I was having network delays cause by my Internet provider…

      Greetings,

      Blag.

      (0) 
      1. Former Member

        Yes, I loaded 1m rows as generated by your scripts into both tables.

        The error comes out because this function is declared with two parameters but called with only one:

        def Generate_File(pSchema, pNumber)

        Generate_File(num_files)

        I’m not sure what you mean by making an IQ image available. I’m running at a local system here. You can download the free IQ express edition if you want, it’s free.

        (0) 
          1. Former Member

            Here is one that took 3.6 seconds. This can definitely be improved but I’m not going to spend more time on this now.

            For some reason the image didnt get published. It is in the comment below. Click on the image to get a sharp view.

            (0) 
    1. Alvaro Tejada Galindo Post author

      Abhishek:

      I have never used MongoDB…so I have never done any comparison between it and SAP HANA ๐Ÿ™‚ So…no comments on my side…

      Greetings,

      Blag.

      Development Culture.

      (0) 

Leave a Reply