Skip to Content

MySQL, PostreSQL and SAP HANA – A friendly comparison

UPDATE! Thanks to a comment by Alan Suffolk I fixed my script a little bit, just moving the end = time.clock() and time_taken = end – start after the cur.execute(query) because that’s exactly when the SAP HANA query ends…you can see the new processing time in the image below…

My good friend Pakdi Decnud gave a great idea while we were having lunch the very same day of the SAP CodeJam Montreal event.

Pakdi told me…”Why don’t you make a comparison between SAP HANA and let’s say…MongoDB”…I thought that it was of course a great idea…so yesterday I start exploring MongoDB…so you may ask yourselves…”Why then are you talking about MySQL and PostreSQL?”

Easy answer…and here are my thoughts…

  • I really don’t get MongoDB…the whole No-SQL is really alien to me…
  • MongoDB is “Document” based, meaning that you create collections of documents, no databases nor tables…
  • MongoDB doesn’t support Inner Joins and aggregates need a framework that it’s more weird than MongoDB itself…
  • MongoDB is not meant for enterprise applications

That’s why I decide to make a little bit shift and grab the two most used databases by start-ups and developers…MySQL and PostgreSQL.

For this blog, I wanted to have a lot of information…so as always, I grab my beloved Python and create a little script to generate 1 million records for two tables. One script per table.

The structure of the tables is as follows…

DOC_HEADER

Field Name DataType Length
DOCUMENT_ID VARCHAR 8
YEAR VARCHAR 4
AREA VARCHAR 2
NAME VARCHAR 20
LAST_NAME VARCHAR 20

DOC_DETAIL


Field Name DataType Length
DOCUMENT_ID VARCHAR 8
YEAR VARCHAR 4
AREA VARCHAR 2
AMOUNT BIGINT

And here are the script to generate the 1 million records in a nice .CSV file

Doc_Header_Generator.py

import random

import csv

names = [“Anne”, “Gigi”, “Juergen”, “Ingo”, “Inga”, “Alvaro”, “Mario”,

“Julien”, “Mike”, “Michael”, “Karin”, “Rui”, “John”, “Rocky”, “Sebastian”,

“Kai-Yin”, “Hester”, “Katrin”, “Uwe”, “Vitaliy”]

last_names = [“Hardy”, “Read”, “Schmerder”, “Sauerzapf”, “Bereza”, “Tejada”,

“Herger”, “Vayssiere”, “Flynn”, “Byczkowski”, “Schattka”,

“Nogueira”, “Mayerhofer”, “Ongkowidjojo”, “Wieczorek”, “Gau”, “Hilbrecht”,

“Staehr”, “Kylau”, “Rudnytskiy”]

area = [“001”, “002”, “003”, “004”, “005”, “006”, “007”, “008”, “009”, “010”]

year = [“2000”, “2001”, “2002”, “2003”, “2004”, “2005”, “2006”, “2007”,

“2008”, “2009”, “2010”, “2011”, “2012”]

def Generate_File(pSchema, pNumber):

    iNumber = 0

    c = csv.writer(open(“Doc_Header.csv”, “wb”))

    while iNumber < pNumber:

        queries = []

        r_doc_id = random.randrange(1, 999999)

        r_names = random.randrange(0, 20)

        r_lastnames = random.randrange(0, 20)

        r_areas = random.randrange(0, 10)

        r_years = random.randrange(0, 13)

        iNumber += 1

        queries.append(r_doc_id)

        queries.append(year[r_years])

        queries.append(str(area[r_areas]))

        queries.append(names[r_names])

        queries.append(last_names[r_lastnames])

        c.writerow(queries)

num_files = input(“How many records?: \n”)

Generate_File(num_files)

Doc_Detail_Generator.py

import random

import csv

area = [“001”, “002”, “003”, “004”, “005”, “006”, “007”, “008”, “009”, “010”]

year = [“2000”, “2001”, “2002”, “2003”, “2004”, “2005”, “2006”, “2007”,

“2008”, “2009”, “2010”, “2011”, “2012”]

def Generate_File(pSchema, pNumber):

    iNumber = 0

    c = csv.writer(open(“Doc_Detail.csv”, “wb”))

    while iNumber < pNumber:

        queries = []

        r_doc_id = random.randrange(1, 999999)

        r_areas = random.randrange(0, 10)

        r_years = random.randrange(0, 13)

        r_amount = random.randrange(0, 10000, 1)

        iNumber += 1

        queries.append(r_doc_id)

        queries.append(year[r_years])

        queries.append(str(area[r_areas]))

        queries.append(r_amount)

        c.writerow(queries)

num_files = input(“How many records?: \n”)

Generate_File(num_files)

With the two files ready, I upload them to MySQL, PostgreSQL and SAP HANA.

To measure the speed, I create three Python scripts using…yes…again Bottle

The basic idea is to join the two tables, select the Document_Id, Year, Area and the sum of Amount.

Let’s start with the MySQL Script…

MySQL_Bottle_Documents.py

from bottle import get, run

import mysql.connector

import time

@get(‘/show_query’)

def show_form():

    counter = 0

    start = time.clock()

    conn = mysql.connector.Connect(host=’localhost’, user=’root’,

                        password=’root’, database=’P075400′)

    cur = conn.cursor()

    query = ”’SELECT A.DOCUMENT_ID, A.YEAR, A.AREA, SUM(AMOUNT)

                  FROM DOC_HEADER AS A INNER JOIN DOC_DETAIL AS B

                  WHERE A.DOCUMENT_ID = B.DOCUMENT_ID

                      AND A.YEAR = B.YEAR

                      AND A.AREA = B.AREA

                  GROUP BY DOCUMENT_ID, YEAR, AREA”’

    cur.execute(query)

     output = “<DIV ALIGN=’CENTER’><TABLE BORDER=’1′>”

    output += “<TR BGCOLOR=’#B9C9FE’>”

    output += “<TH>Document</TH><TH>Year</TH>”

    output += “<TH>Area</TH><TH>Amount</TH>”

    output += “</TR>”

    for row in cur:

        counter += 1

        document_id = str(row[0])

        year = str(row[1])

        area = str(row[2])

        amount = str(row[3])

        output += “<TR BGCOLOR=’#E8EDFF’>”

        output += ”'<TD>%s</TD><TD>%s</TD>

                         <TD>%s</TD><TD>%s</TD>”’ % (document_id, year,

                                                                               area, amount)

        output += “</TR>”

    output += “</TABLE>”

    end = time.clock()

    time_taken = end – start

    output += “<H1>SAP HANA – %s records in %s seconds</H1></DIV>”\

    % (counter, time_taken)

    return output

run(host=’localhost’, port=8080)

I let the script run…and after more than one hour…I simply got bored and interrupt the process…

So, I continue with PostgreSQL

PostgreSQL_Bottle_Documents.py

from bottle import get, run

import psycopg2

import time

@get(‘/show_query’)

def show_form():

    counter = 0

    start = time.clock()

    conn = psycopg2.connect(“dbname=P075400 user=postgres password=root”)

    cur = conn.cursor()

    query = ”’SELECT “DOC_HEADER”.”DOCUMENT_ID”, “DOC_HEADER”.”YEAR”,

                “DOC_HEADER”.”AREA”, SUM(“DOC_DETAIL”.”AMOUNT”) FROM

                public.”DOC_HEADER”, public.”DOC_DETAIL” WHERE

                “DOC_HEADER”.”DOCUMENT_ID” = “DOC_DETAIL”.”DOCUMENT_ID”

                AND “DOC_HEADER”.”YEAR” = “DOC_DETAIL”.”YEAR”

                AND “DOC_HEADER”.”AREA” = “DOC_DETAIL”.”AREA”

                GROUP BY “DOC_HEADER”.”DOCUMENT_ID”, “DOC_HEADER”.”YEAR”,

                “DOC_HEADER”.”AREA””’

    cur.execute(query)

    output = “<DIV ALIGN=’CENTER’><TABLE BORDER=’1′>”

    output += “<TR BGCOLOR=’#B9C9FE’>”

    output += “<TH>Document</TH><TH>Year</TH>”

    output += “<TH>Area</TH><TH>Amount</TH>”

    output += “</TR>”

    for row in cur:

        counter += 1

        document_id = str(row[0])

        year = str(row[1])

        area = str(row[2])

        amount = str(row[3])

        output += “<TR BGCOLOR=’#E8EDFF’>”

        output += ”'<TD>%s</TD><TD>%s</TD>

                         <TD>%s</TD><TD>%s</TD>”’ % (document_id, year,

                                                                               area, amount)

        output += “</TR>”

    output += “</TABLE>”

    end = time.clock()

    time_taken = end – start

    output += “<H1>PostgreSQL – %s records in %s seconds</H1></DIV>”\

    % (counter, time_taken)

    return output

run(host=’localhost’, port=8080)

This time…I was lucky…

PostgreSQL.pngOut of 2 millions records, PostgreSQL managed to aggregate the amount field and generate 7669 records in 36 seconds…not bad at all…

For SAP HANA, I decided to take fully advantage of the Calculation Views, so I create the following…

CV_Documents.pngI joined both tables, used a projection, applied the aggregation and specified the result…then I wrote this Python script…

SAPHANA_Bottle_Documents.py

from bottle import get, run

import dbapi

import time

@get(‘/show_query’)

def show_form():

    start = time.clock()

    conn = dbapi.connect(‘******’, 30015, ‘********’, ‘*******’)

    cur = conn.cursor()

    queries = [”’create type test_out as table (DEP_DESC NVARCHAR(20),

               YEAR VARCHAR(4), Q1 BIGINT, Q2 BIGINT, Q3 BIGINT, Q4 BIGINT)”’,

               ”’create procedure pyTest (OUT out_test TEST_OUT)\

               language sqlscript reads sql data with result view ProcView as\

               begin\

               out_test = CE_CALC_VIEW(“_SYS_BIC”.”blag/CV_EXPENSES”,

               [“DEP_DESC”,”YEAR”,”Q1″,”Q2″,”Q3″,”Q4″]);\

               end”’]

    for query in queries:

        cur.execute(query)

        conn.commit()

    query = ”’select DEP_DESC, YEAR, Q1, Q2, Q3, Q4 from ProcView”’

    cur.execute(query)

    ret = cur.fetchall()

    output = “<DIV ALIGN=’CENTER’><TABLE BORDER=’1′>”

    output += “<TR BGCOLOR=’#B9C9FE’>”

    output += “<TH>Department</TH><TH>Year</TH>”

    output += “<TH>Q1</TH><TH>Q2</TH><TH>Q3</TH><TH>Q4</TH>”

    output += “</TR>”

    for row in ret:

        dep_desc = str(row[“DEP_DESC”])

        year = str(row[“YEAR”])

        Q1 = str(row[“Q1”])

        Q2 = str(row[“Q2”])

        Q3 = str(row[“Q3”])

        Q4 = str(row[“Q4”])

        output += “<TR BGCOLOR=’#E8EDFF’>”

        output += ”'<TD>%s</TD><TD>%s</TD>

                     <TD>%s</TD><TD>%s</TD>

                     <TD>%s</TD><TD>%s</TD>”’ % (dep_desc, year,

                                                  Q1, Q2, Q3, Q4)

        output += “</TR>”

    output += “</TABLE>”

    end = time.clock()

    time_taken = end – start

    output += “<H1>SAP HANA = %s seconds</H1></DIV>” % (time_taken)

    return output

run(host=’localhost’, port=8080)

After the execution…I couldn’t be happier…here’s the result…

SAP_HANA.png

SAP HANA managed the same 2 million records…generate the same 7669 aggregated records in only 18 seconds…that’s 50% faster than PostgreSQL and…well…let’s only say…way faster than MySQL…

Now…tell me that SAP HANA is not the coolest and fastest Database around…I dare you 🙂

By doing that little fix on my Python Script for SAP HANA…the new processing time, without generating the Bottle table is…

SAP_HANA_Fixed.png

28 Comments
You must be Logged on to comment or reply to a post.
  • Great comparison Alvero.

    Wow, only twice as fast? I would have expected 1.8 seconds for HANA at least. Perhaps I have drunk too much KoolAid, but it is a tad more than twice the price.

    What hardware were mysql and PostgreSQL running on? What if you installed them on an SSD ?

    What about MongoDB or couchDB v HANA? It is sold as being able to handle unstructured data also.

    You don’t show the CREATE TABLE queries – did you index the primary keys?

    There is always much more that what meets the eye in these types of comparisons but this is excellent and much food for thought.

  • I dare you!

    After fixing the error in your Python scripts, I loaded the data and ran the same queries in Sybase IQ (v.15.4). This took 6.3 seconds without any tuning. With elementary tuning, I can run it in 5.6 seconds. Of this, 1.7 seconds is taken by sending the results to the client so the actual query processing time is only 3.9 seconds. With more tuning, this can certainly be improved.

    This is on a pretty old 2-core Linux machine.

    The thing is that everyone here always seems to think that the magic words “in-memory” are what it is all about. But it’s not — IQ will keep this dataset in memory as well, the smart query processing algorithms are what counts here. Sybase IQ has been on the market since the mid-1990’s and is a proven product.

    • Alan:

      Thanks for your comment…but…can you be so kind to point the errors in my Python scripts? Clearly, I haven’t spot them 🙂 Did you used the same 1 million records for each table? The ones generated by my scripts?

      BTW…as I said already, I didn’t want to use any tuning at all…

      Sure, my HANA numbers are slow…but as I always say…I’m not an SAP HANA Expert, so I’m sure a lot of people can do it better than me 😉

      I don;t have IQ installed, so could you share an image with the processing time?

      On a second thought…I moved the end = time.clock()  time_taken = end – start just after the cur.execute(query) because the ret = cur.fetchall() is on the Python side, because the SAP HANA selection has already finished…the new processing time is 1.180 seconds…will update the blog with the new image 😉 Maybe I was having network delays cause by my Internet provider…

      Greetings,

      Blag.

      • Yes, I loaded 1m rows as generated by your scripts into both tables.

        The error comes out because this function is declared with two parameters but called with only one:

        def Generate_File(pSchema, pNumber)

        Generate_File(num_files)

        I’m not sure what you mean by making an IQ image available. I’m running at a local system here. You can download the free IQ express edition if you want, it’s free.