Since its introduction in 2011, SAP tries to push HANA very heavily and there is a lot of marketing buzz over this new product. For a freelance consultant focused on SAP Sybase database products, like me, it is next to impossible to ignore HANA in year 2013. So, I decided not to rely to marketing slogans and check what HANA is, what it can do, and, importantly, what HANA is NOT. I put my first impressions to this blog post; hopefully other HANA-related posts will follow. Note that I’m not a HANA expert (yet 🙂 ) and I’m writing these rows as a person with a lot of experience with IQ and some other RDBMSs and trying to learn HANA.
So, why to compare HANA and IQ? Both are designed for data warehouse environment, both are column-based (with some support of row-based data), both provide a data compression out-of-the-box and highly-parallel. Years ago, much like SAP for HANA today, Sybase claimed that IQ processed data so fast that aggregation tables are not really needed, because the aggregations can be just performed on-the-fly. Well, experience with a number of big projects showed me how problematic that statement was, and it is only a single example.
According to SAP, the strong point of HANA is its ability to utilize CPU cache, which is much faster than accessing the main memory (0.5 – 15 ns. vs. 100 ns.). Currently, IQ and other Sybase RDBMSs lack this capability. Therefore, I decided to build a test environment which allows performing of queries that answer a number of conditions:
- The query should be performed fully in memory, otherwise it is not fare to compare IQ and HANA. In HANA, queries are executed fully in memory, once relevant columns are loaded into the RAM.
- The query should perform a lot of logical I/Os and should be hardly optimizable using indexes. Otherwise, the effect of using CPU cache may be not clear enough.
- The query should take at least a number of seconds to finish. Since both IQ and HANA (very unfortunately) don’t provide the number of logical I/Os performed by a query, we may compare response times only. If the query finishes in small milliseconds, the comparison of response times may be problematic.
Some notes about the test environment:
For IQ, I used 16-core RHEL server with hyper-threading turned on (32 cores visible to OS) and 140GB RAM available. I used IQ 16.0 SP01 for my tests.
For HANA, I had to use HANA SPS6 Developer Edition on a Cloudshare VM, which provides HANA on a Linux server with 24GB RAM. However, only 19.5 GB is actually available from the Linux point of view (free –m output) and most of this memory is allocated by various HANA processes. In fact, less than 3GB RAM is available for user data in HANA. I only wish that SAP would allow us to download HANA and install it on any server that answers to HANA’s requirements for CPUs, but it seems that the SAP’s policy is to distribute HANA as a part of appliances only, so I don’t expect free HANA download any time soon.
This brings us to an additional requirement for the test: the test dataset should be relatively small, because of severe RAM restrictions imposed by HANA Developer Edition on Cloudshare.
Finally, I decided to base my tests on a relatively narrow table that represents information about phone calls (for those involved in Telecom industry, it is like short and very much simplified CDRs). Here is the structure of the table:
create table CDRs (
CDR_ID unsigned bigint, — Phone conversation ID
CC_ORIG varchar(3), — Country code of the call originatior
AC_ORIG varchar(2), — Area code of the call originatior
NUM_ORIG varchar(15), — Phone number of the call originatior
CC_DEST varchar(3), — Country code of the call destination
AC_DEST varchar(2), — Area code of the call destination
NUM_DEST varchar(15), — Phone number of the call destination
STARTTIME datetime, — Start time of the conversation
ENDTIME datetime, — End time of the conversation
DURATION unsigned int — Duration of the conversation in seconds
I developed a stored procedure that fills this table in SAP Sybase ASE row-by-row according to some meaningful logic and prepared delimited files for IQ and HANA. The input files are available upon request. At first, I planned to run tests on a dataset with 900 million rows, but I finally discovered that I have to go down to 15 million rows because of the VM memory limitations mentioned above.
Important note about the terminology. In IQ, inserting of the data from a delimited file into a database table is called LOAD, and retrieving of the data from a table to a delimited file is called EXTRACT. In HANA, the inserting is called IMPORT and the retrieving is called EXPORT. The term LOAD in HANA has a totally different meaning – it means loading of a whole table, or some of its columns, to the memory from disk, when the data is already in the database.
IMPORT functionality in HANA is not similar to IQ, at all. Actually, it contains two phases: IMPORT and MERGE. During the first phase, the data is imported to a “delta store” in an uncompressed form. Then, the data from the “delta store” is merged into “main store”, where the table data is actually resided. The merge is performed automatically, when a configurable threshold is crossed (for example, the size of the “delta store” becomes too big). To ensure that the imported data is fully inside the “main store”, a manual MERGE may be required. The memory requirements during the MERGE process are quite interesting, maybe I will write about it in a different post. It is pretty much possible that you will be able to IMPORT the data, but will not have enough memory to MERGE it; it happened to me a number of times during my tests. I would recommend you to read more about HANA architecture here: http://www.saphana.com/docs/DOC-1073, Chapter 9.
Given the significant difference between the test systems (a powerful dedicated server for IQ vs. small VM for HANA), I didn’t plan to compare the data load performance between IQ and HANA. However, so far I see HANA performing the IMPORT using not more than 1.5 core of 4 available, thus underutilizing the available hardware. The MERGE phase, though, is executed in a much more parallel way. The bottom line is that IQ seems outperform HANA in data loading, possibly quite by far. I will probably return to this topic in one of following posts, additional tests with larger dataset are required.
Now, we come to the data compression. Since IQ and HANA approach the indexing quite differently, I chose to compare the compression without non-default indexes in both IQ and HANA. It appears that IQ provides better data compression and needs 591M to store 15,000,000 rows, while HANA needs 748M to store the same data. HANA provides a number of compression algorithms for columns, which are chosen automatically, according to the data type and data distribution. However, it seems that neither of compression algorithms offered by HANA contains LZW-like compression used by IQ. I’d prefer to test the compression on a more representative data set (15,000,000 is way too small) and play with different HANA compression algorithms. I hope one of future posts will be dedicated to this topic.
Finally, the data is inside the database and we are ready to query it. To answer the test conditions mentioned above, I chose the following query:
a.CDR_ID CDR_ID_1, b.CDR_ID CDR_ID_2,
a.NUM_ORIG NUM_A, a.NUM_DEST NUM_B, a.STARTTIME STARTTIME_1, a.ENDTIME ENDTIME_1,
b.NUM_DEST NUM_C, b.STARTTIME STARTTIME_2, b.ENDTIME ENDTIME_2,
from CDRs a, CDRs b
where a.NUM_DEST = b.NUM_ORIG
and datediff(ss, a.ENDTIME, b.STARTTIME) between 5 and 60
order by a.STARTTIME;
This query finds cases when a person A called person B and then the person B called person C almost immediately (in 60 seconds). This query has to perform a lot of logical I/O by its very definition. With my test data set, this query returns 31 rows.
In IQ, this query takes 6.6 seconds while executed fully in memory and when all relevant indexes are in place. The query uses sort-merge join and runs with relatively high degree of parallelism, allocating about 60% of 32 CPU cores available.
In HANA, the same query takes only 1 second with no indexes in place! Remember, that in my tests HANA is running on a small VM with just 4 virtual CPU cores! The query finishes so fast that I cannot measure the degree of parallelism. Creation of indexes on NUM_ORIG and NUM_DEST reduces the response time to 900 ms.
A note about indexes in HANA: HANA offers only two index types and, by default, it chooses the index type automatically. In my tests, I have found that indexes improve query performance in HANA, sometimes significantly. Unfortunately, I have not found any indication of index usage in HANA query plans, even when some indexes were used by the query for sure. The role of the optimizer statistics in the query plan generation is also not very clear to me. I hope to prepare a separate post about query processing in HANA, stay tuned!
Another amazing and totally unexpected finding in HANA – index creation on NUM_DEST (varchar(15)) takes 194 ms. Index on DURATION (int) is created in 12ms!
My conclusions so far:
- HANA in-memory processing is not just about caching, it is much more than that. HANA allows us to achieve incredible performance for resource-intensive queries. Things that seem impossible with other databases, column-based or row-based, may become possible with HANA.
- Loading of the data into HANA requires careful resource and capacity planning. Merging of the inserted data with the rest of the table may require much more memory that you have probably thought. Particularly, to perform the merge, both old and new version of the table should fit into memory.
- It is pretty much possible that storing of aggregations in HANA is not required indeed, at least in most of the cases. Of course, I need a more representive result set to verify it.
- IQ and HANA can be used together in the same system, where they can solve different problems and store different data. HANA is very good for real-time data processing, or for queries that must be executed very quickly. However, it is not feasible to store the whole multi-terabyte data warehouse in HANA’s memory in most of the cases, at least not in year 2013. At this point, IQ enters into the game. It is very efficient in massive data loading and data storage, and can answer queries with less strict response time requirements very efficiently. In some scenarios, the raw data can be loaded into IQ, and then, after some refining inside IQ, imported into HANA
This post has been originally published in my personal blog at http://leonid-gvirtz.typepad.com/blog.