The Column Store VIP paradox, lazy SAP HANA startup, and other tales
First of all, please accept my most humble apologies for not having written to you in ages… I’ve been awfully busy with N projects in parallel processing (including a short, and definitely deserved vacation), and to make things a bit dramatic, I’ve been having a kind of TBWIB (Temporary Blog Writing and Inspiration Block).
While getting ready for my next tour de force, I just stumbled upon a couple of quirky curiosities I’d like to share with you.
Probably you have heard, not for the first time, that one of the architecture VIPs in the SAP HANA saga is the famous, unmatched Column Store with its beautiful dictionary and index vectors, full of highly compressible integers.
Truth is, SAP HANA is not only about column storage and data compression, but also about on-the-fly data aggregation in-memory vs. old-fashioned materialized aggregation, about high performance engine optimization for calculations, and about taking advantage of high parallel processing capabilities, beyond pure load balancing for high user load.
All this only to start with…
SAP HANA is also about In-Memory computing (surprise, surprise) in the sense that it minimizes the transport of data among the different IT architecture layers, compared to traditional systems, where data needs to be transported from disk, to the application, to RAM, … and finally to the CPU cache, where all calculations ultimately take place (also for old-fashioned, traditional systems).
SAP HANA ignores the disk layer, except for backup and archiving operations.
Well, well, and what happens when you, dear administrator, happen to restart SAP HANA?
- Will everything get automatically loaded into main memory?
- Will only Column Store tables get loaded into main memory?
- Will only some queries get loaded into main memory?
- Will row store data remain confined to the chasms of disk storage?
The answer is… Jein.
During SAP HANA database startup (a restart may take about 10 minutes) some optimization takes place, in order to enable fast startup. Only essential data is loaded into memory directly, and this includes relevant query data, which may represent only a partial amount of table data (remember, “select * from” queries tend to be crappy… we need to be much more selective if we want to ensure high query performance).
During startup, the last savepoint must be restored and logs must be redone…
There is also a pre-load option for entire column store tables or individual columns, which allows you to mark important data containers for in-memory pre-load after startup.
What may sound a bit paradoxical (but is not), is that while column store data is “lazily” loaded into memory on demand during startup, row store data is fully loaded into memory during startup, and stays there.
By the way, dictionary tables (which are the mother of Column Store) are row store tables. This may sound a bit funny, but is quite logical, and allows us to, e.g. avoid chicken-and-eggy, infinite regressy issues.
The startup phase is a slightly anomalous phase… Later on, after a shortly slower startup, everything will be in-memory business as usual.
As you can see, Column Store may be one of the VIPs in HANA town, but it has no monopoly on memory usage, and it could not work alone either way…
It is like in real life, and real business: Great achievements are always the result of several people’s ideas and work, beyond space-time limitations. In one way or the other we are working together not only with our teams and colleagues and friends, but also with our ancestors or with people we have never seen in our lives, but whom we may have felt, read or listened to.
Founder and Managing Director
Great article. Thanks
I am happy to hear that. Thanks, Chris.
Let's see if I can find similar dark stuff to write about soon. I'll try.