Last week I shared my thoughts on how and why SAP HANA is the one big invention in IT within the last years. I also tried to make my point in why HANA could need some help to unfold it’s full potential – read here (Why SAP HANA needs help to unfold it’s full potential).
This time I’d like to give a technical deep dive into it. My attempt will be to write a technical document on a complex and difficult topic, in a way that people, without a master degree in computer science, can understand it.
Of course this ambition must fail in one way or the other. I will either loose my audience on the way, if I’m too accurate, or, if I keep it on the surface too much, I will fail on my own ambition in explaining how SAP HANA works on the inside. So prepare yourself on either twisting your brain in a Gordian knot or find yourself with yet another technical document about SAP HANA that does not really explain it.
The basic principals
SAP HANA is an appliance. That means, if you buy SAP HANA, you get a box, comprised of certified hardware and software that is installed on that hardware. Currently you can get the appliance from IBM, HP, Cisco and Fujitsu-Siemens. All vendors offer HANA in three sizes (S – Small, M – Medium and L – Large). Which box is right for you of course depends on what you intend to do with it. Or, to speak more accurately, how much data you need to store and work on.
With SAP HANA all data in the database and all applications that work on that data, are kept in RAM at all times. This speeds up computing tremendously because no computing time is waisted, waiting for data to be loaded from disk to RAM. With all data in RAM, the only wait cycles are while data from the RAM is loaded into the CPU cache.
The organizational structure of the data storage in that RAM based database comes with another new and inventive concept. It is not that of a normal relational table anymore with rows and columns, but instead the system stores all data column vise. Take a look at figure 1, to see why that matters.
Figure 1 – differences in data storage between table, row or column storage (taken from the SAP HANA developer guide)
As you can see, if data is stored in a column store, the CPU can access it from adjacent memory locations and directly work on it. You can also apply compression algorithms to it, because multiple duplicate values might occur.
The previous was generic for any In-Memory computing platform. But as said, HANA is an appliance and that means the environment is predictable for the software – predictable down to the bits. The software kernel (that is the database kernel as well as the software and script compiler above it), know exactly, how much data can be put in any RAM location and what is the fastest access path (with respect to the hardware layout) to that locations. With this knowledge, it is possible to further optimize the distribution of the workload over the multiple cores of the box. Take a look at figure 2, on how this can be used effectively.
Figure 2 – distribution of column stores to different CPU cores (taken from HANA developer guide)
Imagine this, you have millions, of millions of data to compute and analyze. This data could be from financial transactions or from geological or even meteorological surveys. Whatever source the data might be from, it has something in common. It is too big to fit all of it into a single RAM chip – even into the biggest ones. Because of this you probably always need to separate the data and distribute it over multiple RAM locations. With SAP HANA this is not a problem at all. It is, in fact, perfect in sync with the basic architecture of HANA itself.
HANA makes us of massive parallel computing by distributing the data to work on over different CPU cores. This distribution can be done on a column by column base, in which column A would be processed by core 1, column B by core 2 and so forth.
The distribution can also be done within one column. In the above figure, you can see that core 3 works on one part of the data, while core 4 processes a different part of the same column. This massive parallel distribution of work speeds up the work on big data tremendously.
These architectural concepts together gives HANA a mileage advantage over any other database on the market.
The basic architecture
The software part of SAP HANA is comprised of individual components. There are components for session management, for SQL and MDX statements, for calculation and of course for accessing the data either column, or row vise. Finally you have the usual suspects, replication server and transaction, authorization and metadata manager.
Figure 3 – basic architecture of SAP HANA
When a request comes in, this can either be done by an SQL script or an MDX request. The SQL parser or the calc engine then calculates the need access optimization and redirects the request to one of the relational engines. Which one depends on the data storage in question. As said, HANA usually stores all data in a column store, but you can also choose to store data in a row fashion. Although, this is usually much slower. The relational engine provides access to the persistence layer via the page manager, that provides the actual memory locations.
Size does matter
HANA was developed to distribute the workload on multiple Cores and between multiple server systems.
Figure 4 – the complete architecture
As you can see, if one HANA appliance is too small for your particular needs in data size, you can have multiple boxes connected to each other. This connection is done with super fast fibre channel connections with a TCP protocol on top of that. But of course, accessing data in memory on box A from a CPU on box B, is something completely different, compared to the case, where RAM and CPU is on the same box.
Therefore you will always loose performance in cases where you have to distribute you data over multiple boxes. To stay with my above example of meteorological data, there could be terabytes of data to analyze. You will most probably need more than one box to load all that data into RAM. Now, if you happen to need to compare or analyze data that is on different boxes, you need to leave one box and compare data from different boxes. That means, you don’t talk in nanoseconds anymore, but instead go into milliseconds – a factor of 1000 slower!
Sounds bad, but is in fact not that bad at all. I all of the usual situations distribution of data and computing over multiple boxes might again give you an extra performance boost, as you can share the workload. It’s just in that rare cases where you really have to distribute values of data that should be read and compared together, and therefore kept near to each other, where this is a problem.
But it shows one very important aspect that needs to be considered, when buying an SAP HANA system – sizing. Make sure the size of your HANA appliances accurately matches your requirements today and tomorrow.
To be continued…