2 Billion Rows is Not a Limit
I recently got into a discussion regarding the 2 Billion row ‘limit’ for HANA tables / partitions and when this limit might be removed.
In fact this is not a ‘limit’ or a ‘bug’ but rather its intrinsic to how HANA works. Erich suggested that I transfer the information to a blog which I’ve been happy to do.
As we know HANA is a primarily a column store, that is we expect to see the bulk of the data we hold held as large in-memory column stores.
At certain times we need to input or to reconstruct rows of data, and we need to keep track of which where the corresponding attributes of a row occur within the separate columns.
We define our ‘rows’ not by using keys for rows and holding all the attributes together juxtaposed physicall as a row, but rather by their position within the column – so that we can materialise the N’th row by picking up the column attributes at the N’th position in each column.
For this we need to hold a position value – and we’ve chosen 32 bits to do this , with one bit reserved to mark ‘deleted. rows This gives us just over 2Bn values to represent a row position (the 2 Bn value will be familiar with anyone familiar with powers of two, as 2**31 = 2,147,483,648).
As my colleague Ruediger Karl pointed out we can regard this as a separate data type; the data type that defines the row position of a ‘row’ in our table.
This looks like a good tradeoff between the the number of rows we can keep in a table / partition and the space occupied.
If we were to increase the limit, say to 33 bits, then this would be pretty awkward, as we’d most likely have to take at least another 8 bits, maybe even 16 or 32 – otherwise we have values that don’t align to word boundaries. But taking just 4 bytes gives us a good range of values to play with.
Consider what happens when we are doing a query and scanning a column to find values that qualify in a WHERE clause. Most of the values will be filtered out with rows, possible scattered throughout the table being passed on in an intermediate result ready to be combined with other intermediate results. At some point we will reach the end of our query, and at that point we need to materialise our results and pass the results back to the user. At that point we will likely need to pick up other column values from our table that correspond to the positions of the filtered values.
That is, we may have scanned the ACCOUNT_TYPE column of our ACCOUNT table to find all those of type ‘A, D or E’ then examine the ACCOUNT_AMT column of the same table to find those with values of > 100,000 Euro, then return the account number, holders name and account name.
At each stage we need to keep track of the positions of the values we’re dealing with, that is when we’ve found the positions in the ACCOUNT_TYPE column we need to remember these and inspect the corresponding positions in the ACCOUNT_AMT column to check their values. We’ll then end up with a list of positions in the table that qualify for both criteria, and we’ll then go into ACCOUNT_NUMBER, ACCOUNT_HOLDER_NAME and ACCOUNT_NAME columns to pick up the corresponding values in those columns. In order to do this we clearly need to keep track of the row positions – and that’s the job of this internal data type which allows us to carry forward not just the values we found qualified, but the positions in the table where we found them.
This is not really a limit because we can divide any column table into partitions, as many as we like pretty much and this 2Bn limit applies to each partition, not to whole tables, so if we need more than 2Bn rows then all we need to do is partition the data into two or more partitions. Note that this applies whether or not we have a single server HANA cluster or a multi-server cluster. For a multi-server cluster we must partition the table so we can put one or more partitions on each server in order to spread the data. But here again we may discover that we need to put more than one partition on each server since we need to hold more than 2Bn rows per server. For example, a little while ago I worked on a project where we had 60 Bn rows spread over six HANA servers, therefore we needed at least 60 / 2 / 6 = 5 partitions per server. In practice we have many more partitions per server because we didn’t want to have the partitions up against their limit.
As an aside this is not just about the absolute storage limit, but also about performance and operational considerations. I’ve read that for performance a recommendation is to have not more than 100m rows per partition – so to hold 2bn rows you might want 20 partitions of 100m rows each. But depending on the workload, it may not be necessary to go for 20 partitions since one of the things HANA does is dynamically subdivide partitions into sub ranges and assign different tasks to each subrange – so it can increase the parallelism even if the table hasn’t been explicitly subdivided. If so then 2 or 3 partitions would be sufficient, say if the table was seldom updated. As ever ‘it depends’.
Operationally we also need to consider if the tables are being updated having one big table gives problems with doing a MERGE after updates, where we merge the updates that have arrive,and are being held in the delta store with the main table. The way HANA does this is to replicate the partion (or whole table if it’s a single partition) to create a new version. If they have one 2Bn partition then we’d need to replicate all of that into the new version – ie. They have 4Bn rows ‘in flight’ during the operation. But if we partition the data into multiple partitions, say of 100m rows each then they can do these one at a time on a rolling basis – so we only need to replicate 100m rows at any one time, that is we just need to replicate 5% of the data at any one time, not 100%. This consumes less resource as well as smoothing resource consumption.
So, when we need to hold more than 2Bn rows, we simply partition the table and each partition can use a full 2Bn rows, as we said in practice we’ll probably want the parititon to hold less than this in each partition.
For the mathematically inclined, another way to think about this is if we want to exceed the 2Bn limit then we’ll need more than our 32-1 bits to represent all the rows that might be processed, but we can create these by adding the partition number to the front of our row position, if we like we can have hundreds of partitions if we wish and identify any row position with the ‘Partition | RowId’ combination.
The use of fixed position within the column store is fundamental to enabing the use of column stores, and HANA’s clever delta merge process allows us to keep track of and adjust allour row positions as our partitions have rows added and deleted.
For those interested in knowing more this topic is extensively coverred in Hasso’s book:
and also in the In-Memory Data Management Course offered by the Open HPI