2 Billion Rows is Not a Limit
I recently got into a discussion regarding the 2 Billion row ‘limit’ for HANA tables / partitions and when this limit might be removed.
In fact this is not a ‘limit’ or a ‘bug’ but rather its intrinsic to how HANA works. Erich suggested that I transfer the information to a blog which I’ve been happy to do.
As we know HANA is a primarily a column store, that is we expect to see the bulk of the data we hold held as large in-memory column stores.
At certain times we need to input or to reconstruct rows of data, and we need to keep track of which where the corresponding attributes of a row occur within the separate columns.
We define our ‘rows’ not by using keys for rows and holding all the attributes together juxtaposed physicall as a row, but rather by their position within the column – so that we can materialise the N’th row by picking up the column attributes at the N’th position in each column.
For this we need to hold a position value – and we’ve chosen 32 bits to do this , with one bit reserved to mark ‘deleted. rows This gives us just over 2Bn values to represent a row position (the 2 Bn value will be familiar with anyone familiar with powers of two, as 2**31 = 2,147,483,648).
As my colleague Ruediger Karl pointed out we can regard this as a separate data type; the data type that defines the row position of a ‘row’ in our table.
This looks like a good tradeoff between the the number of rows we can keep in a table / partition and the space occupied.
If we were to increase the limit, say to 33 bits, then this would be pretty awkward, as we’d most likely have to take at least another 8 bits, maybe even 16 or 32 – otherwise we have values that don’t align to word boundaries. But taking just 4 bytes gives us a good range of values to play with.
Consider what happens when we are doing a query and scanning a column to find values that qualify in a WHERE clause. Most of the values will be filtered out with rows, possible scattered throughout the table being passed on in an intermediate result ready to be combined with other intermediate results. At some point we will reach the end of our query, and at that point we need to materialise our results and pass the results back to the user. At that point we will likely need to pick up other column values from our table that correspond to the positions of the filtered values.
That is, we may have scanned the ACCOUNT_TYPE column of our ACCOUNT table to find all those of type ‘A, D or E’ then examine the ACCOUNT_AMT column of the same table to find those with values of > 100,000 Euro, then return the account number, holders name and account name.
At each stage we need to keep track of the positions of the values we’re dealing with, that is when we’ve found the positions in the ACCOUNT_TYPE column we need to remember these and inspect the corresponding positions in the ACCOUNT_AMT column to check their values. We’ll then end up with a list of positions in the table that qualify for both criteria, and we’ll then go into ACCOUNT_NUMBER, ACCOUNT_HOLDER_NAME and ACCOUNT_NAME columns to pick up the corresponding values in those columns. In order to do this we clearly need to keep track of the row positions – and that’s the job of this internal data type which allows us to carry forward not just the values we found qualified, but the positions in the table where we found them.
This is not really a limit because we can divide any column table into partitions, as many as we like pretty much and this 2Bn limit applies to each partition, not to whole tables, so if we need more than 2Bn rows then all we need to do is partition the data into two or more partitions. Note that this applies whether or not we have a single server HANA cluster or a multi-server cluster. For a multi-server cluster we must partition the table so we can put one or more partitions on each server in order to spread the data. But here again we may discover that we need to put more than one partition on each server since we need to hold more than 2Bn rows per server. For example, a little while ago I worked on a project where we had 60 Bn rows spread over six HANA servers, therefore we needed at least 60 / 2 / 6 = 5 partitions per server. In practice we have many more partitions per server because we didn’t want to have the partitions up against their limit.
As an aside this is not just about the absolute storage limit, but also about performance and operational considerations. I’ve read that for performance a recommendation is to have not more than 100m rows per partition – so to hold 2bn rows you might want 20 partitions of 100m rows each. But depending on the workload, it may not be necessary to go for 20 partitions since one of the things HANA does is dynamically subdivide partitions into sub ranges and assign different tasks to each subrange – so it can increase the parallelism even if the table hasn’t been explicitly subdivided. If so then 2 or 3 partitions would be sufficient, say if the table was seldom updated. As ever ‘it depends’.
Operationally we also need to consider if the tables are being updated having one big table gives problems with doing a MERGE after updates, where we merge the updates that have arrive,and are being held in the delta store with the main table. The way HANA does this is to replicate the partion (or whole table if it’s a single partition) to create a new version. If they have one 2Bn partition then we’d need to replicate all of that into the new version – ie. They have 4Bn rows ‘in flight’ during the operation. But if we partition the data into multiple partitions, say of 100m rows each then they can do these one at a time on a rolling basis – so we only need to replicate 100m rows at any one time, that is we just need to replicate 5% of the data at any one time, not 100%. This consumes less resource as well as smoothing resource consumption.
So, when we need to hold more than 2Bn rows, we simply partition the table and each partition can use a full 2Bn rows, as we said in practice we’ll probably want the parititon to hold less than this in each partition.
For the mathematically inclined, another way to think about this is if we want to exceed the 2Bn limit then we’ll need more than our 32-1 bits to represent all the rows that might be processed, but we can create these by adding the partition number to the front of our row position, if we like we can have hundreds of partitions if we wish and identify any row position with the ‘Partition | RowId’ combination.
The use of fixed position within the column store is fundamental to enabing the use of column stores, and HANA’s clever delta merge process allows us to keep track of and adjust allour row positions as our partitions have rows added and deleted.
For those interested in knowing more this topic is extensively coverred in Hasso’s book:
and also in the In-Memory Data Management Course offered by the Open HPI
Cheers, H
So I've been pondering this for a while. The attribute vector in HANA is a variable size, as I understand it, with a maximum of 32-bits. It's not clear to me what the variable sizes are, but my guess is they are 1,2,4,8,16,32-bit or some subset of this.
Sybase IQ doesn't have that limit though, and it also uses dictionary encoding, as I understand it. Neither does DB2 or Oracle. IQ has a 48-bit limit, or 281,474,976,710,656 rows.
What I don't understand is what the difference in design between IQ/DB2/Oracle's column stores and HANA that means that an attribute vector > 32-bits is not allowed?
Is it simply that HANA uses SIMD instructions with a 128-bit data path, and using a 33-bit (or 64-bit) data-type would mean you could only parallelize 4 instructions rather than 2?
Or...?
Hi John, Good to hear the clatter of your keys ! The 2Bn row thing is not related to the size of the register for SIMD or the compression factor for the values held in the dictionary, they are separate things - let me explain.
The 2Bn row 'limit' comes from the fact that we implicitly hold the position of every 'row' in the database to keep track of where that attribute value came from - we may need to go back to the 'table' - that is the collection of columns that make up the table and pick out the another attribute in the N'th position of the table that corresponds to the N'th position of another column that we need to access. We don't identify the rows in HANA as part of the row structure instead we uniquely identify them by their position in the attribute vector.
For example if we test a column and find that the values at positions 2, 22 and 30 match our criteria then we most likely have to look at the 2nd, 22nd and 30th position in another column to pick up the corresponding values we've found in our first column - remember we're accessing columns not whole rows.
If we look at the values at position 2,22 and 30 in our 2nd column and find that those at position 22 and 30 in our second column further qualify then we may have to go to multiple columns and pick out all the values at their 22nd and 30th positions in order to 'reconstruct' the rows and return the complete result. (We're doing 'late materialisation')
Therefore whilst we have intermediate results then we need to keep track of the positions that the filtered values came from, so we effectively need a 'RowPosition' attribute- an internal attribute that allows us to keep track of this. It so happens we've chosen 4 bytes which is a good compromise between a good range of values and a reasonable overhead for space. So think of it as an extra attribute in the 'row' that has the RowId and which any intermediate result will pick up to keep track of the values.
The other thing you mention is the size of the registers etc for SIMD. This is a similar sounding but different thing.
As you know when we use dictionary compression we reduce our column attribute values to the minimum number of bits necessary to hold the data value. This is simply information theory - for a given range of values what is the minimum level of bits needed to hold the values we need to hold. If I recall correctly we don't have a fixed size for this - we can compress the values down to 1, 2, 5, 15 or however many bits we need to hold the range of values we need. All the values in a column are held in this same minimum length of bits (except maybe if we are using run length encoding etc)
Thus depending on the column data and the level of compression we get we can pack different numbers of values into the registers. I'm pretty sure that we don't insist on them being on word / half word boundaries either.
So the 2Bn 'limit' isn't related to the number of values we can pack into a register, it determines how many total values we can store in a table / partition. The number of values we can pack into a register at any one time is simply a factor of the compression factor for the binary encoded data coming from our dictionary, if you get high compression you can process more values at a time. The trade off here is that by holding the minimal number of bits you take less data space overall, and you can pack more values in a register - but its a bit more complicated doing the calculations as you may need to do some bit shifting as the values don't necessarily align on byte, half word or word boundaries within the register.
So, to summarise the 2Bn limit for a single partitions (you can have loads of partitions) is due to the RowId you need to hold, the number of values you can pack into a register comes from the compression you can get for the overall values -these factors are independent.
One of the really, really clever things about this scheme and the way HANA does updating / merges is the way that it can automatically adjust the number of bits it uses to compress values. As you're aware periodically we take the updates that have arrived into the system and which are held in the Delta Store and merge them into the main column table. If at this point we discover a value that needs more bits to represent it than we've seen previously then we rebuild the dictionary using an increased number of bits and then switch it into use. This means we can always guarantee to be using the minimal number of bits but are not afraid of a value turning up that might blow our encoding scheme - on each MERGE we get the chance to increase the number of bits we're using. This is a neat example of how the different features of HANA support each other enabling us to do things in a completely new, and more efficient, manner.
This is explained in a lot more detail in Hasso's book (the detailed yellow covered one) and the course material for the In-Memory MOOC from the Hasso Plattner Institute.
As we said above the 2Bn'limit' isn't really a limit - if you need more rows just use more partitions đ
Right, the row-id is for the attribute vector. That's limited to a maximum of a 32-bit value for a single partition and what causes the 2bn row limit.
Other databases don't have this limit. I don't know about DB2 or Oracle, but IQ is definitely a 48-bit value.
Now, you can reasonably argue that because of HANA's delta design, partitions bigger than 2-300m rows start to get expensive to maintain, plus partitions can allow you to do query pruning, and therefore with HANA's database design, partitions are necessary regardless. Therefore the 2bn row partition limit makes a lot of sense, and doesn't cause an impact in the real world. That's the case you make above and I agree with it.
But it still doesn't explain why the development team chose a hard limit of a 32-bit attribute vector for HANA and 48-bit for IQ.
BTW have you ever seen what happens when you exceed the limit? Imagine the following scenario:
- Disable automerge
- Insert 1.99bn rows
- Merge delta
- Insert 101m rows < what happens here?
- Merge delta < what happens here?
John, I'm guessing here but remember some other databases don't rely on row position but have another way of identifying a row - some have an explicit RowId - and also some will be assuming a non-partitioned solution, therefore the total number of rows can reasonably be expected to be higher than 2Bn. The 32bits is just a reasonable compromise between allowable range and space, they could have chosen another.
I think you'll find that the system will throw an exception if the limit is exceeded as in your example above - but that's just a matter of sizing the system with the right number of partitions to avoid that.
If you have a single node, single partition system and the table is read only then 2Bn gives enough range for most people -as you've said before if you're updating then you probably want 100-200m rows per partition for update and merge efficiency , as good practice, which is well within the 2Bn range.
You can regard the partition number as an extra number of bits tacked onto the RowId to give you a higher range if more are needed - rather than requiring every row to carry a much larger overhead when in most cases and for most organisations it wouldn't be needed.
Yes I think it is a very acceptable trade-off between performance, storage and capability. It's never caused me a problem in the real world!
Wondering if Mark Mumy has an opinion, I bet he knows why IQ is 48-bit.
I also wondered how long it would be before the HANA engine has auto-partitioning to hide all this. It could choose a partition structure that best represents the dataset based on the queries. That would be enormously powerful!
BTW the insert into the delta store will work, and if you do COUNT(*) on the table, it will have over 2bn rows. When you come to merge, it will fail the merge. This can only happen if you do bulk loads, so is a very rare thing (that's how I know about this scenario). In the real world, the database sends an alert when you hit 300m rows in a partition, so you will fix big partitions long before they become a problem!
Much like HANA, the IQ limit to 48 bits was really a tradeoff of space needed to track rows and the likelihood that a table would exceed 281 trillion rows. Most systems had a row limit that 4 bytes could address: 2-4 billion depending on whether or not you needed to hold a bit for other uses.
At the time, IQ could have certainly used anything higher than 4 bytes. 8 bytes would be wasteful if you think about it. That's 18,446,744,073,709,551,616 rows! But a 5 byte (40 bit) value is too small at just 1,099,511,627,776 rows. OK, not small but certainly something that would be hit in the foreseeable future, some years ago. And it would have been quite a limitation in our latest HANA and IQ Guinness 12 PB system where we had tables with more than 3 trillion rows each.
Go up 1 byte to 6 and you jump to 281 trillion rows. Plus it's a nice boundary in the programming world making things a tad neater.
While massive by today's standards, even 281 trillion rows could eventually be reached. I would love to see that, though!
I understand why HANA has the limit that it does of 2^31. RAM real estate is limited and expensive as compared to disk.
With memory getting much larger and with more data being pushed into the in-memory space for performance reasons, we will likely want to revisit this at some point.
I can still remember that whopping 100 MB database that I worked with just 20 years ago. And the first time I crossed into the 1 TB realm. Now, building petabyte HANA and IQ systems is becoming more mainstream.
Mark
We have a 2bn+ bseg and it's partitioned (as you would expect), so yeah, "no real world problems". đ
John, Thanks for that the result you saw makes sense to me, if you insert rows then they go into the Delta store. When you do the COUNT(*) you'd do that for the main store, and for the Delta store and then combine the result. But when you come to do the MERGE the main store needs to be reconstructed at which point the system discovers it would need to keep track of more than 2Bn values - hence the error. See also the additional comment from Lars below.
Hi Henry,
as this is already a rather technical discussion I might as well chip in here.
One thing that needs a bit more clarification in my eyes is the distinction between the logical row and the offset of a row in the main store value vector.
The logical row has a row obviously has to cover not just one main store but needs to be uniquely identifiable across the whole table.
This means it needs to cover potentially one main store and two delta stores for up to 1000 partitions.
So, the row_id (and I think that is what Ruediger Karl referred to) does in fact not declare the offset position of a row in the main store value vector.
It is merely a sequence number for the order in which records where entered into the table.
This also applies to updates, as every update of a record is effectively a delete/invalidate+insert operation.
Also, the offset position of any row within the main store can change, when the sort order of the table is changed during the compression optimization.
The actual offset/position of a record within a value vector is only explicitly stored, when an inverted index is used.
Clearly this is not a factual explanation why there is this limit in the way we have it but I think you made the point well by saying that it is not a limit for any practical matters.
Cheers,
Lars
Lars, great post, thank you for the point about the delta stores. When I wrote the original post one of the main things I was trying to get across was the need to carry forward row positions in order to do late materialisation of results so we know where to go in order to reconstruct complete rows having filtered results using previously processed columns. So, I was mainly thinking about the use of positional representation in this context. Do you have any comments on this?
My understanding is that we keep track of the positions within a partition rather than an entire table and that if we want to uniquely identify a row we need to know the partition number and the offset within the partition to do this - if we were to do this for 1,000 partitions of up to 2Bn values (per partition) then we'd need more than 32 bits to do that - ie. we'd need 32 bits for each partition (2Bn values) plus Logbase2(1,000)= 10 bits to do this, or 42 bits in all (ok 2**10 is 1024)
My view is that the use of positional representation for rows and marrying this with an MPP architecture that makes use of multiple partitions is a really neat and elegant solution to the problem of doing highly efficient large scale processing. The insight that you can combine both, and then maintain order through use of the MERGE process is the key to allowing the use of column stores for both OLTP and analytics and that other researchers had been persistent enough to get to the point of realising that this combination of features provided a configuration that could provide a general purpose processing system that provides much higher update and analytic performance. Each feature on its own wouldn't provide the effect - but the combination does - and illustrates the genius of this approach (which is why others are starting to try and copy it - albeit in a limited way - the ability to do real time updates, or use a single data representation being the being typical limitations).
So, any and all comments on this are very welcome đ
Hi Henry,
first off I agree that the current column store and the query engines that have been build on top of it make good use of parallel processing and SIMD options of the underlying hardware.
I don't think that it is the best thing since sliced bread either - but that's just personal view and I am not saying there are so many better solutions out there. Maybe it's just that I've become used to the concepts.
Coming back to your point:
Yes, late materialization plays an important role for the query processing in SAP HANA.
But the offset position of a specific value in the main vector is really only used to 'glue' the record back together. To be able to do that, the processing engine indeed needs to also track the partition the row is coming from - very similar to the case when multiple tables are involved in a query.
So, during query processing the record-references indeed allow to address way more than the 2 bln rows that a single main vector can hold. And sure enough these references can take more than 31 bits.
Fortunately it's often possible to get rid of actual record references rather early in the query processing. Filter push-down and early aggregation are the "magic" here.
As an example: for multi-partition/table aggregations in the OLAP engine, SAP HANA makes up a shared dictionary for all selected values on-the-fly.
The record materialization then simply refers to this dictionary instead of going back to the original dictionaries in the table columns.
One thing I didn't like so much in the HPI lectures was that they didn't take the step to show query processing on a live SAP HANA system.
Things like the OLAP engine, Query transformation and optimization had been left out and one could get the impression that it is just about the data structure column store what makes the performance.
But as always the data structure is only half the story. You also need to have good algorithms that work on those data structures.
And these algorithms really need some more documentation/explanation.
- Lars
Lars, if I understand correctly, this is also something which causes a practical problem for HANA for very large-scale use cases. Like you say, partitions are no different to tables in many respects, just the query pruning and semantics are automated.
You can of course create a bunch of separate tables, join them in a calc view, apply a constant value for query pruning and you have a very similar model to HANA's native partition concept.
In both cases you need to have (I think!) a separate dictionary for each partition. In cases where the dictionary is different for each partition (date in a range partition, or the hash predicate, for example) then this has zero effect.
But there are problems when the dictionary is similar for each partition (customer, product etc.) because the dictionary is duplicated each time.
Now I suppose that those dictionaries are (generally) smaller, but as you approach the 1000-partition limit, this could cause quite an explosion in dictionary sizes
Of course it also means there is a 2-trillion (1000x2bn) table size limit in theory, and more like a 1-200bn limit in practice, assuming your partitions are rather sparser than 2bn.
Ok, this time I try to answer more compact đ
Dictionary entry duplication: correct.
2xTrillion rows limit per table: correct.
If any of these things actually impose a problem is a different story.
Since partitioning is primarily a tool to shard a table over several nodes each partition needs it's own local dictionary.
Also, this design not only gives way to problems, but also allows for optimized partitioning in cases where the dictionaries are not near-complete overlaps for most columns.
Plus: the risk for that is lowered as the primary key (the column(s) with the highest number of distinct values) should be part of the partitioning key.
Scenarios like the 2xTrillion (worth of a whole year of inserting ~65.000 records/sec) may need additional considerations on application level anyhow.
To put this into perspective: even logging Wikipedia's page access on its most active page (MAIN_PAGE, ~15.000.000 views/day) would require "only" 173 records/sec.
I'd say, we're safe - for now đ
Funny what kind of stuff I find in my archives...
--> 2 bln error message
Started: 2014-02-24 18:34:36
Could not execute 'insert into lars.log (select top 1000000000 * from lars.log)' in 54:42.783 minutes .
SAP DBTech JDBC: [2048]: column store error: [17] Docid overflow in current delta and delta1 and main, table: LARS:LOG$delta_2$en;location=dewdftzldc03:39103
Cheers,
Lars
Now try:
alter table lars.log disable automerge;
insert into lars.log (select top 900000000 * from lars.log);
merge delta of lars.log;
insert into lars.log (select top 200000000 * from lars.log);
merge delta of lars.log;
Should be fun đ
As you can tell by the date I did this test a long time ago.
This system and with it the test case already went straight to /dev/nul.
All things must pass as George said.
Dear Experts,
What will happen if the Number of Rows reaches this Limit and there are new entries required in the table ? What is the impact if we don't partition the table in time and entries tend to increase ?
Jitendra
You'll get an error. Best to go ahead and partition the table before it gets to 2bn.
Dear Friends,
Â
>Henry wrote = As we said above the 2Bnâlimitâ isnât really a limit â if you need more rows just use more partitions
IMHOÂ 2b rows limit in HANA for physical tables â it is a rudiment in IoT world.
Â
But why needed in HANA Â 2b rows limit for temporary in memory tables (starts with #_SYS_QO_COL_XXXX). In VizPlan they creates at Nodes -> âColumn Search (Create Temp Index)â. Â
By example on unions for intermediate results.
SAP DBTech JDBC: [2048]: column store error: search table error: [2598] column search intermediate result exceeds 2 billion rows limitation
Â
Why need this limitation in HANA? Â What profit?
HANA in-memory DB â but no matter how you have RAM (tens or hundreds TB) â you can process in memory only - less than 2Billions rows. Itâs a paradox situation, isnât it?
Â
How HANA can process more than 2^31 (2B) rows in memory? Â Why HANA not partition this type of tables (in-memory temp tables) in memory, if SAP do not want to increase datatype from int4?
Â
In some heavy queries (by example with count distinct) Â have much more than 2B rows in real life.
Â
Â
Â
Â
Â
Hi all
I'm getting a short dump in our (SAP HANA) source system when TDMS is trying to read a partitioned table there. It has over 4.7 billion rows but is divided into 6 smaller partitions.
The dump is:
Database error text: "SQL message: transaction rolled back by an internal
error: Search result size limit exceeded: 4789062768"
With it being partitioned this shouldn't happen, but then I thought, what if it's not when it reads the ACTUAL table, but rather a temporary table? I'm not an ABAPER (BASIS background) so don't fully understand the code, but it short dumps here:
272 *BOC: DIXITAJ: 21/3/2016: Performance improvement of select query
273 * SELECT (lt_fieldname) FROM (ld_tabname)
274 * INTO ld_tst_zeile .
275 * EXIT.
276 * ENDSELECT.
277 IF gv_new_selection = ' '. "old code
>>>>> SELECT (lt_fieldname) FROM (ld_tabname)
279 INTO ld_tst_zeile .
280 EXIT.
281 ENDSELECT.
I'm just guessing that a temporary table is involved, that ISN'T partitioned, hence the limit/dump.
Is it possible to partition a temporary table? If so, I guess the TDMS code needs modification.
Thanks
Ross
Hi All,
If i want to partition CDPOS table that contains more than 2 billion record, which method is good for me.. as per analysis going with HASH method going with below command, please suggest any need to change.
How many partition it will be good.
Alter table SAPHANADB.CDPOS partition by HASH(OBJECTID) partitions 5;
Regards,
Nilesh