HANA Table Partitioning – First Thoughts
- HANA provides table partitioning options that are logically similar to other RDBMSs
- Partitioning is important to spread tables across multi node HANA implementations – facilitating increased query parallelization
- Effective implementation of of partitioning is important in maintaining a consistent Delta Merge overhead
- Partitions can be managed explicitly – i.e. add/remove ranges ; split merge
- Simple partitioning syntax
Introduction – First Thoughts
SAP produce a brief guide to this – http://help.sap.com/hana/hana_db_part_en.pdfInterested to find out how/why you might deploy explicit partitioning in HANA.Now I understand that HANA will dynamically partition tables to a query – i.e. to use as many cores as possible at 2Gb/sec per core. Tables need to be partitioned to spread them across mutile hosts. So why else might you partition?SAP list the ‘usual’ partitioning benefits –
- load balancing
- parellelization (? – I thought HANA was going to do this for me …)
- partition pruning (i.e. – can go straight to a sub set of partitions)
- Explicit Partition Handling i.e. managing time series data
All well and good. OK – a few HANA only considerations :-
- A non partitioned table is limited to 2 billion rows
- delta merge processing time – if modifications are restricted to a sub set of tables then there will be less delta merge (or spread merge across hosts)* to complete
- Only available on column store tables (HANA does row and column store)
- Typically used in distributed landscapes – but can be beneficial on single host systems **
* Delta Merge? – a key part of the design of HANA and a topic in itself. To summarise in one sentence – data changes are stored in a separate area, with its own data dictionary; this area is periodically merged with the main store (Shall we say HANA is fast – but a dictionary encoded column store is going to have to compromise somewhere …)
** OK. Is that where ‘we’ get to improve parallelization – by splitting tables across nodes? I.e. HANA can partition tables ‘on the fly’ WITHIN A NODE ; to use more than one node for a query we need to distribute data across nodes. That makes sense.2 Billion row limit? I’m guessing thats a facet of a dictionary encoded column store which has restricted itself to integer for ‘rowids’? This initially sounds a bit restrictive. However I think when you get to billion row plus tables you’d be considering partitioning on most RDBMS systems. And of course with HANA we need to consider Delta Merge processing. Also big tables are going to reside in big systems – which are likely to have more than one node -and you’d probably want to split that big table across those nodes. So this row limit is not looking like an issue. With billion row plus tables you’d probably have gone for partitioning for other reasons – not because of the 2 billion row limit.
Types of Partitioning
HANA provides two categories of table partitions ; Single Level and Multi-Level.Lets look at Single Level first.This looks pretty familiar (I’m talking from a Sybase ASE/SQL Server background) :-
- Round Robin
Syntax is simple. You can specify the number of partitions or – and this seems to be SAPs recommendation – use the GET_NUM_SERVERS() function.A restriction in Single Level table partitions is that the columns that are used to partition the table must be part of of the primary key (except hash partitions – which cannot have a primary key defined)
Slightly harder to visualise than Single Level partitioning, Multi-Level partitioning :-
- Allows partitioning on non key values (a restriction in hash and range)
- Works well with time series data – if older data is not updated (or rarely updated) then there will be little or no need to complete Delta Merge processing on these partitions. This facilitates keeping the time (and resources) required to complete Delta Merge processing (for a table) constant
Three types of Multi-Level partition :-
- Hash-Range. ‘Most typically used’. Example given shows time series date split evenly across nodes – i.e. if have 4 nodes and month based data each node will have 25% of January, 25% of February etc.
- Round robin-Range
I can see that Hash-Range allows for maximum parallelization – i.e. highly likely to engage all nodes. Certainly Delta Merge will be completed on all nodes. So if data modifications are restricted to the current month (or quarter etc.) the Delta Merge should be consistent AND will utilize all available nodes – so it’s completed as fast as possible
- Any partition involving a range function can be managed by adding or removing ranges
- Partition type of a table can be changed
- Spit partitions
- Merge partitions
Excellent Info , Thank a ton for this.
These are my doubts:
1. already created the tables can we add partition to these?
2. Best practices: How shd we partition on what basis ( Dimension / Fact )
3. What are the pre-requisites , issues , risks, concerns to keep in mind for partitioning.
Really appreciate your help please.