Properties of Analytic Applications : Interesting ways to store and read data : Part 3 – Concept of the column store
Turning the concept of ROLAP on its head – column based approaches :
To recollect :
Current ROLAP Approach :
Row based Indexing and retrieval
This basically has every row stored separately and indexed according to needs / application design. This would mean that to fetch
Data the row indices are read.
Typical indices used : F Table Indices etc.
Random access for the table is quite fast since access cannot be predicted. This is okay considering current datawarehouses since
There is a row based index on the cube which is ued for random access and we create smaller cubes for smaller indexes in the form of aggregates and query the same as required.
The size of the table cannot be optimized – this is because the index tends to grow pretty large and storage is not efficient ( not a very big concern considering the cost of storage !!!) but then this becomes a hugely limiting factors in Very Large Databases ( 100 TB sorts ) – not sure what is the delimiting factor for a VLDB but have given my own interpretation of the same.
Column based approach :
Column Oriented databases follow a different approach , this involves indexing the columns together rather than the rows – this is because the unique values in the columns is lesser and hence the indexes are smaller.
What if instead of looking at the rows – if I were to index each column separately ? this would mean that :
1. Repeated values are not stored
2. Query on the database is much easier in terms of processing since much lesser records are fetched
3. Lends itself to sparse queries – queries where the number of columns accessed are lesser compared to the columns in the actual table
1. Loading data into these tables is an extremely processing intensive affair
2. When the number of parameters for the query increases then the query performance starts to degrade and in some cases can be below that of a row based approach
Here there are 4 records scanned for a list of customers …. Even if we consider more filter conditions :
Customers for 01.2008 – this will scan 3 records….
However if we consider a column based approach :
There will be three columns :
Joining these three tables together and querying based on the user requirement will be faster due to lesser values being looked for but however if all three columns are being accessed – then the query in some cases might end up being slower than the row based query.
Now the next thing to talk about is storing this data which would involve compression of the same also. Will talk about that in a separate blog .
Also there is a lot of work done in this area by a person called Daniel J Abadi who teaches at MIT – A lot of published material also which can be looked into.
And continuing onwards to towards to goal of explaining the concept behind BIA …
Cheers for the weekend,