Properties of Analytic Applications : Interesting ways to store and read data : Part 4 – Building a Column Store
Approaches to building a column store :
There are multiple ways by which this column store described in the previouss blog can be built … Each one better than the other in terms of performance but having its own complexities and disadvantages :
Approach 1 : Keep the storage same , change the schema…
To vertically partition a row-store database.
All the rows in the table are broken up into multiple tables each consisting of the table
key and the attribute. This would mean that there is a table for each attribute
in the table. Depending on the query only the necessary tables are accessed.
This is somewhat similar to the star scheme but then this design demands a separate
dimensions for each attribute as per existing design ..
However there is one more way to achieve this – this can also be achieved using
Aggregates to a certain extent but then Aggregates tend to be query specific and for a
true approach a table with so many columns needs to be built.
Here also the number of attrubutes accessed determines the performance.
The lesser the number of attributes – then the better the p[erformance –
the more the number of attributes – it comes almost equal to or greater than
But then the pros and cons of this approach :
1. Heavy data redundancy – for very large databases – unacceptable….
2. Existing architecture need not be redefined – just the star schema needs to be thought through – but no change in SQL / DBMS required.
Approach 2: Keep the schema same – change the storage
To modify the storage layer of the DBMS to store data in columns rather than rows. At the logical level the schema looks no different
However, at the storage ( disk ) level, instead of storing the table in its current format
( row by row )a table row-by-row, the table is stored column-by-column. This way the need for storing the table keys again for each column is avoided , thereby reducing the amount of redundancy as well as amount of data stored.
Similarly with the previous approach, only those columns that are relevant for a
particular query need to be accessed and merged together. Once this merging has taken place, the normal (row-store) query executor can process the query as normal.
This would mean that redundancy at the column level would still continue but then
the whole table is partitioned by columns. For example if there are 10 columns in a
table with 1000 rows .. we would have 10 tables with 1000 rows each . This way there is no change required to the existing schema / code – the application accesses which columns it needs and fetches the data , imagine partitioning the cube on each characteristic in the cube. This is what is suggested. That too this is a physical partition and not logical.
Pros and cons :
1. Extremely friendly for reading data – will show definite improvement in read performances
for random queries that will degenerate as you increase the number of columns.
2. Poor write performance – imagine writing one row of data across so many columns ,
updating indices etc.
Approach 3 : Modify both storage and Schema
This is approach 1 and two put together… Basically doing away with both disadvantages :
1. Keeping the storage simple ,
2. Also avoiding the reconstruction of row level data by the query processor – basically keeping the data in one format
and changing the query executor to read data in the same fashion.
Pros and cons :
If a predicate like Max , Distinct etc is applied on a columns then that column can alone be processed
Selection and aggregation on a column based data store will result in fewer rows getting merged leading to performance improvements.
Now that approaches to build a column store have been detailed – the next stage is to detail the next step in performance improvement – this would be compressing the storage.