Properties of Analytic Applications : Interesting ways to store and read data : Part 2 – Data I/O OLTP versus OLAP
Data Storage on disk
Database tables are mapped to a one dimensional interface. This is because ultimately the rows and columns that form part of the database have to be referenced to a one dimensional reference on the disk for the information to be stored. This data is retrieved from the disk while the same is accessed.
The most common approach is to map the two dimensional data row by row to this reference point. This way all the information is stored in its entirety.
This approach also lends itself to better inserts which means it is easier to write data into the table since each row is referenced separately and same is the case for updates also.
This is the case with almost all relational databases. By this I mean that :
- Data is stored in a row by row fashion
- Data is retrieved by accessing this time.
Now this design lends itself to situations where :
- A lot of data writes happen
- Querying of the table happens at the most granular level
- Data volumes are not very high ( near future data exists )
The above symptoms / scenarios are typical of an OLTP system. Now the big question … how does this lend itself to a OLAP scenario where :
- A lot of data reads happen
- Querying of the table is uncontrolled and almost always ad-hoc
- Volume of data is very high
With this scenario it can be seen that retrieving this information is quite expensive since for every query each row has to be scanned. This would mean that even if the row has unrelated information ( additional data fields ) which is not required for the query , it is nevertheless fetched.
Well, Have done the difficult part…. Now some questions :
- What am I trying to tell you ?
- Where is this leading to ?
- Does this mean that my data warehouse is dead ?
A lot of heretical questions … but then …
1. Databases have been designed this way all the way from the 1970’s , not much has changed from the time codd’s rules were promulgated… but then those were the times when 1MB used to be a lot of space and disk was extremely expensive and business process automation was key where the intent was to store data in a relational way for easy retrieval for operational day to day needs.
Now the scenario is quite different …
- CPUs have become much more faster and capable and scalable
- Data volumes have increased manifold
By the 80’s people started looking for analytical information from their databases – here this would mean data across months / years etc – this would mean that a lot of data has to be parsed. Here for example in a table of 1 million rows , if I give the following query…
Select distinct(customer) from table.
This would ideally scan the entire table to give me the number of distinct records. But then I might have only about 300 customers for whom I have data.
Where is this leading to ?
Ultimately the aim of this blog is to try and arrive at what happens inside the BIA and how it works .. but then this requires a lot of explanation of new terms – Am trying to explain all the required before going on to the main topic.
Is my datawarehouse dead ?
Not really – only that it can be used much more efficiently. Your investments are safe!!
Coming back to the discussion….
Going back to the query … is there any easier way to get this out ? I would ideally like to scan only 300 records instead of 1 million records…
The question is how do we do this and can the same be achieved with the existing database architecture or something radical is required.
One particular approach is to have a separate database ( read data warehouse ) for analytical queries and have the transactional systems go on as expected. That’s where we are .. but then the underlying concern of running relational queries itself is challenged which comes to asking are relational databases out of their time ?