I recently read this from a BI reporting tool vendor about an upcoming webinar they were broadcasting:
Would business people really know how to wade around raw, hard-to-understand data? Consider these raw ECC names (<table>.<column>) without the benefit of the data dictionary:
I think the answer is, not until a data modeler has provided some structure on it via a more user-friendly view or a BI tool’s semantic layer. Whether the underlying tool is the traditional relational database or tools like Hadoop/Hive from the unstructured data processing world, some type of user-friendly structure should be imposed in order to “empower business people” and “enable self-service.”
In this entry, I would like to cover data modeling tasks associated with a very frequent type of project that we get in the Database & Technology Services (DTS) consulting group: offloading the reporting functions running in an OLTP (operational data store) database into an analytic database like SAP Sybase IQ or SAP HANA. I’m not talking about migrating ready-made reporting systems like BW, but more of in-house built reporting. Typically the motivations to offload reporting from OLTP systems are to
- Take some of the performance load off of the OLTP database
- Speed up the performance of report generation/execution
- Take the first steps into data warehousing by leveraging/extending/transforming operational reporting into a full fledged data warehouse
In order to get quick time-to-value from the project, customers generally want to take the schema directly from the OLTP database and instantiate the equivalent on the analytic database. Most clients are using some type of BI front-end tool like BusinessObjects so existing reports will run with minimal changes, if any. Once the schema has been ported, data provisioned, and reports are functional again, some additional considerations should be taken into account in order get the maximum performance of the analytic engine.
A well-designed OLTP schema has been normalized (to 3rd Normal Form usually) so one of the first items to look at is whether to denormalize some tables or not.
- In database theory, the search space for finding an optimal query plan is given by O(n2^(n -1)) where n is the number of relations among tables in the query, i.e. polynomial time for n greater than 2. For queries with “large” number of tables and joins, elimination of some joins by denormalizing will speed up the query performance. For example, joins to attribute tables to get description information.
- Database schema changes over time, by adding or removing columns to tables, is more efficient in column stores compared to row stores because no reorganization of the table is needed.
- Update performance is impacted if denormalized
- Redundant data will take additional storage. However, this may not be such a big deal when using columnar databases since you’ll get some amount of compression. For dimension tables, any expansion of disk space should be non-significant relative to the overall database size.
- Additional data management to keep data copies consistent across tables.
Consider denormalizing dimension tables if your schema changes rarely and the type of keys (type I, II, III, or some hyrid) you will adopt for slowly changing dimensions.
One of the key killers of query performance in analytic databases is the preservation of OLTP-style of processing: row-by-row–typically done using cursors within stored procedures. Often, these stored procedures encode “business rules” because they represent the result set for a report. The business rules consist of if-then-else or case-when-then-else constructs of hard-coded values to compare column values against. Take the example of storing a year’s worth of data in one row–a column exists for each month of the year with a year column:
(PK, fiscal_yr, jan_amt, feb_amt, …, dec_amt,…)
In order to break out values for each month of a particular year, you’ll see code like this:
select …, sum(case @selectedMonth when 1 then jan_amt when 2 then feb_amt … when 12 then dec_amt else 0 end) from …
Perhaps there was a legitimate reason why time was not represented as a formal dimension in the data model but if time-based queries are regular characteristic of your reporting database, then you need to seriously consider time as a formal dimension and redesigning your model to eliminate every-row evaluations which are expensive and will not scale well with data volume growth. In general, if you have queries which forces the database to consider every row/column’s value, consider redesigning your data model and refactoring your stored procedures to eliminate every-row evaluations and, instead, use joins of fact to dimension tables to achieve the same results.
Here’s an example of using cursors
declare MY_CURSOR dynamic scroll cursor for select … from table_x where col_y = @inputVarValue and …;
fetch next MY_CURSOR into @myVar
if sqlstate = err_notfound then
if <some-condition> then
end loop myLOOP1;
These type of constructs that force the database to process row-by-row of a large data set need to be refactored such that any “business rules” are captured in the data model itself as dimension tables/attributes so that you can simply use joins and constraints on the queries to get the results you desire.
At one customer, someone had designed a table where one column held a value which had to be evaluated to interpret the value and meaning of another column in the table and because the value of that second column could be numeric or characters, the datatype was made into a varchar(n) type forcing some conversions in some cases. This meant that the semantic meaning and value of the second column was heterogenous. In fact, the designer actually named some of the columns col1, col2, col3, etc! I thought to myself: this was clearly a case where the designer came from the hierarchical-database school where the application was responsible for traversingand interpreting the values of columns in records. Please, if you’re going to use a relational database, keep the data for a specific column in a homogenous fashion and give it a meaningful name.
I/O Consequences of Row vs Column Stores
In a row-based RDBMS, a ‘select * from table_x where <some-condition>‘ would have the same paging activity as ‘select col_i from table_x where <some-condition>‘. Whether you’re retrieving 1 column or all columns of a particular table, because a row resides on a particular page, the page containing the entire row will get fetched from disk so the cost of I/O for fetching 1 or all columns can be thought of as being equivalent. However, in columnar databases, since pages contain columns, a ‘select *’ has a heavier cost than a single column select (‘select col_i’). Therefore, you should examine your queries and modify them so that you only retrieve what you need to reduce I/O.
Circling back to the quote cited at the beginning of this blog entry, many organizations want to enable end-user self-service for reports. If the end-user is expected to understand what data is available, they have every reason to expect that each data item have understandable names and have consistent semantics in usage. If you’re data model contains cryptic names from legacy systems that must be preserved such as 8 character constrained names or generic names like col1, col2, etc., you owe it to the users to at least build views with intuitive names for the underlying columns or provide a business-friendly semantic layer using BI tools.