Introduction to Multidimensional Databases
The purpose for this post is to familiarize myself more with the terms and concepts around multidimensional databases and data warehouses, this post is part of a series that will talk more about business intelligence and will be an introduction for upcoming posts about SAP technologies in general and the SAP Business Intelligence technology stack.
The multidimensional data model emerged as it was found out that the previously wide spread model (the relational model) was efficient is performing “online” transactions but did not perform well when the objective was to analyze data (especially on large scale). So, the multidimensional model was designed specifically to express support for data analysis and is the core component in the Business Intelligence (BI) industry.
The multidimensional model presents data as facts with associated numerical values called measures or dimensions that characterize the facts and they are “mostly” textual. Queries on the model usually aggregate measure values over a range of dimension values to produce results such as sale per month and city. Multidimensional models have three important application areas, they are:
- Data warehousing: A data warehouse is typically a large repository of integrated data obtained from different resources for the purpose of data analysis (Data marts are subsets of a Data Warehouse). Feeding data into the warehouse is done using the Extract-Transform-Load (ETL) process: First the data is extracted from the operational source systems (ERP, CRM … etc.) and the then the transformation process is applied in order to unify the data into the warehouse format; most of the data integration techniques are done in this phase (cleansing, matching … etc.). Finally the loading is process is applied to import the data to the warehose. ETL can sometimes be referred to as ETT (Extract-Transform-Transport).
- Online Analytical Processing (OLAP) Systems: Systems that provide fast answers to queries that aggregate large amounts of data to find overall trends; the results are presented in multidimensional fashion. As opposed to the well known Online Analytical Transaction Processing (OLTP) the focus is on data analysis rather than transactions.Moreover, when doing OLTP the focus is on a single transaction whereas OLAP systems keep track of each individual transaction. OLAP systems generally never delete nor update its data; only additions of new data takes place periodically, thus OLAP systems are optimized for retrieving and summarizing large amounts of data. OLAP systems come mainly in three broad categories:
- Data Mining : The aim is to discover unknown knowledge in the the large data available
Dimensions are used for three purposes; the labeling, selection(filtering) and the grouping of data at a desired level of detail (they are often textual and considered as labels for data). A dimension is organized into a containment-like hierarchy composed of number of levels, each of which represents a specific level of details. The instances of the dimensions are typically called dimension values or members; each value or member belongs to a particular level. for example, in the figure above T,State and City are dimension levels; each level may have a number of properties; each property is associated with a simple non-hierarchical information i.e. City -> population. Boston, Florida … etc. are all dimension values or members.The dimension hierarchy is defined in the meta-data of the cube (the data structure in multidimensional models that we will discuss later); by doing that we will avoid duplicated definitions of the hierarchy afterwards. In some cases we might have defined multiple hierarchy definitions for example for Fiscal and Calendar year. Most models require that dimensions form a balanced tree so that it will have uniform weight everywhere however this is not the case always.A conformed dimension is a set of data attributes that have been physically implemented in multiple database tables using the same structure, attributes, domain values, definitions and concepts in each implementation. A conformed dimension cuts across many facts. Dimensions are conformed when they are either exactly the same (including keys) or one is a perfect subset of the other
A measure has two components, a numerical property and a formula (usually an aggregation function such as sum or average). Measures generally represent the properties of a chosen fact. There are three classes of measures:
- Additive measures: can be combined meaningfully along any dimension (ex: sales over location and time) this causes no overlap between the real world phenomena that caused the individual values.
- Semi-additive measures: values cannot be combined along one or more of the dimensions (usually the time dimension); for example it doesn’t make sense to sum inventory over time as the same inventory item may be counted several times.
- Non-additive measures: Cannot be combined along any dimension usually because of the chosen formula; for example when averages for lower level values cannot be combined with averages of higher level values.
Facts are the objects that present the subject of the analysis. They are mostly defined by their combination of dimension values. a fact has a certain granularity which is determined by the levels from which its dimension values are drawn. There are mainly two types of facts:
- Event Facts: they usually model events in the real world; meaning that there is exactly one fact for each unique event in the overall real-world process that is captures. For example with a fine granularity there will be one fact for each particular sale of a specific item; with a coarser granularity there would be one fact for each day when one or more sale of a specific item occurred If event facts did not have any measures they are called measurelss facts or factless facts, for example the record that represents a contact with a customer to inform him about a new offer.
- Snapshot Facts (state Facts): models the stats of a given process at a certain point in time.
Event facts are made to represent events that can happen unpredictably in time; however snapshot facts are typically made to represent the states at predefined times.
A cube is a multidimensional data structure for capturing and analyzing data. It generalizes the tabular spreadsheet such as there can be any number of dimensions (in contrast to only two in the tabular spreadsheets). The terms hypercube and cube are used interchangeably as sometimes the term cube can infer that it hold only three dimensions. A collection of related cubes is commonly referred to as a multidimensional database or multidimensional data warehouse.A cube consists of a set of cells at each of the dimensions intersections; a non empty cell is a fact. A fact has a number of measures associated with it. A sparse cube is one with high percentage of empty cells; a dense cube is one with high percentage of facts. Cubes tend to become increasingly sparse with increasing dimensionality and with increasingly finer granularities of the dimension value.
A pivot table is a two dimensional table of data with associated subtotals and totals. It may also allow the user to use hierarchies to drill down or roll up. It can be also nested into several dimensions on one axis. It can be also pivoted such as the dimensions can be rotated (swapping x and y).
Multidimensional Querying (MDX)
Dedicated querying languages that offers a number of ranking and sorting function on multidimensional data
There are two principal ways of representing dimensions:
A star schema has one dimension table for each dimension, this table has a key column and one column for each level of the dimension (except T). Furthermore, a star schema has a Fact Table that hold a row for each multidimensional fact, it has a column for each dimension. They primary key in the dimension tables is typically a surrogate key (ID), this will result in better storage, prevention of key-reuse problems and more efficient query processing. However, we notice that there is redundancy in higher level data, however it is not a problem space-wise and it does not cause any update problems thus it is often good idea to use such redundant dimensions tables to optimize performance.
It is much similar to the star schema, however it contains several dimension tables for each dimension; this will result in removing the redundancy found in star schemas. The result of this is that querying the schema is now harder since several joins must be applied; it is also more time consuming to compute the results.
Analysis and Querying
The support for analysis and querying on cubes is done using these operations:
- Rolling up: Rolling up causes the data view to go up to a higher cross grained view
- Drilling down and Drilling Out: The opposite of rolling up, the data view becomes more fine grained and detailed it offers more details by expanding one or more dimensions from a non-T level to a lower level. Drilling out occurs when a drill down is done by including an additional dimension at a non-t level, after a drill out the measure values are spread out among more cells.
- Slicing and Dicing: This happens when an analyst wishes to consider a subset of the cube, so he selects a specific value for a dimension; this is a slice. It is possible to slice the resulting slice further in what is called a dice. Slicing generally refer to filtering out data, and dicing refers to grouping out the filtered data.
- Drill Across: This is done when we do operation on more than one cube that share one or more conformed dimensions. The data from these cubes is combines by these shared dimensions, this in relational terms corresponds to a Full Outer-join.
- Pivot: allows an analyst to rotate the cube in space to see its various face