As announced in our fourth BIA blog in our series “BIA Changes Everything!”, we will now focus on aggregates. (If you haven’t had a change to look at the fourth blog, you can find it BIA BLOG Series, Part IV: BIA Monitoring and Maintenance.)
Before BIA was available, aggregates provided the most effective way to improve query performance. Customers that have deployed BIA, on the other hand, have removed most – if not all – aggregates. The reasons are obvious:
It’s much faster to roll-up data into BIA than into aggregates.
Change-runs complete much earlier if there are no (or very few) aggregates.
Aggregates use up disk space.
The creation of aggregates requires a detailed understanding of the business requirements, e.g., what queries are executed most frequently, what filter values are used, what is the drill-down path, etc.
The usage of aggregates needs to be monitored on a regular basis: While some aggregates may no longer be needed, new aggregates are required if there are any changes to the underlying cube, queries or user behavior.
In short, aggregates have a negative impact on the storage requirements, data latency and administration. In an earlier blog, we made some recommendations regarding the BIA deployment process. For instance, until you have gained more experience with BIA, it may be a good idea to deactivate aggregates rather than deleting them. Please note that while it is fine to have both active aggregates and a BIA index for a particular cube, the roll-ups and change-runs are only faster if you have eliminated (or deactivated) the aggregates. You wouldn’t benefit from active aggregates anyway, since they are never used when executing a query. (Unless, of course, there is a serious issue with BIA and you deactivate the indexes or have configured the so-called database failover.)
Why – you might wonder – do we have a blog on aggregates and BIA if the basic message is to utilize BIA rather than aggregates? The answer is simple: Using BIA instead of aggregates has an impact on how cubes and queries should be designed. As we mentioned in our first blog: BIA changes everything!
Before focusing on design issues, let’s briefly discuss whether there are still reasons to keep (active) aggregates. Let’s consider the following statements that I have come across many times (even in an FAQ document published by SAP!).
- If you have already a good response time, then there is no need to delete the aggregates and create a BIA index.
- An aggregate may provide a better performance than a BIA index.
The first statement can easily be dismissed. If you don’t have any aggregates and the performance is good, then you don’t need to create a BIA index. (There are some customers that consider creating BIA indexes even in this case, since BIA may help reduce the load on the database server. You might consider this approach if the load on your BI server is continuously very high or if you are paying a (hardware) license fee based on the CPU utilization of your BI server.)
If you have aggregates, then you probably had some performance issues. Since the impact BIA has on roll-ups, change-runs, disk space and maintenance is not as bad as that of having aggregates, I would recommend to delete the aggregates and utilize BIA instead.
What if aggregates perform better than BIA? First of all, there are very few cases where this is actually the case. I have seen a lot of statistics where executing a query based on an aggregate seemed to provide a (slightly) better performance than utilizing a BIA index. Consider the following (actual) statistics from the very first BIA customer installation in the US:
Test environment containing an InfoCube with 30 million records
415,000 records selected; 7,400 records returned
The response time without an aggregate was 773 seconds; 2.3 seconds with a perfectly tuned aggregate (i.e. an aggregate that contained exactly the required 415,000 records) and 2.8 seconds with BIA (no OLAP cache). The customer immediately decided to switch off the aggregate despite the fact that it performed slightly better than BIA. Clearly, the aggregate could not be used if the user changed the filter conditions to be less restrictive (since the aggregate contains only the records for the current condition). We initially concluded that the reason why the aggregate performs slightly better was that it contained exactly the 415,000 records whereas BIA had to select the same records out of 30 million records that were included in the BIA index. It took us some time before we realized that this analysis was wrong. What really happened was that since the aggregate was so small, it was completely stored in the main buffer of the database! In other words, please be aware that the statistics obtained from running queries in a test environment may not be meaningful. The chances that aggregates remain in the database buffer in a production environment, however, are much lower.
The aggregates that you should not delete are those that you may have created in order to improve the performance of OpenHub. BIA was designed to improve the performance of ad-hoc reporting – not data extraction from an InfoCube. In other words, the OpenHub interface will read the data from an InfoCube (or aggregate) even if an active BIA index exists.
As promised, let’s now have a look at what impact aggregates and BIA have on the design of queries and InfoCubes. As I mentioned in the first blog of this series, it is often the case that data in an InfoCube is aggregated whereas DSOs are often used to store line item information. Many people, including Ralph Kimball, have argued that data should always be stored at the most granular level . There are a lot of benefits if data is stored this way and we will provide a comprehensive list in a future blog. As we all know, performance problems resulting from the fact that a large number of records may have to be retrieved from an InfoCube can often be eliminated by creating aggregates. However, in addition to the issues mentioned at the beginning of this blog, aggregates have another major disadvantage that is often overlooked by BI developers and analysts. The response time of performing a drill-down depends on whether a suitable aggregate is available that contains all required data. Consider a user who is executing an ad-hoc query. The first few drill-downs may be very fast since suitable aggregates exist. Subsequent drill-downs, however, may time out since no aggregate is available and a huge amount of data has to be retrieved from the InfoCube. Can the next drill-down be completed within a few seconds since an aggregate can be utilized or will it take a very long time? From a user’s perspective, the response time seems to be non-predictable.
Here’s how most developers solve this particular problem:
Queries are created with mandatory filter values. The values that can be entered are restricted in such a way that the number of records that can be selected in the cube is reduced considerably. Furthermore, aggregates are created in such a way that an applicable aggregate can be utilized no matter what filter values are entered in the selection screen of the query. The advantage of this approach is that good response times can always be guaranteed. The above-mentioned “unpredictability of the response time” when performing a drill-down no longer exists. We are using filter values to restrict the data in such a way that an aggregate can be built to guarantee a good query response time!
It is very unfortunate that most queries are designed like this, but there are very few alternatives to the above-mentioned approach. The main problem is that we no longer provide a user with access to all information that is stored in the InfoCube. In order to achieve a better performance, we are making it very tough for a user to analyze the data. The more restricted the values in the selection screen of the query are, the easier it is to create an aggregate and the better the performance. Consequently, the team that is developing queries and reports need to work closely with those who are creating the aggregates.
Needless to say that the mandatory variables in the selection screen of a query may no longer be required if the cube is (BIA-) indexed. Since many queries for a particular InfoCube only differ by their selection screens and the characteristics in the rows and/or columns, it is easy to see that a deployment of BIA may lead to fewer queries that need to be created and maintained.
Before making these changes, two guidelines need to be followed:
When designing queries, please ensure that most characteristics are defined as free characteristics. Since the number of records selected in the InfoCube could be huge (since there are no mandatory filters in the selection screen), executing a query that has a lot of characteristics in the rows or columns (instead of free characteristics) will have a very large result set which could lead to all sorts of problems.
Furthermore, since users may not be used to ad-hoc reporting (due to the fact that the number of records selected in the cube used to be severely restricted through the use of mandatory filter values), a user training may be required. While the vast majority of users are experts in analyzing data, I have seen a lot of users that execute queries which display thousands of records in a report. Ideally, a drill-down should followed by a drill-up or by setting a filter value before performing a drill-down by another characteristic. Also, please be patient: For some users it may take some time before they start utilizing NW BI functionality to perform ad-hoc reporting in the system rather than running reports in order to load data into Excel for further analysis.
Since aggregates are no longer used, the above-mentioned cooperation between the team that creates the queries and the team that creates aggregates is – obviously – no longer required. Allowing certain users to create their own queries is – at least from a performance perspective – no longer an issue.
This concludes our fifth blog on BIA. Previously published blogs can be accessed below: