#BI4 Performance Optimization for Multi-Source Uni...

scott_broadway · ‎10-12-2012

The new semantic layer in BI 4.0 gives you the ability to do a couple of magical, powerful things that weren't possible in prior releases. Namely, you can join data from heterogenous data sources (the so-called and much-awaited "Multi-Source Universe" feature), and you can also query tables from BW DSOs and InfoProviders without the need of a BEx query.

Join Small to Big

Before we get too far into the architecture, I'd like to explain a key feature that makes this possible. Imagine that you have two databases, one called CRM and one called SALES. They are on separate network segments. You would like to build a universe that joins data between these databases, and you would rather not do the ETL work to build a data mart for this. You want a table called CRM.ACCOUNTS to be joined with a table called SALES.ORDERS.

The new "Data Federation Service" in BI 4.0 makes this possible. This service hosts the engine behind the legacy Data Federator 3.x product. The DF Service (henceforth, DF) can do many of the things a relational database engine can do:

- optimize SQL queries

- query data sources

- cache intermediate results in Java memory or in disk cache

- join recordsets

- apply functions, aggregations, filters, etc.

Let's say your CRM.ACCOUNTS table has 1,500 rows, while SALES.ORDERS contains 12,000,000 rows. If you were the DF engine, how would you join these tables together when they come from two different databases?

One way is to SELECT * on both tables, fetching the results into memory and then joining them together in temp space. This is obviously the most expensive! Every time a business user refreshes a WebI report, do you want them doing a full table scan on your SALES.ORDERS table? What would that do to your transactional application running on the SALES database?? This would also transfer quite a bit of data from SALES to the SIA node running the DF Service, and significantly bloat the memory on this machine while it joins 12,000,000 rows to 1,500 rows.

Thankfully, DF has features that can avoid this. But first, the DF engine needs to know the approximate sizes of the tables involved in this join. As with most databases, DF uses table statistics to create more optimal query plans. The first thing you need to do is refresh those statistics -- check out Section 3.3 in the the Data Federation Administration Tool Guide for details on how to refresh or manually override table and column statistics.

Once statistics are enabled, DF knows that SALES.ORDERS is much larger than CRM.ACCOUNTS. Thus, it presumes that a table scan on SALES.ORDERS should be avoided unless absolutely necessary. If you were the database engine, what could you do to avoid getting all rows from SALES.ORDERS? You could perform this query in two steps:

Since it is smaller, SELECT * from CRM.ACCOUNTS and cache it to memory.
Find all distinct values on the CRM.ACCOUNTS field that participates in the join with SALES.ORDERS
Fetch only those records from SALES.ORDERS where the values in the join field match the distinct values you found in step (2.). Note: if this column has an index, this filter will be MUCH faster
As potentially matching rows arrive from SALES, compare them to the cached data from CRM.ACCOUNTS and enforce the join criteria. Send the query to the process that requested it.

What's a Semi-Join?

DF refers to this algorithm as a semi-join. This algorithm assumes that it is faster to fetch a filtered subset of data from a large table than it is to fetch the entire large table. In a normal example, I've seen semi-join reduce the query time by a factor of 1,000. Trust me, you want this feature!

But semi-join can be a fickle thing! It is a feature of the DF optimizer, and as such, it is only enabled after a user submits a query that can benefit from semi-join. In the example above, DF required a priori knowledge that CRM.ACCOUNTS was an order of magnitude smaller than SALES.ORDERS. If your statistics have never been refreshed, every row count and cardinality is 0. Thus, DF assumes every table is infinitely small. So you have to refresh your statistics! DF also needs the query to be an inner or left-outer join from CRM.ACCOUNTS to SALES.ORDERS. A right-outer join would require all values from SALES.ORDERS, so it would not be possible to reduce the data fetched from SALES.ORDERS using a semi-join. There are also some other ways that semi-join can get excluded from the query plan (e.g. making it run slower), but I'd rather discuss those in a follow-up article.

How do you know if semi-join is being used for multiple sources? Open the Data Federation Administration Tool, or DFAT [I say "dee-fat", btw] and go to the "Query Monitoring" tab. Find the query that you executed from a report and click the [+] to expand the query. You will see the list of queries that were submitted to various data connectors in order to satisfy the multi-source query. You will also see the row counts retrieved from each query, as well as the execution times for these queries. Hopefully you see 1,500 rows fetched from the query to CRM. In the query submitted to SALES, you should see that there is a filter on the field that joins SALES.ORDERS to CRM.SALES. Since this filter exists in the query pushed down to the source, you see much fewer than 12,000,000 rows fetched from SALES.

But what if semi-join isn't being activated?

Let's say you still see 12,000,000 rows being fetched from SALES, and that table scan takes 5 minutes to run. This is obviously a performance problem, and won't allow your users to query this universe. It will also hurt the scalability of the SALES database and the BI environment that has to cache and join 12,000,000 rows.

The most likely problem is that your statistics are wrong.

From the DFAT "Query Monitoring" tab, copy the multi-source query that is performing poorly.
Go to the DFAT "Query Panel" tab and paste the query into the SQL Text box.
In the upper right, select the drop-down menu on the "Run" button. Select the "Explain Statistics" action.

In the results panel, you will now see a list of all tables and columns required to satisfy the given SQL Text. Examine the table row counts and the field cardinality values. Are these values an accurate reflection of the true size of the data in your actual source system? Yes, this might require a trip to your friendly neighborhood DBA.

There are also a number of other reasons semi-join could be de-activated. Section 3.4.5 of the DFAT Guide talks about these, but I'll be discussing these later in a follow-up article.

Semi-Join and Relational BW Jco

Great, but how does this relate to BW Jco? In BI 4.0, you have a couple of ways to access BW:

- Legacy .unv OLAP Universe: requires a BEx query; connects via MDX

- WebI connection to BICS: requires a BEx query; does not allow the use of a Universe

- Relational .unx Universe: does not require a BEx query; exposes InfoProviders or DSOs to end-users for BI self-service

For more info on when to use these different options, check out didier.mazoue's Q&A about how the Semantic Layer should be used with BW.

If you use the relational .unx option, the queries are satisfied by DF. This is because DF can expose InfoProviders, InfoObjects, and DSOs to business users as if they were just tables and columns in a dimensional model. When DF submits a query to BW, it does not use SQL. Instead, it makes RFC calls to DF-specific functions installed on BW (these are the "DF Façade" functions). These functions can only query one BW table at a time. For example, if you were querying a database on tables D0MATERIAL and T0MATERIAL, you would simply create one SELECT statement and provide the join criteria between these tables. The DF Façade functions only allow one table queried per each function call. Thus, it must first query D0MATERIAL and then query T0MATERIAL in two separate RFC function calls.

jonathan.haun3 recently discussed the pro's and con's of the new Relational BW Jco Universe feature. However, he implied that every call to a BW table requires all data from the table to be fetched into DF memory and joined with all data from any other BW table involved in the join. This is not entirely true, because DF can use the semi-join operator!

Just as with the CRM and SALES tables above, DF can analyze whether a query between two BW tables can benefit from the semi-join operator. As long as the statistics indicate that one of the BW tables is much larger than ther other BW table, DF will try to generate a semi-join. Here's how it satisfies a semi-join between two tables from BW:

Fetch all required data from the smaller BW table
Fetch from the larger BW table only those rows which match the join field values found in the results from the smaller BW table
Join these results together in memory in the DF Service in the BI processing tier

Just as semi-join can radically improve the response time for a multi-source Universe query, I have seen similar results when semi-join gets activated in Jco relational queries to BW. This is especially true when the larger BW table has a secondary index on the field(s) involved in the join...that way, BW can use the database index to quickly filter out only those rows that are needed to evaluate the semi-join.

Advanced Semi-Join Settings

When a semi-join is activated by the optimizer, it means that DF will fetch some values from a smaller table and then use those values to filter out data in the larger table. There are a few methods that DF can perform this filtering. Section 4.3.1 of the DFAT Guide briefly mentions the connector options for property "semiJoinExecutionStrategies", but I want to make sure you know what they actually mean.

"I" -- semi-join will filter the larger table based on an IN list of values. For example,

SELECT <projection>

FROM SALES.ORDERS

WHERE SALES.ORDERS.<keyField> IN (<value1>, <value2>,... <valueN>)

"T" -- semi-join will first create a temporary table in the database, populate the temporary table with the key values necessary to satisfy the join, and then filter from the larger table based on the values in the temporary table. For example:

CREATE TABLE <tempTableName> (...

INSERT INTO <tempTableName> VALUES (...

SELECT <projection>

FROM SALES.ORDERS

WHERE SALES.ORDERS.<keyField> IN (SELECT <keyfield> FROM <tempTableName>)

DROP TABLE <tempTableName>;

"P" -- semi-join will use a prepared statement to parameterize a filter from the larger table based on values from the smaller table. For example:

SELECT <projection>

FROM SALES.ORDERS

WHERE SALES.ORDERS.<keyField> = ?

This SELECT statement will be executed once for each unique value found in the smaller table.

By default, this connector property is set to the system parameter SEMI_JOIN_EXECUTION_STRATEGIES, so you can control it at a global level. You can customize this for each connector, though. The "I,T,P" setting is probably good enough for most folks, but I want to make sure you understand the impact of each setting:

"I" takes a little bit more time in the query optimizer. This is because it has to format a potentially large SQL SELECT statement. Actually, if there are many values found in your smaller table (e.g. CRM.ACCOUNTS) it might do more than one SELECT statement. The connector property maxValuesInInClause limits the number of values in a single IN clause. You may have to adjust this parameter if your source database doesn't like receiving tons of IN values in a really massive SELECT statement. For example:

Your query to the "smaller" table in the semi-join returns 2500 unique key values. Your maxValuesInInClause is set to 1000. DF creates five different SELECT statements to the "larger" table:

SELECT <projection>

FROM SALES.ORDERS

WHERE SALES.ORDERS.<keyField> IN (<value1>, <value2>,...<value1000>);

SELECT <projection>

FROM SALES.ORDERS

WHERE SALES.ORDERS.<keyField> IN (<value1001>, <value1002>,...<value2000>);

SELECT <projection>

FROM SALES.ORDERS

WHERE SALES.ORDERS.<keyField> IN (<value2001>, <value2002>,...<value2500>);

This is okay as long as the connector property semiJoinMaxQueries is set to at least 3 or higher. If you had 4500 values from the "smaller" table and maxValuesInInClause is set to 1000, the optimizer would need to query the "larger" table with 5 SELECT statements. 5 is greater than semiJoinMaxQueries=3. Thus, the optimizer would look at the priority of semiJoinExecutionStrategies and use the next strategy on the list. If you had it set to "I,T,P" it would decide to use a temporary table.

"T" takes a little more time because it needs to create a temporary table, insert values into the table, submit the SELECT, and then drop the temporary table. This option is nice when, for example, 4500 values are returned by your "smaller" table yet maxValuesInInClause=1000 and semiJoinMaxQueries=3. Another drawback is that the user who connects to the data source must have permissions to create a temporary table and INSERT INTO the table. If these permissions do not exist, the "T" option will obviously throw a database error.

The "P" option is useful when large "IN" lists cannot be used with a data source. As the number of values from the "smaller" table increases, this also means that DF will submit more parameterized SELECT statements to your data source. Do you really want DF running 4500 SELECT statements to your data source? On the other hand, if you know that the row count from the "smaller" tables for a given connector will always be small (less than 20 values)

Next Steps

In a follow-up blog post, I will provide an example of a BW Jco query that ran inefficiently (e.g. no semi-join) so that I can demonstrate how DF statistics, connector settings, and system settings can be used to make your reports run optimally. Until then, let me know if you have questions and I can try to point you in the right direction.