#BI4 Performance Optimization for Multi-Source Universes & #SAPBW Jco Queries — Part 1
The new semantic layer in BI 4.0 gives you the ability to do a couple of magical, powerful things that weren’t possible in prior releases. Namely, you can join data from heterogenous data sources (the so-called and much-awaited “Multi-Source Universe” feature), and you can also query tables from BW DSOs and InfoProviders without the need of a BEx query.
Join Small to Big
Before we get too far into the architecture, I’d like to explain a key feature that makes this possible. Imagine that you have two databases, one called CRM and one called SALES. They are on separate network segments. You would like to build a universe that joins data between these databases, and you would rather not do the ETL work to build a data mart for this. You want a table called CRM.ACCOUNTS to be joined with a table called SALES.ORDERS.
The new “Data Federation Service” in BI 4.0 makes this possible. This service hosts the engine behind the legacy Data Federator 3.x product. The DF Service (henceforth, DF) can do many of the things a relational database engine can do:
– optimize SQL queries
– query data sources
– cache intermediate results in Java memory or in disk cache
– join recordsets
– apply functions, aggregations, filters, etc.
Let’s say your CRM.ACCOUNTS table has 1,500 rows, while SALES.ORDERS contains 12,000,000 rows. If you were the DF engine, how would you join these tables together when they come from two different databases?
One way is to SELECT * on both tables, fetching the results into memory and then joining them together in temp space. This is obviously the most expensive! Every time a business user refreshes a WebI report, do you want them doing a full table scan on your SALES.ORDERS table? What would that do to your transactional application running on the SALES database?? This would also transfer quite a bit of data from SALES to the SIA node running the DF Service, and significantly bloat the memory on this machine while it joins 12,000,000 rows to 1,500 rows.
Thankfully, DF has features that can avoid this. But first, the DF engine needs to know the approximate sizes of the tables involved in this join. As with most databases, DF uses table statistics to create more optimal query plans. The first thing you need to do is refresh those statistics — check out Section 3.3 in the the Data Federation Administration Tool Guide for details on how to refresh or manually override table and column statistics.
Once statistics are enabled, DF knows that SALES.ORDERS is much larger than CRM.ACCOUNTS. Thus, it presumes that a table scan on SALES.ORDERS should be avoided unless absolutely necessary. If you were the database engine, what could you do to avoid getting all rows from SALES.ORDERS? You could perform this query in two steps:
- Since it is smaller, SELECT * from CRM.ACCOUNTS and cache it to memory.
- Find all distinct values on the CRM.ACCOUNTS field that participates in the join with SALES.ORDERS
- Fetch only those records from SALES.ORDERS where the values in the join field match the distinct values you found in step (2.). Note: if this column has an index, this filter will be MUCH faster
- As potentially matching rows arrive from SALES, compare them to the cached data from CRM.ACCOUNTS and enforce the join criteria. Send the query to the process that requested it.
What’s a Semi-Join?
DF refers to this algorithm as a semi-join. This algorithm assumes that it is faster to fetch a filtered subset of data from a large table than it is to fetch the entire large table. In a normal example, I’ve seen semi-join reduce the query time by a factor of 1,000. Trust me, you want this feature!
But semi-join can be a fickle thing! It is a feature of the DF optimizer, and as such, it is only enabled after a user submits a query that can benefit from semi-join. In the example above, DF required a priori knowledge that CRM.ACCOUNTS was an order of magnitude smaller than SALES.ORDERS. If your statistics have never been refreshed, every row count and cardinality is 0. Thus, DF assumes every table is infinitely small. So you have to refresh your statistics! DF also needs the query to be an inner or left-outer join from CRM.ACCOUNTS to SALES.ORDERS. A right-outer join would require all values from SALES.ORDERS, so it would not be possible to reduce the data fetched from SALES.ORDERS using a semi-join. There are also some other ways that semi-join can get excluded from the query plan (e.g. making it run slower), but I’d rather discuss those in a follow-up article.
How do you know if semi-join is being used for multiple sources? Open the Data Federation Administration Tool, or DFAT [I say “dee-fat”, btw] and go to the “Query Monitoring” tab. Find the query that you executed from a report and click the [+] to expand the query. You will see the list of queries that were submitted to various data connectors in order to satisfy the multi-source query. You will also see the row counts retrieved from each query, as well as the execution times for these queries. Hopefully you see 1,500 rows fetched from the query to CRM. In the query submitted to SALES, you should see that there is a filter on the field that joins SALES.ORDERS to CRM.SALES. Since this filter exists in the query pushed down to the source, you see much fewer than 12,000,000 rows fetched from SALES.
But what if semi-join isn’t being activated?
Let’s say you still see 12,000,000 rows being fetched from SALES, and that table scan takes 5 minutes to run. This is obviously a performance problem, and won’t allow your users to query this universe. It will also hurt the scalability of the SALES database and the BI environment that has to cache and join 12,000,000 rows.
The most likely problem is that your statistics are wrong.
- From the DFAT “Query Monitoring” tab, copy the multi-source query that is performing poorly.
- Go to the DFAT “Query Panel” tab and paste the query into the SQL Text box.
- In the upper right, select the drop-down menu on the “Run” button. Select the “Explain Statistics” action.
In the results panel, you will now see a list of all tables and columns required to satisfy the given SQL Text. Examine the table row counts and the field cardinality values. Are these values an accurate reflection of the true size of the data in your actual source system? Yes, this might require a trip to your friendly neighborhood DBA.
There are also a number of other reasons semi-join could be de-activated. Section 3.4.5 of the DFAT Guide talks about these, but I’ll be discussing these later in a follow-up article.
Semi-Join and Relational BW Jco
Great, but how does this relate to BW Jco? In BI 4.0, you have a couple of ways to access BW:
– Legacy .unv OLAP Universe: requires a BEx query; connects via MDX
– WebI connection to BICS: requires a BEx query; does not allow the use of a Universe
– Relational .unx Universe: does not require a BEx query; exposes InfoProviders or DSOs to end-users for BI self-service
For more info on when to use these different options, check out Didier Mazoue‘s Q&A about how the Semantic Layer should be used with BW.
If you use the relational .unx option, the queries are satisfied by DF. This is because DF can expose InfoProviders, InfoObjects, and DSOs to business users as if they were just tables and columns in a dimensional model. When DF submits a query to BW, it does not use SQL. Instead, it makes RFC calls to DF-specific functions installed on BW (these are the “DF Façade” functions). These functions can only query one BW table at a time. For example, if you were querying a database on tables D0MATERIAL and T0MATERIAL, you would simply create one SELECT statement and provide the join criteria between these tables. The DF Façade functions only allow one table queried per each function call. Thus, it must first query D0MATERIAL and then query T0MATERIAL in two separate RFC function calls.
Jonathan Haun recently discussed the pro’s and con’s of the new Relational BW Jco Universe feature. However, he implied that every call to a BW table requires all data from the table to be fetched into DF memory and joined with all data from any other BW table involved in the join. This is not entirely true, because DF can use the semi-join operator!
Just as with the CRM and SALES tables above, DF can analyze whether a query between two BW tables can benefit from the semi-join operator. As long as the statistics indicate that one of the BW tables is much larger than ther other BW table, DF will try to generate a semi-join. Here’s how it satisfies a semi-join between two tables from BW:
- Fetch all required data from the smaller BW table
- Fetch from the larger BW table only those rows which match the join field values found in the results from the smaller BW table
- Join these results together in memory in the DF Service in the BI processing tier
Just as semi-join can radically improve the response time for a multi-source Universe query, I have seen similar results when semi-join gets activated in Jco relational queries to BW. This is especially true when the larger BW table has a secondary index on the field(s) involved in the join…that way, BW can use the database index to quickly filter out only those rows that are needed to evaluate the semi-join.
Advanced Semi-Join Settings
When a semi-join is activated by the optimizer, it means that DF will fetch some values from a smaller table and then use those values to filter out data in the larger table. There are a few methods that DF can perform this filtering. Section 4.3.1 of the DFAT Guide briefly mentions the connector options for property “semiJoinExecutionStrategies”, but I want to make sure you know what they actually mean.
- “I” — semi-join will filter the larger table based on an IN list of values. For example,
WHERE SALES.ORDERS.<keyField> IN (<value1>, <value2>,… <valueN>)
- “T” — semi-join will first create a temporary table in the database, populate the temporary table with the key values necessary to satisfy the join, and then filter from the larger table based on the values in the temporary table. For example:
CREATE TABLE <tempTableName> (…
INSERT INTO <tempTableName> VALUES (…
WHERE SALES.ORDERS.<keyField> IN (SELECT <keyfield> FROM <tempTableName>)
DROP TABLE <tempTableName>;
- “P” — semi-join will use a prepared statement to parameterize a filter from the larger table based on values from the smaller table. For example:
WHERE SALES.ORDERS.<keyField> = ?
This SELECT statement will be executed once for each unique value found in the smaller table.
By default, this connector property is set to the system parameter SEMI_JOIN_EXECUTION_STRATEGIES, so you can control it at a global level. You can customize this for each connector, though. The “I,T,P” setting is probably good enough for most folks, but I want to make sure you understand the impact of each setting:
“I” takes a little bit more time in the query optimizer. This is because it has to format a potentially large SQL SELECT statement. Actually, if there are many values found in your smaller table (e.g. CRM.ACCOUNTS) it might do more than one SELECT statement. The connector property maxValuesInInClause limits the number of values in a single IN clause. You may have to adjust this parameter if your source database doesn’t like receiving tons of IN values in a really massive SELECT statement. For example:
Your query to the “smaller” table in the semi-join returns 2500 unique key values. Your maxValuesInInClause is set to 1000. DF creates five different SELECT statements to the “larger” table:
WHERE SALES.ORDERS.<keyField> IN (<value1>, <value2>,…<value1000>);
WHERE SALES.ORDERS.<keyField> IN (<value1001>, <value1002>,…<value2000>);
WHERE SALES.ORDERS.<keyField> IN (<value2001>, <value2002>,…<value2500>);
This is okay as long as the connector property semiJoinMaxQueries is set to at least 3 or higher. If you had 4500 values from the “smaller” table and maxValuesInInClause is set to 1000, the optimizer would need to query the “larger” table with 5 SELECT statements. 5 is greater than semiJoinMaxQueries=3. Thus, the optimizer would look at the priority of semiJoinExecutionStrategies and use the next strategy on the list. If you had it set to “I,T,P” it would decide to use a temporary table.
“T” takes a little more time because it needs to create a temporary table, insert values into the table, submit the SELECT, and then drop the temporary table. This option is nice when, for example, 4500 values are returned by your “smaller” table yet maxValuesInInClause=1000 and semiJoinMaxQueries=3. Another drawback is that the user who connects to the data source must have permissions to create a temporary table and INSERT INTO the table. If these permissions do not exist, the “T” option will obviously throw a database error.
The “P” option is useful when large “IN” lists cannot be used with a data source. As the number of values from the “smaller” table increases, this also means that DF will submit more parameterized SELECT statements to your data source. Do you really want DF running 4500 SELECT statements to your data source? On the other hand, if you know that the row count from the “smaller” tables for a given connector will always be small (less than 20 values)
In a follow-up blog post, I will provide an example of a BW Jco query that ran inefficiently (e.g. no semi-join) so that I can demonstrate how DF statistics, connector settings, and system settings can be used to make your reports run optimally. Until then, let me know if you have questions and I can try to point you in the right direction.
Thank for the information. However, for the record I am mentioning
semi-join on my blog posting. I noticed in a guide from SAP that the functions
COUNT() and AVERAGE() are not pushed down to the BW aggregation engine and
managed at the DF engine. DF BW connector guide: http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/d071c7aa-efb9-2e10-4f94-fe9f4fe747e9?QuickLink=index&overridelayout=true&52110838211258
I have found that when I am joining two large InfoProviders (for
example SALES_ORDER_HEADER and SALES_ORDER_DETAILS) and I am also counting the number of SALES in the HEADER and counting the number of ORDER LINE items in the footer that the semi-join features does not seem to matter. I had a report where the requirement was to show a count of ORDERS and ORDER LINES by YEAR and a few other dimensions. In this case a very large portion of the data was transferred from the two largest InfoProviders to DF. There were no filters because users wanted to see the totals for the last 5 years. The header was about 2.5 million
rows and the footer was about 6.5 million rows. From all appearances over 7
million rows were transferred (over 550 MB of data) to DF. With a traditional Universe and a single RDMS (Relation Database Management System) source, the join and subsequently the aggregation would be managed at the RDMS level. This would result in only a few records being transferred from the database to Web Intelligence. The question is then, if I am joining large InfoProviders and using aggregation functions that are not pushed down to BW, is there an option to reduce the data transferred and managed by DF?
A couple of questions -- are the stats refreshed on all these tables and fields? Also, can you go to the Query Monitoring tab and get the SQL Text submitted from the report as well as the SQL Text submitted for the BW connector? The latter doesn't look like SQL -- it is a set of parameters passed to the DF Façade RFC functions. It looks more like this:
TABLE_NAME:IABCD123 CONTAINS_COUNT_STAR:false IS_GROUP_BY_ALONE:false READ_COLS:0MATERIAL SAP_RANGES: PROJ_COLS:
I checked with the developers -- you are seeing expected behavior. DF in BI 4.0 cannot push down AVG or COUNT functions to BW. This still follows Thomas Zurek's advice on when to use OLAP connectivity (in BI 4.0 that means BICS & BEx) and when to use relational (DF via BW Jco).
Great blog, giving lot of technical information.What is your opinion on building a .Unx universe on top of BW cube/DSO using relation connection?I am particulary interested in knowing if the data federator engine causes any performance issues , if I use a relational conenction approach to build a non- auto generated Universe on top of BW info Providers?Basically instead of building a mutli source .Unx,I want to use the same approach and build a .Unx for BW info provider instead auto generated BICS Universe .
In my opinion, you should only use the Jco Relational connection to BW when you want to fetch specific rows or join/union them to other recordsets.
The better question is when you should NOT use the Jco Relational connection to BW -- which is easy. Do NOT use it if you are trying to do analytic/OLAP queries to an InfoProvider. BEx and BICS are the tool you should use for that kind of thing. You should not expect to force the Jco Relational connection to do the same kinds of analytic queries. Hence, you end up seeing problems like Jonathan had above.
The best uses of the Jco Relational connection are to provide small amounts of row-oriented details, filtered on specific criteria. For instance, line item details. But I have also seen customers get carried away with this...they start to write WebI reports where they want to extract 14 million rows of line item detail data from a DSO. This is ridiculous -- if you just want the data out of BW and into some other place, use OpenHub. Don't try to use WebI and the Jco Relational connection to do end-user ETL.
Yep, makes sense and BICS is the way for huge extraction .The only reason I was thinking of JCO is the power of creating my own .Unx than the auto generated one.Agreed auto generated reduces lot of work but also takes away some flexibilities at UNV layer.It would be nice if SAP gives some relational extraction way out of BW ,like the HANA relational connection in the Universe.
Big thanks for the blog and also the response.
There's also an old blog post about why to use relational access to BW with Data Federator from Thomas Zurek. The entire article refers to BW 7.0x and BOE 3.x and DF 3.0, but the principles are still basically valid.
This is one of the best technical blogs i've seen for a long while. Many thanks!
Speaking of DF statistics - can't wait for the details on those, also have a question -- if DF statistics show over 10 million cardinalities, how should this be addressed ?
10 million in an I-table is fine -- as long as it is being joined to smaller D tables that DF can use to activate the semi-join rule. The customers who have had the most problems are the ones who try to join big I-table to big I-table on key fields directly WITHOUT trying to put a smaller conformed dimension table between the two of them. That is a recipe for disaster because DF knows that it cannot benefit from the semi-join rule.
Thanks for sharing your experience on optimizing MSU using DFAT . I have a question, When we perform stats in DFAT tool can these collection of stats be automated? As currently i see an overhead when the volume grow exponentially then we need to engage with admin folks to redo the stats. Is there any better way so that these stats happens once in a while. Also when we migrate reports built using MSU do the stats need to be collected manually?? Need your help here..
Hi Nitin -- there is currently no way to automatically collect stats in DFAT. There used to be a way to schedule this in Data Federator 3.0 but that no longer exists. It's definitely a feature that our larger customers will need but it does not currently exist. The same is true for migrating reports from Dev -> Test -> Prod -- the statistics need to be refreshed in the new landscape where the reports have been migrated.
Scott, Looking for input on below DFAT parameter and SAP Note.
DFAT Global Parameter: ACTIVATE_SEMI_JOIN_DIMENSION_RUNTIME_CARDINALITY_LIMIT
SAP Note 1762023 - How do you ensure optimal performance of Web Intelligence reporting off of multi-source Universes with semi-joins?
DFAT Admin Guide includes for this Parameter: "Whether to activate cardinality limitation computed at runtime for semi-join. If this option is activated, the runtime cardinality of a semijoin dimension is compared to the expected value, if the runtime value is greater than the expected value, the dimension is discarded." This statement is not real clear and there is no further detail in SAP Notes or web for this DFAT parameter. We've not tested, but if this DFAT parameter is set as “FALSE” what can we expect with report results?
Exceptional. Thanks for sharing this.
Wow! Very good blog since I was actually questionning how can it be when multi-source is used on big tables
Hi there, I have a very similar scenario. Small Table ( fact table - 70k), Big Table (Master Data 12M). BO4.1 SP05
A. DFAT Glogal parameter:
1. ACTIVATE_SEMI_JOIN_RULE --> true
2. SEMI_JOIN_EXECUTION_STRATEGIES --> I,T,P
3. ACTIVATE_SEMI_JOIN_DIMENSION_RUNTIME_CARDINALITY_LIMIT -> false
B. DFAT Statistics activated in both table and columns used
At the moment first it goes to the master Data table instead of fact table. It doesn't filter properly. From my understanding first it should access to the fact table, and then filter it in the Master data table using these values. Is that correct? Or it just works with two fact tables?
It is working now (master data tables and fact table). Needed to adapt some global parameters and the configuration on DFAT . Thanks. Very helpful topic!!
Hello, I like your post. Very interresting. I'm so scare about multi-source. I'm the administrator of a big BO platform.
It's good that BO give parameters options, it's necessary. But at same time very dangerous since developper of univers and report don't take care of this usually.
Personnaly, I still see the DF module good to build Prove of concept before send these 2 or mores multi-source into a reel dataware house at the same place.
I can change my mind maybe!!!
Excellent information about multi source capabilities.
will semi - join applicable for both connection with relation database ?
It's very nice, do we have part 2 for this topic ?