Assessing data federation performance in the context of SQL Query Design
By now, we are all familiar with the term Data Federation , which is the concept of virtually accessing distributed data without “copying” over data to the landscape where for eg., the analysis is actually done.
Oftentimes its advantageous to let the data be where its originally residing , as it helps make sure the processing power of the data’s native stores are efficiently utilized to the fullest . At the same time , when this data is analyzed by laying it over the data from a different data store , then we unlock doors to a powerful analysis that makes the data even more meaningful.
In our previous blog, we penned down a how-to guide to integrate data stored in Amazon S3 via Amazon Athena onto the SAP Datasphere thereby illustrating the method to virtualize and do powerful visual data analysis through SAP Analytics Cloud stories in the SAP Datasphere.
But at the end of the day , there are these SQL queries (among other factors) under the hood that determine the performance of your data analysis stories.
In this blog, let’s try to understand the performance of the (federated data) stories in the context of SQL query design.
DISCLAIMER : The query and timing mentioned here are purely rudimentary representations to show the basic comparative difference observation ONLY and in no way represents the official performance with a tuned external data set or a tuned model (where the actual best query timings may be at sub-second levels) and this does not come from any of the official product teams either !!
- We have a sales data table of 5 Million rows residing in S3 that can be queried via Amazon Athena
- We have a local( SAP Datasphere) Item Category master table mapping items to Categories
- We have created a view that joins the Remote Athena table and Local ItemCategory table
- We have created Story to visualize/analyze the overlaid data .
Lets assume our Sales table(in Athena) has these columns we are interested in(SalesID, ItemType, OrderDate, saleschannel, region, country, ItemPrice, UnitsSold, TotalCost)
ItemCategory table(in local/SAP Warehouse Cloud) just has 2 columns: ItemType , ItemCategory
Use Case I :
Do aggregated calculation based on a Column belonging to local( SAP Datasphere) table ONLY.
In the below , the totalRevenue is from Remote Athena table and itemCategory is from Local table.
The SQL for the above in SAP Datasphere translates to:
SELECT "ItemCategory", sum("totalrevenue") FROM "AthenaSalesRem_ItemCatLocal_Jo" GROUP BY "ItemCategory"
And on Athena the SQL to federate data for above case translates to:
SELECT "salesdata"."iitemtype", "salesdata"."totalrevenue" FROM "salesdata" "salesdata"
Looking at Athena Query history log, we see it had logged a query time of 11.23 seconds as in this case we federated “all” the data from remote to local first before joining and aggregating on a local table column.
Also the end to end response time as seen from SAP Datasphere Story page “loading” for this use case is approx 12.78 seconds , which leads to huge negative experience for user.
Use Case II :
On the other hand, lets add just one remote column(country) to the aggregation such as:
SELECT "salesdata"."country", "salesdata"."iitemtype", SUM("salesdata"."totalrevenue") FROM "salesdata" "salesdata" GROUP BY "salesdata"."country","salesdata"."iitemtype" as seen below.
Athena’s Query history log shows a mere 1.8 secs now, as this time , there was aggregation happening on remote side of things and only the aggregated results were federated for join and further aggregation on local column.
Interestingly a third use case of doing an Average on totalrevenue & grouping by local column (ItemCategory) has the same results as use case I above, meaning data is federated out fully before joining and aggregation happens on local [and it logged 10.25 secs]
Similarly a fourth use case of Average on total revenue and group by “country”(remote) and Itemcategory(local) behaves exactly the same as use case-II above [as Athena executes below query before federating Only the aggregated data back and logged just 1.81 sec in the Athena Query log]. Interestingly, in this use case, the SUM and count are sent back from Athena before Average is processed on SAP Datasphere side along with joins and further aggregations.
SELECT "salesdata"."region", "salesdata"."iitemtype", SUM("salesdata"."totalcost"), COUNT("salesdata"."totalcost") FROM "salesdata" "salesdata" GROUP BY "salesdata"."region","salesdata"."iitemtype"
Though the observations here are trivial and may be obvious to many, the goal of the exercise here is to emphasize the direct correlation between SAP Datasphere story (SQL Query design) and data federation performance especially in cross-source and multi-cloud data federation scenarios . This in particular may not be that obvious to business users that are trying to build charts with grouped data , especially when the models they use, source the data from disparate external sources.
Please let me know if this helped get a tad little insight into Analytical Queries in data federation use cases.
Thanks for reading !