FAQs, Best Practices

robertwerlich · ‎06-26-2023

The ABAP CDS development framework offers enormous possibilities, and since it was released, many new features have been added. An excellent introduction to CDS can be found here: Getting started with ABAP Core Data Services. Now it is quite easy to create extraordinarily rich views on application data, which allow you to describe varieties of application data models in a unified way. This unification of the description of application models allows us to access the very same application model via generic frameworks in different contexts. For example, the same model might be used in different analytical transactional applications out of different UIs or via APIs, or again in another CDS view. Finally, we leverage the power of SAP HANA with this model, since all queries on top of such models are executed in SAP HANA.

However, this enormously powerful approach comes with some challenges with respect to performance. Since this is essentially a model-driven and declarative approach, details of how a specific query on top of such models is executed is fully controlled by the generic HANA SQL query optimizer. While SAP HANA in general is does an excellent job with respect to performance, not every query on a complex model of large or very large data sets will run with the desired performance and sufficiently low resource consumption. Reasons for this are complex calculations on a large data set, which cannot be avoided to produce the expected and correct result. Furthermore, as with any software SAP HANA also has some limitations, and not every conceivable algorithmic optimization is implemented. Finally, there are fundamental limitations that apply to any SQL query optimizer that inevitably sometimes the optimizer does not find the best possible algorithm, and therefore performance can be much worse than expected. All three reasons are elaborated further below.

CDS views and view entities are enormously powerful, and you can solve a lot of application problems in a very elegant way with huge flexibility. However, as with all other powerful tools, CDS must also be used with care! After some very general explanations of the advantages, but also potential drawbacks with respect to performance for a CDS-based approach, this blog and the subsequent series will explain some typical pitfalls and will help you to set up safeguards for a successful implementation of CDS-based applications. It will also offer a guide for tools for performance analysis.

CDS View Performance Best Practice – NOT NULL preserving LEFT OUTER JOIN

There is already a series of blog posts available that introduces the topic quite well:

Part 1 - CDS View Complexity

Part 2 - SAP HANA SQL Optimizer and Plan Cache

Part 3 - Rules for Good Performance of CDS Views

This series will complement the above series by examining additional aspects:

Guidelines for when to use CDS views and for which purpose and, even more importantly, when it is not advisable to rely on a CDS-view-based implementation

If a CDS-based solution is a promising approach from a functional and performance perspective, what patterns in the implementation should you avoid or actively apply

Overview of tools for performance analysis

What are appropriate testing strategies that indicate how sufficient performance can be achieved, also in the final production system

But before going into the details, I want you to understand at least on a higher level how things are wired from a technical perspective, and what are the consequences with respect to performance.

One first needs to understand that every CDS view is represented as an SQL view on SAP HANA – so nothing fancy. The power of CDS is created by the fact that every view can be used again like a table – or in formal terms of relational algebra as a relation – and this allows us to use each CDS view to abstract a certain piece of application logic into a view. Now it is straightforward to build complex application models just by stacking CDS views in multiple layers, each of them representing a certain piece of application logic. This leads to the performance challenges. Since it is so easy to build extraordinarily complex models, this is simply what happens. Sometimes CDS views contain hundreds of views with multiple joins and expressions in each of them. One can easily imagine that it will be quite challenging for the SAP HANA optimizer and execution engine to execute any arbitrary select query with the desired performance. If the data volume in the underlying tables (or base relations in more formal terms) is small or moderate, for example, containing not more than a few million data records, performance will be still acceptable in most cases – so typically not running longer than a few seconds. However, as soon as data volume grows significantly beyond this data volume, such queries might run for much longer and might consume a huge amount of memory, if - and this is very important – the data volume is not reduced very efficiently at an early stage during query processing. This in turn depends on multiple factors:

It must be possible to reduce the data volume at an early stage of the query processing from an algorithmic/logical point of view

The SAP HANA query optimizer needs to find the best possible algorithm given the specific query and data constellation

SAP HANA supports all the transformations required to get from the model and the query on top of the model to a final algorithm which allows an early data reduction

For each of the above factors I know examples where this fails.

Let me try to illustrate this for the first bullet point. Let us assume we have 100 million sales order lines items and let us say 20 million sales orders in the system. A query that just calculates the total sales volume would be very quick on SAP HANA, running in about 100 milliseconds. This is because in SAP HANA, aggregation of data on a single table is a very efficient means to reduce data. However, if I want to query my top 10 customers with respect to sales volume, this will take roughly 10 seconds, so two orders of magnitude more. This is because the customer attribute is stored in the sales order table, and we now need to join 100 million records with 20 million records before we can start to aggregate data. After aggregation, data needs to be sorted, and only after sorting will the top 10 customers be picked. This is inevitably expensive! So, aggregation on top of a single table is quite quick and an efficient method to reduce data. However, if very large tables need to be joined, or some calculations must be executed on large data volumes before aggregation, this can be expensive if big data volume is involved.

Another obvious method to reduce data are filters. If I run a query on the model that exposes the sales volume per customer, having a filter on a single customer will reduce the data quite early. The best algorithm will start by selecting all sales orders for a single customer, joining the corresponding sales order line items, and then aggregating the sales volume per customer (Figure 1). If the customer has only a few thousand orders and line items, the query will run in a few milliseconds. Of course, if the customer has assigned many millions of sales orders and line items, run time will again be in a range of a few seconds.

Figure 1

And, of course, a filter on the sales volume alone will not be efficient at all. Before the filter can be applied the sales volume for all customers need to be calculated before the filter can be applied. So, this query will run with about the same performance as my top 10 list of customers with respect to sales volume. This is a perfect example where a reduction of the data volume at an early stage of query processing is not possible, even though if one looks at the model from the outside, there is a quite selective filter.

The second point is a little harder to explain. Let us take again the above example of a query to select the sales volume of a single customer on top of a model, which exposes the sales volume per customer. The most naive way to run this query would be simply to calculate the sales volume of all customers and then apply the customer filter on top of this result, so first join all sales orders with the corresponding line items, then aggregate all line items per customer and finally apply the filter for the customer (Figure 2).

Figure 2

Of course, this would deliver the correct results, but it would be slow. The task of the query optimizer is now to find a fast way to execute the same query, but still deliver the correct results. To do this, the optimizer applies relational algebra (Relational algebra - Wikipedia), which is nothing but a set of algebraic rules that allows us to transform from one algorithm to another in a way that results do not change. For our example of sales volume per customer and a query with a filter on customer, one can still enumerate all alternatives. Two of them have already been mentioned. The naive one (Figure 2) and the optimal one (Figure 1). But there are more options: Join all sales orders to all sales order line items or vice versa join all sales order line items to all sales orders, then apply the filter on the customer and aggregate the sales volume of all items for the customer. Or another possibility is to aggregate the sales volume of the sales order line items per sales order, join this with the sales order, aggregate again and filter by customer (Figure 3). And there are even more options. And if you zoom in, there are even more alternatives, for example, join algorithms (such as hash, nested loop, index join).

Figure 3

If the model grows, for example, more joins are added, each permutation of the joins would be valid.

So, if the model grows, then the number of valid algorithms literally explodes. For a model with many dozens of joins, unions and expressions, the number of possible correct algorithms can easily be greater than the number of stars in the universe. It’s almost a miracle and fascinating that in such situations the query optimizer in many cases finds a reasonably fast algorithm at all, but of course beyond a hand full of joins the likelihood that the optimizer will fail to find a good execution plan will grow. To learn more about query optimization see here (https://open.sap.com/courses/hanasql1).

The third bullet point has much to do with the specific requirements in CDS as a language that supports business applications. To be able to build semantically rich models, CDS supports a set of special functions, such as the currency and unit conversion or date functions. Unlike the well-described standard relational operators like joins and unions, the possible algebraic transformations need to be considered quite carefully in order to not introduce functional regressions. Since functional correctness is even more important than performance, the initial implementation of supported transformations is quite conservative and this in turn can lead to unexpected performance issues in specific constellations. Over time, however, the optimizer will improve, for example, with SAP HANA SPS07 some additional transformations for the currency conversions have been introduced. Also, conditions (case expressions) are quite challenging from the optimizer perspective. There are myriads of different possible transformations depending on the specific patterns of the expression. Here the set of implemented transformations is limited by purpose, to avoid the combinatorial explosion of possible algorithms, which would make it even more difficult to find a reasonable execution plan.

So even though SAP HANA is doing a fantastic job in creating satisfactory performance in many situations, you need to be aware of the limitations. If, for example, reliable, and predictable performance of a repetitive identical task is of utmost importance, using a query on top of a complex CDS model, on top of very large tables might not be the best idea. If done anyway, this has to be verified very thorough testing, because you have extremely limited means to influence the algorithms chosen by SAP HANA and the likelihood that they are not the best possible ones is not small. This might vary quite significantly depending on the data constellation. Variations of multiple orders of magnitude cannot be excluded. On the other hand, having a CDS model gives you huge advantages in terms of flexibility. The virtual model allows you to expose almost any conceivable application property in a way that it can be easily consumed via different channels. This allows you to ask and answer a lot of questions that would otherwise not be possible at all or only with huge effort. As long data volume is small to moderate, this will also work with mostly acceptable performance. However, depending on the specific model and query on top of a model, in the case of a very large data volume, it is obvious that a sufficient reduction in the data volume at an early stage of query processing will not always happen, and there are situations where performance and resource consumption is no longer acceptable for the respective use case.

The subsequent chapters will explain how you can minimize such situations. One part is about avoiding certain patterns in the CDS implementations that are known to be potentially harmful in terms of performance and how to find those anti-patterns in existing CDS implementations with the available analysis tools.