Partitioning as a backbone of an optimized calculation performance in PaPM
During these two years working with SAP Profitability and Performance Management (PaPM), I have questioned my knowledge regarding partitioning. Is it the standard IT concept of using parallelized threads? What do I need to focus on in order to optimize execution? I have come up with a few explanations and an example which provides the answer to the following question: How does it work in PaPM? When presenting the partitioning feature to clients or partners, our best practice is presenting it within the Sample Content for Profitability and Cost Management, which comes with the product itself.
As you probably know, the idea of partitioning is about splitting the dataset into subsets and triggering calculation logic for those several parallel threads at the same time. Applications running on SAP HANA rely on its capabilities and speed, but still, there are some steps that need to be configured to make the execution faster. Speaking of which, the users should analyze a few important factors:
- Data volume – if the input data is big enough you should think about partitioning
- Hardware utilization – to take advantage of the parallelization of available CPU cores
- Memory overload – high memory consumption while working with a large dataset
- HANA limitations – although large datasets can be processed with PaPM, there is a constraint of 2 billion rows per table or partition.
Going through the blog from beginning to end, I will try to answer the following questions:
- How to choose the right field for partitioning key, while respecting the cardinality and functional constraints?
- How to set partitioning at environment level in the Modeling Environment of PaPM?
- How do results of the processing function look after using the partitioned execution?
Let’s get familiar with some of the points that determine whether partitioning settings would magically affect your execution and make it faster!
Partitioning works better when we respect the cardinality (number of unique values of a field in a dataset). The user should perform the partitioning based on the field with cardinality that is not too high, because using too many partitions could give a poor performance, having in mind that each package triggers one thread.
You should also pay attention to the underlying hardware possibilities in order to balance the number of planned partitions and threads. There is no silver bullet for every customer. The performance quality always depends on the specific use case or project scope and the nature of the values in a table.
In order to discuss possible functional constraints, we should first mention possible data dependencies, having in mind that created subsets (packages) should be disjunctive. This explains why the partitioned execution increases the calculation speed because packages are calculated independently from each other without affecting the result, ensuring the calculation consistency when processed in multiple packaged calls. This is usually the case in all kinds of linear, step-down, or network processes. It is usually not the case in all kinds of circular and iterative processes. For example, if the partition field is a company and the processing functions rely on intercompany transactions, then the package execution will not give the expected results. You should try different partitioning columns in testing systems and check what gives you the best performance.
Target field chosen for the partition key can be some of the following:
- A field used in the existing model tables (company, cost center, country, etc.)
- The second option could be a combination of a few fields, which are not dep (related). This option is very sensitive because of the functional requirements of the specific use case. The packages should follow certain functional logic where datasets remain independent or sequentially dependent. The possible options that could be used for a field combination are the following:
- company and fiscal year
- internal activity and market region field
- other combinations of unrelated fields
The abovementioned combinations could be used to create IDs for the partitioning field. But how you can do it? For example, you can generate IDs for those field groups by using simple window function, as in the screenshot below, where I have used ROW_NUMBER() function to generate IDs for partitioning field based on the combination of Company code (RBUKRS) and Scenario (GSCEN) fields, ordered by Posting date (BUDAT).
- The third way of populating a field that will be used for a partition key is to dynamically fill its IDs in any processing function. It is implemented by splitting the original dataset into packages that have an equal or similar number of rows. In the following example, we have used exactly this method for creating a partition key.
For example, in Sample Content for Profitability and Cost Management (Environment ID: SXP), the View function creates IDs for the field Partition key (GPK) which will be used as the field on which the partitioning will be implemented. This Sample Content contains scalar parameter Duplication Parameter (:I_DUPLICATION) which could be used to multiply the “Plan and Forecast Data” model tables as many times as it is set in that parameter. In the “Advanced” tab Iteration Counter is maintained using Duplication Parameter to execute this function multiple times, iterating from “Low” to “High” value of this parameter, as you can see in the screenshot below.
From the previous chapter, you could figure out the possible ways for determining the field that the partitioning will relate to. Now let’s see an example for creating a partitioning that could be deployed for parallel execution in a process. In the Modeling Environment, observing the Sample Content for Profitability and Cost Management, you can create partitioning settings as shown in the screenshot below.
For each partitioning, we need to define the run mode that is going to be used during the execution. Let’s explain what different run modes stand for. In the screenshot below you can see the possible options:
The run mode chosen for the specific partitioning in the example of this Sample Content is PPDP, which stands for Parallel, Packaged, Dialog, and Partitioned settings. “Parallel” implies that the control returns promptly without waiting for the function execution to be done. “Packaged” implies that the ranges of the partitioning trigger various occurrences during execution, where each of them respects the range from the field value. “Dialog” represents that another assignment is opened in the dialog mode, where the function execution is triggered. The last mark (P) determines that the range partitioning has been applied. In case you want to research about different run modes, please refer to this page from SAP Help Portal.
After configuring the specific run mode for partitioning, you will have to determine the ranges that its packages will relate to. Once again, please bear in mind that the data across different packages should be quite evenly distributed so that each package/partition has roughly the same data volume. That means that the processing time will be quite the same per each of them.
Ranges determine the specific unique value that each partition will relate to. In our example you can see that range R01 will be a separate partition from range R02 because it uses R1 as a value in the field GPK (partition key).
The level is an important settings point used to control the number of packages to be executed in parallel or in sequence based on the hardware or resource availability. Level settings basically behave like a hierarchy, in case there are any dependencies among different packages, e.g. then level 1 will wait for level 0 packages to be executed.
But what if it is too exhausting for you to name all different values in different ranges? PaPM provides a simpler option! If the environment field or BW InfoObject that is related to has its master data already entered, you don’t need to put ranges manually, because it will pull the master data values as specific selections for each package.
Now that you maintained partitioning settings at the environment level, let’s point out the place where chunking (data packaging) will happen! The spot is the Advanced tab in Function attributes in the first function which would read from models like model views, model tables, etc. which point to the sources you are reading from. You need to register the partitioning field as a package selection field in function attributes in the first functions (like Views or Joins).
The key thing to keep in mind is that the models should be defined as Input in the input tab as packaging is applied only on the input tab function, and not on rules. This means that the whole input data will be filtered based on package selection and then packaged. In the example of SXP sample content, JOGPD is the join function and MTPFD is the Model table.
Now that you know how the partitioning definition is maintained, I would like to show you where in the model this defined partitioning should be used. In general, the defined partitioning setting is used in the main function (the function which has a NetWeaver trigger and/or is explicitly triggered – either a function which is maintained as process template activity and/or is of processing type Executable). In the previously mentioned example – Sample Content for Profitability and Cost Management, it is the function that will be explicitly triggered – the one that is maintained as a process template activity. All versions before version 9 of this Sample Content have ALPCU (Allocate Products and Services to Customer) as executable function. In the version 9, it is JOBSD (Calculate Additional KPI and Join Final Results). In the Function Attributes of this function the user needs to set partitioning ID that was previously created at the environment level (PAR1):
After maintaining partitioning settings in this function, the system will know that it has to consider the run mode, packaging, and parallelization settings of the corresponding partitioning when executing this function. In order to partition the Y table of the function in a database you will have to regenerate this function. After running this function, you can analyze the detailed information of each package in the Application log of the results. You will notice from the log that there are multiple runs of this function with the same run ID but with different package IDs. From the package IDs, you can identify which package ID corresponds to which specific package value you had configured in the partitioning settings. You can also notice that package ID is a concatenation of partition ID and range ID. In the following screenshots of the mentioned Application logs, the specific package selections are presented.
If you were wondering what happens with the processes when you initiate the run from your executable function running with parallelization settings, you can check it the screenshots below. On the first one, you can see the range partitioning applied on the Y table for JOBSD (Calculate Additional KPI and Join Final Results function) from SAP HANA Studio. The second screenshot is taken in transaction SM50 in NetWeaver, where you can see multiple ABAP processes triggered for each of the packages from the mentioned executable function (JOBSD from Sample Content for Profitability and Cost Management, version 9).
Being in the role of a modeling user, you should initially pay attention to partitioning feasibility, while analyzing the dataset from the functional point of view, respecting cardinality and package dependency. The other side of the coin is the analysis of technical capabilities, considering hardware utilization and memory overload, trying to overcome HANA limitations.
I hope now you are more familiar with configuration possibilities in PaPM regarding partitioning and parallelization. I am sure that you are going to use it in the best way in order to improve execution time and memory usage, achieving the best possible performance of PaPM models.
Many thanks for taking the time to read my first blog entry. I am looking forward to hearing your opinions in comments. If you found this post useful, feel free to share it with your colleagues and partners.
Until next time!