Reducing Runtime of CPI-DS Tasks by Avoiding Self Joins
I have been working in CPI-DS team for more than a year. My company is one of the renowned names in oil and gas industry and we use SAP-IBP for Sales and Operations Planning with CPI-DS being used for integrating data from SAP HANA and legacy systems. My CPI-DS team comprises of experienced CPI-DS professionals with average work experience of more than 10 years.
When I joined the team, I got to learn a lot about technical aspects of running CPI-DS but, I found that joins were extensively used even in applications where it can be avoided. The problem with joins is that as the input data increased by X times, the data points in the joins increased by X square times where two joins were used and X cube times where 3 joins were used and so on. Avoiding joins helped me to reduce the runtime of CPI-DS jobs from 5 hours to 5 minutes in one of the applications. I have explained below one of the methods of avoiding joins:
Input Data consists of the following attributes:
- Unit Cost
- Other Attributes
Here, ‘MaxPrice’ filters out all the entries except the one with maximum unit cost for any location and product combination. An Inner join is then used in the next transform to get the other attributes from the source for the selected entry.
Logic Without Joins
The ‘prdlocgeomktkey’ transform orders all the records by the unit cost in descending order so that for a particular location and product combination, the maximum unit cost always comes at the top. The ‘genrownum’ transform then generates unique numbers for same combinations of location and product. The first entry in a location product combination will be numbered as 1 and the subsequent duplicate entries will be numbered from 2 onwards. As the maximum unit cost for a particular location and product combination will always have the row number generated as 1, it can then be filtered out in the next transform using the logic that row number should be one.
The below image shows how the runtime reduced by 98% by merely changing a small logic.
CPI-DS servers are not very proficient in handling huge amounts of data and calculations. Therefore, joins should be kept specific to small datasets and should be avoided as much as possible. Most of the times self joins can be avoided by using multiple functions such as ordering and generating row number by groups.
I would also like to learn from the community how you have eliminated joins or used any kind of innovative logic in transforms to reduce runtime. Please comment if you have done anything similar in your field of work.