Predicting bad SAP performance (Part 4)
In my last part I described how I started collecting data from FRUN. Now the next question is:
Which metrics will I use for evaluating the performance of a SAP system?
I don’t have very much time to check the FRUN metrics out individually, and the productive FRUN system is already under heavy load, so I cannot easily activate new metrics for monitoring, so I have to stick with what is already available.
In table MAI_UDM_STORE there is a column CATEGORY=’PERFORM’. I limit my selection further on only productive systems, and ABAP stacks. This leaves me with 26 so called TYPES, which are the performance metrics I will use. May the machine learning model later decide which of these are relevant for predicting the overall performance and which ones are not:
This brings me to another important point. How do I determine if the performance is actually good or bad? This is a very important question, because I do not have the time to manually evaluate all available data to identify and label the cases of actual bad performance. I want to automate this and generate lots of labeled data for machine learning.
In this case, I simply use the standard evaluation from SAP FRUN for these metrics. Each of these 26 metrics gets a simple rating of OK, WARNING or CRITICAL, depending on preconfigured thresholds. While this standard rating from SAP FRUN might not be ideal, it is a valid starting point and a huge time saver.
I simply define at each point in time the “performance health” as how the 26 metrics got rated. For a rating of OK there are 2 points awarded, for WARNING just 1 point and for CRITICAL I use 0 points. Then that sum is normalized to a value between 0 to 100%.
If all performance evaluations are OK, then the performance health rating is 100% (= best case). If all performance evaluations are CRITICAL then the health rating is 0% (= worst case scenario). This can be easily calculated with a simple SQL statement from the database. In a next step, I can identify the incidents where SAP systems encountered a performance incident, and even get some idea about how long the problem persisted and how high the impact was.
In a way, I have now something I could call “Anomaly Detection”. In my big database, I can now identify incidents showing an abnormal bad SAP performance. This will be the basis for the next steps in the series, where I tackle “Anomaly Prediction”.