My Adventures in Machine Learning 3
I have started the collection of 68 metrics. Currently I have roughly 142 days of data for each of my BW systems. Today I focus on the BWonOracle systems because for the two BWonHANA systems several metrics are missing. First of all, I would like to get confident that the data is reliable, and I would like to detect hidden connections. For the reliability I performed some manual checks. They turned out quite good, it looks like the error for virtually all metrics should be well below 5%. Luckily for me, for detecting unknown connections there is already a tool available:
The Correlation Matrix
This correlation matrix is a standard feature from Pandas. It can be plotted very easily via the Seaborn package. I computed the matrix for 62 of my 68 metrics because I don’t have enough data for the last 6 metrics (on workporcess utilization) yet. The outcome is really useful:
Figure 1: My correlation matrix (62 metrics from 13 BWonOracle systems), click to enlarge
What can we see here? The range of correlations coefficients varies from 1 (on the main diagonal) down to -0.35. There are several combinations outside the main diagonal with correlation coefficients close to 1. On the other side of the spectrum, there is no real negative correlation. The lowest numbers are coefficients of correlations between a workload and a performance metric, e.g:
- DTP_REQUESTS vs. AVG_CALL: -0.347577
- IP_INFOPAKIDS vs. AVG_CALL: -0.324468
These are no real negative correlations but more like artefacts in the data. Systems with a high workload (lots of InfoPackages and DTPs being processed) run on better hardware. So there the average Oracle PL/SQL routine runs faster than on the low workload systems with cheaper hardware. When I check individual systems, then some more negative correlations appear, but the absolute value of the coefficient of correlation typically remains smaller than 0.7. I haven’t found any meaningful negative correlation in my data yet. From the kind of data that I collect, this was to be expected. The pandas tool just confirmed my expectations.
There are strong positive correlations in the data however. This is the list of the top correlated metrics and what I make of them:
- DV_SEGMENTS vs. DV_TOTAL_PARTS: 0.999997
On BW systems, an Oracle segment is typically a table or an index partition.
- DV_TOTAL_PARTS vs. DV_COLD_PARTS: 0.993075
Almost all of these partitions are cold data.
- DV_SEGMENTS vs. DV_COLD_PARTS: 0.993068
Dito. Almost all segments are cold data.
- DV_WARM_GB vs. DV_DSO_GB: 0.989642
DSO data is predominantly warm (active) data.
- IP_INFOPAKIDS vs. IP_RECORDS: 0.949045
The number of rows per InfoPackage seems to be relatively constant.
- IP_RECORDS vs. IP_SECONDS: 0.940835
The throuput for InfoPackages seems to be very stable.
- DTP_LOG vs. LOG_GB: 0.935215
The amount of DTP redo of the total LOG redo is quite constant (ranges per SID between 15 and 43%).
- IP_INFOPAKIDS vs. IP_SECONDS: 0.930577
The InfoPackages have low variation in runtime.
Some correlations are less strong, but also very interesting:
- LOG_GB vs. INSERTS: 0.802269
SQL insert commands are the highest driver of redo log generation. The actual data distribution can be seen from a scatter ploit:
Figure 2: My sample scatter plot for metrics LOG_GB and INSERTS, click to enlarge
Then I found some strange behaviour for the BWA on initial BW query steps:
- Q_INISTEPS vs. NO_BIA: 0.837903
The initial steps for BW queries typically do not profit much from the BWA. See also:
Q_INISTEPS vs. BIA_USED: 0.113119
Q_NAVSTEPS vs. BIA_USED: 0.402409
This problem would be worth a deeper investigation.
Data Quality Check
As I stated in the beginning, so far there is not much data collected for the workprocess utilization, but I can compare the little data that is already available. Metric BTC_SUM is derived from table TBTCO, while metric WP_BTC_AVG is derived from table /SDF/MON_WPINFO. Both should be highly correlated because of the simple connection: BTC_SUM=WP_BTC_AVG*86400. There are minor rounding errors, but the coefficient of correlation looks great:
- BTC_SUM vs. WP_BTC_AVG: 0.973407
This gives me confidence that both data sources are of a high quality and the other metrics derived from them should be also fine. Additionally, I have performed other manual checks of the data quality to ensure that I can rely on that data. This had to be done manually and ate up quite some time. As a plus, I get more familiar with the data trove.
As a summary, there seem to be not many close correlations between my metrics. The list of strong positive correlations is surprisingly short. When I focus on a single system or very small subsets of the 13 systems, then more correlations show up. However, that means they are too individual and cannot be generalized. SAP BW systems seem to be less uniform than I expected.
My next step will be to get a better understanding of the overall workload of the BW systems. There I will create an overall ranking to see which are the heavy workload and which ones are the low workload systems. Additionally I’ll start a first experiment to cluster them.