The Correlation Matrix

mark_foerster · ‎02-02-2020

I have started the collection of 68 metrics. Currently I have roughly 142 days of data for each of my BW systems. Today I focus on the BWonOracle systems because for the two BWonHANA systems several metrics are missing. First of all, I would like to get confident that the data is reliable, and I would like to detect hidden connections. For the reliability I performed some manual checks. They turned out quite good, it looks like the error for virtually all metrics should be well below 5%. Luckily for me, for detecting unknown connections there is already a tool available:

The Correlation Matrix

This correlation matrix is a standard feature from Pandas. It can be plotted very easily via the Seaborn package. I computed the matrix for 62 of my 68 metrics because I don't have enough data for the last 6 metrics (on workporcess utilization) yet. The outcome is really useful:

Figure 1: My correlation matrix (62 metrics from 13 BWonOracle systems), click to enlarge

Negative Correlations

What can we see here? The range of correlations coefficients varies from 1 (on the main diagonal) down to -0.35. There are several combinations outside the main diagonal with correlation coefficients close to 1. On the other side of the spectrum, there is no real negative correlation. The lowest numbers are coefficients of correlations between a workload and a performance metric, e.g:

DTP_REQUESTS vs. AVG_CALL: -0.347577

IP_INFOPAKIDS vs. AVG_CALL: -0.324468

These are no real negative correlations but more like artefacts in the data. Systems with a high workload (lots of InfoPackages and DTPs being processed) run on better hardware. So there the average Oracle PL/SQL routine runs faster than on the low workload systems with cheaper hardware. When I check individual systems, then some more negative correlations appear, but the absolute value of the coefficient of correlation typically remains smaller than 0.7. I haven't found any meaningful negative correlation in my data yet. From the kind of data that I collect, this was to be expected. The pandas tool just confirmed my expectations.

Positive Correlations

There are strong positive correlations in the data however. This is the list of the top correlated metrics and what I make of them:

DV_SEGMENTS vs. DV_TOTAL_PARTS: 0.999997

On BW systems, an Oracle segment is typically a table or an index partition.

DV_TOTAL_PARTS vs. DV_COLD_PARTS: 0.993075

Almost all of these partitions are cold data.

DV_SEGMENTS vs. DV_COLD_PARTS: 0.993068

Dito. Almost all segments are cold data.

DV_WARM_GB vs. DV_DSO_GB: 0.989642

DSO data is predominantly warm (active) data.

IP_INFOPAKIDS vs. IP_RECORDS: 0.949045

The number of rows per InfoPackage seems to be relatively constant.

IP_RECORDS vs. IP_SECONDS: 0.940835

The throuput for InfoPackages seems to be very stable.

DTP_LOG vs. LOG_GB: 0.935215

The amount of DTP redo of the total LOG redo is quite constant (ranges per SID between 15 and 43%).

IP_INFOPAKIDS vs. IP_SECONDS: 0.930577

The InfoPackages have low variation in runtime.

<p>

Some correlations are less strong, but also very interesting:

LOG_GB vs. INSERTS: 0.802269

SQL insert commands are the highest driver of redo log generation. The actual data distribution can be seen from a scatter ploit:

Figure 2: My sample scatter plot for metrics LOG_GB and INSERTS, click to enlarge

Then I found some strange behaviour for the BWA on initial BW query steps:

Q_INISTEPS vs. NO_BIA: 0.837903

The initial steps for BW queries typically do not profit much from the BWA. See also:

Q_INISTEPS vs. BIA_USED: 0.113119

Q_NAVSTEPS vs. BIA_USED: 0.402409

This problem would be worth a deeper investigation.

Data Quality Check

As I stated in the beginning, so far there is not much data collected for the workprocess utilization, but I can compare the little data that is already available. Metric BTC_SUM is derived from table TBTCO, while metric WP_BTC_AVG is derived from table /SDF/MON_WPINFO. Both should be highly correlated because of the simple connection: BTC_SUM=WP_BTC_AVG*86400. There are minor rounding errors, but the coefficient of correlation looks great:

BTC_SUM vs. WP_BTC_AVG: 0.973407

This gives me confidence that both data sources are of a high quality and the other metrics derived from them should be also fine. Additionally, I have performed other manual checks of the data quality to ensure that I can rely on that data. This had to be done manually and ate up quite some time. As a plus, I get more familiar with the data trove.

Outlook

As a summary, there seem to be not many close correlations between my metrics. The list of strong positive correlations is surprisingly short. When I focus on a single system or very small subsets of the 13 systems, then more correlations show up. However, that means they are too individual and cannot be generalized. SAP BW systems seem to be less uniform than I expected.

My next step will be to get a better understanding of the overall workload of the BW systems. There I will create an overall ranking to see which are the heavy workload and which ones are the low workload systems. Additionally I'll start a first experiment to cluster them.

My Adventures in Machine Learning 3

The Correlation Matrix

Negative Correlations

Positive Correlations

Data Quality Check

Outlook

SAP PI for Beginners

ABAP 7.40 Quick Reference

Fiori: technical installation and configuration of one app from A - Z