My Adventures in Machine Learning 2

mark_foerster · ‎01-27-2020

My project literally starts from scratch. I have some experience in operating SAP BW systems and troubleshooting various performance issues, but that always was human experience applied to a specific technical problem. Now there is an (at least for me) completely new field: Applying Machine Learning to operations. During the last years I read a lot about the AI hype. AI was applied almost exclusively to business cases. This means to me stuff inside the SAP applications or some other business software layers. I haven't really seen a case for trying to apply AI or Machine Learning below, in the ordinary operations. Sure, there is lots of automation in operations, but hardly anyone that I am aware of talks about AI or ML in operations. This makes it hard to use some best practices which might guide me.

What data to collect?

Before I can dive into the Machine Learning part, I need to talk about data first. At least I can use my experience in operations to gather data which I believe to be relevant. The better the data quality and the more relevant the metrics, the better the performance of my Machine Learning models will be. My approach tries to follow the Data Science Pipeline from Professor Naumann of the HPI:

capture: data is captured by SAP BW in its database tables

extraction: data is extracted by me via SQL statements

curation: metrics are calculated by the very same SQL statements

storage: I collect my metrics in a new relational database

search: not required, the data volume is pretty small

sharing: not required, this is just a prototype

querying: done in Anaconda/Jupyter

analysis: done in Anaconda/Jupyter

visualization: done in Anaconda/Jupyter

Now if you remember my schematically view of the BW system, this was the guideline to identify BW-specific workload and performance metrics:

Figure 1: created by me

To start chronologically in this picture, let's start on the top: I want to collect details on the InfoPackages and DTPs. Actually, they are about loading rows, the require some processing time and there are different types of InfoPackages. Since I don't want to burden the productive systems with my data extraction, I decided to only collect data once a day and aggregated on days. For workload metrics this will be a daily sum and for performance metrics this will be a daily average. Now I have to give my metrics some fancy names. I get some inspiration by the column names of the base tables and try to use some prefix for each metric group.

Table RSDDSTATWHM provides the data on InfoPackages. My metrics are:

IP_INFOPAKIDS: number of InfoPackages loaded on this day

IP_SECONDS: overall processing time for these InfoPackages

IP_RECORDS: total number of rows loaded by these InfoPackages

IP_AVG: average processing time per row on that day

Table RSDDSTATDTP has got the equivalent data on DTPs, but with more details in the columns. Here my metrics are:

DTP_REQUESTS: number of DTP requests on that day

DTP_DATAPAKS: number of DTP data packages which have been loaded

DTP_SECONDS: overall processing time for these DTPs

DTP_RECORDS: number of rows loaded by these DTPs

DTP_GB: total DTP data volume in GB on that day

DTP_AVG: average processing time per row on that day

Table RSDDSTATINFO provides details on BW queries. Here we see. Please note that there is a huge difference between the first navigational step (STEPCNT='0001') of a BW query and all subsequent steps. The first navigational step can easily run for 10 minutes, even with a BWA in place. The next navigational steps within that query are way faster, typically faster than 1 second. So, my metrics also provide details on the initial navigational steps:

Q_NAVSTEPS: total number of navigational steps on that day

Q_INISTEPS: number of initial navigational steps on that day

Q_USERS: number of users who executed BW queries

Q_QUERIES: number of different BW queries which were executed

Q_RT_SUM: total sum of the runtime for all BW queries

Q_RT_AVG: average runtime of a navigational step

Q_RT_MED: medium runtime of a navigational step

Q_RT_INI: average runtime of the initial navigational steps

RSDDSTATBIAUSE is a table containing usage metrics of the BWA. For better or worse, I just collect them verbatim, sine I know of no other source for BWA usage data. The column names are relatively self-explaining:

BIA_USED, BIA_INA(ctive), DB_FALLBACK, BIA_NOTPOSS(ible), NO_BIA

Depending on the complexity of a BW query, many calls to the BWA could be triggered and registered in BIA_USED.

Then I collect 15 metrics for BWonOracle on the overall data volume and 7 metrics for BWonHANA. This is mainly because transaction DBACOCKPIT doesn't keep a long history. Probably these metrics are not suitable for training models, but maybe I find some interesting correlations:

DV_SEGMENTS: total number of segments in the Oracle database

DV_TABLES: total number of tables in the Oracle database

DV_NUMROWS: total number of rows in these tables (according to the optimizer statistics)

DV_DSO_GB: size of all DSOs (tables+indexes) in GB

DV_IC_GB: size of all InfoCubes (tables+indexes) in GB

DV_PSA_GB: size of the PSA (tables+indexes) in GB

DV_ADM_GB: size of all RS* tables+indexes in GB

DV_TOTAL_PARTS: total number of table+index partitions in Oracle

DV_EMPTY_PARTS: allocated partitions without any rows in them

DV_COLD_PARTS: partitions which have not been accessed for 42 days

DV_WARM_PARTS: partitions which have been accessed in the last 42 days

DV_COLD_GB: size of all segments which have not been accessed for 42 days

DV_WARM_GB: size of all segments which have been accessed in the last 42 days

Also, I collect some standard SQL metrics, both on Oracle (DBA_HIST_SQLSTAT) and HANA (HOST_SQL_PLAN_CACHE). In some later blog post I'll examine whether some significant differences can be found between BWonOracle versus BWonHANA.

SELECTS: total number of SELECT statements on that day

AVG_SELECT: average runtime of a SELECT statement

INSERTS: total number of INSERT statements on that day

AVG_INSERT: average runtime of an INSERT statement

UPDATES: total number of UPDATE statements on that day

AVG_UPDATE: average runtime of an UPDATE statement

DELETES: total number of DELETE statements on that day

AVG_DELETE: average runtime of a DELETE statement

CALLS: total number of database procedure callls (PL/SQL or SQLscript)

AVG_CALL: average runtime of a database procedure call

LOG_GB: amount of log information written on that day

For BWonOracle, I collect some more available metrics on the database workload from DBA_HIST_SYSSTAT:

LOGICAL_READS, PHYSICAL_READS, DB_BLOCK_CHANGES, TABLE_SCANS

Finally I collect some metrics on BW process chains (RSPCLOGCHAIN and RSPCPROCESSLOG), SAP batchjobs (TBTCO) and workprocess utilization (/SDF/SMON_WPINFO😞

PC_CHAINS: number of different process chains which were executed

PC_RUNS: number of process chains which have run

PC_TOTAL: total runtime of all process chains on that day

PC_AVG: average runtime of a process chain

PC_MED: medium runtime of a process chain

PC_FAIL: number of failed process chains on that day

BTC_JOBS: total number of batchjobs which ran on that day

BTC_SUM: total runtime of all batchjobs on that day

BTC_AVG: average runtime of a batchjob

BTC_MED: medium runtime of a batchjob

BTC_FAIL: number of failed batchjobs on that day

WP_DIA_AVG: number of DIA workprocesses which were in use on average

WP_DIA_PCT: percentage of the DIA workprocesses which were in use on average

WP_DIA_MAX: maximum number of DIA workprocesses which were active simultaneously

WP_BTC_AVG: number of BTC workprocesses which were in use on average

WP_BTC_PCT: percentage of the BTC workprocesses which were in use on average

WP_BTC_MAX: maximum number of BTC workprocesses which were active simultaneously

So far for the relevant metrics which I decided to collect. There is a handful of additional metrics, but I collect them solely for operative purposes. I just want to know my SAP BW systems in great detail.

68 Metrics later

To quickly sum up the metrics: At the beginning I didn't know what was relevant for later on, so I simply collected what looked interesting and was relatively easy to collect. My motto was: The more metrics, the better. If I had access to detailed data, I collected the average and the median value for performance metrics. If I had access only to preaggregated data I had to be satisfied with the average value. It was a somewhat tedious task, but I had to lay a solid foundation for the next steps. Anyway, I started my automated data collection and was eager to explore the treasures. Stay tuned for the next blog posts where the fun part beings.