What? Performance is slow? I'll get right on it....

JimSpath · ‎01-26-2008

I blogged about a performance issue in sales order management last September. That focused on problems caused by table buffering changes. This time, there are reported issues with manufacturing response time, both dialog and batch.

First I looked at dialog response time using ST03 (expert mode). I picked a few of the most-used transactions starting with "M". We keep 5 months online, so I viewed average response times back to July 2007. For each month, I copied the values to a spreadsheet after switching the screen to Excel viewing mode.

It's easier for me to see the trends with a picture, so I graphed the first 3 columns - dialog steps, time, and time per step. I often compute the average instead of using the downloaded column. Below are the views I captured.

I saw different trends, where some transactions show degrading performance over time, with others remaining constant. This means the entire system is not having a problem, but perhaps it is the tables or indexes used in MB51? As that one showed the largest change, I looked at it closer (note the slope of the yellow line on the first graph compared to the others).

The week view has higher numbers than the month views led me to think, although a few readings were under 2 seconds. I stopped charting after looking at 2 application servers as I felt a pattern was beginning to emerge.

When I added a trendline, I saw increasing response time on these app servers. The change from December to mid-January was a half second. This might start to be noticeable to users.

I posted emails asking if end users were reporting worse performance, and didn't hear of anything significant, although one business analyst mentioned that using date ranges in MB51 seemed to be less efficient than he thought it should be.

The day view widened the recorded response time ranges. This time I looked at one more app server, and skipped the weekend data.

The last drill down level I looked at was response time per user for MB51 on each application server on a specific day. This is the only way I've found to understand who is doing what, without running traces or analyses.

I used a scatter plot to display each user's response time for that day. There are a few over 2 seconds, including a surprising maximum over 4 minutes. I later found out from the Basis team that we were getting dumps where other transactions hit the maximum time allowed, but these are not recorded in ST03 times.

The last plot for that day omitted the top user, with a resulting trend of the vast majority of users seeing 1 or 2 seconds.

After looking at ST03, I switched gears the next day and started looking at ST04 to find what was showing up in the database shared SQL statement cache, and looked at a different perspective in ST03, the 10 minute snapshot.

The SQL cache review highlighted an SAP program (RM07DOCS) as having many disk reads. Several executions of one ABAP statement were in the cache, hinting that the code was being run in different ways.

Here's one I found:

SELECT
  T_01 . "ANLN1" , T_01 . "ANLN2" , T_01 . "APLZL" , T_01 . "AUFNR" , T_01 . "AUFPL" ,   
[...]
FROM
  "MKPF" T_00 , "MSEG" T_01                                                                         
WHERE
  ( T_01 . "MANDT" = '040' AND T_00 . "MANDT" = T_01 . "MANDT" AND T_00 . "MBLNR" = T_01 . "MBLNR"  
  AND T_00 . "MJAHR" = T_01 . "MJAHR" ) AND T_00 . "MANDT" = '###' AND T_01 . "MATNR" = '##' 
  AND T_00 . "BUDAT" BETWEEN '20080121' AND '20080122' AND T_01 . "WERKS" = '##'

Note: I changed literal values to number signs

This is a join between MSEG and MKPF, with the former being larger (50GB?). The explain plan lists a nested loop, with MSEG accessed with index "M" and MKPF using the primary key index. The material number was entered but not the year, which might be more efficient. Index "M" also keys off material number, with plant (WERKS) as the next key, so this should work well. But ST04 shows almost 30,000 disk reads for this statement, meaning it could be better.

ST03 10 minute detail report

Started	Server	Transaction	Program	T	Scr.	Wp	User	Response Time	Time in
								ms	WPs ms

10:01:39	ap06_PRD_0	MB51	RM07DOCS	D	0120	0	T	14,808	14,726
10:03:54	ap02_PRD_0	MB51	RM07DOCS	D	0120	8	L	18,629	18,536
10:05:12	ap03_PRD_0	MB51	RM07DOCS	D	0120	31	Y	33,836	33,518
10:07:24	ap07_PRD_0	MB51	RM07DOCS	D	0120	2	E	169,999	169,858
10:08:46	ap07_PRD_0	MB51	RM07DOCS	D	0120	8	C	38,617	38,536
10:11:37	ap07_PRD_0	MB51	RM07DOCS	D	0120	2	E	37,659	37,659

An email sent to a user returned with the feedback:

"... pero lo corrio sin parametros, ya hable con ellos para que no vuelva a pasar"

(Maybe Blag or Ignacio can help me out here)

Some of the above will be easier once I get Central Performance History working - see these earlier blogs [7606and 7607 ] for where I'm headed with that.

I'll try to blog on the batch run time analysis when I have time.

[All opinions expressed are mine, no one else's]

[04-Feb-2008 Update]

Problem user or problem system?

Ignacio rightly challenged my assertion that performance problems are caused by users, as it should be up to the developers and infrastructure team to deliver quality customer solutions. However, sometimes the budget isn't big enough, and sometimes, you inherit a system designed for a previous problem while users are solving the next one.

Problem user or user error?

There is turnover in the user community, so last year's training class was effective for those who attended, but not new hires. This dilemma requires ongoing learning, whether virtual, viral or vocal. Part of my job is determining where those gaps occur. Then there are refresher courses, or just reminders, for those who have forgotten the basics such as which columns allow the fastest return on query investment.

Then there are the "top" users, who are the exception to the rule, and who sometimes need to submit the dinosaur killer query to get their job done. The systems people need to recognize who they are, and perhaps steer them to a quiet time to prevent contention with other users.

Nuts & Bolts

After posting my original blog, I have continued research with the Basis and DBA teams. Below are findings as of 04-Feb-2008. Our Unicode conversion is next week, so some performance aspects may change due to index rebuilds. One index on MKPF did not get transported to production this weekend as expected, meaning I'll need to wait another 2 weeks before that impact is seen.

Tabe rows (TST)

MSEG 145,342,400
MKPF 19,374,700

Custom Indexes we have created over the past 10 years on MSEG and MKPF

MSEG~Z1

WERKS
BWART
KOSTL

MSEG~Z2

LIFNR
WERKS
BWART

MSEG~Z3

LGORT
WERKS
BWART
SOBKZ

MSEG~Z4

AUFNR
MATNR
WERKS
BWART

MKPF~Z1

MANDT
CPUDT

MKPF~Z2

MANDT
BLDAT

MKPF~Z3

XBLNR

New for February 2008:

MKPF~Z4

USNAM
BUDAT

PRD - Oracle 9i / non-Unicode

SQL Plan for above generated code

SELECT STATEMENT ( Estimated Costs = 983 , Estimated #Rows = 0 ) + NESTED LOOPS ( Estim. Costs = 983 , Estim. #Rows = 1 ) + TABLE ACCESS BY INDEX ROWID MSEG ( Estim. Costs = 270 , Estim. #Rows = 1,188 ) + INDEX RANGE SCAN MSEG~Z3 ( Estim. Costs = 1,065 , Estim.#Rows = 1,188 ) Search Columns: 2 + TABLE ACCESS BY INDEX ROWID MKPF ( Estim. Costs = 1 , Estim. #Rows = 1 ) + INDEX UNIQUE SCAN MKPF~0 ( Estim. Costs = 2 , Estim. #Rows = 1 ) Search Columns: 3

TST - Oracle 10g / Unicode

New index is in place; transport into production 02-Feb-2008.

PRD - a different SQL statement from the same ABAP statement

The Plan

SELECT STATEMENT ( Estimated Costs = 15,939 , Estimated #Rows = 0 ) + TABLE ACCESS BY INDEX ROWID MSEG ( Estim. Costs = 2 , Estim. #Rows = 1 ) + NESTED LOOPS ( Estim. Costs = 15,939 , Estim. #Rows = 1 ) + TABLE ACCESS BY INDEX ROWID MKPF ( Estim. Costs = 15,915 , Estim. #Rows = 120 ) 1 INDEX RANGE SCAN MKPF~BUD ( Estim. Costs = 1,368 , Estim. #Rows = 123,257 ) Search Columns: 2 + INDEX RANGE SCAN MSEG~0 ( Estim. Costs = 3 , Estim. #Rows = 19 ) Search Columns: 3

TST plan

SELECT STATEMENT ( Estimated Costs = 140 , Estimated #Rows = 2 ) + NESTED LOOPS ( Estim. Costs = 140 , Estim. #Rows = 2 ) Estim. CPU-Costs = 1,516,519 Estim. IO-Costs = 139 + TABLE ACCESS BY INDEX ROWID MSEG ( Estim. Costs = 25 , Estim. #Rows = 570 ) Estim. CPU-Costs = 518,219 Estim. IO-Costs = 25 + INDEX RANGE SCAN MSEG~Z1 ( Estim. Costs = 1 , Estim. #Rows = 570 ) Search Columns: 2 Estim. CPU-Costs = 29,761 Estim. IO-Costs = 1 + TABLE ACCESS BY INDEX ROWID MKPF Estim. CPU-Costs = 1,751 Estim. IO-Costs = 0 + INDEX UNIQUE SCAN MKPF~0 Search Columns: 3 Estim. CPU-Costs = 210 Estim. IO-Costs = 0

Well, this is not as good as the other, as the optimizer decided date wasn't selective enough (range is a month instead of one day now), and went for index Z1 on MSEG. This will find all the rows mathing that plant (and the 2 BWARTs), and weed through MKPF by MBLNR. Ugh -- what I like to call the Mumble Number.

Shrinking the date dange in ST05 explain plan demo screen shows how the plan changes.

T_00 . "BUDAT" BETWEEN '20070526' AND '20070527'

SELECT STATEMENT ( Estimated Costs = 42 , Estimated #Rows = 1 ) + HASH JOIN ( Estim. Costs = 41 , Estim. #Rows = 1 ) Estim. CPU-Costs = 4,486,299 Estim. IO-Costs = 41 + TABLE ACCESS BY INDEX ROWID MKPF ( Estim. Costs = 15 , Estim. #Rows = 215 ) Estim. CPU-Costs = 142,495 Estim. IO-Costs = 15 + INDEX RANGE SCAN MKPF~Z4 ( Estim. Costs = 1 , Estim. #Rows = 215 ) Search Columns: 2 Estim. CPU-Costs = 14,337 Estim. IO-Costs = 1 + TABLE ACCESS BY INDEX ROWID MSEG ( Estim. Costs = 25 , Estim. #Rows = 570 ) Estim. CPU-Costs = 518,219 Estim. IO-Costs = 25 + INDEX RANGE SCAN MSEG~Z1 ( Estim. Costs = 1 , Estim. #Rows = 570 ) Search Columns: 2 Estim. CPU-Costs = 29,761 Estim. IO-Costs = 1

The other improvement to this plan is a switch from NESTED LOOP to HASH JOIN.

Conclusions?

Not yet! There is a transport in flight, as well as Unicode and Oracle 10 migration. We could see improvements from better user education, and maybe pre-filling columns to steer users away from open-ended wildcard queries.

Stay tuned.