Monitoring, standardised reporting and virtualisat...

Former Member · ‎02-24-2010

Few months ago, I got a call for a performance issue in our BI 7.0 system. The issue was primarily identified as performance concerns around the J2EE stack and its instability. After a lot of discussion and analysis between various teams, I learned few new things. This issue also triggered a thought about monitoring tools and standardisation.

The problem

Basis, OS teams and SAP had done some analysis (I was told so 🙂 ). I got the information as below.

BI J2EE engine is not stable and has severe performance issues.
Java server processes are not performing because of heavy usage of swap space >90%. (I am liberally using swapping/paging interchangeably here just to avoid the details)
SAP CCMS monitoring node is also giving "red alert" (default configuration).
Disk swap space is increased to reduce the %age value to below 50%.
But performance issue still remains there.

About 50% swap space used, with no Paging and a lot of free RAM

Analysis

Trusting the information, I tried quick intuitive analysis.

If swap space is used then it would mean swapping is happening. This would be true in Linux and AIX implementation of swap, in default configuration.
Swapping is really critical for J2EE performance. NO swapping would be the recommendation.
Then how come increase in swap space to reduce the %age of swap space would help, as swapping is still happening!!!
I checked the ST06 for swap utilisation and got confused. Swap space is used heavily, but there is no paging happening!!!
After some work and thought, I realized that this instance is on HP-UX. (There were 2 instances on Linux as well).
I searched for the documentation and found the implementation of swapping in HP-UX. It is different from Linux (Which I did not know earlier). HP-UX keeps on "reserving" the swap space; even if it is not used (At least this system was configured like that). On the contrary Linux/AIX only shows swap usage when swap is actually being used. SAPOSCOL does not differentiate between "reserve" and "used".

Reserved swap appears as used swap in total.

7. Point clarified, swap space is not the issue.

Solution

Problem was not the swapping. But rather some configuration issues which combined with some heavy query were causing the instability.

Questions

The instability problem was solved separately. But now the question was, "What to do about monitoring?"

Answers

Implement OS level alerting tools.

Use page out rate as primary criteria for check on swap usage (recommendation from HP-UX expert).

Related observations

I came across similar situation for CPU utilization reporting as well. Here SAPOSCOL on HP-UX reports CPU utilisation of a process >100% but over all CPU utilization way less than 100%!!!. This reporting is also confusing and different from other platforms.

CPU % utilisation is 1,110 %!!!

After thoughts

This exercise provoked me to think

Can we improve the monitoring tools to provide standardised reporting for various attributes for similar category e.g. memory, CPU etc.?
Will it not help abstract the technology layer people from the lower layer variations?
It may also help to configure the monitoring tools once and "re-use" the configuration even if the underlying Infrastructure technology keeps on changing.
Reduce the effort on documentation and analysis in a hybrid infrastructure environment. E.g. in the above example of mixed environment of Linux and HP-UX instances.
Can this be taken as one expect of virtualisation?After all virtualisation means providing abstraction from the lower level layer's details and variations.

It does not mean that all platforms are same. It is just about creating an abstraction layer to have standardised, re-usable reporting possible. I expect such features to make administrator's life a little bit easier.