Besides cases of missing cats and dime store fortunes, there hasn’t been much for the ABAP Detective to write home about. In periods of economic retrenchment, it’s often the mundane stories that prevail. However, I’ve been following a few leads that might be of interest.
With virtualization comes great fuzziness. Or something like that. All I know is that the “good old days” of waiting for the 8:03 streetcar to take you to your job, I mean for the batch scheduler to run your job, are over. It’s now Zip Cars and event management and all kinds of new-fangled technology. But what hasn’t changed is that bad hardware gets noticed. Sorry, not “bad hardware”. I mean “socially maladjusted leased equipment resources.”
As more and more systems are running together in the same virtual environments, it’s been getting harder and harder to tail suspects. I’ve been rolling out new tools, employing old techniques in new ways, and trying to stay ahead of the yahoos. It isn’t easy. One way I try to find the dirty jobs is to prowl the network back streets with a simple ping sweep. It isn’t elegant, it has got nothing to do with Solution Manager, and high-falutin’ aficionados look down their nose at it, but for me, a well-oiled set of ping times beats any number of fancy tools for revealing patterns.
The first indication was in late May or early June. Funny stuff in the network layers. Lots of possible suspects, from bad configuration, to incorrect high availability designs, to overcrowding somewhere. We looked and looked, and though a few “oops” were corrected, intermittent slowdowns continued. And if it’s anything a trained ABAP detective hates to hear, it’s not just “the system is slow”, it’s “the system is slow, sometimes, but I’m not sure when.”
The first chart above shows ping times in early June. What I began to notice was not the usual messy chart where users got very busy moving data during work hours, tying up the phone lines (with generally legitimate work), but spikes at regular intervals. And bad spikes too, the kind that leave you dizzy with their height. Pointy spikes, with that possible random timing which baffles the normal detective. Not that I’m a normal detective of course.
Now we’re looking at July. As I reviewed the evidence, adding more tails and searches, and talking out possible theories with cohorts, I began to realize much of the noise happened when most folks are sleeping. Again, the opposite of normal business traffic. Very suspicious. One theory, still not disproven, is that there are large file transfers happening. But who’d be smuggling such large shipments and not be known to someone, somewhere? With a maximum ping time approaching one second, I began to get more and more aware that this problem was spreading, and I still didn’t have a case to take to the grand jury.
Another set of ping times, through the end of July and into August (missing data on the 30/31st was a power outage – even electric sheep probes need to sleep sometime). The problem began to creep into user awareness. I take this to mean that the underlying issue may have been there for weeks, and it was after multiple glitches, with associated coffee-corner conversations, that the back office talk started turning into help desk reported issues. But that’s good in a way, as it confirms my suspicions that something bad is occurring, and I haven’t wasted my investigation time.
Ping one system
One last ping chart, and then on to ABAP transactions and more SAP-centric technologies. The above shows a real-time chart (at the time I made it). Once again, hats off to Tobias Oetiker and his SmokePing tool, along with the RRD database.
ST03 by transaction
Above is the depiction of database response time, for a single dialog transaction, for about a 30 minute period. If anyone is interested in how to do this (without Solution Manager), drop a blog comment and I’ll write the process in a wiki page. The huge delays between 07:20 and 07:30 can’t be missed. And since times aren’t recorded until the transaction is complete, the root cause started happening no later than 07:20.
ST03 by transaction
Different system shown here, different transaction, and a different time zone (5 hour later than the one above). Again, spikes at 20 minutes after the hour, lasting 5 to 10 minutes. Whether users notice this enough to complain, I am not certain. Anyone awake at that time window may be battle-hardened from dealing with overnight batch (overnight from my perspective, not theirs), so this delay may not trigger a reaction.
ST03 by time
|Interval||Number of Steps||T Response Time (s)||Ø Response Time (ms)||T CPU Time (s)||Ø CPU Time (ms)||T Database Time (s)||Ø DB Time (ms)|
The last entry in my case journal so far is a report on dialog response time by time period. It’s unfortunate that the base reporting lumps midnight to 6AM as one time slot. You’d think by now, someone would have realized that SAP systems run around the clock, and losing detail for part of the day is unacceptable.
Yes, the system is slowing down during certain hours of the day. Yes, it’s being caused by large amounts of traffic, or by constricted virtual network piping. No, I don’t have enough evidence to get a warrant. I’m detecting.
[to be continued]