Previously ( Return of the ABAP Detective – A Case of Peripheral Damage ), I shared a dilemma with occasional network “blips” that are causing application issues in our SAP and other environments. Though I don’t have a root cause or even a suspect in the interrogation room, I thought I’d share a few more investigation techniques in tracking down members of the elusive Dragon Network.
Since the last update, I’ve been the lead sleuth trying to uncover clues in the increasingly complex matrix of interconnected systems. The architectural designs call for redundancy at every layer, from power supplies through processing units through storage and network components. An unfortunate side effect of this large amount of circuitry is the additional time required to find where faults may arise. Built-in failover or recovery methods are now ingrained nearly everywhere, although the reporting of faults among different families of hardware and software leave much to be desired. When it works, great; when it either silently fails, or intermittently dies and comes back is where I earn my bread.
At the risk of being redundant myself, I’ll share a couple charts that are similar to my previous post, though with differences I will explain. And then on to new territory.
The first chart is another set of ping results. This morning we had several “attacks” and a few users noticed. I’ve seen other days with similar spikes, but it probably depends who’s doing what, and how long the delays are, whether the average user is seeing enough to notice.
What’s different in this chart is a cut/paste of a timescale from the local users perspective. I tend to look at the performance charts in my timeframe, but the trouble tickets and other traces are going to be their time.
The second chart shows an SAP transaction, database response time, over 30 minutes or so. I have more data, but this should suffice to view the hits. Since these were pulled from ST03 data on the system, which runs in remote time even though it’s local to my office, the times correspond to the upper scale above (marked “UK” time).
What struck me odd was a lag (I think) between when the network spikes are recorded, and the highest times shown above. My theory is this is a result of ST03 data only being recorded at the very end of a transaction step, so if there is a delay of a few seconds or more, the users metrics will be recorded later the blip. Note the trough around 9:17, with the biggest spike closer to 9:24. I think a lot of work bunched up, like in a logjam, and then was let loose. That should explain it, right?
I was advised to add traceroute to my beat, another network analysis tool that is typically used either for idle curiosity (how do my packets grow?) or on a manual basis for trouble-shooting. In fact, the manual even advises:
|This program is intended for use in network testing, measurement and management. It should be used primarily for manual fault isolation. Because of the load it could impose on the network, it is unwise to use traceroute during normal operations or from automated scripts.|
However, when additional firepower is needed to stop the Dragon gang, let’s not skimp on our delay-fighting tools. I’d normally have run traceroute like this:
$ tracert weblogs.sdn.sap.com
over a maximum of 30 hops:
2 28 ms 26 ms 26 ms 10.4.1.1
3 29 ms 27 ms 27 ms at-1-0-1-...
4 30 ms 27 ms 28 ms so-7-0-...
5 33 ms 29 ms 29 ms as0-0-...
6 33 ms 30 ms 30 ms 0.so-5-1-0...
7 33 ms 258 ms 33 ms xe-11-3-0.edge1...
8 46 ms 146 ms 306 ms POS6-0.GW2....
9 33 ms 153 ms 30 ms ae-92-92.ebr2....
10 38 ms 36 ms 311 ms 0.ge-0-1-0.gw2.sr.sonic.net [220.127.116.11]
11 39 ms 35 ms 251 ms gig50.dist1-1.sr.sonic.net [18.104.22.168]
12 40 ms 140 ms 307 ms weblogs.sdn.sap.com [22.214.171.124]
It’s supposed to show network hop times, through an algorithm that includes a bit of overhead, and exposes a bit more of the network topology than some might like. I’ve blotted out a few details above for that reason. You should see that my DSL link is about 25 milliseconds, which isn’t bad, compared to some of the other hops.
Astute observers will see that I’ve used the Windows OS command line version, which threw away a couple vowels when borrowing the code and grammar from UNIX. Normally you’d expect the UNIX version to be more consonant heavy.
I didn’t like the way that traceroute formats the output, so I looked around for an equivalent power tools version (the way I found fping far superior to ping) and discovered mtr. See, for instance, http://www.bitwizard.nl/mtr/ – the version I built was 0.80. Similar programs are probably available under Windows or your other OS of choice.
Data collections are now underway, and I can share preliminary results. I am skeptical, unfortunately, that this is going to lead us straight to the villain’s hideaway but one never knows.
Fri Aug 20 14:48:36 EDT 2010
With little data so far, it’s unwise to project too far. But so far, no run has produced more than 1 packet drop out of 3, the worst case is around 250 ms., and the times correspond to the ping reports.
We’re getting closer, I think.