Skip to Content

Previously ( Return of the ABAP Detective – A Case of Peripheral Damage ), I shared a dilemma with occasional network “blips” that are causing application issues in our SAP and other environments.  Though I don’t have a root cause or even a suspect in the interrogation room, I thought I’d share a few more investigation techniques in tracking down members of the elusive Dragon Network.

Since the last update, I’ve been the lead sleuth trying to uncover clues in the increasingly complex matrix of interconnected systems.  The architectural designs call for redundancy at every layer, from power supplies through processing units through storage and network components.  An unfortunate side effect of this large amount of circuitry is the additional time required to find where faults may arise.  Built-in failover or recovery methods are now ingrained nearly everywhere, although the reporting of faults among different families of hardware and software leave much to be desired.  When it works, great; when it either silently fails, or intermittently dies and comes back is where I earn my bread.

At the risk of being redundant myself, I’ll share a couple charts that are similar to my previous post, though with differences I will explain.  And then on to new territory.

 

[one]

The first chart is another set of ping results.  This morning we had several “attacks” and a few users noticed.  I’ve seen other days with similar spikes, but it probably depends who’s doing what, and how long the delays are, whether the average user is seeing enough to notice.

 

What’s different in this chart is a cut/paste of a timescale from the local users perspective.  I tend to look at the performance charts in my timeframe, but the trouble tickets and other traces are going to be their time.

 

[two]

The second chart shows an SAP transaction, database response time, over 30 minutes or so.  I have more data,  but this should suffice to view the hits.  Since these were pulled from ST03 data on the system, which runs in remote time even though it’s local to my office, the times correspond to the upper scale above (marked “UK” time).

 

What struck me odd was a lag (I think) between when the network spikes are recorded, and the highest times  shown above.  My theory is this is a result of ST03 data only being recorded at the very end of a transaction step, so if there is a delay of a few seconds or more, the users metrics will be recorded later the blip.  Note the trough around 9:17, with the biggest spike closer to 9:24.  I think a lot of work bunched up, like in a logjam, and then was let loose.  That should explain it, right?

 

New territory

 

I was advised to add traceroute to my beat, another network analysis tool that is typically used either for idle curiosity (how do my packets grow?) or on a manual basis for trouble-shooting.  In fact, the manual even advises:

 

 This  program  is  intended for use in network testing, measurement and management.  It should be used primarily for  manual  fault  isolation. Because of the load it could impose on the network, it is unwise to use traceroute during normal operations or from automated scripts.

 

However, when additional firepower is needed to stop the Dragon gang, let’s not skimp on our delay-fighting tools.  I’d normally have run traceroute like this:

$ tracert weblogs.sdn.sap.com

Tracing route to weblogs.sdn.sap.com [64.142.8.108]
over a maximum of 30 hops:

  1     3 ms     1 ms     1 ms  dslrouter [192.168.1.1]
  2    28 ms    26 ms    26 ms  10.4.1.1
  3    29 ms    27 ms    27 ms  at-1-0-1-...
  4    30 ms    27 ms    28 ms  so-7-0-...
  5    33 ms    29 ms    29 ms  as0-0-...
  6    33 ms    30 ms    30 ms  0.so-5-1-0...
  7    33 ms   258 ms    33 ms  xe-11-3-0.edge1...
  8    46 ms   146 ms   306 ms  POS6-0.GW2....
  9    33 ms   153 ms    30 ms  ae-92-92.ebr2....
 10    38 ms    36 ms   311 ms  0.ge-0-1-0.gw2.sr.sonic.net [64.142.0.205]
 11    39 ms    35 ms   251 ms  gig50.dist1-1.sr.sonic.net [208.201.224.30]
 12    40 ms   140 ms   307 ms  weblogs.sdn.sap.com [64.142.8.108]

 

It’s supposed to show network hop times, through an algorithm that includes a bit of overhead, and exposes a bit more of the network topology than some might like.  I’ve blotted out a few details above for that reason.  You should see that my DSL link is about 25 milliseconds, which isn’t bad, compared to some of the other hops.

 

Astute observers will see that I’ve used the Windows OS command line version, which threw away a couple vowels when borrowing the code and grammar from UNIX.  Normally you’d expect the UNIX version to be more consonant heavy.

 

I didn’t like the way that traceroute formats the output, so I looked around for an equivalent power tools version (the way I found fping far superior to ping) and discovered mtr.  See, for instance, http://www.bitwizard.nl/mtr/ – the version I built was 0.80.  Similar programs are probably available under Windows or your other OS of choice.

Data collections are now underway, and I can share preliminary results.  I am skeptical, unfortunately, that  this is going to  lead us straight to the villain’s hideaway but one never knows.

Fri Aug 20 14:48:36 EDT 2010 
             
HOST: hammer... 
 Loss%  Snt  Last  Avg  Best  Wrst  StDev 
1.|-- t....com  
 0.0%  3  0.3   0.4  0.3  0.4  0.0
2.|-- m...com 
 0.0%  3  0.3   0.3  0.3  0.4  0.0
3.|-- m...com 
 0.0%  3  0.4  0.4  0.3  0.4  0.0
4.|-- t....com  
 0.0%  3  251.3  251.0  250.7  251.3  0.3

 

With little data so far, it’s unwise to project too far.  But so far, no run has produced more than 1 packet drop out of 3, the worst case is around 250 ms., and the times correspond to the ping reports.

 

We’re getting closer, I think.

To report this post you need to login first.

4 Comments

You must be Logged on to comment or reply to a post.

  1. Jeppe Lærke Kristensen
    These are some interesting articles (just read the first post as well)

    Do you know of the EEM tool (Enduser Experience Monitoring) you might be able to use this tool to find out where the issue lies.

    This tool can repeat recorded process on multiple locations (robots/agents, and pass the responce time, this can then be analysed with RCA in solman diagnostic. the wiki for eem http://wiki.sdn.sap.com/wiki/display/EEM/Home

    Do you know of a wiki or other documentation on the transaction ST03, and what each of the columns in the measurements refers to.

    Thank you for this blog

    (0) 
    1. Jim Spath Post author
      Jeppe:  I know “of” the EEM tool.  However, I’m not confident it can help us yet.  For one thing, the various components of Solution Manager are unstable and require their own administrative effort.  For another, I’ve been trying to get a class and it hasn’t happened yet (See blogs SAP Virtual Training Class turns into Physical Vacation and Back On Course with SAP Education, Virtually. No Really — Join Me! ).

      As for ST03, one wiki page we started is here: ST03 Workload Monitoring – How To.  I can develop this further if you’re interested. 

      Jim

      (0) 
  2. Bert Ernie
    …the next part of the story.

    I was having a few similar issues in the past. Did you know that the SAP tool niping can be helpful with network issues as well?

    I also had no ping issues, but problems with SAPGui. Then i learned that the larger pings (~1500 bytes) fail as well. Most pings are able to send larger packets, -l option on win others -s.

    One time i was seeing a huge network load generated by frequent SAP buffer invalidations/resyncs. However it was only hurting real bad when a network switch was misbehaving at the same time 🙂

    In your case it may be interesting to mesure the network traffic on your servers interface and see if there are peaks at the same time.

    Nice blog and good luck finding the problem, Michael

    (0) 
    1. Jim Spath Post author
      Michael: I’ve used niping a bit, though it is probably not available on the workstation I’m using without some effort.  The mtr code compiled quickly and cleanly.  If I do run passes with niping I’ll post an update.
      We’re measuring traffic in many other ways that I have not documented here, including where we’re seeing device failovers due to dropped packets.  We will get to the bottom of this.  Or it will suddenly vanish when someone steps on the correct cord in the switch room :-).

      Jim

      (0) 

Leave a Reply