Pssst, ABAP Detective, you got a second?

JimSpath · ‎06-17-2012

.

I was working the undercover squad, trying to take out a mob of no-ops that had infiltrated the infrastructure when I got a tip about a Code 16. Though I'm not on the regular ABAP force, I'm called in when the evidence points outside the straight and narrow ABAP world, to the seamy underbelly, I mean, the virtualization layer. With the site reorganization, my old case notes might be hard to find, so I've started tagging them with jim.spath/blog/tags#/?tags=abap_detective. Maybe that will help the future Dr. Watsons. Suffice it to say that while I've not been reading as much ABAP as in the past, the transaction volumes, spurious behaviors and fatal endings have not escaped my vision.

The report, as often happens, was ambiguous. A job ended with a 16. More than one job, in fact. For those unfamiliar with ABAP policing protocol, a 16 is pretty bad news. You can survive an 8, certainly a 4, and definitely a 2 or a 1. But a 16, that's a one way trip to the place where they put a tag on your process. The place where the dumps go. It's not pretty. Sherlock Holmes had the Sign of Four. We have the Sign of Sixteen.

Unfortunately, the early warning signs of a 16 can be misleading. What the dispatchers told me was running the sheet again made the error go away. And the message they showed me from their snitches didn't mean anything. I should know better than to trust someone else's interpretation of a messages. In my world, you make your own error messages by running the bad code on your own. What better way to run bad code out of town, than to know the subroutines, log locations, and seedy variants.

Job scheduling

By no small coincidence, I started seeing error messages on another case in another domain that ended up solving the perp of the reports above. First a little background on that case, and then I tie the two together. We have a scheduling tool that interfaces with SAP. Like the voice of the dispatcher that comes on the police radio band and says it's time for your shift to be over. Anyway, it has a database tier, and an application tier, and various agents and customers. For the past couple years (much longer than I'd like to admit), we would have periodic time drifts. I don't mean the kind where you've worked a double shift, and your cruising down the highway in your Crown Victoria, startling to the horns of oncoming traffic jarring you awake, I'm talking about milliseconds of time added or subtracted per day. Wait, you might be saying, there's no quantum mechanics or relativity allowed in the ABAP world! Hang on, we're getting there.

What I observed was that one of the two systems used by this application had the wrong time. Looking at various logs (I keep scrupulous records, as any good ABAP or code detective would), I found the times diverting by a second or so per month. No big deal, right? And in most cases, if a system were bounced monthly (perhaps to apply bandages to bleeding code injuries), that time difference would probably vanish. But these systems stay running for months at a time, no breaks. All along, I suspected the OS, where the network time protocol (NTP) was misbehaving. I'll go more into that social deviance after a short digression about system time synchronization.

Ignoring daylight savings time, or other time zone changes, it's good practice to use a reference clock to keep your many systems on the proper time. In the UNIX world (and most other operating systems can use this, or a variation of it), a demon runs on the system, checking with one or more other systems what time they have. Assuming you've done your homework, you'd have some internal systems that are the primary reference for the remainder, and external systems that those check. A few reference sites:

http://www.ntp.org/
US Naval Observatory - http://tycho.usno.navy.mil/what.html
University of Delaware (U.S.) http://www.eecis.udel.edu/~mills/ntp.html
http://support.ntp.org/bin/view/Servers/WebHome - About the Public NTP Time Server Lists

Briefly (way too briefly), there are reference machines one can (and should) use to keep your systems in line. Failure to do so will have dire consequences, as we relate shortly. In the meantime, the main clue I had was a periodic time drift, indicating a synchronization failure. I assumed the ntp daemon (service) was corrupt. But it wasn't. Before asking for debugging to be turned on (which would reset the system clocks, causing the evidence to once again be hidden for weeks), I read some manual pages (gasp, I know), and ran some diagnostics.

Diagnostics

$ ntpdate -d time01.internal

12 Jun 09:23:43 ntpdate[20709430]: 3.4y

transmit(...)

delay 0.04185, dispersion 0.00038

offset 2.448537

12 Jun 09:23:43 ntpdate[20709430]: step time server 4.3.2.1 offset 2.448537

.

Log 1

The "-d" flag says to show what adjustments would be made, if permitted. This showed a 2.4 second (approximately) gap. While we're not dealing with atomic clocks here, and the thinnest of blue lines, anything over 1 second is bad.

$ ntpdate -d time02.internal

12 Jun 09:24:55 ntpdate[31981758]: 3.4y

transmit(...)

delay 0.04184, dispersion 0.00043

offset 0.276432

12 Jun 09:24:55 ntpdate[31981758]: adjust time server 4.3.2.5 offset 0.276432

.

Log 2

This says the other time reference site, if we relied on it, is around one-quarter second difference from the system I ran the command on. In other words, the two reference sites are out of synch with each other. As Mark Knopfler sang, "one of them must be wrong."

What about external reference clocks?

$ ntpdate -d 2.north-america.pool.ntp.org

12 Jun 09:25:31 ntpdate[40042622]: 3.4y

transmit(207.32.191.59)

...

delay 0.04903, dispersion 0.00076

offset 2.504520

12 Jun 09:25:31 ntpdate[40042622]: step time server 207.32.191.59 offset 2.504520

.

Log 3

Also wrong by over 2 seconds. With several other reference sites (from the NTP pools), I determined that we had a crooked timekeeper, right in our ranks.

Back to ABAP. What happens when the database and application server don't agree on what time it is? Well, this, for one:

`Date`	`Time`	`Message text`	`Message class`	`Message no.`	`Message type`
`06/12/2012`	`22:51:15`	`Job started`	`00`	`516`	`S`
`06/12/2012`	`22:51:16`	`Step 001 started (program [], variant [], user ID [])`	`00`	`550`	`S`
`06/12/2012`	`22:51:25`	`... file is being processed`	`BA`	`238`	`S`
`06/12/2012`	`22:55:02`	`ABAP/4 processor: ZDATE_ILLEGAL_LOCTIME`	`00`	`671`	`A`
`06/12/2012`	`22:55:02`	`Job cancelled`	`00`	`518`	`A`

Table 1

I am showing similar errors in text, and in screen shot mode, to help the search engines, and to make the point. Other indications, as hinted above, don't show what caused the code 16. This shows "illegal local time". The typical root causes are incorrect daylight savings time settings (one app server has one rule or setting, the other has another set), or one or more systems just aren't running a time service. But we were! Why were we out of synch?

Figure 1

What led me to look for the above was an message from years back, that is still in the ABAP systems: "big time diff".

Dev messages

$ grep -il "big time" dev*

dev_w3.old

dev_w51

dev_w52

dev_w53

dev_w53.old

[...]

M Tue Jun 12 04:13:37 2012

M *** WARNING => big time diff 2265953896 [1396155766 3662109662]for stamp STAT_READSEQ, correct ???

Log 4

I found these on a few application server traces, but then, I knew where to look. Often, what is seen in the trace files has a different interpretation in the job log.

Why were we out of synch?

Unfortunately, I can't reveal my sources, but I'll share a few lessons learned.

Don't use just 2 reference sites. Use 3.
Use the NTP pools as reference.
Make sure your reference sites exist.

For further reading, a few SCN forum posts, and a slightly interesting blog, more about daylight savings time than reference. Oh, and SAP Notes. Better to read them before finding your watch has stopped.

SAPonLinuxNotes	Linux reports "big time diff XXX for stamp YYY"
http://scn.sap.com/thread/1198057	ORA-1555 occurred... *** WARNING => big time diff
http://scn.sap.com/thread/739725	N4S Test Drive under SuSe 10.3 ... *** WARNING => big time diff
http://scn.sap.com/thread/1632847	Workprocesses stops and lock accumulates ... *** WARNING => big time diff
http://scn.sap.com/thread/1579656	BW System crashed due to DB6_ADM_RUNSTATS... *** WARNING => big time diff
Daylight Saving Time and Slowing Down The Time	Daylight Saving Time and Slowing Down The Time

447839	ZDATE_ILLEGAL_LOCTIME
481835	Analyzing the time zone settings
606374	"WARNING: big time diff" in wp trace files on AIX
679615	"big time diff" message in developer trace

Time to punch out the time clock and call this a wrap sheet.