Blog #5 – System log analysis

    A. When to use these tools

Looking at the system logs occasionally is not a bad idea, even if you are a DBA with no responsibilities for the overall system. Knowing what is “normal” in these logs can be very useful when a problem has been encountered; it will help you see quickly any unusual entries. I will discuss some specific issues where looking at system logs can help determine what the problem is.

    B. What to look for

Unix systems often can put a lot of messages into a system log. Typically, most of them will not have a direct bearing on running ASE. There are some specific things, though, that can be looked for. Most system logs contain the name of the process that hit an error, and looking for “dataserver” or “backupserver”, for example, can help spot messages that may be pertinent. As one example, on a Linux system running ASE, you might run:

grep  -i dataserver  /var/log/messages

One possible message might be:

Jul  9 09:44:23 redhead kernel: EXT4-fs (sda2): Unaligned AIO/DIO on inode 3824145 by dataserver; performance will be poor.

This is a warning that the ASE in question was using a file system device on an unaligned file system, and as a result there may be an impact on disk I/O performance.  Note that this particular message often occurs when running a 2K page ASE on a 4K block filesystem. More details may be found at https://forums.oracle.com/forums/thread.jspa?messageID=10582441.


    C. What tools to use

On Linux, Solaris, and HP-UX platforms the system log is a plain text file. Its location varies by platform, and may be configured by your system administrator to go someplace other than the default. The default locations are:

Linux – /var/log/messages and the ‘dmesg’ utility

Solaris – /var/adm/messages

HP-UX –  /var/adm/syslog/syslog.log

On AIX, the log is kept in a binary format, and the errpt program needs to be used to look at it. The complete output can be obtained using “errpt –a”, you can use the man page on errpt to check out other possible options.

    D. Examples

There are many possible messages that may show up in a system log. As mentioned above, a good idea would be to familiarize yourself with “normal” messages in order to quickly spot exceptional ones. Here are some examples of exceptional messages:

  1. A disk error (note that for this type of error the system log did not include ‘dataserver’ )

From the Solaris system log:

Nov 28 10:59:10 testsys1 scsi: [ID 107833 kern.warning]
WARNING: /pci@8,600000/scsi@1/sd@0,1 (sd36):
Nov 28 10:59:10 testsys1 SCSI transport failed: reason ‘reset’: retrying
command
Nov 28 10:59:10 testsys1 scsi: [ID 107833 kern.warning]
WARNING: /pci@8,600000/scsi@1/sd@0,2 (sd37):
Nov 28 10:59:10 testsys1 Error for Command: read(10) Error Level: Retryable
Nov 28 10:59:10 testsys1 scsi: [ID 107833 kern.notice] Requested Block:
38078968 Error Block: 38078968
Nov 28 10:59:10 testsys1 scsi: [ID 107833 kern.notice] Vendor: SUN
Serial Number: 37820045-00
Nov 28 10:59:10 testsys1 scsi: [ID 107833 kern.notice] Sense Key: Unit
Attention
Nov 28 10:59:10 testsys1 scsi: [ID 107833 kern.notice] ASC: 0x29 (power on,
reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0

The corresponding messages in the ASE errorlog:


       01:00000:02398:2013/11/28 10:59:10.99 server Page Information from first read attempt: Page read from disk ppageno = 16461 pobjid = 1052530783, pindid = 0 pnextpg = 0, pprevpg = 0 plevel = 0, pstat = 0x891 pts_hi = 3, pts_lo = -969580731

       01:00000:02398:2013/11/28 10:59:10.99 server Page Information from second read attempt: Page read from disk ppageno = 2707 pobjid = 988530555, pindid = 0 pnextpg = 0, pprevpg = 0 plevel = 0, pstat = 0x801 pts_hi = 3, pts_lo = 834066651

01:00000:02398:2013/11/28 10:59:10.99 server  End diagnostics for read failure

01:00000:02398:2013/11/28 10:59:10.99 server  Suspect a Device or OS problem,Second IO successfull

01:00000:02398:2013/11/28 10:59:10.99 server  Error: 694, Severity: 24, State:1

01:00000:02398:2013/11/28 10:59:10.99 server  An attempt was made to read

logical page ‘16461407’, virtpage ‘825608303’ from virtual device ’49’ for object ‘988530555’ in database ’19’. The page was not read successfully. You may have a device problem or an operating system problem.

Note here that the page number reported by ASE is quite different than the device block number reported by the OS. This is quite normal; and both numbers are useful in looking at the problem. The system log does give the actual device (sd37) which can be difficult to determine from within ASE due to volume managers, striping, etc. One good clue here: any time you see a 69x error in the ASE errorlog it is a good idea to take a look at the OS log to see if there may have been a corresponding OS disk error reported. Taking care of disk errors while they are still recoverable (as was this case) is much better than trying to reconstruct all the data on a completely dead disk.

    2. An issue where ASE crashed

Here is a system log message on a Linux box where ASE experienced a crash. The ASE log did not have any messages;  so no clues were available there.

Jun  9 19:53:20 redhead abrt[24996]: saved core dump of pid 54913 (/test/asecarina/SMP/release/ASE-15_0/bin/dataserver) to /var/spool/abrt/ccpp-1402363992-54913.new/coredump (357540128 bytes)


The valuable information here is the location of the core dump (/var/spool/abrt/ccpp-1402363992-54913.new/coredump). This core dump can be sent to SAP support for analysis to determine the cause of the crash. There would be no indication in any ASE log or in the $SYBASE directory that a core dump had been created.

    3. ASE was experiencing random errors and corruptions

In this case, a lot of diagnosis was done looking at possible causes of problems within ASE before someone thought to look at the OS log. When that was done multiple messages of the following type were seen:

May 12 00:45:41 solsys2 UDBH Syndrome 0x64 Memory Module Slot D: J3400
May 12 00:45:41 solsys2 SUNW,UltraSPARC-IV: [ID 905878 kern.info] [AFT0] errID 0x00005b2e.0036704c Corrected Memory Error on Slot D: J3400 is Intermittent
May 12 00:45:41 solsys2 SUNW,UltraSPARC-IV: [ID 330920 kern.info] [AFT0] errID 0x00005b2e.0036704c ECC Data Bit 7 was in error and corrected


Even though the OS was reporting that the memory errors were corrected, after replacing the indicated memory board the ASE errors disappeared. Although your system administrators may be scanning for such errors, the fact that he OS reported the errors as corrected might make them dismiss them as a possible cause for the ASE issues. As we saw in this case; that may not always be true.

    4. ASE seeing timeslices and hangs

The ASE errorlog was showing repeated messages like:

03:00000:00089:2013/09/14 14:12:42.08 kernel  timeslice -501, current process infected

In addition, two complete hangs were reported, where ASE was unresponsive to login attempts.

After analysis of memory dumps showed that the hangs occurred when processes were trying to complete a network send; it was suggested to examine the system logs.

The system log (/var/log/messages on Linux) contained several  messages with timestamps that correlated with the timeslices in the ASE errorlog:

Sep 14 14:13:52 linsys1 kernel: INFO: task dataserver:8589 blocked for more than 120 seconds.


Investigating that error message on Google it was discovered that there was a known issue on RedHat Linux (Bugzilla ID # 525898 ) that was corrected in versions 5.7 and up. After upgrading their system to a newer version the timeslices (and system log messages) stopped.

    E. Conclusion

While careful monitoring of system logs may not be in a DBA’s job description; when issues arise within the system that affect ASE those logs are a very good place to check for possible errors. Becoming familiar with where the logs are and what they typically look like can be a valuable tool in helping to quickly locate ASE issues related to system events.

To report this post you need to login first.

Be the first to leave a comment

You must be Logged on to comment or reply to a post.

Leave a Reply