Blog #5 – System log analysis
A. When to use these tools
Looking at the system logs occasionally is not a bad idea, even if you are a DBA with no responsibilities for the overall system. Knowing what is “normal” in these logs can be very useful when a problem has been encountered; it will help you see quickly any unusual entries. I will discuss some specific issues where looking at system logs can help determine what the problem is.
B. What to look for
Unix systems often can put a lot of messages into a system log. Typically, most of them will not have a direct bearing on running ASE. There are some specific things, though, that can be looked for. Most system logs contain the name of the process that hit an error, and looking for “dataserver” or “backupserver”, for example, can help spot messages that may be pertinent. As one example, on a Linux system running ASE, you might run:
grep -i dataserver /var/log/messages
One possible message might be:
Jul 9 09:44:23 redhead kernel: EXT4-fs (sda2): Unaligned AIO/DIO on inode 3824145 by dataserver; performance will be poor.
This is a warning that the ASE in question was using a file system device on an unaligned file system, and as a result there may be an impact on disk I/O performance. Note that this particular message often occurs when running a 2K page ASE on a 4K block filesystem. More details may be found at https://forums.oracle.com/forums/thread.jspa?messageID=10582441.
C. What tools to use
On Linux, Solaris, and HP-UX platforms the system log is a plain text file. Its location varies by platform, and may be configured by your system administrator to go someplace other than the default. The default locations are:
Linux – /var/log/messages and the ‘dmesg’ utility
Solaris – /var/adm/messages
HP-UX – /var/adm/syslog/syslog.log
On AIX, the log is kept in a binary format, and the errpt program needs to be used to look at it. The complete output can be obtained using “errpt –a”, you can use the man page on errpt to check out other possible options.
There are many possible messages that may show up in a system log. As mentioned above, a good idea would be to familiarize yourself with “normal” messages in order to quickly spot exceptional ones. Here are some examples of exceptional messages:
- A disk error (note that for this type of error the system log did not include ‘dataserver’ )
From the Solaris system log:
Nov 28 10:59:10 testsys1 scsi: [ID 107833 kern.warning]
WARNING: /pci@8,600000/scsi@1/sd@0,1 (sd36):
Nov 28 10:59:10 testsys1 SCSI transport failed: reason ‘reset’: retrying
Nov 28 10:59:10 testsys1 scsi: [ID 107833 kern.warning]
WARNING: /pci@8,600000/scsi@1/sd@0,2 (sd37):
Nov 28 10:59:10 testsys1 Error for Command: read(10) Error Level: Retryable
Nov 28 10:59:10 testsys1 scsi: [ID 107833 kern.notice] Requested Block:
38078968 Error Block: 38078968
Nov 28 10:59:10 testsys1 scsi: [ID 107833 kern.notice] Vendor: SUN
Serial Number: 37820045-00
Nov 28 10:59:10 testsys1 scsi: [ID 107833 kern.notice] Sense Key: Unit
Nov 28 10:59:10 testsys1 scsi: [ID 107833 kern.notice] ASC: 0x29 (power on,
reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
The corresponding messages in the ASE errorlog:
01:00000:02398:2013/11/28 10:59:10.99 server Page Information from first read attempt: Page read from disk ppageno = 16461 pobjid = 1052530783, pindid = 0 pnextpg = 0, pprevpg = 0 plevel = 0, pstat = 0x891 pts_hi = 3, pts_lo = -969580731
01:00000:02398:2013/11/28 10:59:10.99 server Page Information from second read attempt: Page read from disk ppageno = 2707 pobjid = 988530555, pindid = 0 pnextpg = 0, pprevpg = 0 plevel = 0, pstat = 0x801 pts_hi = 3, pts_lo = 834066651
01:00000:02398:2013/11/28 10:59:10.99 server End diagnostics for read failure
01:00000:02398:2013/11/28 10:59:10.99 server Suspect a Device or OS problem,Second IO successfull
01:00000:02398:2013/11/28 10:59:10.99 server Error: 694, Severity: 24, State:1
01:00000:02398:2013/11/28 10:59:10.99 server An attempt was made to read
logical page ‘16461407’, virtpage ‘825608303’ from virtual device ’49’ for object ‘988530555’ in database ’19’. The page was not read successfully. You may have a device problem or an operating system problem.
Note here that the page number reported by ASE is quite different than the device block number reported by the OS. This is quite normal; and both numbers are useful in looking at the problem. The system log does give the actual device (sd37) which can be difficult to determine from within ASE due to volume managers, striping, etc. One good clue here: any time you see a 69x error in the ASE errorlog it is a good idea to take a look at the OS log to see if there may have been a corresponding OS disk error reported. Taking care of disk errors while they are still recoverable (as was this case) is much better than trying to reconstruct all the data on a completely dead disk.
2. An issue where ASE crashed
Here is a system log message on a Linux box where ASE experienced a crash. The ASE log did not have any messages; so no clues were available there.
Jun 9 19:53:20 redhead abrt: saved core dump of pid 54913 (/test/asecarina/SMP/release/ASE-15_0/bin/dataserver) to /var/spool/abrt/ccpp-1402363992-54913.new/coredump (357540128 bytes)
The valuable information here is the location of the core dump (/var/spool/abrt/ccpp-1402363992-54913.new/coredump). This core dump can be sent to SAP support for analysis to determine the cause of the crash. There would be no indication in any ASE log or in the $SYBASE directory that a core dump had been created.
3. ASE was experiencing random errors and corruptions
In this case, a lot of diagnosis was done looking at possible causes of problems within ASE before someone thought to look at the OS log. When that was done multiple messages of the following type were seen:
May 12 00:45:41 solsys2 UDBH Syndrome 0x64 Memory Module Slot D: J3400
May 12 00:45:41 solsys2 SUNW,UltraSPARC-IV: [ID 905878 kern.info] [AFT0] errID 0x00005b2e.0036704c Corrected Memory Error on Slot D: J3400 is Intermittent
May 12 00:45:41 solsys2 SUNW,UltraSPARC-IV: [ID 330920 kern.info] [AFT0] errID 0x00005b2e.0036704c ECC Data Bit 7 was in error and corrected
Even though the OS was reporting that the memory errors were corrected, after replacing the indicated memory board the ASE errors disappeared. Although your system administrators may be scanning for such errors, the fact that he OS reported the errors as corrected might make them dismiss them as a possible cause for the ASE issues. As we saw in this case; that may not always be true.
4. ASE seeing timeslices and hangs
The ASE errorlog was showing repeated messages like:
03:00000:00089:2013/09/14 14:12:42.08 kernel timeslice -501, current process infected
In addition, two complete hangs were reported, where ASE was unresponsive to login attempts.
After analysis of memory dumps showed that the hangs occurred when processes were trying to complete a network send; it was suggested to examine the system logs.
The system log (/var/log/messages on Linux) contained several messages with timestamps that correlated with the timeslices in the ASE errorlog:
Sep 14 14:13:52 linsys1 kernel: INFO: task dataserver:8589 blocked for more than 120 seconds.
Investigating that error message on Google it was discovered that there was a known issue on RedHat Linux (Bugzilla ID # 525898 ) that was corrected in versions 5.7 and up. After upgrading their system to a newer version the timeslices (and system log messages) stopped.
While careful monitoring of system logs may not be in a DBA’s job description; when issues arise within the system that affect ASE those logs are a very good place to check for possible errors. Becoming familiar with where the logs are and what they typically look like can be a valuable tool in helping to quickly locate ASE issues related to system events.