System performance monitoring

  1. When to use these tools.

It is often useful to take an overall look at the system when there are ASE performance issues. When the system is experiencing any sort of resource shortage or contention it can directly affect the ability of ASE to perform properly. Identifying these resource bottlenecks can allow ASE to work at optimum levels. Typically making tuning changes to ASE to reduce resource consumption can help in these cases; but it can be quite difficult to identify where the contention exists using the diagnostic tools inside ASE. For instance, if the system is running into a lack of CPU resources, it may show up in ASE as high CPU busy. But, if the reaction to high CPU busy is to increase the number of ASE engines; the situation will be made worse, as it will simply increase the overall CPU contention on the system.

There are 4 main areas of system resources that will be looked at.

  1. CPU
  2. Memory
  3. Disk
  4. Network

Each of these areas has specific tools that can used to measure and evaluate performance. Note that there are also many third-party system performance tools available; we will not be discussing those, but most of the times they are going to be reporting metrics that you can relate to the ones provided by the system tools. For example, memory paging is the same whether you see it from vmstat output or a nice graph generated by some third-party tool.

I will not be discussing each field in these outputs (there are a lot) but rather will go through the most important metrics when diagnosing each system resource. If you are interested in a complete discussion of the fields, I would suggest using the “man <command>”  in order to get complete documentation.

I have put together some scripts for vafrious Unix and Linux platforms that run the commands whose output is discussed below. They can be found

on SCN at System debugging and analysis techniques scripts

     1. CPU

There are several key numbers to look at when analyzing CPU usage. The high-level view can start with  user cpu%, system cpu% and idle%. These can be found, for instance, in the output from the vmstat .sh script. That output might look something like:

kthr memory            page            disk          faults      cpu

r b w   swap free  re  mf pi po fr de sr s0 sd sd sd   in sy   cs us sy id

1 0 0 78426408 47056720 214 264 1134 73 73 0 0 0 5 5 5 6040 141463 3626 8 3 89

3 0 0 64302616 28740520 1 2 0  0  0 0  0  0 0  0  1 3997 865005 4517 17 8 75

3 0 0 64302576 28740560 0 0 0  1  1 0  0  0 0  0  0 3948 855812 4526 17 7 77

3 0 0 64302648 28740640 5 41 0 0  0  0 0  0  0 0  1 4135 859738 4619 17 7 76

2 0 0 64302768 28740768 0 0 0  0  0 0  0  0 0  0  1 4093 857408 4600 17 7 76

2 0 0 64302280 28740248 8 51 0 1  1  0 0  0  0 0  0 4132 852127 4581 17 7 76

3 0 0 64301992 28739984 15 121 0 2 2 0  0  0 0  0  0 4184 856176 4715 17 7 76

3 0 0 64301504 28739528 0 0 0  0  0 0  0  0 0  0  1 3942 864103 4517 17 7 76

2 0 0 64301496 28739528 2 13 0 1  1  0 0  0  0 0  0 3973 857463 4566 17 7 76

2 0 0 64299944 28737896 441 1775 0 0 0 0 0 0 0  0  0 4207 853593 5116 18 8 74

A couple points to look at here: 1) On many platforms the first line shows the averages since the system was last booted. That means you can probably ignore it when looking at current issues. 2) The last 3 columns are user busy % (us), system busy % (sy) and idle % (id). In this particular case, we can see that the system is roughly 75% idle presently, indicating a large amount of reserve cpu cycles.  But what this doesn’t show us is if the cpu usage is being spread evenly across all cpus or perhaps we have a few very busy ones. For that, look at the mpstat.sh output. One sample interval might look like:

CPU minf mjf xcal  intr ithr csw icsw migr smtx  srw syscl  usr sys wt idl

  0    0 0   15   206  106  185 55   44    9 0 180725   74  17 0  9

  1   10 0    8   258 127  326    7 63    8    0 8412    5   2 0  93

  3    0 0    6   346  169  458 7   61    8 0  4341    2 2   0  96

  4   22 0   10   222  107  267 5   43   10 0 11051    4   2 0  94

  5    0 0    7   184   76  226 18   81   13 0 141239   36  15 0  50

  6    0 0    3   456  273  398 6   41    7 0  6665    2 2   0  96

  7    9 0  235   421  139  437 12  101   16 0 49070   10   6 0  83

16    0 0    6   153   65  169 56   39    6 0 168353   81  16   0  3

17    0 0    3   262  126  363 8   55    9 0 13042    4   3 0  93

19    00    3   321  156  420 7   57    9 0  4608    1 1   0  98

20    0 0    3   227  104  273 5   38    9 0 10610    5   5 0  90

21    0 0   10   159   65  192 12   68    9 0 204277   37  23 0  40

22    0 0    2   440  147  369 6   39    7 0  8532    2 1   0  97

23    0 0    9   391  174  531 16  108   16 0 46734    9   6 0  85

Here we see that the system has 14 cpus; two of which are < 10% idle; two around 40-50% idle; and the rest nearly completely idle. The overall average is about 75% idle, but we may have a couple very busy jobs that are using up nearly a complete CPU each. This tells us that we may have to do some more detailed analysis to determine what those jobs are (which will be a topic in a future blog).

The ratio of user cpu %(usr) to system cpu %(sys) is also important. In general, to process a typical query ASE does not make a lot of system calls. Therefore, we would expect to see system cpu % fairly low (it is the percentage of time spent in handling system calls). If it is high it may mean that either ASE is making an unexpectedly large number of system calls or that the OS is not handling them very efficiently.

Correlating the overall cpu busy with ASE “busyness” can be a challenge. The measures that ASE uses differ from the OS,  in that ASE counts being in the scheduler looking for work as being idle. But, it may still be using cpu cycles to search for runnable processes or making system calls to poll for completed disk I/Os or incoming network packets. As a result, the system would view the engine as busy but ASE metrics would show it as either Idle or I/O Busy. In addition, when using threaded mode in versions 15.7 and newer, ASE will show up as a single process (with multiple threads) to the OS. It can become even more difficult to separate things out when there are other jobs running on the server that are using CPU. I’ll discuss in later on how to help separate things out.

     2. Memory

The memory in a Unix system can be divided up into two main categories – memory used by the OS and memory used by processes. The biggest concern when looking at memory usage is seeing if there is enough memory contention going on to cause performance degradation. We can start by again using vmstat output, for instance, to get a general overall view of the system. The columns will vary by system type, but here is a sample Linux output:

Procs    ———–memory———- —swap– —–io—- –system– —–cpu——

r b   swpd   free buff    cache   si so    bi    bo in   cs us sy id wa st

6 1    660 14931620 100688 21177656    0    0 801   234    1 1 13  1 84  2  0

6 1    660 14930736 100696 21177648    0    0 18940 4613 4204 15503 35  2 60  3  0

6 1    660 14931612 100700 21177676    0    0 21538 3534 3993 14152 34  2 60  5  0

7 0    660 14931984 100712 21177668    0    0 14154 3556 4033 13898 32  1 62  5  0

7 1    660 14931964 100712 21177680    0 0 14618  4522 4346 15921 33  2 61 4  0

5 1    660 14932324 100716 21177684    0    0 11058 4319 4025 14549 30  1 63  6  0

5 1    660 14932628 100724 21177672    0    0 12654 5710 4392 16194 30  2 62  5  0

5  1    660 14933004 100728 21177680    0 0 11048  2845 4018 13458 32  1 64 4  0

4 1    660 14933252 100728 21177680    0    0 10880 3587 4222 14738 28  1 67  4  0

5 1    660 14933872 100736 21177680    0    0 11352 5302 4724 17347 31  2 64  4  0

The most important columns to look at here are the “si” and “so”. These indicate paging in (si) and out (so) to swap space on disk. Ideally, these values should always be 0. If you see any non-zero values at all it indicates some amount of memory contention; and the higher the values the worse the contention was. Many modern OSes attempt to use as much memory as is available for functions such as file system buffer cache; so do not be too concerned if you see the “free” column have a relatively low value. As long as there  is not paging going on a low free value will not be hurting performance; and can, in fact, be improving performance by having the OS make good use of more of the system memory.

The vmstat.sh scripts also use the “-s” option to show collected memory statistics. In general, these values are the accumulated totals since the last system boot. As such, they will not show immediate issues, but then can be very useful in determining if there has been any memory contention since the system was last booted. A partial example from Solaris on a system with no memory contention shows:

        0 swap ins

        0 swap outs

        0 pages swapped in

        0 pages swapped out

473997441 total address trans. faults taken

35643878 page ins

  2140018 page outs

254314432 pages paged in

16457040 pages paged out

Note that a distinction is made between memory pages “swapped out” and pages “paged out”. The “paged out” pages will include pages that were in the file system buffer cache that were written to disk, which is a normal function of the OS, so they do not indicate any contention. An example from a Linux system that has seen memory contention shows:

     24734348 total memory

13923988  used memory

1477148  active memory

11455092  inactive memory

10810360  free memory

346920  buffer memory

12267784  swap cache

25165816  total swap

26636  used swap

25139180  free swap

2302227251 pages paged in

8418742410 pages paged out

65435 pages swapped in

1923934 pages swapped out

Here we see a significant number of pages have been swapped out; but the used swap space is not very large, indicating that the contention occurred sometime in the past. If you see pages getting swapped in and out on a regular basis you can be sure that overall system performance is being impacted by memory contention. We will look later on at some tools to help us determine what processes are using the memory.

     3. Disk

There are two primary measurements that need to be looked at for disk I/O. One is the overall amount of traffic to a particular device, and the second is the average response time taken by the disk subsystem to complete a request. On most platforms the iostat tool can be used to show these values (on HPUX we would use sar with the “-d” option instead).

Here is an iostat example output when there was very heavy write activity on a couple of disks:

Device:          rrqm/s  wrqm/s  r/s    w/s    rsec/s  wsec/s   avgrq-sz avgqu-sz await svctm  %util

sda               0.00     1.00 0.00    1.20     0.00 16.00        13.33     0.01 4.83   3.83   0.46

sdb               0.00     0.00 0.00  932.20     0.00 224614.40   240.95     0.96 1.03   1.03  95.66

sdc               0.00     0.00 0.00  933.20     0.00 224870.40   240.97     0.47 0.51   0.51  47.14

dm-0              0.00     0.00 0.00    2.00     0.00 16.00         8.00     0.01 2.90   2.30   0.46

Here we can see that sdb and sdc are seeing heavy writes, but the average service times (svctm) are quite low, 1 millisecond or less. The only way to improve disk speeds with this profile would be to move load to other disks.

*Note* According to the iostat man page on recent Linux versions, the service time should not be used. Instead, use the average wait time (await) as the

          measurement of performance. This is due to how the Linux kernel collects disk statistics.

From a different platform, we can see what high service times look like:

r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
2.8   162.7   7.0  649.5  0.0 25.4    0.0  153.3   0 100 md/d1
53.1  297.5 306.2  912.0  0.0 46.3    0.0  102.8   0 100 md/d11

82.4  142.6 181.7  442.4  0.0 25.2    0.0  111.9   0 100 md/d21

The service times here (asvc_t) are quite high – over 100 milliseconds. While it is difficult to generalize, since different types of disk devices have different speeds, a general rule of thumb for disk average service times is:
< 10 milliseconds = OK (though very fast disks such as SSD should only be 1-2 milliseconds)
10-20 milliseconds = maybe OK, but could be a problem
>20 milliseconds = there is a disk bottleneck

Slower or overloaded disks can obviously have a large impact on any database server performance. If an ASE seems to be running a bit sluggish, it is always worthwhile to check to make sure there is not some disk issue contributing to slower performance.

     4. Network

Since many network issues, such a proper routing and problems due to traffic load, take place in the network hardware it can be difficult to diagnose such issues from the system. The best tool to take a look at what the system sees is netstat. The netstat.sh script will show some of the more useful metrics.

The first output from netstat.sh shows the traffic and error levels for each network interface

Iface       MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg

eth0       1500 0 1349503855      0      0 0 1574347028      0      0 0

eth1       1500 0        0      0 0      0        6 0      0      0

eth2       1500 0  9107774      0 0      0      294 0      0      0

eth3       1500 0 743127146      0 0      0 7354573006      0 0      0

Note that on most platforms the netstat output columns do not line up very well. This is mostly due to the very high values that some columns may have. Here we see 4 interfaces, with one of them being almost never used,  but we don’t see any errors (RX-ERR and TX_ERR) or drops (RX_DRP and TX_DRP) being reported. Ideally, that is what should show up – errors in this output generally indicate a problem with the interface itself.

We can also get metrics on a per-protocol basis from netstat. Here is an example of TCP values on Solaris:

TCP     tcpRtoAlgorithm     = 4     tcpRtoMin           = 400

        tcpRtoMax           = 60000     tcpMaxConn          = -1

        tcpActiveOpens      =2681247    tcpPassiveOpens     =358412

        tcpAttemptFails     =2305528 tcpEstabResets      =183254

        tcpCurrEstab        = 198     tcpOutSegs          =250743921

        tcpOutDataSegs      =1896837601 tcpOutDataBytes     =3429284834

        tcpRetransSegs      =136413     tcpRetransBytes     =182285123

        tcpOutAck           =42999509   tcpOutAckDelayed    =864043

        tcpOutUrg           = 334     tcpOutWinUpdate     = 509

        tcpOutWinProbe      = 4890     tcpOutControl       =5976598

tcpOutRsts          =2385762    tcpOutFastRetrans   =    16

        tcpInSegs           =181700271

        tcpInAckSegs        =892374093  tcpInAckBytes       =2170564957

        tcpInDupAck         =1820706    tcpInAckUnsent      = 0

        tcpInInorderSegs    =181655424 tcpInInorderBytes   =1103918392

        tcpInUnorderSegs    =288318 tcpInUnorderBytes   =388521280

        tcpInDupSegs        = 12909     tcpInDupBytes       =609222

        tcpInPartDupSegs    = 70     tcpInPartDupBytes   = 23647

        tcpInPastWinSegs    = 0     tcpInPastWinBytes   = 0

        tcpInWinProbe       = 70     tcpInWinUpdate      = 4871

        tcpInClosed         = 1034     tcpRttNoUpdate      =609634

        tcpRttUpdate        =891264817 tcpTimRetrans       =1393901

        tcpTimRetransDrop   = 65     tcpTimKeepalive     = 20402

        tcpTimKeepaliveProbe=  4744 tcpTimKeepaliveDrop =    53

        tcpListenDrop       = 0     tcpListenDropQ0     = 0

        tcpHalfOpenDrop     = 0     tcpOutSackRetrans   =123211

Yes, there are a *lot* of numbers there.  The two main values that tell us whether or not we are seeing network issues are retransmissions (tcpTimRetrans, tcpRetransSegs)  and drops (tcpTimRetransDrop). While these values can be non-zero; any time we see retransmissions or drops we know that there have been some network issues that have slowed down the responses to users. It’s mostly a matter of percentages (i.e. the ratio of tcpRetransSegs to tcpOutDataSegs). In this case we see that is a little more than .007 percent of all packets; which is low enough to not be a big issue. Generally if this ratio starts to hit 1% or higher it would indicate that the network may have a problem.


I hope these examples can help you make some determinations about whether your system is running with resource contention; and, if so, where you might start lookinbg to improve performance.


To report this post you need to login first.

Be the first to leave a comment

You must be Logged on to comment or reply to a post.

Leave a Reply