Skip to Content

Today a customer asked about how to monitor recovery activity on MaxDB.

The typical questions here are: “Is it still running?” and “How long will it take?”

While the last question cannot be answered easily (and mostoften not at all), since the time necessary to finish a recovery largely depends on the performance of
the hardware and the data volume, the first question can rather easilybe answered.

First step – X_CONS 

The first step to check the recovery activity is to check what the database is currently doing at all.
Since a recovery can only be done in ADMIN mode, we can be pretty sure that everything we will see in the ‘x_cons show all’ output will elong to the recovery.

So let’s check this. I just started the recovery of some logbackupfiles on mytest db (MaxDB 7.6.04 Build 11/Win32):

C:\WINDOWS>x_cons DB760 show ac

SERVERDB: DB760

ID   UKT  Win   TASK       APPL Current         Timeout Region     Wait
          tid   type        pid state          priority cnt try    item
T11    6  0x81C Pager           Vvectorio             0 0               671(s)
T12    6  0x81C Pager           Vvectorio             0 0               671(s)
T13    6  0x81C Pager           Vvectorio             0 0               671(s)
T18    7  0xD5C User       1560 JobWait BckRec        0 0               3599(s)
T28    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T29    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T30    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T31    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T32    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T33    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T34    4  0xEDC BUPmed          AsynWaitRead          0 0               1130(s)
T35    4  0xEDC BUPvol          JobWait Redo          0 0               1130(s)
T51    4  0xEDC Savepnt         PagerWaitWritr        0 0               1361(s)
T52    4  0xEDC RedoRea         Vsuspend (204)        0 0               1130(s)
T53    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T54    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T55    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T56    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T57    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T58    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T59    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T60    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T61    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T62    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T63    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T64    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T65    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T66    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T67    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T68    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T69    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T70    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T71    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T72    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T73    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T74    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T75    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T76    4  0xEDC RedoExe         IO Wait (R)           0 0      2        2697(s)
T77    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T78    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T79    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T80    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T81    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T82    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T83    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T84    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T85    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T86    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T87    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T88    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T89    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T90    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T91    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T92    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T93    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T94    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T95    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T96    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T97    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T98   4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T99    4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T100   4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)
T101   4  0xEDC RedoExe         NoRedoJob(231)        0 0               1130(s)

Console command finished (2008-06-11 10:55:33).

Ok, the first task we see is:

ID   UKT  Win   TASK       APPL Current         Timeout Region     Wait


T18    7  0xD5C User       1560 JobWait BckRec        0 0               3599(s)

This is the user tasks that actually started the recovery.
It is in state “JobWaitBckRec” (waiting for the completion of a backup or a recovery) and does no work for itself.
Instead there are several usertasks active with actually performing thenecessary work for the recovery.

This work can be splitted in:

A) Reading data from the log-backup

We see this in this line: 

ID   UKT  Win   TASK       APPL Current         Timeout Region     Wait
          tid   type        pid state          priority cnt try    item

T34    4  0xEDC BUPmed          AsynWaitRead          0 0               1130(s) 
T35    4  0xEDC BUPvol          JobWait Redo          0 0               1130(s)

B) Creating internal recovery files from the data and analyze the recovery files, so that transactions that don’t need to be redone aren’t actually redone and to maximize parallelity of recovery.

In our list above this task is actually doing this:

ID   UKT  Win   TASK       APPL Current         Timeout Region     Wait
          tid   type        pid state          priority cnt try    item

T52    4  0xEDC RedoRea         Vsuspend (204)        0 0               1130(s) 

Hmm… what could Vsuspend (204) mean here? Unfortuately not all suspends have names that actually tells what’s going on.
But you can help yourself with this command:

C:\WINDOWS>x_cons DB760 show suspends
SERVERDB: DB760
List of suspend reasons:
========================

total suspends:   10530
DC OvFlow(043) :       2 (   0.02% ) b13get_node: sysbuf_overflow (SVP)   :-(
IOWait(W)(044) :       5 (   0.05% ) b13pfree_pno                         :-(
PagerWaitWritr :      13 (   0.12% ) Pager_Controller::WaitForPagerWritReply
DB FULL  (198) :      13 (   0.12% ) FBM_HandleDBFull
Vsuspend (203) :    1566 (  14.87% ) kb39read_wait
Vsuspend (204) :     886 (   8.41% ) kb39write_wait
JobWait BckRec :      12 (   0.11% ) SrvTasks_BackupServer::WaitForAnyJobFini
JobWait Redo   :       3 (   0.03% ) Rst_RedoManager::RedoLog
Vsuspend (224) :       6 (   0.06% ) kb39wait_for_redo
Vsuspend (227) :    1608 (  15.27% ) RedoReader waits for finished jobs
Vsuspend (228) :       1 (   0.01% ) RedoReader waits for end of all RedoTask
Prep-End (230) :       1 (   0.01% ) bd13GetNode
NoRedoJob(231) :    3434 (  32.61% ) Rst_RedoTrafficControl::ExecuteJobs()
SyncRedo (232) :    2559 (  24.30% ) Rst_RedoTrafficControl::WaitForIoSeq()
LogIOwait(234) :       1 (   0.01% ) Log_Queue::UserTaskEOTReady
SVP-wait (243) :       3 (   0.03% ) Log_Savepoint::StartSavepointAndWait
No-Work  (255) :     417 (   3.96% ) Task is waiting for work

Console command finished (2008-06-11 11:15:31). 

Although it does display which suspends occured the most often, it does not show us how long those suspends lasted – so for performance analysis this view is rather useless.

Anyhow, for each suspend we see a description next to it and for our Vsuspend (204) we find“kb39write_wait”.
“KB” stands for “Kommunikationsbasisschicht” or “Communication Basislayer” in english and refers to db internal functions within the data access manager.
So we can conclude that our task T52 is currently writing out some internal redo files.

C) Perform the recovery,that is redo the transaction that got committed and rollback those that either got rolled back as well in the first place or that did not finish.

In our case it’s just this task working on this:

ID   UKT  Win   TASK       APPL Current         Timeout Region     Wait
          tid   type        pid state          priority cnt try    item

T76    4  0xEDC RedoExe         IO Wait (R)           0 0      2        2697(s) 

To perform the recovery it has to read the redo files and perform the changes on the pages of the database.
If those pages aren’t found in the cache than the task has to read it from the data volume as it just happens here in our example.

D) During all this, pages are changed in the database, so the pager needs to write out those ‘dirty’ pages from time to time via a savepoint.

In the tasklist we find these tasks for that:

ID   UKT  Win   TASK       APPL Current         Timeout Region     Wait
          tid   type        pid state          priority cnt try    item

T11    6  0x81C Pager           Vvectorio             0 0               671(s)
T12    6  0x81C Pager           Vvectorio             0 0               671(s)
T13    6  0x81C Pager           Vvectorio             0 0               671(s)
[...]
T51    4  0xEDC Savepnt         PagerWaitWritr        0 0               1361(s)    

Task 51 triggered the Savepoint and the asynchronous writing of the data pages
is than done with one task per volume (in this case there are just three data volumes).
To optimize the write performance the data is written out in groups of
blocks and thus we see “Vvectorio” for the suspend reason.

Detailed information – knldiag/knlmsg file 

So with x_cons we have quite a good option to see if there is anything happening in the database.
More detailed information on the recovery process itself can be found in the knldiag(knlmsg) file of the database. 

The following is an excerpt from the knldiag file during the recovery.
I did only minor reformatting to ease reading:

2008-06-11
10:54:57      0xD5C     20006 Log      0 queues, flushmode is 'MaximizeSafety', devstate is 'Cleared'
10:54:57      0xD5C     20007 Log      Oldest not saved is ioseq NIL @ off 1
10:54:57      0xD5C     20009 Log      LogVolume not completely written until now
10:54:57      0xD5C     20005 Log      Restart from ioseq 1 @ off 1 to ioseq NIL @ off NIL
10:54:57      0xD5C     20010 Log      Result after checking the log device: 'Ok'
10:54:57      0xD5C         9 Log      The number of active logging-queues has been increased to 1

10:54:57      0xEDC        11 Rst      88 redo transactions readable and 55 redo tasks available.
10:54:57      0xEDC         6 Restart  recovering log from tape from IOSeq '1' until IOSeq: 'NIL'
10:54:57      0xEDC     52024 RESTORE  200 pages <- "sapdb\backup\db760log.001"

10:54:58      0xEDC     20012 Log      redo-read#50000:TR465841(2)[2396]@NIL.1936'InsertRecord':20080611:102906
10:54:59      0xEDC     20012 Log      redo-read#100000:TR465963(37514)[3812]@NIL.3808'InsertRecord':20080611:103101
10:54:59      0xEDC     20012 Log      redo-read#150000:TR465963(87493)[4597]@NIL.5600'InsertRecord':20080611:103103
10:54:59      0xEDC     20012 Log      redo-read#200000:TR465992(36925)[5147]@NIL.5784'DeleteRecord':20080611:103129

10:55:01      0xEDC     20012 Log      redo-read#250000:TR465992(86904)[5591]@NIL.1752'DeleteRecord':20080611:103130
10:55:01      0xEDC     20012 Log      redo-read#300000:TR466018(36398)[6055]@NIL.1680'DeleteRecord':20080611:103155
10:55:01      0xEDC     20012 Log      redo-read#350000:TR466018(86377)[6498]@NIL.5712'DeleteRecord':20080611:103156
10:55:01      0xB4C     19626 IO       Async I/O thread stopped, 'C:\sapdb\backup\db760log.001'
10:55:01      0xEDC     52024 RESTORE  4712 pages <- "apdb\backup\db760log.001"
10:55:01      0xD5C     52012 RESTORE  new tape required 4300
10:55:01      0xD5C         1 Backup   Backupmedium #1 (C:\sapdb\backup\db760log.001)                                        
10:55:01      0xD5C         6 KernelCo  +   Backup error occured, Error code 4300 "new_hostfile_required"
10:55:01      0xD5C     20000 Log       +   no update of LogInfoPage: DeviceState = Cleared, LogIsEmpty = true
10:55:02      0xF68     19625 IO       Async I/O thread started, 'C:\sapdb\backup\db760log.002
10:55:02      0xEDC     52101 RESTORE  Filetype: file
10:55:02      0xEDC     52024 RESTORE  200 pages <- "sapdb\backup\db760log.002"
10:55:02      0xEDC     20012 Log      redo-read#400000:TR466023(36334)[6943]@NIL.1896'DeleteRecord':20080611:103159
10:55:02      0xEDC     20012 Log      redo-read#450000:TR466023(86292)[7387]@NIL.6216'DeleteRecord':20080611:103200
10:55:02      0xEDC     20012 Log      redo-read#500000:TR466027(36247)[7832]@NIL.1752'DeleteRecord':20080611:103204
10:55:03      0xEDC     20012 Log      redo-read#550000:TR466027(86227)[8274]@NIL.5928'DeleteRecord':20080611:103205
10:55:03      0xEDC     20012 Log      redo-read#600000:TR466032(36142)[8718]@NIL.3840'DeleteRecord':20080611:103208
10:55:03      0xEDC     20012 Log      redo-read#650000:TR466032(86121)[9161]@NIL.1320'DeleteRecord':20080611:103210
10:55:03      0xEDC     20012 Log      redo-read#700000:TR466037(36098)[9604]@NIL.5424'DeleteRecord':20080611:103212
10:55:03      0xEDC     20012 Log      redo-read#750000:TR466037(86098)[10047]@NIL.1176'DeleteRecord':20080611:103214
10:55:04      0xEDC     20012 Log      redo-read#800000:TR466038(36096)[10491]@NIL.4992'DeleteRecord':20080611:103216
10:55:04      0xEDC     20012 Log      redo-read#850000:TR466038(86096)[10934]@NIL.744'DeleteRecord':20080611:103217
[...]
10:55:31      0xAD8     19625 IO       Async I/O thread started, 'C:\sapdb\backup\db760log.010
10:55:31      0xEDC     52101 RESTORE  Filetype: file
10:55:31      0xEDC     52024 RESTORE  200 pages <- "sapdb\backup\db760log.010"

10:55:33      0xEDC     20012 Log      redo-read#4350000:TR466074(86047)[45950]@NIL.6008'UpdateRecord':20080611:104003
10:55:33      0xEDC     20012 Log      redo-read#4400000:TR466075(36045)[46791]@NIL.8016'DeleteRecord':20080611:104024
10:55:34      0xEDC     20012 Log      redo-read#4450000:TR466075(86045)[47234]@NIL.3768'DeleteRecord':20080611:104034

10:55:39      0xEDC     20012 Log      last-redo-read empty errlist#4463954:TR466075(99999)[47357]@NIL.7728'DeleteRecord':20080611:105311
10:55:39      0xEDC        11 Log      Transaction 466075 is open after log reading and forced to rollback 0 entries.

10:59:18      0xEDC         5 Rst      RedoControl: End of redo processing
10:59:18      0xEDC         4 Rst       +   RedoControl: OldestCommitSeq: 1, RedoTasks: 55 , MaxRedoTrans: 88
10:59:18      0xEDC         6 Rst       +   RedoControl: EndOfLogReached: true, RedoIsAborted: false, RedoProcessingDisabled: false
10:59:18      0xEDC        12 Rst       +   RedoControl: redo reader T0 wait reason:  isNotWaiting, waiting redo executer:  0
10:59:18      0xEDC         8 Rst       +   RedoControl: Transactions processed: 2582 (2582)

10:59:19 0xEDC 20039 Admin Kernel state: 'REDO LOG' finished

[...]  

 Here we see that redo-entries are read and redofiles are created:

10:55:03      0xEDC     20012 Log      redo-read#700000:TR466037(36098)[9604]@NIL.5424'DeleteRecord':20080611:103212 

At the end of the recovery there will be a summary, here marked with messagetype RST:

10:59:18      0xEDC         5 Rst      RedoControl: End of redo processing
10:59:18      0xEDC         4 Rst       +   RedoControl: OldestCommitSeq: 1, RedoTasks: 55 , MaxRedoTrans: 88
10:59:18      0xEDC         6 Rst       +   RedoControl: EndOfLogReached: true, RedoIsAborted: false, RedoProcessingDisabled: false
10:59:18      0xEDC        12 Rst       +   RedoControl: redo reader T0 wait reason:  isNotWaiting, waiting redo executer:  0
10:59:18      0xEDC         8 Rst       +   RedoControl: Transactions processed: 2582 (2582)

as well as the “we’re done!” message:

2008-06-11 10:59:19 0xEDC 20039 Admin Kernel state: 'REDO LOG' finished 

Further reading

All these information are presented in a much better way in the  “MaxDB Internals” course material.
It’s available for free at http://maxdb.sap.com/training/
Even more information can be found in the MaxDBWikihere in SDN  

 

Best regards,
Lars

To report this post you need to login first.

2 Comments

You must be Logged on to comment or reply to a post.

  1. Nelis Lamprecht
    Hi Lars,

    Thanks for the information, as usual your blogs are very informative and I always take time to read them.

    I thought I’d just add a comment here on a little trick I use to monitor my restores in Linux/Unix platforms running MaxDB. Using the program “watch” which is common on a lot of OS’s I run the following command:

    watch -n 1 ‘grep -B 20 “current write position” knldiag’

    This basically will display all data as it is being written to the knldiag file in real time.

    So I see something along the lines of:

    2008-07-03 16:46:17 31802 11597 IO Open ‘/tmp/nsr_pipe’ successfull, fd: 97
    2008-07-03 16:46:17 31802 11565 startup DEVi started
    2008-07-03 16:46:17 31428 11000 vasynope ‘/tmp/nsr_pipe’ devno 60 T56 succeeded
    2008-07-03 16:46:17 31428 52101 RESTORE Filetype: pipe
    2008-07-03 16:46:17 31428 52024 RESTORE 200 pages <- “/tmp/nsr_pipe”
    2008-07-03 16:46:22 31428 52024 RESTORE 15000 pages <- “/tmp/nsr_pipe”
    2008-07-03 16:46:27 31428 52024 RESTORE 33200 pages <- “/tmp/nsr_pipe”
    2008-07-03 16:46:32 31428 52024 RESTORE 50000 pages <- “/tmp/nsr_pipe”

    Since I already know how many pages there are in total I can get a good idea of how much further it has to go and a rough estimate on how long it will take going by the times of the written pages.

    Anyway, not sure if this is useful but it works for me 🙂

    Keep well.

    Nelis

    (0) 
    1. Lars Breddemann Post author
      Hi Nelis,

      thanks for your kind reply and for your handy way to monitor the knldiag file.
      Just a few comments to that:
      1) with 7.7 onwards the MaxDB Kernel will create an KNLMSG file instead of the knldiag file. This KNLMSG will be in XML format – so direct grepping into the file won’t be easily possible anymore.

      2) as written in the blog the Reading of the log pages does only make up one part of the recovery and most often the question “is it still doing anything?” comes up, when this part is done but the database is still not open.

      best regards,
      Lars

      (0) 

Leave a Reply