I can repair your corrupt database!

lbreddemann · ‎03-21-2007

Years in database support, endless experience and access to a whole bunch of internal notes and guidelines has enabled me (as well as all DB Supporters in SAP) to repair corrupt databases. This obscure technique is of course only available to the initiated and should be used with care. But when applied it works just like a wand: puff – the database is OK again.

Now I’m going to share this secret with you. The method is called: Restore and Recovery.

Yes, I know what you might think right now: “Restore and Recovery? That’s no repairing at all and the once wrong data will be wrong again!”

To reply to this, we’ve to think about, what a corruption really is. It may well be that the database handles its own data wrong and produces corruptions. Yes, that’s possible but really seldom. In those cases there will usually be a bug fix or a repair program to fix this. And in some cases the DB-Developers will have to find a custom solution.

But these are really, really seldom cases.

Corruptions WILL appear

The majority of corruptions reported by the database are caused OUTSIDE the database. Now, what do I mean by that?

The database writes out data to the disk/files usually by calling some OS-function. The OS then hands over the request to the device driver, this in turn accesses the controller. The controller should manage the physical disks in which the read/write heads are moved to the correct positions. Basically it’s a rather long chain of systems calling other subsystems functionality. Once the DB got an “OK” from the OS back for it’s I/O call, than the DB is out of the game. From there the DB software has no influence on what happens to the data.

As any of these systems may (and will at some time) fail, unwanted changes to the data appear – that is: corruptions will be detected by the DB. As you easily can imagine, the more data blocks there are and the more I/O there is done the higher is the probability that data is changed by accident.

This is nothing that we can prevent by any means from happening.

Since we don’t want to loose data the actions taken usually include making copies of the data. In e.g. RAID systems redundancy is used to handle lost data if a disk fails. Unfortunately the error detection of storage systems is rather chunky compared to the checks done by the page-I/O interface of the DB. That’s the reason why in most cases “all lights are green” on the storage system, while the DB already reports corruptions.

What way out?

What do we know until now? Corruptions do happen and cannot be avoided. They most often “happen” outside the database and are usually only discovered by the database.

So how is Restore and Recovery related to this?

Once a corruption has been detected by the database software there are usually two different situations:

The corruption appeared on some data that is necessary for each and every transaction in the system. Due to the corruption there is now a production down situation. We need a quick solution.
The corruption appeared on some barely used data – no business impact given here. Resolving the issue is important, but not so time related and may be analyzed “in-depth”

What may sound funny is: the action plan for both scenarios is the same.

As we can be pretty sure, that a Restore and Recovery will remove the corruption, it may not be the fastest option to get the database back online. Therefore corruption handling always includes identifying what database objects are broken. If these are secondary information (like indexes or status tables which may be regenerated by the application) then it may be quicker to avoid the recovery.

But again: in most cases Restore and Recovery is the only option to get the correct data back.

Luckily such an opportunity is just what Backups are taken for.

Sadly, taking Backups seem to be trained heavily, Restore and Recovery much less.

From support point of view the real problems start here: Restore and Recovery had NOT been tested and trained for the production system. Nobody at customer side knows how to handle the external backup system. At worst it never had been used for a Restore at all.

And even if the procedure to Restore and Recover the database is clear and works as it should, there are far too many customers that realize in that very moment: all my backups already include the corruption. It was just no problem until now, since the corrupt blocks haven’t been touched for a while.

This is the moment where finger pointing starts andthe manual repair is asked for. But: finding out whom to blame – if possible at all, and that’s rarely the case – won’t help a bit here. No vendor of any of the systems that have touched the now corrupt data can make them right again – they don’t know how the correct block has locked like, how should they?

So keeping a corruption free backup and being able to recover it is the key to solve corruption issues.

Its DBAs responsibility

How can one be sure that the backup is not corrupt? Since it’s unlikely that a corruption simply disappears again, making a backup and run a database structure validation just after it already is a valid proof. Even more secure would be to actually perform the Restore and Recovery on a second system and do all the database consistency checks on the recovered instance.

Basically that’s the core job of the DBA. Being able to Restore and Recover the database. When awakened middle in the night, it’s nice to know how to tune statements but it’s CRUCIAL to be able to Restore/Recover the database blindfolded.

The mantra for each and every DBA therefore should be: I do make backups, I do restore them, I do check them, I do make backups …

Practical Advice

In the more than 350 corruption related messages that had been handled up to date this year in SAP DB-Support, most often these notes have been handed to the customer. As the notes are very instructive and provide a sound knowledge of how to handle corruptions it’s a clear advice that they should be read BEFORE any problem appears:

Oracle:

#540463 - FAQ: Consistency Checks + Block Corruptions
#365481 - Block corruptions
#23345 - Consistency check of ORACLE database

MaxDB:

#940420 - FAQ: Database structure check (VERIFY)
#26837 - MaxDB: Data corruption
#846890 - FAQ: MaxDB Administration
#727099 - Using database structure checks (Verify)

Of course there are many other notes targeting special types of corruptions, but knowing these for your database system will spare you a lot of stress.