Not my fault - whose then?

lbreddemann · ‎03-03-2008

I have written some lines about database corruptions and the ways to handle them in the past.

The bottom line of everything I wrote (including support notes, SDN blogs and replies in customer messages) has always been:

"The DBA is the one single responsible person to protect the company from data loss - by setting up a backup strategy that actually works."

Although this is still what I consider the core responsibility of any DBA, I have changed my mind a bit. Corruptions of data do occur. Everyday, everywhere.

The more recent file systems like ZFS include mechanisms to recognize such corruption as early as possible (most often this means at the next user access time). There are also check programs available that check the disks for any failures. So finally, file systems provide the same corruption detection facility as DBMS now do for years.

Corruptions and mistaktes are common - they will happen.

But has this made anything better about corruptions? Not for me.
The core problem is not that corruptions can occur or that they may go unnoticed for a long time. The very core of the problem is that it is not only possible to run production systems without a backup that protects from data loss - it is the default standard setup and usually it takes a great effort to change this.

If you think about it, this is not what one can call a user-friendly safe design. In fact it more looks like being not designed at all.

Although all DBMS vendors claim to deliver business data infrastructure what is on the market is rather a ‘raw-byte-storing-engine' than a business process data container.

The whole chain of acquiring data, streamlining it to the business processes, evaluating it and finally archive it for later use (e.g. regulatory validations) is totally automated today. But not so the caretaking to protect from data loss.

What should be changed? Think like a shepherd!

What would be the alternative to the current state of affairs? What would be the best thing to do if you realize that you cannot say for sure anymore that your data is protected and that is a fundamental prerequisite to your work? Well, stop changing the data or adding new data would be the right answer to me.

Think of a shepherd that used to herd 100 sheep with his collie (dog). Now his collie gets ill and cannot help the shepherd any longer for some days. The shepherd would probably leave the sheep in the barn until the collie gets well again. I doubt that the shepherd would go and accept even more sheep to take care of while his best friend (also the dog) is still ill. He would not be able to guarantee for the sheep.

The DBA is considered the Shepard for the data. In addition, although his collie (the backup this time) is not able to help him, still more and more data is pumped into the databases he is responsible for. That does not sound too clever to me.

It would be much better if the database itself just stops working when a certain amount of data has been changed after the last backup verification was successful. It is not enough to just check for the time when the backup was taken - it is crucial that the backup has been tested successfully for recoverability.

Now you perhaps say "Wait a minute! Stop the database because a backup was not verified? Business won't accept that - the database needs to be available, no matter what."

"Business" is typically not concerned about the inner workings of the data infrastructure it uses - it just has to work. As soon as no guaranteed working backup is available the infrastructure does not work anymore - but it is hidden from the users that rely on the recoverability of the information systems they use.

Therefore, it is fair to consider this a bug in the DBMS software - a critical error (insufficient backups) is not handled the right way. It is ignored, if at all recognized. Therefore, it would be the turn of the DBMS vendors to come up with a solution for this.

With SAPs CCMS, the system admins at least could get a warning - if they like to. Nevertheless, they still have to check for it. That is not enough. If it is possible to run a productive database without having good backups than this is a design bug - not a user error.

The databases need to stop, when it cannot be guaranteed that all data can be recovered. Period. In reality there is already a similar functionality in place in your database - it's called "archiver stuck" or "log full" situation. Unfortunately, the log backups are of no use if the last data backup is not "good" ...

In times where companies use hundreds and thousands of database instances, this point is even more important. There are so many databases around that are not even recognized as databases (e.g. SAP content server database or liveCache). And yes, there are cases where customers lost data because of that. Sometimes it was just the work of the last week, sometimes it was the work of the last year.

Finger pointing - does it help?

Who is the one to blame here? Is it the admin? Of course, it was in his responsibility to take care of the backup so he has to get some "feedback" on this.

But in most cases I have seen the DBA has taken care of the backup - once when he set it up in the scheduler. After that, he thought "It's running automatic, so my job done."

This is a false believe of secureness, no doubt about that.

But at the same time the task of ensuring that the backup strategy works is unnecessary hard. There is no automatic process that performs a backup, restores the backup, checks the restored database for errors and returns the outcome of this process to the DBA in an easy to use manner.

Ask for improvement!

So what to do now as a user of the DBMS? As long as the state of affairs is like it is now: check your backups, check your restores. Check everything, double, triple! Once your data is gone, because for some reason your restore does not work - nobody will take responsibility.

Not SAP, not the DBMS vendors, not the hardware vendors.

However, to change this, ask for a solution of this issue at your DBMS vendor. Or at SAP. Or the storage system vendor. Or even better: ask these companies to sit together and bring up something that connects these components.

KR Lars