Now step back and start thinking!

lbreddemann · ‎03-08-2009

The default reaction

When working in support you get to see a lot of bugs, mistakes and errors.

That's of course how things are going and there's not much to whine about.

Anyhow, there is one thing happening over and over again that makes me wonder why people are doing it.
It's the 'do-again-if-first-try-failed' approach to address anything that does not work as expected.
The program aborts?
Run it again and see if it does it again.
The database won't get online?
Type in 'startup' in again and see if this fixes the problem.

While this behavior is very much the same thing we do, when e.g. we have the impression that our conversation partner in a talk did not get the message, it's usually a nonsense action when coping with IT issues.
Such behavior is just reacting to an unexpected situation - sometimes you get the impression that the decision to simply repeat the just failed action did not even pass the cortex cerebri but instead the fingers typed in the commands autonomously.

So why is it so bad to just react to the error message that way?
Hasn't a restart not always fixed any problem with that damn old windows 3.11?

It's bad because of the changes caused by the retry - just starting a database, a SAP instance , etc., is already changing things ! - could make things worse a lot.
One the of the simpler examples for this are the log files that get written to in a cyclic way. Restarting often enough is the safest way to overwrite all important error message logs that would enable an analysis later on.
I cannot tell how often there had not been any chance to figure out what really happened due to this.
Another example is the "Duplicate Key" error when re-running a ABAP transport.

Solution in sight?

Clearly I cannot provide a general solution to this.
It's a kind of default reaction as it appears to me, so this wouldn't be easily changed by a blog, training or a fancy IT operating standard.
Nevertheless I think it may be a good idea to appeal to everybody reading this.

The next time you run a command and it does not do what you expected, take your hands from the keyboard and the mouse.
Sit back for just a short moment.

Ask yourself:
What did just happen?
Why did the command not work as I expected?
What is the difference between what I wanted and what I got?

If you have difficulties to answer these questions it may be a sign for that your ideas about the system behavior are not clear enough.
Only with a clear set of expectations one will be able to figure that a system is misbehaving.
This is true even if your expectation of the system behavior turns out to be wrong.
Note: Usually an error message is not a sign of misbehavior but a finger point of the developers to the cause of the problem.

It may also be the case that you have not enough information about the current system state.
Checking the error logs and trace files may help with that.
Also re-checking the documentation about the meaning of the error message you may see is usually a very good idea.

Looking back to quite a number of situations that could be called 'It-Disaster', I really can say that regardless how bad the initial problem had been, the real bad stuff (the one where neither any developer nor Mr. Wolfe could help you with) was caused within the first couple of hours, sometimes within minutes, when the person on the keyboard just reacted.
Calling to vendor support than is a bit like getting the dead horse to the vet (Ok, not exactly).
But honestly - how many of you would try to fix your brand new BMW X5 yourself when it does honk when you want to start?

Ok, enough ranting for today.
Maybe this little blog gave you a thought that might save your system once.
I hope it does and wish you the best.