This is a true story – I know because I was in at the death. It is also something of a cautionary tale and a public service announcement. Names and all identifying details have been withheld for obvious reasons.
The original problem:
IT reports our SAP ECC system is looking rather full. We need to make some space… what can we get rid of?
The original solution:
Business and IT get together. How about we purge some of those old workflows? There are millions of them out there and no-one’s sure what they do anyway so surely no-one cares about them anymore.
Ground Zero of the disaster to come:
IT says “no worries”, and then thinks ummmmmm… we know we haven’t got around to setting up archiving yet, but never mind, no one cares about that data so we’ll just physically delete it all. Nooooooooooooooo!!!
Mistake #1 – Having no one who understands the business criticality of your workflows
Business knew they were running workflows but apart from being somewhat aware that the top 2 or 3 headliners were running they weren’t quite sure:
- What other workflows were running
- How many workflows were being used for critical business functions
- What would be the impact if the workflows stopped running
- Whether there were any alternatives for processing the same data if the workflow was down for any reason
- What sort of regression testing was needed to make sure workflows were still ok after any changes
Because no-one in the business knew the impact of the workflows, no-one was able to raise informed objections to the workflow deletion plan or put in place an effective regression testing plan to make sure that no damage had been done by the deletion.
The Business did know that occasionally workflows stopped and they did their best to start them up again as soon as possible, but had no
idea why they stopped or how to prevent it from happening again. Which brings us to…
Mistake #2 – Having no one who understands the technical administration of
There was no workflow administrator for the system. There was no-one with adequate workflow skills who knew how to check the health of the workflow environment or restart work items that had failed or diagnose why some of the workflows would stop from time to time. So when it came to deleting workflows there was no-one who knew how it should be done.
So when IT management asked their team to:
“Physically delete all workflows that are fully completed and older than the nominated date.”
What actually happened on the ground was … well maybe it’s time for a 10-point pop quiz called Spot-the-Obvious-Error. This is what was executed:
- The program used to delete the workflows was RSWWWIDE (i.e. transaction SWWL).
- The variant used when running the program included the following settings:
- Work Item Type = * (i.e. any type of work item)
- Status = * (i.e. any status)
- Work item creation date > the nominated date
- Work item end date > the nominated date
- Delete immediately – Yes
Award yourself 5 points and the Real Workflow Developer badge if you picked up that that was THE WRONG PROGRAM. Even in Development and Test systems the physical deletion program to use is RSWWWIDE_TOPLEVEL (i.e. transaction SWWL_TOPLEVEL) which makes sure that we always delete a parent workflow with all of its child work items so that there are no orphaned lonely little work items stuck sitting in inboxes with nowhere to go.
Award yourself the remaining 5 points if you noticed that they used a greater than symbol instead of a less than symbol. That’s right instead of deleting the oldest lost-in-the-mists-of-time workflows, they deleted the most current workflows. So a few days later when strange things started to happen with the work items in people’s inboxes and in a bunch of workflow supported applications ….
OMG Moment #1: WE DELETED THE WRONG WORKFLOWS!!!
At this point they thought it might be good idea to get some help… maybe even call in some workflow expertise… ah well, better late than never.
Mistake #3 – Running background jobs in restored backup systems
By the time the situation was properly assessed, production had been running for several days, so the decision was taken to migrate the deleted workflows from a restored backup copy of the production system. Not easy (there were over 30 tables with complex relationships to migrate) and even with our best guys on the case it was several more days before we ran the migration into a current copy of the production system and started the regression testing.
Strange things started to happen. Work items in inboxes threw bizarre errors. Random work items from one workflow were showing in the log of another
completely different workflow. Two leave requests from 2 different employees were listed in the same workflow. The top level workflow of a PM notification workflow appeared half way down the log of a SD Billing workflow. Some workflows had references to work items that didn’t even exist. In other words, a big fat mess.
Once we realised what was happening it didn’t take us long to find the culprit – a batch job had accidentally been run in the restored backup system.
OMG Moment #2: WE ACCIDENTALLY CORRUPTED OUR BACKUP SYSTEM!!!
What had happened was that as soon as the backup system came up, the basis guys had as usual gone into the system and stopped all the background jobs. But the backup was a snapshot of a working production system complete with frequently running batch jobs. The most frequent workflow batch
job is the Deadline Monitoring job which by default runs every 3 minutes. If you are lucky and the snapshot was taken just after the job last ran, you have
just under 3 minutes to stop it. If you are unluckly, the snapshot could be taken less than 30 seconds before the next job runs. If you are really unlucky – Angry Wheelchair Man Darwin Award Winner unlucky – you might have less than 10 seconds before the next job runs and no time to stop it at all.
Why did it matter that the batch job was running? Well to put it simply… batch jobs change things.. after all that’s why we run them. But in a restored backup system you don’t want them making changes to your supposedly pristine as-it-was-on-the-day snapshot of production.
What the Deadline Monitoring job does is check for any exceeded deadlines, and if exceeded triggers the corresponding response in the workflow. This typically involves irreversibly changing the status of existing work items, creating new work items, and creating new relationships between existing parent workflows and new child work items. Because the backup system was several days older than the production system, there were hundreds of deadlines that had been exceeded.
Time for a rethink…. and to try a couple of things… we realised we could exclude the new work items, we might be able to resume workflows where a work item had changed its status, but we still had an insurmountable problem…parent workflows now had relationships to the wrong child work item ids.
- Because the backup was using the same number range as the real production system, and because in the meantime in the real production system many new workflows had been created, many of the work item ids created by the deadline monitoring job had already been assigned to new work items from completely different workflows
- Because the work item number range was (like most number ranges) buffered, in the production system there were gaps in the number range sequence so some of the work item ids had been skipped altogether leaving parent workflows pointing to non-existent child work items
It was if you were in the middle of a complex game of cards, and someone grabbed the deck, threw half the cards away, replaced the remaining cards with part of a Tarot pack, reshuffled everything and randomly distributed the remaining cards among all the participants by throwing them up in the air over the card table. Not happy!
We needed to restore from a pristine system, where the workflows were exactly as they were before the purge. And you guessed it, by now the 2 week cycle
of backups had overwritten all of the other previous backups, so all we had was a saved copy of the current backup taken before the migration but after the
deadline job had run – i.e. the already corrupted backup system.
OMG Moment #3: WE OVERWROTE ALL OUR OTHER BACKUP SYSTEMS!!!
We finally came to the depressing assessment that the workflows could not be safely recovered at all. Aaaaarrgh….
The Biggest Mistake of All – Thinking this can’t happen to you
Sure some mistakes were made, but none of the individual mistakes were any worse than I’ve seen happen on any other site. It was the combination of a couple of mistakes with the physical deletion of workflows that made for the perfect storm.
When you are putting in a new system it’s easy to put archiving on the next phase list and never quite get around to it, leaving the temptation to physically delete old workflows. How would archiving have helped? Archiving is a much more controlled process that ensures only completed or cancelled workflows over a certain age (not an arbitrary date range) are removed. Plus you never fully lose the archived data so if you do accidentally archive too much at least you can still access the details and aren’t left with no details at all of what you’ve lost.
So what were the positives coming out of all of this?
- We have a lot of people in both Business and IT who are now much better educated about their workflows
- Everyone is now convinced that workflow administration needs to be done
- In reviewing all their workflows, we’ve identified a number of workflows that can be reworked or deactivated to reduce the load on their system
- The cause of the unexpectedly stopping workflows turned out to be one easily fixed setting in transaction SWEQADM
- IT is reinvestigating archiving
If you do nothing else as a result of reading this blog….
Go have a quiet word with your system administrators about how they handle frequently running batch jobs when restoring backup systems.
So why do you never physically delete workflows in production?
Because… if you accidentally stuff it up there may be no way back.
BTW – In case you were wondering logical deletion is fine. Logical deletion is simply another term for Cancellation – no data is lost or removed.
Next time anyone suggests physically deleting workflows from a production system, please show them this blog, and keep repeating the mantra…
In Production systems we ARCHIVE
BTW – if you want to “do the right thing” and archive workflows… a good place to start is with these SAP Notes… Note 573656 – Collective note relating to Archiving in workflow and Note 1084132 – Archive information structures after upgrade to Release 640