Don’t try this at home: Why you should never physically delete workflows
This is a true story – I know because I was in at the death. It is also something of a cautionary tale and a public service announcement. Names and all identifying details have been withheld for obvious reasons.
The original problem:
IT reports our SAP ECC system is looking rather full. We need to make some space… what can we get rid of?
The original solution:
Business and IT get together. How about we purge some of those old workflows? There are millions of them out there and no-one’s sure what they do anyway so surely no-one cares about them anymore.
Ground Zero of the disaster to come:
IT says “no worries”, and then thinks ummmmmm… we know we haven’t got around to setting up archiving yet, but never mind, no one cares about that data so we’ll just physically delete it all. Nooooooooooooooo!!!
Mistake #1 – Having no one who understands the business criticality of your workflows
Business knew they were running workflows but apart from being somewhat aware that the top 2 or 3 headliners were running they weren’t quite sure:
- What other workflows were running
- How many workflows were being used for critical business functions
- What would be the impact if the workflows stopped running
- Whether there were any alternatives for processing the same data if the workflow was down for any reason
- What sort of regression testing was needed to make sure workflows were still ok after any changes
Because no-one in the business knew the impact of the workflows, no-one was able to raise informed objections to the workflow deletion plan or put in place an effective regression testing plan to make sure that no damage had been done by the deletion.
The Business did know that occasionally workflows stopped and they did their best to start them up again as soon as possible, but had no
idea why they stopped or how to prevent it from happening again. Which brings us to…
Mistake #2 – Having no one who understands the technical administration of
workflows
There was no workflow administrator for the system. There was no-one with adequate workflow skills who knew how to check the health of the workflow environment or restart work items that had failed or diagnose why some of the workflows would stop from time to time. So when it came to deleting workflows there was no-one who knew how it should be done.
So when IT management asked their team to:
“Physically delete all workflows that are fully completed and older than the nominated date.”
What actually happened on the ground was … well maybe it’s time for a 10-point pop quiz called Spot-the-Obvious-Error. This is what was executed:
- The program used to delete the workflows was RSWWWIDE (i.e. transaction SWWL).
- The variant used when running the program included the following settings:
- Work Item Type = * (i.e. any type of work item)
- Status = * (i.e. any status)
- Work item creation date > the nominated date
- Work item end date > the nominated date
- Delete immediately – Yes
Award yourself 5 points and the Real Workflow Developer badge if you picked up that that was THE WRONG PROGRAM. Even in Development and Test systems the physical deletion program to use is RSWWWIDE_TOPLEVEL (i.e. transaction SWWL_TOPLEVEL) which makes sure that we always delete a parent workflow with all of its child work items so that there are no orphaned lonely little work items stuck sitting in inboxes with nowhere to go.
Award yourself the remaining 5 points if you noticed that they used a greater than symbol instead of a less than symbol. That’s right instead of deleting the oldest lost-in-the-mists-of-time workflows, they deleted the most current workflows. So a few days later when strange things started to happen with the work items in people’s inboxes and in a bunch of workflow supported applications ….
OMG Moment #1: WE DELETED THE WRONG WORKFLOWS!!!
At this point they thought it might be good idea to get some help… maybe even call in some workflow expertise… ah well, better late than never.
Mistake #3 – Running background jobs in restored backup systems
By the time the situation was properly assessed, production had been running for several days, so the decision was taken to migrate the deleted workflows from a restored backup copy of the production system. Not easy (there were over 30 tables with complex relationships to migrate) and even with our best guys on the case it was several more days before we ran the migration into a current copy of the production system and started the regression testing.
Strange things started to happen. Work items in inboxes threw bizarre errors. Random work items from one workflow were showing in the log of another
completely different workflow. Two leave requests from 2 different employees were listed in the same workflow. The top level workflow of a PM notification workflow appeared half way down the log of a SD Billing workflow. Some workflows had references to work items that didn’t even exist. In other words, a big fat mess.
Once we realised what was happening it didn’t take us long to find the culprit – a batch job had accidentally been run in the restored backup system.
OMG Moment #2: WE ACCIDENTALLY CORRUPTED OUR BACKUP SYSTEM!!!
What had happened was that as soon as the backup system came up, the basis guys had as usual gone into the system and stopped all the background jobs. But the backup was a snapshot of a working production system complete with frequently running batch jobs. The most frequent workflow batch
job is the Deadline Monitoring job which by default runs every 3 minutes. If you are lucky and the snapshot was taken just after the job last ran, you have
just under 3 minutes to stop it. If you are unluckly, the snapshot could be taken less than 30 seconds before the next job runs. If you are really unlucky – Angry Wheelchair Man Darwin Award Winner unlucky – you might have less than 10 seconds before the next job runs and no time to stop it at all.
Why did it matter that the batch job was running? Well to put it simply… batch jobs change things.. after all that’s why we run them. But in a restored backup system you don’t want them making changes to your supposedly pristine as-it-was-on-the-day snapshot of production.
What the Deadline Monitoring job does is check for any exceeded deadlines, and if exceeded triggers the corresponding response in the workflow. This typically involves irreversibly changing the status of existing work items, creating new work items, and creating new relationships between existing parent workflows and new child work items. Because the backup system was several days older than the production system, there were hundreds of deadlines that had been exceeded.
Time for a rethink…. and to try a couple of things… we realised we could exclude the new work items, we might be able to resume workflows where a work item had changed its status, but we still had an insurmountable problem…parent workflows now had relationships to the wrong child work item ids.
- Because the backup was using the same number range as the real production system, and because in the meantime in the real production system many new workflows had been created, many of the work item ids created by the deadline monitoring job had already been assigned to new work items from completely different workflows
- Because the work item number range was (like most number ranges) buffered, in the production system there were gaps in the number range sequence so some of the work item ids had been skipped altogether leaving parent workflows pointing to non-existent child work items
It was if you were in the middle of a complex game of cards, and someone grabbed the deck, threw half the cards away, replaced the remaining cards with part of a Tarot pack, reshuffled everything and randomly distributed the remaining cards among all the participants by throwing them up in the air over the card table. Not happy!
We needed to restore from a pristine system, where the workflows were exactly as they were before the purge. And you guessed it, by now the 2 week cycle
of backups had overwritten all of the other previous backups, so all we had was a saved copy of the current backup taken before the migration but after the
deadline job had run – i.e. the already corrupted backup system.
OMG Moment #3: WE OVERWROTE ALL OUR OTHER BACKUP SYSTEMS!!!
We finally came to the depressing assessment that the workflows could not be safely recovered at all. Aaaaarrgh…. D:<
The Biggest Mistake of All – Thinking this can’t happen to you
Sure some mistakes were made, but none of the individual mistakes were any worse than I’ve seen happen on any other site. It was the combination of a couple of mistakes with the physical deletion of workflows that made for the perfect storm.
When you are putting in a new system it’s easy to put archiving on the next phase list and never quite get around to it, leaving the temptation to physically delete old workflows. How would archiving have helped? Archiving is a much more controlled process that ensures only completed or cancelled workflows over a certain age (not an arbitrary date range) are removed. Plus you never fully lose the archived data so if you do accidentally archive too much at least you can still access the details and aren’t left with no details at all of what you’ve lost.
So what were the positives coming out of all of this?
- We have a lot of people in both Business and IT who are now much better educated about their workflows
- Everyone is now convinced that workflow administration needs to be done
- In reviewing all their workflows, we’ve identified a number of workflows that can be reworked or deactivated to reduce the load on their system
- The cause of the unexpectedly stopping workflows turned out to be one easily fixed setting in transaction SWEQADM
- IT is reinvestigating archiving
If you do nothing else as a result of reading this blog….
Go have a quiet word with your system administrators about how they handle frequently running batch jobs when restoring backup systems.
So why do you never physically delete workflows in production?
Because… if you accidentally stuff it up there may be no way back.
BTW – In case you were wondering logical deletion is fine. Logical deletion is simply another term for Cancellation – no data is lost or removed.
Next time anyone suggests physically deleting workflows from a production system, please show them this blog, and keep repeating the mantra…
In Production systems we ARCHIVE
Latest Update:
BTW – if you want to “do the right thing” and archive workflows… a good place to start is with these SAP Notes… Note 573656 – Collective note relating to Archiving in workflow and Note 1084132 – Archive information structures after upgrade to Release 640
Hi Jocelyn!
wow, what a nightmare...
IMHO, this is a perfect example of why Operations needs to be involved right at the start of any project. How can they possibly have any idea of the complexity of the solution they are supporting if they only see it during an end of project handover as the solution is thrown over the fence into support.
This should be a mandatory reading for anyone running workflow!
Great blog, BTW.
AT
Thanks Alisdair,
Unfortunately this is what happens when the workflows were written years ago and support has changed hands several times without proper handover. Very very sad.
Hi Jocelyn
I felt this sinking feeling in my stomach as I read this. My sympathies to the customer, whoever they are.
I am fortunate to have a good working relationship with the Basis team, and of course, since I have been here since dirt was invented, I have a reasonably good handle on things. But it's always good to be reminded of just how far astray things can go.
I know that Karin Tillotson would be interested in this blog too, as her interest in ILM makes her a perfect advocate for Archiving.
Thanks so much for sharing!
Sue
Hi Sue, Yes it was very very sad... and I suspect something that a lot of Basis teams don't consider... I'm finding with many teams going offshore it becomes even harder to ensure they have sufficient skills and experience. Even more so when the business is not fully aware of what really needs to be supported and what proper support should look like.
Thanks for thinking of Karin... yes definitely a reason for archiving!
Cheers,
Jocelyn
Now I cannot sleep tonight. Thank you for the nightmares.
Thanks Uwe... we had plenty of sleepless nights sorting it all out... posted the blog in the hope that I can save a few other customers from the same waking nightmare.
Thanks Jocelyn, I'll provide a link to here the next time (usually once a week) someone on SDN advises someone else to use SWWL to remove workitems from a user's inbox.
Important point about the batch jobs, I've seen it happen too.
regards
Rick
Thanks Rick - yes... love our SCN but the quality of the advice can vary... always great getting ideas but need to pass them through the commonsense check with the extra fine toothcomb filter turned on for production environments.. link away!
This was a great read, especially as we are merging two system together and there is big question mark on how to handle the workflows. After migrating the tables (completed workflows only) from one system to another we also had many issues - it turned out that it was due to the workflow having different versions in the source and target system. After synchronizing the versions it looks better but still not perfect.
Hi Richard, That is a tricky one... worth blogging about yourself perhaps? There's bound to be others out there looking to consolidate systems and wondering how to go about it.
Rgds,
Jocelyn
Hi Jocelyn,
thanks for the article. I'm thinking of working on a project next year to archive our workflows. can you point to, or give me any advice on how to proceed?
Thanks.
Hi Chris, Glad you liked it.
Nothing dramatic in the archiving of workflows ... just the usual set up in transaction SARA and using object WORKITEM. You may need to think about what happens to any attachments ... as these are often objects in their own right and are likely to require a separate approach.
The one thing I would watch out for is how many dead workflows you have in ERROR status that no-one has resolved. Workflows have to be completed or cancelled to be archived so you may need to logically delete them en masse first. The SAP_WAPI function modules are good for creating programs to do this, or you can use transaction SWIA but if it's thousands doing it manually may be too tedious.
BTW the sign of good workflow administration for me... hardly anything sitting in ERROR status. Try transaction SWF_GMP to see how healthy/unhealthy your workflow environment is.
Hope that helps.
Hi Jocelyn,
You can now logically delete multiple workflows quite easily in SWIA, a welcome improvement! I agree, there should not be a need for this in a well-run system.
Thanks for mentioning SWF_GMP, I didn't know about that one.
regards
Rick
Yes thanks Rick - I too know and love the option to logically delete en masse in SWIA.
Transaction SWF_GMP is a great one! Traffic light overview of whether your workflow environment is looking ok including: are the batch jobs running; and counts of pending and error work items.
BTW If it times out just trying to run the transaction assume that system is in a *really* bad way. Has only happened to me once and it was a shocker.
See this too often these days with untrained and not very knowledgeable "BASIS" personnel making bad assumptions about being able to remove data. Data Archiving is a very poorly understood concept with many customers.
I agree with you Jeff. IMO archiving impact needs to be thought through first by the affected business personnel... and only then should IT be involved in technical setup and execution. Archiving or purging without considering the business impact on not just running operations but also audit (and sometimes legal) tracking requirements is unwise to say the least.
Thanks for sharing all these with us. Cheers!!!
Thanks Yi Qing New - glad you like it. Good lessons!
Hi Jocelyn,
WUG brought me here...anyway, did i get this right? it wasn't the SWWL but how and when it was executed. was the expectation that there would be an undo somewhere for it? it won't save anybody any space, but if a user doesn't wish to receive any items in their inbox any more, you can simply change the agent on the task and all those workitems get rerouted to whichever new agent is assigned to it from then on and the past ones as well. it may not work for all scenarios like only trying to delete workitems up to certain date, but it does save one from the impossible task of recreating the deleted items.
Hi Jocelyn,
I learnt lot from this blog..Thanks you so much
Regards,
Syam
Thanks Syam, glad it was helpful! That's why I write them. 🙂
Thanks for sharing this valuable information, I learned so many points in this.
Quite interesting to know the performance of background jobs in workflows.
--
Murali Krishna
Thanks Murali - yes that little conundrum of the scheduled-every-minute batch job hadn't occurred to me either until I saw it in action.. very nasty!
HI Jocelyn,
Thanks the lot for giving Awareness and Knowledge.
Regards,
Ragav
HI Jocelyn,
Thanks for do's N dont's..
Quite informative and helpful facts...good leanings for us.
BR
Ansumesh
Great article!!! "Archive is the answer not delete"....lesson to learn by reading than from experience
Loved reading it! Cheers!!
Awesome blog Jocelyn, Bang on target. Very Informative.
Thanks for posting excellent stuffs..
Regards,
Naveen
Quite informative and helpful . learnt lot from this article ..
Hi Jocelyn,
Very nice and helpful article.
Thanks for sharing the knowledge.
Regards,
Siddhant
Hi Jocelyn,
Thank you for the article.
Nice and helpful.
Regards,
Yuksel AKCINAR
Hi Jocelyn,
Thanks for the great article. I was searching on how to delete the workflow logs and I was about to use SWWL (just in sandbox), but after doing more search it lead me to this blog. Even though it's just in Sandbox , I would bet it would have been night mare.. Though this document is OLD, it's always GOLD
Thank you,
Justin.
Jocelyn,
I was just contemplating deleting some single-step-background-task-nobody-cares-about-em workflows from Production (we have 7 years' worth).
It felt wrong, so I decided to go back and read your cautionary blog again.
I really don't think that *I* would make the mistakes you mentioned (run the wrong program, set the wrong selection critieria) but you're absolutely right: the risk to the Production system is simply too great.
So I'm off to fire up SWW_SARA, and start archiving...
thanks again
Paul
Always better safe than sorry... ! If it's that @#$!@#$@ IDOC task I'd also be tempted to make an exception... but even so the slightest slip is fatal in a production environment.
Great blog, thank you. Lots of helpful information you have provided here. Instant fan.
You can never be too careful when deleting data from Production. That's why we have "data archiving" process in place.
This should be a cautionary tale for those who decide to use SWWL in prd.
The whole Workflow environment is quite unstable. Bad thing can happen even if you follow protocol. When I started to read your post I was about 99% sure that you will not be able to recover that and I was right.
I was using a home made process of restarting workflows with a table update to pls attempt to restart the workflow. However the standard SAP logic will delete all workflows if in the process one of the workflow gets a shortdump. So I still do it, because in our system there are no shortdumps in the production however I knew of colleges who tried the same method and lost workflows.
My point is that people should not mess around with the workflow system - even if they know what they are doing, because it is not 100% stable.
Interesting comment! I have worked with SAP workflow for over 10 years, at multiple client sites, and have found it to be solid as a rock.
Perhaps you could start a discussion in the SAP Workflow forum on the problems you experienced with your 'home made process' - we're all keen to hear.
cheers
Paul
This is the workaround. Workflow deadline error
Because this is a workaround and not standard, definitely not an SAP error.
It does work if the workflow has no serious problem what causes shortdump.
But if a shortdump happens all workflows are deleted.
See reply from colleague.
"Table SWWWIDH - Workflow Runtime: Deadline Monitoring of Work Items
Select all entries where Field STATUS is '02' - Error.
Set field to '01' - Active for all entries and save."
Now I understand why you think SAP workflow is unstable. It isn't. You are the one causing the problems. I think you missed the part in the SAP training that said you should never update SAP standard tables. How can SAP possibly cater for random updates by users?
Well, SAP doesnt restart these workflows on its own.
Actually my workaround does restart the workflows provided no short dump happen.
So my workaround is actually usefull.
And I mentioned that the failure in not an SAP error.
But my workaround is there because of missing SAP functionality.
Hi Jozsef, Appreciate what you are trying to do but have to agree with Rick & Rob that the way you are doing it is causing bigger problems than it is solving. My suggestion would be to look at events and the workflow APIs to build a less risky solution. It's not wrong that workflow doesn't restart these workflows - there's a problem with the design of the flowchart that needs to be resolved.
There are even workflow APIs that allow you to adjust the deadlines, and APIs that allow you to safely restart the workflows.
There's a bigger risk here of setting a precedent for others that could invalidate their support agreement. Changing tables directly risks introducing inconsitencies that destroy stability.
Workflow is one of the most widely used, mature, and well behaved solutions in the suite. Like all solutions it is important to respect the design of the solution & work with it not against it. Which is the contention of my blog - the problem is not that workflows were deleted but that they were deleted in production - an environment where data deletion should never be performed on auditable data.
I'm glad you are on the forum as it is good to share challenges & solutions - to test them against others & hopefully avoid costly mistakes by getting others to review. You might also want to raise your challenge under the Customer Connection initiative to challenge SAP to fill gaps
Wishing you all the best with your endeavours & hoping these discussions have given you some food for thought
Jocelyn
Hi Jocelyn,
I agree that I shouldn't have posted the blog entry. 🙂
I cannot edit it any more otherwise I would add the comment to pls check the consistency of your workflows before you try and the consequences of the shortdump.
On the other hand we have to solve the problems what happens and for example this solution spares like €80K labor cost for the client each time the failure happens and like 200 complicated workflows are stopped and must be retraced and recreated manually.
Doesn't happen often but it happens.
I am aware of the risks involved. 🙂
Cheers,
Jozsef
Hi Jozsef, Suggest you add a comment on your own blog and perhaps a link back to this discussion.
Rgds,
Jocelyn
Hi Jozsef
If you want your blog removed and can't do it yourself, go to your blog and press alert moderator and ask for the Global Mods to remove it for you.
Regards
Colleen
Like Paul, I find your comments interesting, but unfortunately a bit vague.
For example, you write: "When I started to read your post I was about 99% sure that you will not be able to recover that and I was right."
-> I'm not sure what your point is? If I use a tool that physically deletes stuff, I don't expect to be able to recover it. That's what SWWL does. How does this make things "unstable"?
As to the rest of your post, what 'standard SAP logic deletes workflows'? I know of no standard SAP logic that does this (apart from SWWL which is designed to do exactly that).
I would also encourage you to post your problems in the forum. I would be very interested indeed if this is a real SAP issue.
Regards,
Mike
"I was using a home made process of restarting workflows with a table update"
That's a big mistake right there,if you mean a SAP standard table.
From my experience only workflow development used to be unstable some twelve years ago. Workflow processing not.
You're approaching business process modeling wrong when you have to restart a workflow with a table update.
If ever the need arises to restart a specific Business Process using SAP workflow then of course you to do that by triggering an event which is both the starting event and the terminating event of a workflow, see SAP's standard approach to Purchase Order approval with the SIGNIFICANTLYCHANGED event for more info on the subject.
Cheers, Rob.
Jocelyn Dart
This Blog with all its Black Comedy totally takes the cake (after 6 frustrating hours with BUILD
; and 3 painful hours with Web IDE
)
Also, I have noticed a trend of late; the comments section is longer than the Blog itself
!
Regards
Arijit
Jocelyn,
We intend to delete the entries from SWFREVTLOG table using the program RSWFEVTLOGDEL ( T Code RSWELOGD ).This acitivity we are doing as a part of ARCHIVING.
Would like to know if there is any impact due to this activity anywhere within the System.
What are the dependcies that this particular table entries are linked to.
Kindly opine.
K.Kiran.