Chronology of a major incident and a sweet revenge
Saturday 6 pm:
Thirty hours of blood, sweat and tears are just over.
Taking a deep breath now.
One last email.
Subject: Finally end of work
all mails are cleaned up from SOSG
N. has probably received 363000 emails, unfortunately I can’t support him with deleting them from his inbox in Lotus Notes.
This was just 1/10th of what was expected to be send to him based on the number of workitems, which we deleted as well.
It is not yet clear why it created so much emails for 62832 Idocs with Syntax errors, hoping to find the root cause on Monday
Are you curious now? Want to know what happened?
Murphy’s Law hit me with full force.
December 23rd 2014 around 8 pm:
Waiting for my Chimichanga and sipping a Mojito when suddenly the Blackberry became a disco light because of incoming mails. When I was ready to pay I had already 3600 mails for failed Idocs from purchase order migration.
Thursday November 19th, 11 pm:
A very busy year is going to its end, and as well one of hour biggest SAP system mergers where I am team lead of Master Data.
Vendors, materials and classification are already loaded, customers will be next week, and I tried to put some finishing touches on material classification, while I was preparing the last slides for the Fit/Gap workshop with Japan which is scheduled for Friday morning 7:30 am CET, have to get out of bed early, 8 hours time difference to Tokio.
My test load to add an additional material class with just one characteristic to 175000 materials had just ended, all had posted well, but I found 860 Idocs with syntax errors. An Idoc segment was skipped in the conversion process of LSMW but the Idoc itself was created. Actually the entire Idoc creation should have been skipped. I did not want those Idocs with syntax errors in the production system and made a tiny little change in my LSMW.
Friday 6 am: Get up, start the PC, connect via VPN to the company (had to do this before tooth brushing, last week I needed 40 minutes to log in, wanted to make sure that I am connected before my meeting starts). Start brewing coffee, going into the bathroom, came back for breakfast, wondered that I had only coffee for 1 and half cup instead of 3.
Friday 7:30 am:
Fit/Gap analysis for the next system merger, 4 companies in Japan. Kon’nichiwa.Sayōnara.
Friday 9:45 – 10:45 am: Driving to work
Friday 11:00 am:
Some office gossip: “believe or not, again no authority to display a vendor master in the Japanese system, I guess we will make the same experience even in the 100th project.” 😡 Still no Coke in the machine 😥 . 7 km Stop ‘n’ Go again this morning 😈
Finally exporting the LSMW project from the test system and importing it to the production system. Reading the source file, executing the conversion.
Friday 12 pm noon:
Lunch break. Quick. Starting the Idoc creation, uuh is this system slow today, one can see each number in the counter, okay lets go into the canteen. Ctrl+L
Friday 12:32 pm:
Back from the canteen. Log in again. What’s that, a pop-up with an error message, the IDoc creation was cancelled.
E0070 EDI: Table passed to EDI_SEGMENTS_ADD_BLOCK is empty. (self explanatory)
Checking BD87, no CLFMAS Idoc in status 64 there. Maybe the Idoc creation was cancelled because of transports that got imported during lunch time, our usual time for emergency imports during day time.
Friday 12:40 pm:
Start LSMW Idoc creation again. Pop-up. Same error message. Process cancelled. 😯
Friday 12:50 pm:
Chat with a Basis guy.
Me: “Can it be that the lock table in the production system is smaller than in our test system?”
Basis guy: “No – it allows actually 6 times more records than in the test system”
Me: “I am confused. Did the same yesterday in the test system and it worked without an issue. Thanks anyway”
Friday 1 pm:
Chat with a team mate.
I am sending a screenshot. “Do you know this message?”
He: “Yes, quite familiar to me, what are you doing?”
Me: “Classification load”
He: “I am doing classification right now, can you wait for 2 hours. Maybe our activities crashed”
Me: “Okay I can wait, but let us check if we could have a conflict”
So we checked his object and his setup of Idoc processing for LSMW, found out that it did not follow the desired approach. But we did not further talk about my error message. I tried then the conversion for just 1 material classification….and got a different error message. System was trying to read behind the end of records. This message was known, but not well known as I had it only 2 or 3 times before in the last couple years, but I remembered that an entry from a file had to be deleted After having done this I converted again a single record and displayed the converted data. Then I saw that this Idoc had to less segments.
I remembered my change from last night. Checked the coding and found that I mixed <> and = in a if condition which actually turned my desire and removed an segment from those Idocs which were good in the test. Shame on me. 😳
Friday 2:15 pm:
H: “You already know what happens in the production system?” Screenshot of BD87 showing 62832 Idocs in status 60 Syntax error. “N. got already 60000 emails.”
Me: “Not really, I am still analyzing, worst case he gets about 340000.” (this was an estimate, about 170 thousand expected classification updates, and 2 times executed)
Friday 2:22 pm:
Email arrived with copy to system coordinator. “N. has already 67000 emails. SBWP and Lotus Notes get a load and performace test”
Friday 2:25 – 4:00 pm:
Chat and then telephone call with Basis guy. I explained the situation. He checked the system and saw that there are several processes active with my user. I said that I have nothing scheduled, what I did online failed, that I see a floating number up to 50 locks in the lock table for my user, and had even logged off from the system in the mean time. So he used his Basis power to kill all active processes with my user. Then we went into SOSG transaction and deleted all waiting unprocessed emails to N.
Friday 5:15 – 8:50 pm:
Driving home, even more traffic jam than in the morning, driving my daughter to Karate training, having dinner, start my DVD player “The company you keep”
Friday 9:50 pm:
I here the sound of Keith Emerson’s Inferno. Actually the sound for colleagues on my smart phone. (I will not tell you the sound reserved for the boss)
A: Hi Jürgen, Sorry to call you late Friday night but we have trouble, can’t get Idocs to the TM system nor any emails with certificates to our customers.
Me: Wait a second or 40 minutes till I am logged in. I will check, we had an incident today but we cleared everything.
Saturday 0:22 am:
Mail to A.
Subject: Fighting against windmills
It looks like deleting the mails from the queue is not helping. I feel that this job is re-creating the emails until they get the status that they have been sent.
I deleted certainly more than 100000 till now, in buckets of 1000, when those 1000 are gone and I refresh SOSG then I have again 1000 with a more recent time. I can’t even find an active job with our function user xxxxx while the emails are still increasing
And I have no authority for WE11 to delete the faulty Idocs, just in case that their existence is causing this trouble
Saturday 1:03 am:
Forwarding the mail to the Basis guy. “The mess is back I have no more idea. I have 62832 Idocs with Syntax errors but N got already 127440 mails. I have already deleted more than 100000, can’t find an active job, don’t know why the mails are being sent several times.”
Saturday 2:00 am:
Giving up with deleting, going to bed.
Saturday 6:45 am:
Getting up again. PC, bath room, PC. Starting the coffee machine. Just one new mail per second in SOSG. Almost happy 😛 .
Start deleting 5000 mails that are waiting to be processed. Refresh SOSG, uuh 8 mails per second now. Deleting, deleting, deleting
Saturday 9:39 am:
Calling Help Desk. Voice mail. Leaving my cell phone number.
Bake some rolls while waiting to be called back.
Reach to the coffe-pot, feels light today, no coffee, 😯 😥 😡 machine broken. Instant coffee, water was just hot when the telephone rang. An American from the international help desk, pretty tired, took 5 minutes until he had written my name. Informed me that it is Saturday and nobody is working. Okay, I knew this already, even in Germany we know what Saturday means. He promised to call the voice mail in Frankfurt. The coffee water was cold again.
Saturday 10:08 am:
I called our system coordinator and explained the situation. He tried to find the people on duty, found a few, but none from SAP Basis and he promised to call me back when he finds one.
I had just put the Strawberry jam on my first roll when the telephone rang again. Frankfurt was calling. No idea where this Frankfurt was, it sounded like Iran or Afghanistan. I listened to his broken German and think he wanted explain me that nobody is there on Saturday, and finally interrupted him and asked if he personally could help me. After he said “No” I told him not to my steel time I am already in contact with somebody and waiting to be called back
Saturday 10:19 am
The system coordinator called back, and gave me the number of a Basis girl who was on duty yesterday.
Saturday 10:21 am
You see from time that calls between real men are short and to the fact. Now calling the girl. We had checked the system together, searched for the invisible job, deleted mails, brainstormed about how to stop the ongoing creation of these mails and what could actually causing this. Even we did not believe that it is the root cause but we decided to delete the Idocs with the syntax error. Easier said than done, our production system is well secured. Further searching SCN for options to switch the Idoc status from 60 to 68. Of course we found a discussion telling how to do it (RC1_IDOC_SET_STATUS). A quick test in our test system, then into production and changing the status of the 62832 Idocs. The mails were still coming. We guessed that deleting the workitems could eventually help. She wanted to call another colleague so we hang up.
Saturday 11:30 am
The water was hot again and I could just get my first instant coffee and my first bite into the roll when the telephone rang again.
It was H. who had informed me yesterday about the email flooding. I told him to call N who shall delete his workitems from the business workplace.
Saturday 11:50 am
H called me again: “The workplace does not open. It dumps because of too much content. Must be a lot more than 62832 workitems. He has to leave for soccer game of his son but will take the company smartphone with him, even it is Saturday.
Saturday 11:56 am
Call from the basis girl. Her colleague found the job which is creating emails from workflow items. The external scheduler starts it every 30 minutes. They are going to kill the job. Then they removed the task ID TS00008074 from the variant to be save that the next start does not care about the workitems again.
But now we have to get rid of the workitems, somehow, quick. Searching SCN. Found How to delete the 7 lakhs workitem from user’s inbox in workflow
and the answer of Jocelyn Dart. From here jumping to her blog Don’t try this at home: Why you should never physically delete workflows
Saying to our Basis girl “Jocelyn Dart, I trust her advise, let us try the logical deletion” . Of course first in the test system. But then quickly in production.
I executed SWIA for the workitems with a range of 1 hour. Friday noon to 1pm: 780000 workitems. The Basis girl tried for all Friday and the system dumped.
We finally found that the workitem creation had stopped already Friday afternoon when the other Basis guy had killed all the processes with my user.
4 hours ran this invisible job and had created about 3.2 million workitems.
Saturday 12:32 pm
Finally breakfast. We ran SWIA with 3 person in parallel for workitem buckets of 15 minutes, which were between 150000 and 200000 workitems.
Saturday 13:45 pm
15 minutes till the flower shops close. Have to to get some flowers for my mom who celebrates birthday on Sunday.
Saturday 18:00 pm
🙂 issue resolved
To make the story round. N was the person migrating purchase order last year, and my name was as Agent in the WE20 partner profile.
To avoid getting again mails for dinner I had made him the agent for post processing.
Details on this setup in general can be found in OSS note 116610 – IDoc: Notifications from IDoc processing
Finally not a bad decision, so I was still capable of acting during this time.
The job (program RSWUWFML2 ) which is creating emails from workitems is already a very old job in our system, it is no more known who had originally scheduled it. Some years ago when the external job scheduler was implemented, this job got migrated to this new scheduler. Overall and after some days of talking with colleagues this job is still wanted and needed for our MDM ALE process. We may need to exclude the migration guys from the emails, we know it anyway if something failed and do not need extra reminders by email.
The SWIA transaction is not well designed, you only realize that the execution had finished when the hourglass disappears, no message, no change in screen, no option to schedule in background. It was a bit boring to star a screen for about 30 minutes to be aware of the moment when the cursor changes its appearance.
The Idocs are created from the converted data in LSMW which is actually executing the program/SAPDMC/SAP_LSMW_IDOCS_CREATE
And if something unusal happens then it is probably IDOC_ERROR_WORKFLOW_START which creates the workitems.
While this incident gave us the knowledge that our system is able to create 3.2 million workitems in about 4 hours and can send about 172000 emails in 30 hours we still do not know why our SAP system created 3.2 million workitems for 62000 Idocs with syntax errors.
So if somebody has a hint, we would really appreciate to let us know