The ABAP Detective Copies A Client
Usually, I am the detective, not the client, in most logic cases. This time I was the accidental client of an enterprise that needed help. Not the mandant kind of client, one that uses someone’s software.
It started on a clear Tuesday afternoon. I tried to pitch a mitzvah to a registered charity, only I hit the out-of-order sign. The one that says, hey. we’ll be right back after a little housekeeping. I was a little puzzled because most places like to do their tidy-ups on a weekend, when typical office workers or shoppers are sleeping or swapped out.
When I considered the outage planning, I figured this site, which I’m not naming for privacy reasons but which might be found with a little optimized searching, did a lot of business on weekends when religious meetings mainly occur around these parts. Collecting and tallying would be impacted then, maybe less so in the middle of the week.
Tuesday was the 27th, though I don’t have details on when the application was unavailable. The optimistic message that things would occur “throughout the day” indicated probable downtime under 24 hours, or even 12.
Because I’m a detective, I clicked the link to show the outage status, learning that a platform migration of some type was being executed. It appeared to be a database change, though whether it was a point patch, a major step up (or more than one step), or switching the database type from one vendor to another was uncertain. I am certain the “back end” was not SAP HANA based on the terminology in the notes (later I learned this was a move from on-premises to the cloud).
When the estimated service restart time came and went, I refreshed the status page, realizing the team making the planned changes was in trouble. If I had to guess, everything went well in the development and testing layers, but the production change attempt had surprises. If you’ve been in that seat, you know the feeling.
9:09 AM · Sep 28, 2022
there was not the right level of confidence so one of the backup plans was implemented
posted: [Sep 28, 2022 – 00:55 PDT]
More than one plan? Belts and suspenders! In the dim past, I worked for government agencies which sometimes, but not always, had a way of expression that allowed for as little of the facts to be shared as possible. Like a politician, promise to do everything but don’t take a stand. Sharing details like, oops, we didn’t know this application had an interface to a third party that also needs to be converted. A “level of confidence” hints at test failures. Having multiple recovery plans is all to the good.
11:53 AM · Sep 28, 2022
our team began a process to manually upload each database table individually
From an outsider perspective, this sounded quite labor-intensive and potentially error-prone. Without more inside information, I wouldn’t know if this was a better option than others. I didn’t see any indication that the “go back to square one” option of abandoning the planned system upgrade was considered. I know that can happen, but it can then throw a monkey wrench into related changes in other systems.
At this point, the service interruption was close to 36 hours, instead of an ideal 4 or maybe 8. With small teams, there is a big chance some key staff had not slept, depending on their inter-team disciple and handoff capabilities.
7:25 PM · Sep 28, 2022
optimizations which must be completed before we can restore public access
This sounded promising, as the message related to improving performance, not a pass/fail indication that the upgrade was problematic. Were I involved in a team that needed to review the actions and results later, one of my first tasks would be to document a timeline, with objective observations (e.g., “export completed and returned code X at time Y”).
I figured SQL optimizations, as those are often needed in re-platforming, but this could also include load balancing and network or memory capacity sizing.
Was there user acceptance testing? With a diversified, perhaps anonymous donors included, customer base, I think there were some key or power users that might have been given early looks, though I don’t have evidence of this. Monitoring the trouble queue is certainly one way to check for data discrepancies or peculiarities in an offhand way.
9:23 AM · Sep 29, 2022
Why do they call the after-report a “post mortem” when the subject should definitely be alive then?
Just a grammar thing. Pushing office culture to use more peaceful, neutral and clear terms is challenging.
What happened next?
After about 4 times the expected outage time, my client screen came back on the browser and subsequent actions were nominal, as the astronauts say.
But indications are there were ongoing issues even after the public facing site came back online. Batch (recurring) transactions may have been omitted, but hopefully not duplicated. That’s an impact that can’t always be tested (as happened here).
I’m glad I wasn’t on the team that promised a short outage and failed to deliver; as customer, er client, I’m happy to have seen the ongoing details of actions behind-the-scenes and having the ability to pause my transactions. Were I more pressed to complete my requests, I might not be so calm.