The Perfect Storm, Part 1
Maybe you, too, have fond memories of a board game called Clue. If you’re not familiar with it, each player chooses a token- perhaps the lead pipe, the candlestick, or the rope- to represent himself/ herself while moving around the game board, seeking clues to solve a murder mystery, Eventually, a player declares something like, “It was Professor Plum in the library with the lead pipe,” and if the guesser is correct, the game is won.
My sisters and I loved that game, and these days I think of it often. Unfortunately the mystery I have been working on is not so easily solved. What has caused the poor performance of one of our systems in the ERP landscape? The suspects are many: the upgrade back in 2008, which solved some performance issues but seemed to result in new ones; any of the myriad of hot fixes and custom patches installed since then; the new database design that came with the upgrade which resulted in a seemingly uncontrolled proliferation of tables; the web dynpro user interface on our SAP portal; the custom integration middleware, coded by a third party, code which we did not have a copy of; something in the network connections between the connected SAP systems, the app server and the database server; some peculiarity of our use case, our security role design, and our global 24/7 security support organization.
After that 2008 upgrade, system performance gradually deteriorated to the point that, last October, the system essentially stopped running, apparently due to the database growth. We discovered, much to our dismay, that the solution had no archiving functionality; the vendor’s product engineers gave us some guidance on deleting thousands of tables that were never needed in the first place, and eventually we cobbled together our own way to archive records and wrote stored procedures to access them.
We limped along for months, working daily with a crew from the vendor’s technical support group, never really satisfied with the performance, despite increasing time and resources spent on the system’s care and feeding. The straw that broke the camel’s back came in April – our discovery that the solution had no functionality for a smooth adjustment and seamless continuity in a high availability SAP landscape after a server failover. After a week of downtime, we made the best of a bad situation; we took down the problem application, moved the remaining users over to a development system, and took down the production system completely.
Our plan was to back up the servers, uninstall everything- all the apps, hot fixes and patches- create a new database and reinstall everything but the problem application. Fortunately we have a manual workaround, and we will survive without it. The rebuilt production system is back up and being tested. The next step is to upgrade to a newer release that seems to offer improved performance.
So if I’ve seemed a bit preoccupied or out of touch recently, this is what I’ve been dealing with the past six months. Somewhere along the way my system configuration and compliance job devolved into a hybrid of an Intensive Care Unit nurse and an air traffic controller, coordinating efforts of a large and no doubt weary team including many on the vendor’s technical support team and product engineers, my server support folks, my DBA, my in house developer, my SAP security team mates, and my key users.
I wish i could say that we have identified the root cause of all this frustration and inconvenience to our users, but I am afraid that we may never know. We have no plans to reinstall the application that seemed to be most problematic, and with a clean database and new release for the rest of it, we are hopeful that we are on the road to recovery. The unsolved mystery may nag at me for a long time, but moving on may be our best option. Options for the longer term resolution are still under consideration, but I certainly hope that we end up with a simpler architecture, with fewer moving parts at risk for failures.