What does uptime mean to you?
In the small little startup company that a few of us setup a few years back in 2002, we ran Linux on > 50% of the desktops. My desktop held the record for the longest uptime : 173 days. And then too, it had to be shut-down due to a planned power system maintenance where we ran out of both main power as well as UPS. I bragged about my system to innumerable friends. Now, what is this fixation with a system that doesn’t shut down. As an individual, my PC possibly carried out a few not so critical tasks during the night: checking out and building from CVS the latest copy of a few tools and carried out a nightly job of building a big piece of software. No big deal.
But, imagine the smile on your face when you login first thing in the morning and this is already done and ready? Once in a while, this turned to a catastrophic worst when I had to press a Ctrl+Alt+Backspace to kill the X server when we have a GUI update.
There are many parts of a system unavailability that just does not resonate with the natural reaction from us. I lost power in my apartment last week on account of a water leakage (turning into a fountain that landed on the power socket) that shorted my mains. Now, take this to the next level with the enterprise users. Especially customers where trucks que up for the lack of an ability to carry out the necessary SD processing from allowing them to deliver goods per plan. This is a scary picture.. right? This is indeed lost business, inefficiency and an annoyance.
What is the right answer to this problem? Is it the cloud? Many of you may have seen the AWS outage from the not so distant past: http://aws.amazon.com/message/67457/ . So, the real answer lies in our ability to think Engineering!
Each one of the layers that run our application:
- Application development layer: Basically, the software we have built (and the bugs it may inherently carry)
- Application server(s)
- DB server(s)
- Infrastructure layer: network/power/environment/disaster recovery mechanism
Have to be designed for resilience to really avoid an outage to our end customer.
Recently, we had a customer who reported a problem with the transport move from quality to production to prepare the live instance: the process would take a few hours that is more hours than what they could afford the system to be down. This got me checking for the possibilities we have at hand to resolve.
We carried out a merge of TRs to minimize the number of objects (TR create, sort and compress) and used the instructions from notes: 1223360: Performance optimization during import and 1069417 – Gen and syntax check of programs. This may be applicable for you.
In the meanwhile, there are several other techniques that we could visualize as possible:
Which describes the possibility of creating a clone productive system – start operations and while this is happening, collect delta logs on the “old” system that could be migrated into the new environment
And even in the cloud, provisioning for a fail-safe mechanism at a geographically distributed site that avoids outages.
Apart from the techniques here, customers plan for a staggered go-live into a new solution to provision/avoid these types of surprises.
The pure-play technology vendors can help squeeze the last few drops of juice in this equation by enabling hot plug/unplug or cluster techniques. Beyond this, it still leaves quite some latitude in the application development area to check how much “dependency” per module/across modules we have during an upgrade. What is the path of least resistance to roll-back and more importantly, ensure the data consistency is maintained. It would really be nice if we have widgets that can carry out dependency analysis and give a list of the transactions that are important for me to support on a Sunday night and check whether my planned upgrade and software impacts any of them. With the advances we have made in transparency to code and usage analysis from the last couple years, I am pretty certain, we will have a very granular view in the not so distant future and we can run our SAP functionality for years without really any outage.