ASE Cluster Edition now supports rolling upgrades.
An oft requested feature from customers considering ASE Cluster Edition is support for rolling upgrades. While this was originally planned for the ASE 16 Cluster Edition release planned for later this year, the good news is that support for this feature has been expedited and now available in the in-market version – starting with ASE 15.7 CE sp133 as is currently planned (as usual, all caveats apply about plans changing – but the key is it will be fairly soon).
What does this mean? In the future, when patches are released for ASE Cluster Edition, the patch documentation will state whether the patch is certified for a rolling upgrade or not. If the patch is certified for a rolling upgrade, the DBA can apply the patch without shutting down the cluster. One requirement, of course, is that the cluster nodes must be using a private install of the ASE binaries vs. using a single shared cluster file system implementation. With this support, there are three methods of minimizing or eliminating downtime entirely for upgrading ASE cluster edition
If the patch is certified for a rolling upgrade, the DBA can apply the patch in a rolling fashion with zero downtime by using the following steps:
• Use the workload manager to failover/migrate logical clusters off the node to be patched
• Once all workload has been migrated off the node, shut it down
• Patch the local binary copy for that node
• Restart that node of the cluster – at this point it should rejoin the cluster
• Failback/migrate workload back onto the node using the workload manager
• Repeat for each node in the cluster
From an interesting point of reference, the lead engineer on this project did a review of patches for earlier releases and noted that most of the patches would have been certifiable for rolling upgrades. This led to the decision to expedite releasing this capability ahead of plan.
Minimal Down-time Upgrades
This capability sort of has been implicitly always been available and should have been used as a best practice. First of all, to understand what is gained from this method, you must first understand the full downtime length for a normal upgrade:
• Users are kicked off the system
• The DBMS is shut down
• The software binary is applied (takes some 10’s of minutes or longer)
• The DBMS is restarted (can take multiple minutes)
• Application access is restored
The question is whether this can be reduced when using clustered DBMS implementations when the patch is not certified for rolling upgrades. The answer, of course, is “yes” – by following a best practice that some term “minimal down-time upgrades”.
In this strategy, the nodes of the clusters are thought of as belonging to one of at least two sets. The first set of nodes will be those upgraded while the second will be the nodes that provide services while the first set are patched. For example, in a 4-node cluster, you might consider 2 nodes in each set. For a 3 node cluster, perhaps 2 for the first set and 1 for the second. The process then is as follows:
• Use the workload manager to fail/migrate all workloads to the second set of nodes
• Shutdown the first set of nodes
• Patch the first set of nodes
• Shutdown the second set of nodes
• Restart the first set of nodes – check/verify the LC’s are all pointing to these nodes
• Restart the applications
• Patch the second set of nodes
• Restart the second set of nodes
• Re-distribute the workload using the workload manager as desired
You might have noticed that there is downtime in the middle – between shutting down the second set of nodes are restarting the first set. However, this should be just the time it takes to start the cluster nodes and not the 10’s of minutes that would also be necessary if patching in between – hence, that is what this method is sometimes referred to as the “minimal down-time” approach.
Also, you need to be careful in defining the sets of nodes to take down at once. If a logical cluster doesn’t span both sets of nodes with primary and failover nodes, then depending on the down routing mode whether applications associated with those logical clusters will be available or not. This could be exploitable in situations where some applications need higher availability than others – non-critical applications would be down for the full upgrade, while others would only be unavailable for the restart of the first upgraded nodes.
Major Upgrades/Avoiding Down-time
One of the bigger risks to system availability is when applying major upgrades to the DBMS. While this doesn’t happen as often as patches, most of the time such upgrades affect the system catalogs, the cluster interconnect protocols or other DBMS internals that prevents a rolling upgrade. While the minimal down-time approach above could still be used, many businesses want even better application availability to include:
• Ability to avoid down-time to the maximum amount possible – even the restarts time as that could be 10’s of minutes depending on memory size, tempdb size(s), database recovery times, etc.
• Ability to run different major releases for a period of time to allow rolling back the upgrade if significant problems are experienced post upgrade
Across the industry, there is only one solution for this – replicating the data to another cluster (or non-clustered) system. This can be done by physical replication of log records or logical replication using SAP Replication Server,. Generally, using logical replication is the most tenable solution as this allows log records to be sent in either direction. For example, some customers like to perform the upgrade, flip to the upgraded system and run for a period of time (e.g.2 weeks) and then flip back to the un-upgraded environment and run there for an equal time frame as a second affirmation before flipping finally back to the upgraded system and upgrading the replicated copy.
Using a replicated copy, the only outage to the application is during the switch itself – which can be made transparent to the end user via middle tier components. The degree of transparency may differ depending on the component’s ability to understand database connection contexts, etc. The most simplest form is hardware switches which causes the application to get a connection drop message which then the app should be able to attempt a reconnect and if successful (which likely would be) then resubmits any current in-flight transactions.
One consideration that SAP is looking into is melding the upcoming ASE HADR technology with Cluster Edition. ASE HADR technology (planned to be released in the upcoming ASE 16 sp02 release…..but as said earlier about ‘planned’….) allows fully independent ASE installations to be viewed as a HADR cluster with full transparent client application failover and other capabilities that in the past were only available with ASE/HA or ASE Cluster Edition. However, such an implementation is a long term future consideration at this point.
In summary, with support for rolling upgrades, ASE Cluster Edition is now an even higher availability solution than previously – and ASE CE matches the capabilities of competitive cluster solutions with respect to online upgrades.