Consultant Field Note – Recovering from a Warm Standby failure
Most of you that will read this blog entry are SAP SRS (Replication Server) technical knowledgeable and like myself have found tricks that involve changing system tables to create quick fixes. All this is based on our in-depth understanding of the underlying technology.
Recently I had an assignment to debug an SAP SRS warm-standby scenario. During the “admin switch_active” command (making the Standby ASE Server the new Primary ASE Server) we encountered a failure in the master database replication switch due to network packet failures. My Customer was told that all they needed to do was to go into the RSSD database and “fix” the rs_databases table to emulate a working warm standby system that was already switched over. This was done by a series of updates to the table, removing errant status bits, etc. After the updates occurred, the SRS was rebooted and life would continue with the former Standby side being the Primary ASE Server and vis-versa.
By and large this seems to work, however, we discovered that this was not the silver bullet for this fix. It was not 100% effective and this depended upon the type of failure and the intentions of replication after the error condition.
The ‘go to’ method to ultimately fix a broken warm-standby connection is to drop and recreate the internal configuration of the standby side. The only method of delivery of this fix is to use RCL scripts (replication control language). While the activity of “drop” implies a rebuilding effort this is not such a drastic step, this is not a time consuming effort and really only involves seconds of effort.
For this Customer, the master database was in-sync as we always did the warm standby switch on a quiecsed system (this is just following best practices for our warm standby environment). Here are the steps we used to resync the master database to make it replication ready. For this scenario the master databases for Primary and Standby were already in-sync and were replication-aware. The following steps were done to produce a workable solution to resolving master warm standby replication issues.
1. Obtain the generation id from the primary master database. In the ASE Server, master database type: dbcc gettrunc.
2. Drop connection to standby master. In the SRS Server, type: drop connection to <Standby ASE>.master
3. In standby ASE Server, master database, truncate the rs_lastcommit table by typing: truncate table rs_lastcommit
4. In the primary ASE Server, stop the active Replication Agent by typing: sp_stop_rep_agent master
5. In the SRS’s RSSD zero the locator, type: exec rs_zeroltm <Primary ASE>,master
6. In the primary ASE Server, reset the generation id (Step 1 gen_id + 1) forcing the system to ignore any previous records. This can be an optional step and we are doing this more for an extra level of comfort. In the Primary ASE Server,master database type: dbcc settrunc (ltm, gen_id, <original gen_id + 1>)
7. In the primary ASE Server, restart the master Rep Agent by typing: exec sp_start_rep_agent master
8. Recreate the connection to the Standby master. In the SRS Server, type:
create connection to <Standby ASE Server>.master
set error class rs_sqlserver_error_class
set function string class rs_sqlserver_function_class
set username <replication maintenance user>
set password <password>
with log transfer on
as standby for <Logical Server>.<Logical database master>
Resume any suspended connections from the SRS to the master database by typing: resume connection to <ASE Server>.master.
9. Test using rs_ticket and adding a dummy login. In the primary ASE Server, master database type: exec rs_ticket ‘step 9 ok.’ In the standby ASE Server master database type: select * from master..rs_ticket_history and look for the “step 9 OK” record.
Steps 1 to 9 entered manually took 5 minutes. We could have shortened that time considerably if we had done this using shell scripts and batch files. The take-away point is while we had 9 steps, these steps were easy to implement and allowed us to have one solution regardless of the master error condition.
If and only if this does NOT work, then you need to drop the Active warm standby configuration connection along with the Standby warm standby
configuration connection. Keeping master replication build scripts is always a good idea and is highly recommended in any environment.
Sometimes going back to first principles and keeping it simple is the best solution of all.
This Field Notes Series is dedicated to observations from the field taken from personal consulting experiences. The examples used have been created for this blog and do not reflect any existing SAP Customer configuration.