Issues with system replication in a scaled out HANA system having unequal number of nodes between primary and secondary
Hi,
Recently we faced an issue with system replication in our production system where the “Replication Status Details” was showing “Savepoint Pending”. If you want to unregister the secondary system and bring it up, it won’t be successful. While troubleshooting this, discovered a gap in HANA which is causing this issue. This blog is all about that and what to do to fix the issue in such a scenario.
Scope: This issue will only occur in a scaled out environment where the number of nodes in primary and secondary are not equal. For e.g. in our case we have one node less in secondary than primary. This issue won’t occur in a single node system or a scaled our system where the primary and secondary both have same number of nodes.
Issue: System replication status showing “savepoint pending” for all the nodes. When trying unregister the secondary and bring it up, the system won’t come up. Error message:
e sr_nameserver TopologyUtil.cpp(02196) : ### WARNING: The persistence of at least one service is not initialized correctly.
e sr_nameserver TopologyUtil.cpp(02197) : ### In order to initialize the secondary site you can …
e sr_nameserver TopologyUtil.cpp(02198) : ### – Re-Register the secondary site by executing sr_register, start the secondary site and wait until all services have been initialized. Afterwards the system can be used after executing sr_unregister or sr_takeover
e sr_nameserver TopologyUtil.cpp(02199) : ### – Re-Create the persistence of the secondary site from a backup
e sr_nameserver TopologyUtil.cpp(02200) : ### – Re-Install the HANA System
If you try to re-register secondary again with primary, after the replication completes, the “Replication Status Details” will again show “Savepoint Pending” for all the nodes. If you again try to unregister the secondary it will throw the error message as shown above.
Scenario when this issue can occur: This issue occurs in the following scenario –
- You have a scaled our system in both primary and secondary.
- You don’t have equal number of nodes in primary and secondary. For. eg. your primary has 8 worker and 2 standby nodes (total 10) whereas your secondary has 8 worker and 1 standby node (total 9). So you have 1 node in primary that is not mapped to any node in secondary. See table below.
- Any worker node in primary has failed over to a standby node.
- You try to unregister/register the secondary after the node failover in primary.
Why this issue occurs in this situation: Let’s take the example of the below node configuration of a HANA system.
Normal node configuration in primary and secondary – Table 1 | Node configuration after failover in primary – Table 2 | Node configuration after registering seconday with primary after node failover in primary Table 3 | |||||||||||
Primary | Secondary | Primary | Primary | Secondary | |||||||||
Node 10 | Master | Node 50 | Master | Node 10 | Master | Node 10 | Master | Node 50 | Master | ||||
Node 11 | Worker | Node 51 | Worker | Node 11 | Worker | Node 11 | Worker | Node 51 | Worker | ||||
Node 12 | Worker | Node 52 | Worker | Node 12 | Standby | Node 12 | Standby | Node 52 | Standby | ||||
Node 13 | Worker | Node 53 | Worker | Node 13 | Worker | Node 13 | Worker | Node 53 | Worker | ||||
Node 14 | Worker | Node 54 | Worker | Node 14 | Worker | Node 14 | Worker | Node 54 | Worker | ||||
Node 15 | Worker | Node 55 | Worker | Node 15 | Worker | Node 15 | Worker | Node 55 | Worker | ||||
Node 16 | Worker | Node 56 | Worker | Node 16 | Worker | Node 16 | Worker | Node 56 | Worker | ||||
Node 17 | Worker | Node 57 | Worker | Node 17 | Worker | Node 17 | Worker | Node 57 | Worker | ||||
Node 18 | Standby | Node 58 | Standby | Node 18 | Standby | Node 18 | Standby | Node 58 | Standby | ||||
Node 19 | Standby | Not mapped | Node 19 | Worker | Node 19 | Worker | Not mapped |
Table 1 shows the normal node configuration in a primary and secondary system. Note that secondary has 1 node lesser than primary, so the standby node 19 in primary is not mapped to any node in secondary.
Table 2 shows the node configuration in primary after a node failed. In this example, lets say node 12 failed because of an issue and the standby node 19 took its place as worker.
Table 3 shows the node configuration in primary and secondary if you try to unregister the secondary and then register it back after the node failover happened in primary.
Note: You can check the node configuration in secondary using the python script “landscapeHostConfiguration.py” in your python_support directory.
In this scenario when you register your secondary with primary, the topology of the secondary adjusts itself with primary automatically. So node 52 in secondary which is corresponding to node 12 in primary (the one which failed and is standby currently) also gets assigned as standby. So now we have 2 standbys in secondary. As secondary has 1 node lesser than primary, the current worker node 19 which is not mapped to any node in secondary does not get replicated over to secondary. Hence the “Replication Status Details” shows “Savepoint Pending” for all the nodes and you won’t be able to successfully unregister secondary and bring it up. It will complain about “The persistence of at least one service is not initialized correctly”. Here it is actually referring to to the unmapped node in primary which is currently acting as worker after the node failover happened in primary. The data in node has not been replicated over to secondary. Though it does not mention this specifically in the trace files.
Solution:
- Restart your primary system. This will cause the node configuration to automatically revert back to it original configuration (as shown in table 1).
- Try to unregister secondary. It will fail with the same error message as mentioned above in “Issue” section.
- Update nameserver.ini file to ensure that only one node is marked as standby there as per your original configuration (Node 58 in this example should be the only standby in secondary). This is needed because when you tried to register the secondary after the node failover happened in primary, the nameserver.ini file and the topology.ini in secondary automatically adjusts themselves with that of primary and hence marks 2 nodes as standby – node 58 which has been standby originally and also node 52 because of the corresponding node 12 in primary became standby after it failed.
- Force register secondary with primary. This will allow the secondary to automatically adjust its topology vis-à-vis primary as well, i.e. only one node standby (node 58).
Now after the replication completes, you won’t be getting the “savepoint pending” status again and you will also be able to unregister your secondary and bring it up successfully.
i have reported this issue to SAP via an OSS message. They will be releasing a note shortly about this issue and also will be incorporating this in their development plan to put a permanent fix to this issue in the future releases.
Thanks.
Arindam
Hi,
Thanks for sharing. This is weird as according to SAP, standby node in secondary site is optional.
Cheers,
Nicholas Chang
Hi Nicholas,
Yes, as per SAP it is not mandatory to have equal number of nodes in primary and secondary. But if you do not have equal number of nodes in primary and secondary, you will encounter this issue. SAP is going to fix this bug in some release in HANA 2.0, that's what they told me.
Thanks,
Arindam
Hello Arindam,
Do you have the SAP note if already released for this scenario?
Best Regards,
Tanmeya
Hi Tanmeya,
Yes, OSS note 2452107 - Wrong host mapping after hdbnsutil -sr_register if primary has more standby hosts than the secondary released for this scenario.
However, note that if you have already tried to unregister and register your secondary when the node failover has taken place in primary, your nameserver.ini file in the secondary server will get changed automatically with incorrect standby node configuration. You won't be able to unregister it in this situation. It will fail with the error message as mentioned in this blog. You will need to perform step 3 in the solution section above to get around that. This has not been mentioned in the note.
Thanks,
Arindam
This issue is fixed with the newer revisions of HANA.