Skip to Content

Hi,

Recently we faced an issue with system replication in our production system where the “Replication Status Details” was showing “Savepoint Pending”. If you want to unregister the secondary system and bring it up, it won’t be successful. While troubleshooting this, discovered a gap in HANA which is causing this issue. This blog is all about that and what to do to fix the issue in such a scenario.

Scope: This issue will only occur in a scaled out environment where the number of nodes in primary and secondary are not equal. For e.g. in our case we have one node less in secondary than primary. This issue won’t occur in a single node system or a scaled our system where the primary and secondary both have same number of nodes.

Issue: System replication status showing “savepoint pending” for all the nodes. When trying unregister the secondary and bring it up, the system won’t come up. Error message:

e sr_nameserver TopologyUtil.cpp(02196) : ### WARNING: The persistence of at least one service is not initialized correctly.
e sr_nameserver TopologyUtil.cpp(02197) : ### In order to initialize the secondary site you can …
e sr_nameserver TopologyUtil.cpp(02198) : ### – Re-Register the secondary site by executing sr_register, start the secondary site and wait until all services have been initialized. Afterwards the system can be used after executing sr_unregister or sr_takeover
e sr_nameserver TopologyUtil.cpp(02199) : ### – Re-Create the persistence of the secondary site from a backup
e sr_nameserver TopologyUtil.cpp(02200) : ### – Re-Install the HANA System

If you try to re-register secondary again with primary, after the replication completes, the “Replication Status Details” will again show “Savepoint Pending” for all the nodes. If you again try to unregister the secondary it will throw the error message as shown above.

Scenario when this issue can occur: This issue occurs in the following scenario –

  1. You have a scaled our system in both primary and secondary.
  2. You don’t have equal number of nodes in primary and secondary. For. eg. your primary has 8 worker and 2 standby nodes (total  10) whereas your secondary has 8 worker and 1 standby node (total 9). So you have 1 node in primary that is not mapped to any node in secondary. See table below.
  3. Any worker node in primary has failed over to a standby node.
  4. You try to unregister/register the secondary after the node failover in primary.

Why this issue occurs in this situation: Let’s take the example of the below node configuration of a HANA system.

Normal node configuration in primary and secondary – Table 1 Node configuration after failover in primary – Table 2 Node configuration after registering seconday with primary after node failover in primary Table 3
Primary Secondary Primary Primary Secondary
Node 10 Master Node 50 Master Node 10 Master Node 10 Master Node 50 Master
Node 11 Worker Node 51 Worker Node 11 Worker Node 11 Worker Node 51 Worker
Node 12 Worker Node 52 Worker Node 12 Standby Node 12 Standby Node 52 Standby
Node 13 Worker Node 53 Worker Node 13 Worker Node 13 Worker Node 53 Worker
Node 14 Worker Node 54 Worker Node 14 Worker Node 14 Worker Node 54 Worker
Node 15 Worker Node 55 Worker Node 15 Worker Node 15 Worker Node 55 Worker
Node 16 Worker Node 56 Worker Node 16 Worker Node 16 Worker Node 56 Worker
Node 17 Worker Node 57 Worker Node 17 Worker Node 17 Worker Node 57 Worker
Node 18 Standby Node 58 Standby Node 18 Standby Node 18 Standby Node 58 Standby
Node 19 Standby Not mapped Node 19 Worker Node 19 Worker Not mapped

 

Table 1 shows the normal node configuration in a primary and secondary system. Note that secondary has 1 node lesser than primary, so the standby node 19 in primary is not mapped to any node in secondary.

Table 2 shows the node configuration in primary after a node failed. In this example, lets say node 12 failed because of an issue and the standby node 19 took its place as worker.

Table 3 shows the node configuration in primary and secondary if you try to unregister the secondary and then register it back after the node failover happened in primary.

Note: You can check the node configuration in secondary using the python script “landscapeHostConfiguration.py” in your python_support directory.

In this scenario when you register your secondary with primary, the topology of the secondary adjusts itself with primary automatically. So node 52 in secondary which is corresponding to node 12 in primary (the one which failed and is standby currently) also gets assigned as standby.  So now we have 2 standbys in secondary. As secondary has 1 node lesser than primary, the current worker node 19 which is not mapped to any node in secondary does not get replicated over to secondary. Hence the “Replication Status Details” shows “Savepoint Pending” for all the nodes and you won’t be able to successfully unregister secondary and bring it up. It will complain about “The persistence of at least one service is not initialized correctly”. Here it is actually referring to to the unmapped node in primary which is currently acting as worker after the node failover happened in primary. The data in node has not been replicated over to secondary. Though it does not mention this specifically in the trace files.

Solution:

  1. Restart your primary system. This will cause the node configuration to automatically revert back to it original configuration (as shown in table 1).
  2. Try to unregister secondary. It will fail with the same error message as mentioned above in “Issue” section.
  3. Update nameserver.ini file to ensure that only one node is marked as standby there as per your original configuration (Node 58 in this example should be the only standby in secondary). This is needed because when you tried to register the secondary after the node failover happened in primary, the nameserver.ini file and the topology.ini in secondary automatically adjusts themselves with that of primary and hence marks 2 nodes as standby – node 58 which has been standby originally and also node 52 because of the corresponding node 12 in primary became standby after it failed.
  4. Force register secondary with primary. This will allow the secondary to automatically adjust its topology vis-à-vis primary as well, i.e. only one node standby (node 58).

Now after the replication completes, you won’t be getting the “savepoint pending” status again and you will also be able to unregister your secondary and bring it up successfully.

 

i have reported this issue to SAP via an OSS message. They will be releasing a note shortly about this issue and also will be incorporating this in their development plan to put a permanent fix to this issue in the future releases.

Thanks.

Arindam

 

To report this post you need to login first.

2 Comments

You must be Logged on to comment or reply to a post.

    1. Arindam Deb Post author

      Hi Nicholas,

      Yes, as per SAP it is not mandatory to have equal number of nodes in primary and secondary. But if you do not have equal number of nodes in primary and secondary, you will encounter this issue. SAP is going to fix this bug in some release in HANA 2.0, that’s what they told me.

      Thanks,
      Arindam

      (0) 

Leave a Reply