System Replication Implementation and Testing (part 1)
My name is Man-Ted Chan and I’m from the SAP HANA product support team. Recently I’ve been seeing a few issues in regards to High Availability (HA) environment using system replication so I’m writing this piece on setting up the HA along with some troubleshooting tips, and SAP notes.
To avoid confusion with the terminology I will refer to another posting on the SCN:
- System Replication is NOT Host Auto-Failover
- System Replication is NOT Scale Out
- System Replication is Disaster Tolerance (DT) / Disaster Recovery (DR)
- System Replication synchronizes data between two data centers (Site A and Site B)
- There is always one (logical) primary and one secondary system, e.g. site A is primary and site B is secondary. After a takeover, site B is (logically) primary system. Thus, primary and secondary changes, whereas site A and B will refer to a physical instance.
- A takeover is making a secondary system functioning as primary system. Note that this explicitly does not include changing the state of the primary (in exceptional/disaster situations, the secondary must not depend on having access to the primary site to be able to change the state)
- Failback: back to original setup, e.g. a takeover from the backup site to the preferred site: the preferred site may have a better internet connectivity, better reachable by clients, etc.
Also I’ve had to break up this blog into two parts as I hit a limit on the number of images that can be in a single blog posting.
- Have separate primary and secondary server with HANA installed with equal number of services and nodes. The revision of HANA on the secondary server has to be equal to or new than the primary.
- Secondary system has the same SAP system ID and instance number.
- Ports 3<instance number>15 and 3<instance number + 1>15 must be available
- The primary server must have a backup available
Setting up System Replication
These are steps from, but done in an SP09 environment:
I have included screen caps, tests, and log snippets
Setting up primary
When setting up the system replication a backup needs to exists, as a test I will show what happens when there is no backup:
Right click on your primary system and select ‘Configure System Replication…’
As we can see we cannot proceed with the replication as there is no backup. In the next few images we will create the backup.
Afterwards try and create the replication again. Please note that field ‘Primary System Logical Name’ can be whatever you want, but I chose the name ‘primary’.
After this is ran the following can be found in the nameserver trace
==== Starting hdbnsutil, version 1.00.090.00.1416514886 (fa/newdb100_rel),
i Basis TraceStream.cpp(00469) : MaxOpenFiles: 1048576
i Basis TraceStream.cpp(00472) : Server Mode: L2 Delta
i Basis ProcessorInfo.cpp(00713) : Using GDT segment limit to determine current CPU ID
i Basis Timer.cpp(00650) : Using RDTSC for HR timer
i Memory AllocatorImpl.cpp(01326) : Allocators activated
i Memory AllocatorImpl.cpp(01342) : Using big block segment size 8388608
i Basis TopologyUtil.cpp(03894) : command: hdbnsutil -sr_enable –name=primary –sapcontrol=1
w Environment Environment.cpp(00295) : Changing environment set SSL_WITH_OPENSSL=0
i sr_nameserver TopologyUtil.cpp(02581) : successfully enabled system as system replication source site
If you wanted to use the command line to create this replication run the following:
hdbnsutil -sr_enable –name=< Primary System Logical Name>
After this your system is now enabled for system replication
Setting up the secondary node
You will have to stop the HANA servers on the secondary server prior to setting up the replication. Right click on the server and select ‘Configuration and Monitoring’->’Configure System Replication…’ again, please note that the SID is the same.
At this step you will name the replication in the ‘Secondary System Logical Name’ and enter the host from the above (note that the Instance number is non-editable)
Replication mode options that are available are the following:
- Synchronous with full sync option (mode=sync. Full sync is configured with the parameter [system_replication]/enable_full_sync) means that log write is successful when the log buffer has been written to the logfile of the primary and the secondary instance. In addition, when the secondary system is disconnected (for example, because of network failure) the primary systems suspends transaction processing until the connection to the secondary system is re-established. No data loss occurs in this scenario.
- Synchronous (mode=sync) means the log write is considered as successful when the log entry has been written to the log volume of the primary and the secondary instance.
- Synchronous in memory (mode=syncmem) means the log write is considered as successful, when the log entry has been written to the log volume of the primary and sending the log has been acknowledged by the secondary instance after copying to memory.
- Asynchronous (mode=async): The primary system sends redo log buffers to the secondary system asynchronously. The primary system commits a transaction when it has been written to the log file of the primary system and sent to the secondary system through the network. It does not wait for confirmation from the secondary system. This option provides better performance because it is not necessary to wait for log I/O on the secondary system. Database consistency across all services on the secondary system is guaranteed. However, it is more vulnerable to data loss. Data changes may be lost on takeover.
The above is from the SAP HANA Admin guide:
This can be done via the command line
hdbnsutil -sr_register –remoteHost=<primary hostname> –remoteInstance= –mode=<sync/syncmem/async> – -name=< Secondary System Logical Name>
During this registration I ran into the following error
I then ran it via the command line to show the error
I checked the listed nameserver trace to see if there is any other information
==== Starting hdbnsutil, version 1.00.090.00.1416514886 (fa/newdb100_rel),
e Configuration ConfigStoreManager.cpp(00693) : Configuration directory does not exist.
- TopologyUtil.cpp(03894) : command: hdbnsutil -sr_register –remoteHost=xxxxx509 –remoteInstance=00 –mode=sync -name=sec
e sr_nameserver TNSClient.cpp(06778) : remoteHost does not match with any host of the source site. all hosts of source and target site must be able to resolve all hostnames of both sites correctly
From this error we can see that the landscape between the two do not match. I checked the landscape in the primary and secondary
Here we can see that in secondary server there is the ‘sapstartsrv’ process. After this is resolved re-run the wizard or enter in the hdbnsutil command
‘Initial full data shipping’ is the equivalent to running hdbnsutil –sr_register –force_full_replica
If parameter is set, a full data shipping is initiated. Otherwise a delta data shipping is attempted.
If you run this via command line you will have to manually start up the secondary server
For more information on the hdbnsutiloptions please refer to the following reference guide
Checking the status of the replication
You can check the status of the replication in studio and via command line. The below screen caps
Check the name server trace to see following success messages upon startup to see the replication:
TREXNameServer.cpp(12634) : called registerDatacenter from registrator=xxxx301545c
i sr_nameserver TREXNameServer.cpp(12776) : registerDatacenter; new disaster recovery site id =2
i sr_nameserver TREXNameServer.cpp(12864) : matched host xxxx509 to xxxx301545c
i sr_nameserver TREXNameServer.cpp(15138) : volume 1 successfully initialized for system replication
i sr_nameserver TREXNameServer.cpp(15138) : volume 2 successfully initialized for system replication
i sr_nameserver TREXNameServer.cpp(15138) : volume 4 successfully initialized for system replication
i sr_nameserver TREXNameServer.cpp(15138) : volume 3 successfully initialized for system replication
Please note that when you add the replication server you’ll notice that you cannot open the Administration panel or run SQL queries. So you will not be able to check the data in the replication server.
Instead you are opening Diagnosis mode, below the screen cap shows the difference between the 2
Click here for part 2