System Replication Implementation and Testing (par...

former_member182824 · ‎09-23-2015

Hi again,

My name is Man-Ted Chan and I’m from the SAP HANA product support team. This is part 2 to my High Availability/System Replication blog, part 1 can be found here.

This will continue where the last blog left off

How to turn off replication

First we will unregister the secondary server, this means no more data from the primary will go to this server:

After this have been unregistered we can check the hdbnsutil –sr_state to confirm this:

However, if you check the primary node you will see that the replication is still enabled, but no server for the replication is listed.

Next we can disable the replication on the primary

Once this is done you can check the replication tab and hdbnsutil –sr_state

As a test, I stopped the primary to see what happens on

Other things tested during this phase

As a test I stopped the primary to see what happen to the replication. No automated takeover will occur, but we will see the following network communication errors in the trace files

e Stream NetworkChannelCompletion.cpp(00524) : NetworkChannelCompletionThread #2 NetworkChannel FD 28 [0x00007fc028072818] {refCnt=3, idx=2} 10.97.22.172/0_tcp->10.97.22.172/30103_tcp ConnectWait,[---c]

: Error in asynchronous stream event: exception 1: no.2110001 (Basis/IO/Stream/impl/NetworkChannelCompletion.cpp:450)

Generic stream error: getsockopt, Event=EPOLLERR - , rc=111: Connection refused

Please note that if you stop the replication server the primary server will throw the following alerts

ReplicationError with state INFO with event ID 1 occurred at <DATE> on xxxx36f509:30007. Additional info: Communication channel closed

Associated with Alert ID 78

The following error will be found in the trace files

e TNS TNSClient.cpp(00671) : sendRequest dr_getremotereplicationinfo to xxxx301545c:30001 failed with NetException. data=(I)drsender=1|

e sr_nameserver TNSClient.cpp(06880) : error when sending request 'dr_getremotereplicationinfo' to xxxx301545c:30102: connection refused,location=xxxx301545c:30001

i EventHandler EventManagerImpl.cpp(00602) : acknowledge: ReplicationEvent(): Communication channel closed

If you run into this alert in your own system you should check to see if the secondary node is down (can you start it or was there a crash?)

How to perform a takeover

*Please note that performing a takeover should be done only if there is an issue if the primary or if you would like zero down during a HANA upgrade
Right click on the secondary node and open the “Configure System Replication”

At an OS level you will see the takeover process

To perform the takeover via the command prompt you would run the following on the secondary server:

Hdbnsutil –sr_takeover

*After the takeover a new server needed to be made so the server name is different from 301545c to 59e3753f1

Please note on your replication server you will now be able to open the admin panel and not just the diagnosis mode (in the diagnosis mode only ‘Processes’, ‘Diagnosis Files’, and ‘Emergency Information’ tabs are available)

On the old primary server and old replication we can check the Landscape->System Replication and see there is no replication

Since the replication hasn’t been disabled we will see the communication errors again on the original primary

i EventHandler EventManagerImpl.cpp(00780) : --removeAllEvents: ReplicationEvent(): Communication channel closed

On the old replication server the nameserver trace will show the following during the takeover if it was successful

i sr_nameserver TREXNameServer.cpp(15647) : re-assign for databaseId 2 volume 2 returned successfully

i sr_nameserver TREXNameServer.cpp(15647) : re-assign for databaseId 2 volume 4 returned successfully

i sr_nameserver TREXNameServer.cpp(15647) : re-assign for databaseId 2 volume 3 returned successfully

i sr_nameserver TREXNameServer.cpp(15703) : issueing "/usr/sap/MV1/SYS/global/hdb/install/bin/hdbupdrep -s MV1 --user_store_key=SRTAKEOVER -b"

i sr_nameserver TREXNameServer.cpp(15686) : reconfiguring all services

Check the global.ini and nameserver.ini on the secondary node (the primary will not change)

/usr/sap/MV1/global/hdb/custom/config> cat global.ini

[system_replication]

site_id = 2

mode = sync

actual_mode = primary

site_name = rep

mo-59e3753f1:/usr/sap/MV1/global/hdb/custom/config> cat nameserver.ini

[landscape]

id = 55de6934-1b45-7f0a-e100-00000a6116ac

master = mo-59e3753f1:30001

worker = mo-59e3753f1

active_master = mo-59e3753f1:30001

idsr = 55f36543-7352-8161-e100-00000a61131b

roles_mo-59e3753f1 = worker

Memory

In order to minimize memory consumption, the following parameters should be set in the secondary system:

1) global.ini/[system_replication]/preload_column_tables = false

2) global.ini/[memorymanager]/global_allocation_limit =

<size_of_row_store + 20%>

If the parameter "preload_column_tables" is set to "true" on the secondary side, the secondary system will dynamically load tables into memory according to the preload information shipped from the primary side.

During the takeover procedure, the "global_allocation_limit" should be increased on the secondary side to the same value as on the primary side.

Memory on the primary can be consumed in async mode there is a logbuffer that gets loaded and then sent over to the secondary, the amount of memory this takes up is set by

global.ini -> [system_replication] -> logshipping_async_buffer_size = <size_in_byte>

Tracing

For additional information during a takeover please run the following

alter system alter configuration ('nameserver.ini','SYSTEM') SET ('trace','failover')='debug' with reconfigure;

alter system alter configuration ('nameserver.ini','SYSTEM') SET ('trace','ha_provider')='debug' with reconfigure;

Perform failover test. Once done you can turne off this tracing

alter system alter configuration ('nameserver.ini','SYSTEM') UNSET ('trace','failover') with reconfigure;

alter system alter configuration ('nameserver.ini','SYSTEM') UNSET ('trace','ha_provider') with reconfigure;

For general tracing during the replication you can go edit in the SAP HANA studio global.ini-> trace-> sr_dataaccess = debug and studio global.ini-> trace->stream= debug. This will add additional tracing in the indexserver trace.

References

System Replication Configuration Parameters

http://help.sap.com/saphelp_hanaplatform/helpdata/en/0c/d257970d514abd8ddf9ee1f45f3bca/content.htm?f...

Issues Encountered

Misc.

-After SP9 users ran into Alert 79, Configuration Parameter Mismatch, to resolve this you can edit global.ini->system_replication->keep_old_style_alert = false

The ini’s will still be mismatched, but the alert will stop appearing. User can manually check the mismatches, or can go to /usr/sap/<SID>/global/hdb/customer/config and copy from the primary and paste it to the secondary, but do not overwrite global.ini->system_replication and nameserver.ini->landscape section as this will break replication. Another option you can do is run the SQL script to find the differences:

HANA_Replication_SystemReplication_ParameterDeviations

Network Related

-‘Communication Channel Closed’ errors, the replication server is either down or there is a networking error. (Check to see if the HANA services are running, if they are talk to your networking team about blocked ports)

-(DataAccess/impl/DisasterRecoveryProtocol.cpp:3478) Asynchronous Replication Buffer is Overloaded exception throw location:

This error occurs only if you choose ASYNC replication, this can occur if there is a slowness in the network. You can check your network statistics on with the following table

HOST_VOLUME_IO_TOTAL_STATISTICS or run the SQL script

HANA_Replication_SystemReplication_Bandwidth

If you need to resolve this issue prior to looking into you network you can do one of the following:

1) Change the replication mode, -sr_change mode –mode= sync|syncmem

2) Change global.ini->system_replication->logshipping_async_wait_on_buffer_full = false, this will temporarily decouple the synchronization.

Registration fails

Issue:

Unable to contact primary site error: at 30001

Solution:

Check the host name you have entered, something’s to check:

The hostnames are unique

The secondary host name is not a substring of the primary

Do not use the IP address

Issue:

f sr_nameserver TREXNameServer.cpp(10651) : remoteHost does not match with any host of the source site. Please ensure that all hosts of source and target site

Can’t resolve all hostnames of both sites correctly.

Solution:

Run the following query and

select name from m_topology_tree where path = '/host/'

Startup of secondary fails

Issue:

Secondary nameserver starup fails after registration of secondary to primary: TREXNameServer.cpp(02876) : source site is not active, cannot start secondary site. Please run hdbnsutil -sr_takeover in case of a disaster or start primary site first. -> stopping instance ..

Solution:

Do not use secondary hostnames that are substring of primary hostnames.

Issue:

nameserver server:30001 not responding.

collecting information ...

error: source system and target system have overlapping logical

hostnames; each site must have a unique set of logical hostnames.

hdbrename can be used to change names;

failed.

Solution:

This is caused by connection timeouts, but if you see it only for a few services check to see if the landscape are the same.

MultiDB issue

Issue:

"unhandled ltt exception: exception 1000003:

Index 1 out of range [0, 0)" when i check the sr_state after running

Solution:

Resolved in 97.01 and 102

Takeover

Issue:

i LogReplay RowStoreTransactionCallback.cc(00226) : starting master-slave DTX consistency check

e LogReplay RowStoreTransactionCallback.cc(00264) : Slave volume 3 is not available

Solutions:

Resolved in rev 74.04 and 82

Work around:

1) Add following INI parameters as 'false' in indexserver.ini and statisticserver.ini

[transaction]

check_slave_on_master_restart = false

check_global_trans_consistency = false

2) The, restart your system.

Issue:

From time to time the takeover process hangs

w Backup BackupMonitor_TransferQueue.cpp(00048) : Master index server not available! Following trace Entries are in written to the trace file, and there is a time gap in the trace of 30m: [11596]{-1}[-1/-

i PersistenceManag PersistenceManagerImpl.cpp(02359) : Activating periodic savepoint, frequency 300 e TrexNet Channel.cpp(00362) : active channel 33 from 53223 to 127.0.0.1:30001: reading failed with timeout error; timeout=1800000ms elapsed

Solution:

There is no work around, this issue is fixed in 85.02 and 90

Issue:

If a takeover is performed on a secondary system where not all tenants could be taken over (e.g. because they were not initialized yet) then the takeover flag is not removed from the topolgy (/topology/datacenters/takeover/*)

Solution:

Resolved in HANA 10.1

Crash on secondary

indexserver crash at DataRecovery::LoggerImpl::IsSecondaryBackupHistoryComplete on the secondary system.

The bug is fixed as of revision 90 so a permanent solution is available via an upgrade.

In the interim the workaround to the issue is the setting of the parameter [system_replication] ensure_backup_history = false within the global.ini file.

The setting of this parameter disables the maintenance of the backup history. The takeover process is not affected by this parameter but full recovery scenarios after takeover (using old primary data/log backups with new primary log backups) may be impacted.

SAP Notes

1995412 - Secondary site of System Replication runs out of disk space due to closed data shipping connection

1945676 - Correct usage of hdbnsutil -sr_unregister

2057595 - FAQ: SAP HANA High Availability

2100052 - How to disable parameter mismatch alert for system replication

2050830 - Registering a secondary system via HANA Studio fails with error 'remoteHost does not match with any host of the source site'

2021186 - Garbage collection takes a long time during HANA service restart

2075771 - SAP HANA DB: System Replication - Possible persistence corruption on secondary site

1852017 - Error 10061 when connecting SAP Instances to failed over HANA nodes

2063657 - HANA System Replication takeover decision guideline

2062631 - high availability limitation for SAN storage

2129651 - Indexserver crash caused by inconsistent log position when startup

1681092 - Multiple SAP HANA DBMSs (SIDs) on one SAP HANA system

2033624 -System replication: Secondary system hangs during takeover

2081563 - secondary system's replication mode and replication status changed to "UNKNOWN"

2135107 - Log segment for backup history is still missing after reconnect with log shipping

System Replication Implementation and Testing (part 2)

Get Your SAP HANA Idea Incubator Badge Today!

SCN Mission - SAP HANA Quiz Challenge is now retired

Share your #HANAStory and Win