Skip to Content
Author's profile photo Steve Soumah

SAP HANA Hands on tests ( part 4.2 ) : HANA replication failback

Where am i at ?

Previously I performed a takeover test , killing the primary system which was hdbtest1.

hdbtest2 node took over the HDB and i could restart my ECC application ( needing to restart it due to my lab configuration. see my previous doc / blog : SAP HANA Hands on tests ( part 4.1 ) : HANA replication takeover ) .

Now, at first I wanted to perform the failback only, but in the end I ‘ll also perform a “Near Zero DownTime” update of the HANA platform and then a failback.

Start situation:

/wp-content/uploads/2015/08/failback0_781476.png

HDB is running on Node2 as the primary instance.

Node1 failed a few days ago and is therefore not in sync anymore .

The HDB software is still there and installed on HDB node1.

I’m currently running the following version of HDB : 1.00.093.00.1424770727 (fa/newdb100_rel)

Target :

  • HDB in version 1.00 SPS 97.
  • Node1 back as primary
  • Node2 back as stby.

How :

Basically, it should take these main steps :

  • Get my hana node1 back in the configuration as a STBY node.
  • Perform the software  update on this stby HDB node ( hdbtest1 ) .
  • Takeover -> Node1 is back as primary node
  • Update HDB node2
  • Put HDB node2 back in the configuration as STBY host.
  • Perform the required post update steps.

Where to get some information :

First of all : RTNG ! ( Read The Notes and Guides ) .

There are lots of notes and guides around for this topic.

The main ones I followed :

I also read this excellent blog about Hana SPS updates :

Let’s go !

Get hana node 1 back in the game !

Having a look at the HDB replication statuses, I have this situation due to my previous simulated crash / takeover :

hdbtest2 :

/wp-content/uploads/2015/08/failback2_781983.png

hdbtest1 :

/wp-content/uploads/2015/08/failback2_bis_781984.png

As you can see, the situation is not clear. I have 2 systems claiming here they are primary. So the First step for me is to clean this up .

At first i thought I would have to run into some clean up using some unregistering commands and putting the host back in.

It turns out that you only need to register the system again in the configuration.

So all I had to do was registering the system again using the hdbnsutil tool ( adding the option –force_full_replica ) :

hdbnsutil -sr_register –remoteHost=hdbtest2 –remoteInstance=HTL –mode=syncmem –name=HTLPRIM –force_full_replica

/wp-content/uploads/2015/08/failback3_782166.png

hdbtest1 node is now in syncmem mode instead of primary.

hdbtest2 is aware of the change in the topology :  hdbtest1 is seen as secondary_host :

/wp-content/uploads/2015/08/failback4_782167.png

Now we restart the hdbtest1 node.

The system replication is triggered on startup :

/wp-content/uploads/2015/08/failback5_782222.png/wp-content/uploads/2015/08/failback6_782223.png

The replication is back. hdbnode2 is replicating to hdbnode1 :

/wp-content/uploads/2015/08/failback7_782232.png


The replication is back online.


That said, for the failback to be complete, I will turn back to the initial situation while updating hana using Near Zero DownTime update concept :

     Node 1 as primary

     Node 2 as standby


Next steps will be described here :

SAP HANA Hands on tests ( part 4.3 ) : Near Zero DownTime update using replication

Assigned Tags

      2 Comments
      You must be Logged on to comment or reply to a post.
      Author's profile photo Ilke Tutku Senol
      Ilke Tutku Senol

      Hello Steve,

      Is there any way to stop Replication to DR side for a while and open DR DB for test case and after that continue to Replication? We are doing it with Oracle Dataguard but couldn't find a way in HANA.

      The only way seems to cancel Replication, do test and restart replication from the start.

      Regards

      Tutku

      Author's profile photo Former Member
      Former Member

       

      Hi Tutku,

      Did you find out any additional information on your test case? We are in the same scenario, we have to test our DR systems periodically, so we DON’T want to fail back.

       

      So far in my tests I have been doing that as well:

       

      1. Put up network blocks to prevent Primary site from talking to Secondary DR site
      2. Run takeover on Secondary DR
      3. Bring up Secondary DR system and application and have users test
      4. Bring Secondary DR system down
      5. Drop Secondary DR database
      6. Uninstall replication on Primary database (Seems to require restart — not really a good option)
      7. Install Secondary DR database
      8. Turn off network blocks
      9. Reinstall replication on both systems
      10. Do an initial sync from Primary to Secondary DR

      Let me know if you or anyone else came up with a better way, all the tests all assume a failback situation, we definitely do NOT want that or any risk of failing data from DR back to the Primary.

       

      We also don't want to risk Primary site becoming unresponsive. We had an issue where bandwidth became saturated and also a global.ini mismatch, which caused production to slow to a crawl due to trying to send changes to the Secondary site.

       

      Still debating whether to use built in replication for this task or stick with some homegrown rsync and log shipping solution instead.