Skip to Content

Overview

 

 

This blog is part of a series of troubleshooting blogs geared towards telling you a story of how an issue got resolved. I will include the entire troubleshooting process to give you a fully transparent account of what went on. I hope you find these interesting. Please leave the feedback in the comments if you like the format or things I can improve on 🙂

 

Let’s get started!

 

 

Problem Description

 

 

Trying to register the secondary site for System Replication fails with error “remoteHost does not match with any host of the source site”

 

 

Environment Details

 

 

This incident occurred on Revision 73

 

 

Symptoms

 

 

Running the following command:

 

hdbnsutil -sr_register –name=SITEB –remoteHost=<hostname primary> –remoteInstance=<inst> –mode=<sync mode>

 
Gives error:

 

adding site …, checking for inactive nameserver …, nameserver <hostname_secondary>:3<inst>01
not responding., collecting information …, Error while registering new
secondary site: remoteHost does not match with any host of the source site.
please ensure that all hosts of source and target site can resolve all
hostnames of both sites correctly., See primary master nameserver tracefile for
more information at <hostname_primary>, failed. trace file nameserver_<hostname_secondary>00000.000.trc
may contain more error details.]

 

 

Studio had a similar error as well.

studio system replication error.jpg

 

 

Troubleshooting

 

The error message indicates that the secondary system could not be reached when performing sr_register.

 

Firstly, when dealing with System Replication, it is always good to double-check that all the prerequisites have been completed. Refer to the Administration
guide for this (http://help.sap.com/hana/SAP_HANA_Administration_Guide_en.pdf)

 

 

Let’s make sure the network connectivity is fine between the primary master nodes and the secondary master nodes.

 

 

Are the servers able to ping each other?

 

From the O/S, type “ping <hostname>”. Perform this from the primary to secondary and secondary to primary.

 

 

In this customer’s case, ping was successful.

 

 

What about firewalls? Could the ports be blocked?

 

 

From the O/S, type “telnet <hostname> <port>”. Perform this from the primary to the secondary and secondary to the primary.
The port that you will use is the SQL Port. In this case 3<instance number>15.

 

 

In this customer’s case, ping was successful.

 

 

 

Comparing the host files between the primary and secondary sites

The customer noticed that there was an error in the /etc/hosts file, the shortname was not filled in correctly. They fixed this, but the problem still occurred 🙁

 

 

 

 

Network Communication and System Replication

There is a note 1876398 – Network configuration for System Replication in HANA SP6. 

 

 

 

The symptoms of the note match what we are experiencing “When using SAP HANA Support Package 6, a
System Replication secondary system may not be able to establish a connection to the primary system.
“.

 

It is explained “Therefore, the listener hears only on the local network. System Replication also uses the infrastructure for internal network communication for exchanging data between the name servers of the primary and the secondary system.  Therefore, the name servers of the two systems can no longer communicate with each other in this case.”

 

 

It is worth noting this is very common cause of the issue, but in the customer’s case, it was not the problem.

 

 

 

 

Strace

 

 

Performed an strace, here is some of the output.

 

 

sendto(13,”?\0\50\50\50\60\0\0\0\1\2\6,\0\0\0dr_gethdbversion”…, 86, 0, NULL,
0) = 86

recvfrom(13,0x7f1bd94549264, 8337, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)

poll([{fd=13,events=POLLIN|POLLPRI}], 1, -1) = 1 ([{fd=13, revents=POLLIN}])

recvfrom(13,”\323\346\v\333\333\333\333\333\333\333\333F\1I\nhdbversionI\0221.00.”…,
8337, 0, NULL, NULL) = 52

recvfrom(13,0x7fff22c5277f, 1, 2, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)

recvfrom(13,0x7fff22c528bf, 1, 2, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)

gettid()                                = 35760

sendto(13,”?\0\32\33\45\33\0\0\0\1\2\0033\0\0\0dr_registerdatac”…, 413, 0,NULL, 0) = 413

recvfrom(13,0x7f1bd9745564, 8337, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)

poll([{fd=13,events=POLLIN|POLLPRI}], 1, -1

 

 

Seems like some sort of packet loss here.

 

 

 

Involving the Networking Team

 

 

 

We involved the customer’s networking team and found that the MTU-size was set to 9000. They set the MTU-size to 1500 and then ran the register step and it worked! The registration completed!

 

 

The networking team did not explain exactly what was going on but we suspect they performed a tcpdump to see if there was packet loss.

 

 

** This may need to be changed back later for performance optimization, see 2081065 – Troubleshooting SAP HANA Network **

 

 

 

Disclaimer

 

This blog detailed the steps that SAP and the customer worked through a problem towards a resolution. This may not be the exact resolution for every incident that has the same symptoms. If you are encountering the same issue, you can review these steps with your HANA Administrator and Networking team.

To report this post you need to login first.

1 Comment

You must be Logged on to comment or reply to a post.

  1. Lars Breddemann

    Hey Jimmy,

    Nice to see more SAP colleagues to type up content here on SCN.

    Some more posts and I can finally retire 🙂

    Since you asked for feedback and there seem to be more blog posts to come here are my thoughts to this one:

    • structure: the structure with the different title sections looks like a SAP note or KBA. Very “form-based” if you will  and very generalized structure. This is a blog post – which should be rather a personal text – so why not structure the text along the story you’re telling?
    • formatting and linking: the text can become a lot more consumer friendly if you link all the sap notes and documentation you reference.
    • analysis steps: while you lay out the general course of analysis, you leave out the “juicy” bits and don’t show the actual calls of “ping”, “strace” – this doesn’t enable users to do it themselves.
    • solution unclear: ok, so the MTU size was set to 9000. So, what? Why are jumbo frames not supported? what is a MTU or jumbo frames and why the heck does it lead to a problem?
      How to check the actual setting of MTU on SLES anyhow?
      Also: where is it documented that MTU needs to be set to 1500? For some SAN systems for HANA (e.g. Hitachi) it’s explicitly documented to set MTU to 9000 (http://www.hds.com/assets/pdf/sap-hana-tailored-datacenter-integration-with-hitachi-vsp.pdf), so it would be good to have some docu for the SAP HANA network interfaces.
    • the disclaimer: this is kind of cheesy and looks as if you don’t actually stand behind your text. It doesn’t help SAP legally and if the whole text would look less like a SAP note the risk to mistake it as some official documentation.

    That would be my thoughts about the text.

    However, writing really is a matter of practice (isn’t everything?). Thus, by all means: keep on writing and publishing!

    I’m looking forward for more posts and interesting bits of knowledge.

    Despite my criticism, I learned something new here: checking the MTU size could help with connection problems with hdbnsutil.

    Cheers,

    Lars

    (0) 

Leave a Reply