IndexServer process yellow in sapcontrol
Today I solved an interesting HANA startup issue and wanted to share it with you.
Collecting Symptoms
yellow status
HANA did not start up on the worker node in my scale-out landscape. On Linux, I logged in as sidadm and called
sapcontrol -nr <instance number> -function GetProcessList
This showed me “YELLOW” for the process hdbindexserver.
long startup
The nameserver process had taken an unusually long time to start.
clean shutdown impossible
HDB stop did not work, I had to do killall -9 hdbindexserver to bring down HANA.
strace
Now I wanted to know what hdbindexserver does and I straced it:
ps -A | grep hdbindexserver
19416 ? 00:03:13 hdbindexserver
Process 19416 attached – interrupt to quit
epoll_wait(17,
Using the command man 2 epoll_wait told me the process is waiting for some event on a file… that does not help us much, you will see the same output for a sane HANA installation. Let’s move on to find out what its subprocesses are doing:
strace -ffs 9999 -p 19416
[pid 19438] <… futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 19436] <… futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 19438] futex(0x7ffe86826870, FUTEX_WAIT_PRIVATE, 0, {0, 1000000} <unfinished …>
Same here – man 2 futex tells me it is waiting for some value to change. And running this in a sane HANA environment gives me the same output.
Strace cannot help us.
trace files
Proceeding like as described here I took a look at the log files (or “trace files”). I even set saptracelevel to 5 in /usr/sap/<SID>/HDB<NR>/exe/config/global.ini. But the indexserver’s trace file only contained two lines after startup:
hostname:/usr/sap/<SID>/HDB<NR>/hostname/trace> cat indexserver_hostname.30003.000.trc
[…]
[25327]{-1}[-1/-1] 2014-09-18 11:04:33.548012 i Service_Startup translog.cc(01634) : Activating private log buffering mode
[25327]{-1}[-1/-1] 2014-09-18 11:04:33.548051 i assign TREXIndexServer.cpp(00730) : persistence started with volume 6
[25419]{-1}[-1/-1] 2014-09-18 11:09:34.658969 w Logger SavepointImpl.cpp(02447) : NOTE: BACKUP DATA needed to ensure recoverability of the database
And again, I had set saptracelevel to 5 in /usr/sap/<SID>/HDB<NR>/exe/config/global.ini. However, there was one hint in the nameserver traces:
hostname:/usr/sap/<SID>/HDB<NR>/hostname/trace> cat nameserver_alert_hostname.trc
[…]
[10155]{-1}[-1/-1] 2014-09-18 12:02:03.918610 e TNS TNSClient.cpp(00800) : sendRequest setstarting to master:30001 failed with NetException. data=(S)databaseid=2|host=hostname|port=30001|(I)type=3|(B)watchdog=0|(N)node=host|hostname|nameserver|…|…|…|
[10155]{-1}[-1/-1] 2014-09-18 12:02:03.918647 e NameServer TREXNameServer.cpp(09839) : master nameserver@hostname:30001 not respondin.g retry in 5 sec
network
Now as the previous symptom has pointed us to a network problem, we drill down on that. Indeed the command lsof -P -p 18994 (where 18994 is the PID of the name server) showed much more established connections on a sane node than on this node. It was possible to connect to any port on the master name server from the server with the error, but not to the server with the error. To find that out, best way is to do a telnet <host> <portnumber>.
Reason
On the server with the error, the firewall was up which prevented HANA from starting. At least to me this was counter-intuitive as I regarded the HANA worker node as the initiator of the communication and these (outbound) requests were not blocked by the firewall.
Solution
Solution was to stop the firewall, in this case for SUSE Linux with the command
/etc/init.d/SuSEfirewall2_init stop
/etc/init.d/SuSEfirewall2_setup stop
and then disable the firewall with the tool called with the command
yast2 firewall
Then HDB start worked fine.
More Solutions
If I find more solutions to the problem “IndexServer process yellow in sapcontrol” I will paste them here. You can, too, in the comments section.