Troubleshooting the Fault Manager
This post provides information on the key troubleshooting issues you might encounter while using the Fault Manager, and the various diagnostic and monitoring tools you can use to fix them. It also details recommendations on configuring your Fault Manager and SAP Host Agent. The post includes the following:
— Troubleshooting HADR System/Fault Manager Issues
— Miscellaneous Issues
— Recommendations
Troubleshooting HADR System/Fault Manager Issues
When the Root Partition is Full
On one of the hosts running the primary or companion servers, the Fault Manager heartbeat log file (dev_hbeat
) may grow very large in size, and as a result, the host’s root partition fills up and the asehostctrl command fails.
Resolution: Use the following command to check the size of the dev_hbeat
file to determine if the increased file size is causing the failure:
sudo du -sh /usr/sap/hostctrl/work/dev_hbeat
16G /usr/sap/hostctrl/work/dev_hbeat
To resolve this issue, delete the dev_hbeat
file. If the dev_hbeat
file does not consume much space, you might want to check other files on the partition.
When the ASE Cockpit Frequently Displays Timeout Messages
This indicates that the sapdbctrl
calls from the Fault Manager are timing out.
Resolution: Increase the timeout period for sapdbctrl
by increasing the value for the ha/syb/dbctrl_timeout
parameter in the Fault Manager Profile file. The default value of the parameter is 30 seconds. After you have made the necessary changes, restart the Fault Manager using the restart command:
$SYBASE/FaultManager/bin/sybdbm restart
When Fault Manager Calls to the SAP Host Control Fail
Resolution: Refer to the following logs and search for the errors:
— Fault Manager log (<installation_directory> /log/FaultManager.log
)
— SAP Host Agent log (/usr/sap/hostctrl/work/dev_sapdbctrl file
)
Generally, start with the Fault Manager log and check for the command that has failed. For example, if you are suspecting that the error is caused by system heartbeat failure, in the Fault Manager log, search for TASK = HEARTBEAT_CHECK
. Now search for the text HEARTBEAT_CHECK
in the SAP Host Agent log for the same timestamp. For correct diagnostic, ensure that the system clocks of the Fault Manager host and the SAP Host Agent are in sync. It’s recommended to use trace level 3 (for maximum verbose output) while debugging SAP Host Agent issues.
The SAP Host Agent is a software component that can accomplish many lifecycle management tasks, such as operating system monitoring, database monitoring, system instance control and so on. It contains several sub-modules, including the SAP Host Control. The SAP Host Control runs within the SAP Host Agent under the sapadm user. For more information, refer to the SAP Host Agent architectural overview.
Error While Stopping the Fault Manager
While using the stop
command to shut down the Fault Manager, you see this message:
fault manager did not change to mode UNKNOWN within 60 seconds. fault manager running, pid = 15922, fault manager overall status = OK, currently executing in mode DIAGNOSE
Resolution: Re-execute the stop
command. Don’t stop the Fault Manager using the kill -9
operating system command.
The sybdbfm Utility Displays a “No Fault Manager Found” Message
When using the sybdmfm
utility, you may see this message:
no fault manager found for current working directory error: stop failed.
Most likely, you are not running the sybdbfm
command from the directory where the profile file and other Fault Manager-generated files (such as sp_sybdbfm
and stat_sybdbfm
) are located.
Resolution: Re-execute the sybdbfm
command from the directory where these files are located.
Replication status Messages
Though the primary and companion HADR nodes are healthy (when db host
and db status
is OK), the sanity report still displays the replication status as one of following:
— DEAD
— SUSPENDED
— UNKNOWN
— ASYNC_OK
Resolution: Refer to the Replication Server error logs for information.
Fault Manager Could Not Create a Connection to the Host Agent
The Fault Manager error log indicates (as shown below) that the Fault Manager could not create a connection to the Host Agent.
***LOG Q0I=> NiPConnect2: 10.172.162.61:1128: connect (111: Connection refused)
[/bas/CGK_MAKE/src/base/ni/nixxi.cpp 3324]
*** ERROR => NiPConnect2: SiPeekPendConn failed for hdl 6/sock 6
(SI_ECONN_REFUSE/111; I4; ST; 10.172.162.61:1128) [nixxi.cpp 3324]
Resolution: Check if the sapstartsrv process is running by executing the following command:
ps -aef | grep sapstartsrv
Normally, when the SAP Host Agent is started, the sapstartsrv process starts automatically with it. If the sapstartsrv process is not running already, you need to start it, then re-start the SAP Host Agent.
Miscellaneous Issues
- Ensure that you have write permissions for the SAP ASE installation directory, the Fault Manager installation and execution directories, and the
/tmp
directory. The Fault Manager creates temporary directories under/tmp
, and adds temporary files. In the absence of appropriate permissions, SAP Host Agent calls fail. Also, it’s important to prevent the/tmp
directory from becoming full. If/tmp
is full, the Fault Manager cannot create temporary files. Check the status of/tmp
by executing thedf -k /tmp
command. If this command shows 100 percent usage, make room in/tmp
. - Verify that the GLIBC (GNU C Library) version is 2.7 or later. The Fault Manager is built with GLIBC version 2.7, therefore the hosts running it must use GLIBC version 2.7 or later. Use the following command to check the GLIBC version:
ldd –version
- Make sure you enter the correct passwords for
sa
,DR_admin
, andsapadm
. - Set the appropriate value for file descriptors: A file descriptor is an integer number that uniquely represents an opened file in the operating system. Verify that the user limit value (file descriptor) for open files is set to an adequate number (4096 or more) before you configure the HADR system for large databases.
To determine the number of file descriptors to which your system is set, enter the following command:- For C-shell:
limit descriptors
- For Bourne shell:
ulimit –n
To change the value for the file descriptor (for instance, 4096), enter:
- For C-shell:
limit descriptors 4096
- For Bourne shell:
ulimit –n 4096
Recommendations
Increase the Trace Level for Troubleshooting
Set the trace level (essentially, the level of detail in the error log) to its highest level on the SAP Host Agent and the Fault Manager so your error log output is as detailed as possible.- For the Fault Manager: Set the value of the trace level for the
ha/syb/trace parameter
in the profile file (SYBHA.PFL
), then restart the Fault Manager (using the$SYBASE/FaultManager/bin/sybdbm restart
command). For example, to get the maximum verbose information, set the trace level to 3 by adding the line‘ha/syb/trace = 3’
toSYBHA.PFL
file. TheSYBHA.PFL
file is located in the installation directory of the Fault Manager on all platforms. Increasing the trace level increases the number of log entries, and may increase the file size. You may choose one the following values for theha/syb/trace
parameter:- 1 – Basic verbose output
- 2 – Medium verbose output
- 3 – Maximum verbose output
- For the SAP Host Agent: Set the trace level in the profile file, and restart the SAP Host Agent using the
saphostexec
program. For example, to get the maximum verbose output, add the lineservice/trace = 3
to the host profile(/usr/sap/hostctrl/exe/host_profile
). The profile file is located in:- (UNIX):
/usr/sap/hostctrl/exe/host_profile
- (Windows):
%ProgramFiles%\SAP\hostctrl\exe\host_profile1
- (UNIX):
- For C-shell:
Hi,
I am having an error installing Fault Manager:
– Root user
– ldd version 2.11
-Linux SUSE 12
– hostagent 721 patch23
-Installing in a different host than ERP1 and ERP2 (Primary and standby SAP Servers)
-ulimit 4096
ERROR
2017 01/27 17:37:54.824 (11876) loading executable /usr/sap/SYB/SYS/exe/run/sybdbfm for heartbeat to SAPHostAgent tools.
2017 01/27 17:37:54.824 (11876) upload executable /usr/sap/SYB/SYS/exe/run/sybdbfm.
2017 01/27 17:37:54.824 (11876) ERROR: cannot open file /usr/sap/SYB/SYS/exe/run/sybdbfm for read.
2017 01/27 17:37:54.824 (11876) bootstrap failed.
2017 01/27 17:37:57.825 (11876) start bootstrap.
I dont understand why it is asking for SAP directory , on the other hand SID for ERP is PRD not SYB.
Finally I couldn´t find fault_manager_responses.txt in $SYBASE/log directory (secondary server)
Any clue?
Hello,
This blog refers to HADR for SAP ASE for custom applications, so it does not apply to HADR for SAP ASE for Business Suite.
Note that in a Business Suite environment, Fault Manager is currently not supported, as stated in SAP Note 1891560 - Disaster Recovery Setup with SAP Replication Server :
General Limitations for SAP Replication Server 15.7.1:
SAP Netweaver Business Warehouse (BW) or systems using SAP BW features like SAP SCM APO, SAP SEM, and SAP Solution Manager are currently not supported.
SAP Replication Server 15.7.1 SP200 and higher is not supported for SAP ASE 15.7.
SRS SP200 and higher requires SAP ASE 16.0 as a minimum. The versions that are supported for SAP ASE 16.0 are specified below.
Important: Fault Manager is not supported for HADR for Business Suite environments.
Regards,
Cris
Hi Cris,
I didn't notice this limitation when I checked the note.
I have this versions>
ASE SRS
I can see in ASE Cockpit both servers with its status green(primary) and grey(stand by) and replication works fine but if Fault Manager is not supported what tool should I use? or what's next?
Regards.
Hello Fernando,
There are still issue preventing Fault Manager to be supported for the Business Suite, even if ASE 16 SP02 PL05 HF1 is supported for HADR.
DBA Cockpit is the recommended tool when running SAP applications on SAP ASE, HADR options have been enhanced there. ASE Cockpit has not been specifically designed for ASE for Business Suite, and usually customers running SAP Applications on ASE are not even aware of its existence 🙂
The advantage of the Fault Manager is that it monitors the health of the components of an HADR environment (ASE, SRS, RMA) for you and will take actions automatically depending on the health. Without it, you can still setup your HADR environment, monitor and take the actions needed.
HTH
Regards,
Cris
Hi Fernando,
can you please tell me which tool or software you use for auto fail over for ASE HADR for Business suite. I have also configured ASE HADR for business suite and looking some mechanism to auto-fail over this
Thanks
Abu
Great blog!
SAP ASE 16.0 is required as a minimum for SRS SP200
Thanks
Lisa C | Customer Success Manager
7600 Dublin Blvd #210
PH: (877) 895-9163 | C: (770) 393-3234
Drivers Update Windows 10
Hi,
I am having issues to get the Fault Manager to work.
I have 3 Windows 2008 R2 VMs (HADR1, HADR2 and HADR3). I have an ASE 16 SP03 PL03 installed on HADR1 and HADR3, in HADR mode. sap_status path indicate that for all dbs the path is active and replicaiton can occur.
I installed the Fault Manager on HADR2 but when I start it I get errors on HADR1 and HADR3.
On HADR1 - dev_sybdbfm
2018 01/31 11:54:22.726 (1520) start HeartBeatClient.
2018 01/31 11:54:22.726 (1520) sybdbfm exe directory is C:\Program Files\SAP\hostctrl\exe\ASE1
2018 01/31 11:54:22.726 (1520) check_create: 0
2018 01/31 11:54:22.726 (1520) HeartBeatClient started.
2018 01/31 11:54:22.726 (1520) starting heartbeat thread (client) for HADR2:13777.
2018 01/31 11:54:25.737 (1520) start H2HServer.
2018 01/31 11:54:25.737 (1520) starting heartbeat server at: HADR1:13797.
2018 01/31 11:54:25.737 (1520) starting heartbeat thread (server) for HADR1:13797.
2018 01/31 11:54:25.737 (1520) thread status 1 at 217EC7C.
2018 01/31 11:54:25.737 (1520) heartbeat server started.
2018 01/31 11:54:25.737 (1520) HeartBeatServer started.
2018 01/31 11:54:28.748 (1520) HeartBeatSanityCheck: start.
2018 01/31 11:54:28.748 (1520) dbctrl call cnt: 0 .
2018 01/31 11:54:28.748 (1520) executing: asehostctrl -function GetDatabaseStatus -dbname HA1 -dbtype syb -dbinstance ASE1 .
2018 01/31 11:54:28.748 (1520) starting control call.
2018 01/31 11:54:30.183 (1520) Error: Database not found
2018 01/31 11:54:30.183 (1520) dbctrl call cnt reset: 0 .
2018 01/31 11:54:30.183 (1520) control call ended.
2018 01/31 11:54:30.183 (1520) call_saphostctrl completed ok.
2018 01/31 11:54:30.183 (1520) check saphostctrl running (F00)....
2018 01/31 11:54:30.183 (1520) terminateThread (F00).
2018 01/31 11:54:30.183 (1520) ThrExitCode returned (0).
2018 01/31 11:54:30.183 (1520) call exited (exit code 1).
2018 01/31 11:54:30.183 (1520) terminateThread (F00) done.
2018 01/31 11:54:30.183 (1520) ThrDetach returned (5).
2018 01/31 11:54:30.183 (1520) terminate ctrl thread done.
2018 01/31 11:54:30.183 (1520) saphostctrl executed.
2018 01/31 11:54:30.183 (1520) dbctrl call cnt reset 2: 0 .
2018 01/31 11:54:30.183 (1520) database is UNKNOWN.
On HADR1 - dev_sapdbctrl
Wed Jan 31 11:32:52 2018
[PID 1820] ODBC driver for Sybase Adaptive Server is not installed.
[PID 1820] DBConfigPath is C:\SAP\sapdbctrl-config
[PID 1820] LiveUpdateOption Status
[PID 1820] ODBC driver for Sybase Adaptive Server is not installed.
[PID 1820] Wed Jan 31 11:32:52 2018 INTERNAL_ERROR sybServer.cpp:5016:SYB_Server::isExisting DESCRIPTION: Cfg file not found: C:\SAP\HA1.cfg LAST ERROR: (0) : The operation completed successfully.
[PID 1820] SAP ASE Server instance ASE1 does not exist.
[PID 1820] LiveUpdateOption retrieving db status failed
sapparam: sapargv(argc, argv) has not been called!
sapparam(1c): No Profile used.
sapparam: SAPSYSTEMNAME neither in Profile nor in Commandline
On HADR3 - dev_sybdbfm
018 01/31 11:54:27.035 (1668) start HeartBeatClient.
2018 01/31 11:54:27.035 (1668) sybdbfm exe directory is C:\Program Files\SAP\hostctrl\exe\ASE1
2018 01/31 11:54:27.035 (1668) check_create: 0
2018 01/31 11:54:27.035 (1668) HeartBeatClient started.
2018 01/31 11:54:27.035 (1668) starting heartbeat thread (client) for HADR2:13787.
2018 01/31 11:54:30.046 (1668) start H2HClient.
2018 01/31 11:54:30.046 (1668) HeartBeatClient started.
2018 01/31 11:54:30.046 (1668) starting heartbeat thread (client) for HADR1:13797.
2018 01/31 11:54:33.103 (1668) HeartBeatSanityCheck: start.
2018 01/31 11:54:33.103 (1668) dbctrl call cnt: 0 .
2018 01/31 11:54:33.103 (1668) executing: asehostctrl -function GetDatabaseStatus -dbname HA1 -dbtype syb -dbinstance ASE1 .
2018 01/31 11:54:33.103 (1668) starting control call.
2018 01/31 11:54:34.492 (1668) Error: Database not found
2018 01/31 11:54:34.492 (1668) dbctrl call cnt reset: 0 .
2018 01/31 11:54:34.492 (1668) control call ended.
2018 01/31 11:54:34.492 (1668) call_saphostctrl completed ok.
2018 01/31 11:54:34.492 (1668) check saphostctrl running (FA0)....
2018 01/31 11:54:34.492 (1668) terminateThread (FA0).
2018 01/31 11:54:34.492 (1668) ThrExitCode returned (0).
2018 01/31 11:54:34.492 (1668) call exited (exit code 1).
2018 01/31 11:54:34.492 (1668) terminateThread (FA0) done.
2018 01/31 11:54:34.492 (1668) ThrDetach returned (5).
2018 01/31 11:54:34.492 (1668) terminate ctrl thread done.
2018 01/31 11:54:34.492 (1668) saphostctrl executed.
2018 01/31 11:54:34.492 (1668) dbctrl call cnt reset 2: 0 .
2018 01/31 11:54:34.492 (1668) database is UNKNOWN.
On HADR3 - dev_sapdbctrl
Wed Jan 31 11:54:26 2018
[PID 1616] DBConfigPath is C:\SAP\sapdbctrl-config
[PID 1616] LiveUpdateOption LUT_Start_Heartbeat
[PID 1616] Wed Jan 31 11:54:26 2018 INTERNAL_ERROR sybProcess.cpp:754:SybProcess::readInfo DESCRIPTION: OpenProcess failed for PID 4 LAST ERROR: (0) : The operation completed successfully.
[PID 1616] Wed Jan 31 11:54:26 2018 INTERNAL_ERROR sybProcess.cpp:754:SybProcess::readInfo DESCRIPTION: OpenProcess failed for PID 3848 LAST ERROR: (87) : The parameter is incorrect.
[PID 1616] heartbeat started.[PID 1616]
Wed Jan 31 11:54:29 2018
[PID 1616] Wed Jan 31 11:54:29 2018 INTERNAL_ERROR sybProcess.cpp:754:SybProcess::readInfo DESCRIPTION: OpenProcess failed for PID 4 LAST ERROR: (5) : Access is denied.
[PID 1616] LiveUpdateOption start Heartbeat ok.
[PID 2284]
Wed Jan 31 11:54:34 2018
[PID 2284] ODBC driver for Sybase Adaptive Server is not installed.
[PID 2284] Wed Jan 31 11:54:34 2018 INTERNAL_ERROR sybServer.cpp:5016:SYB_Server::isExisting DESCRIPTION: Cfg file not found: C:\SAP\HA1.cfg LAST ERROR: (0) : The operation completed successfully.
[PID 2284] SAP ASE Server instance ASE1 does not exist.
[PID 2284] *** ERROR => 'Get database status' failed: Database not found [sapdbctrl.cp 3690]
[PID 696]
Wed Jan 31 11:54:40 2018
[PID 696] lookup of secstore path failed.
[PID 696] ODBC driver for Sybase Adaptive Server is not installed.
[PID 696] DBConfigPath is C:\SAP\sapdbctrl-config
[PID 696] LiveUpdateOption Status
[PID 696] ODBC driver for Sybase Adaptive Server is not installed.
[PID 696] Wed Jan 31 11:54:40 2018 INTERNAL_ERROR sybServer.cpp:5016:SYB_Server::isExisting DESCRIPTION: Cfg file not found: C:\SAP\HA1.cfg LAST ERROR: (0) : The operation completed successfully.
[PID 696] SAP ASE Server instance ASE1 does not exist.
[PID 696] LiveUpdateOption retrieving db status failed
sapparam: sapargv(argc, argv) has not been called!
sapparam(1c): No Profile used.
sapparam: SAPSYSTEMNAME neither in Profile nor in Commandline
On HADR2 - dev_sybdbfm
2018 01/31 11:54:25.074 (2708) read password from secstore rc (0)
2018 01/31 11:54:25.074 (2708) executing: asehostctrl -host HADR3 -user sapadm ******** -function LiveDatabaseUpdate -dbname HA1 -dbtype syb -dbinstance ASE1 -timeout 30 -updatemethod Execute -updateoption TASK=HEARTBEAT_STARTUP .
2018 01/31 11:54:25.074 (2708) starting control call.
2018 01/31 11:54:29.488 (2708) Webmethod returned successfully
2018 01/31 11:54:29.488 (2708) Operation ID: 000C29E6D6E41ED881CEA33293C7FC5B
2018 01/31 11:54:29.488 (2708) ----- Response data ----
2018 01/31 11:54:29.488 (2708) LogMsg/Text=Executing LiveDatabaseUpdate
2018 01/31 11:54:29.488 (2708) START_HEARTBEAT=ok
2018 01/31 11:54:29.488 (2708) LogMsg/Text=LiveDatabaseUpdate successfully executed
2018 01/31 11:54:29.488 (2708) ----- Log messages ----
2018 01/31 11:54:29.488 (2708) Info: saphostcontrol: Executing LiveDatabaseUpdate
2018 01/31 11:54:29.488 (2708) Info: saphostcontrol: LiveDatabaseUpdate successfully executed
2018 01/31 11:54:29.488 (2708) dbctrl call cnt reset: 0 .
2018 01/31 11:54:29.488 (2708) control call ended.
2018 01/31 11:54:29.488 (2708) call_saphostctrl completed ok.
2018 01/31 11:54:29.488 (2708) check saphostctrl running (9BC)....
2018 01/31 11:54:29.488 (2708) terminateThread (9BC).
2018 01/31 11:54:29.488 (2708) ThrExitCode returned (0).
2018 01/31 11:54:29.488 (2708) call exited (exit code 0).
2018 01/31 11:54:29.488 (2708) terminateThread (9BC) done.
2018 01/31 11:54:29.488 (2708) ThrDetach returned (5).
2018 01/31 11:54:29.488 (2708) terminate ctrl thread done.
2018 01/31 11:54:29.488 (2708) saphostctrl executed.
2018 01/31 11:54:29.488 (2708) dbctrl call cnt reset 2: 0 .
2018 01/31 11:54:29.488 (2708) heartbeat: success.
2018 01/31 11:54:29.488 (2708) heartbeat client started.
2018 01/31 11:54:29.488 (2708) SimpleFetch: select convert(integer, convert(varchar,@@version_number) + substring(convert(varchar,@@sbssav),6,2) + substring(convert(varchar,@@sbssav),9,2))
2018 01/31 11:54:29.551 (2708) SimpleFetch out: 160000303
2018 01/31 11:54:29.551 (2708) FM will acknowledge ASYNC request
2018 01/31 11:54:29.551 (2708) SimpleFetch: sp_configure 'FM Enabled',1
2018 01/31 11:54:29.660 (2708) SimpleFetch out: FM Enabled
2018 01/31 11:54:29.707 (2708) Config option 'FM Enabled' changed on Primary ASE to 1
2018 01/31 11:54:29.707 (2708) SimpleFetch: sp_configure 'FM Enabled',1
2018 01/31 11:54:34.730 (2708) SQLGetDiagRec 0
2018 01/31 11:54:34.730 (2708) ERROR in function SimpleFetch (1427) (SQLExecDirect failed): (30149) [HYT00] [SAP][ASE ODBC Driver]Th
2018 01/31 11:54:34.730 (2708) ERROR in function SimpleFetch (1427) (SQLExecDirect failed): (30086) [HY008] [SAP][ASE ODBC Driver]Operation Canceled.
2018 01/31 11:54:34.730 (2708) Failed to execute statement sp_configure 'FM Enabled',1 on Standby
2018 01/31 11:54:34.730 (2708) bootstrap finished.
2018 01/31 11:54:34.730 (2708) *** sanity check report (1)***.
2018 01/31 11:54:34.730 (2708) node 1: server HADR1, site SITE01.
2018 01/31 11:54:34.730 (2708) db host status: UNKNOWN.
2018 01/31 11:54:34.730 (2708) db status DB INDOUBT hadr status PRIMARY.
2018 01/31 11:54:34.730 (2708) node 2: server HADR3, site SITE02.
2018 01/31 11:54:34.730 (2708) db host status: UNKNOWN.
2018 01/31 11:54:34.730 (2708) db status DB INDOUBT hadr status STANDBY.
2018 01/31 11:54:34.730 (2708) replication status: UNKNOWN.
2018 01/31 11:54:34.730 (2708) insert status to fault manager status table.
2018 01/31 11:54:34.730 (2708) omitting insert into fault manager status table as db is not in status ok.
2018 01/31 11:54:34.730 (2708) omitting insert into fault manager status table as db is not in status ok.
2018 01/31 11:54:34.730 (2708) sybdbfm server mode.
2018 01/31 11:54:34.730 (2708) Virtual memory used by current process (bytes): 21737472
Any idea what can be the problem ? Fro mthose error message I have no clue what is the underlying issue.
Best regards,
Juan Vega