This appendix explains how to troubleshoot system problems.
The major sections in this appendix are as follows:
When you encounter a failure, follow this general procedure:
Use df to make sure that all filesystems on shared disks are mounted only on one node; no filesystem should be simultaneously mounted by two nodes.
Look in /var/adm/SYSLOG or the file you specified in /etc/syslog.conf (see the section “Messages From IRIS FailSafe” in Chapter 6) on both nodes for causes of failure.
Diagnose and repair the problem using the information in the remainder of this appendix.
If the failure caused a failover and the failed node is in standby state, after repairing the problem you can bring both nodes back to normal state by following the procedure in the section “Moving a Node From Standby State to Normal or Degraded State” in Chapter 6.
If the IRIS FailSafe system does not start, follow these steps:
Make sure that IRIS FailSafe is chkconfig'd on.
Make sure that /var/ha/ha.conf is identical on both nodes in the cluster by entering this command on each node:
# /usr/etc/ha_cfgchksum 0x12a2390e |
The checksums output by the commands should be identical.
Check the format and contents of /var/ha/ha.conf using the ha_cfgverify command:
# /usr/etc/ha_cfgverify |
Ensure that there are no errors from ha_cfgverify.
Verify the network interfaces and serial connections using the procedures in the sections “Testing the Public Network Interfaces” and “Testing the Serial Connections” in Chapter 5.
Look at /var/adm/SYSLOG to see what errors are printed out by the IRIS FailSafe daemons. When a node in the IRIS FailSafe cluster starts up normally, the following SYSLOG messages appear:
ha_appmon[6141]: Received XRELEASE_PEER ha_nc[6135]: Received JOINING ha_nc[6135]: New state: NC_JOINING ha_nc[6135]: Received REJOIN ha_appmon[6141]: Received XACQUIRE ha_appmon[6141]: Received START_REMMON ha_nc[6135]: New state: NC_NORMAL |
If an IRIS FailSafe cluster consisting of a CHALLENGE S node and a CHALLENGE node that is a not a CHALLENGE S experiences a complete power failure, the cluster may not start up IRIS FailSafe successfully. The problem is that the CHALLENGE S comes up quickly, but gets no response from the other node, times out, and doesn't start IRIS FailSafe. The larger CHALLENGE node comes up more slowly, detects that IRIS FailSafe is not running on the CHALLENGE S, and goes into standby node.
To start IRIS FailSafe manually on the two nodes, enter this command on each node:
# /etc/init.d/failsafe start |
To prevent this problem in the future, modify the configuration file /var/ha/ha.conf as follows:
Increase the value of the long-timeout parameter. A suggested value is 90.
Increase the values of all of the start-monitor-time parameters. A suggested value is 120. It must be larger than the value of the long-timeout parameter.
To install the new configuration file, follow the directions in the section “Upgrade Procedure C” in Chapter 7.
If you see SCSI bus-related errors after configuring the cluster, or nonexistent devices show up in hinv, follow these steps:
Verify that the SCSI host IDs of the two nodes are different (by convention, 0 and 2) by running nvram.
To change the SCSI host ID, enter this command:
# nvram -v scsihostid id |
Verify that the SCSI IDs of all disks and other peripherals on the same SCSI bus have distinct SCSI unit numbers and that they are different from the SCSI host IDs of the two nodes in the cluster (usually 0 and 2).
![]() | Note: This convention works when only one internal disk per node is used. If a second internal disk is used in either node, that disk's SCSI ID cannot be identical to either SCSI host ID. |
If you cannot access a network interface, check if the interface is configured up and the network interface is working as described below.
The procedure uses this ha.conf fragment as an example:
node xfs-ha1 { interface xfs-ha1-ec0 { name = ec0 ip-address = 190.0.2.2 netmask = 0xffffff00 broadcast-addr = 190.0.2.255 } ... } |
Follow these steps:
Get information about the interface from this ifconfig command:
# /usr/etc/ifconfig ec0 ec0: flags=c63<UP,BROADCAST,NOTRAILERS,RUNNING,FILTMULTI,MULTICAST> inet 190.0.2.2 netmask 0xffffff00 broadcast 190.0.2.255 |
The UP in the first line of output indicates that the interface is configured up.
If the interface is not configured up, add the interface to the /etc/config/netif.options file as described in the section “Configuring Network Interfaces” in Chapter 3.
If the interface is configured up, check if the network interface is working by entering this command:
# /usr/etc/netstat -I ec0 |
Check the output to see if there are input errors or output errors for the interface.
If you cannot access a node using the network, run netstat -i as described in the section “Getting Information About Interfaces” in Chapter 6 to see if the IP address to which you are trying to connect is configured on one of the node's public interfaces.
![]() | Note: You cannot access an IP address associated with a private interface from a node on the public network. |
Because high-availability IP addresses are configured by IRIS FailSafe, they might not be configured if IRIS FailSafe is not started. Also, the IP address might have been taken over by the other node.
If you suspect a problem with one of the serial cables connected to the remote power control unit or to the system controller port of the other node (possibly because you have received mail that indicates a problem), use the procedure below to determine if there is a problem. If you suspect a particular cable, perform this procedure on the node whose serial port is connected to the cable. If necessary, follow this procedure on both nodes.
To stop the automatic communication on the serial line, enter this command:
# /usr/etc/ha_admin -m stop hostname ha_admin: Stopped monitoring the serial connection to hostname |
Manually send messages on the serial line by entering this command:
# /usr/etc/ha_spng -i 10 -f reset-tty -d sys-ctlr-type -w password |
reset-tty is the value of the reset-tty parameter for this node in the configuration file /var/ha/ha.conf. sys-ctlr-type is the value of the sys-ctlr-type parameter for this node in /var/ha/ha.conf. The -d sys-ctlr-type option is omitted if there is no sys-ctlr-type parameter or it is set to CHAL. password is the unencrypted password for this node's system controller. The -w password option is omitted if the node is a CHALLENGE node or if it is an Origin node with the default system controller password.
Check the return code from the command in step 2:
# echo variable 0 |
If you are using csh, variable is $status. If you are running sh, variable is $?. The zero output indicates normal operation.
If the return code was zero, restart automatic communication on the serial line by entering this command:
# /usr/etc/ha_admin -m start hostname ha_admin: Started monitoring the serial connection to hostname |
There is no problem with the serial connection, so skip the remainder of this procedure.
If the return code from ha_spng was non-zero, try re-seating the serial cable connectors.
If re-seating the cables didn't work, replace the cable by following the directions in the section “Replacing the Serial Cable” in Chapter 8.
To check that a node is licensed for plexing of XLV logical volumes, XLV logical volumes are visible to the node, and XLV logical volumes are owned by the correct node, follow these steps:
Verify that the node is licensed for plexing and that the plexing software is installed:
# xlv_mgr xlv_mgr> show config Allocated subvol locks: 30 locks in use: 7 Plexing license: present Plexing support: present Maximum subvol block number: 0x7fffffff |
If you have just installed the plexing software, you must reboot the system for the plexing support to be included in the kernel.
Verify that the node sees the volume. In an IRIS FailSafe environment, a node accesses (and mounts filesystems on) only those XLV volumes that it owns. To see all the volumes in the cluster, enter these commands:
xlv_mgr> show all Volume: vol1 (complete) Volume: vol2 (complete) Volume: shared_vol (complete) xlv_mgr> quit |
This command shows all the XLV volumes in the cluster.
If the cluster is using a CHALLENGE RAID storage system, stop the RAID agent by entering this command:
# /etc/init.d/raid5 stop |
To see volumes owned by a node, enter this command on that node:
# xlv_assemble -ln VOL vol2 flags=0x1, [complete] DATA flags=0x0() open_flag=0x0() device=(192, 4) PLEX 0 flags=0x0 VE 0 [active] start=0, end=687999, (cat)grp_size=1 /dev/dsk/dks5d9s0 (688000 blks) PLEX 1 flags=0x0 VE 0 [active] start=0, end=687999, (cat)grp_size=1 /dev/dsk/dks5d9s1 (688000 blks) |
This command displays only the volumes owned by this node.
If a volume is owned by the wrong node, change the ownership of a volume (for example, make vol2 owned by xfs-ha2) using xlv_mgr after first unmounting the filesystem mounted on that volume (vol2):
# umount /vol2 # xlv_mgr xlv_mgr> change nodename xfs-ha2 vol2 set node name "xfs-ha2" for object "vol2" done |
Run xlv_assemble -l on both nodes:
# xlv_assemble -l |
Restart the RAID agent, if you stopped it in step 3, by entering this command:
# /etc/init.d/raid5 start |
If you are having trouble mounting filesystems on shared disks, execute the mount directive that would be executed by the IRIS FailSafe software on each node. Follow these steps:
In your /var/ha/ha.conf file, look for the filesystem block for a filesystem that does not mount successfully. For example:
filesystem fs1 { mount-point = /shared mount-info { fs-type = xfs volume-name = vol1 mode = rw,noauto } } |
This filesystem block is for the filesystem mounted at /shared.
Look for the volume block for this filesystem; it is the volume block whose label is vol1, the value of volume-name. For example:
volume vol1 { server-node = xfs-ha1 backup-node = xfs-ha2 devname = vol1 devname-owner = root devname-group = sys devname-mode = 0600 } |
Follow the procedure in the section “Trouble Accessing a Network Interface” in this appendix to make sure that the volume is owned by this node.
Mount the filesystem with this command:
# mount -txfs -rw,noauto /dev/xlv/vol1 /shared |
Unmount the filesystem with this command:
# umount /shared |
![]() | Note: Do not omit this step. Data corruption could result. |
Repeat steps 1 through 5 for every filesystem that doesn't mount successfully.
On the other node, repeat steps 1 through 6. While following the procedure in the section “Trouble Accessing a Network Interface,” do change the owner of the volume to the second node.
If you cannot access a filesystem on a shared disk over NFS, make sure that the network interface is ifconfig'd up and the filesystem is mounted, as explained in the sections “Error Message From ha_statd” and “Trouble Mounting Filesystems” in this appendix. Also verify that IP alias you are using is specified correctly in the configuration file /var/ha/ha.conf.
The procedure below exports the filesystem manually from both nodes and checks to see if a client can access it. It uses this ha.conf fragment as an example:
nfs shared1 { filesystem = shared1 export-point = /shared1 export-info = rw ip-address = 190.0.2.3 } |
Follow these steps:
Verify that the value of ip-address is a high-availability IP address. See the section “NFS Blocks” in Chapter 4 for more information.
Verify that /shared1 is mounted. If it is not, follow instructions in the section “Trouble Accessing a Network Interface” in this appendix.
Export the filesystem by entering this command:
# exportfs -i -o rw /shared1 |
Verify that the filesystem is exported:
# exportfs /shared1 -rw |
From a client, mount the filesystem and verify that you can access it. For example
# mount -tnfs -rw 190.0.2.3:/shared1 /shared1 |
On the client, change to /shared1, verify that the filesystem has been mounted, then unmount it:
# cd /shared1 # ls # cd / # umount /shared1 |
Unexport and unmount the filesystem:
# exportfs -u /shared1 # umount /shared1 |
![]() | Note: Do not omit this step. Data corruption could result. |
Repeat steps 2 through 7 with the filesystem mounted on the other node.
This type of error message is normal when nodes in the cluster boot up:
error: could not bind to 190.0.2.1 port 80 (Cannot assign requested address) error: could not bind to 190.0.2.2 port 80 (Cannot assign requested address) |
These messages appear because IRIS FailSafe starts up the Netscape FastTrack Server and the Netscape Enterprise Server before it configures the network interfaces up. They are harmless and can be ignored.
If a Netscape server is not responding after the network and the Netscape server have been installed and configured, make sure that the configured addresses are accessible. Follow these steps:
If you have multiple Netscape servers, verify that the file /etc/config/ns_fasttrack.options file exists (it is created by IRIS FailSafe when it starts Netscape servers):
# ls /etc/config/ns_fasttrack.options |
Enable and start the Netscape FastTrack server if used:
# chkconfig ns_fasttrack on # /etc/init.d/ns_fasttrack start |
Enable and start the Netscape Enterprise server if used:
# chkconfig ns_enterprise on # /etc/init.d/ns_enterprise start |
Run a Web browser, such as Netscape, and try to access some Web pages exported by the server.
These messages are written to /var/adm/SYSLOG when the IRIS FailSafe Web monitoring script detects that the Netscape httpd daemons are no longer responding:
Nov 29 11:25:36 6D:xfs-ha2 syslog[775]: /usr/etc/ha_exec: command /usr/etc/http_ping failed error=1. retrying Nov 29 11:25:36 5B:xfs-ha2 root: Failed to ping local webserver [port: 80] Nov 29 11:25:36 6D:xfs-ha2 ha_appmon[241]: webserver_xfs-ha1 local monitoring failed: status = 3 Nov 29 11:25:36 6D:xfs-ha2 ha_nc[235]: Received LOCMONFAIL |
Notice that the first line reports that /usr/etc/http_ping was executed by /usr/etc/ha_exec and failed. Try to determine why the /usr/etc/http_ping command is failing. Possible reasons are these:
The Netscape server is not configured correctly for a high-availability IP address.
The Netscape server has not been started. Check the webserver entries in /var/ha/ha.conf.
Check the interface-pair blocks of /var/ha/ha.conf to verify that a high-availability IP address has been configured on the interface.
After you have fixed the problem, try executing the command that failed (/usr/etc/http_ping, in this case) from the command line to verify that the problem has been solved.
IRIS FailSafe uses application-specific failover scripts. If a script fails, the error is logged to /var/adm/SYSLOG. A sample entry might look like the following:
Nov 29 11:07:32 5B:xfs-ha1 root: ERROR: /sbin/xlv_mgr -c "change nodename xfs-ha2 shared_vol" Nov 29 11:07:32 6D:xfs-ha1 ha_appmon[238]: takeback script exited with status 3 Nov 29 11:07:32 6D:xfs-ha1 ha_nc[232]: process appmon died with status 1 |
Normally, the other node in the cluster restarts this node and takes over its services. To diagnose what caused the original script failure, follow these steps:
Shut the cluster down by following the procedure in the section “Shutting Down IRIS FailSafe” in Chapter 6.
Re-enter the command with the same command options that failed. In the example above, the command is the call to /sbin/xlv_mgr.
Rerun the appropriate script from /var/ha/actions (in this case, /var/ha/actions/takeback). For example:
# /var/ha/actions/takeback `/usr/etc/ha_cfgchksum` |
Make the appropriate fixes, which are most likely to be fixing errors in configuration, and bring the cluster back up.
The ha_admin command times out when the node is transitioning from one state to another. Retry the command after a few minutes.
When ha_cfgverify is run automatically as part of the startup of IRIS FailSafe, it can generate this message:
ha_statd: Error sending message to rpc.statd ha_statd: rpc.statd is a back revision |
If you see this message, follow these steps:
Check to see what options the network status monitor daemon, /usr/etc/rpc.statd, was started with by entering this command:
# ps -ef | grep rpc.statd root 211 1 0 May 31 ? 0:00 /usr/etc/rpc.statd -h root 4215 4213 2 14:48:57 ttyq3 0:00 grep rpc.statd |
Look for the -h option (shown in this output). The error message shown above appears if the -h option wasn't used.
If the -h option wasn't used, add -h to the first line of the file /etc/config/statd.options as described in the section “Configuring NFS Filesystems” in Chapter 3.
If you edited /etc/config/statd.options, restart rpc.statd by entering this command:
# /etc/init.d/network start |
A false failover occurs when there is a failover for no apparent reason. IRIS FailSafe monitors the high-availability services, the other node in the cluster, the networks, and the connections between the nodes. If a service, a network, or the other node doesn't respond to the monitoring within a timeout period and this monitoring fails a specified number of times (the retry value), a failure is declared and a failover is initiated.
Preventing false failovers is done by tuning the values of the timeout and retry values in the configuration file /var/ha/ha.conf so that they provide timely detection of true failures, but do not falsely detect failure because of node or network loading. See the section “Preventing False Failovers” in Chapter 4 for more information.
This section lists errors that might be logged to /var/adm/SYSLOG.
date 6D:node1 syslog[3195]: /usr/etc/ha_exec: /usr/etc/ha_probe -h 127.0.0.1 -p ha_ifa failed(exit status = 1). retrying date 6D:node1 ha_ifa[3148]: ifa_mon: No packets for ec0 date 6D:node1 ha_ifa[3148]: ifa_mon: Interface failure(s) date 6D:node1 ha_appmon[3268]: am_send_event LOCMONFAIL to NC date 6D:node1 ha_nc[1583]: Received LOCMONFAIL date 6D:node1 ha_nc[1583]: state NORMAL, event LOCMONFAIL (0x11) date 6D:node1 ha_nc[1583]: Sending mesg: REMOTE_STOPMON (0x30200019) to NC on node2-pr
This series of messages is the result of a local monitoring failure. In this example, the interface agent failed. (Look for the name of the process that failed, ha_ifa in this example, in the arguments of the ha_probe command shown on the first line.) The message No packets for ec0 in the second line indicates that the problem is related to the interface ec0. Check this interface for network problems.
date 6D:node1 ha_appmon[351]: Retrying remote heartbeat monitor, node 2 (lost_count = 2) date 6D:node1 ha_appmon[351]: lost_heartbeat: heartbeat_2 failed date 6D:node1 ha_appmon[1692]: am_send_event NOHEARTBEAT to NC date 6D:node1 ha_nc[319]: Received NOHEARTBEAT date 6D:node1 ha_nc[319]: state NORMAL, event NOHEARTBEAT (0x12)
date 5B:node1 root: ERROR: /etc/init.d/informix stop date 6D:node1 ha_appmon[4916]: giveaway script exited with status 3 date 6D:node1 ha_nc[4904]: process appmon died with status 1 date 6D:node1 ha_nc[4904]: node controller exiting date 6D:node1 ha_hbeat[4901]: process nc died with status 1 date 6D:node1 ha_hbeat[4901]: FailSafe shutting down
This series of messages is an example of a failover script failure. In this example, the giveaway script failed. giveback, takeback, and takeover script failures produce similar SYSLOG output. In this example, the command /etc/init.d/informix stop failed. Check the directory /var/ha/logs for more information on the command failure.
wd95_5: WD95 saw SCSI reset
This message is benign. A CHALLENGE node resets the SCSI bus in the process of starting up. For a dual-hosted system like IRIS FailSafe, the other node on the shared SCSI bus also sees the resets.
checksum mismatch error
Either the configuration files on the two nodes do not match or the configuration file has been changed after the IRIS FailSafe daemons were started.
Make sure that the /var/ha/ha.conf files are the same on both nodes and then reboot both nodes in the cluster.
read_conf error error return from read_config nc_readconfig: Bad config file ...
The IRIS FailSafe daemons could not start up because the /var/ha/ha.conf file on the node is invalid. In some cases, the error message also indicates the missing or invalid parameters. Run ha_cfgverify and fix any problems identified.
Kill failed (%d) after remote monitor failed or no heartbeat
The local node tried to take over the other node's services but failed because it could not reset the other node. The local node went into the standby state.
Check the serial connection between the nodes. If used, check the remote power control unit. After the problem has been fixed, used the command ha_admin -rf as described in the section “Moving a Node From Standby State to Normal or Degraded State” in Chapter 6.
assumed power failed
This node detected that the heartbeat and the failsafe mechanism have both failed on the remote node and assumed that the remote node experienced a power failure. This node then took over the other node's services.
Monitoring of reset-tty on xxx failed
The IRIS FailSafe system could not communicate with either the remote power control unit (CHALLENGE S) or the system controller port (all other systems). Although the system continues to function, this situation prevents a node from taking over from another node if another failure occurs.
Check the physical connections. Test the serial connection using the procedure in the section “Netscape Daemons Not Responding to Monitoring” in this appendix.
mail script failed
The application monitor detected a change in the state of the cluster, tried to send mail to the recipient specified in the /var/ha/ha.conf file, and failed.
Verify the mail configuration on your nodes.
xlv_plexd[31]: DIOCXLVPLEXCOPY on xxx (dev 192.4) failed: No such device or address
The plex revive operation was interrupted because the underlying device went away. This situation is usually because the shared volume has been given away to the other node in the cluster.
This message is benign.
New state: NC_ERROR
The IRIS FailSafe software has detected an internal inconsistency and has suspended operation. The nodes of the cluster are still running.
Verify the software and hardware configuration; reboot this node. Report this problem to Silicon Graphics Technical Support.
lost_heartbeat: heartbeat_<nodenumber> failed
Retrying heartbeat monitor over <public interface>
The cluster is in normal state, but a failure was detected in the private network.
Repair the private network. Follow the instructions in the section “Moving Heartbeat Messages to the Private Network” in Chapter 6 to switch the heartbeat back to the private network.
connect to opsnc failed
In a mixed OPS/IRIS FailSafe configuration, the ha_killd daemon could not connect to the OPS node controller on the Indy workstation that is running the IRISconsole software.
Verify that the cables from the Indy workstation running the IRISconsole are properly attached to the nodes in the IRIS FailSafe cluster, that the Indy workstation is running, and that the opsnc daemon is running on the Indy workstation.
open_tty failed -- cannot monitor node
This message applies to clusters running both Oracle Parallel Server (OPS) and IRIS FailSafe. If you see this message and the cluster is not running OPS, the problem is that the configuration parameter reset-host has been set. Remove reset-host to solve the problem.
If the cluster is running OPS, the IRIS FailSafe system cannot open the serial connection to the remote power control unit or the system controller port of the other node in the cluster.
Verify that the serial connection is hooked up and that the remote power control unit is powered on if used. Test the serial connection using the procedure in the section “Netscape Daemons Not Responding to Monitoring” in this appendix.