Appendix B. System Troubleshooting

This appendix explains how to troubleshoot system problems.

The major sections in this appendix are as follows:

General Troubleshooting Procedure

When you encounter a failure, follow this general procedure:

  1. Use df to make sure that all filesystems on shared disks are mounted only on one node; no filesystem should be simultaneously mounted by two nodes.

  2. Look in /var/adm/SYSLOG or the file you specified in /etc/syslog.conf (see the section “Messages From IRIS FailSafe” in Chapter 6) on both nodes for causes of failure.

  3. Diagnose and repair the problem using the information in the remainder of this appendix.

  4. If the failure caused a failover and the failed node is in standby state, after repairing the problem you can bring both nodes back to normal state by following the procedure in the section “Moving a Node From Standby State to Normal or Degraded State” in Chapter 6.

IRIS Failsafe System Does Not Start

If the IRIS FailSafe system does not start, follow these steps:

  1. Make sure that IRIS FailSafe is chkconfig'd on.


    Note: If two failovers have occurred within a period of time specified by MIN_UPTIME in /etc/init.d/failsafe (300 seconds, by default) IRIS FailSafe software automatically performs chkconfig failsafe off.


  2. Make sure that /var/ha/ha.conf is identical on both nodes in the cluster by entering this command on each node:

    # /usr/etc/ha_cfgchksum
    0x12a2390e
    

    The checksums output by the commands should be identical.

  3. Check the format and contents of /var/ha/ha.conf using the ha_cfgverify command:

    # /usr/etc/ha_cfgverify 
    

    Ensure that there are no errors from ha_cfgverify.

  4. Verify the network interfaces and serial connections using the procedures in the sections “Testing the Public Network Interfaces” and “Testing the Serial Connections” in Chapter 5.

  5. Look at /var/adm/SYSLOG to see what errors are printed out by the IRIS FailSafe daemons. When a node in the IRIS FailSafe cluster starts up normally, the following SYSLOG messages appear:

    ha_appmon[6141]: Received XRELEASE_PEER
    ha_nc[6135]: Received JOINING
    ha_nc[6135]: New state: NC_JOINING
    ha_nc[6135]: Received REJOIN
    ha_appmon[6141]: Received XACQUIRE
    ha_appmon[6141]: Received START_REMMON
    ha_nc[6135]: New state: NC_NORMAL
    

No IRIS FailSafe After a Power Failure

If an IRIS FailSafe cluster consisting of a CHALLENGE S node and a CHALLENGE node that is a not a CHALLENGE S experiences a complete power failure, the cluster may not start up IRIS FailSafe successfully. The problem is that the CHALLENGE S comes up quickly, but gets no response from the other node, times out, and doesn't start IRIS FailSafe. The larger CHALLENGE node comes up more slowly, detects that IRIS FailSafe is not running on the CHALLENGE S, and goes into standby node.

To start IRIS FailSafe manually on the two nodes, enter this command on each node:

# /etc/init.d/failsafe start

To prevent this problem in the future, modify the configuration file /var/ha/ha.conf as follows:

  • Increase the value of the long-timeout parameter. A suggested value is 90.

  • Increase the values of all of the start-monitor-time parameters. A suggested value is 120. It must be larger than the value of the long-timeout parameter.

To install the new configuration file, follow the directions in the section “Upgrade Procedure C” in Chapter 7.

Duplicate SCSI IDs

If you see SCSI bus-related errors after configuring the cluster, or nonexistent devices show up in hinv, follow these steps:

  1. Verify that the SCSI host IDs of the two nodes are different (by convention, 0 and 2) by running nvram.

    To change the SCSI host ID, enter this command:

    # nvram -v scsihostid id 
    

  2. Verify that the SCSI IDs of all disks and other peripherals on the same SCSI bus have distinct SCSI unit numbers and that they are different from the SCSI host IDs of the two nodes in the cluster (usually 0 and 2).


    Note: This convention works when only one internal disk per node is used. If a second internal disk is used in either node, that disk's SCSI ID cannot be identical to either SCSI host ID.


Trouble Accessing a Network Interface

If you cannot access a network interface, check if the interface is configured up and the network interface is working as described below.

The procedure uses this ha.conf fragment as an example:

node xfs-ha1
{
        interface xfs-ha1-ec0
        {
                name = ec0
                ip-address = 190.0.2.2
                netmask = 0xffffff00
                broadcast-addr = 190.0.2.255
        }

...
}

Follow these steps:

  1. Get information about the interface from this ifconfig command:

    # /usr/etc/ifconfig ec0
    ec0: flags=c63<UP,BROADCAST,NOTRAILERS,RUNNING,FILTMULTI,MULTICAST>
            inet 190.0.2.2 netmask 0xffffff00 broadcast 190.0.2.255
    

    The UP in the first line of output indicates that the interface is configured up.

  2. If the interface is not configured up, add the interface to the /etc/config/netif.options file as described in the section “Configuring Network Interfaces” in Chapter 3.

  3. If the interface is configured up, check if the network interface is working by entering this command:

    # /usr/etc/netstat -I ec0 
    

    Check the output to see if there are input errors or output errors for the interface.

Trouble Accessing a Node Over the Network

If you cannot access a node using the network, run netstat -i as described in the section “Getting Information About Interfaces” in Chapter 6 to see if the IP address to which you are trying to connect is configured on one of the node's public interfaces.


Note: You cannot access an IP address associated with a private interface from a node on the public network.

Because high-availability IP addresses are configured by IRIS FailSafe, they might not be configured if IRIS FailSafe is not started. Also, the IP address might have been taken over by the other node.

Trouble With the Serial Connection

If you suspect a problem with one of the serial cables connected to the remote power control unit or to the system controller port of the other node (possibly because you have received mail that indicates a problem), use the procedure below to determine if there is a problem. If you suspect a particular cable, perform this procedure on the node whose serial port is connected to the cable. If necessary, follow this procedure on both nodes.

  1. To stop the automatic communication on the serial line, enter this command:

    # /usr/etc/ha_admin -m stop hostname 
    ha_admin: Stopped monitoring the serial connection to hostname
    

  2. Manually send messages on the serial line by entering this command:

    # /usr/etc/ha_spng -i 10 -f reset-tty -d sys-ctlr-type -w password 
    

    reset-tty is the value of the reset-tty parameter for this node in the configuration file /var/ha/ha.conf. sys-ctlr-type is the value of the sys-ctlr-type parameter for this node in /var/ha/ha.conf. The -d sys-ctlr-type option is omitted if there is no sys-ctlr-type parameter or it is set to CHAL. password is the unencrypted password for this node's system controller. The -w password option is omitted if the node is a CHALLENGE node or if it is an Origin node with the default system controller password.

  3. Check the return code from the command in step 2:

    # echo variable 
    0
    

    If you are using csh, variable is $status. If you are running sh, variable is $?. The zero output indicates normal operation.

  4. If the return code was zero, restart automatic communication on the serial line by entering this command:

    # /usr/etc/ha_admin -m start hostname 
    ha_admin: Started monitoring the serial connection to hostname
    

    There is no problem with the serial connection, so skip the remainder of this procedure.

  5. If the return code from ha_spng was non-zero, try re-seating the serial cable connectors.

  6. Re-test the serial line by performing steps 2 and 3.

  7. If re-seating the cables didn't work, replace the cable by following the directions in the section “Replacing the Serial Cable” in Chapter 8.

Trouble With Logical Volumes

To check that a node is licensed for plexing of XLV logical volumes, XLV logical volumes are visible to the node, and XLV logical volumes are owned by the correct node, follow these steps:

  1. Verify that the node is licensed for plexing and that the plexing software is installed:

    # xlv_mgr
    xlv_mgr> show config
    Allocated subvol locks: 30 locks in use: 7
    Plexing license: present
    Plexing support: present
    Maximum subvol block number: 0x7fffffff
    

    If you have just installed the plexing software, you must reboot the system for the plexing support to be included in the kernel.

  2. Verify that the node sees the volume. In an IRIS FailSafe environment, a node accesses (and mounts filesystems on) only those XLV volumes that it owns. To see all the volumes in the cluster, enter these commands:

    xlv_mgr> show all
    Volume: vol1 (complete)
    Volume: vol2 (complete)
    Volume: shared_vol (complete)
    xlv_mgr> quit
    

    This command shows all the XLV volumes in the cluster.

  3. If the cluster is using a CHALLENGE RAID storage system, stop the RAID agent by entering this command:

    # /etc/init.d/raid5 stop 
    

  4. To see volumes owned by a node, enter this command on that node:

    # xlv_assemble -ln
    
    VOL vol2        flags=0x1, [complete]
    DATA    flags=0x0()     open_flag=0x0() device=(192, 4)
    PLEX 0  flags=0x0
    VE 0    [active]
            start=0, end=687999, (cat)grp_size=1
            /dev/dsk/dks5d9s0 (688000 blks)
    PLEX 1  flags=0x0
    VE 0    [active]
            start=0, end=687999, (cat)grp_size=1
            /dev/dsk/dks5d9s1 (688000 blks)
    

    This command displays only the volumes owned by this node.

  5. If a volume is owned by the wrong node, change the ownership of a volume (for example, make vol2 owned by xfs-ha2) using xlv_mgr after first unmounting the filesystem mounted on that volume (vol2):

    # umount /vol2
    # xlv_mgr
    xlv_mgr> change nodename xfs-ha2 vol2
    set node name "xfs-ha2" for object "vol2" done
    

  6. Run xlv_assemble -l on both nodes:

    # xlv_assemble -l 
    

  7. Restart the RAID agent, if you stopped it in step 3, by entering this command:

    # /etc/init.d/raid5 start 
    

Trouble Mounting Filesystems

If you are having trouble mounting filesystems on shared disks, execute the mount directive that would be executed by the IRIS FailSafe software on each node. Follow these steps:

  1. In your /var/ha/ha.conf file, look for the filesystem block for a filesystem that does not mount successfully. For example:

    filesystem fs1
    {
            mount-point = /shared
            mount-info
            {
                    fs-type = xfs
                    volume-name = vol1
                    mode = rw,noauto
            }
    }
    

    This filesystem block is for the filesystem mounted at /shared.

  2. Look for the volume block for this filesystem; it is the volume block whose label is vol1, the value of volume-name. For example:

    volume vol1
    {
            server-node = xfs-ha1
            backup-node = xfs-ha2
            devname = vol1
            devname-owner = root
            devname-group = sys
            devname-mode = 0600
    }
    

  3. Follow the procedure in the section “Trouble Accessing a Network Interface” in this appendix to make sure that the volume is owned by this node.

  4. Mount the filesystem with this command:

    # mount -txfs -rw,noauto /dev/xlv/vol1 /shared 
    

  5. Unmount the filesystem with this command:

    # umount /shared
    


    Note: Do not omit this step. Data corruption could result.


  6. Repeat steps 1 through 5 for every filesystem that doesn't mount successfully.

  7. On the other node, repeat steps 1 through 6. While following the procedure in the section “Trouble Accessing a Network Interface,” do change the owner of the volume to the second node.

Trouble Accessing a Filesystem Over NFS

If you cannot access a filesystem on a shared disk over NFS, make sure that the network interface is ifconfig'd up and the filesystem is mounted, as explained in the sections “Error Message From ha_statd” and “Trouble Mounting Filesystems” in this appendix. Also verify that IP alias you are using is specified correctly in the configuration file /var/ha/ha.conf.

The procedure below exports the filesystem manually from both nodes and checks to see if a client can access it. It uses this ha.conf fragment as an example:

nfs shared1
{
        filesystem = shared1
        export-point = /shared1
        export-info = rw
        ip-address = 190.0.2.3
}

Follow these steps:

  1. Verify that the value of ip-address is a high-availability IP address. See the section “NFS Blocks” in Chapter 4 for more information.

  2. Verify that /shared1 is mounted. If it is not, follow instructions in the section “Trouble Accessing a Network Interface” in this appendix.

  3. Export the filesystem by entering this command:

    # exportfs -i -o rw /shared1 
    

  4. Verify that the filesystem is exported:

    # exportfs
    /shared1 -rw 
    

  5. From a client, mount the filesystem and verify that you can access it. For example

    # mount -tnfs -rw 190.0.2.3:/shared1 /shared1
    

  6. On the client, change to /shared1, verify that the filesystem has been mounted, then unmount it:

    # cd /shared1
    # ls
    # cd /
    # umount /shared1
    

  7. Unexport and unmount the filesystem:

    # exportfs -u /shared1
    # umount /shared1
    


    Note: Do not omit this step. Data corruption could result.


  8. Repeat steps 2 through 7 with the filesystem mounted on the other node.

Netscape Server Warning Messages at Startup

This type of error message is normal when nodes in the cluster boot up:

error: could not bind to 190.0.2.1 port 80 (Cannot assign requested address)
error: could not bind to 190.0.2.2 port 80 (Cannot assign requested address)

These messages appear because IRIS FailSafe starts up the Netscape FastTrack Server and the Netscape Enterprise Server before it configures the network interfaces up. They are harmless and can be ignored.

Netscape Server Not Responding

If a Netscape server is not responding after the network and the Netscape server have been installed and configured, make sure that the configured addresses are accessible. Follow these steps:

  1. If you have multiple Netscape servers, verify that the file /etc/config/ns_fasttrack.options file exists (it is created by IRIS FailSafe when it starts Netscape servers):

    # ls /etc/config/ns_fasttrack.options
    

  2. Enable and start the Netscape FastTrack server if used:

    # chkconfig ns_fasttrack on
    # /etc/init.d/ns_fasttrack start 
    

  3. Enable and start the Netscape Enterprise server if used:

    # chkconfig ns_enterprise on
    # /etc/init.d/ns_enterprise start 
    

  4. Run a Web browser, such as Netscape, and try to access some Web pages exported by the server.

Netscape Daemons Not Responding to Monitoring

These messages are written to /var/adm/SYSLOG when the IRIS FailSafe Web monitoring script detects that the Netscape httpd daemons are no longer responding:

Nov 29 11:25:36 6D:xfs-ha2 syslog[775]: /usr/etc/ha_exec: command /usr/etc/http_ping failed error=1. retrying
Nov 29 11:25:36 5B:xfs-ha2 root: Failed to ping local webserver [port: 80]
Nov 29 11:25:36 6D:xfs-ha2 ha_appmon[241]: webserver_xfs-ha1 local monitoring failed: status = 3
Nov 29 11:25:36 6D:xfs-ha2 ha_nc[235]: Received LOCMONFAIL

Notice that the first line reports that /usr/etc/http_ping was executed by /usr/etc/ha_exec and failed. Try to determine why the /usr/etc/http_ping command is failing. Possible reasons are these:

  • The Netscape server is not configured correctly for a high-availability IP address.

  • The Netscape server has not been started. Check the webserver entries in /var/ha/ha.conf.

  • Check the interface-pair blocks of /var/ha/ha.conf to verify that a high-availability IP address has been configured on the interface.

After you have fixed the problem, try executing the command that failed (/usr/etc/http_ping, in this case) from the command line to verify that the problem has been solved.

Failover Script Fails

IRIS FailSafe uses application-specific failover scripts. If a script fails, the error is logged to /var/adm/SYSLOG. A sample entry might look like the following:

Nov 29 11:07:32 5B:xfs-ha1 root: ERROR: /sbin/xlv_mgr -c "change nodename xfs-ha2 shared_vol"
Nov 29 11:07:32 6D:xfs-ha1 ha_appmon[238]: takeback script exited with status 3
Nov 29 11:07:32 6D:xfs-ha1 ha_nc[232]: process appmon died with status 1

Normally, the other node in the cluster restarts this node and takes over its services. To diagnose what caused the original script failure, follow these steps:

  1. Shut the cluster down by following the procedure in the section “Shutting Down IRIS FailSafe” in Chapter 6.

  2. Re-enter the command with the same command options that failed. In the example above, the command is the call to /sbin/xlv_mgr.

  3. Rerun the appropriate script from /var/ha/actions (in this case, /var/ha/actions/takeback). For example:

    # /var/ha/actions/takeback `/usr/etc/ha_cfgchksum`
    

  4. Make the appropriate fixes, which are most likely to be fixing errors in configuration, and bring the cluster back up.

ha_admin Times Out

The ha_admin command times out when the node is transitioning from one state to another. Retry the command after a few minutes.

Error Message From ha_statd

When ha_cfgverify is run automatically as part of the startup of IRIS FailSafe, it can generate this message:

ha_statd: Error sending message to rpc.statd
ha_statd: rpc.statd is a back revision

If you see this message, follow these steps:

  1. Check to see what options the network status monitor daemon, /usr/etc/rpc.statd, was started with by entering this command:

    # ps -ef | grep rpc.statd
        root   211     1  0   May 31 ?       0:00 /usr/etc/rpc.statd -h
        root  4215  4213  2 14:48:57 ttyq3   0:00 grep rpc.statd 
    

    Look for the -h option (shown in this output). The error message shown above appears if the -h option wasn't used.

  2. If the -h option wasn't used, add -h to the first line of the file /etc/config/statd.options as described in the section “Configuring NFS Filesystems” in Chapter 3.

  3. If you edited /etc/config/statd.options, restart rpc.statd by entering this command:

    # /etc/init.d/network start
    

False Failovers

A false failover occurs when there is a failover for no apparent reason. IRIS FailSafe monitors the high-availability services, the other node in the cluster, the networks, and the connections between the nodes. If a service, a network, or the other node doesn't respond to the monitoring within a timeout period and this monitoring fails a specified number of times (the retry value), a failure is declared and a failover is initiated.

Preventing false failovers is done by tuning the values of the timeout and retry values in the configuration file /var/ha/ha.conf so that they provide timely detection of true failures, but do not falsely detect failure because of node or network loading. See the section “Preventing False Failovers” in Chapter 4 for more information.

Errors Logged to /var/adm/SYSLOG

This section lists errors that might be logged to /var/adm/SYSLOG.

date 6D:node1 syslog[3195]: /usr/etc/ha_exec: /usr/etc/ha_probe -h 127.0.0.1 -p ha_ifa failed(exit status = 1). retrying date 6D:node1 ha_ifa[3148]: ifa_mon: No packets for ec0 date 6D:node1 ha_ifa[3148]: ifa_mon: Interface failure(s) date 6D:node1 ha_appmon[3268]: am_send_event LOCMONFAIL to NC date 6D:node1 ha_nc[1583]: Received LOCMONFAIL date 6D:node1 ha_nc[1583]: state NORMAL, event LOCMONFAIL (0x11) date 6D:node1 ha_nc[1583]: Sending mesg: REMOTE_STOPMON (0x30200019) to NC on node2-pr

This series of messages is the result of a local monitoring failure. In this example, the interface agent failed. (Look for the name of the process that failed, ha_ifa in this example, in the arguments of the ha_probe command shown on the first line.) The message No packets for ec0 in the second line indicates that the problem is related to the interface ec0. Check this interface for network problems.

date 6D:node1 ha_appmon[351]: Retrying remote heartbeat monitor, node 2 (lost_count = 2) date 6D:node1 ha_appmon[351]: lost_heartbeat: heartbeat_2 failed date 6D:node1 ha_appmon[1692]: am_send_event NOHEARTBEAT to NC date 6D:node1 ha_nc[319]: Received NOHEARTBEAT date 6D:node1 ha_nc[319]: state NORMAL, event NOHEARTBEAT (0x12)

This series of messages is an example of heartbeat failure. Check the private network between the nodes.

date 5B:node1 root: ERROR: /etc/init.d/informix stop date 6D:node1 ha_appmon[4916]: giveaway script exited with status 3 date 6D:node1 ha_nc[4904]: process appmon died with status 1 date 6D:node1 ha_nc[4904]: node controller exiting date 6D:node1 ha_hbeat[4901]: process nc died with status 1 date 6D:node1 ha_hbeat[4901]: FailSafe shutting down

This series of messages is an example of a failover script failure. In this example, the giveaway script failed. giveback, takeback, and takeover script failures produce similar SYSLOG output. In this example, the command /etc/init.d/informix stop failed. Check the directory /var/ha/logs for more information on the command failure.

wd95_5: WD95 saw SCSI reset

This message is benign. A CHALLENGE node resets the SCSI bus in the process of starting up. For a dual-hosted system like IRIS FailSafe, the other node on the shared SCSI bus also sees the resets.

checksum mismatch error

Either the configuration files on the two nodes do not match or the configuration file has been changed after the IRIS FailSafe daemons were started.

Make sure that the /var/ha/ha.conf files are the same on both nodes and then reboot both nodes in the cluster.

read_conf error error return from read_config nc_readconfig: Bad config file ...

The IRIS FailSafe daemons could not start up because the /var/ha/ha.conf file on the node is invalid. In some cases, the error message also indicates the missing or invalid parameters. Run ha_cfgverify and fix any problems identified.

Kill failed (%d) after remote monitor failed or no heartbeat

The local node tried to take over the other node's services but failed because it could not reset the other node. The local node went into the standby state.

Check the serial connection between the nodes. If used, check the remote power control unit. After the problem has been fixed, used the command ha_admin -rf as described in the section “Moving a Node From Standby State to Normal or Degraded State” in Chapter 6.

assumed power failed

This node detected that the heartbeat and the failsafe mechanism have both failed on the remote node and assumed that the remote node experienced a power failure. This node then took over the other node's services.

Monitoring of reset-tty on xxx failed

The IRIS FailSafe system could not communicate with either the remote power control unit (CHALLENGE S) or the system controller port (all other systems). Although the system continues to function, this situation prevents a node from taking over from another node if another failure occurs.

Check the physical connections. Test the serial connection using the procedure in the section “Netscape Daemons Not Responding to Monitoring” in this appendix.

mail script failed

The application monitor detected a change in the state of the cluster, tried to send mail to the recipient specified in the /var/ha/ha.conf file, and failed.

Verify the mail configuration on your nodes.

xlv_plexd[31]: DIOCXLVPLEXCOPY on xxx (dev 192.4) failed: No such device or address

The plex revive operation was interrupted because the underlying device went away. This situation is usually because the shared volume has been given away to the other node in the cluster.

This message is benign.

New state: NC_ERROR

The IRIS FailSafe software has detected an internal inconsistency and has suspended operation. The nodes of the cluster are still running.

Verify the software and hardware configuration; reboot this node. Report this problem to Silicon Graphics Technical Support.

lost_heartbeat: heartbeat_<nodenumber> failed
Retrying heartbeat monitor over <public interface>

The cluster is in normal state, but a failure was detected in the private network.

Repair the private network. Follow the instructions in the section “Moving Heartbeat Messages to the Private Network” in Chapter 6 to switch the heartbeat back to the private network.

connect to opsnc failed

In a mixed OPS/IRIS FailSafe configuration, the ha_killd daemon could not connect to the OPS node controller on the Indy workstation that is running the IRISconsole software.

Verify that the cables from the Indy workstation running the IRISconsole are properly attached to the nodes in the IRIS FailSafe cluster, that the Indy workstation is running, and that the opsnc daemon is running on the Indy workstation.

open_tty failed -- cannot monitor node

This message applies to clusters running both Oracle Parallel Server (OPS) and IRIS FailSafe. If you see this message and the cluster is not running OPS, the problem is that the configuration parameter reset-host has been set. Remove reset-host to solve the problem.

If the cluster is running OPS, the IRIS FailSafe system cannot open the serial connection to the remote power control unit or the system controller port of the other node in the cluster.

Verify that the serial connection is hooked up and that the remote power control unit is powered on if used. Test the serial connection using the procedure in the section “Netscape Daemons Not Responding to Monitoring” in this appendix.