Chapter 10. System Recovery and Troubleshooting

This chapter provides information on FailSafe system recovery, and includes sections on the following topics:

Overview of System Recovery

When a FailSafe system experiences problems, you can use some of the FailSafe features and commands to determine where the problem is located.

FailSafe provides the following tools to evaluate and recover from system failure:

  • Log files

  • Commands to monitor status of system components

  • Commands to start, stop, and fail over highly available services

Keep in mind that the FailSafe logs may not detect system problems that do not translate into FailSafe problems. For example, if a CPU goes bad, or hardware maintenance is required, FailSafe may not be able to detect and log these failures.

In general, when evaluating system problems of any nature on a FailSafe configuration, you should determine whether you need to shut down a node to address those problems.

When you shut down a node, perform the following steps:

  1. Stop FailSafe HA services on that node

  2. Shut down the node to perform needed maintenance and repair

  3. Start up the node

  4. Start FailSafe HA services on that node

It is important that you explicitly stop FailSafe HA services before shutting down a node, where possible, so that FailSafe does not interpret the node shutdown as node failure. If FailSafe interprets the service interruption as node failure, there could be unexpected ramifications, depending on how you have configured your resource groups and your application failover domain.

When you shut down a node to perform maintenance, you may need to change your FailSafe configuration to keep your system running.

Identifying the Cluster Status

When you encounter a problem, identify the cluster status by answering the following questions:

  • Are the cluster processes (cmond, crsd, fs2d, and cad) and HA processes (ha_cmsd, ha_gcd, ha_srmd, ha_fsd, and ha_ifd) are running?

  • Are the cluster, node, and resource group states consistent on each node? Run the cluster_status command on each node and compare, or run the GUI connecting to each node in the cluster.

  • Which nodes are in the FailSafe membership? Check the status with the cluster_status and cmgr commands, and see the /var/adm/SYSLOG file.

  • Which nodes are in the cluster database (fs2d ) membership? See the /var/cluster/ha/log/fs2d_log files on each node.

  • Is the database consistent on all nodes? Determine this logging in to each node and examining the /var/cluster/ha/log/fs2d_log file and database checksum.

Locating Problems

To locate the problem, do the following:

  • Examine the following log files:

    /var/cluster/ha/log/cad_log
    /var/cluster/ha/log/cli_Hostname
    /var/cluster/ha/log/crsd_Hostname
    /var/cluster/ha/log/fs2d_Hostname

    • Search for errors in all log files. Examine all messages within the timeframe in question.

    • Trace errors to the source. Try to find an event that triggered the error.

  • Gather process accounting data.

  • Use the icrash commands.

  • Use detailed information from the view area in the GUI to drill down to specific configuration information.

  • Run the Test Connectivity task in the GUI.

  • Get a dump of the cluster database. You can extract such a dump with the following command:

    # /usr/cluster/bin/cdbutil -c 'gettree #' > dumpfile

  • Determine which nodes are in the FailSafe membership with the cluster_status command.

Common Problems

The following are common problems and solutions:

Timed-out Resource Monitor Script

You may be able to diagnose why a monitor action script has timed out by examining the process accounting data. This assume that you have previously enabled either extended accounting or Comprehensive System Accounting on all production servers, as recommended in “Enabling System Accounting” in Chapter 3.

Do the following:

  • Determine the timeframe of the problem as a start-time and end-time. You may need to convert this time to GMT depending on how you have your system configured.

  • Determine the process ID (PID) of the monitor task that timed out. This can be done by looking at the srmd log or by finding the monitor task with a long elapsed time from accounting data.

  • Select the accounting records of interest and create a PID tree from the output.

GUI Will Not Run

If the GUI will not run, check the following:

  • Is the license properly installed?

  • Are the cluster daemons running?

  • Are the tcpmux and tcpmux/sgi_sysadm services enabled in the /etc/inetd.conf file?

  • Are the inetd or tcp wrappers interfering? This may be indicated by connection refused or login failed messages.

Log Files Consume Too Much Disk Space

If the log files are consuming too much disk space, you should rotate them according to the directions in the FailSafe Administrator's Guide for SGI InfiniteStorage. You may also want to consider choosing a less-verbose log level.

Unable to Define a Node

If you are unable to define a node, it may be that there are hostname resolution problems. See the information about hostname resolution rules in the FailSafe Administrator's Guide for SGI InfiniteStorage.

System is Hung

The following may cause the system to hang:

  • Overrun disk drives.

  • Heartbeat was lost. In this case, you will see a message that mentions withdrawal of node.

  • As a last resort, do a nonmaskable interrupt (NMI) of the system and contact SGI. (The NMI tells the kernel to panic the node so that an image of memory is saved and can be analyzed later.) For more information, see the owner's guide for the node.

    Make vmcore.#.comp, unix.#, /var/adm/SYSLOG, and cluster log files available.

You Cannot Log In

If you cannot log in to a FailSafe node, you can use one of the following commands, assuming the node you are on is listed in the other nodes' .rhosts files:

# rsh hostname ksh -i
# rsh hostname csh -i

Power Failure

In the case of a power failure, the first node to join the cluster will wait for the number of seconds specified by the _CMS_WAIT_FOR_ALL_TIMEOUT parameter before attempting to start resource groups. This delay allows the other nodes time to join the cluster.

To modify this value, use the modify ha_parameters command to cmgr command to set a value for node_wait or the Set FailSafe HA Parameters GUI task. For more information, see “Set FailSafe HA Parameters” in Chapter 6.

If the value is not set for the cluster, FailSafe calculates this value by multiplying the node-timeout value by the number of nodes.

Error Messages During Remote Installation

If you are performing a remote installation, you may see messages such as the following:

cdb-exitop: can't run remotely - scheduling to run later

When you perform a remote or miniroot install, the exitop commands are deferred until cluster services are started, at which time the exitop commands will be run. For more information, see “Differences When Performing a Remote or Miniroot Install” in Chapter 4.

Disabling Resource Groups for Maintenance

If you must disable resources, such as when you want to perform maintenance on a node, use the following procedure:

  1. Offline the resource groups by using the offline_detach option or offline_detach_force option (if the resource group is in error). For more information, see “Resource Group Recovery” , and “Resource Group Maintenance and Error Recovery ”.

  2. Perform the needed maintenance.

  3. Reboot the node.

  4. Online the resource group.

Ensuring that Resource Groups are Deallocated

Performing an admin offline_force does not guarantee that all resource groups are offline. If you run into errors, such as with an admin move command, you should verify that the resource groups have been deallocated.

Checking for Exclusivity

After performing an admin offline_force, you should run the exclusive script with the appropriate arguments to verify that the resource in question is not running, or perform a check similar to that done by the script.

Stopping Resources Manually

You must stop resources according to their execution order, from highest to lowest. Use the exclusive scripts to verify whether or not a resource needs to be stopped.

FailSafe Log Files

FailSafe maintains system logs for each of the FailSafe daemons. You can customize the system logs according to the level of logging you wish to maintain. Table 10-1 shows the levels of messages.

For information on setting logging for cad, cmond, and fs2d, see “Configure System Files” in Chapter 4. For information on setting up log configurations, see “Set Log Configuration” in Chapter 6 in Chapter 6, “Configuration”.

Table 10-1. Message Levels

Message Level

Description

Normal

Normal messages report on the successful completion of a task. An example of a normal message is as follows (<N notation indicates a normal message):

 

Wed Sep 2 11:57:25.284 <N ha_gcd cms 10185:0> Delivering TOTAL membership (S# 1, GS# 1)

Error/Warning

Error or warning messages indicate that an error has occurred or may occur soon. These messages may result from using the wrong command or improper syntax. An example of a warning message is as follows (<W notation indications a warning. <E indicates an error.):

 

Wed Sep 2 13:45:47.199 <W crsd crs 9908:0 crs_config.c:634> CI_ERR_NOTFOUND, safer - no such node

SYSLOG

All normal and error messages are also logged to syslog. SYSLOG messages include the symbol <CI> in the header to indicate they are cluster-related messages. An example of a SYSLOG message is as follows:

 

Wed Sep 2 12:22:57 6X:safe syslog: <<CI> ha_cmsd misc 10435:0> CI_FAILURE, I am not part of the enabled cluster anymore

Debug

Debug messages appear in the log group file when the logging level is set to debug0 or higher (using the GUI) or 10 or higher (using cmgr). The following message is logged at debug0 (see D0 in the message) or log level 10:

Thu Sep 27 14:43:24.233 <D0 ha_fsd fsd 57540:0 fs_failsafe.c:1471> Determine
oldest state: coordinator: perf22/0x10001

Examining the log files should enable you to see the nature of the system error. Noting the time of the error and looking at the log files to observe the activity of the various daemons immediately before error occurred, you may be able to determine what situation existed that caused the failure.


Note: Many megabytes of disk space can be consumed on the server when debug levels are used in a log configuration.

See Appendix C, “System Messages”.

FailSafe Membership and Resets

In looking over the actions of a FailSafe system on failure to determine what has gone wrong and how processes have transferred, it is important to consider the concept of FailSafe membership. When failover occurs, the runtime failover domain can include only those nodes that are in the FailSafe membership.

FailSafe Membership and Tie-Breaker Node

Nodes can enter into the FailSafe membership only when they are not disabled and they are in a known state. This ensures that data integrity is maintained because only nodes within the FailSafe membership can access the shared storage. If nodes that are outside the membership and are not controlled by FailSafe were able to access the shared storage, two nodes might try to access the same data at the same time; this situation would result in data corruption. For this reason, disabled nodes do not participate in the membership computation.


Note: No attempt is made to reset nodes that are configured disabled before confirming the FailSafe membership.


FailSafe membership in a cluster is based on a quorum majority. For a cluster to be enabled, more than 50% of the nodes in the cluster must be in a known state, able to talk to each other, using heartbeat control networks. This quorum determines which nodes are part of the FailSafe membership that is formed.

If there are an even number of nodes in the cluster, it is possible that there will be no majority quorum; there could be two sets of nodes, each consisting of 50% of the total number of node, unable to communicate with the other set of nodes. In this case, FailSafe uses the node that has been configured as the tiebreaker node when you configured your FailSafe parameters. If no tiebreaker node was configured, FailSafe uses the node with the lowest ID number where HA services have been started.

The nodes in a quorum attempt to reset the nodes that are not in the quorum. Nodes that can be reset are declared DOWN in the membership, nodes that could not be reset are declared UNKNOWN. Nodes in the quorum are UP.

If a new majority quorum is computed, a new membership is declared whether any node could be reset or not.

If at least one node in the current quorum has a current membership, the nodes will proceed to declare a new membership if they can reset at least one node.

If all nodes in the new tied quorum are coming up for the first time, they will try to reset and proceed with a new membership only if the quorum includes the tiebreaker node.

If a tied subset of nodes in the cluster had no previous membership, then the subset of nodes in the cluster with the tiebreaker node attempts to reset nodes in the other subset of nodes in the cluster. If at least one node reset succeeds, a new membership is confirmed.

If a tied subset of nodes in the cluster had previous membership, the nodes in one subset of nodes in the cluster attempt to reset nodes in the other subset of nodes in the cluster. If at least one node reset succeeds, a new membership is confirmed. The subset of nodes in the cluster with the tiebreaker node resets immediately; the other subset of nodes in the cluster attempts to reset after some time.

Resets are done through system controllers connected to tty ports through serial lines. Periodic serial line monitoring never stops. If the estimated serial line monitoring failure interval and the estimated heartbeat loss interval overlap, the cause is likely a power failure at the node being reset.

No Membership Formed

When no FailSafe membership is formed, you should check the following areas for possible problems:

  • Is the ha_cmsd FailSafe membership daemon running? Is the fs2d database daemon running?

  • Can the nodes communicate with each other? Are the control networks configured as heartbeat networks?

  • Can the control network addresses be reached by a ping command issued from peer nodes?

  • Are the quorum majority or tie rules satisfied? Look at the cmsd log to determine membership status.

  • If a reset is required, are the following conditions met?

    • Is the crsd node control daemon up and running?

    • Is the reset serial line in good health?

    You can look at the crsd log for the node you are concerned with, or execute an admin ping and admin reset command on the node to check this.

Status Monitoring

FailSafe allows you to monitor and check the status of specified clusters, nodes, resources, and resource groups. You can use this feature to isolate the location of system problems.

You can monitor the status of the FailSafe components continuously through their visual representation in the GUI view area. Using the cmgr command, you can display the status of the individual components by using the show command.

For information on status monitoring and on the meaning of the states of the FailSafe components, see “System Status” in Chapter 8 of Chapter 8, “FailSafe System Operation”.

XVM Alternate Path Failover

Messages in the /var/adm/SYSLOG file indicate that XVM has detected a failure in the disk path and has successfully used alternate path failover for lun0; that is, the cluster is in a degraded state and requires attention from the system administrator. For example (line breaks added for readability):

Jun 26 13:21:55 5A:gold2 unix: NOTICE: xvm_serverpal_iodone: done with retry ior 0xa800000041afe100
 for physvol 0xa8000000008eae00 
Jun 26 13:21:55 6A:gold2 unix: dksc 50050cc002004b23/lun0vol/c4p3: <6>SCSI driver error: device 
 does not respond to selection
Jun 26 13:21:55 6A:gold2 unix: dksc 50050cc002004b23/lun0vol/c4p3: <6>SCSI driver error: device 
 does not respond to selection
Jun 26 13:21:55 4A:gold2 unix: WARNING: XVM: WRITE I/O error - errno 5, dev 0x134, bp 0xa800000059bcdf80,
 b_flags 0x100400c, b_addr 0x0, b_pages 0xa800000202127600, io_resid -4611686018427387904, io_error 
 0xa800000000000000
Jun 26 13:21:55 6A:gold2 unix: dksc 50050cc002004b23/lun0vol/c4p3: <6>SCSI driver error: device does 
 not respond to selection
Jun 26 13:21:55 4A:gold2 unix: WARNING: XVM: WRITE I/O error - errno 5, dev 0x134, bp 0xa80000024bef9980,
 b_flags 0x100400c, b_addr 0x0, b_pages 0xa8000000025aa040, io_resid -4611686018427387904, io_error 
 0xa800000200000000
Jun 26 13:21:55 4A:gold2 unix: WARNING: XVM: WRITE I/O error - errno 5, dev 0x134, bp 0xa800000011037600, 
 b_flags 0x100400c, b_addr 0x0, b_pages 0xa8000000024fc3c0, io_resid -6341068275337592832, io_error 
 0xa800000000000000
Jun 26 13:21:55 4A:gold2 unix: WARNING: XVM: WRITE I/O error - errno 5, dev 0x134, bp 0xa800000259171680, 
 b_flags 0x100400c, b_addr 0x0, b_pages 0xa8000000024e5a80, io_resid 0, io_error 0x0
Jun 26 13:21:55 4A:gold2 unix: WARNING: XVM: failover successful. Failover from dev 0x134 to dev 0x156 
 (/hw/module/001c01/Ibrick/xtalk/14/pci/1/scsi_ctlr/0/node/50050cc002004b23/port/2/lun/0/disk/volume/block)
 physvol 0x1b7
Jun 26 13:21:55 5A:gold2 unix: NOTICE: xvm_serverpal_iodone: done with retry ior 0xa800000041afe100 for 
 physvol 0xa8000000008eaa00

These messages in the SYSLOG file would be produced by the following /etc/failover.conf file:

#ident $Revision: 1.33 $
#
#       This is the configuration file for table configured failover support.
#
#       Please see the failover (7m) manual page for details on failover and
#       on how to use this file.
#
#disable_target_lun_check
lun2    50050cc002004b23/lun2/c4p3 \
        50050cc002004b23/lun2/c3p2 \
        50050cc002004b23/lun2/c3p3 \
        50050cc002004b23/lun2/c4p2 
lun1    50050cc002004b23/lun1/c4p3 \
        50050cc002004b23/lun1/c3p2 \
        50050cc002004b23/lun1/c3p3 \
        50050cc002004b23/lun1/c4p2 
lun0    50050cc002004b23/lun0/c4p3 \
        50050cc002004b23/lun0/c3p2 \
        50050cc002004b23/lun0/c3p3 \
        50050cc002004b23/lun0/c4p2

Dynamic Control of FailSafe HA Services

FailSafe allows you to perform a variety of administrative tasks that can help you troubleshoot a system with problems without bringing down the entire system. These tasks include the following:

  • You can add or delete nodes from a cluster without affecting the FailSafe HA services and the applications running in the cluster.

  • You can add or delete a resource group without affecting other online resource groups.

  • You can add or delete resources from a resource group while it is still online.

  • You can change FailSafe parameters such as the heartbeat interval and the node timeout and have those values take immediate affect while the services are up and running.

  • You can start and stop FailSafe HA services on specified nodes.

  • You can move a resource group online, or take it offline.

  • You can stop the monitoring of a resource group by putting the resource group into maintenance mode. This is not an expensive operation, as it does not stop and start the resource group, it just puts the resource group in a state where it is not available to FailSafe.

  • You can reset individual nodes.

For information on how to perform these tasks, see Chapter 6, “Configuration”, and Chapter 8, “FailSafe System Operation”.

Recovery Procedures

The following sections describe various recovery procedures you can perform when different failsafe components fail. Procedures for the following situations are provided:

Single-Node Recovery

When one of the nodes in a two-node cluster is intended to stay down for maintenance or cannot be brought up, a set of procedures must be followed so that the database on the surviving node knows that that node is down and therefore should not to be considered in the failover domain. Without these procedures, the resources cannot come online because half or more of the failover domain is down.

See the procedure in “Two-Node Clusters: Single-Node Use” in Chapter 8.

Cluster Error Recovery

Use the following procedure if status of the cluster is UNKNOWN in all nodes in the cluster:

  1. Check to see if there are control networks that have failed (see “Control Network Failure Recovery”).

  2. Determine if there are sufficient nodes in the cluster that can communicate with each other using control networks in order to form a quorum. (At least 50% of the nodes in the cluster must be able to communicate with each other.) If there is an insufficient number of nodes, stop HA services on the nodes that cannot communicate (using the force option); this will change the number of nodes used in the quorum calculation.

  3. If there are no hardware configuration problems, do the following:

    • Detach all resource groups that are online in the cluster (if any)

    • Stop HA services in the cluster

    • Restart HA services in the cluster

    See “Resource Group Recovery”

For example, the following cmgr command detaches the resource group web-rg in cluster web-cluster:

cmgr> admin detach resource_group web-rg in cluster web-cluster

To stop HA services in the cluster web-cluster and ignore errors (force option), use the following command:

cmgr> stop ha_services for cluster web-cluster force

To start HA services in the cluster web-cluster, use the following command:

cmgr> start ha_services for cluster web-cluster

Resource Group Recovery

The fact that a resource group is in an error state does not mean that all resources in the resource group have failed. However, to get the resources back into an online state, you must first set them to the offline state. You can do without actually taking the resources offline by using the following cmgr command:

admin offline_detach_force RGname [in cluster Clustername]

For example:

cmgr> admin offline_detach_force RG1 in cluster test-cluster


Caution: You should use the InPlace_Recovery failover policy attribute when using this command. This attribute specifies that the resources will stay on the same node where they were running at the time when the offline_detach_force command was run.


Node Error Recovery

When a node is not able to talk to the majority of nodes in the cluster, the SYSLOG will display a message that the CMSD is in a lonely state. Another problem you may see is that a node is getting reset or going to an unknown state.

Use the following procedure to resolve node errors:

  1. Verify that the control networks in the node are working (see “Control Network Failure Recovery”).

  2. Verify that the serial reset cables to reset the node are working (see “Serial Cable Failure Recovery”).

  3. Verify that the sgi-cmsd port is the same in all nodes in the cluster.

  4. Check the node configuration; it should be consistent and correct.

  5. Check SYSLOG and cmsd logs for errors. If a node is not joining the cluster, check the logs of the nodes that are part of the cluster.

  6. If there are no hardware configuration problems, stop HA services in the node and restart HA services.

    For example, to stop HA services in the node web-node3 in the cluster web-cluster, ignoring errors (force option), use the following command:

    cmgr> stop ha_services in node web-node3 for cluster web-cluster force

    For example, to start HA services in the node web-node3 in the cluster web-cluster, use the following command:

    cmgr> start ha_services in node web-node3 for cluster web-cluster

Resource Group Maintenance and Error Recovery

To do simple maintenance on an application that is part of the resource group, use the following procedure. This procedure stops monitoring the resources in the resource group when maintenance mode is on. You must turn maintenance mode off when performing application maintenance.


Caution: If there is a node failure on the node where resource group maintenance is being performed, the resource group is moved to another node in the failover policy domain.

For example:

  1. To put a resource group web-rg in maintenance mode, use the following cmgr command:

    cmgr> admin maintenance_on resource_group web-rg in cluster web-cluster

  2. The resource group state changes to ONLINE_MAINTENANCE. Do whatever application maintenance is required. (Rotating application logs is an example of simple application maintenance).

  3. To remove a resource group web-rg from maintenance mode, use the following command:

    cmgr> admin maintenance_off resource_group web-rg in cluster web-cluster

    The resource group state changes back to ONLINE.

Perform the following procedure when a resource group is in an ONLINE state and has an SRMD EXECUTABLE ERROR:

  1. Look at the SRM logs (default location: /var/cluster/ha/logs/srmd_Nodename) to determine the cause of failure and the resource that has failed. Search for the ERROR string in the SRMD log file:

    Wed Nov 3 04:20:10.135
    <E ha_srmd srm 12127:1 sa_process_tasks.c:627>
    CI_FAILURE, ERROR: Action (start) for resource (192.0.2.45) of type
    (IP_address) failed with status (failed)

  2. Check the script logs on that same node for IP_address start script errors.

  3. Fix the cause of failure. This might require changes to resource configuration or changes to resource type stop/start/failover action timeouts.

  4. After fixing the problem, move the resource group offline with the force option and then move the resource group online in the cluster.

    For example, the following command moves the resource group web-rg in the cluster web-cluster offline and ignores any errors:

    cmgr> admin offline resource_group web-rg in cluster web-cluster force

    The following command moves the resource group web-rg in the cluster web-cluster online:

    cmgr> admin online resource_group web-rg in cluster web-cluster

    The resource group web-rg should be in an ONLINE state with no error.

Use the following procedure when a resource group is not online but is in an error state. Most of these errors occur as a result of the exclusivity process. This process, run when a resource group is brought online, determines if any resources are already allocated somewhere in the failure domain of a resource group. Note that exclusivity scripts return that a resource is allocated on a node if the script fails in any way. In other words, unless the script can determine that a resource is not present, it returns a value indicating that the resource is allocated.

Some possible error states include: SPLIT RESOURCE GROUP (EXCLUSIVITY), NODE NOT AVAILABLE (EXCLUSIVITY), NO AVAILABLE NODES in failure domain. See “Resource Group Status” in Chapter 8, for explanations of resource group error codes.

  1. Look at the failsafe and SRMD logs (default directory: /var/cluster/ha/logs, files: failsafe_Nodename, srmd_Nodename) to determine the cause of the failure and the resource that failed.

    For example, suppose that the task of moving a resource group online results in a resource group with error state SPLIT RESOURCE GROUP (EXCLUSIVITY). This means that parts of a resource group are allocated on at least two different nodes. One of the failsafe logs will have the description of which nodes are believed to have the resource group partially allocated:

    [Resource Group:RGname]:Exclusivity failed -- RUNNING on Node1 and Node2

    [Resource Group:RGname]:Exclusivity failed -- PARTIALLY RUNNING on Node1 and PARTIALLY RUNNING on Node2

    At this point, look at the srmd logs on each of these nodes for exclusive script errors to see what resources are believed to be allocated. In some cases, a misconfigured resource will show up as a resource that is allocated. This is especially true for Netscape_web resources.

  2. Fix the cause of the failure. This might require changes to resource configuration or changes to resource type start/stop/exclusivity timeouts.

  3. After fixing the problem, move the resource group offline with the force option and then move the resource group online.

Perform the following checks when a resource group shows a no more nodes in AFD error:

  1. All nodes in the failover domain are not in the membership. Check CMSD logs for errors.

  2. Check the SRMC/script logs on all nodes in the failover domain for start/monitor script errors.

There are a few double failures that can occur in the cluster that will cause resource groups to remain in a non-highly-available state. At times a resource group might be stuck in an offline state. A resource group might also stay in an error state on a node even when a new node joins the cluster and the resource group can migrate to that node to clear the error. When these circumstances arise, do the following:

  1. If the resource group is offline, try to move it online.

  2. If the resource group is stuck on a node, detach the resource group and then bring it back online again. This should clear many errors.

  3. If detaching the resource group does not work, force the resource group offline, then bring it back online.

  4. If commands appear to be hanging or not working properly, detach all resource groups, then shut down the cluster and bring all resource groups back online.

See “Take a Resource Group Offline” in Chapter 8, for information on detaching resource groups and forcing resource groups offline.

Clear Resource Error State

Use this procedure when a resource that is not part of a resource group is in an ONLINE state with an error. This can happen when the addition or removal of resources from a resource group fails.

Do the following:

  1. Look at the SRM logs to determine the cause of failure and the resource that has failed. The default location is:

    /var/cluster/ha/logs/srmd_Nodename

  2. Fix the problem that caused the failure. This might require changes to resource configuration or changes to resource type stop/start/failover action timeouts.

  3. Clear the error state with the GUI or the cmgr command:

    • Use the Clear Resource Error State GUI task. Provide the following information:

      • Resource Type: select the type of the resource

      • Resource in Error State: select the name of the resource that should be cleared from the error state

      Click OK to complete the task.

    • Use the cmgr admin offline_force command to move the resource offline. For example, to remove the error state of resource web-srvr of type Netscape_Web , making it available to be added to a resource group, enter the following:

      cmgr> admin offline_force resource web-srvr of resource_type Netscape_Web in cluster web-cluster

Control Network Failure Recovery

Control network failures are reported in cmsd logs. The default location of cmsd log is /var/cluster/ha/logs/cmsd_Nodename . Follow this procedure when the control network fails:

  1. Use the ping command to check whether the control network IP address is configured in the node.

  2. Check node configuration to see whether the control network IP addresses are correctly specified.

    The following cluster_mgr command displays node configuration for web-node3:

    cmgr> show node web-node3

  3. If IP names are specified for control networks instead of IP addresses in XX.XX.XX.XX notation, check to see whether IP names can be resolved using DNS. You should use IP addresses instead of IP names.

  4. Check whether the heartbeat interval and node timeouts are correctly set for the cluster. These HA parameters can seen using cluster_mgr show ha_parameters command.

Serial Cable Failure Recovery

Serial cables are used for resetting a node when there is a node failure. Serial cable failures are reported in crsd logs. The default location for the crsd log is /var/cluster/ha/log/crsd_ Nodename.

Check the node configuration to see whether serial cable connection is correctly configured.

The following cmgr command displays node configuration for web-node3

cmgr> show node web-node3

Use the admin ping command to verify the serial cables. The following command reports serial cables problems in node web-node3:

cmgr> admin ping node web-node3

Cluster Database Sync Failure

If the cluster database synchronization fails, use the following procedure:

  1. Check for the following message in the SYSLOG file on the target node:

    Starting to receive CDB sync series from machine <node1_node_ID>
    ...
    Finished receiving CDB sync series from machine <node1_node_ID>

  2. Check for control network or portmapper/rpcbind problems.

  3. Check the node definition in the cluster database.

  4. Check the SYSLOG and fs2d logs on the source node.

Cluster Database Maintenance and Recovery

When the entire cluster database must be reinitialized, stop HA services on all nodes in the cluster, and then execute the following command on all nodes in the cluster:

# /usr/cluster/bin/cdbreinit /var/cluster/cdb/cdb.db

This command stops cluster processes, reinitializes the database and restarts all cluster processes. The contents of the cluster database will be automatically synchronized with other nodes if other nodes in the pool are available.

Otherwise, the cluster database must be restored from backup at this point. For instructions on backing up and restoring the cluster database, see “Cluster Database Backup and Restore” in Chapter 8.

GUI Will Not Run

If the GUI will not run, check the following:

  • Are the cluster daemons running?

    When you first install the software, the following daemons should be running:

    • fs2d

    • cmond

    • cad

    • crsd

    To determine which daemons are running, enter the following:

    # ps -ef | grep cluster

    The following shows an example of the output when just the initial daemons are running; for readability, whitespace has been removed and the daemon names are highlighted:

    fs6 # ps -ef | grep cluster
    root 31431     1 0 12:51:36 ?     0:14 /usr/lib32/cluster/cbe/fs2d /var/cluster/cdb/cdb.db #
    root 31456 31478 0 12:53:01 ?     0:03 /usr/cluster/bin/crsd -l
    root 31475 31478 0 12:53:00 ?     0:08 /usr/cluster/bin/cad -l -lf /var/cluster/ha/log/cad_log --append_log
    root 31478     1 0 12:53:00 ?     0:00 /usr/cluster/bin/cmond -L info -f /var/cluster/ha/log/cmond_log
    root 31570 31408 0 14:01:52 pts/0 0:00 grep cluster

    If you do not see these processes, go to the logs to see what the problem might be. If you must restart the daemons, enter the following:

    # /etc/init.d/cluster start

  • Are the tcpmux and tcpmux/sgi_sysadm services enabled in the /etc/inetd.conf file?

    The following line is added to the /etc/inetd.conf file when sysamd_base is installed:

    tcpmux/sgi_sysadm stream tcp nowait root   ?/usr/sysadm/bin/sysadmd sysadmd

    If the tcpmux line is commented out, you must uncomment it and then run the following:

    # kill -HUP inetd

  • Are the inetd or tcp wrappers interfering? This may be indicated by connection refused or login failed messages.

  • Are you connecting to an IRIX node? The fsmgr command can only be executed on an IRIX node. The GUI may be run from a node running an operating system other than IRIX via the Web if you connect the GUI to an IRIX node.

GUI and cmgr Inconsistencies

If the GUI is displaying information that is inconsistent with the FailSafe cmgr command, restart cad process on the node to which GUI is connected to by executing the following command:

# killall cad

The cluster administration daemon is restarted automatically by the cmond process.

GUI Does Not Report Information

If the GUI is not reporting configuration information and status, perform the following steps:

  1. Check the information using the cmgr command. If cmgr is reporting correct information, there is a GUI update problem.

  2. If there is a GUI update problem, kill the cad daemon on that node. Wait for a few minutes to see whether cad gets correct information. Check the cad logs on that node for errors.

  3. Check the CLI logs on that node for errors.

  4. If the status information is incorrect, check the cmsd or fsd logs on that node.

Using the cdbreinit Command

When the cluster databases are not in synchronization on all the nodes in the cluster, you can run the cdbreinit command to recover. The cdbreinit command should be run on the node which is not in sync.

Perform the following steps.


Note: Perform each step on all the nodes before proceeding to the next step in the recovery procedure.


  1. Stop FailSafe HA services in the cluster using the GUI or cmgr.

  2. Stop cluster processes on all nodes in the pool:

    # /etc/init.d/cluster stop
    # killall fs2d

  3. Run cdbreinit on the node where the cluster database is not in sync.

  4. Start cluster processes on all nodes in the pool:

    # /etc/init.d/cluster start

  5. Wait a few minutes for the cluster database to synchronize. There will be cluster database sync long messages in the SYSLOG on the node.

  6. Start FailSafe HA services in the cluster.

Action Script Configuration Errors

If you try to execute an action script that is missing or does not have the correct permissions, you will get an error message (see “ha_srmd Error Message” in Appendix C.) After fixing the problem, you must send a SIGHUP signal to the ha_srmd process on each node so that it rereads the configuration. Use the following command line on each node:

# killall -HUP ha_srmd

When ha_srmd receives the SIGHUP signal, it will reread the resource type configuration. If ha_srmd finds errors in the resource type configuration, the errors will be logged in the SYSLOG or ha_srmd logs.

CXFS Metadata Server Relocation

FailSafe uses a umount command with the -k option to move a resource in the case of a CXFS metadata server relocation if the relocate-mds attribute in the CXFS resource definition is set to true. The umount -k command will kill all server process using the CXFS filesystem.

Other Problems with CXFS Coexecution

For information solving problems involving coexecution with CXFS, see the troubleshooting chapter of the CXFS Administration Guide for SGI InfiniteStorage.

Reporting Problems to SGI

When reporting a problem about a FailSafe node to SGI, you should retain the following information:

  • System core files in /var/adm/crash, including:

    analysis.number
    unix.number
    vmcore.number.comp

  • Output about the cluster obtained from the cxfsdump utility, which is shipped in the cluster_services.sw.base software product. (Although it was written primarily for CXFS, it also provides cluster information applicable to FailSafe.) You can run this utility immediately after noticing a problem. It collects the following:

    • Information from the following files:

      /var/adm/SYSLOG
      /var/cluster/ha/log/*
      /etc/failover.conf
      /var/sysgen/stune
      /etc/hosts

    • Output from the following commands:

      /usr/cluster/bin/cdbutil gettree '#'
      /usr/sbin/versions -n
      /usr/sbin/systune
      /sbin/hinv -vm
      /sbin/xvm show -v phys
      /sbin/xvm show -top -v vol
      /usr/sbin/scsifo -d
      /usr/etc/netstat -ia