Chapter 11. General Troubleshooting

This chapter contains the following:

Also see the following platform-specific sections:

For more advanced cluster troubleshooting, see the CXFS Administration Guide for SGI InfiniteStorage.

Identifying Problems

This section provides tips about identifying problems:

Is the Client-Only Node Configured Correctly?

To determine the current configuration of a node in a cluster, run the following command on a CXFS server-capable administration node:

/usr/cluster/bin/cxfs-config -all

For more information, see “Verifying the Cluster Status” in Chapter 10.

Confirm that the host type, private network, and failure hierarchy are configured correctly, and that no warnings or errors are reported. You should rectify any warnings or errors before proceeding with further troubleshooting.

Is the Client-Only Node in Membership?

To determine if the node is in the cluster membership, use the tools described in “Verifying the Cluster Status” in Chapter 10.

If the client is not in membership, see the following:

Is the Client-Only Node Mounting All Filesystems?

To determine if the node has mounted all configured filesystems, use the tools described in “Verifying the Cluster Status” in Chapter 10.

If the client has not mounted all filesystems, see the following:

Can the Client-Only Node Access All Filesystems?

To determine if the client-only node can access a filesystem, navigate the filesystem and attempt to create a file.

If the filesystem appears to be empty, the mount may have failed or been lost. See “Determining If a Client-Only Node Is Fenced” and “Verifying Access to XVM Volumes” in Chapter 10.

If accessing the filesystem hangs the viewing process, see “Filesystem Appears to Be Hung”.

Are There Error Messages?

When determining the state of the client-only node, you should check error message logs to help identify any problems.

Appendix A, “Operating System Path Differences” lists the location of the cxfs_client log file for each platform. This log is also displayed in the Windows version of cxfs_info.

Each platform also has its own system log for kernel error messages that may also capture CXFS messages. See the following:

There are various logs also located on the CXFS server-capable administration nodes. For more information, see the CXFS Administration Guide for SGI InfiniteStorage.

What Is the Network Status?

Use the netstat command on a client-only node to determine the network status.

For example, to determine if you have a bad connection, you could enter the following from a DOS console on the Windows platform:

C:\Documents and Settings\cxfsqa>netstat -e -s

The Linux, Mac OS X, and Windows platforms support the -s option, which shows per-protocol statistics. The Linux and Windows systems also support the -e option, which shows Ethernet statistics. See the netstat(1) man page for information about options.

What is the Status of XVM Mirror Licenses?

To view the current status of XVM mirror licenses, use the following command and search for the line containing the keyword mirrors :

xvm show -subsystem

For example:

# xvm show -subsystem
XVM Subsystem Information:
--------------------------
apivers:                 26
config gen:              33
privileged:              1
clustered:               1
cluster initialized:     1
user license enabled:    1
local mirrors enabled:   1
cluster mirrors enabled: 1
snapshot enabled:        1
snapshot max blocks:     -1
snapshot blocks used:    0

Typical Problems and Solutions

This section contains the following typical problems that apply to any platform:

cdb Error in the cxfs_client Log

The following errors in the cxfs_client may log indicate that the client is not found in the cluster database:

cxfs_client: cis_client_run querying CIS server
cxfs_client: cis_cdb_go ERROR: Error returned from server: cdb error (6)

Run the cxfs-config command on the metadata server and verify that the client's hostname appears in the cluster database. For additional information about the error, review the /var/cluster/ha/log/fs2d_log file on the metadata server.

Unable to Achieve Membership

If cxfs_info does not report that CMS is UP, do the following:

  1. Check that cxfs_client is running. See one of the following sections as appropriate for your platform:

  2. Look for other warnings and error messages in the cxfs_client log file. See Appendix A, “Operating System Path Differences” for the location of the log file on different platforms.

  3. Check cxfs-config output on the CXFS server-capable administration node to ensure that the client is correctly configured and is reachable via the configured CXFS private network. For example:

    admin# /usr/cluster/bin/cxfs-config -all

  4. Check that the client is enabled into the cluster by running clconf_info on a CXFS server-capable administration node.

  5. Look in the system log on the CXFS metadata server to ensure the server detected the client that is attempting to join membership and check for any other CXFS warnings or errors.

  6. Check that the metadata server has the node correctly configured in its hostname lookup scheme (/etc/host file or DNS).

  7. If you are still unable to resolve the problem, reboot the client node.

  8. If rebooting the client node in step 7 did not resolve the problem, restart the cluster administration daemons (fs2d, cad, cmond, and crsd) on the metadata server. This step may result in a temporary delay in access to the filesystem from all nodes.

  9. If restarting cluster administration daemons in step 8 did not solve the problem, reboot the metadata server. This step may result in the filesystems being unmounted on all nodes.

Filesystem Appears to Be Hung

If any CXFS filesystem activity appears to hung in the filesystem, do the following:

  1. Check that the client is still in membership and the filesystem is mounted according to cxfs_info.

  2. Check on the metadata server to see if any messages are more than a few seconds in age (known as a stuck message). For example, on IRIX running icrash as root, the following message was received from cell 4 more than four minutes ago:

    # icrash
        >>>> mesglist
        Cell:1
        THREAD ADDR         MSG ID TYPE CELL MESSAGE
        Time(Secs)
        ================== ======= ==== ==== ================================ ==========
        0xa80000004bc86400    10fc  Rcv    4                 I_dsxvn_allocate       4:20
    
    
    

  3. If there is a stuck message, gather information for SGI support:

    • Find the stack trace for the stuck thread. For example:

      >>>> kthread 0xa80000004bc86400
      
                       KTHREAD TYPE                ID             WCHAN NAME
              =============================================================================
              a80000004bc86400    1         100000534  c000000002748008 mtcp_notify
              =============================================================================
              1 kthread struct found
      
              >>>> defkthread 0xa80000004bc86400
      
              Default kthread is 0xa80000004bc86400
      
              >>>> trace
      
              ===============================================================================
              STACK TRACE FOR XTHREAD 0xa80000004bc86400 (mtcp_notify):
      
               1 istswtch[../os/swtch.c: 1526, 0xc00000000021764c]
               2 swtch[../os/swtch.c: 1026, 0xc000000000216de8]
               3 thread_block[../os/ksync/mutex.c: 178, 0xc00000000017dc8c]
               4 sv_queue[../os/ksync/mutex.c: 1595, 0xc00000000017f36c]
               5 sv_timedwait[../os/ksync/mutex.c: 2205, 0xc0000000001800a0]
               6 sv_wait[../os/ksync/mutex.c: 1392, 0xc00000000017f038]
               7 xlog_state_sync[../fs/xfs/xfs_log.c: 2986, 0xc0000000002a535c]
               8 xfs_log_force[../fs/xfs/xfs_log.c: 361, 0xc0000000002a25dc]
               9 cxfs_dsxvn_wait_inode_safe[../fs/cxfs/server/cxfs_dsxvn.c: 2011,
              0xc00000000046a594]
              10 dsvn_getobjects[../fs/cxfs/server/dsvn.c: 3266, 0xc0000000004676fc]
              11 I_dsxvn_allocate[../fs/cxfs/server/cxfs_dsxvn.c: 1406, 0xc0000000004699c8]
              12 dsxvn_msg_dispatcher[../IP27bootarea/I_dsxvn_stubs.c: 119,
              0xc000000000456768]
              13 mesg_demux[../cell/mesg/mesg.c: 1130, 0xc000000000408e88]
              14 mtcp_notify[../cell/mesg/mesg_tcp.c: 1100, 0xc0000000004353d8]
              15 tsv_thread[../cell/tsv.c: 303, 0xc000000000437738]
              16 xthread_prologue[../os/swtch.c: 1638, 0xc00000000021782c]
              17 xtresume[../os/swtch.c: 1686, 0xc0000000002178f8]
              ===============================================================================
                          

    • Run cxfsdump on the metadata server.

    • Run cxfsdump on the client that has the stuck message.

    • If possible, force the client that has the stuck message to generate a crash dump.

  4. Reboot the client that has the stuck message. This is required for CXFS to recover.

Determining If a Client-Only Node Is Fenced

To determine if a client-only node is fenced, log in to a CXFS server-capable administration node and use the hafence(1M) command. A fenced port is displayed as status=disabled.

In the following example, all ports that have been registered as CXFS host ports are not fenced:

admin# /usr/cluster/bin/hafence -q
Switch[0] "brocade04" has 16 ports
Port 4 type=FABRIC status=enabled hba=210000e08b0042d8 on host o200c
Port 5 type=FABRIC status=enabled hba=210000e08b00908e on host cxfs30
Port 9 type=FABRIC status=enabled hba=2000000173002d3e on host cxfssun3

All switch ports can also be shown with hafence:

admin# /usr/cluster/bin/hafence -v
Switch[0] "brocade04" has 16 ports
Port 0 type=FABRIC status=enabled hba=2000000173003b5f on host UNKNOWN
Port 1 type=FABRIC status=enabled hba=2000000173003adf on host UNKNOWN
Port 2 type=FABRIC status=enabled hba=210000e08b023649 on host UNKNOWN
Port 3 type=FABRIC status=enabled hba=210000e08b021249 on host UNKNOWN
Port 4 type=FABRIC status=enabled hba=210000e08b0042d8 on host o200c
Port 5 type=FABRIC status=enabled hba=210000e08b00908e on host cxfs30
Port 6 type=FABRIC status=enabled hba=2000000173002d2a on host UNKNOWN
Port 7 type=FABRIC status=enabled hba=2000000173003376 on host UNKNOWN
Port 8 type=FABRIC status=enabled hba=2000000173002c0b on host UNKNOWN
Port 9 type=FABRIC status=enabled hba=2000000173002d3e on host cxfssun3
Port 10 type=FABRIC status=enabled hba=2000000173003430 on host UNKNOWN
Port 11 type=FABRIC status=enabled hba=200900a0b80c13c9 on host UNKNOWN
Port 12 type=FABRIC status=disabled hba=0000000000000000 on host UNKNOWN
Port 13 type=FABRIC status=enabled hba=200d00a0b80c2476 on host UNKNOWN
Port 14 type=FABRIC status=enabled hba=1000006069201e5b on host UNKNOWN
Port 15 type=FABRIC status=enabled hba=1000006069201e5b on host UNKNOWN

When the client-only node joins membership, any fences on any switch ports connected to that node should be lowered and the status changed to enabled.

However, if the node still does not have access to the storage, do the following:

No HBA WWPNs are Detected

On most platforms, the cxfs_client software automatically detects the world wide port names (WWPNs) of any supported host bus adapters (HBAs) in the system that are connected to a switch that is configured in the cluster database. These HBAs will then be available for fencing.

However, if no WWPNs are detected, there will be messages about loading the HBA/SNIA library.

See the following:

Membership Is Prevented by Firewalls

If a client has trouble obtaining membership, verify that the system firewall is configured for CXFS use. See “Configure Firewalls for CXFS Use” in Chapter 2.

Devices are Unknown

You can run the cxfs-reprobe script on a client-only node (other than Windows) to look for devices and perform a SCSI bus reset if necessary. cxfs-reprobe will also issue an XVM probe to tell XVM that there may be new devices available:

client# /var/cluster/cxfs_client-scripts/cxfs-reprobe

Verifying the XVM Mirror Licenses on Client-Only Nodes

To view the current status of XVM mirror licenses on client-only nodes, use the following command and search for the line containing the keyword mirrors:

xvm show -subsystem

For example:

client# xvm show -subsystem
XVM Subsystem Information:
--------------------------
apivers:                 26
config gen:              33
privileged:              1
clustered:               1
cluster initialized:     1
user license enabled:    1
local mirrors enabled:   1
cluster mirrors enabled: 1
snapshot enabled:        1
snapshot max blocks:     -1
snapshot blocks used:    0

Reporting Problems to SGI

When reporting a problem with a client-only node, it is important to retain the appropriate information; having access to this information will greatly assist SGI in the process of diagnosing and fixing problems. The methods used to collect required information for problem reports are platform-specific: