This chapter contains the following:
Also see the following platform-specific sections:
For more advanced cluster troubleshooting, see the CXFS Administration Guide for SGI InfiniteStorage.
This section provides tips about identifying problems:
To determine the current configuration of a node in a cluster, run the following command on a CXFS server-capable administration node:
For more information, see “Verifying the Cluster Status” in Chapter 10.
Confirm that the host type, private network, and failure hierarchy are configured correctly, and that no warnings or errors are reported. You should rectify any warnings or errors before proceeding with further troubleshooting.
To determine if the node is in the cluster membership, use the tools described in “Verifying the Cluster Status” in Chapter 10.
If the client is not in membership, see the following:
To determine if the node has mounted all configured filesystems, use the tools described in “Verifying the Cluster Status” in Chapter 10.
If the client has not mounted all filesystems, see the following:
To determine if the client-only node can access a filesystem, navigate the filesystem and attempt to create a file.
If the filesystem appears to be empty, the mount may have failed or been lost. See “Determining If a Client-Only Node Is Fenced” and “Verifying Access to XVM Volumes” in Chapter 10.
If accessing the filesystem hangs the viewing process, see “Filesystem Appears to Be Hung”.
When determining the state of the client-only node, you should check error message logs to help identify any problems.
Appendix A, “Operating System Path Differences” lists the location of the cxfs_client log file for each platform. This log is also displayed in the Windows version of cxfs_info.
Each platform also has its own system log for kernel error messages that may also capture CXFS messages. See the following:
There are various logs also located on the CXFS server-capable administration nodes. For more information, see the CXFS Administration Guide for SGI InfiniteStorage.
Use the netstat command on a client-only node to determine the network status.
For example, to determine if you have a bad connection, you could enter the following from a DOS console on the Windows platform:
C:\Documents and Settings\cxfsqa>netstat -e -s
The Linux, Mac OS X, and Windows platforms support the -s option, which shows per-protocol statistics. The Linux and Windows systems also support the -e option, which shows Ethernet statistics. See the netstat(1) man page for information about options.
To view the current status of XVM mirror licenses, use the following command and search for the line containing the keyword mirrors :
xvm show -subsystem
# xvm show -subsystem XVM Subsystem Information: -------------------------- apivers: 26 config gen: 33 privileged: 1 clustered: 1 cluster initialized: 1 user license enabled: 1 local mirrors enabled: 1 cluster mirrors enabled: 1 snapshot enabled: 1 snapshot max blocks: -1 snapshot blocks used: 0
This section contains the following typical problems that apply to any platform:
The following errors in the cxfs_client may log indicate that the client is not found in the cluster database:
cxfs_client: cis_client_run querying CIS server cxfs_client: cis_cdb_go ERROR: Error returned from server: cdb error (6)
Run the cxfs-config command on the metadata server and verify that the client's hostname appears in the cluster database. For additional information about the error, review the /var/cluster/ha/log/fs2d_log file on the metadata server.
If cxfs_info does not report that CMS is UP, do the following:
Check that cxfs_client is running. See one of the following sections as appropriate for your platform:
“Start/Stop cxfs_client for SGI ProPack Client-Only Nodes” in Chapter 7
“Start/Stop the CXFS Client Service for Windows” in Chapter 9
Look for other warnings and error messages in the cxfs_client log file. See Appendix A, “Operating System Path Differences” for the location of the log file on different platforms.
Check cxfs-config output on the CXFS server-capable administration node to ensure that the client is correctly configured and is reachable via the configured CXFS private network. For example:
admin# /usr/cluster/bin/cxfs-config -all
Check that the client is enabled into the cluster by running clconf_info on a CXFS server-capable administration node.
Look in the system log on the CXFS metadata server to ensure the server detected the client that is attempting to join membership and check for any other CXFS warnings or errors.
Check that the metadata server has the node correctly configured in its hostname lookup scheme (/etc/host file or DNS).
If you are still unable to resolve the problem, reboot the client node.
If rebooting the client node in step 7 did not resolve the problem, restart the cluster administration daemons (fs2d, cad, cmond, and crsd) on the metadata server. This step may result in a temporary delay in access to the filesystem from all nodes.
If restarting cluster administration daemons in step 8 did not solve the problem, reboot the metadata server. This step may result in the filesystems being unmounted on all nodes.
If any CXFS filesystem activity appears to hung in the filesystem, do the following:
Check that the client is still in membership and the filesystem is mounted according to cxfs_info.
Check on the metadata server to see if any messages are more than a few seconds in age (known as a stuck message). For example, on IRIX running icrash as root, the following message was received from cell 4 more than four minutes ago:
# icrash >>>> mesglist Cell:1 THREAD ADDR MSG ID TYPE CELL MESSAGE Time(Secs) ================== ======= ==== ==== ================================ ========== 0xa80000004bc86400 10fc Rcv 4 I_dsxvn_allocate 4:20
If there is a stuck message, gather information for SGI support:
Find the stack trace for the stuck thread. For example:
>>>> kthread 0xa80000004bc86400 KTHREAD TYPE ID WCHAN NAME ============================================================================= a80000004bc86400 1 100000534 c000000002748008 mtcp_notify ============================================================================= 1 kthread struct found >>>> defkthread 0xa80000004bc86400 Default kthread is 0xa80000004bc86400 >>>> trace =============================================================================== STACK TRACE FOR XTHREAD 0xa80000004bc86400 (mtcp_notify): 1 istswtch[../os/swtch.c: 1526, 0xc00000000021764c] 2 swtch[../os/swtch.c: 1026, 0xc000000000216de8] 3 thread_block[../os/ksync/mutex.c: 178, 0xc00000000017dc8c] 4 sv_queue[../os/ksync/mutex.c: 1595, 0xc00000000017f36c] 5 sv_timedwait[../os/ksync/mutex.c: 2205, 0xc0000000001800a0] 6 sv_wait[../os/ksync/mutex.c: 1392, 0xc00000000017f038] 7 xlog_state_sync[../fs/xfs/xfs_log.c: 2986, 0xc0000000002a535c] 8 xfs_log_force[../fs/xfs/xfs_log.c: 361, 0xc0000000002a25dc] 9 cxfs_dsxvn_wait_inode_safe[../fs/cxfs/server/cxfs_dsxvn.c: 2011, 0xc00000000046a594] 10 dsvn_getobjects[../fs/cxfs/server/dsvn.c: 3266, 0xc0000000004676fc] 11 I_dsxvn_allocate[../fs/cxfs/server/cxfs_dsxvn.c: 1406, 0xc0000000004699c8] 12 dsxvn_msg_dispatcher[../IP27bootarea/I_dsxvn_stubs.c: 119, 0xc000000000456768] 13 mesg_demux[../cell/mesg/mesg.c: 1130, 0xc000000000408e88] 14 mtcp_notify[../cell/mesg/mesg_tcp.c: 1100, 0xc0000000004353d8] 15 tsv_thread[../cell/tsv.c: 303, 0xc000000000437738] 16 xthread_prologue[../os/swtch.c: 1638, 0xc00000000021782c] 17 xtresume[../os/swtch.c: 1686, 0xc0000000002178f8] ===============================================================================
Run cxfsdump on the metadata server.
Run cxfsdump on the client that has the stuck message.
If possible, force the client that has the stuck message to generate a crash dump.
Reboot the client that has the stuck message. This is required for CXFS to recover.
To determine if a client-only node is fenced, log in to a CXFS server-capable administration node and use the hafence(1M) command. A fenced port is displayed as status=disabled.
In the following example, all ports that have been registered as CXFS host ports are not fenced:
admin# /usr/cluster/bin/hafence -q Switch "brocade04" has 16 ports Port 4 type=FABRIC status=enabled hba=210000e08b0042d8 on host o200c Port 5 type=FABRIC status=enabled hba=210000e08b00908e on host cxfs30 Port 9 type=FABRIC status=enabled hba=2000000173002d3e on host cxfssun3
All switch ports can also be shown with hafence:
admin# /usr/cluster/bin/hafence -v Switch "brocade04" has 16 ports Port 0 type=FABRIC status=enabled hba=2000000173003b5f on host UNKNOWN Port 1 type=FABRIC status=enabled hba=2000000173003adf on host UNKNOWN Port 2 type=FABRIC status=enabled hba=210000e08b023649 on host UNKNOWN Port 3 type=FABRIC status=enabled hba=210000e08b021249 on host UNKNOWN Port 4 type=FABRIC status=enabled hba=210000e08b0042d8 on host o200c Port 5 type=FABRIC status=enabled hba=210000e08b00908e on host cxfs30 Port 6 type=FABRIC status=enabled hba=2000000173002d2a on host UNKNOWN Port 7 type=FABRIC status=enabled hba=2000000173003376 on host UNKNOWN Port 8 type=FABRIC status=enabled hba=2000000173002c0b on host UNKNOWN Port 9 type=FABRIC status=enabled hba=2000000173002d3e on host cxfssun3 Port 10 type=FABRIC status=enabled hba=2000000173003430 on host UNKNOWN Port 11 type=FABRIC status=enabled hba=200900a0b80c13c9 on host UNKNOWN Port 12 type=FABRIC status=disabled hba=0000000000000000 on host UNKNOWN Port 13 type=FABRIC status=enabled hba=200d00a0b80c2476 on host UNKNOWN Port 14 type=FABRIC status=enabled hba=1000006069201e5b on host UNKNOWN Port 15 type=FABRIC status=enabled hba=1000006069201e5b on host UNKNOWN
When the client-only node joins membership, any fences on any switch ports connected to that node should be lowered and the status changed to enabled.
However, if the node still does not have access to the storage, do the following:
Check that the HBA WWPNs were correctly identified. See “Verifying the I/O Fencing Configuration” in Chapter 10.
Check the cxfs_client log file for warnings or errors while trying to determine the HBA WWPNs. See “No HBA WWPNs are Detected”.
Log into the Fibre Channel switch. Check the status of the switch ports and confirm that the WWPNs match those identified by cxfs_client.
On most platforms, the cxfs_client software automatically detects the world wide port names (WWPNs) of any supported host bus adapters (HBAs) in the system that are connected to a switch that is configured in the cluster database. These HBAs will then be available for fencing.
However, if no WWPNs are detected, there will be messages about loading the HBA/SNIA library.
See the following:
If a client has trouble obtaining membership, verify that the system firewall is configured for CXFS use. See “Configure Firewalls for CXFS Use” in Chapter 2.
You can run the cxfs-reprobe script on a client-only node (other than Windows) to look for devices and perform a SCSI bus reset if necessary. cxfs-reprobe will also issue an XVM probe to tell XVM that there may be new devices available:
To view the current status of XVM mirror licenses on client-only nodes, use the following command and search for the line containing the keyword mirrors:
xvm show -subsystem
client# xvm show -subsystem XVM Subsystem Information: -------------------------- apivers: 26 config gen: 33 privileged: 1 clustered: 1 cluster initialized: 1 user license enabled: 1 local mirrors enabled: 1 cluster mirrors enabled: 1 snapshot enabled: 1 snapshot max blocks: -1 snapshot blocks used: 0
When reporting a problem with a client-only node, it is important to retain the appropriate information; having access to this information will greatly assist SGI in the process of diagnosing and fixing problems. The methods used to collect required information for problem reports are platform-specific:
“Reporting SGI ProPack Client-Only Nodes Problems” in Chapter 7