This section discusses the following:
In a CXFS environment, also see the troubleshooting information in the CXFS 7 Administrator Guide for SGI InfiniteStorage.
To enter to the cluster domain, the appropriate CXFS services must be started. See “CXFS Service Requirements for Cluster Domain” in Chapter 2.
If you execute the xvm(8) command when the services are started, you will automatically enter the cluster domain. If the services are not started, you will automatically enter the local domain.
If you try to explicitly enter cluster domain when the services are not started, you will get an error:
# xvm -d cluster could not start up in the specified domain: domain name is invalid or unavailable |
See:
If a cluster or host that currently owns the disk does not exist, you must perform a steal operation. This situation might occur after a mistaken give operation or after deleting a host or cluster.
See:
If the steal command is used to take a disk from a running system, the configuration can become inconsistent and the disk may appear as both owned and foreign. You can use the reprobe command to recover from this situation; see “Removing Configuration Information with the reprobe Command” in Chapter 5.
To avoid problems like this, see “Give Rather than Steal Ownership” in Chapter 3.
When a node in a CXFS cluster crashes, a mirror may start reviving. This happens when the node that crashed was using the mirror and may have left the mirror in a dirty state, with the legs of the mirror unequal. When this occurs, XVM must forcibly resynchronize all of the legs. This can be a lengthy process.
If your system performance of mirror revives seems slow, you may need to reconfigure the mirror revive resources set by the xvm_max_revive_rsc and xvm_max_revive_threads XVM system tunable kernel parameters. See:
If a mirror leg is disconnected while a change is occurring, the leg or the parents or children of the mirror may temporarily display an inconsistent state in the xvm CLI show command output. This is expected behavior and does not require any administrative action.
If you notice an inconsistent state that is not associated with a mirror, contact SGI Support for assistance.
This section discusses the following:
If a hardware problem causes the system to switch the path to an XVM physvol, you can run the following command to return to the preferred path after solving the hardware issue:
xvm:cluster> foswitch -preferred phys |
You can also execute the foswitch command directly from the shell prompt:
# xvm foswitch -preferred phys |
See:
The objects displayed and acted on by xvm commands depend upon the domain setting; if you are in the cluster domain, only objects owned by the cluster will be shown.
For example, the following output displays the scenario where you must switch from the cluster domain to the local domain in order to see the physvol lp:
xvm:cluster> show phys/* phys/first 2339536896 online,cluster,accessible phys/fourth 2339536896 online,cluster,accessible phys/second 2339536896 online,cluster,accessible xvm:cluster> set domain local xvm:local> show phys/* phys/lp 11112861696 online,local,accessible |
For more information, see:
For system dump analysis, use the crash(8) tool provided with SLES or RHEL.
To enable the collection of crash dumps, do the following:
Install the following RPMs, where kernelrev matches your installed kernel:
RHEL:
kernel-debuginfo- kernelrev
kernel-debuginfo-common- kernelrev
system-config-kdump- kernelrev
SLES:
kernel-default-debuginfo- kernelrev
kdump-version
kexec-tools-version
For example, for the SLES 2.6.27.19-5-default kernel, you would require kernel-default-debuginfo-2.6.27.19-5.1 RPM, which would install the following file:
/usr/lib/debug/boot/vmlinux-2.6.27.19-5-default.debug |
When you install the RHEL kdump-version RPM, it will automatically add the following information onto the kernel lines in the /boot/grub/menu.lst file:
crashkernel=256m-:128M@16M |
| Note: When you install the kdump RPM, kdump is automatically enabled. |
Reboot, which activates the kernel and reserves the required memory. You will see the following message on the console:
Loading kdump done |
Verify that the machine is set up correctly by requesting an NMI from the console:
console# echo "c">/proc/sysrq-trigger |
| Note: If there are several old dump files, the oldest one might be deleted by this process. |
For example:
console# echo "c">/proc/sysrq-trigger SysRq : Trigger a crashdump Initializing cgroup subsys cpuset Initializing cgroup subsys cpu ... (pages of output) |
The key piece of information to look for are lines such as the following at the end of the output:
Saving dump using makedumpfile ------------------------------------------------------------------------------- Copying data : [ 100 %] The dumpfile is saved to /root/var/crash/2009-10-28-13:05/vmcore. makedumpfile Completed. ------------------------------------------------------------------------------- Generating README Finished. Copying System.map Finished. Copying kernel Finished. Copying kernel.debug Finished. |
Then the machine will reboot normally.
Go to the /var/crash directory and look for the dump directories that named according to the date and time. Each date directory will contain the files required for analysis. For example:
# cd /var/crash console# ls 2009-10-13-21:02/ 2009-10-26-15:55/ # ls -1 2009-10-26-15:55 README.txt System.map-2.6.27.19-5-default vmlinux-2.6.27.19-5-default.debug vmlinux-2.6.27.19-5-default.gz vmcore |
For more information, see the crash(8) man page.
| Note: The sgidb tool is supported for systems that have CXFS installed. |
Before reporting a problem to SGI, you should do the following:
Retain any messages that appeared in the system logs immediately before the system exhibited the problem.
If there was system crash, obtain a system dump if possible.
If you suspect that there are problems with a particular client (such as if you see error messages indicating that a client is timing out), collect a system dump if possible or else a kernel-stack backtrace for all threads from that client. If it is unclear which client is causing the problem, collect this information for all clients.
For Linux systems, you can obtain kernel-stack backtraces of all threads by running the following crash(8) command:
foreach bt > somefile |
For more information, see the crash(8) man page.
| Note: Normally, you can use the crash command to get the kernel-stack backtraces on a running system (without having to first crash the system in order to get a dump). |
If XVM has cluster problems, you should also provide system logs and systems dumps from other nodes in the cluster. This can be helpful if you encounter problems such as the following:
The xvm command will not enter cluster domain
You cannot mount a CXFS filesystem
You see messages about a client not responding
If your system is set up with KDB, retain the debugger information from the KDB built-in kernel debugger after a system kernel panic.
Provide the core files and the following associated information:
The application that created the core file:
file core_filename |
The binaries listed by the following command:
ldd application_path |
Gather the XVM subsystem parameters by executing the following:
xvm:cluster> show -subsystem |
For example:
xvm:local> show -subsystem XVM Subsystem Information: -------------------------- apivers: 19 config gen: 15 privileged: 1 clustered: 0 cluster initialized: 0 |
| Parameter | Description | |
| apivers | The version of the application programming interface (API) that XVM is using. All nodes in the cluster must use the same version. | |
| config gen | A generation number that increments every time the XVM configuration in the kernel is changed. | |
| privileged | Indicates whether the current invocation of the xvm CLI is capable of making configuration changes ( 1) or not (0). | |
| clustered | Indicates whether the kernel is cluster-aware ( 1) or not (0). | |
| cluster initialized | Indicates whether the CXFS cluster services that allow XVM to operate in the cluster domain have been initialized ( 1) or not (0). |