Chapter 11. Troubleshooting

This section discusses the following:

In a CXFS environment, also see the troubleshooting information in the CXFS 7 Administrator Guide for SGI InfiniteStorage.

Common Problems

Cannot Switch Domains

To enter to the cluster domain, the appropriate CXFS services must be started. See “CXFS Service Requirements for Cluster Domain” in Chapter 2.

If you execute the xvm(8) command when the services are started, you will automatically enter the cluster domain. If the services are not started, you will automatically enter the local domain.

If you try to explicitly enter cluster domain when the services are not started, you will get an error:

# xvm -d cluster
could not start up in the specified domain:  domain name is invalid or unavailable

See:

Need to Steal

If a cluster or host that currently owns the disk does not exist, you must perform a steal operation. This situation might occur after a mistaken give operation or after deleting a host or cluster.

See:

Disk is Both Owned and Foreign

If the steal command is used to take a disk from a running system, the configuration can become inconsistent and the disk may appear as both owned and foreign. You can use the reprobe command to recover from this situation; see “Removing Configuration Information with the reprobe Command” in Chapter 5.

To avoid problems like this, see “Give Rather than Steal Ownership” in Chapter 3.

Mirror Revives on Recovery in a Cluster

When a node in a CXFS cluster crashes, a mirror may start reviving. This happens when the node that crashed was using the mirror and may have left the mirror in a dirty state, with the legs of the mirror unequal. When this occurs, XVM must forcibly resynchronize all of the legs. This can be a lengthy process.

Slow Mirror Revives

If your system performance of mirror revives seems slow, you may need to reconfigure the mirror revive resources set by the xvm_max_revive_rsc and xvm_max_revive_threads XVM system tunable kernel parameters. See:

Volume Element in inconsistent State

If a mirror leg is disconnected while a change is occurring, the leg or the parents or children of the mirror may temporarily display an inconsistent state in the xvm CLI show command output. This is expected behavior and does not require any administrative action.

If you notice an inconsistent state that is not associated with a mirror, contact SGI Support for assistance.

Troubleshooting Strategies

This section discusses the following:

Returning to Preferred Path

If a hardware problem causes the system to switch the path to an XVM physvol, you can run the following command to return to the preferred path after solving the hardware issue:

xvm:cluster> foswitch -preferred phys

You can also execute the foswitch command directly from the shell prompt:

# xvm foswitch -preferred phys

See:

Switching Domains to Find All Objects

The objects displayed and acted on by xvm commands depend upon the domain setting; if you are in the cluster domain, only objects owned by the cluster will be shown.

For example, the following output displays the scenario where you must switch from the cluster domain to the local domain in order to see the physvol lp:

xvm:cluster> show phys/*
phys/first               2339536896 online,cluster,accessible
phys/fourth              2339536896 online,cluster,accessible
phys/second              2339536896 online,cluster,accessible
xvm:cluster> set domain local
xvm:local> show phys/*
phys/lp                  11112861696 online,local,accessible 

For more information, see:

Using the System Dump Analysis Tool

For system dump analysis, use the crash(8) tool provided with SLES or RHEL.

To enable the collection of crash dumps, do the following:

  1. Install the following RPMs, where kernelrev matches your installed kernel:

    • RHEL:

      • kernel-debuginfo- kernelrev

      • kernel-debuginfo-common- kernelrev

      • system-config-kdump- kernelrev

    • SLES:

      • kernel-default-debuginfo- kernelrev

      • kdump-version

      • kexec-tools-version

    For example, for the SLES 2.6.27.19-5-default kernel, you would require kernel-default-debuginfo-2.6.27.19-5.1 RPM, which would install the following file:

    /usr/lib/debug/boot/vmlinux-2.6.27.19-5-default.debug

    When you install the RHEL kdump-version RPM, it will automatically add the following information onto the kernel lines in the /boot/grub/menu.lst file:

    crashkernel=256m-:128M@16M


    Note: When you install the kdump RPM, kdump is automatically enabled.


  2. Reboot, which activates the kernel and reserves the required memory. You will see the following message on the console:

    Loading kdump                                 done

  3. Verify that the machine is set up correctly by requesting an NMI from the console:

    console# echo "c">/proc/sysrq-trigger


    Note: If there are several old dump files, the oldest one might be deleted by this process.

    For example:

    console# echo "c">/proc/sysrq-trigger
    SysRq : Trigger a crashdump
    Initializing cgroup subsys cpuset
    Initializing cgroup subsys cpu
    ...
    (pages of output)
    

    The key piece of information to look for are lines such as the following at the end of the output:

    Saving dump using makedumpfile
    -------------------------------------------------------------------------------
    Copying data                       : [ 100 %]
    
    The dumpfile is saved to /root/var/crash/2009-10-28-13:05/vmcore.
    
    makedumpfile Completed.
    -------------------------------------------------------------------------------
    Generating README              Finished.
    Copying System.map             Finished.
    Copying kernel                 Finished.
    Copying kernel.debug           Finished.

    Then the machine will reboot normally.

  4. Go to the /var/crash directory and look for the dump directories that named according to the date and time. Each date directory will contain the files required for analysis. For example:

    # cd /var/crash
    console# ls
    2009-10-13-21:02/  2009-10-26-15:55/
    # ls -1 2009-10-26-15:55
    README.txt
    System.map-2.6.27.19-5-default
    vmlinux-2.6.27.19-5-default.debug
    vmlinux-2.6.27.19-5-default.gz
    vmcore

For more information, see the crash(8) man page.


Note: The sgidb tool is supported for systems that have CXFS installed.


Using SGI Knowledgebase

If you encounter problems see the SGI Customer Portal:

https://support.sgi.com

Then click on Search Knowledgebase and select the type of search you want to perform.

If you need further assistance, contact SGI Support.

Reporting Problems to SGI

Before reporting a problem to SGI, you should do the following:

  • Retain any messages that appeared in the system logs immediately before the system exhibited the problem.

  • If there was system crash, obtain a system dump if possible.

  • If you suspect that there are problems with a particular client (such as if you see error messages indicating that a client is timing out), collect a system dump if possible or else a kernel-stack backtrace for all threads from that client. If it is unclear which client is causing the problem, collect this information for all clients.

    For Linux systems, you can obtain kernel-stack backtraces of all threads by running the following crash(8) command:

    foreach bt > somefile

    For more information, see the crash(8) man page.


    Note: Normally, you can use the crash command to get the kernel-stack backtraces on a running system (without having to first crash the system in order to get a dump).


  • If XVM has cluster problems, you should also provide system logs and systems dumps from other nodes in the cluster. This can be helpful if you encounter problems such as the following:

    • The xvm command will not enter cluster domain

    • You cannot mount a CXFS filesystem

    • You see messages about a client not responding

  • If your system is set up with KDB, retain the debugger information from the KDB built-in kernel debugger after a system kernel panic.

  • Provide the core files and the following associated information:

    • The application that created the core file:

      file core_filename

    • The binaries listed by the following command:

      ldd application_path

  • Gather the XVM subsystem parameters by executing the following:

    xvm:cluster> show -subsystem

    For example:

    xvm:local> show -subsystem
    XVM Subsystem Information:
    --------------------------
    apivers:              19
    config gen:           15
    privileged:           1
    clustered:            0
    cluster initialized:  0

    Parameter 

    Description

    apivers 

    The version of the application programming interface (API) that XVM is using. All nodes in the cluster must use the same version.

    config gen 

    A generation number that increments every time the XVM configuration in the kernel is changed.

    privileged 

    Indicates whether the current invocation of the xvm CLI is capable of making configuration changes ( 1) or not (0).

    clustered 

    Indicates whether the kernel is cluster-aware ( 1) or not (0).

    cluster initialized 

    Indicates whether the CXFS cluster services that allow XVM to operate in the cluster domain have been initialized ( 1) or not (0).