Chapter 5. Identifying and Correcting Failures

This chapter describes how the Storage System Manager window displays information about array health and outlines procedures you can use to detect and correct problems that may arise with managed arrays. This chapter is organized as follows:

Array Health Indicators in the Storage System Manager Window

The Storage System Manager window displays fault information several ways; Figure 5-1 points out these features.

Figure 5-1. Array Health Indicators in the Storage System Manager Window



Note: When the Storage System Manager window is minimized to an icon, a slanted line appears across the array icon graphic to indicate that one or more managed arrays are faulted, as shown in Figure 5-2.

Figure 5-2. Storage System Manager Icon With Faulted Array(s)


Fault indication features are explained in separate sections below.

Array Status Button

The array status button at the lower left of the window indicates array health:

  • Normal (button is gray) indicates that ssmgui detects no failures in the managed arrays

  • Fault (button is amber) indicates that ssmgui detects a failure in one or more arrays, or that one or more arrays is inaccessible

Clicking this button displays information on system faults; Figure 5-3 shows an example.

Figure 5-3. SSM Information Window (Array Fault Button)


Array Icons in the Array Selection Area

The color of an icon for an array indicates its health:

  • gray indicates that no failure is detected

  • amber indicates a fault in some part of the array; a small F appears at the lower right

The graphic for an array icon indicates the status of the array; see “Enclosure Health and Accessibility” in Chapter 2.

If you are managing many arrays, checking the Array Status button is more convenient than clicking the array icons. This button also appears in the Array Configuration window.

Auto Poll Indicator

The Auto Poll indicator at the lower right of the window shows the status of automatic polling. You can make sure that you are getting the latest status information on the arrays in two ways:

  • Enable automatic polling for the managed arrays and for ssmgui, as explained in “ssmagent Polling Interval and Polling Requests” in Chapter 4.

  • Manually poll the managed arrays periodically by clicking the Manual Poll button at the right end of the array toolbar; this button has a letter M at its lower right.

If any array that you want to monitor does not appear in the Storage System Manager window, see “Using the Host Administration Window to Add Servers” in Chapter 2.

Checking a Faulted Array

To determine the state of each component in a faulted array, follow these steps:

  1. In the Storage System Manager window, select the faulted array for which you want information.

  2. Open the Components For Array or Equipment View windows for the selected arrays either by clicking the monitor array button near the middle of the array toolbar or by choosing Monitor from the Array menu.

    For an array with multiple enclosures, the Components For Array window opens; for an array with only one enclosure, the Equipment View window opens.

  3. If the Components For Array window opens:

    • Select the faulted enclosures (amber icons).

    • Open the Equipment View window for the selected enclosures by choosing Monitor... in the Chassis menu, or by double-clicking the icon for the enclosure.


    Note: The icon type in the Components For Array window indicates the kind of enclosure; see “Interpreting Array Icons” in Chapter 2.


  4. In the Equipment View window for each enclosure, look for amber icons. Figure 5-4 shows an example.

    Figure 5-4. Equipment View Window With Faulted Power Supply


The right field in this window displays system components, also known as CRUs: customer-replaceable units, although end users replace only disk modules. The rest of this chapter and the Origin FibreVault and Fibre Channel RAID Owner's Guide have information on replacing components.

A white icon, which also displays the letter E ({empty), indicates a missing component.

If you do not know the name of the CRU represented by an amber icon, you can display it by positioning the cursor over the CRU for a few seconds without moving it. The RAID GUI displays a description of the CRU near the icon, as well as in the status bar.

The external Fibre Channel Hub is transparent to ssmgui and the command-line interfaces.


Note: For configurations with multiple disk enclosures, you can cause the LEDs on specific disk modules to flash as an aid to locating them. In the Components window or Equipment View toolbar, use the second button from the right to start flashing the disk module LEDs; use the button at the far right to stop LED flashing.


Faulted LUN

If a LUN is faulted, its icon is amber. Figure 5-5 shows an example with the faulted LUN selected.

Figure 5-5. Faulted LUNs


If a LUN is inaccessible, its icon appears in the Unowned LUNs field in the Array Configuration window.

Click the button at the far right of the LUN (upper) toolbar (or choose Describe Fault Indication from the LUN menu) to display fault information. Figure 5-6 shows an example.

Figure 5-6. LUN Fault Information


For complete information on the LUN, display its windows as explained in “Using the LUN Information Windows” in Chapter 4.


Note: If a DPE has two LCCs (A and B), each DAE cabled to it must also have two LCCs. If a DAE has no LCC B, binding LUNs on SP B is restricted in some circumstances: If SP A uses disk modules 00 through 09 (that is, all the disk modules in the DPE) in LUNs, you cannot bind any LUNs on SP B, because the connection to the disk modules in the DAE is missing. There is no path through LCC B (absent) and no path through the disk modules on the DPE (all are bound on SP A).


Faulted Disk Module

An amber disk module icon indicates that the disk module it represents is in one of these states:

  • failed: either powered on but inaccessible, or powered off by the SP, which can happen when a disk module has been replaced with one of the wrong capacity

  • removed from the enclosure after the ssmagent started running

This section contains the following topics:

Fibre Channel Disk Module Types

Replace a failed disk module only with another Silicon Graphics FC-AL disk module; see disk replacement information in the Origin FibreVault and Fibre Channel RAID Owner's Guide. Follow disk-replacement procedures explained in that manual.

Special rules apply when you replace database or cache vault disk modules:

  • The first three disk modules in the first enclosure (DPE) in a chain, or in the only enclosure in the array, contain licensed internal code (LIC). These disk modules, with disk IDs 00, 01, and 02, are known as database disk modules. At least two are required for DPE operation.

    If all three are removed at the same time when the array is powered on, contact with the array is lost. (The SP fault LED illuminates.) If you remove these three disk modules with the array powered off, label these as you remove them, because they must be reinstalled in their original positions.

  • If the array uses write caching, the disk modules that the array uses for its cache vault are 00 through 08.

    If a cache vault disk module fails, the array dumps its write cache image to the remaining modules in the vault. Then it writes all dirty (modified) pages to disk and disables write caching. Write caching remains disabled until a replacement disk module is inserted and the array rebuilds the LUN with the replacement module in it. The Write Cache State field in the Cache section of the SP Information window indicates whether array write caching is enabled or disabled (see “SP Cache Information” in Chapter 4).

“Replacing a Database or Cache Vault Disk Module” has guidelines for replacing these disk modules.

Disk Module Failure and Powering Off, Unbinding, or Taking Disks Offline

The system operator can replace or reinsert any failed disk module without powering off the array or interrupting user applications.

If the failed disk module is part of a RAID 5, 3, 1, or 1/0 LUN, the LUN can continue functioning without interrupting user applications.

Unbinding a RAID disk module is equivalent to taking it offline. Unbinding is required only in these situations:

LUN Integrity and Disk Module Failure

If a failed disk module is part of a RAID 5, 3, 1, or 1/0 LUN, the LUN, you can replace the disk module without powering off the array or interrupting applications. If the array contains a hot spare on standby, the SP automatically rebuilds the failed module on the hot spare. When you replace a disk module in a RAID 5, 3, 1, or 1/0 LUN, the SP equalizes the new module, and then begins to reconstruct the data. While rebuilding occurs, users have uninterrupted access to information on the LUN. (For more information on how these LUNs are rebuilt, see “Rebuilding a RAID 5, 3, 1, or 1/0 LUN” and “Rebuilding a RAID 0 LUN”.)

RAID disk modules do not have to be unbound to be replaced unless two or more in a LUN fail so that the LUN's data redundancy is compromised:

  • RAID 0: a single disk module fails

  • RAID 1/0: both modules in a pair

  • RAID 1: both modules in a pair

  • RAID 3: two modules in a LUN (RAID 3 does not use a second hot spare)

  • RAID 5: two modules in a LUN

In these cases, the LUN becomes unowned (not accessible by either SP). After you unbind the affected LUN(s) and replace the disk modules (one at a time), you rebind the affected LUN(s). If the data on the failed disks was backed up, restore it to the new disks.

A hot spare is unowned until it becomes part of a LUN when one of the LUN's disk modules fails. The failure of an unowned hot spare does not make any LUN inaccessible.

If an individual disk LUN fails, you must unbind it, replace the disk, and rebind the LUN.

If you want to move disk modules from one array to another, back up the data, unbind the LUN(s), move each disk module one at a time to its new location, rebind, and restore the backed-up data.

Replacing a Disk Module

Removing the wrong disk module can introduce an additional fault that shuts down the LUN containing the failed module.


Note: For configurations with multiple disk enclosures, you can cause the LEDs on specific disk modules to flash as an aid to locating them. In the Components window or Equipment View toolbar, use the second button from the right to start flashing the disk module LEDs; use the button at the far right to stop LED flashing.

Before removing a disk module, follow these steps:

  1. Check module's amber check or fault LED. If it is illuminated, or if both LEDs are off, the host is bypassing it, indicating a failure.

  2. Double-click the disk icon to read fault information (see “Disk Module Error Information” in Chapter 4).

  3. Read the event log for the SP that owns the LUN containing the faulty disk module for a message about the disk module; see “Displaying an SP Event Log” in Chapter 4.

  4. Check for any other messages that indicate a related failure, such as a failure of a SCSI bus or a general shutdown of an enclosure. Such a message could mean the disk module itself has not failed.

  5. A message about the disk module contains its module ID. If the disk module button does not, in the Array Configuration window, choose Show Disk IDs from the View menu to see the location of the disk module. Figure 5-7 shows an example.

    Figure 5-7. Array Configuration Window: Disk Module IDs


  6. If the disk is an individual disk LUN, you must unbind the LUN before you replace the disk module, and rebind the LUN afterward. Unbinding and rebinding is not necessary for other LUN types.

  7. After you confirm the failure of a disk module, replace it following instructions in the Origin FibreVault and Fibre Channel RAID Owner's Guid . Be sure to use only Silicon Graphics FC RAID disk modules (part number 9470192).

Replacing a Database or Cache Vault Disk Module

Special rules apply when you replace a database or cache vault disk module; see “Fibre Channel Disk Module Types”

If you must replace all database disk modules, follow these steps:

  1. Unbind the LUN(s) containing the disk modules you want to replace, following instructions in “Unbinding a LUN” in Chapter 3.

  2. One by one, replace all failed disk modules except for the one in slot 00; leave the disk module in slot 00 in place. (See instructions in the Origin FibreVault and Fibre Channel RAID Owner's Guide .)

  3. Bind the disk modules into the desired LUNs, following instructions in “Binding Disk Modules” in Chapter 2.


    Tip: Use a minibind in this case: ssmcli bind with the -z option. For information on the -z option, see page 146 in Chapter 6.

    When you bind the LUNs, the SP copies the licensed internal code from the 00 disk module onto the other two database disk modules (01 and 02).

  4. Unbind the LUN containing the 00 disk module.

  5. Replace this disk module and rebind the LUN.

Rebuilding a RAID 5, 3, 1, or 1/0 LUN

Although you can remove a disk module within a RAID 5, 3, 1, or 1/0 LUN without damaging the data on the LUN, do so only when the disk module has actually failed.

If the array contains a hot spare on standby, the SP automatically rebuilds the failed module on the hot spare. When you replace a disk module in a RAID 5, 3, 1, or 1/0 LUN, the SP equalizes the new module, and then begins to reconstruct the data. While rebuilding occurs, you have uninterrupted access to information on the LUN.

You can use the Disk Information window (see Figure 5-8) to follow the status of the new disk module during the rebuilding process. To open this window, choose Configuration from the Array menu; in the Array Configuration window, double-click the disk module you are rebuilding.

Figure 5-8. Disk Information Window: Configuration Information


The Status field shows these states:

  1. Powering Up: The hot spare or replacement disk module is being powered on.

  2. Rebuilding: The SP rebuilds the data on the hot spare.

  3. Equalizing: Data from a hot spare is being copied onto a replacement disk module.

  4. Enabled: The hot spare is fully integrated into the LUN, or the failed disk module has been replaced with a new module and the SP copies the data from the hot spare onto the new module.

  5. Ready: The copy is complete. The LUN consists of the disk modules in the original slots and the hot spare is on standby.

(For full information on this window, see “Disk Module Configuration Information” in Chapter 4.)

Rebuilding occurs at the same time as user I/O. The rebuild time that is specified when the LUN is bound determines the duration of the rebuild process and the amount of SP resources dedicated to rebuilding. A short rebuild time consumes many resources and may significantly degrade performance. A long period consumes fewer resources with less effect on performance. You can determine the rebuild period by looking at the configuration section of the LUN Information window (see “LUN Configuration Information” in Chapter 4).

Rebuilding a RAID 0 LUN

If one of the disk modules in a RAID 0 LUN fails, the LUN changes state from owned to unowned. The state does not change back to owned after you replace the failed disk module.

Follow these steps to rebuild the RAID 0 LUN:

  1. Replace the failed disk module following instructions in the Origin FibreVault and Fibre Channel RAID Owner's Guide.

  2. Make sure the RAID 0 LUN is not in use.

  3. Reset the LUN's ownership state to owned in one of the following ways:

    • Determine the SP that owns the RAID 0 LUN and reboot the SP with the ssmcli rebootSP command; see “rebootSP” in Chapter 6.

    or

    • Use hinv to determine the LUN ID and enter the following command:

      scsicontrol -i lunid 
      

      For example:

      scsicontrol -i sc2d0l4 
      

    or

    • Use ssmcli trespass to change LUN ownership:

      ssmcli -d devicename lun lun-number 
      

      For example:

      ssmcli -i sc2d0l4 lun 4
      

      This operation does not require a reboot. For more information, see “trespass” in Chapter 6.

Faulted SP

An amber SP icon in the Equipment View window indicates that the SP has failed. When an SP fails:

  • One or more LUNs may become inaccessible and the array's performance can degrade if array read or write caching was enabled.

  • The SP's check or service light turns on, along with the check or service light on the front of the array.

  • If the array has a second SP and a number of LUNs are inaccessible, you may want to transfer control of the LUNs to the working SP. Follow instructions in “Transferring Control of a LUN (Manual Trespass)” in Chapter 3.

To display more information on the fault, double-click the amber SP icon. The SP Information window appears; see “Using the SP Information Windows” in Chapter 4.

An authorized Silicon Graphics System Support Engineer can replace the SP under power, without interrupting applications. Call your service provider.


Caution: If array write caching is enabled, it must be disabled before an SP is replaced. See “Enabling or Disabling Array Caching” in Chapter 3 for instructions.


Faulted LCC

An amber link control card (LCC) icon in the Equipment View window indicates that the LCC has failed. In addition, the LCC's fault light turns on, along with the service light on the front of the array.

When an LCC fails, the SP it is connected to loses access to its LUNs, and the array's performance can degrade. If the array has a second LCC and a number of LUNs are inaccessible, you may want to transfer control of the LUNs to the SP that is connected to the working LCC. Follow instructions in “Transferring Control of a LUN (Manual Trespass)” in Chapter 3.

Double-click the amber icon to display more information on the fault. See the Origin FibreVault and Fibre Channel RAID Owner's Guide for information on replacing the LCC.

Faulted Fan

An amber fan pack icon in the Equipment View window indicates that the fan group it represents is in one of the following states:

  • Faulted: The fan group has failed.

  • Removed: The fan group was removed after the ssmagent started running.

  • Not Present: The fan group failed or a fan pack was removed before the agent started running.

If one fan in a pack fails, the other fans speed up to compensate so that the array can continue operating. If a second fan fails and the temperature rises, the array shuts down after about two minutes to prevent damage to the disk modules from overheating.

Double-click the amber icon to display more information on the fault. ssmgui refers to the drive fan pack as the rear fan (FAN A in the command-line interface), and to the SP fan pack as the front fan (FAN B in the CLI). A DPE has both types of fan pack; a DAE has only the rear fan pack.

If you see an amber fan icon, replace the entire fan pack as soon as possible; call your service provider. The replacement fan pack must be on hand for this process. If the fan pack is removed for more than two minutes, the SPs and the disk modules power off; they power on when a functional replacement fan pack is installed. See the Origin FibreVault and Fibre Channel RAID Owner's Guide for information on replacing the fan pack.

If the fan pack is removed or disabled (more than one fan fails), the disk modules spin down. Note that the green disk module LEDs and the green (active) SP LEDs remain illuminated; though the disk modules and SP(s) are shut down, they are not faulted. The disk modules and SPs automatically power back on when a functional fan pack is installed.

Faulted Power Supply

An amber power supply icon in the Equipment View window indicates that the power supply it represents is in one of the following states:

  • Down: Failed or removed after the ssmagent started running.

  • Not Present: Failed or removed before the agent started running.

A Fibre Channel RAID enclosure has a power supply A; it can also have an optional power supply B. An array with two power supplies can recover from the failure of one power supply and provide uninterrupted service while the defective power supply is replaced. If a second power supply fails or is removed, the entire enclosure shuts down immediately.

In enclosures with no SPS, failure of the AC distribution system (line cord, utility power, and so on) also immediately shuts down the entire enclosure.

Double-click the amber icon to display more information on the fault. See the Origin FibreVault and Fibre Channel RAID Owner's Guide for information on replacing the power supply.


Note: When a DPE shuts down, the operating system loses contact with the LUNs. When the DPE powers on again, you might need to reboot the server to re-establish operating system access to the LUNs; in this case, you must also restart ssmagent, as explained in “Restarting ssmagent” in Appendix A.


Faulted SPS

An amber SPS icon in the Equipment View window indicates a problem with the standby power supply (SPS), also known as the battery backup unit (BBU). Double-click the amber icon to display more information on the fault.

Also, check the status lights on the SPS. These indicate when the SPS has an internal fault, when it is recharging, and when its battery pack needs replacing. (See the Origin FibreVault and Fibre Channel RAID Owner's Guide for a full description of this and other system component hardware.)

If the SPS status lights indicate an internal fault, it might still be able to run, but the SP disables write caching. The array can use write caching only when a fully charged, working SPS is present. After the faulted SPS is replaced with a functional SPS, the SP automatically re-enables write caching when it detects the presence of the functional SPS. (If a second fully charged, working SPS is present, write caching continues.)

If the SPS status lights indicate that a battery pack needs replacing, or indicate any other failure, contact your service provider. In an array with a second SPS that is healthy, an authorized Silicon Graphics System Support Engineer can replace the faulted SPS while the DPE is powered on, without interrupting applications.