This chapter describes how the Challenge RAID storage system operates in the case of failure of a system component (field-replaceable unit) other than a disk module: a fan module, power supply, optional battery backup unit, or storage-control processor board.
Read this chapter after you have determined that one of these components has failed as indicated by the cabinet fault light, in the output of the raidcli getcrus command, or in the graphical user interface Equipment view, as explained in “Getting Information About Other Components” in Chapter 3.
|Note: These components can be replaced only by qualified Silicon Graphics System Service Engineers or other qualified service providers. Only disk modules are owner replaceable or end-user replaceable; Chapter 5 provides instructions.|
Call the Silicon Graphics hotline to order a replacement module:
The Challenge RAID storage system has two or three redundant power supplies, or VSCs (voltage semi-regulated converters): VSC A, VSC B, and, optionally, VSC C. If the storage system has three power supplies, it can recover from power supply component failures and provide uninterrupted service while the defective component is replaced.
In a storage system with two power supplies, a failed power supply unit can fail or be removed without affecting the disk modules. If a second power supply fails, the entire chassis shuts down immediately. Any defective power supply must be replaced immediately by a Silicon Graphics System Service Engineer.
Failure of the AC distribution system (line cord, utility power, and so on) also shuts down the entire chassis immediately.
|Note: When the storage system shuts down, the operating system loses contact with the LUNs. When the storage system starts up automatically, you may need to reboot it to let the operating system access the LUNs.|
In RAID5GUI, an amber power supply button in the Equipment View indicates that the power supply this button represents is in one of the following states:
Down: the power supply failed or was removed after the agent started running
Not Present: the power supply failed or was removed before the agent started running
These states appear in /var/adm/SYSLOG; if alarms are enabled in RAID5GUI, they appear in e-mail messages, on-screen alarm messages, or both, depending on the setting.
Each Challenge RAID storage system or rack chassis assembly has one fan module, containing six fans wired in two groups (FAN A and FAN B) of three fans each. If any fan fails, the fan fault light on the back of the fan module turns on, and all other fans speed up to compensate. The storage system can run after one fan fails; however, if another fan failure occurs and temperature rises, the storage system shuts down after two minutes.
If the fault light comes on, if you see a fan fault in the raidcli getcrus command output, or if a fan button is amber in the RAID5GUI Summary View, have the entire fan unit replaced as soon as possible by a Silicon Graphics System Service Engineer.
An amber fan button in the Equipment View of RAID5GUI indicates that the fan group this button represents is down (the fan group failed or the fan module was removed after the agent started running) or not present (the fan group failed or the fan module was removed before the agent started running). (These states appear in /var/adm/SYSLOG; if alarms are enabled in RAID5GUI, they appear in e-mail messages, on-screen alarm messages, or both, depending on the setting.)
|Note: Leaving the rear door open for more than two minutes can cause the storage system to overheat. If the temperature within a Challenge RAID chassis reaches an unsafe level, the power system shuts down the storage system immediately.|
When the optional battery backup unit fails, the following events occur:
The battery backup unit's service light turns on.
If the storage system is using the cache, the storage system's performance may become sluggish while it writes any modified pages from the cache to disk.
Caching is disabled; the cache state as shown in the raidcli getcache command or in the SP Cache Summary window is “Disabled.” Caching is not re–enabled until the battery backup unit is replaced with a fully charged one.
The raidcli getcrus output has three possible states for the battery backup unit: Faulted (removed), Charging, and Present (fully charged or charging). If the battery backup unit takes longer than an hour to charge, it shuts itself off and transitions to the “Faulted” state.
If the fault light comes on, if the battery backup unit state is shown as “Faulted” in the raidcli getcrus command output, or if a BBU button is amber in the RAID5GUI Summary View, have the battery backup unit replaced as soon as possible by a Silicon Graphics System Service Engineer.
After a power outage, a BBU takes 15 minutes to recharge. From total depletion, recharging takes an hour or less.
Each week, the SP initiates BBU self-test to ensure that the BBU's monitoring circuitry is working. While the test runs, storage-system caching is disabled, but communication with the server continues. I/O performance may decrease during the test. When the test is finished, storage-system caching is re–enabled automatically. The default time for the BBU test to start is 1:00 a.m. on Sunday.
When an SP fails, the storage system's performance may become sluggish if the storage system is using caching. You may find that one or more physical disk units are inaccessible. The failed SP's service light turns on, and the service light on the front of the storage system turns on.
You can determine SP failure these ways:
View the unsolicited event log, which contains the message “Peer Controller: Removed.” View this log with raidcli getlog or by clicking Log in the RAID5GUI Summary View, as explained in Chapter 3.
Use raidcli getcrus to verify the failed SP, as explained in Chapter 3. The getcrus output indicates the failed SP, for example, “SPA State: Not Present.”
In RAID5GUI, click on the SP's button in the Equipment View or Summary View to display the SP Summary window, as explained in Chapter 3.
If the Challenge RAID storage system has a second SP and a number of LUNs are inaccessible, you might want to transfer control of the LUNs to a surviving SP. See “Transferring Control of a LUN” in Chapter 8.