Appendix C. Storage-Control Processor Event-Log Error Codes

This appendix lists the hexadecimal error codes and messages that can appear in the storage-control processor unsolicited event log. The messages are given in numerical order.

SP error codes fall into these groups:

The 0x800–series codes indicate transient events, typical of mass storage systems, from which the storage system recovers. These messages are not included in this appendix.

Many 0x801 messages, however, indicate a problem with either a disk module or SCSI bus. If these messages refer to the same disk module, that disk module is reaching the end of its useful life. If they refer to disk modules on the same SCSI bus, the SCSI bus or components that affect the bus are becoming problematic.

0x601 (SP Powerup; Rev 0x%08x)

The system is coming up.

0x602 CRU Enabled

The specified disk module has been enabled and is ready for use. This message appears after you rebuild or register the LUN to which the module belongs.

0x603 CRU Rebuild Started

The storage system started rebuilding the RAID 5, 3, 1, or 1/0 LUN to which the disk module belongs, or the disk module is a hot spare.

0x604 CRU Rebuild Complete

The storage system has finished rebuilding a RAID 5, 3, 1, or 1/0 LUN.

0x608 CRU Ready

The specified drive module is powered on and ready for binding or registering the LUN to which it belongs.

0x610

The disk module could not be physically formatted, and thus cannot be used in the storage system. Make sure the disk module is a valid model. If the model is valid, then consult your service provider for recovery steps.

0x621

The storage system has begun the background checkpoint verification of the accuracy and completeness of the disk module parity check data. This message may appear after you replace an SP or transfer control of LUNs from one SP to another.

0x622

The storage system has completed background checkpoint verification of the accuracy and completeness of the parity check data in a RAID 5, 3, 1/0 or 1 LUN.

0x630 Fan Installed

The storage system has detected that a fan module has been installed or replaced.

0x631

The storage system has detected that a VSC has been installed or replaced.

0x633

The storage system has detected an increase in a fan's speed, perhaps because temperature rose or another fan failed.

0x634

Speed of a fan has returned to normal.

0x635

Virtual sector data error. The storage system has detected a data inconsistency in a disk sector.

0x636

BBU was removed from the storage-system chassis.

0x637 BBU Recharging

BBU is recharging.

0x638 BBU Enabled

BBU has become ready.

0x639 System Cache Enabled

Storage-system cache has become ready.

0x640

The storage system has finished reconstructing the disk mirror.

0x641

Background rebuild operation has aborted before it was complete.

0x643 SP Initializing

The storage system began formatting this disk module.

0x644 SP Inserted

A peer (second) SP was detected in this chassis.

0x650 CRU Signature Error

CRU signature error occurred.

0x654 Cache Dumping

The storage system has started dumping the storage-system cache to the vault disks.

0x657 Cache Dumping Completed

The storage system has finished dumping the system cache to the vault disks.

0x658

Storage-system caching was enabled by the storage system or system operator.

0x659 Cache Disabled By User

Storage-system caching was disabled by the storage system or system operator. The storage system disables caching if the BBU is not fully charged or an SP, vault disk, or fan fails; see error 0x908.

0x660 AC Power Failure

After an AC power failure, the storage system dumped the cache to the vault and turned off power to the BBU (it does this to minimize drain on the BBU).

0x663

BBU self-test started.

0x664 Cache Disabled By BBU Test

In preparation for its weekly BBU test, the storage system disabled caching. This message is followed by a 0x663 message.

0x665 0x666

The storage system detected change to a single SP (0x665) or dual SP (0x666) configuration.

0x667 0x668

Host cache is recovering or has recovered. The storage system started recovering the cache (0x667) and finished recovering (0x668).

0x66A Soft Vault Load Failure

The vault load failed when no cache dirty pages existed. This situation occurs most often when you change both the RAID–3 memory size and the write cache size at the same time.

0x901 Hard SCSI Bus Error

An abnormal SCSI bus or disk module event was detected and could not be cleared through retry operations. This error message is often followed by an explicit CRU message as in 0xa07.

0x903 Fan Removed

The fan module shut down or was removed.

0x904 VSC Shutdown-Removed

A VSC unit has been shut down or removed from the storage system.

0x905 Chassis overtemperature

The storage system found internal temperature too high. It tries to correct an overtemperature condition by increasing fan speed. Check for any obvious problems, such as obstruction of cooling vents or excessive room temperature.

0x906 Unit Shutdown

A failure in a CRU (which may be a fan or disk module), has made further access to the LUN impossible. If this unit has redundant CRUs (for example, it is a RAID 5 LUN), a failure in two CRUs is needed to produce this error. The SP shut down the LUN and the server can no longer access it.

If this message appears along with the 0x905 or 0xa06 message, replacing a defective fan module may restore access to the LUN. If the problem is with disk modules, do not replace the disk modules; instead call your service provider.

0x907 Fatal Firmware Error

A fatal firmware error has occurred; as a result, the program running in the SP has reset the SP. The SP was restarted and continued normally. Consult your service provider.

0x908 Fault - Cache Disabling

The storage system is disabling caching because of a system fault. The problem might be that

  • the BBU is not ready (not present and fully charged)

  • one or more vault disks is missing or being rebuilt

  • a fan fault occurred

  • an SP failed

    To recover, either identify the problem and fix it or wait for the storage system to fix it (for example, wait for the BBU to reach full charge or for the vault disks to be rebuilt). When the fault no longer exists, the storage system automatically re-enables storage-system caching.

0x909 Vault Dump Failure

A fault caused the storage system to try and dump the vault. The cache dump failed because two or more vault disks are missing or have failed.

Try replacing one or more disk modules in the vault (for 20 slot, A0, B0, C0, D0, or E0; for 10 slot, A0-A4). A power failure, or double SP failure, while the vault is failed and the caching is enabled makes any LUN that has pages in the cache inaccessible; for any such LUN, you need to replace the bad modules, unbind, rebind, software format, and load the lost data from backup. At system power-on, error 0x90A occurs for the inaccessible LUN(s).

0x90A Can't Assign - Cache Dirty

The storage system cannot enable caching for the LUN or the storage system because the cache contains modified unwritten (dirty) pages for the LUN that it cannot write to the LUN. This error condition can happen, for example, if two disk modules failed in a LUN that was using caching. This error may occur on power-on after the error condition that caused message 0x909.

Make sure there are usable disk modules in all vault slots (20–slot chassis, A0, B0, C0, D0, or E0; 10–slot chassis, A0-A4) and cycle power to restore the cache. If the vault is not the problem, you may need to unbind the LUN to which the dirty pages are destined. Then, you can enable caching for the LUN and the storage system. Sooner or later, you must replace the failed module(s), rebind and soft format the failed LUN, and reload from backup.

0x90B Cache Unit Failed

The storage system cannot define the cache because the existing cache contains modified unwritten (dirty) pages.

This error can occur if you try to change cache parameters while the caching is active; if so, disable caching, wait for caching to be disabled, and retry. If the problem does not result from changing cache parameters, check for one or more failed LUNs and if you find one, fix it. If that is not the problem, you may need to unbind the LUN(s) to which the unwritten pages belong; the ID of the LUN(s) is part of the accompanying 0x90A error message.

0x90C Image Larger Than Memory

The cache was dumped to the vault, but cannot be restored to SP memory because an SP has too little memory to accept the cache image. This can happen if an SP fails and you replace it with an SP that has less memory than the one you removed.

To recover, remove the SP that has the inadequate amount of memory, insert the correct amount of memory on it, and reinsert it.

0x90D BBU Removed

A BBU failed or was removed. The cache is dumped to the vault; caching is disabled; and the cache is flushed to disk. Caching cannot be enabled until the problem is corrected either by replacing the BBU, if it failed, or by reinstalling it. When the fault is fixed, caching is re-enabled automatically.

0x90E BBU Disabled, Says Ready

BBU test was unable to turn off the BBU. The BBU is probably faulty. The cache is dumped to the vault; caching is disabled; and the cache is flushed to disk. Caching cannot be enabled until the problem is fixed. Replace the BBU. When the fault is fixed, caching is re-enabled automatically.

0x90F Host Cache Recovered With Errors

A non-mirrored cache recovery failed to recover the cache pages for some, but not all, cached LUNs. It does not apply to a storage system with two SPs. Contact your service provider.

0x910 Host Cache Recovery Failed

A non-mirrored cache failed to recover information for all cached LUNs. It does not apply to a storage system with two SPs. Contact your service provider.

0x920 (Hard Media Error)

The sector remap failed. The SP could not remap a defective sector on the drive module. This is a hard media error. The disk's remap area may be full. The disk is unusable; replace it. If the disk was an individual unit or part of an unmirrored RAID–0 group, you will need to bind the replacement disk into a replacement, soft format it, and reload its filesystem's data from backup.

0x921 Host Vault Load Failed

The SP encountered errors while trying to load the cache image from disk. This message may indicate multiple disk failures. Probably any LUN with write-cached pages will be inaccessible and must be unbound. To identify such a LUN, look for a message 90A “Can't Assign” and read that message in this table.

0x922 Host Vault Load Inconsistent

The SP found inconsistencies in the cache image on disk. This may indicate a failure or abort of the cache dump. Probably any LUN with write-cached pages will be inaccessible and must be unbound. To identify such a LUN, look for a message 90A “Can't Assign” and read that message in this table.

0x923 Host Vault Load Failed Bitmap Ok

The SP successfully read the control portion of the cache image on disk, but found the data portion to be incomplete. This means that a failure or abort occurred during the cache dump. Probably any LUN with write-cached pages will be inaccessible and must be unbound. To identify such a LUN, look for a message 90A “Can't Assign” and read that message in this table.

0x924 Host Vault Disks Scrambled

The SP found that the order of the vault disks containing the cache image was different from their order when the cache image was dumped to disk. This means the disks were swapped at power-off. You must restore the disks to their original order before the SP can load the cache image.

0x926 R3 Can't Assign, No Memory 0x928 R3 Can't Init, No Memory

The SP does not have enough memory available for the RAID–3 physical disk unit. This error occurs when the storage system is powered up after SP memory has been removed, or when ownership of the unit is transferred to a peer SP that does not have enough memory.

0xa02 Failed SCSI Bus

An internal SCSI bus has failed. The CRU number displayed corresponds to the bus number (A0 means bus A, B0 means bus B, and so on). The failure resulted from a bad cable or cable connection, bad terminator, bad SCSI chip on an SP, or a bad device. All disk modules on that internal bus are now inaccessible by the SP. A RAID 5, 3, 1, or 1/0 LUN, or software mirror can continue if the other disk module(s) are on other internal buses.

It is unlikely that the other SP (if any) will be able to use the bus. Call your Silicon Graphics service provider.

0xa05 NOVRAM Uninitialized

The nonvolatile memory on the SP is not initialized. No server I/O operations can occur. Consult your service provider.

0xa06 Chassis Shutdown

A second VSC failure has occurred, ora fan module has been inoperative for more than two minutes. The SP is powering off all modules in the chassis. Someone must correct the problem—perhaps by inserting a new fan module—before powering on again.

0xa07 CRU Powered Down

The specified disk module has been powered off by the SP, has failed, or has been removed from the chassis.

0xa08 Data Base Sync Error

The SP cannot determine the correct virtual configuration of all LUNs in the storage system. Some LUNs may be unusable. Contact your service provider.

0xa09 Drive Too Small

For a redundant LUN, a replacement disk module was inserted, but it has a smaller capacity than the other disk module(s) in the LUN. The rebuild operation cannot begin until someone moves the replacement disk module and inserts a module of the correct size.

0xa11 SP Removed

The other SP in this chassis has failed. You can force the working SP to take over the failed SP's LUNs via the secondary route.

0xa12 Hard Cache Memory Error

The storage system detected a hard error in SP memory in a storage system with non-mirrored caching. This error does not apply to a storage system with two SPs.