Chapter 8. System Maintenance and Troubleshooting

This chapter contains hardware-specific information that can be helpful if you are having trouble with your SGI 2400 or 2800 rackmount server.

Maintaining Your Hardware and Software

This section gives you some basic guidelines to follow to keep your hardware and software in good working order.

Hardware Dos and Don'ts

To keep your system in good running order, follow these guidelines:

  • Do not enclose the system in a small, poorly ventilated area (such as a closet), crowd other large objects around it, or drape anything (such as a jacket or blanket) over it.

  • Do not place terminals on top of the system chassis.

  • Do not connect cables or add other hardware components while the system is turned on.

  • Do not power off the system frequently; leave it running overnight and on weekends, if possible.

  • Do not leave the key switch in the Diagnostics position.

  • Do not place liquids, food, or heavy objects on the system, terminal, or keyboard.

  • Ensure that all cables are plugged in completely.

  • Ensure that the system has power surge protection.

  • Route all external cables away from foot traffic.

Software Dos and Don'ts

When your system is up and running, follow these guidelines:

  • Do not turn off power to a system that is currently running software.

  • Do not use the root account unless you are performing administrative tasks.

  • Make regular backups (weekly for the whole system, nightly for individual users) of all information.

  • Protect all accounts with a password. Refer to the IRIX Admin: Backup, Security, and Accounting Manual for information about installing a root password.

System Problem Catagories

The behavior of a system that is not working correctly falls into three broad categories:


You are able to log in to the system, but it does not respond as usual.


You are not able to start up the system fully, but you can reach the System Maintenance menu or PROM monitor.


You cannot reach the System Maintenance menu or PROM monitor.

If the behavior of your system is operational or marginal, first check for error messages on the MSC display, then perform a physical inspection using the checklist in the following section. If all the connections seem solid, restart the system. If the problem persists, run the diagnostic tests from the System Maintenance menu or PROM Monitor. See your IRIX Admin: System Configuration and Operation manual for more information about diagnostic tests.

If your system is faulty, turn the power to the main unit off and on. If this does not help, contact your system administrator.

XIO Board Slots Not Functioning

If not all the XIO board slots are functioning, verify your system module has the required number of CPU Node boards and that the boards are installed in the appropriate slot. See Chapter 2, “Chassis Tour” for additional information

Physical Inspection Checklist

Check every item on this list:

  • Make sure the terminal and main unit power switches are turned on.

  • If the system has power, check the System Controller display for any messages, then reset the system.

Before you continue, shut down the system and turn off the power.

Verify these connections:

  • The terminal cable is connected securely to the rear of the terminal and to the appropriate connector on the BaseIO panel.

  • The terminal power cable is securely connected to the terminal at one end and to the power source at the other end.

  • The keyboard cable is securely connected to the keyboard at one end and to the terminal at the other end.

  • The system power cable is securely installed in the receptacle in the system chassis and in the proper AC outlet.

  • The network cable is connected to the appropriate port and that the key or lock used to secure the network connection is engaged.

  • Serial port cables are securely installed in their corresponding connectors.

When you finish checking the hardware connections, turn on the power to the main unit and then to the terminal; then reboot the system. If your system continues to fail, restore the system software and files using the procedures described in the IRIX Admin: Backup, Security, and Accounting manual. If the system fails to respond at all, call your service organization.

MSC Shutdown

Under specific circumstances, the MSC may shut down the system. Usually this occurs when the operating environment becomes too warm because of fan failure, high ambient temperatures, or a combination of the two.

The System Controller will automatically shut down the system and light the “Over Temperature Fault” LED if any of the following situations occur:

  • failure of two or more of the system's nine fans

  • failure of one fan plus a high ambient temperature

  • failure of any (critical) fan directly responsible for cooling the power supply or a router board

  • an unacceptably high ambient temperature

Only the last situation can be dealt with completely by the end user. The first three require a service call by a qualified support technician.

Fixing the MSC Shutdown

If you determine that a critical fan or fans have failed, you should immediately place a service call. The system is not usable until the faulty fan(s) are replaced.

If the problem involves the combined failure of a single noncritical fan and a high ambient temperature, you should place a service call. You may be able to keep the system running by lowering the ambient temperature of the operating environment while waiting for service.

You could

  • lower the air conditioning temperature

  • move the system to a cooler environment

  • use a portable fan(s) to circulate more air around the system

  • use a portable air-conditioner to lower the temperature of the system

If the problem is simply a high ambient temperature, you will need to either lower the work environment temperature or move the system to an area with a lower ambient temperature.

Recovering from a System Crash

Your system might have crashed if it fails to boot or respond normally to input devices such as the keyboard. The most common form of system crash is terminal lockup—a situation where your system fails to accept any commands from the keyboard. Sometimes when a system crashes, data may be damaged or lost.

Using the methods described in the following paragraphs, you can fix most problems that occur when a system crashes. You can prevent additional problems by recovering your system properly after a crash.

The following list presents a number of ways to recover your system from a crash. The simplest method, rebooting the system, is presented first. If it fails, go on to the next method, and so on. Here is an overview of the different crash recovery methods:

  • rebooting the system

    Rebooting usually fixes problems associated with a simple system crash.

  • restoring system software

    If you do not find a simple hardware connection problem and you cannot reboot the system, a system file might be damaged or missing. In this case, you need to copy system files from the installation tapes to your hard disk. Some site-specific information might be lost.

  • restoring from backup tapes

    If restoring system software fails to recover your system fully, you must restore from backup tapes. Complete and recent backup tapes contain copies of important files. Some user- and site-specific information might be lost.

Refer to your IRIX Admin: Backup, Security, and Accounting manual for instructions for each of the recovery methods listed above.