Chapter 3. System Overview

Chapter 3. System Overview
Prev		Next

This chapter provides an overview of the physical and architectural aspects of your SGI Prism XL accelerator-enabled cluster. The system is built around SGI's new STIX Architecture allowing an accelerator-agnostic, power-efficient cluster server infrastructure in a small-footprint rack.

The Prism XL system is built on a network-connected group of “sticks” (dual-node enclosures) clustered together as an integrated system. The cluster is managed through the SGI Management Center software package installed on the system's “head node” server. Each stick contains two “slices.” A slice consists of a full-length, full-height, double wide PCIe Gen2 x16 slot and a single-socket (C32) AMD 4100-powered node board. The stick's built-in fans and auto-sensing power supply can detect ambient changes and keep the unit operating in most environments. Each “slice” has up to two 2.5 in SATA disk drives and two optional 1.8 inch solid-state disks (SSDs), allowing large amounts of data storage in a single stick. There are two available GigE ports on each slice (node) and each Prism XL cluster comes with a fully-configured GigE network.

The SGI Prism XL systems can run parallel programs using a message passing tool like the Message Passing Interface (MPI). The Prism XL cluster system uses a distributed memory scheme as opposed to a shared memory system like that used in the SGI Altix UV family of high-performance compute servers. Instead of passing pointers into a shared virtual address space, parallel processes in an application pass messages and each process has its own dedicated processor and address space.

This chapter consists of the following sections:

System Components

The Prism XL cluster system features the following major components:

Tall rack. This is a custom rack used for both the compute and I/O rack. Up to 63 sticks can be installed in each rack in a multi-rack system. Head nodes require 2U of space reserved in a rack, individual Gigabit Ethernet and InfiniBand switches require 1U of space each. 42U tall racks are available depending on configuration ordered. See Chapter 4 for additional information on system racks used with the SGI Prism XL.

Note: While the 42U tall rack can technically support up to 63 installed sticks, stand-alone systems require room to support switch hardware. A maximum of 54 sticks would normally be installed in a rack with the expectation of supporting switch fabric hardware in the rack.

Stick Compute Unit. This enclosure contains the compute/memory, PCIe accelerator slots, disk drives, optional solid state disks (SSDs) and networking PCIe cards. Note the optional half-height PCIe slots are generally used for network interconnect.

Note: PCIe options may be limited, check with your SGI sales or support representative.
Gigabit Ethernet switch. The Gigabit Ethernet switches used in the Prism XL accelerator cluster are primarily for administrative functionality within the cluster. See Chapter 2 for more information on Ethernet switch use within the system.
InfiniBand interconnect fabric switch. The InfiniBand switch(es) provide the main data transfer fabric for the system and support high-speed converged InfiniBand network data traffic switching to the cluster. See Chapter 2 for additional information on the IB switch.

Compute Stick Features

The basic enclosure within the Prism XL system is the dual-node “stick”. Each stick enclosure supports two AMD based compute nodes with two x16 PCIe accelerator slots that support a number of different accelerator cards. Check with your SGI sales or service representative for a complete list of supported accelerator card options. Two additional low-profile x8 PCIe cards (generally used for I/O and networking) are supported in each stick.

A hardware accelerator is a separate computational device (mounted on a PCIe card) that is connected to the node's Central Processing Unit (CPU). A hardware accelerator is basically a co-processor that contains massive numbers of functional units, together with memory and is connected to a CPU to cause a dramatic increase in processing speed. Note that the accelerator does not replace the CPU but compliments the CPU. Parts of the code that can be processed in “parallel” are sent to the accelerators.

A 19-inch rack for this cluster server houses all stick enclosures (up to 45 sticks), switches (up to 8 InfiniBand and 2 GigE), plus the cluster head node server in a single rack.

Figure 3-1 shows an example of the compute/memory/PCIe accelerator “stick”.

Figure 3-1. “Stick” Compute/Memory/PCIe Acceleration Enclosure Example

The SGI Prism XL system requires a minimum of one tall rack (see Figure 3-2) with enough single-phase or 3-phase power distribution units (PDUs) to provide outlets for the sticks and accompanying support hardware installed in the rack. Each single-phase PDU has 8 outlets. The three-phase PDU has 21 outlets. The head node requires one or two outlets depending on configuration.

You can also add additional RAID and non-RAID disk storage to your rack system and this should be factored into the number of required outlets.

Figure 3-2. Tall System Rack Example

System Architecture

The Prism XL system is a switch-interconnected cluster that scales PCIe-based accelerators to very large “petascale” processing levels.

System Switches

Two types of switches are offered in the Prism XL system. Gigabit Ethernet switches are primarily used as an “administrative” interconnect. Optionally, a small Prism XL cluster could use gigabit Ethernet switches for both administrative and message passing fabric functionality.

InfiniBand switches are primarily used for delivering high bandwidth and low latency interconnect between the compute nodes and head node. Transfer rates up to 40 gigabits per second are possible.

Chapter 2, “Prism XL System Interconnect Overview” includes additional descriptive information on the system switches.

Modularity and Scalability

The Prism XL acceleration systems are modular cluster systems. The components are primarily housed in two-node building blocks referred to as “sticks”. Other “free-standing” SGI compute servers “head nodes” are used to provision, access and administer the Prism XL cluster systems. Additional optional mass storage may be added to the systems.

You can add different types of PCIe options to a system rack to achieve the desired system configuration. You can configure and scale the systems around accelerator processing capability, memory size or InfiniBand fabric I/O capability. The air-cooled sticks offer two complete nodes in each enclosure. A water-chilled rack option expands a single rack's compute density with added heat dissipation capability for the Prism XL components.

Reliability, Availability, and Serviceability (RAS)

The Prism XL cluster system components have the following features to increase the reliability, availability, and serviceability (RAS) of the systems.

Power and cooling:
- Optional redundant power supplies are available in the system head nodes.
- A rack-level water chilled cooling option is available for systems with high-density configurations.
- Sticks have overcurrent protection at the node and power supply level.
System monitoring:
- Stick level BMCs monitor the internal voltage, power and temperature of the nodes.
- Each stick power supply has a failure LED that indicates the power status; LEDs are readable at the end of the stick.
- Systems support remote console and maintenance activities.
Error detection and correction
- External memory transfers are protected by cyclical redundancy correction (CRC) error detection. If a memory packet does not checksum, it is retransmitted.
- Nodes exceed SECDED standards by detecting and correcting 4-bit and 8-bit DRAM failures.
- Detection of all double-component 4-bit DRAM failures occur within a pair of DIMMs.
- 32-bits of error checking code (ECC) are used on each 256 bits of data.
- Automatic retry of uncorrected errors occurs to eliminate potential soft errors.
Power-on and boot:
- Automatic testing occurs after you power on the system nodes. (These power-on self-tests or POSTs are also referred to as power-on diagnostics or PODs).
- Processors and memory are automatically de-allocated when a self-test failure occurs.
- Boot times are minimized.

Optional System Components

Availability of optional components for the systems may vary based on new product introductions or end-of-life components. Some options are listed in this manual, others may be introduced after this document goes to production status. Check with your SGI sales or support representative for the most current information on available product options not discussed in this manual.

Prev	Table of Contents	Next
Chapter 2. Prism XL System Interconnect Overview		Chapter 4. Rack Information