This chapter provides an overview of IRIX GSN version 2.1.
Gigabyte System Network (GSN) is a full-duplex, error-free, flow-controlled communications protocol that simultaneously provides a full gigabyte (8 gigabits) of data transfer in each direction (6.4 gigabits of data plus 1.6 gigabits for control and HIPPI-6400 protocol information). Table 1-1 compares theoretical GSN data rates to the theoretical rates of other communications protocols.
Peak User Payload Rate[a]
Sustained User Payload Rate
500 MBaud on 20 lines
1256 MBaud on one line
622 MBaud on line
[a] Peak rate is the rate required for hardware's direct-memory-access (DMA) when hardware has small input queue.
[b] All rates are decimal not digital (that is, they are base-ten, not base-two); for example, giga is 1,000,000,000.
The following sections describe the SGI GSN products:
The GSN products offered by SGI consist of multiple components that implement the following protocols:
SGI GSN hardware: copper-based Gigabyte System Network (GSN, also known as HIPPI-6400 or SuperHIPPI) hardware for use in XIO slots.
IP–over–GSN driver (gsn#) included in IRIX GSN. This component is the interface between the GSN hardware and the Internet Protocol (IP) with its associated transport-layer protocols: TCP, UDP, ICMP, and so on. Requires IRIX 6.5.9m or 6.5.9f or later.
ST–over–GSN driver (gsn#) included in IRIX GSN. This component interfaces the GSN hardware to the Scheduled Transfer Protocol (ST). Requires IRIX 6.5.9f or later.
HARP (HIPPI Address Resolution Protocol) driver included in IRIX GSN. This component provides Internet-to-GSN hardware mapping service and interfaces to the HARP daemon. Requires IRIX 6.5.9m or 6.5.9f or later.
Address resolution protocol server (harpd daemon) and client functionality shipped with IRIX GSN. The dynamic HARP component handles HIPPI-6400 clients. IRIX HARP also supports static table lookup for handling HIPPI systems that do not support HARP.
IRIX sockets-based application programming interface (API) to the IP network stack (driver) for use by customers who want to develop or port applications to send/receive data through the IP–over–GSN subsystem. Available with IRIX 6.5.9m or 6.5.9f and subsequent versions.
IRIX sockets-based application programming interface (API) to the ST network stack (driver) for use by customers who want to develop or port applications to send/receive data through the ST–over–GSN subsystem. Available with IRIX 6.5.9f and subsequent versions.
The SGI GSN hardware and IRIX GSN software support the following network stacks (illustrated in Figure 1-1):
ST-over-GSN: applications that use the IRIX GSN product's Scheduled Transfer (ST) programmatic interface to send/receive data over GSN. Applications that use this interface include the IRIX utilities shipped with the IRIX GSN product and customer-developed ST applications.
|Note: Each gsn# network interface services two main protocols: ST and IP. The INET address that the customer assigns to an instance of gsn# is shared by the ST-over-GSN and IP-over-GSN stacks. Some of the upper-layer address processing (for example, routing) that is performed on the address applies to both IP and ST traffic.|
GSN (also called HIPPI–6400 or SuperHIPPI)
Information Technology - High-performance Parallel Interface - 6400 Mbit/s Physical Layer (HIPPI-6400-PH), ISO/IEC 11518-10, NCITS (ANSI) standard.
Information Technology - High-performance Parallel Interface - 6400 Mbit/s Switch Control (HIPPI-6400-SC), T11.1, Project 1231-D, Rev. 2.5, August 1998, working draft for NCITS (ANSI). Only those functions that apply to GSN endpoints.
Information Technology - Scheduled Transfer Protocol (ST), T11.1, Project 1245-D, Rev. 2.6, December 1998, working draft for NCITS (ANSI).
RFC 2067, IP over HIPPI
Other standard internet protocols provided with IRIX (IP versions 4 and 6, NFS versions 2 and 3, TCP, UDP, ICMP, and so on.)
To obtain copies of the GSN and ST documents, see the Web site http://www.hippi.org, or contact the American National Standards Institute (ANSI) at 11 West 42nd Street, New York, New York 10036, telephone: 212-642-4900. For RFCs, see the Web site http://info.internet.isi.edu/in-notes/rfc.
IRIX GSN 2.1 requires IRIX 6.5.9m for TCP/UDP and 6.5.9f for full TCP/UDP/STP support. Use the versions command to verify the version of IRIX that is currently running on the system. The version number (indicated by the -n option) must be equal to or greater than the version shown in the following example:
% versions -n eoe I eoe 1275719131 IRIX Execution Environment, 6.5.9m
% hinv -v | grep HUB HUB in Module #/Slot 1: Revision 5 Speed 97.50 Mhz (enabled) HUB in Module #/Slot 2: Revision 5 Speed 97.50 Mhz (enabled) HUB in Module #/Slot 3: Revision 5 Speed 97.50 Mhz (enabled) HUB in Module #/Slot 4: Revision 5 Speed 97.50 Mhz (enabled)
The following sections provide an overview of the protocols that make up and interoperate with IRIX GSN. Figure 1-1 illustrates the GSN protocol stacks.
Gigabyte System Network (GSN) is a set of ANSI standards (listed in “Standards Compliance”) that defines physical and data link layers for a very high-speed communications protocol. The GSN protocol is also known by two other names: HIPPI–6400 and SuperHIPPI. Throughout this document, the term GSN is used for this entire set of protocols, except when referring to an item from a specific ANSI standard, in which case the term from the ANSI document's title is used (for example, HIPPI–6400-PH micropacket).
The following terms have specific meanings when used within the context of GSN:
One section of HIPPI–6400–PH cable (copper or fiber-optic) that connects two HIPPI–6400-PH elements. Each element can be either a switch or an endpoint. Each physical link is a full-duplex link composed of two simplex links; each simplex link carries data in only one direction; the two streams of data in the full-duplex link flow in opposite directions. The path (virtual connection) between an original point of transmission (the originating source) and a final point for reception (the final destination) can involve numerous physical links.
Any component of a HIPPI-6400 fabric or system that is able to receive, process, and send HIPPI-6400 Admin micropackets in a manner that conforms with the HIPPI-6400 standard. Each HIPPI-6400 element contains both a source and a destination. For example, the SuMAC chip in an SGI GSN product is a GSN element.
The transmitting element located at one end of a physical link. An upper-layer entity (host, network–layer interface, or program) that uses the GSN subsystem is sometimes loosely referred to as the source. However, it is more correct to call these software entities upper-layer protocols (that is, source ULPs). An “originating source” refers to the element that first transmitted a micropacket; an element that is retransmitting the micropacket (for example, a switch) is simply a source.
The receiving element located at the other end of a physical link. An upper-layer entity (host, network–layer interface, or program) that receives communications through the GSN subsystem is sometimes loosely referred to as the destination. However, it is more correct to call these software entities upper-layer protocols (that is, destination ULPs). A “final destination” refers to the element that is the ultimate receiver for a micropacket; an element that receives, then retransmits a micropacket (for example, a switch) is simply a destination.
A final destination or an originating source of GSN traffic. An endpoint may have only one GSN port. A single system may have many endpoints (for example, an Origin module with two SGI GSN products has two endpoints).
A node that is located along the route between two endpoints. GSN traffic passes through the switch on its way to a destination endpoint. A switch must have at least two, and usually has more, GSN ports.
All the HIPPI nodes (switches, endpoint devices, extenders) that are physically interconnected and communicate using the same physical–layer protocol.One GSN fabric can be logically divided into multiple upper-layer address spaces (that is, networks). For example, a single GSN fabric can support multiple IP networks. And, conversely, one logical network can include members from multiple HIPPI fabrics.
A number used in HIPPI–6400 Admin micropackets to specify the number of elements through which the micropacket should be forwarded. Each time a micropacket exits an element, the hop count is decremented by one. See “GSN Admin Micropackets” for further details.
The GSN protocol provides 6.4 gigabits of user data per second from source to destination (in each direction) over either copper-based or fiber–optics-based physical media. The protocol is point-to-point, full-duplex, and flow-controlled. It uses small fixed-size micropackets (illustrated in Figure 1-4 and Figure 1-6) and up to four interleaved logical datastreams (channels) per point–to–point connection.
Each physical link is composed of two simplex links that connect two HIPPI-6400 elements; data flows in only one direction on each simplex link. Both simplex links are required for a connection because control information for each datastream travels in the reverse direction (that is, along the other simplex link of the connection). This design provides a full-duplex connection between two endpoints.
The GSN data rate is stated as 6.4 gigabits of user data per second on each simplex link; however, each link physically carries a total of 8 gigabits (1 gigabyte) of data (user and control) every second. The following items describe the GSN bandwidth:
At the physical layer (that is, on the wire), GSN uses a dual-edged 250–million–cycle -per–second clock, which results in 500 million transmission events per second. Said another way, GSN operates at 500 MBaud.
For each baud, GSN transmits 16 bits of user data and 4 bits of control data that is encoded with 4b/5b. This means that 20% of the total bandwidth is overhead for the encoding, and, of the remaining bandwidth, 20% is overhead for the HIPPI-6400 protocol. This results in user bandwidth of 6.4 gigabits or 6400 megabits per second.
The available bandwidth for user data is 6400 megabits/second, which is 6.4 gigabits or 0.8 gigabytes of per second in each direction.
Table 1-2 summarizes the mathematical calculations:
Total physical signal carrying capacity
20 simultaneous signals multiplied by 500 MBaud, which is 10 billion signals per second in each direction.
Bandwidth available for protocols
Rate in row above, minus bandwidth used by 4b/5b encoding.
Bandwidth available for users (that is, layers above the HIPPI-6400 layer)
Rate in row above, minus amount used by GSN control information. GSN control = 4 of the 20 bits (20% of 8 Gbits).
Each simplex link can carry up to four logical datastreams ( virtual channels). These virtual channels are allocated for control traffic, low latency traffic, and bulk traffic to avoid the latency/blocking issues that occur when only a single channel is attempting to handle both bulk and interactive traffic.
Each virtual channel is commonly implemented as a queue; micropackets are selected alternately from the active queues and placed onto the physical link in an interleaved fashion, as illustrated in Figure 1-2. Not all four channels need to be active on every connection. All the micropackets belonging to a single GSN Message always travel through the same channel, even when the message traverses switches along its way to the final destination. The restrictions for the data that can be carried on each channel are described in Table 1-3.
Carries GSN Messages that do not exceed 68 micropackets of TYPE data (about 2176 bytes of upper-layer data). For ST–over–GSN traffic, ST data channel 0 maps to this GSN channel; all ST control operations (for example, Request_To_Send and Clear_To_Send) travel on this virtual channel.
Carries GSN Messages that do not exceed 4100 micropackets of TYPE data (about 128 Kbytes of upper-layer data) and Admin micropackets in which the COMMAND field specifies a request or a command (that is, not a response). IP–over–GSN traffic is carried on this VC. For ST–over–GSN traffic, ST data channel 1 maps to this GSN channel.
Carries GSN Messages that do not exceed 4100 micropackets of TYPE data (about 128 Kbytes of upper-layer data) and Admin micropackets in which the COMMAND field specifies a response. For ST–over–GSN traffic, ST data channel 2 maps to this GSN channel.
Carries GSN Messages that do not exceed 134,217,728 micropackets that are of TYPE data (about 4 Gbytes of upper-layer data). This channel requires that the final destination endpoint agree to accept this Message via a flow–controlled protocol such as Scheduled Transfer. For ST–over–GSN traffic, ST data channel 3 maps to this GSN channel.
The micropacket is the basic protocol data unit for GSN. Each GSN micropacket is 32-bytes of data accompanied by 8 bytes (64 bits) of control information. The TYPE field within the control bits indicates the format and purpose of the micropacket's 32 bytes of data. The VC field determines which virtual channel carries the micropacket. Some of the control bits that accompany a 32-byte chunk of data refer to that chunk of data (for example, the VC and TYPE fields), and some bits refer to the datastream traveling in the opposite direction on the other physical link (for example, the credits in the CR field that allow the reader/receiver of the control bits to transmit more data for its own datastream). Figure 1-3 illustrates the control bits and Table 1-4 describes them. Table 1-5 summarizes the different TYPEs of GSN micropackets.
Name of Field
Number of Bits in Field
Applies to Data in Which Link
Virtual channel selector for this micropacket (binary values):
Type of micropacket: see Table 1-5
Virtual channel for which the credits (in CR field) apply.
Credits: number of credits the source (that is, the receiver of these control bits) can add to the data transfer on the virtual channel indicated in the VCR field. (See “GSN Flow Control”
for further explanation.)
Reception sequence number:
Transmission sequence number:
End-to-end checksum. Checksum for all data bytes of the GSN Message, up to and including, the bytes in this micropacket. This checksum is verified by the final destination.
Link checksum. Checksum for the 32 bytes of data and the first 48 bits of control information in this micropacket. This checksum is verified by each GSN element at the end of a link.
Most of the GSN micropacket TYPEs are related to control and management of the GSN link. Only three TYPEs of micropackets are passed to the upper layers: Admin, Header, and Data micropackets. The Admin micropacket (illustrated in Figure 1-4) is used by upper-layer GSN administrative programs to manage and configure a GSN fabric; hence, the Admin micropacket is defined by the Switch Control ANSI standard (HIPPI-6400-SC). The Header and Data micropackets are used to create GSN Messages (illustrated in Figure 1-6) that carry user-level data.
Description of the Micropacket
Supported by IRIX GSN Hardware?
Causes the receiving HIPPI-6400-PH device to reset the local link (that is, the physical link between this sender and the device at the other end of the physical link).
Acknowledges that the Reset micropacket was received and that the HIPPI-6400-PH link reset was completed.
Causes the receiving HIPPI-6400 device to reinitialize.
Acknowledges that the Initialize micropacket was received and that the HIPPI-6400-PH initialization procedure was completed.
Not applicable (NA)
Contains no data in the 32-byte data area; there may be valid information in the Control Bits. This type is transmitted only when there is nothing else to transmit; it keeps the physical link active/alive.
Contains data for a GSN (HIPPI-6400) Message (illustrated in Figure 1-6
Contains the header information for a GSN (HIPPI-6400) Message (illustrated in Figure 1-6
Contains only valid credits (VCR and CR fields of Control Bits) that allow the transmitter to send more data. The micropacket contains no data in the 32-byte data area. This type is transmitted only when there are no Admin, Header, or Data micropackets awaiting transmission.
Used for administering GSN switches and endpoints. Format for Admin micropacket is defined by the HIPPI–6400–SC standard. A number of functions (commands) are supported, including: ping another GSN device, request ULA of a remote GSN device, and set up broadcast capability for a GSN fabric.
One of the functions for the Admin micropacket is to allow each switch on a GSN fabric to discover the fabric's physical configuration and each endpoint to discover the universal LAN MAC address (ULA) that its switch has assigned to it. This functionality is not available on every GSN product; however, when it is implemented, this is how it works.
For an endpoint, upon starting, it transmits an Admin micropacket that asks the device at the other end of the link to identify its function (for example, is it an endpoint or a switch). If the device is a switch, the endpoint asks for an assigned ULA; if the device is another endpoint, the local endpoint uses its locally assigned ULA (which might be stored in the hardware's PROM).
For a switch, upon starting, it transmits Admin micropackets that ask for other devices' functions (for example, is it a switch or an endpoint). The switch sends one such request to each hop (successive hardware device) down each of its links until an endpoint is reached. Upon discovery of each endpoint or a switch, it uses Admin micropackets to exchange ULA information with that device. As it receives responses from these Admin requests, the switch constructs a map (spanning tree) of its fabric. Once this map has been constructed, a micropacket destined for a known endpoint (that is, any endpoint discovered within that fabric) can be delivered.
|Note: This fabric discovery scheme does not solve the problem of how each endpoint comes to know the ULA for the other endpoints with which it wants to communicate. That problem can be solved by an upper-layer address resolution mechanism (for example, HARP or another network–layer address resolution mechanism). For details, see “Address Resolution for GSN”.|
A GSN destination (receiving) endpoint controls the flow of micropackets by periodically releasing credits to the source. Each credit represents memory at the destination for one GSN micropacket. Each credit gives the source permission to send one additional micropacket on a specific channel. The destination gives credits to the source in the control bits (CR and VCR bits) that accompany the destination's own micropackets. Note that the credits travel in the opposite direction from the data, as illustrated in Figure 1-5, and can accompany micropackets traveling on any of the GSN virtual channels for the connection.
The GSN Message is the basic data transfer unit between source and final destination endpoints. Each Message is composed of one initial Header micropacket followed by zero or more Data micropackets (illustrated in Figure 1-6). The micropackets of a Message are sequentially ordered and all travel over the same virtual channel using the same originating source (S_ULA value) and final destination (D_ULA value). The last micropacket in a Message has a bit set (the TAIL flag) to indicate that the Message is complete. Figure 1-6 illustrates a complete GSN Message.
When the GSN Header micropacket is carrying an IP datagram (EtherType=0x0800), the 8 bytes of payload in the Header micropacket are the first 8 bytes of the IP header. (Note that the 8 bytes immediately preceding the Payload are an 802.2 SNAP header.) When the GSN Header micropacket is carrying an ST transfer (EtherType=0x8181), the payload bytes in the Header micropacket are the initial 8 bytes of the ST Header.
Every HIPPI element is capable of processing GSN (HIPPI–6400-SC) Admin micropackets. These micropackets configure elements, discover the fabric topology, and maintain the elements of a GSN fabric. The TYPE field of the control bits (illustrated in Figure 1-3) indicates that a micropacket is of the Admin type. Admin micropackets have the format illustrated in Figure 1-4.
Most HIPPI-6400 elements have two ports: one leading toward the fabric and the other leading toward the host/core. For example, a link end element (such as the SuMAC ASIC) has one port connected to a physical link/the fabric and the other port connected to additional GSN logic (which may be another local element) on an adapter board. Notice that a GSN system may contain more than one element; this fact is important in understanding the processing of Admin micropackets.
An Admin micropacket can enter an element through either port, as illustrated in Figure 1-7. Each Admin micropacket is either processed and responded to or forwarded to the next element through the element's other port, as illustrated in Figure 1-7. A response to an Admin micropacket always exits the element through the same port by which the original Admin micropacket arrived.
The hop count field in the Admin micropacket determines when the Admin packet is acted upon/processed. The count indicates the number of elements (hops) through which the Admin micropacket is propagated/forwarded before it is processed. As long as the hop count is greater than zero, the receiving element decrements the hop count by one and transmits the Admin micropacket out the element's other port (which leads to another element), as illustrated in Figure 1-8. When the count is zero, the receiving element processes the micropacket and responds, as illustrated in Figure 1-9. Figure 1-10 through Figure 1-12 show examples of various hop count values and the manner in which hop count determines which element acts on and responds to the micropacket.
Table 1-6 lists the administrative commands that are available with Admin micropackets.
Required (R) or Optional (O) for Switches and Endpoints
Are you there?
Yes I am here (and functioning).
Here is your “element address”.
Status (for example, I have started using the assigned address).
I am a <switch/endpoint> element.
What are you?
I am a <switch/endpoint> element.
Assign me a ULA.
Here is your ULA.
R for switches
Give me the data from this Admin register.
Here is the data you requested.
Put this data into this Admin register.
Status (for example, the data has been written).
I received an invalid/unrecognized/ unsupported Admin micropacket.
Give me a list of all the ULAs connected to you.
Here is the list.
R for switches
For all traffic containing the specified ULA, change the route (output port) to a specified (new) port ID.
R for switches
Give me the port ID that I must use to contact the specified ULA.
Here is the port ID.
R for switches
Scheduled Transfer (ST) is an upper-layer protocol that can be implemented to operate over a number of physical–layer subsystems, including GSN, ATM, FDDI, and Ethernet. This section describes the main characteristics of the ST protocol. For the sake of introduction and ease of understanding, many of the less important functional details of ST are not covered in this description. Refer to the ANSI standard (listed in the section “Overview of Protocols”) for complete details.
The most salient feature of ST is that it prepares both endpoints for the data movement before any data is transmitted. The first step in the preparation is to create a condition (state) called a virtual connection or VC (described in “ST Connection Setup Sequence”). The second step is a handshake that allocates memory for the data movement and exposes this memory to the other endpoint (described in “ST Data Movement Sequences Including Memory Allocation”). There are two kinds of the memory–allocation handshake: one provides memory that is used once (described in “Single-use Memory Data Movements”); the other provides memory that is used many times until released (described in “Persistent Memory Data Movements”). The two endpoints exchange ST control operations to accomplish these prearrangements. Only after these prearrangements are complete can the first data movement begin; the data movement is performed with ST data operations.
The following terms have specific meanings within the context of ST:
The ST protocol data unit. It is composed of a 40–byte header and variable–length data ranging from 0 bits to 4 gigabits (illustrated in Figure 1-13). Each ST operation is transmitted as one GSN Message, as illustrated in Figure 1-13.
A series of operations that occur in a specific order and accomplish an ST protocol task.
The ST endpoint that sends the first operation within an ST sequence. The endpoint that acts as initiator during one sequence (for example, the connection setup) can act as the responder in a subsequent sequence (for example, the data movement).
The other (not the initiator) ST endpoint participating in an ST sequence.
Memory at an ST destination that is reserved for holding one incoming ST Header.
The Operation is the basic protocol data unit for ST. Each ST Operation is carried within a single GSN Message, composed of two or more HIPPI–6400 micropackets, as illustrated in Figure 1-13.
ST operations (listed in Table 1-7) are commonly grouped into the following categories:
Connection management operations: used to set up and tear down a VC
Control operations: used to manage a VC (for example, status or flow control)
Data operation: used to transmit ST payload (upper-layer data) and/or data checksum during data movement sequences
Name of Operation
Sequence in Which Operation is Used
Requests that a VC be created. Issued by any endpoint. First operation of setup sequence.
Response to RC. Accepts (creates VC) or rejects the RC. Second (and last) operation of setup sequence.
Indicates that sender (initiator) is tearing down the VC. Issued by either endpoint of VC. First operation of teardown sequence.
Response to RD. Indicates that the sender (responder) is tearing down the VC. Second operation of teardown sequence.
Response to DA. Indicates sender (initiator) has finished tearing down VC. Third (and last) operation of teardown sequence.
Requests that responder expose memory. First operation of persistent memory sequence.
Response to RMR. Exposes responder's memory to initiator.
Issuer (initiator) is destination for the data movement. Exposes initiator's memory to receive the requested data. Data comes from source's exposed persistent memory region. RMR/MRA handshake must have occurred.
Issuer (initiator) is destination for the data movement. Exposes initiator's memory to receive the requested data. Data comes from source's exposed persistent memory region. RMR/MRA handshake must have occurred.
Response to FETCHOP.
Issued by source (=initiator for write or =responder for read). Indicates issuer is ready to transmit data; asks responder to expose single-use memory. First operation of write sequence.
First operation for a read sequence. Indicates issuer is ready to receive data. Issuer becomes the initiator of the read sequence.
Response to RTS. Gives source permission to transmit one block of data. Exposes single-use memory for that data.
Carries ST payload and/or checksum; used in every data movement sequence. Sent by data source, which can be either initiator or responder within the data movement sequence.
Response to an RTS, RTR, RMR, GET, or FETCHOP. Accepts, rejects, or pauses the request to which it is responding.
Requests VC status information. Issued by either endpoint.
Communicates VC state information. Response to either an RC operation or a DATA operation in which the Send_state flag (within the ST Header) is set.
Abort Data Movement
Terminates an in-progress data movement (read/write transfer or a persistent memory region) by causing the allocated memory to be released; leaves VC open. Issued by either endpoint.
Abort Data Movement
Response to END. Indicates responder has aborted the associated data movement.
The ST Header (illustrated in Figure 1-14) carries the information that implements the ST protocol features. Some of the parameters that are communicated within the ST Header are:
Type of operation (listed in Table 1-7)
Data channel through which this operation travels, which, for ST–over–GSN, maps directly to GSN virtual channels (summarized in Table 1-3)
Number of memory spaces (slots for holding ST Headers) that are currently available at each endpoint for this data channel (that is, VC)
Port values for initiator and responder within each VC
Key values for initiator and responder within each VC
Length of the data to be moved from one endpoint to the other
Block number for use in tracking progress, managing flow control and resource allocation, and performing striping within a data movement
Memory address (buffer index and offset) to use for the data movement
Checksum for the operation
Identification numbers for tracking and sequencing operations: DATA operations, FETCHOP operations, GET operations, and REQUEST_STATE_RESPONSE operations within each VC
The following are some of the endpoint behaviors that can be controlled by the operation's ST Header:
Whether or not the destination for a data movement supports reception of out-of-order Blocks
Whether or not the operation's ST Header should be delivered to the destination's upper-layer protocol (ULP)
Whether or not the destination ULP should be interrupted when this operation arrives
Request status information from the endpoint receiving this ST Header
Inform initiator that responder is rejecting a request
Pause the transmission during a data movement
ST defines sequences of operations for accomplishing various tasks, including the following:
To open a connection between two endpoints and negotiate the parameters associated with the virtual connection. (See “ST Connection Setup Sequence”.)
To perform a data movement including the handshake that allocates memory at the destination. (See “ST Data Movement Sequences Including Memory Allocation”.)
To control the data flow during the data movement, thus enabling full-rate, non-congested data flow between the endpoints. (See “ST Flow Control Sequences”.)
To tear down a connection. (See “ST Connection Teardown Sequence”.)
Each ST sequence allows the two endpoints to exchange a set of control parameters and information. The parameters are carried in the ST Header (illustrated in Figure 1-14). Each type of operation uses the Header fields differently and exchanges a different set of parameters.
Before any ST data can be exchanged, a Virtual Connection (VC) must be set up between the initiator and the responder. Upon successful completion of this exchange, each endpoint will have stored a set of parameters associated with the VC and will have set aside some resources for exclusive use by this VC. Three of the stored parameters are used (as a tuplet) for identifying/validating operations that arrive to the VC. The verification tuplet consists of: the remote endpoint's ST port number, the local endpoint's ST port number, and the key value that the local endpoint has assigned to this VC. Figure 1-15 illustrates how these identification parameters are set up.
|Note: The initiator for the connection setup sequence is the endpoint that sends the first control operation for the sequence (that is, the Request_Connection).|
The connection setup sequence consists of two control operations: a Request_Connection sent by the initiator, followed by a Connection_Answer sent by the responder). Figure 1-15 and Figure 1-16 illustrate different subsets of the information exchanged in one successful connection setup sequence. Figure 1-17 illustrates a connection setup sequence in which the responder refuses to create the VC.
The ST connection setup sequence negotiates and sets the following parameters and resources that remain in effect for the duration of the VC:
I_Port and R_Port
ST port value on which endpoint (initiator and responder) wants to receive all communication associated with this VC.
I_Key and R_Key
Locally unique identification number (key) for use in verifying and identifying this VC. Each endpoint gives the other endpoint a key, which the other simply echoes back in each communication; the key means nothing to the remote end and is only “unique” at the endpoint where it was assigned.
I_Bufsize and R_Bufsize
Size of the buffers used by each endpoint for data it receives on this VC.
I_Slots and R_Slots
Initial number of “slots” available at each endpoint. Each slot indicates memory that has been set aside for storing ST headers that are received on this VC. Each slot normally consists of one 40-byte data structure.
Number of Clear_to_Sends that the source would like to have outstanding (available) at all times during the data movement.
I_MaxSTU and R_MaxSTU
Maximum size STU that each endpoint is willing to receive. The other endpoint must respect this size when transmitting on this VC.
Identity of the protocol being encapsulated (carried) within the ST Messages on this VC. For example, for IP datagrams, the EtherType is 0x0800; when the ST Messages carry user data that is not enclosed in any additional protocol, the EtherType is 0x0000. The initiator specifies this parameter.
When an endpoint no longer wants a VC, it initiates the connection teardown sequence illustrated in Figure 1-18. This sequence is not used to terminate data movements. (See “ST Termination Sequence for a Data Movement”.)
This section describes ST data movement sequences. Each ST data movement sends upper-layer (user) data from one endpoint (the source) to one other endpoint (that is, one final destination). The entire data transfer is controlled by the VC parameters negotiated during one ST connection setup procedure (described in “ST Connection Setup Sequence”) or renegotiated during the data movement. The setup sequence must be completed before any data movement sequence is initiated.
The data movement sequences consist of two to five operations, exchanged between the VC's two endpoints (the memory-allocation handshake), followed by one or more data operations. There are five different data movement sequences, as summarized in Table 1-8. The initiator controls which sequence is used, depending on the type of memory it wants to have allocated, the type of functionality it desires for the data movement, and the role it wants to assume in the transfer.
The memory-allocation handshakes allow either of the following types of memory to be allocated for receipt of the data:
Persistent memory: a region of memory that is used over and over for the transfers that occur within that virtual connection, as described in “Persistent Memory Data Movements”
Single-use memory: a region of memory that is written once, then released, as described in “Single-use Memory Data Movements”
Table 1-8 summarizes the five data movement sequences and indicates where each sequence is illustrated:
Initiator wants to be source
Initiator wants to be destination
and Figure 1-22
|Note: Within a data movement sequence, the initiator is the endpoint that sends the first control operation for the sequence (for example, Request_to_Send or Request_Memory_Region), regardless of whether it operates as the data transmitter (source) or receiver (destination).|
Table 1-9 summarizes the data size ranges for each type of data movement. As illustrated in Figure 1-19, the data is first chunked into one or more Blocks; the maximum size for a Block is negotiated during the memory allocation handshake. Each Block is divided into one or more scheduled transfer units (STU; the data for one data operation); the maximum size for the STU was negotiated during the connection setup sequence. Any ST data movement that is larger than the VC's maximum STU size requires multiple data operations. Each STU (that is, each data operation) is transmitted as one GSN Message. The flow-control mechanism for user data (described in “ST Flow Control Sequences”) operates at the Block level.
Data Movement Type
Minimum Length Data Movement Sequence
Maximum Length for Data Movement Sequence
Single-use Memory: Write
264 minus 1 byte or unlimited
Single-use Memory: Read
264 minus 1 byte or unlimited
Persistent Memory: each Put
248 minus 1 byte or VC's max_STU (one Block)
Persistent Memory: each Get
216 minus 1 byte or VC's max_STU (one Block)
Persistent Memory: each FetchOp
8 bytes (one Block)
The persistent memory sequences consist of a few control operations (the memory–allocation handshake) followed by any number of Put, Get, and/or FetchOp sequences. The persistent memory handshake allocates one or more memory regions at the responding endpoint. These regions are then used multiple times; each buffer within each region is used over and over during the life of the virtual connection. When properly used, this method provides permanent, low–latency delivery, in which an unlimited number of transfers can be performed with no intervening overhead. There is an important caveat: the low latency on this type of data transfer depends on the speed at which the memory can be made available for the next use. This type of transfer works best for small (or fixed-size) data and for applications for which the transmission rate is well understood, so that the memory can be sized in a manner that allows it to be recycled within an acceptable period of time. It is the responsibility of the upper-layer applications to manage flow control and prevent precipitous overwriting of the memory region.
Once a persistent memory region has been allocated at the responder endpoint, the initiator can move data in or out of it in three manners:
Put sequence (illustrated in Figure 1-20)
One data operation (STU) that writes any portion of or the entire persistent memory region at the responder. This sequence can be repeated over and over with no intervening operations.
Get sequence (illustrated in Figure 1-21)
A GET control operation to expose memory at the initiator for receiving the requested data, followed by any number of data operations. Each data operation moves a portion or all of the data from the responder's allocated memory into the initiator's memory. Multiple GETs can be outstanding (occurring) simultaneously to different or shared portions of the persistent memory region.
FetchOp sequence (illustrated in Figure 1-22 and Figure 1-23)
A FETCHOP control operation to expose memory at the initiator for receiving the retrieved data and to specify the desired function (increment, decrement, or clear). Then, a single data operation (one STU) that moves one 64-bit Block of data from the responder's memory into the initiator's memory. When the data arrives successfully at the initiator, the initiator issues a completion control message, at which point the responder performs the specified function on its own copy of the data. If the completion does not arrive within a timeout period, the responder retransmits the data. Note that, unlike PUT and GET, this data movement sequence is atomic.
A persistent memory region is terminated (released) with an End operation, as described in “ST Termination Sequence for a Data Movement”.
The single–use memory movement sequence consists of a few control operations (the memory-allocation handshake) that allocate memory at the destination endpoint, followed by one or more data operations for a specified amount of data. The data transfer uses the destination's allocated memory once; each buffer is used only once during the life of the transfer. This method allows high-bandwidth delivery after an initial delay for the allocation of resources: the transfer provides for a limited number of back–to–back writes or reads with no intervening overhead. This method is efficient for large, variable-length data.
A data transfer can be aborted (terminated before all the data has been transferred) with an End operation, as described in “ST Termination Sequence for a Data Movement”.
Figure 1-24 illustrates the data transfer sequence used when the initiator is the data source. Figure 1-25 illustrates the sequence used when the initiator is the data destination. Each illustration includes the memory allocation handshake.
Flow control operates differently for data transfers and ST operations. Each is explained below.
ST endpoints implement strict flow control for all data transfers done to single-use memory. For this purpose they use the Request_To_Send (RTS) and Clear_To_Send (CTS) control operations. There can be multiple CTSs generated in response to one RTS, as explained below and summarized in Table 1-10.
The ST flow control sequence regulates both the number of data transfer events that occur between the two endpoints and the size of these events. Before any data is transferred, the data transmitter (source) generates an RTS, in which it specifies the maximum size block of data and the number of blocks that it wants to send right now. The specified (requested) size and number do not oblige the receiver to give permission for that size or number; these are only suggestions that, if followed, could make the transfer more efficient.
The data receiver (destination) generates one or more CTSs in response to each RTS. In each CTS, the receiver gives the source permission to transmit one block of data; the number of CTSs issued by the receiver cannot exceed the number of “requested blocks” specified in the RTS. In the first CTS for the data movement, the receiver indicates the block size that it is willing to receive during this data movement; the block size must be no larger than the maximum block size specified in the associated RTS. Before issuing each CTS, the receiver must allocate the amount of memory specified by the block size in that CTS. See Figure 1-24 and Figure 1-25 for illustrations of the flow control sequence.
|Note: ST does not use flow control for persistent memory data movements: Put, Get, and FetchOp.|
Transfer Event Parameter
Number of events
In RTS: number of blocks the source
would like to send at this time.
In CTS: with each CTS, the destination
gives the source permission to transmit
1 block of data.
Size of each event
In RTS: requested maximum block size
for transfer events associated with this
Note: When the source does the actual data transfer, the size is not controlled by the RTS maximum block size; it is limited by the block size specified in the CTS.
In first CTS: block size that will be used
for these transfer events.
Flow control for the ST Headers of ST operations is managed with a mechanism called slot allocation. Each slot represents memory that has been allocated at an endpoint to hold one incoming ST Header while it awaits processing. All incoming ST Headers use one slot, except Request_Connection operations and Data operations that have the Silent flag set.
|Note: Data operations with the Silent flag set, do not occupy a slot because the ST Header for these operations is not passed to the receiving endpoint (and hence is not stored). The Request_Connection operation does not occupy a slot because the VC does not yet exist when this operation arrives. An implementation may have a queue of slots associated with Port 0 (the port to which the Request_Connection arrives), but the queue is not required because there are no consequences caused by the endpoint dropping the request other than the initiator trying again, until it succeeds.|
During the setup sequence for a VC, each endpoint communicates to the other endpoint the number of slots it has allocated for that VC. Updates for slot availability are communicated during normal operation with Request_State_Response operations. (See “ST Status Sequences” for details.) Each source keeps track of the number of outstanding operations (that is, slot-consuming ST Headers that it sends) and makes sure that it does not send more operations than the destination can handle.
During normal operation, the endpoints for a VC can use either of two status sequences (illustrated in Figure 1-26 and Figure 1-27) to obtain information from the other endpoint about its state and status.
The information that can be exchanged with this mechanism includes:
number of currently available slots for this VC
highest Block received for a data movement
reception status for a specific Block
The following data movements do not have a natural ending:
a persistent memory region
a data transfer of unlimited size
To terminate either of the above data movements and release the associated resources, either endpoint initiates the termination sequence illustrated in Figure 1-28. In addition, this sequence can be used to abort a data transfer of specific length before all the data has been transferred.
GSN virtual channels are designed to carry specific sizes of data (see Table 1-3). The various ST data channels (DCs) that exist within ST virtual connections (VCs) can take advantage of these sized GSN channels. The IRIX ST–over–GSN stack routes any ST operation with DC=0 to GSN channel 0, DC=1 to GSN channel 1, and so on. For example, each ST application (for example, ST Port), is required to have one data channel (DC_0) for its control operations and one or more other channels (DCs 1, 2, and/or 3) for its data operations. Note that each GSN channel is shared by many VCs; for example, DC_0 for all ST VCs share GSN channel 0. Figure 1-29 shows an example of ST VCs using their data channels (DC values) to effectively make use of the four GSN channels.
This section explains how logical networks are created on GSN and HIPPI fabrics. The discussion assumes that you have a thorough understanding of the concept of a logical network, the format of INET addresses, and the use of subnet masks to divide a single INET network address space into smaller networks, called Logical IP Subnets (LISs).
|Note: For complete details on INET address subnetting and the netmask, see the comments in the /etc/config/ifconfig.options file, the man page for inet(7F), the man page for ifconfig(1M), and the online IRIS InSight document IRIX Admin:Networking and Mail.|
There are three basic concepts that underlie the discussion in this section. Each is discussed in more detail in subsequent sections:
|Basic Concept #1|
The hosts connected to a GSN or HIPPI fabric do not have to function as one logical network whose addresses all come from one address space.
|Basic Concept #2|
A LIS (one address space) can include hosts from physically different GSN and/or HIPPI fabrics, as long as there is a bridging switch between the fabrics.
|Basic Concept #3|
Within a GSN or HIPPI fabric, direct communication (without use of an intermediate router) between INET hosts can occur only when (1) the network interfaces involved in the exchange have addresses that come from the same logical address space (for example, they are members of the same LIS), and (2) both hosts have access to an address resolution mechanism.
The hosts connected to a GSN or HIPPI fabric do not have to function as one network address space. The hosts can be organized into smaller groupings (for example, based on function, project, or hardware manufacturer). Each grouping of hosts is a separate logical network or a LIS. Each LIS is assigned a sequence of network-layer addresses (that is, a unique address space). Figure 1-31 illustrates this concept.
A group's address space can be the complete range of addresses for an INET network address (192.0.2.0 to 192.0.2.255), or it can be a portion of the range (for example, subnet 192.0.2.0 to 192.0.2.31). Membership in a group is determined for each GSN network interface (for example, each gsn#) by the INET address associated with the interface (in the /etc/config/netif.options file) and the netmask value (in the ifconfig.options or the ifconfig-#.options file). The netmask value defines the size of the address space for each group. For example, a netmask value of 0xFFFFFF00 creates an address range that provides 256 individual host addresses. However, netmask value 0xFFFFFFE0 (shown in Figure 1-30) creates eight LISs in which each LIS can have up to 32 “host” addresses.
A logical network or a LIS can include hosts from physically different GSN and HIPPI fabrics, as long as there is a “bridging” communication path between the fabrics. Hosts that are members of the same INET address space (thus benefitting from the services provided by broadcast and routing) do not have to be physically attached to the same physical medium (fabric). Figure 1-32 illustrates this concept.
Direct communication between INET hosts (without use of an intermediate router) can occur only when the network interfaces involved in the exchange are members of the same logical address space (network or LIS). Contact with members outside one's own LIS requires use of an INET address router.
This rule is true even when a shared hardware connection (for example, a switch) exists between the two hosts that belong to different LISs. For example, for two hosts attached to the same switch, a message from host A in LIS 1, if sent to host B in LIS 2, must go through host C, an INET router. The benefit is that, no matter where a GSN network adapter is physically located or relocated, it continues to function as a member of the same LIS. Notice that no address or LIS–membership change is required when an endpoint is physically relocated.
The following facts explain why this concept exists:
GSN switches do not resolve network–layer (INET) addresses.
The local INET routing software (for example, IRIX' routed) does not maintain complete paths to destinations that are not members of the same LIS.
Before transmission of an IP packet, a GSN hardware address (ULA) must be discovered for the destination. This step requires the services of a HARP server.
Each HARP server maintains mappings only for its own LIS. (However, in the IRIX implementation, a single HARP daemon can act as a HARP server for multiple LISs at the same time.)
The basic concepts summarized in “GSN Fabrics and Logical Networks”, make the examples described in this section possible.
Figure 1-31 and Figure 1-32 show examples of subnetting within two different GSN fabric configurations. The LIS addressing used in these examples (summarized in Figure 1-30) is identical. The examples use network INET address 192.0.2, so that each host address is 192.0.2.xxx. Hosts in LIS_1 use addresses between 192.0.2.0 and 192.0.2.31; those in LIS_2 use addresses between 192.0.2.32 and 192.0.2.63, and so on.
If you want a single-fabric site to have multiple address spaces, you can use multiple INET network addresses, or you can use a netmask to divide a single INET address space into smaller chunks (referred to as LISs). Likewise, in a multiple–fabric site, you can group all the hosts into one logical address space, or into multiple LISs regardless of each host's location.
Figure 1-31 illustrates a GSN fabric that has one switch to which all the network interfaces are attached (that is, all endpoints in this fabric have a direct physical link to one another). The example shows two LISs. Communication from A in LIS_1 to C in LIS_2 passes through the router (network interfaces J and H). Messages do not go directly from endpoint A to C, because of the concept explained in “Basic Concept #3”.
Figure 1-32 illustrates a different configuration for the same address space and network interfaces (“hosts”) used in Figure 1-31. This configuration is a two-switch fabric. In this example, A, B, E, J, K, and L belong to LIS_1, while C, D, F, G, and H belong to LIS_2. The system with network interfaces H and J continues to perform as the router between the two LISs. Just as in the first example (Figure 1-31), communication directed to C in LIS_2 from A in LIS_1, goes first to the router (J/H), even though both A and C are physically attached to the same switch. But, most importantly, notice that the router has been moved to a different switch, and yet, the INET addressing is identical to that used in the first configuration. The hardware changes do not affect the addressing. Also note that a router for an LIS does not need to share a switch with the members of its LISs, as illustrated by router J in relation to hosts A and B and router H in relation to hosts C and D.
This section describes how network (OSI layer three) addresses are mapped (resolved) to physical (OSI layer-one) addresses in a GSN fabric. This section assumes that you are familiar with standard Internet ARP (RFC 826, Ethernet Address Resolution Protocol) and Inverse ARP (RFC 2390, Inverse Address Resolution Protocol) protocols.
When a network–layer address is locally associated with (configured to) an IRIX GSN or IRIS HIPPI subsystem, address mapping is needed between network–layer addresses and physical-layer addresses so that communication can occur between the local network-layer entity and remote network-layer entities. The GSN/HIPPI physical address is known as the Universal LAN MAC Address or ULA. For IRIX, the default network protocol stack is the Internet Protocol and the network address is the INET address. The address resolution scheme for IP/ST–over–GSN is defined by RFC 2835, IP and ARP over HIPPI–6400, as described in the section “HARP Address Resolution”.
|Note: Each INET address (AF_INET) can support multiple protocols. For example, in IRIX 6.5, INET addresses support both the IP suite of protocols (PF_INET) and the ST protocol (PF_ST). For further details, see the man page for inet(7).|
To transmit data to another network-layer entity within the GSN fabric, each network-layer stack in the GSN fabric needs two addresses for each destination:
The network-layer address for the destination host. In IRIX, this information is supplied by the static “hosts” database or the dynamic NIS server.
The physical-layer address for the destination endpoint. This information is supplied by the static HARP table or the dynamic HARP server. See “HARP Address Resolution” for details.
A GSN fabric is said to support broadcasting when all of the switches of that fabric provide broadcasting. The behavior for HARP clients and HARP servers is different depending on whether the underlying GSN/HIPPI fabric supports broadcasting.
When the fabric does not support broadcasting, at least one host behaves as a HARP server for each defined LIS. All other hosts on the LIS are HARP clients.
HARP servers act as centralized repositories for IP-to-ULA mappings. As each host on an LIS initializes its GSN interface, it registers its IP-to-ULA mapping with each of the LIS's servers. The servers save this mapping information internally. When a host needs to communicate with another host via GSN, it queries the HARP server for the destination host's IP-to-ULA mapping. A HARP client will typically save mapping information it has received from the HARP server in its own local cache for faster subsequent mappings.
Address mappings are not permanent, so HARP clients must reregister with all HARP servers periodically. If HARP clients wish to locally maintain a cache of address mappings for other hosts, they must periodically validate these mappings with a HARP server as well.
When broadcast is supported by all switches in fabric, there are no HARP servers. HARP's behavior is almost identical to standard ARP: when a host needs to perform an IP-to-ULA mapping, it broadcasts an ARP request using the broadcast ULA (FF:FF:FF:FF:FF:FF). The host for which the mapping is requested can identify its own IP address in the request packet, and sends a reply to the requestor with its ULA. All other hosts ignore the request.
The address resolution protocol for HIPPI networks is specified in the HARP RFCs. The protocol works with fabrics that provide broadcasting and with those that do not. One of the first tasks of each HARP client is to determine if its underlying fabric supports broadcasting, as described in “Determining Fabric Support for Broadcast”.
HARP provides a dynamic, client/server-based address resolution service. The protocol makes it possible for each IP/ST-over-HIPPI endpoint (client) within a network to register or communicate its own INET address and ULA, and to discover the ULAs for hosts with whom it wants to communicate. The HARP server maintains a kernel-resident lookup table that maps INET addresses to ULAs. HARP occurs in two phases: a registration phase (summarized in the section “HARP Registration Phase”) and a normal operation phase (summarized in “HARP Normal Operation Phase”).
When an LIS includes one or more endpoints that do not support dynamic HARP, static mappings for those endpoints must be added to the address resolution table at the HARP server (as described in the section “HARP Normal Operation Phase”).
A host determines whether it is on a broadcast- or nonbroadcast-capable LIS during its initialization phase by sending a request for its own address to the broadcast ULA (FF:FF:FF:FF:FF:FF). If the underlying fabric is a broadcast medium, the sending host will receive a copy of this packet, as will every other host on the LIS. If it does not receive this packet, the host is not on a broadcast medium. (To ensure that a single lost packet does not result in the host being brought up in the wrong mode, a host may send multiple self-identification packets during the initialization phase.)
If a host discovers that it is on a broadcast fabric, the HARP registration phase described in the following section is skipped (because there are no HARP servers with which to register), and HARP immediately enters the operational phase, described in “HARP Normal Operation Phase”.
During initialization of each GSN device, a HARP client on a non-broadcast medium will register its address pair (INET address and ULA) with each HARP server on its LIS. This is done by transmitting an InARP request to each HARP server. The InARP request contains the IP-to-ULA mapping of the client, and it requests the IP-to-ULA mapping of the server in reply. (Since InARP requests are sent to ULAs, each client must know the ULAs for all servers on the LIS. For IRIX, this information is contained in the configuration file /etc/config/harpd.options. For details, see “Edit harpd.options File” in Chapter 2.)
Because HARP clients can be brought up before HARP servers, a client might not receive replies to all (or any) of the InARP requests that it transmits. For each nonresponding HARP server, a HARP client will periodically retransmit the InARP request.
When at least one HARP server has responded with an InARP reply, the HARP client gains the ability to resolve unknown IP-to-ULA mappings on the LIS; the client then transitions from the registration phase to the operational phase.
The client enters HARP's operational phase under one of the following circumstances:
When a host determines that its GSN is connected to a broadcast-capable medium
When a host on nonbroadcast-capable medium has successfully registered with a HARP server
In the operational phase, a host can resolve IP-to-ULA mappings that it does not have in its local HARP table by issuing ARP requests. On broadcast-capable media, these requests are transmitted to the broadcast ULA (FF:FF:FF:FF:FF:FF:); on nonbroadcast-capable media, these reqeusts are transmitted to a HARP server.
When a host receives an ARP reply, it places the reply's IP-to-ULA mapping into its local mapping table for subsequent mappings of this address.
While in the operational phase, all HARP clients on nonbroadcast media must periodically reregister their own IP-to-ULA mappings. This reregistration is accomplished by sending either an ARP request or an InARP request to each HARP server for the LIS. Since, according to the HARP protocol, servers "forget" about clients they have not heard from in 20 minutes, this reregistration must occur in shorter intervals. In IRIX, by default this reregistration occurs every 15 minutes.
If a HARP server does not respond to the reregistration request, the HARP client must assume that the server is no longer functioning and cannot be used as a target for mapping requests. If no server is responding to a HARP client's reregistration requests, the client must fall back to the HARP registration phase.
HARP clients must also revalidate or remove from their local mapping table all entries that are more than 15 minutes old. Clients revalidate by sending ARP requests to the server or (on broadcast media) directly to the hosts whose mapping entry is to expire. An entry that has been revalidated is valid for another 15 minutes. If no reply (or a NAK reply) is received for an ARP request, the address for which the request was sent must be considered unmappable and is removed from the local mapping table.
When a host within a HIPPI/GSN LIS does not support dynamic HARP, the system administrator needs to add a static entry for that host to each HARP client (for broadcast capable networks) or to each HARP server's database (for nonbroadcast-capable networks). Static entry definitions can be placed into the HARP daemon configuration file (/etc/config/harpd.options), or they can be added manually to the mapping database by using the gsnarp utility. Each entry in the database must map a ULA (IEEE or MAC address) to an INET address.
These guidelines explain how to select a system to provide HARP services (that is, be the HARP server) when the HIPPI fabric does not support broadcasting. It is not necessary to identify a system for this purpose when the fabric supports broadcasting.
From among the members of the LIS, at least one system must be chosen to be a HARP server. For redundancy purposes, at least two systems should be selected for this purpose. (When no HARP servers are available on an LIS, no HARP address resolution can occur, so the only members of the LIS who will be able to intercommunicate are hosts whose HARP entries are statically defined.
To ensure that every HARP server's database contains a complete mapping for all registered hosts, all hosts in an LIS must identify the same systems as HARP servers.
The IRIX GSN implementation of the ST protocol uses the same address resolution scheme as is used for IP–over–GSN. See “Address Resolution for GSN” for the details.
|Note: Each gsn# network interface services two protocols: ST and IP. The INET address assigned to an instance of gsn# is shared by the ST-over-GSN and IP-over-GSN stacks. Some of the upper-layer address processing (for example, routing) that is performed on the address applies to both IP and ST traffic.|
These entries are loaded when the harpd daemon is initialized via the harpd configuration file (by default, /etc/config/harpd.options; for details, see “Edit harpd.options File” in Chapter 2). Alternatively, the administrator can add them individually via the gsnarp -s command. The administrator can remove static entries via gsnarp -d.
The description in this section applies to systems running IRIX 6.5.9f (or later) and to network interfaces for the Internet Protocol suite (INET address over GSN subsystem) and Scheduled Transfer (ST–over–GSN) protocol.
With each restart (for example, a power on, a reboot or init 0 command), the startup routine probes for hardware on all the modules connected into the CrayLink interconnect fabric. All the slots and links in all the modules within the fabric are probed. The routine then creates a hierarchical filesystem, called the hardware graph, that lists all the located hardware. The top of the hardware graph is visible at /hw. For complete details, see the man page for hwgraph(4). After the hardware graph is completed, the ioconfig program assigns a unit number to each located device that needs a number. Other programs (for example, hinv and each device's driver) read this assigned number and use it.
The XIO slots are searched (probed for a device) in the order shown below; this order is not the same sequence as the XIO slot numbering. For example, the device in XIO slot 4 is located before the device in slot 2 and, because of this, may have a lower unit number than the device in slot 2. After the first power on, you can edit the /etc/ioconfig.conf file to assign unit numbers that are convenient for you. Your changes are used during each subsequent power on. See the ioconfig(1M) man page for further details.
On an initial system startup, ioconfig groups devices into classes/types and assigns hardware unit numbers sequentially within each class. It records these assignments in the /etc/ioconfig.conf file; for example, if two SGI GSN products are found, they are numbered unit 0 (gsn0) for the first one found and unit 1 (gsn1) for the second one. When an SGI GSN product is a two-board solution, both boards are associated with a single unit number. On subsequent startups, ioconfig distinguishes between hardware that it has seen before and new items. To previously seen items, it assigns the same hardware unit numbers (those that are recorded in the ioconfig.conf file). To new hardware, it assigns new sequential numbers and records them. It never reassigns a number, even if the device that had the number is removed and leaves a gap in the numbering. For example, in a system with two instances of some class of devices, if the unit0 is removed, the next restart results in the system listing only unit1; if a new board is installed in a new location, it is listed as unit2.
New items are differentiated from previously seen items through the hardware graph listing (that is, the path under /hw/module/#/slot/io#/...). The database of previously seen devices is kept in the file /etc/ioconfig.conf. A replacement board (with the exact same hardware device name) that is installed into the location of an old board (so that it has the same hardware graph listing) is assigned the old board's unit number, but a board that is moved from one location to another is assigned a new number. For example, in a two-device system with ioconfig.conf entries illustrated below, if unit0 is moved to a different slot, the next restart results in a new item in the ioconfig.conf file. The hinv command lists unit1 (an original board in its original slot) and unit2 (the board that has been moved to a new slot), but not unit0. For more information about the hardware graph and ioconfig, see the man pages for hwgraph(4) and ioconfig(1M).
Initial entries for two devices: 0 /hw/module/1/slot/io8/xio_gsn/device 1 /hw/module/1/slot/io4/xio_gsn/device 0 /hw/gsn/0 1 /hw/gsn/1 Entries after unit0 is moved: 0 /hw/module/1/slot/io8/xio_gsn/device 1 /hw/module/1/slot/io4/xio_gsn/device 2 /hw/module/1/slot/io5/xio_gsn/device 1 /hw/gsn/1 2 /hw/gsn/2
The two-board SGI GSN product occupies two XIO slots that are logically associated with a single device (one unit number). The device has two XIO slots and two hardware graph entries. All links (for example, the short or convenience path, /hw/gsn/#) point to the XIO slot for the main SGI GSN board. All located SGI GSN hardware devices can be displayed with the /sbin/hinv or find command.
As the startup process continues, it calls the network drivers and protocol software modules so that they can create their network and programmatic interfaces. For GSN, this step works in the following manner:
For each located SGI GSN device (port), the startup process creates short (/hw/gsn/#) and long (/hw/module/#/slot/io#/xio_gsn) entries in the hardware graph. Then, the initialization scripts create a symbolic link in /dev that points to the device's entry in the hardware graph.
For each located GSN hardware device, the startup routine creates an entry in the hardware inventory database that can be displayed by hinv.
For each located hardware device, the IRIX GSN driver creates a logical network interface and assigns it a number that matches the hardware. For example, if the only hardware device is /hw/gsn/2, then the only network interface created is gsn2.
The ifconfig command searches the netif.options file for IP–over–GSN network interface names (for example, gsn0, gsn1, gsn2), associates each network interface with the hardware that is specified, then configures and enables each interface.
ST requires that the endpoints and their associated resources be set up before any data movement can proceed in which IP acts on a store-and-forward basis. The IP endpoints and intermediate hosts dynamically provide resources such as target buffers. ST is connection-oriented and the end points retain state information such as packet sequencing numbers. IP does not guarantee sequential delivery of packets and is a connectionless protocol.
The logical IP subnets on GSN can be independent of the underlying GSN physical network. Refer to “Consequences and Examples”.
The table below lists notable differences between ST and IP.
ST When It Is Borrowing From IP (INET address, routing protocol, ARP, etc.)
network-layer routing within an LIS
routing between LISs and inter-LIS forwarding
multiple hop routing (more than one intermediate hardware device--switch/concentrator/hub- -between endpoints
broadcasting to all members of an LIS
broadcasting to all members attached to a physical fabric
only if physical layer supports functionality
only if physical layer supports functionality
only if physical layer supports functionality
data handling between source and final destination
store and forward; finds path/resources along the way
direct delivery from source to final destination; path/resources established and open before data transfer started
direct delivery; path/resources established and open before data transfer started
 For SGI GSN release 1.0, only the copper-based medium is supported.
 Flow control is a mechanism for preventing data loss that is caused by a source transmitting data faster than the destination can process it. Without flow control, the destination drops incoming data when it does not have memory available (free) in which to store the data.
 For IRIX GSN, the Scheduled Transfer Protocol is an additional default stack; ST shares the INET address used by IP.