Chapter 1. Overview of IRIX GSN

This chapter provides an overview of IRIX GSN version 2.1.

Gigabyte System Network (GSN) is a full-duplex, error-free, flow-controlled communications protocol that simultaneously provides a full gigabyte (8 gigabits) of data transfer in each direction (6.4 gigabits of data plus 1.6 gigabits for control and HIPPI-6400 protocol information). Table 1-1 compares theoretical GSN data rates to the theoretical rates of other communications protocols.

Table 1-1. GSN Compared to Other Communication Technologies

Protocol

BAUD Rate

Peak User Payload Rate[a]

Sustained User Payload Rate

GSN (copper)

500 MBaud on 20 lines

6.4 gigabits/sec.[b] 

6.365 gigabits/sec.

Gigabit Ethernet

1256 MBaud on one line

1.0 gigabit/sec.

0.924 gigabits/sec.

ATM OC12c
over SONET

622 MBaud on line

0.622 gigabits/sec.

0.541 gigabits/sec.

[a] Peak rate is the rate required for hardware's direct-memory-access (DMA) when hardware has small input queue.

[b] All rates are decimal not digital (that is, they are base-ten, not base-two); for example, giga is 1,000,000,000.


SGI GSN Products

The following sections describe the SGI GSN products:

Components of Products

The GSN products offered by SGI consist of multiple components that implement the following protocols:

  • SGI GSN hardware: copper-based Gigabyte System Network (GSN, also known as HIPPI-6400 or SuperHIPPI) hardware for use in XIO slots.

  • IP–over–GSN driver (gsn#) included in IRIX GSN. This component is the interface between the GSN hardware and the Internet Protocol (IP) with its associated transport-layer protocols: TCP, UDP, ICMP, and so on. Requires IRIX 6.5.9m or 6.5.9f or later.

  • ST–over–GSN driver (gsn#) included in IRIX GSN. This component interfaces the GSN hardware to the Scheduled Transfer Protocol (ST). Requires IRIX 6.5.9f or later.

  • HARP (HIPPI Address Resolution Protocol) driver included in IRIX GSN. This component provides Internet-to-GSN hardware mapping service and interfaces to the HARP daemon. Requires IRIX 6.5.9m or 6.5.9f or later.

  • Address resolution protocol server (harpd daemon) and client functionality shipped with IRIX GSN. The dynamic HARP component handles HIPPI-6400 clients. IRIX HARP also supports static table lookup for handling HIPPI systems that do not support HARP.

  • IRIX sockets-based application programming interface (API) to the IP network stack (driver) for use by customers who want to develop or port applications to send/receive data through the IP–over–GSN subsystem. Available with IRIX 6.5.9m or 6.5.9f and subsequent versions.

  • IRIX sockets-based application programming interface (API) to the ST network stack (driver) for use by customers who want to develop or port applications to send/receive data through the ST–over–GSN subsystem. Available with IRIX 6.5.9f and subsequent versions.

GSN Within IRIX Network Stacks

The SGI GSN hardware and IRIX GSN software support the following network stacks (illustrated in Figure 1-1):

  • IP-over-GSN: applications that use the standard IRIX interface (BSD sockets) to send/receive data using the IP suite of protocols.

  • ST-over-GSN: applications that use the IRIX GSN product's Scheduled Transfer (ST) programmatic interface to send/receive data over GSN. Applications that use this interface include the IRIX utilities shipped with the IRIX GSN product and customer-developed ST applications.

  • ARP for HIPPI/GSN (HARP): automatically resolves physical-layer HIPPI–6400 ULA addresses to and from network-layer addresses (IP and ST).


    Note: Each gsn# network interface services two main protocols: ST and IP. The INET address that the customer assigns to an instance of gsn# is shared by the ST-over-GSN and IP-over-GSN stacks. Some of the upper-layer address processing (for example, routing) that is performed on the address applies to both IP and ST traffic.


    Figure 1-1. IRIX GSN Modules Within OSI-style Network Protocol Stack

    IRIX GSN Modules Within OSI-style Network Protocol Stack

Standards Compliance

IRIX GSN complies with the following industry standards:

  • GSN (also called HIPPI–6400 or SuperHIPPI)

    • Information Technology - High-performance Parallel Interface - 6400 Mbit/s Physical Layer (HIPPI-6400-PH), ISO/IEC 11518-10, NCITS (ANSI) standard.

    • Information Technology - High-performance Parallel Interface - 6400 Mbit/s Switch Control (HIPPI-6400-SC), T11.1, Project 1231-D, Rev. 2.5, August 1998, working draft for NCITS (ANSI). Only those functions that apply to GSN endpoints.

  • ST-over-GSN

    • Information Technology - Scheduled Transfer Protocol (ST), T11.1, Project 1245-D, Rev. 2.6, December 1998, working draft for NCITS (ANSI).

  • IP-over-GSN

    • RFC 2067, IP over HIPPI

    • Other standard internet protocols provided with IRIX (IP versions 4 and 6, NFS versions 2 and 3, TCP, UDP, ICMP, and so on.)

  • IRIX HARP

    • RFC 2835, IP and ARP over HIPPI–6400, December 1998

      To obtain copies of the GSN and ST documents, see the Web site http://www.hippi.org, or contact the American National Standards Institute (ANSI) at 11 West 42nd Street, New York, New York 10036, telephone: 212-642-4900. For RFCs, see the Web site http://info.internet.isi.edu/in-notes/rfc.

GSN Product Names

The following strings are used to identify the GSN product:

  • Name for software image: gsn
    (for example, versions gsn or showprods gsn)

  • Hardware inventory name for each adapter:
    GSN 1-Port adapter and GSN 2-Port adapter

  • Name for each logical IP or ST network interface: gsn#
    (for example, ifconfig gsn0 up)

Compatibility Issues

IRIX GSN 2.1 requires IRIX 6.5.9m for TCP/UDP and 6.5.9f for full TCP/UDP/STP support. Use the versions command to verify the version of IRIX that is currently running on the system. The version number (indicated by the -n option) must be equal to or greater than the version shown in the following example:

% versions -n eoe 
I eoe 1275719131  IRIX Execution Environment, 6.5.9m

The SGI GSN hardware requires the system's HUB ASICs to be version 5. Use this command to verify the version of the HUB on each Node board:

% hinv -v | grep HUB
HUB in Module #/Slot 1: Revision 5 Speed 97.50 Mhz (enabled)
HUB in Module #/Slot 2: Revision 5 Speed 97.50 Mhz (enabled)
HUB in Module #/Slot 3: Revision 5 Speed 97.50 Mhz (enabled)
HUB in Module #/Slot 4: Revision 5 Speed 97.50 Mhz (enabled)

Overview of Protocols

The following sections provide an overview of the protocols that make up and interoperate with IRIX GSN. Figure 1-1 illustrates the GSN protocol stacks.

What is GSN?

Gigabyte System Network (GSN) is a set of ANSI standards (listed in “Standards Compliance”) that defines physical and data link layers for a very high-speed communications protocol. The GSN protocol is also known by two other names: HIPPI–6400 and SuperHIPPI. Throughout this document, the term GSN is used for this entire set of protocols, except when referring to an item from a specific ANSI standard, in which case the term from the ANSI document's title is used (for example, HIPPI–6400-PH micropacket).

GSN Terminology

The following terms have specific meanings when used within the context of GSN: 

Physical link
 

 One section of HIPPI–6400–PH cable (copper or fiber-optic) that connects two HIPPI–6400-PH elements. Each element can be either a switch or an endpoint. Each physical link is a full-duplex link composed of two simplex links; each simplex link carries data in only one direction; the two streams of data in the full-duplex link flow in opposite directions. The path (virtual connection) between an original point of transmission (the originating source) and a final point for reception (the final destination) can involve numerous physical links.

Element
 

 Any component of a HIPPI-6400 fabric or system that is able to receive, process, and send HIPPI-6400 Admin micropackets in a manner that conforms with the HIPPI-6400 standard. Each HIPPI-6400 element contains both a source and a destination. For example, the SuMAC chip in an SGI GSN product is a GSN element.

Source
 

 The transmitting element located at one end of a physical link. An upper-layer entity (host, network–layer interface, or program) that uses the GSN subsystem is sometimes loosely referred to as the source. However, it is more correct to call these software entities upper-layer protocols (that is, source ULPs). An “originating source” refers to the element that first transmitted a micropacket; an element that is retransmitting the micropacket (for example, a switch) is simply a source.

Destination
 

 The receiving element located at the other end of a physical link. An upper-layer entity (host, network–layer interface, or program) that receives communications through the GSN subsystem is sometimes loosely referred to as the destination. However, it is more correct to call these software entities upper-layer protocols (that is, destination ULPs). A “final destination” refers to the element that is the ultimate receiver for a micropacket; an element that receives, then retransmits a micropacket (for example, a switch) is simply a destination.

Endpoint
 

 A final destination or an originating source of GSN traffic. An endpoint may have only one GSN port. A single system may have many endpoints (for example, an Origin module with two SGI GSN products has two endpoints).

Switch
 

 A node that is located along the route between two endpoints. GSN traffic passes through the switch on its way to a destination endpoint. A switch must have at least two, and usually has more, GSN ports.

Fabric
 

 All the HIPPI nodes (switches, endpoint devices, extenders) that are physically interconnected and communicate using the same physical–layer protocol.One GSN fabric can be logically divided into multiple upper-layer address spaces (that is, networks). For example, a single GSN fabric can support multiple IP networks. And, conversely, one logical network can include members from multiple HIPPI fabrics.

Hop count
 

 A number used in HIPPI–6400 Admin micropackets to specify the number of elements through which the micropacket should be forwarded. Each time a micropacket exits an element, the hop count is decremented by one. See “GSN Admin Micropackets” for further details.

GSN Overview

The GSN protocol provides 6.4 gigabits of user data per second from source to destination (in each direction) over either copper-based or fiber–optics-based physical media.[1] The protocol is point-to-point, full-duplex, and flow-controlled. It uses small fixed-size micropackets (illustrated in Figure 1-4 and Figure 1-6) and up to four interleaved logical datastreams (channels) per point–to–point connection.

GSN Physical Layer

Each physical link is composed of two simplex links that connect two HIPPI-6400 elements; data flows in only one direction on each simplex link. Both simplex links are required for a connection because control information for each datastream travels in the reverse direction (that is, along the other simplex link of the connection). This design provides a full-duplex connection between two endpoints.

The GSN data rate is stated as 6.4 gigabits of user data per second on each simplex link; however, each link physically carries a total of 8 gigabits (1 gigabyte) of data (user and control) every second. The following items describe the GSN bandwidth:

  • At the physical layer (that is, on the wire), GSN uses a dual-edged 250–million–cycle -per–second clock, which results in 500 million transmission events per second. Said another way, GSN operates at 500 MBaud.

  • For each baud, GSN transmits 16 bits of user data and 4 bits of control data that is encoded with 4b/5b. This means that 20% of the total bandwidth is overhead for the encoding, and, of the remaining bandwidth, 20% is overhead for the HIPPI-6400 protocol. This results in user bandwidth of 6.4 gigabits or 6400 megabits per second.

  • The available bandwidth for user data is 6400 megabits/second, which is 6.4 gigabits or 0.8 gigabytes of per second in each direction.

    Table 1-2 summarizes the mathematical calculations:

    Table 1-2. GSN Bandwidth Calculations

    Item

    Bandwidth

    Calculation Details

    Total physical signal carrying capacity

    10 GBaud

    20 simultaneous signals multiplied by 500 MBaud, which is 10 billion signals per second in each direction.

    Bandwidth available for protocols

    8.0 Gbits/s

    Rate in row above, minus bandwidth used by 4b/5b encoding.

    Bandwidth available for users (that is, layers above the HIPPI-6400 layer)

    6.4 Gbits/s

    Rate in row above, minus amount used by GSN control information. GSN control = 4 of the 20 bits (20% of 8 Gbits).


GSN Virtual Channels

Each simplex link can carry up to four logical datastreams ( virtual channels). These virtual channels are allocated for control traffic, low latency traffic, and bulk traffic to avoid the latency/blocking issues that occur when only a single channel is attempting to handle both bulk and interactive traffic.

Each virtual channel is commonly implemented as a queue; micropackets are selected alternately from the active queues and placed onto the physical link in an interleaved fashion, as illustrated in Figure 1-2. Not all four channels need to be active on every connection. All the micropackets belonging to a single GSN Message always travel through the same channel, even when the message traverses switches along its way to the final destination. The restrictions for the data that can be carried on each channel are described in Table 1-3. 

Table 1-3. Data Restrictions for Each GSN Virtual Channel

Virtual Channel

Description

0

Carries GSN Messages that do not exceed 68 micropackets of TYPE data (about 2176 bytes of upper-layer data). For ST–over–GSN traffic, ST data channel 0 maps to this GSN channel; all ST control operations (for example, Request_To_Send and Clear_To_Send) travel on this virtual channel.

1

Carries GSN Messages that do not exceed 4100 micropackets of TYPE data (about 128 Kbytes of upper-layer data) and Admin micropackets in which the COMMAND field specifies a request or a command (that is, not a response). IP–over–GSN traffic is carried on this VC. For ST–over–GSN traffic, ST data channel 1 maps to this GSN channel.

2

Carries GSN Messages that do not exceed 4100 micropackets of TYPE data (about 128 Kbytes of upper-layer data) and Admin micropackets in which the COMMAND field specifies a response. For ST–over–GSN traffic, ST data channel 2 maps to this GSN channel.

3

Carries GSN Messages that do not exceed 134,217,728 micropackets that are of TYPE data (about 4 Gbytes of upper-layer data). This channel requires that the final destination endpoint agree to accept this Message via a flow–controlled protocol such as Scheduled Transfer. For ST–over–GSN traffic, ST data channel 3 maps to this GSN channel.

Figure 1-2. GSN Micropackets from Virtual Channels Interleaved in Datastream

GSN Micropackets from Virtual Channels Interleaved in Datastream

GSN Micropacket

The micropacket is the basic protocol data unit for GSN. Each GSN micropacket is 32-bytes of data accompanied by 8 bytes (64 bits) of control information. The TYPE field within the control bits indicates the format and purpose of the micropacket's 32 bytes of data. The VC field determines which virtual channel carries the micropacket. Some of the control bits that accompany a 32-byte chunk of data refer to that chunk of data (for example, the VC and TYPE fields), and some bits refer to the datastream traveling in the opposite direction on the other physical link (for example, the credits in the CR field that allow the reader/receiver of the control bits to transmit more data for its own datastream). Figure 1-3 illustrates the control bits and Table 1-4 describes them. Table 1-5 summarizes the different TYPEs of GSN micropackets.

Figure 1-3. GSN Micropacket Control Bits

GSN Micropacket Control Bits

Table 1-4. GSN Micropacket Control Bits

Name of Field

Number of Bits in Field

Description

Applies to Data in Which Link

VC

2

Virtual channel selector for this micropacket (binary values):
00=channel_0; 01=channel_1; 10=channel_2; 11=channel_3

This one

TYPE

4

Type of micropacket: see Table 1-5

 

This one

T

1

Tail:
0=more micropackets follow to complete this GSN Message;
1=this is the last micropacket for this Message.

This one

E

1

Error:
0=this GSN Message is OK so far;
1=an unrecoverable error was detected for this Message.

This one

VCR

2

Virtual channel for which the credits (in CR field) apply.

Other

CR

6

Credits: number of credits the source (that is, the receiver of these control bits) can add to the data transfer on the virtual channel indicated in the VCR field. (See “GSN Flow Control”

  for further explanation.)

Other

RSEQ

8

Reception sequence number:
Acknowledgment for the highest-received sequence number (TSEQ) for data micropackets on the other link.

Other

TSEQ

8

Transmission sequence number:
The sequence number associated with this micropacket.

This one

ECRC

16

End-to-end checksum. Checksum for all data bytes of the GSN Message, up to and including, the bytes in this micropacket. This checksum is verified by the final destination.

This one

LCRC

16

Link checksum. Checksum for the 32 bytes of data and the first 48 bits of control information in this micropacket. This checksum is verified by each GSN element at the end of a link.

This one

Most of the GSN micropacket TYPEs are related to control and management of the GSN link. Only three TYPEs of micropackets are passed to the upper layers: Admin, Header, and Data micropackets. The Admin micropacket (illustrated in Figure 1-4) is used by upper-layer GSN administrative programs to manage and configure a GSN fabric; hence, the Admin micropacket is defined by the Switch Control ANSI standard (HIPPI-6400-SC). The Header and Data micropackets are used to create GSN Messages (illustrated in Figure 1-6) that carry user-level data. 

Table 1-5. Types of GSN Micropackets

Type
(Name)

Type
(Hexadecimal)

Description of the Micropacket

Supported by IRIX GSN Hardware?

Reset

2

Causes the receiving HIPPI-6400-PH device to reset the local link (that is, the physical link between this sender and the device at the other end of the physical link).

Y

Reset_Ack

3

Acknowledges that the Reset micropacket was received and that the HIPPI-6400-PH link reset was completed.

Y

Initialize

4

Causes the receiving HIPPI-6400 device to reinitialize.

Y

Initialize_Ack

5

Acknowledges that the Initialize micropacket was received and that the HIPPI-6400-PH initialization procedure was completed.

Y

Reserved

6

Not applicable (NA)

NA

Null

7

Contains no data in the 32-byte data area; there may be valid information in the Control Bits. This type is transmitted only when there is nothing else to transmit; it keeps the physical link active/alive.

Y

Data

8

Contains data for a GSN (HIPPI-6400) Message (illustrated in Figure 1-6

).

Y

Header

9

Contains the header information for a GSN (HIPPI-6400) Message (illustrated in Figure 1-6

).

Y

Credit-only

A

Contains only valid credits (VCR and CR fields of Control Bits) that allow the transmitter to send more data. The micropacket contains no data in the 32-byte data area. This type is transmitted only when there are no Admin, Header, or Data micropackets awaiting transmission.

Y

Reserved

B-E

NA

NA

Admin

F

Used for administering GSN switches and endpoints. Format for Admin micropacket is defined by the HIPPI–6400–SC standard. A number of functions (commands) are supported, including: ping another GSN device, request ULA of a remote GSN device, and set up broadcast capability for a GSN fabric.

Y

One of the functions for the Admin micropacket is to allow each switch on a GSN fabric to discover the fabric's physical configuration and each endpoint to discover the universal LAN MAC address (ULA) that its switch has assigned to it. This functionality is not available on every GSN product; however, when it is implemented, this is how it works.

  • For an endpoint, upon starting, it transmits an Admin micropacket that asks the device at the other end of the link to identify its function (for example, is it an endpoint or a switch). If the device is a switch, the endpoint asks for an assigned ULA; if the device is another endpoint, the local endpoint uses its locally assigned ULA (which might be stored in the hardware's PROM).

  • For a switch, upon starting, it transmits Admin micropackets that ask for other devices' functions (for example, is it a switch or an endpoint). The switch sends one such request to each hop (successive hardware device) down each of its links until an endpoint is reached. Upon discovery of each endpoint or a switch, it uses Admin micropackets to exchange ULA information with that device. As it receives responses from these Admin requests, the switch constructs a map (spanning tree) of its fabric. Once this map has been constructed, a micropacket destined for a known endpoint (that is, any endpoint discovered within that fabric) can be delivered.


    Note: This fabric discovery scheme does not solve the problem of how each endpoint comes to know the ULA for the other endpoints with which it wants to communicate. That problem can be solved by an upper-layer address resolution mechanism (for example, HARP or another network–layer address resolution mechanism). For details, see “Address Resolution for GSN”.


    Figure 1-4. GSN Admin Micropacket

    GSN 
Admin Micropacket

GSN Flow Control

A GSN destination (receiving) endpoint controls the flow of micropackets by periodically releasing credits to the source.[2] Each credit represents memory at the destination for one GSN micropacket. Each credit gives the source permission to send one additional micropacket on a specific channel. The destination gives credits to the source in the control bits (CR and VCR bits) that accompany the destination's own micropackets. Note that the credits travel in the opposite direction from the data, as illustrated in Figure 1-5, and can accompany micropackets traveling on any of the GSN virtual channels for the connection.

Figure 1-5. GSN Flow Control

GSN Flow Control

GSN Message

The GSN Message is the basic data transfer unit between source and final destination endpoints. Each Message is composed of one initial Header micropacket followed by zero or more Data micropackets (illustrated in Figure 1-6). The micropackets of a Message are sequentially ordered and all travel over the same virtual channel using the same originating source (S_ULA value) and final destination (D_ULA value). The last micropacket in a Message has a bit set (the TAIL flag) to indicate that the Message is complete. Figure 1-6 illustrates a complete GSN Message.

Figure 1-6. GSN Message Composed of Header and Data Micropackets

GSN Message Composed of Header and Data Micropackets

When the GSN Header micropacket is carrying an IP datagram (EtherType=0x0800), the 8 bytes of payload in the Header micropacket are the first 8 bytes of the IP header. (Note that the 8 bytes immediately preceding the Payload are an 802.2 SNAP header.) When the GSN Header micropacket is carrying an ST transfer (EtherType=0x8181), the payload bytes in the Header micropacket are the initial 8 bytes of the ST Header.

GSN Admin Micropackets

Every HIPPI element is capable of processing GSN (HIPPI–6400-SC) Admin micropackets. These micropackets configure elements, discover the fabric topology, and maintain the elements of a GSN fabric. The TYPE field of the control bits (illustrated in Figure 1-3) indicates that a micropacket is of the Admin type. Admin micropackets have the format illustrated in Figure 1-4.

Most HIPPI-6400 elements have two ports: one leading toward the fabric and the other leading toward the host/core. For example, a link end element (such as the SuMAC ASIC) has one port connected to a physical link/the fabric and the other port connected to additional GSN logic (which may be another local element) on an adapter board. Notice that a GSN system may contain more than one element; this fact is important in understanding the processing of Admin micropackets.

An Admin micropacket can enter an element through either port, as illustrated in Figure 1-7. Each Admin micropacket is either processed and responded to or forwarded to the next element through the element's other port, as illustrated in Figure 1-7. A response to an Admin micropacket always exits the element through the same port by which the original Admin micropacket arrived.

Figure 1-7. Dual-port HIPPI–6400-PH Elements

Dual-port HIPPI–6400-PH Elements

The hop count field in the Admin micropacket determines when the Admin packet is acted upon/processed. The count indicates the number of elements (hops) through which the Admin micropacket is propagated/forwarded before it is processed. As long as the hop count is greater than zero, the receiving element decrements the hop count by one and transmits the Admin micropacket out the element's other port (which leads to another element), as illustrated in Figure 1-8. When the count is zero, the receiving element processes the micropacket and responds, as illustrated in Figure 1-9. Figure 1-10 through Figure 1-12 show examples of various hop count values and the manner in which hop count determines which element acts on and responds to the micropacket.

Table 1-6 lists the administrative commands that are available with Admin micropackets.

Figure 1-8. Hop Count >0 Indicates Forward Admin Micropacket

Hop Count >0 Indicates Forward Admin Micropacket

Figure 1-9. Hop Count =0 Indicates Process Admin Micropacket

Hop Count =0 Indicates Process Admin Micropacket

Figure 1-10. Hop Count Example: hop_count = 0

Hop Count Example: hop_count = 0

Figure 1-11. Hop Count Example: hop_count = 1

Hop Count Example: hop_count = 1

Figure 1-12. Hop Count Example: hop_count = 2

Hop Count Example: hop_count = 2

Table 1-6. GSN Admin Micropacket Commands

Admin Command

Description

Required (R) or Optional (O) for Switches and Endpoints

Ping

Are you there?

O

*_response

Yes I am here (and functioning).

R

Set_element_address

Here is your “element address”.

O

*_response

Status (for example, I have started using the assigned address).

O

Reset

Initialize yourself

O

Exchange_element_
function

I am a <switch/endpoint> element.

What are you?

R

*_response

I am a <switch/endpoint> element.

R

ULA_request

Assign me a ULA.

R

*_response

Here is your ULA.

R for switches

Read_register

Give me the data from this Admin register.

O

*_response

Here is the data you requested.

O

Write_register

Put this data into this Admin register.

O

*_response

Status (for example, the data has been written).

O

Invalid_command

I received an invalid/unrecognized/ unsupported Admin micropacket.

R

ULA_list_request

Give me a list of all the ULAs connected to you.

O

*_response

Here is the list.

R for switches

Port_remap

For all traffic containing the specified ULA, change the route (output port) to a specified (new) port ID.

O

*_response

Status.

R for switches

Port_map_request

Give me the port ID that I must use to contact the specified ULA.

O

*_response

Here is the port ID.

R for switches


What is ST?

Scheduled Transfer (ST) is an upper-layer protocol that can be implemented to operate over a number of physical–layer subsystems, including GSN, ATM, FDDI, and Ethernet. This section describes the main characteristics of the ST protocol. For the sake of introduction and ease of understanding, many of the less important functional details of ST are not covered in this description. Refer to the ANSI standard (listed in the section “Overview of Protocols”) for complete details.

ST Overview

The most salient feature of ST is that it prepares both endpoints for the data movement before any data is transmitted. The first step in the preparation is to create a condition (state) called a virtual connection or VC (described in “ST Connection Setup Sequence”). The second step is a handshake that allocates memory for the data movement and exposes this memory to the other endpoint (described in “ST Data Movement Sequences Including Memory Allocation”). There are two kinds of the memory–allocation handshake: one provides memory that is used once (described in “Single-use Memory Data Movements”); the other provides memory that is used many times until released (described in “Persistent Memory Data Movements”). The two endpoints exchange ST control operations to accomplish these prearrangements. Only after these prearrangements are complete can the first data movement begin; the data movement is performed with ST data operations.

ST Terminology

The following terms have specific meanings within the context of ST:

operation 

The ST protocol data unit. It is composed of a 40–byte header and variable–length data ranging from 0 bits to 4 gigabits (illustrated in Figure 1-13). Each ST operation is transmitted as one GSN Message, as illustrated in Figure 1-13.

sequence 

A series of operations that occur in a specific order and accomplish an ST protocol task.

initiator 

The ST endpoint that sends the first operation within an ST sequence. The endpoint that acts as initiator during one sequence (for example, the connection setup) can act as the responder in a subsequent sequence (for example, the data movement).

responder 

The other (not the initiator) ST endpoint participating in an ST sequence.

slot 

Memory at an ST destination that is reserved for holding one incoming ST Header.

ST Operations

The Operation is the basic protocol data unit for ST. Each ST Operation is carried within a single GSN Message, composed of two or more HIPPI–6400 micropackets, as illustrated in Figure 1-13.

Figure 1-13. ST Operation

ST Operation

ST operations (listed in Table 1-7) are commonly grouped into the following categories:

  • Connection management operations: used to set up and tear down a VC

  • Control operations: used to manage a VC (for example, status or flow control)

  • Data operation: used to transmit ST payload (upper-layer data) and/or data checksum during data movement sequences

Table 1-7. ST Operations

Name of Operation

Acronym

Category

Sequence in Which Operation is Used

Description

Request_Connection

RC

connection management

Setup

Requests that a VC be created. Issued by any endpoint. First operation of setup sequence.

Connection_Answer

CA

connection
management

Setup

Response to RC. Accepts (creates VC) or rejects the RC. Second (and last) operation of setup sequence.

Request_Disconnect

RD

connection
management

Teardown

Indicates that sender (initiator) is tearing down the VC. Issued by either endpoint of VC. First operation of teardown sequence.

Disconnect_Answer

DA

connection
management

Teardown

Response to RD. Indicates that the sender (responder) is tearing down the VC. Second operation of teardown sequence.

Disconnect_Complete

DC

connection
management

Teardown

Response to DA. Indicates sender (initiator) has finished tearing down VC. Third (and last) operation of teardown sequence.

Request_Memory_Region

RMR

control

Data Movement_
Persistent

Requests that responder expose memory. First operation of persistent memory sequence.

Memory_Region_Available

MRA

control

Data
Movement_
Persistent

Response to RMR. Exposes responder's memory to initiator.

Get

GET

control

Data
Movement_
Persistent

Issuer (initiator) is destination for the data movement. Exposes initiator's memory to receive the requested data. Data comes from source's exposed persistent memory region. RMR/MRA handshake must have occurred.

FetchOp

FETCHOP

control

Data Movement_
Persistent 

Issuer (initiator) is destination for the data movement. Exposes initiator's memory to receive the requested data. Data comes from source's exposed persistent memory region. RMR/MRA handshake must have occurred.

FetchOp_Complete

FC

control

Data
Movement_
Persistent

Response to FETCHOP.

Request_To_Send

RTS

control

Data Movement_
Single-use

Issued by source (=initiator for write or =responder for read). Indicates issuer is ready to transmit data; asks responder to expose single-use memory. First operation of write sequence.

Request_To_Receive

RTR

control

Data
Movement_
Single-use

First operation for a read sequence. Indicates issuer is ready to receive data. Issuer becomes the initiator of the read sequence.

Clear_To_Send

CTS

control

Data
Movement_
Single-use

Response to RTS. Gives source permission to transmit one block of data. Exposes single-use memory for that data.

Data

DATA

data

Data Movement

Carries ST payload and/or checksum; used in every data movement sequence. Sent by data source, which can be either initiator or responder within the data movement sequence.

Request_Answer

RA

control

Data Movement

Response to an RTS, RTR, RMR, GET, or FETCHOP. Accepts, rejects, or pauses the request to which it is responding.

Request_State

RS

control

Status

Requests VC status information. Issued by either endpoint.

Request_State_Response

RSR

control

Status

Communicates VC state information. Response to either an RC operation or a DATA operation in which the Send_state flag (within the ST Header) is set.

End

END

control

Abort Data Movement

Terminates an in-progress data movement (read/write transfer or a persistent memory region) by causing the allocated memory to be released; leaves VC open. Issued by either endpoint.

End_Ack

EA

control

Abort Data Movement

Response to END. Indicates responder has aborted the associated data movement.


ST Header

The ST Header (illustrated in Figure 1-14) carries the information that implements the ST protocol features. Some of the parameters that are communicated within the ST Header are:

  • Type of operation (listed in Table 1-7)

  • Data channel through which this operation travels, which, for ST–over–GSN, maps directly to GSN virtual channels (summarized in Table 1-3)

  • Number of memory spaces (slots for holding ST Headers) that are currently available at each endpoint for this data channel (that is, VC)

  • Port values for initiator and responder within each VC

  • Key values for initiator and responder within each VC

  • Length of the data to be moved from one endpoint to the other

  • Block number for use in tracking progress, managing flow control and resource allocation, and performing striping within a data movement

  • Memory address (buffer index and offset) to use for the data movement

  • Checksum for the operation

  • Identification numbers for tracking and sequencing operations: DATA operations, FETCHOP operations, GET operations, and REQUEST_STATE_RESPONSE operations within each VC

The following are some of the endpoint behaviors that can be controlled by the operation's ST Header:

  • Whether or not the destination for a data movement supports reception of out-of-order Blocks

  • Whether or not the operation's ST Header should be delivered to the destination's upper-layer protocol (ULP)

  • Whether or not the destination ULP should be interrupted when this operation arrives

  • Request status information from the endpoint receiving this ST Header

  • Inform initiator that responder is rejecting a request

  • Pause the transmission during a data movement

    Figure 1-14. ST Header

    ST Header

ST Sequences

ST defines sequences of operations for accomplishing various tasks, including the following:

Each ST sequence allows the two endpoints to exchange a set of control parameters and information. The parameters are carried in the ST Header (illustrated in Figure 1-14). Each type of operation uses the Header fields differently and exchanges a different set of parameters.

ST Connection Setup Sequence

Before any ST data can be exchanged, a Virtual Connection (VC) must be set up between the initiator and the responder. Upon successful completion of this exchange, each endpoint will have stored a set of parameters associated with the VC and will have set aside some resources for exclusive use by this VC. Three of the stored parameters are used (as a tuplet) for identifying/validating operations that arrive to the VC. The verification tuplet consists of: the remote endpoint's ST port number, the local endpoint's ST port number, and the key value that the local endpoint has assigned to this VC. Figure 1-15 illustrates how these identification parameters are set up.


Note: The initiator for the connection setup sequence is the endpoint that sends the first control operation for the sequence (that is, the Request_Connection).

The connection setup sequence consists of two control operations: a Request_Connection sent by the initiator, followed by a Connection_Answer sent by the responder). Figure 1-15 and Figure 1-16 illustrate different subsets of the information exchanged in one successful connection setup sequence. Figure 1-17 illustrates a connection setup sequence in which the responder refuses to create the VC.

The ST connection setup sequence negotiates and sets the following parameters and resources that remain in effect for the duration of the VC:

  • I_Port and R_Port
    ST port value on which endpoint (initiator and responder) wants to receive all communication associated with this VC.

  • I_Key and R_Key
    Locally unique identification number (key) for use in verifying and identifying this VC. Each endpoint gives the other endpoint a key, which the other simply echoes back in each communication; the key means nothing to the remote end and is only “unique” at the endpoint where it was assigned.

  • I_Bufsize and R_Bufsize
    Size of the buffers used by each endpoint for data it receives on this VC.

  • I_Slots and R_Slots
    Initial number of “slots” available at each endpoint. Each slot indicates memory that has been set aside for storing ST headers that are received on this VC. Each slot normally consists of one 40-byte data structure.

  • CTS_req
    Number of Clear_to_Sends that the source would like to have outstanding (available) at all times during the data movement.

  • I_MaxSTU and R_MaxSTU
    Maximum size STU that each endpoint is willing to receive. The other endpoint must respect this size when transmitting on this VC.

  • EtherType
    Identity of the protocol being encapsulated (carried) within the ST Messages on this VC. For example, for IP datagrams, the EtherType is 0x0800; when the ST Messages carry user data that is not enclosed in any additional protocol, the EtherType is 0x0000. The initiator specifies this parameter.

    Figure 1-15. ST Connection Setup Sequence: Identification Parameters Only

    ST Connection Setup Sequence: Identification Parameters Only

    Figure 1-16. ST Connection Setup Sequence: VC Parameters Only

    ST Connection Setup Sequence: VC Parameters Only

    Figure 1-17. ST Connection Setup Sequence: Rejection

    ST Connection Setup Sequence: Rejection

ST Connection Teardown Sequence

When an endpoint no longer wants a VC, it initiates the connection teardown sequence illustrated in Figure 1-18. This sequence is not used to terminate data movements. (See “ST Termination Sequence for a Data Movement”.)

Figure 1-18. ST Connection Teardown Sequence

ST Connection Teardown Sequence

ST Data Movement Sequences Including Memory Allocation

This section describes ST data movement sequences. Each ST data movement sends upper-layer (user) data from one endpoint (the source) to one other endpoint (that is, one final destination). The entire data transfer is controlled by the VC parameters negotiated during one ST connection setup procedure (described in “ST Connection Setup Sequence”) or renegotiated during the data movement. The setup sequence must be completed before any data movement sequence is initiated.

The data movement sequences consist of two to five operations, exchanged between the VC's two endpoints (the memory-allocation handshake), followed by one or more data operations. There are five different data movement sequences, as summarized in Table 1-8. The initiator controls which sequence is used, depending on the type of memory it wants to have allocated, the type of functionality it desires for the data movement, and the role it wants to assume in the transfer.

The memory-allocation handshakes allow either of the following types of memory to be allocated for receipt of the data:

Table 1-8 summarizes the five data movement sequences and indicates where each sequence is illustrated:  

Table 1-8. Data Movement Sequences

 

Persistent Memory

Single-use Memory

Initiator wants to be source

Figure 1-20

 

Figure 1-24

 

Initiator wants to be destination

Figure 1-21

 and Figure 1-22

 

Figure 1-25

 



Note: Within a data movement sequence, the initiator is the endpoint that sends the first control operation for the sequence (for example, Request_to_Send or Request_Memory_Region), regardless of whether it operates as the data transmitter (source) or receiver (destination).

Table 1-9 summarizes the data size ranges for each type of data movement. As illustrated in Figure 1-19, the data is first chunked into one or more Blocks; the maximum size for a Block is negotiated during the memory allocation handshake. Each Block is divided into one or more scheduled transfer units (STU; the data for one data operation); the maximum size for the STU was negotiated during the connection setup sequence. Any ST data movement that is larger than the VC's maximum STU size requires multiple data operations. Each STU (that is, each data operation) is transmitted as one GSN Message. The flow-control mechanism for user data (described in “ST Flow Control Sequences”) operates at the Block level. 

Table 1-9. Data Sizes Possible for Data Movements

Data Movement Type

Minimum Length Data Movement Sequence

Maximum Length for Data Movement Sequence

Single-use Memory: Write

1 byte

264 minus 1 byte or unlimited

Single-use Memory: Read

1 byte

264 minus 1 byte or unlimited

Persistent Memory: each Put

1 byte

248 minus 1 byte or VC's max_STU (one Block)

Persistent Memory: each Get

1 byte

216 minus 1 byte or VC's max_STU (one Block)

Persistent Memory: each FetchOp

8 bytes

8 bytes (one Block)

Figure 1-19. Data Handling for ST Data Movements

Data Handling for ST Data Movements

Persistent Memory Data Movements

The persistent memory sequences consist of a few control operations (the memory–allocation handshake) followed by any number of Put, Get, and/or FetchOp sequences. The persistent memory handshake allocates one or more memory regions at the responding endpoint. These regions are then used multiple times; each buffer within each region is used over and over during the life of the virtual connection. When properly used, this method provides permanent, low–latency delivery, in which an unlimited number of transfers can be performed with no intervening overhead. There is an important caveat: the low latency on this type of data transfer depends on the speed at which the memory can be made available for the next use. This type of transfer works best for small (or fixed-size) data and for applications for which the transmission rate is well understood, so that the memory can be sized in a manner that allows it to be recycled within an acceptable period of time. It is the responsibility of the upper-layer applications to manage flow control and prevent precipitous overwriting of the memory region.

Once a persistent memory region has been allocated at the responder endpoint, the initiator can move data in or out of it in three manners:

  • Put sequence (illustrated in Figure 1-20) 
    One data operation (STU) that writes any portion of or the entire persistent memory region at the responder. This sequence can be repeated over and over with no intervening operations.

  • Get sequence (illustrated in Figure 1-21) 
    A GET control operation to expose memory at the initiator for receiving the requested data, followed by any number of data operations. Each data operation moves a portion or all of the data from the responder's allocated memory into the initiator's memory. Multiple GETs can be outstanding (occurring) simultaneously to different or shared portions of the persistent memory region.

  • FetchOp sequence (illustrated in Figure 1-22 and Figure 1-23) 
    A FETCHOP control operation to expose memory at the initiator for receiving the retrieved data and to specify the desired function (increment, decrement, or clear). Then, a single data operation (one STU) that moves one 64-bit Block of data from the responder's memory into the initiator's memory. When the data arrives successfully at the initiator, the initiator issues a completion control message, at which point the responder performs the specified function on its own copy of the data. If the completion does not arrive within a timeout period, the responder retransmits the data. Note that, unlike PUT and GET, this data movement sequence is atomic.

A persistent memory region is terminated (released) with an End operation, as described in “ST Termination Sequence for a Data Movement”.

Figure 1-20. ST Data Movement Sequence: Persistent Memory—Put

ST Data Movement Sequence: Persistent Memory—Put

Figure 1-21. ST Data Movement Sequence: Persistent Memory—Get

ST Data Movement Sequence: Persistent Memory—Get

Figure 1-22. ST Data Movement Sequence: Persistent Memory—FetchOp

ST Data Movement Sequence: Persistent Memory—FetchOp

Figure 1-23. Example of FetchOp

Example of FetchOp

Single-use Memory Data Movements

The single–use memory movement sequence consists of a few control operations (the memory-allocation handshake) that allocate memory at the destination endpoint, followed by one or more data operations for a specified amount of data. The data transfer uses the destination's allocated memory once; each buffer is used only once during the life of the transfer. This method allows high-bandwidth delivery after an initial delay for the allocation of resources: the transfer provides for a limited number of back–to–back writes or reads with no intervening overhead. This method is efficient for large, variable-length data.

A data transfer can be aborted (terminated before all the data has been transferred) with an End operation, as described in “ST Termination Sequence for a Data Movement”.

Figure 1-24 illustrates the data transfer sequence used when the initiator is the data source. Figure 1-25 illustrates the sequence used when the initiator is the data destination. Each illustration includes the memory allocation handshake.

Figure 1-24. ST Data Movement Sequence: Single-use Memory with Initiator as Source

ST Data Movement Sequence: Single-use Memory with Initiator as Source

Figure 1-25. ST Data Movement Sequence: Single-use Memory with Initiator as Destination

ST Data Movement Sequence: Single-use Memory with Initiator as Destination

ST Flow Control Sequences

Flow control operates differently for data transfers and ST operations. Each is explained below.

Data Transfer Flow Control 

ST endpoints implement strict flow control for all data transfers done to single-use memory. For this purpose they use the Request_To_Send (RTS) and Clear_To_Send (CTS) control operations. There can be multiple CTSs generated in response to one RTS, as explained below and summarized in Table 1-10.

The ST flow control sequence regulates both the number of data transfer events that occur between the two endpoints and the size of these events. Before any data is transferred, the data transmitter (source) generates an RTS, in which it specifies the maximum size block of data and the number of blocks that it wants to send right now. The specified (requested) size and number do not oblige the receiver to give permission for that size or number; these are only suggestions that, if followed, could make the transfer more efficient.

The data receiver (destination) generates one or more CTSs in response to each RTS. In each CTS, the receiver gives the source permission to transmit one block of data; the number of CTSs issued by the receiver cannot exceed the number of “requested blocks” specified in the RTS. In the first CTS for the data movement, the receiver indicates the block size that it is willing to receive during this data movement; the block size must be no larger than the maximum block size specified in the associated RTS. Before issuing each CTS, the receiver must allocate the amount of memory specified by the block size in that CTS. See Figure 1-24 and Figure 1-25 for illustrations of the flow control sequence.


Note: ST does not use flow control for persistent memory data movements: Put, Get, and FetchOp.


Table 1-10. ST Flow Control Sequence

Transfer Event Parameter

Source Specifies

Destination Specifies

Number of events

In RTS: number of blocks the source would like to send at this time.

Limitations: none.

In CTS: with each CTS, the destination gives the source permission to transmit 1 block of data.

Limitations: Destination must not issue more CTSs than source has requested in its RTS. Destination must allocate memory for each CTS it generates.

Size of each event

In RTS: requested maximum block size for transfer events associated with this RTS.

Limitations: none.

Note: When the source does the actual data transfer, the size is not controlled by the RTS maximum block size; it is limited by the block size specified in the CTS.

In first CTS: block size that will be used for these transfer events.

Limitations: Block size must not exceed maximum size specified in the associated RTS.


Operation Flow Control 

Flow control for the ST Headers of ST operations is managed with a mechanism called slot allocation. Each slot represents memory that has been allocated at an endpoint to hold one incoming ST Header while it awaits processing. All incoming ST Headers use one slot, except Request_Connection operations and Data operations that have the Silent flag set.


Note: Data operations with the Silent flag set, do not occupy a slot because the ST Header for these operations is not passed to the receiving endpoint (and hence is not stored). The Request_Connection operation does not occupy a slot because the VC does not yet exist when this operation arrives. An implementation may have a queue of slots associated with Port 0 (the port to which the Request_Connection arrives), but the queue is not required because there are no consequences caused by the endpoint dropping the request other than the initiator trying again, until it succeeds.

During the setup sequence for a VC, each endpoint communicates to the other endpoint the number of slots it has allocated for that VC. Updates for slot availability are communicated during normal operation with Request_State_Response operations. (See “ST Status Sequences” for details.) Each source keeps track of the number of outstanding operations (that is, slot-consuming ST Headers that it sends) and makes sure that it does not send more operations than the destination can handle.

ST Status Sequences

During normal operation, the endpoints for a VC can use either of two status sequences (illustrated in Figure 1-26 and Figure 1-27) to obtain information from the other endpoint about its state and status.

Figure 1-26. Status Sequence Using Request_State

Status Sequence Using Request_State

Figure 1-27. Status Sequence Using S Flag in ST Header

Status Sequence Using S Flag in ST Header

The information that can be exchanged with this mechanism includes:

  • number of currently available slots for this VC

  • highest Block received for a data movement

  • reception status for a specific Block

ST Termination Sequence for a Data Movement

The following data movements do not have a natural ending:

  • a persistent memory region

  • a data transfer of unlimited size

To terminate either of the above data movements and release the associated resources, either endpoint initiates the termination sequence illustrated in Figure 1-28. In addition, this sequence can be used to abort a data transfer of specific length before all the data has been transferred.

Figure 1-28. Termination Sequence

Termination Sequence

Example of ST Virtual Connections and GSN Channels

GSN virtual channels are designed to carry specific sizes of data (see Table 1-3). The various ST data channels (DCs) that exist within ST virtual connections (VCs) can take advantage of these sized GSN channels. The IRIX ST–over–GSN stack routes any ST operation with DC=0 to GSN channel 0, DC=1 to GSN channel 1, and so on. For example, each ST application (for example, ST Port), is required to have one data channel (DC_0) for its control operations and one or more other channels (DCs 1, 2, and/or 3) for its data operations. Note that each GSN channel is shared by many VCs; for example, DC_0 for all ST VCs share GSN channel 0. Figure 1-29 shows an example of ST VCs using their data channels (DC values) to effectively make use of the four GSN channels.

Figure 1-29. Example of ST Virtual Connections Using Multiple GSN Virtual Channels

Example of ST Virtual Connections Using Multiple GSN Virtual Channels

GSN Fabrics and Logical Networks

This section explains how logical networks are created on GSN and HIPPI fabrics. The discussion assumes that you have a thorough understanding of the concept of a logical network, the format of INET addresses, and the use of subnet masks to divide a single INET network address space into smaller networks, called Logical IP Subnets (LISs).


Note: For complete details on INET address subnetting and the netmask, see the comments in the /etc/config/ifconfig.options file, the man page for inet(7F), the man page for ifconfig(1M), and the online IRIS InSight document IRIX Admin:Networking and Mail.

There are three basic concepts that underlie the discussion in this section. Each is discussed in more detail in subsequent sections:

Basic Concept #1
 

  The hosts connected to a GSN or HIPPI fabric do not have to function as one logical network whose addresses all come from one address space.

Basic Concept #2
 

  A LIS (one address space) can include hosts from physically different GSN and/or HIPPI fabrics, as long as there is a bridging switch between the fabrics.

Basic Concept #3
 

  Within a GSN or HIPPI fabric, direct communication (without use of an intermediate router) between INET hosts can occur only when (1) the network interfaces involved in the exchange have addresses that come from the same logical address space (for example, they are members of the same LIS), and (2) both hosts have access to an address resolution mechanism.

Basic Concept #1

The hosts connected to a GSN or HIPPI fabric do not have to function as one network address space. The hosts can be organized into smaller groupings (for example, based on function, project, or hardware manufacturer). Each grouping of hosts is a separate logical network or a LIS. Each LIS is assigned a sequence of network-layer addresses (that is, a unique address space). Figure 1-31 illustrates this concept.

A group's address space can be the complete range of addresses for an INET network address (192.0.2.0 to 192.0.2.255), or it can be a portion of the range (for example, subnet 192.0.2.0 to 192.0.2.31). Membership in a group is determined for each GSN network interface (for example, each gsn#) by the INET address associated with the interface (in the /etc/config/netif.options file) and the netmask value (in the ifconfig.options or the ifconfig-#.options file). The netmask value defines the size of the address space for each group. For example, a netmask value of 0xFFFFFF00 creates an address range that provides 256 individual host addresses. However, netmask value 0xFFFFFFE0 (shown in Figure 1-30) creates eight LISs in which each LIS can have up to 32 “host” addresses.

Basic Concept #2

A logical network or a LIS can include hosts from physically different GSN and HIPPI fabrics, as long as there is a “bridging” communication path between the fabrics. Hosts that are members of the same INET address space (thus benefitting from the services provided by broadcast and routing) do not have to be physically attached to the same physical medium (fabric). Figure 1-32 illustrates this concept.

Basic Concept #3

Direct communication between INET hosts (without use of an intermediate router) can occur only when the network interfaces involved in the exchange are members of the same logical address space (network or LIS). Contact with members outside one's own LIS requires use of an INET address router.

This rule is true even when a shared hardware connection (for example, a switch) exists between the two hosts that belong to different LISs. For example, for two hosts attached to the same switch, a message from host A in LIS 1, if sent to host B in LIS 2, must go through host C, an INET router. The benefit is that, no matter where a GSN network adapter is physically located or relocated, it continues to function as a member of the same LIS. Notice that no address or LIS–membership change is required when an endpoint is physically relocated.

The following facts explain why this concept exists:

  • GSN switches do not resolve network–layer (INET) addresses.

  • The local INET routing software (for example, IRIX' routed) does not maintain complete paths to destinations that are not members of the same LIS.

  • Before transmission of an IP packet, a GSN hardware address (ULA) must be discovered for the destination. This step requires the services of a HARP server.

  • Each HARP server maintains mappings only for its own LIS. (However, in the IRIX implementation, a single HARP daemon can act as a HARP server for multiple LISs at the same time.)

Consequences and Examples

The basic concepts summarized in “GSN Fabrics and Logical Networks”, make the examples described in this section possible.

Figure 1-31 and Figure 1-32 show examples of subnetting within two different GSN fabric configurations. The LIS addressing used in these examples (summarized in Figure 1-30) is identical. The examples use network INET address 192.0.2, so that each host address is 192.0.2.xxx. Hosts in LIS_1 use addresses between 192.0.2.0 and 192.0.2.31; those in LIS_2 use addresses between 192.0.2.32 and 192.0.2.63, and so on.

Figure 1-30. Subnet Mask for Examples

Subnet Mask for Examples

If you want a single-fabric site to have multiple address spaces, you can use multiple INET network addresses, or you can use a netmask to divide a single INET address space into smaller chunks (referred to as LISs). Likewise, in a multiple–fabric site, you can group all the hosts into one logical address space, or into multiple LISs regardless of each host's location.

Figure 1-31 illustrates a GSN fabric that has one switch to which all the network interfaces are attached (that is, all endpoints in this fabric have a direct physical link to one another). The example shows two LISs. Communication from A in LIS_1 to C in LIS_2 passes through the router (network interfaces J and H). Messages do not go directly from endpoint A to C, because of the concept explained in “Basic Concept #3”.

Figure 1-31. Single-switch GSN Fabric with LISs

Single-switch GSN Fabric with LISs

Figure 1-32 illustrates a different configuration for the same address space and network interfaces (“hosts”) used in Figure 1-31. This configuration is a two-switch fabric. In this example, A, B, E, J, K, and L belong to LIS_1, while C, D, F, G, and H belong to LIS_2. The system with network interfaces H and J continues to perform as the router between the two LISs. Just as in the first example (Figure 1-31), communication directed to C in LIS_2 from A in LIS_1, goes first to the router (J/H), even though both A and C are physically attached to the same switch. But, most importantly, notice that the router has been moved to a different switch, and yet, the INET addressing is identical to that used in the first configuration. The hardware changes do not affect the addressing. Also note that a router for an LIS does not need to share a switch with the members of its LISs, as illustrated by router J in relation to hosts A and B and router H in relation to hosts C and D.

Figure 1-32. Multiple-switch GSN Fabric with LISs

Multiple-switch GSN Fabric with LISs

Figure 1-33. LIS Membership That Spans Fabrics

LIS Membership That Spans Fabrics

Address Resolution for GSN

This section describes how network (OSI layer three) addresses are mapped (resolved) to physical (OSI layer-one) addresses in a GSN fabric. This section assumes that you are familiar with standard Internet ARP (RFC 826, Ethernet Address Resolution Protocol) and Inverse ARP (RFC 2390, Inverse Address Resolution Protocol) protocols.

When a network–layer address is locally associated with (configured to) an IRIX GSN or IRIS HIPPI subsystem, address mapping is needed between network–layer addresses and physical-layer addresses so that communication can occur between the local network-layer entity and remote network-layer entities. The GSN/HIPPI physical address is known as the Universal LAN MAC Address or ULA. For IRIX, the default network protocol stack is the Internet Protocol and the network address is the INET address.[3] The address resolution scheme for IP/ST–over–GSN is defined by RFC 2835, IP and ARP over HIPPI–6400, as described in the section “HARP Address Resolution”.


Note: Each INET address (AF_INET) can support multiple protocols. For example, in IRIX 6.5, INET addresses support both the IP suite of protocols (PF_INET) and the ST protocol (PF_ST). For further details, see the man page for inet(7).

To transmit data to another network-layer entity within the GSN fabric, each network-layer stack in the GSN fabric needs two addresses for each destination:

  • The network-layer address for the destination host. In IRIX, this information is supplied by the static “hosts” database or the dynamic NIS server.

  • The physical-layer address for the destination endpoint. This information is supplied by the static HARP table or the dynamic HARP server. See “HARP Address Resolution” for details.

HARP and Broadcast Support

A GSN fabric is said to support broadcasting when all of the switches of that fabric provide broadcasting. The behavior for HARP clients and HARP servers is different depending on whether the underlying GSN/HIPPI fabric supports broadcasting.

No Broadcast Support

When the fabric does not support broadcasting, at least one host behaves as a HARP server for each defined LIS. All other hosts on the LIS are HARP clients.

HARP servers act as centralized repositories for IP-to-ULA mappings. As each host on an LIS initializes its GSN interface, it registers its IP-to-ULA mapping with each of the LIS's servers. The servers save this mapping information internally. When a host needs to communicate with another host via GSN, it queries the HARP server for the destination host's IP-to-ULA mapping. A HARP client will typically save mapping information it has received from the HARP server in its own local cache for faster subsequent mappings.

Address mappings are not permanent, so HARP clients must reregister with all HARP servers periodically. If HARP clients wish to locally maintain a cache of address mappings for other hosts, they must periodically validate these mappings with a HARP server as well.

Broadcast Support

When broadcast is supported by all switches in fabric, there are no HARP servers. HARP's behavior is almost identical to standard ARP: when a host needs to perform an IP-to-ULA mapping, it broadcasts an ARP request using the broadcast ULA (FF:FF:FF:FF:FF:FF). The host for which the mapping is requested can identify its own IP address in the request packet, and sends a reply to the requestor with its ULA. All other hosts ignore the request.

 

HARP Address Resolution

The address resolution protocol for HIPPI networks is specified in the HARP RFCs. The protocol works with fabrics that provide broadcasting and with those that do not. One of the first tasks of each HARP client is to determine if its underlying fabric supports broadcasting, as described in “Determining Fabric Support for Broadcast”.

HARP provides a dynamic, client/server-based address resolution service. The protocol makes it possible for each IP/ST-over-HIPPI endpoint (client) within a network to register or communicate its own INET address and ULA, and to discover the ULAs for hosts with whom it wants to communicate. The HARP server maintains a kernel-resident lookup table that maps INET addresses to ULAs. HARP occurs in two phases: a registration phase (summarized in the section “HARP Registration Phase”) and a normal operation phase (summarized in “HARP Normal Operation Phase”).

When an LIS includes one or more endpoints that do not support dynamic HARP, static mappings for those endpoints must be added to the address resolution table at the HARP server (as described in the section “HARP Normal Operation Phase”).

Determining Fabric Support for Broadcast

A host determines whether it is on a broadcast- or nonbroadcast-capable LIS during its initialization phase by sending a request for its own address to the broadcast ULA (FF:FF:FF:FF:FF:FF). If the underlying fabric is a broadcast medium, the sending host will receive a copy of this packet, as will every other host on the LIS. If it does not receive this packet, the host is not on a broadcast medium. (To ensure that a single lost packet does not result in the host being brought up in the wrong mode, a host may send multiple self-identification packets during the initialization phase.)

If a host discovers that it is on a broadcast fabric, the HARP registration phase described in the following section is skipped (because there are no HARP servers with which to register), and HARP immediately enters the operational phase, described in “HARP Normal Operation Phase”.

HARP Registration Phase

During initialization of each GSN device, a HARP client on a non-broadcast medium will register its address pair (INET address and ULA) with each HARP server on its LIS. This is done by transmitting an InARP request to each HARP server. The InARP request contains the IP-to-ULA mapping of the client, and it requests the IP-to-ULA mapping of the server in reply. (Since InARP requests are sent to ULAs, each client must know the ULAs for all servers on the LIS. For IRIX, this information is contained in the configuration file /etc/config/harpd.options. For details, see “Edit harpd.options File” in Chapter 2.)

Because HARP clients can be brought up before HARP servers, a client might not receive replies to all (or any) of the InARP requests that it transmits. For each nonresponding HARP server, a HARP client will periodically retransmit the InARP request.

When at least one HARP server has responded with an InARP reply, the HARP client gains the ability to resolve unknown IP-to-ULA mappings on the LIS; the client then transitions from the registration phase to the operational phase.

Figure 1-34. HARP Registration

HARP Registration

HARP Normal Operation Phase

The client enters HARP's operational phase under one of the following circumstances:

  • When a host determines that its GSN is connected to a broadcast-capable medium

  • When a host on nonbroadcast-capable medium has successfully registered with a HARP server

In the operational phase, a host can resolve IP-to-ULA mappings that it does not have in its local HARP table by issuing ARP requests. On broadcast-capable media, these requests are transmitted to the broadcast ULA (FF:FF:FF:FF:FF:FF:); on nonbroadcast-capable media, these reqeusts are transmitted to a HARP server.

When a host receives an ARP reply, it places the reply's IP-to-ULA mapping into its local mapping table for subsequent mappings of this address.

While in the operational phase, all HARP clients on nonbroadcast media must periodically reregister their own IP-to-ULA mappings. This reregistration is accomplished by sending either an ARP request or an InARP request to each HARP server for the LIS. Since, according to the HARP protocol, servers "forget" about clients they have not heard from in 20 minutes, this reregistration must occur in shorter intervals. In IRIX, by default this reregistration occurs every 15 minutes.

If a HARP server does not respond to the reregistration request, the HARP client must assume that the server is no longer functioning and cannot be used as a target for mapping requests. If no server is responding to a HARP client's reregistration requests, the client must fall back to the HARP registration phase.

HARP clients must also revalidate or remove from their local mapping table all entries that are more than 15 minutes old. Clients revalidate by sending ARP requests to the server or (on broadcast media) directly to the hosts whose mapping entry is to expire. An entry that has been revalidated is valid for another 15 minutes. If no reply (or a NAK reply) is received for an ARP request, the address for which the request was sent must be considered unmappable and is removed from the local mapping table.

Static Address Resolution

When a host within a HIPPI/GSN LIS does not support dynamic HARP, the system administrator needs to add a static entry for that host to each HARP client (for broadcast capable networks) or to each HARP server's database (for nonbroadcast-capable networks). Static entry definitions can be placed into the HARP daemon configuration file (/etc/config/harpd.options), or they can be added manually to the mapping database by using the gsnarp utility. Each entry in the database must map a ULA (IEEE or MAC address) to an INET address.

Guidelines for Selecting a HARP Server

These guidelines explain how to select a system to provide HARP services (that is, be the HARP server) when the HIPPI fabric does not support broadcasting. It is not necessary to identify a system for this purpose when the fabric supports broadcasting.

From among the members of the LIS, at least one system must be chosen to be a HARP server. For redundancy purposes, at least two systems should be selected for this purpose. (When no HARP servers are available on an LIS, no HARP address resolution can occur, so the only members of the LIS who will be able to intercommunicate are hosts whose HARP entries are statically defined.

To ensure that every HARP server's database contains a complete mapping for all registered hosts, all hosts in an LIS must identify the same systems as HARP servers.

How Address Resolution Works for ST–over–GSN

The IRIX GSN implementation of the ST protocol uses the same address resolution scheme as is used for IP–over–GSN. See “Address Resolution for GSN” for the details.


Note: Each gsn# network interface services two protocols: ST and IP. The INET address assigned to an instance of gsn# is shared by the ST-over-GSN and IP-over-GSN stacks. Some of the upper-layer address processing (for example, routing) that is performed on the address applies to both IP and ST traffic.


IRIX HARP Table

The HARP table is a list of address mappings. Each entry (mapping) consists of an IP address and a GSN ULA. Each entry is either a dynamic entry or a static entry, as explained below.

Static Entries

These entries are loaded when the harpd daemon is initialized via the harpd configuration file (by default, /etc/config/harpd.options; for details, see “Edit harpd.options File” in Chapter 2). Alternatively, the administrator can add them individually via the gsnarp -s command. The administrator can remove static entries via gsnarp -d.

Dynamic Entries

IRIX HARP maintains the dynamic entries in its HARP table in conformance with the HARP standard. It adds entries as it learns about them, refreshes them as they are reregistered by their owners (the clients), and ages and deletes entries as they go stale.

Assignment of Unit Numbers and Network Interfaces to GSN Hardware

The description in this section applies to systems running IRIX 6.5.9f (or later) and to network interfaces for the Internet Protocol suite (INET address over GSN subsystem) and Scheduled Transfer (ST–over–GSN) protocol.

Assignment of Unit Numbers to Hardware

With each restart (for example, a power on, a reboot or init 0 command), the startup routine probes for hardware on all the modules connected into the CrayLink interconnect fabric. All the slots and links in all the modules within the fabric are probed. The routine then creates a hierarchical filesystem, called the hardware graph, that lists all the located hardware. The top of the hardware graph is visible at /hw. For complete details, see the man page for hwgraph(4). After the hardware graph is completed, the ioconfig program assigns a unit number to each located device that needs a number. Other programs (for example, hinv and each device's driver) read this assigned number and use it.

The XIO slots are searched (probed for a device) in the order shown below; this order is not the same sequence as the XIO slot numbering. For example, the device in XIO slot 4 is located before the device in slot 2 and, because of this, may have a lower unit number than the device in slot 2. After the first power on, you can edit the /etc/ioconfig.conf file to assign unit numbers that are convenient for you. Your changes are used during each subsequent power on. See the ioconfig(1M) man page for further details.

  1. slot 8

  2. slot 11

  3. slot 10

  4. slot 7

  5. slot 12

  6. slot 9

  7. slot 4

  8. slot 2

  9. slot 6

  10. slot 5

  11. slot 3

On an initial system startup, ioconfig groups devices into classes/types and assigns hardware unit numbers sequentially within each class. It records these assignments in the /etc/ioconfig.conf file; for example, if two SGI GSN products are found, they are numbered unit 0 (gsn0) for the first one found and unit 1 (gsn1) for the second one. When an SGI GSN product is a two-board solution, both boards are associated with a single unit number. On subsequent startups, ioconfig distinguishes between hardware that it has seen before and new items. To previously seen items, it assigns the same hardware unit numbers (those that are recorded in the ioconfig.conf file). To new hardware, it assigns new sequential numbers and records them. It never reassigns a number, even if the device that had the number is removed and leaves a gap in the numbering. For example, in a system with two instances of some class of devices, if the unit0 is removed, the next restart results in the system listing only unit1; if a new board is installed in a new location, it is listed as unit2.

New items are differentiated from previously seen items through the hardware graph listing (that is, the path under /hw/module/#/slot/io#/...). The database of previously seen devices is kept in the file /etc/ioconfig.conf. A replacement board (with the exact same hardware device name) that is installed into the location of an old board (so that it has the same hardware graph listing) is assigned the old board's unit number, but a board that is moved from one location to another is assigned a new number. For example, in a two-device system with ioconfig.conf entries illustrated below, if unit0 is moved to a different slot, the next restart results in a new item in the ioconfig.conf file. The hinv command lists unit1 (an original board in its original slot) and unit2 (the board that has been moved to a new slot), but not unit0. For more information about the hardware graph and ioconfig, see the man pages for hwgraph(4) and ioconfig(1M).

Initial entries for two devices: 
0 /hw/module/1/slot/io8/xio_gsn/device
1 /hw/module/1/slot/io4/xio_gsn/device
0 /hw/gsn/0
1 /hw/gsn/1
Entries after unit0 is moved: 
0 /hw/module/1/slot/io8/xio_gsn/device
1 /hw/module/1/slot/io4/xio_gsn/device
2 /hw/module/1/slot/io5/xio_gsn/device
1 /hw/gsn/1
2 /hw/gsn/2

The two-board SGI GSN product occupies two XIO slots that are logically associated with a single device (one unit number). The device has two XIO slots and two hardware graph entries. All links (for example, the short or convenience path, /hw/gsn/#) point to the XIO slot for the main SGI GSN board. All located SGI GSN hardware devices can be displayed with the /sbin/hinv or find command.

Assignment of Network Interface to Hardware Device

As the startup process continues, it calls the network drivers and protocol software modules so that they can create their network and programmatic interfaces. For GSN, this step works in the following manner:

  • For each located SGI GSN device (port), the startup process creates short (/hw/gsn/#) and long (/hw/module/#/slot/io#/xio_gsn) entries in the hardware graph. Then, the initialization scripts create a symbolic link in /dev that points to the device's entry in the hardware graph.

  • For each located GSN hardware device, the startup routine creates an entry in the hardware inventory database that can be displayed by hinv.

  • For each located hardware device, the IRIX GSN driver creates a logical network interface and assigns it a number that matches the hardware. For example, if the only hardware device is /hw/gsn/2, then the only network interface created is gsn2.

  • The ifconfig command searches the netif.options file for IP–over–GSN network interface names (for example, gsn0, gsn1, gsn2), associates each network interface with the hardware that is specified, then configures and enables each interface.

Comparison of ST to IP

ST requires that the endpoints and their associated resources be set up before any data movement can proceed in which IP acts on a store-and-forward basis. The IP endpoints and intermediate hosts dynamically provide resources such as target buffers. ST is connection-oriented and the end points retain state information such as packet sequencing numbers. IP does not guarantee sequential delivery of packets and is a connectionless protocol.

The logical IP subnets on GSN can be independent of the underlying GSN physical network. Refer to “Consequences and Examples”.

The table below lists notable differences between ST and IP.

Table 1-11. ST vs IP

 

IP

ST

ST When It Is Borrowing From IP (INET address, routing protocol, ARP, etc.)

network-layer routing within an LIS

y

n

y

routing between LISs and inter-LIS forwarding

y

n

y

multiple hop routing (more than one intermediate hardware device--switch/concentrator/hub- -between endpoints

y

n

n

broadcasting to all members of an LIS

y

n

y

broadcasting to all members attached to a physical fabric

only if physical layer supports functionality

only if physical layer supports functionality

only if physical layer supports functionality

encapsulation

y

n

n

data handling between source and final destination

store and forward; finds path/resources along the way

direct delivery from source to final destination; path/resources established and open before data transfer started

direct delivery; path/resources established and open before data transfer started




[1] For SGI GSN release 1.0, only the copper-based medium is supported.

[2] Flow control is a mechanism for preventing data loss that is caused by a source transmitting data faster than the destination can process it. Without flow control, the destination drops incoming data when it does not have memory available (free) in which to store the data.

[3] For IRIX GSN, the Scheduled Transfer Protocol is an additional default stack; ST shares the INET address used by IP.