Chapter 4. Performance-Driven Programming in Array 3.0

Array 3.0 offers developers a rich set of compilers, libraries, system services, and development tools. Most of these facilities are documented separately. This chapter surveys the development tools and provides pointers to their documentation, as well as taking a deeper look at the Array Services library functions. The main topics include:

Basic Array Application Tuning Strategy

Quite often, new applications developed for an Array system run with satisfactory job execution time. When execution times are not satisfactory, the developer must tune the application for improved performance.

An efficient, systematic tuning strategy helps you achieve the best possible performance in the minimum development time. One such tuning strategy is outlined in this section.

Tuning Single-Node Performance

The first step in tuning a parallel application—and it is a large step—is to tune it for best performance on a single node. Tuning any program begins with instrumenting the program to identify the parts where it spends excess time. Analyze these parts to determine whether algorithmic changes are possible. Optimizations of program logic and algorithms, when they are possible, always yield the largest improvements at the lowest cost.

When you are sure the algorithm is optimal and its coding is logically correct, examine the program for library use, cache use, software pipelining, and SMP performance.

Library Selection

There are many numerical libraries available, some from Silicon Graphics, Inc. and some from other sources both commercial and public domain. Every implementation of a standard algorithm has different characteristics for accuracy bounds and error propagation, raw speed, and speed as a ratio of the problem set size.

SGI Cray Scientific Library (SCSL) includes algorithms that are carefully coded and optimized to Silicon Graphics, Inc. hardware. The SCSL supercedes the older CHALLENGEcomplib product. But by all means have a selection of libraries available and try them all.Some library sources are listed in Table 4-1.

Table 4-1. Information Sources: General Numerical Libraries


Book or URL

Book Number

Directory of WWW software sources


Vast collection of numerical software


Index to math and statistical software


Visualization tools for physics


Volume Renderer and other tools


Some library collections (including CHALLENGEcomplib) are specifically tuned for parallel execution. Pointers to some sources of parallel libraries are listed in Table 4-2.

Table 4-2. Information Sources: Libraries for Parallel Computation


Book, Reference Page, or URL

Book Number

CHALLENGEcomplib overview


Center for Research in Parallel Computation


HPPC software collection


HPCC Vendors “Mall”


Tuning Information

Information sources about performance tuning and the Workshop tools are listed in Table 4-3.

Table 4-3. Information Sources: Performance Analysis Tools


Book, Reference Page, or URL

Book Number

Developer Magic overviews

Developer Magic: ProDev WorkShop Overview



Developer Magic: Debugger User's Guide


Performance Analysis

Developer Magic: Performance Analyzer User's Guide


Performance Analysis

Developer Magic: Static Analyzer User's Guide


Performance Analysis (command-line tools)

MIPSpro Compiling and Performance Tuning Guide


Performance Tuning on Origin2000 and Onyx2

Performance Tuning Optimization for Origin2000 and Onyx2, online only at



The MIPSpro compilers sold by Silicon Graphics, Inc. support software pipelining, in which the compiler structures the machine code of compute-intensive loops to optimally schedule operations through the R8000 or R10000 CPU. The proper use of software pipelining can make immense differences in the speed of execution of certain loops. However, you must sometimes adjust the source code of a loop in order for the compiler to recognize it as eligible for pipeline treatment.

Software pipelining is discussed in the books shown in Table 4-4.

Table 4-4. Information Sources: Software Pipelining


Book, Reference Page, or URL

Book Number

Software pipelining

MIPSpro 64-Bit Porting and Transition Guide


Software pipelining

MIPSpro Compiling and Performance Tuning Guide


SMP Performance

You can make a computation-intensive program faster by applying multiple CPUs in parallel, within a single node of an array (within any Challenge-class server). You should explore single-node parallelism carefully before you even consider multinode parallel execution. The reasons are, first, the extensive support for parallel execution provided by software such as Power Fortran (77 and 90) and IRIS Power C; second, the relative ease of starting, running, and testing a program within one node; and finally the fact that a multinode program must be written to use the more complex model provided by High-Performance Fortran (HPF), by MPI, or by PVM.

If you are sure that a program will be a multinode program, or it you are working on a program that is already written to use MPI or PVM, you can still consider some parallel execution within each node. The parallelizing directives of Power Fortran or Power C can be applied in the context of a single source module. For example, you might use MPI to distribute an array in 1 MB sections to each node, but use single-node parallelism in the DO-loop that processes one section.

Parallel Performance Goals

When a program runs efficiently on a single node, but you find you still need to recruit more CPU cycles to it, you can look for further performance gains through multinode parallelism.

Be aware first that multiparallelism is far from a panacea. It is important to consider Amdahl's Law. If the code that can be run in parallel consumes less than 95% of the total execution time, targeting fewer than 18 processors is sufficient to realize any potential benefit from parallelization. When this is the case, parallelism within a single node is the most appropriate strategy (presuming at least one node has sufficient CPUs installed).

When the parallelization potential is above 96%, then parallelizing across the nodes of the Array can result in a performance speedup.

Designing Appropriate Parallel Algorithms

When designing a multinode application, your basic strategy should be to maximize the work done on the data within any node before communication between nodes is required. In general, aim to communicate between nodes as rarely as possible, and when communication is needed, to use the largest message units possible. This strategy maximizes the use of the high bandwidth and lower latencies of the bus within a node, as compared to the relatively slower communications of the HIPPI network.

While the implementation of your internode parallel algorithms may use a shared memory model (HPF) or message-passing model (MPI, PVM), the design of the fundamental algorithms is conceptually similar to the design of good parallel algorithms for a single-node program. To achieve the performance advantages of distributed multiprocessing, you must

  • Partition the application into concurrent processes

  • Implement efficient communication between the parallel processes to synchronize and exchange data

  • Balance the workload among the parallel processes

  • Minimize communication overhead, so that processors are well utilized

Test and Debug on a Single-node Server

If you design your applications to accept variable numbers of processors and problem sizes, often you can debug your fundamental program logic within a single node. When executing within one node you can apply all the debugging and visualization tools of the Workshop suite.

Parallel Programming and Communication Paradigms

Several models of parallel computation are available for the IRIX and Array 3.0. These models are discussed and compared in detail in the book listed in Table 4-5.

Table 4-5. Information Sources: Parallel Computation Models


Book, Reference Page, or URL

Book Number

Models of Parallel Computation

Topics In IRIX Programming


The models are summarized here, but read the chapter of the book shown for details and the latest version information.

Shared-Memory Communication

IRIX provides several facilities that permit parallel processes to communicate by directly reading and writing the same address space.

Shared Memory within One Node

Three different interfaces provide shared-memory facilities within a single node.

  • IRIX native shared memory

  • POSIX 1003.1b shared memory (available as a patch to IRIX 6.2)

  • System V Release 4 compatible shared memory

A variety of coordination primitives are available for synchronization and mutual exclusion within nodes:

  • IRIX native mutual exclusion locks, semaphores, and barriers

  • POSIX 1003.1b semaphores (available as a patch to IRIX 6.2)

  • System V Release 4 compatible semaphores

You can call on these facilities directly in C programs. The POWER C and POWER Fortran runtime modules for parallel execution use the IRIX native shared memory to communicate, since IRIX shared-memory support is closely integrated with IRIX lightweight processes.

Shared Memory Between Nodes

The current Array architecture restricts direct shared memory IPC to only intranode communications. High Performance Fortran (HPF) provides a form of indirect shared memory for internode communications.

Message-Passing IPC

The message-passing communication model provides a “mail delivery” paradigm for interprocess communication. A collection of data items is given an identifier and “mailed” to a destination process, which subsequently receives it.

The message-passing model makes a clean separation between program modules, affording some protection from accidental changes to shared memory. However, any message-passing facility must incur some delay compared to shared memory, due to buffering, abstraction, and (sometimes) copying and transmission overheads.

Message Passing within a Node

In a C program, you can call on either of two interfaces for queue-based message passing within one node:

  • System V Release 4 message queues

  • POSIX 1003.1b message queues

The POSIX implementation (available as a patch for IRIX 6.2) uses shared memory and operates primarily in user space for minimal overhead. The SVR4 library is included primarily for compatibility, and incurs the overhead of a kernel calls.

Distributed Message Passing

Three abstract models are supported to allow the exchange of arbitrary messages between processes operating in the same or different nodes of an array:

  • MPI (Message Passing Interface) is the preferred message-passing facility for Silicon Graphics, Inc. Array systems. The version of MPI distributed with Array 3.0 is carefully tuned to take maximum advantage of the HIPPI interconnect, and of shared memory within a node.

  • PVM (Portable Virtual Machine) is supported for compatibility, and has been tuned to some extent to work correctly in an Array system.

  • IRIX contains standard support for sockets, with which you can write programs that communicate between any two nodes on the internet. Array Services uses sockets to pass commands and messages between nodes (see “Using Array Services Commands”).

Hybrid Models

If you are preparing a new application, Silicon Graphics, Inc. recommends that you plan as follows:

  • For implicit parallelism within a node, use the compiler facilities of the MIPSpro Fortran and C compilers, aided by Power Fortran and IRIS Power C.

  • For explicit communication within a node, use either IRIX native shared memory, POSIX shared memory, or POSIX message queues.

  • For distributed parallel execution, use MPI.

There is no requirement that applications use any one set of facilities exclusively. For example, the following common models are possible, among others:

  • Shared-memory program with n processes in one node.

  • Message-passing program with n processes in one node.

  • Hybrid application with n processes in one node, using a combination of message passing and shared memory.

  • Message-passing program with n processes distributed over p nodes, n>p.

  • Hybrid application with n processes over p nodes, communicating between nodes via MPI but using shared memory to coordinate multiple processes within each node.

However, when designing a program to use a hybrid model, you must be aware that the MPI library is not “thread-safe,” that is, it has global variables with values that can be destroyed if it is executed by two lightweight processes concurrently. The MPI library should be entered by only one process in any share group. This is discussed in more detail in the MPI and PVM User's Guide, 007-3286-xxx.

Locality, Latency, and Bandwidth

All forms of interprocess communication incur some delay. The time t(s) required to communicate a message containing s bytes of data to another process can be roughly separated into a fixed overhead latency L that is independent of message size, and a size-dependent overhead, which represents the message size divided by the communication bandwidth B. The following formula is often used to approximate the time to transmit a message of length s:

MPI Communication Delays

Array 3.0 contains an optimized protocol stack supporting MPI protocols on HIPPI. Separate design approaches have been implemented for short messages and long ones. Special attention is paid to latency for shorter messages, which are more common. The advice given in “Reducing the Effect of Communication Delay”, to use fewer, longer messages, is valid, but the reduced latency from prior versions should improve the performance of many programs.

Other conditions can affect the use of HIPPI by MPI. When four or more applications are contending for the use of an adapter, MPI does not use that adapter. (The limit is 8 applications per adapter on an Origin2000 node). When a node does not have a HIPPI adapter, or when the maximum MPI jobs are contending for all available adapters, an internode MPI transfer uses a socket instead. This case is not optimized and will be slower.

TCP/IP Communication Delays

The basic IRIX support for TCP/IP is not changed for Array 3.0, and does not take advantage of the special MPI tuning. As a result, you are strongly advised to construct a distributed application using MPI, not sockets.

When you must use sockets, be aware that the bandwidth you can achieve can vary over an extremely wide range depending on several factors. The most important are: the size of the socket buffer (the SO_SNDBUF and SO_RCVBUF options of setsockopt()) and the size of the message.

Typical performance with a 62 KB socket buffer and a stream of 16 KB messages is approximately 15 MB/second. Much higher speeds, up to 60 MB/sec and more, can be achieved using larger, page-aligned transfers, with buffer pages locked in memory. Rates of 90 MB/sec can be achieved by highly-tuned benchmark programs.

Reducing the Effect of Communication Delay

To effectively exploit the memory hierarchy of the Array, you must be aware of the effects of communication latency and bandwidth while designing your program. Your best strategy is to send fewer, longer messages, and to overlap communication with useful work when possible.

Do Not Use Message Aggregation

You can sometimes reduce message count by aggregating small messages to the same destination into one larger message. You may find an existing MPI program going to some effort to block, or aggregate, messages. This is no longer recommended.

The HIPPI support in Array 3.0 is designed to incur low latency for small messages. The extra program logic needed to block and unblock messages will likely cost as much time as it saves. It is true in general that the fewer the messages, the better; but with Array 3.0 you should simply send a small message as soon as it is ready.

Overlap Processing with Communication

Try to design the application to overlap message delays with computation. MPI permits a process to send messages asynchronously. The MPI_Isend() function returns from the immediately, before the message has been sent. MPI also permits asynchronous receipt; the MPI_Irecv() function tests for available data without waiting when none is ready.

Asynchronous communication permits the sender or receiver to continue computation while data is transferred by the system. Figure 4-1 shows the potential performance advantages of asynchronous communication. Figure 4-1a is a time line of two standard communicating processes. Figure 4-1b is a time line of the same two processes using asynchronous sends and receives. Here, much of the communication time is hidden behind computation.

Figure 4-1. Gaining Efficiency Through Asynchronous Communication

Figure 4-1 Gaining Efficiency Through Asynchronous Communication

When implementing asynchronous communication in MPI, it i s important to use both MPI_Isend() and MPI_Irecv(). Neither MPI_Send() nor MPI_Isend can begin to transfer data until a matching receive has been posted. Using MPI_Irecv() allows your program to post a receive earlier rather than later.

Asynchronous communication can in practice hide most communication delays, but it can require program restructuring. Optimize an existing program in other ways first; possibly the increased complexity in program logic will not be necessary.

Array Services Library

Array Services consists of a configuration database, a daemon (arrayd) that runs in each node to provide services, and several user-level commands. The facilities of Array Services are also available to developers through the Array Services library, a set of functions through which you can interrogate the configuration database and call on the services of arrayd.

The commands of Array Services are covered in “Using Array Services Commands”. The administration of Array Services is described in “About Array Configuration” and topics that follow it. These topics are useful background information for understanding the Array Services library.

Array Services Library Overview

The programming interface to Array Services is declared in the header file /usr/include/arraysvcs.h. The object code is located in /usr/lib/, included in a program by specifying -larray during compilation. The library is distributed in o32, n32, and 64-bit versions (not all need to be installed). The functions are documented in reference pages in volume 3.

The library functions can be grouped into these categories:

  • Functions to connect to Array Services daemons in the local or other nodes, and to get and set arrayd options.

  • Functions to interrogate the Array Services configuration database, listing arrays, nodes, and attributes of arrays and nodes.

  • Functions to allocate Array Session Handles (ASHs), to query active ASHs and to change the relationship between PIDs and ASHs.

  • A function to execute a command as for the array command (see “Operation of Array Commands”).

  • A function to execute any arbitrary user command on an array node.

These functions are examined in following topics.

Data Structures

The Array Services functions work with a number of data structures that are declared in arraysvcs.h. In general, each data structure is allocated by one particular function, which returns a pointer to the structure as the function's result. Your code uses the returned structure, possibly passing it as an argument to other functions.

When your code is finished with a structure, it is expected to call a specific function that frees that type of structure. If your code does not free each structure, a memory leak results.

The data structures and their contents are summarized in Table 4-6.

Table 4-6. Array Services Data Structures



Freed By Function


Name and attributes of an Array.



List of asarray_t structures.



List of ASH values.



Describes output of executing an array command on one node, including temporary files and socket numbers.

freed as part of a list


List of command results, one ascmdrslt_t per node where an array command was executed.



Configuration data about one node: machine name and attributes.

freed as part of a list


List of asmachine_t structures, one per machine in the queried array

asfreemachinelist() f


List of PID values.


Error Message Conventions

The functions of the Array Services library have a complicated convention for error return codes. The reference pages related to this convention are listed in Table 4-7.

Table 4-7. Error Message Functions




Discusses the error code conventions and some macro functions used to extract subfields from an error code.


Constructs an error code value from its component parts.


Returns a descriptive string for a given error code value.


Prints a descriptive string, with a specified heading string, on stderr.

In general, each function sets a value in the global aserrorcode, which has type aserror_t (not necessarily an int). An error code is a structured value with these parts:

  • aserrno is a general error number similar to those declared in sys/errno.h.

  • aserrwhy documents the cause of the error.

  • aserrwhat documents the component that detected the error.

  • aserrextra may give additional information.

Macro functions to extract these subfields from the global aserrorcode are provided.

Connecting to Array Services Daemons

The functions listed in Table 4-8 are used to open a connection between the node where your program runs and an instance of arrayd in the same or another node.

Table 4-8. Functions for Connections to Array Services Daemons




Establishes a logical connection to arrayd in a specified node, returning a token that represents that connection for use in other functions.


Close an arrayd connection created by asopenserver().


Return the local options currently in use by an instance of arrayd.


Return the default options in effect at an instance of arrayd.


Set new options for an instance of arrayd.

The key function is asopenserver(). It takes a nodename as a character string (as a user would give it in the -s option; see “Summary of Common Command Options”), and optionally a socket number to override the default arrayd socket number. This function opens a socket connection to the specified instance of arrayd. The returned token (type asserver_t) stands for that connection and is passed to other functions.

The functions for getting and setting server options can change the configured options shown in Table 4-9. To set these options is the programmatic equivalent of passing command line options in an Array Services command (see “About Array Configuration” and “Summary of Common Command Options”).

Table 4-9. Server Options Functions Can Query or Change






Timeout interval for any request to this server.



Timeout interval for connecting to this server.



Whether or not Array Services requests should be forwarded through the local arrayd, or sent directly (the -F option).



The local authentication key (the -Kl command option).



The remote authentication key (-Kr command option).



In default options only, the default socket number.



The hostname for this connection.

Database Interrogation

The functions summarized in Table 4-10 are used to interrogate the configuration database used by arrayd in a specified node (see “About Array Configuration”).

Table 4-10. Functions for Interrogating the Configuration




Return the array name and all attributes strings for the default array known to a specified server, in an asarray_t structure.


Return the names of all arrays, with their attribute strings, from a specified server, as an asarraylist_t structure.


Return the names of all machines, with their attribute strings, from a specified server, as an asmachinelist_t structure.


Search for a particular attribute name in a list of attribute strings, and return its value.

Using these functions you can extract any arrayname, nodename, or attribute that is known to an arrayd instance you have opened.

Managing Array Service Handles

The functions summarized in Table 4-11 are used to create and interrogate ASH values.

Table 4-11. Functions for Managing Array Service Handles




Allocate a new ASH value. The value is only created, it is not applied to any process.


Returns a list of PID values associated with an ASH at a specified server, as an aspidlist_t structure.


Returns the ASH associated with a specified PID.


Change the ASH of the calling process.

The asallocash() function is like the command ainfo newash (see “About Array Session Handles (ASH)”). Only a program with root privilege can use the setash() system function to change the ASH of the current process. Unprivileged processes can create new ASH values but cannot change their ASH.

The functions summarized in Table 4-12 are used to enumerate the active ASH values at a specified node. In each case, the list of ASH values is returned in an asashlist_t structure.

Table 4-12. Functions for ASH Interrogation




Return active ASH values from one node or all nodes of a specified Array via a specified server.


Return active ASH values from an Array by name.


Return active ASH values known to a specified server node.


Return active ASH values in the local node.


Test to see if an ASH is global.

Executing an array Command

The ascommand() function is the programmatic equivalent of the array command (see “Operation of Array Commands” and the array(1) reference page). This command has many options and can be used to execute commands in three distinct modes.

The command to be executed must be prepared in an ascmdreq_t structure, which contains the following fields:

typedef struct ascmdreq {
   char *array;            /* Name of target array */
   int flags;              /* Option flags */
   int numargs;            /* Number of arguments */
   char **args;            /* Cmd arguments (ala argv) */
   int ioflags;            /* I/O flags for interactive commands */
   char rsrvd[100];        /* reserved for expansion: init to 0's */
} ascmdreq_t;

Your program must prepare this structure in order to execute a command. The option flags allow for the same controls as the command line options of array.

The result of the command is returned as an ascmdrsltlist_t structure, which is a vector of ascmdrslt_t structures, one for each node at which the command was executed. Each ascmdrslt_t contains the following fields:

typedef struct ascmdrslt {
    char     *machine;  /* Name of responding machine */
    ash_t    ash;       /* ASH of running command */
    int      flags;     /* Result flags */
    aserror_t error;    /* Error code for this command */
    int      status;    /* Exit status */
    char     *outfile;  /* Name of output file */
    int      ioflags;   /* I/O connections (see ascmdreq_t) */
    int      stdinfd;   /* File descriptor for command's stdin */
    int      stdoutfd;  /* File descriptor for command's stdout */
    int      stderrfd;  /* File descriptor for command's stderr */
    int      signalfd;  /* File descriptor for sending signals */
} ascmdrslt_t;

The fields machine, ash, flags, error, and status reflect the result of the command execution in that machine. The other fields depend on the mode of execution.

Normal Batch Execution

To execute a command in the normal way, waiting for it to complete and collecting its output, you do not set either ASCMDREQ_NOWAIT or ASCMDREQ_INTERACTIVE in the command option flags.

Control returns from ascommand() when the command is complete on all nodes. If the ASCMDREQ_OUTPUT flag was specified, and if the command definition does not specify a MERGE subentry (see “Summary of Command Definition Syntax”), the outfile result field contains the name of a temporary file containing one node's output stream.

When the command is implemented with a MERGE subentry, there is only one output file no matter how many nodes are invoked. In this case, the returned list contains only one ascmdrslt_t structure. It contains the ASCMDRSLT_MERGED and ASCMDREQ_OUTPUT flags, and the outfile result field contains the name of a temporary file containing the merged output.

Immediate Execution

When a command has no useful output and should execute concurrently with the calling program, you specify the ASCMDREQ_NOWAIT option. In this case, output cannot be collected because no program will be waiting to use it. Control returns as soon as the command has been distributed. The result structures do not reflect the command's result but only the result of trying to start it.

Interactive Execution

You can start a command in such a way that your program has direct interaction with the input and output streams of the command process in every node. When you do this, your program can supply input and inspect output in near real time.

To establish interactive execution, specify ASCMDREQ_INTERACTIVE in the command option flag. Also set one or more of the following flags in the ioflags field:


Requests a socket attached to the command's stdin.


Requests a socket attached to the command's stdout.


Requests a socket attached to the command's stderr.


Requests a socket that can be used to deliver signals.

As with ASCMDREQ_NOWAIT, control returns as soon as the command has been distributed. Each result structure contains file descriptors for the requested sockets for the command process in that node.

Your program writes data into the stdinfd file descriptor of one node in order to send data to the stdin stream in that node. Your program reads data from the stdoutfd file descriptor to read one node's output stream.

You will typically use either the select() or the poll() system function to learn when one of the sockets is ready for use. You may choose to start one or more subprocesses using fork() to handle I/O to the sockets of each node (see the select(2), poll(2) and sproc(2) reference pages). (You may also use sproc() to make subprocesses, but keep in mind that the libarray is not thread-safe, so it should only be used from one process in a share group.)

Executing a User Command

The asrcmd() function allows a program to initiate any user command string on a specified node. This provides a powerful facility for remote execution that does not require root privilege, as the standard rcmd() function does (compare the asrcmd(3) and rcmd(3) reference pages).

The asrcmd() function takes arguments specifying:

The returned value (as with rcmd()) is a socket that represents the standard input and output streams of the executing command. Optionally, a separate socket for the standard error stream can be obtained.