You design a program to perform computations in parallel in order to get higher performance, by bringing more hardware to bear on the problem concurrently. In order to succeed, you need to understand the hardware architecture of the target system, and also the software interfaces that are available.
The purpose of this chapter is to give a high-level overview of parallel programming models and of the hardware that they use. The parallel models are discussed in more detail in following chapters.
The O2, Indy, and Indigo workstations have single CPUs. Although they can perform I/O operations in parallel with computing, they can execute only one stream of instructions at a time, and time-share the CPU across all active processes.
The CHALLENGE and Onyx systems (and their POWER versions) are symmetric multiprocessor (SMP) computers. In these systems at least 2, and as many as 36, identical microprocessors access a single, common memory and a common set of peripherals through a high-speed bus.
The OCTANE workstation is a two-CPU SMP.
The POWER CHALLENGEarray comprises 2 or more POWER CHALLENGE systems connected by a high-speed local HIPPI network. Each node in the array is an SMP with 2 to 36 CPUs. Nodes do not share a common memory; communication between programs in different nodes passes through sockets. However, the entire array can be administered and programmed as a single entity.
An Origin2000 system provides nodes each containing two or four CPUs, connected in systems of 2 to 128 nodes by a high-speed connection fabric. All system memory is uniformly addressable, but there is a time penalty for the use of nonlocal memory (see “Using Origin2000 Nonuniform Memory”).
Most programs have a single thread of execution that runs as if it were in a uniprocessor, employing the facilities of a single CPU. The IRIX operating system applies CPUs to different programs in order to maximize system throughput.
You can write a program so that it makes use of more than one CPU at a time. The software interface that you use for this is the parallel programming model. The IRIX operating system gives you a variety of such interfaces. Each one is designed around a different set of assumptions about the hardware, especially the memory system.
Each model is implemented using a different library of code linked with your program. In some cases you can design a mixed-model program, but in general this is a recipe for confusion.
It might seem a contradiction, but it is possible to execute some parallel programs in uniprocessors. Obviously you would not do this expecting the best performance. However, it is easier to debug a parallel program by running it in the more predictable environment of a single CPU, on a multiprocessor or on a uniprocessor workstation. Also, you might deliberately restrict a parallel program to one CPU in order to establish a performance baseline.
Most parallel programming libraries adapt to the available hardware. They run concurrently on multiple CPUs when the CPUs are available (up to some programmer-defined limit). They run on a limited number, or even just one CPU, when necessary. For example, the Fortran programmer can control the number of CPUs used by a MIPSpro Fortran 77 program by setting environment variables before the program starts (see Chapter 11, “Statement-Level Parallelism”).
The key memory issue for parallel execution is this: Can one process access data in memory that belongs to another concurrent process, and if so, what is the time penalty for doing so? The answer depends on the hardware architecture, and determines the optimal programming model.
The CHALLENGE/Onyx system architecture uses a high speed system bus to connect all components of the system.
One component is the physical memory system, which plugs into the bus and is equally available to all other components. Other units that plug into the system bus are I/O adapters, such as the VME bus adapter. CPU modules containing MIPS R4000, R8000, or R10000 CPUs are also plugged into the system bus.
In the CHALLENGE/Onyx architecture, the single, common memory has these features:
There is a single address map; that is, the same word of memory has the same address in every CPU.
There is no time penalty for communication between processes because every memory word is accessible in the same amount of time from any CPU.
All peripherals are equally accessible from any process.
The OCTANE workstation also uses a single, common memory that is accessible from either of its CPUs in the same amount of time.
The effect of a single, common memory is that processes running in different CPUs can share memory and can update the identical memory locations concurrently. For example, suppose there are four CPUs available to a Fortran program that processes a large array of data. You can divide a single DO-loop so that it executes concurrently on the four CPUs, each CPU working in one-fourth of the array in memory.
As another example, IRIX allows processes to map a single segment of memory into the virtual address spaces of two or more concurrent processes (see Chapter 3, “Sharing Memory Between Processes”). Two processes can transfer data at memory speeds, one putting the data into a mapped segment and the other process taking the data out. They can coordinate their access to the data using semaphores located in the shared segment (see Chapter 4, “Mutual Exclusion”).
In an Array system, such as a POWER CHALLENGEarray, each node is a computer built on the CHALLENGE/Onyx architecture. However, the only connection between nodes is the high-speed HIPPI bus between nodes. The system does not offer a single system memory; instead, there is a separate memory subsystem in each node. The effect is that:
There is not a single address map. A word of memory in one node cannot be addressed at all from another node.
There is a time penalty for some interprocess communication. When data passes between programs in different nodes, it passes over the HIPPI network, which takes longer than a memory-to-memory transfer.
Peripherals are accessible only in the node to which they are physically attached.
Nevertheless, it is possible to design an application that executes concurrently in multiple nodes of an Array. The message-passing interface (MPI) is designed specifically for this.
The Origin2000 system uses a memory hierarchy. A certain amount of memory is a physical part of each node. The hardware creates the image of a single system memory. The memory installed in any node can be accessed from any other node as if all memory were local. However, the node number is part of the physical address of a word of memory. There is a multiple-level hierarchy of speed: memory in the same node as the CPU is accessed in the least amount of time, while memory in any other node takes an additional fraction of a microsecond to access. The time penalty depends on the relative location of the two nodes in the system.
These are the results of this design:
There is a single address map. A word of memory can be addressed from any node.
There is a time penalty for some accesses, depending on the node that requests the memory and the node that contains it. However, this time cost is far smaller than the cost of communicating over a socket and a network link.
Peripherals are accessible from any node, but there is a time penalty for access to a peripheral from a node other than the one to which the peripheral is attached.
The implications of these features are explored at more length under “Using Origin2000 Nonuniform Memory”.
The relative size of the unit of computation that executes in parallel: a single statement, a function, or an entire process.
The basic mechanism by which the independent, concurrent units of the program exchange data and synchronize their activity.
A summary comparison of the available models is shown in Table 10-1.
Power Fortran, IRIS POWER C
Looping statement (DO or for statement)
Shared variables in a single user address space.
Shared variables in a single user address space.
Shared variables in a single user address space.
Lightweight UNIX processes (sproc())
Arena memory segment in a single user address space.
General UNIX processes (fork(), exec())
Arena segment mapped to multiple address spaces.
Shared Memory (SHMEM)
Parallel Virtual Machine (PVM)
Memory copy within node; HIPPI network between nodes.
Memory copy within node; special HIPPI Bypass interface between nodes.
A UNIX process consists of an address space, a large set of process state values, and one thread of execution. The main task of the IRIX kernel is to create processes and to dispatch them to different CPUs to maximize the utilization of the system.
IRIX contains a variety of interprocess communication (IPC) mechanisms, which are discussed in Chapter 2, “Interprocess Communication.” These mechanisms can be used to exchange data and to coordinate the activities of multiple, asynchronous processes within a single-memory system. (Processes running in different nodes of an Array must use one of the distributed models; see “Message-Passing Models”.)
In traditional UNIX practice, one process creates another with the system call fork(), which makes a duplicate of the calling process, after which the two copies execute in parallel. Typically the new process immediately uses the exec() function to load a new program. (The fork(2) reference page contains a complete list of the state values that are duplicated when a process is created. The exec(2) reference page details the process of creating a new program image for execution.)
IRIX also supports the system function sproc(), which creates a lightweight process. A process created with sproc() shares some of its process state values with its parent process (the sproc(2) reference page details how this sharing is specified).
In particular, a process made with sproc() does not have its own address space. It continues to execute in the address space of the original process. In this respect, a lightweight process is like a thread (see “Thread-Level Parallelism”). However, a lightweight process differs from a thread in two significant ways:
A lightweight process still has a full set of UNIX state values. Some of these, for example the table of open file descriptors, can be shared with the parent process, but in general a lightweight process carries most of the state information of a process.
Dispatch of lightweight processes is done in the kernel, and has the same overhead as dispatching any process.
The library support for statement-level parallelism is based on the use of lightweight processes (see “Statement-Level Parallelism”).
A thread is an independent execution state within the context of a larger program. The concept of a thread is well-known, but the most common formal definition of threads and their operation is provided by POSIX standard 1003.1c, “System Application Program Interface—Amendment 2: Threads Extension.”
There are three key differences between a thread and a process:
A UNIX process has its own set of UNIX state information, for example, its own effective user ID and set of open file descriptors.
Threads exist within a process and do not have distinct copies of these UNIX state values. Threads share the single state belonging to their process.
Normally, each UNIX process has a unique address space of memory segments that are accessible only to that process (lightweight processes created with sproc() share all or part of an address space).
Threads within a process always share the single address space belonging to their process.
Processes are scheduled by the IRIX kernel. A change of process requires two context changes, one to enter the kernel domain and one to return to the user domain of the next process. The change from the context of one process to the context of another can entail many instructions.
In contrast, threads are scheduled by code that operates largely in the user address space, without kernel assistance. Thread scheduling can be faster than process scheduling.
The POSIX standard for multithreaded programs is supported by IRIX 6.2 with patches 1361, 1367, and 1389 installed, and in all subsequent releases of IRIX.
In addition, the Silicon Graphics implementation of the Ada95 language includes support for multitasking Ada programs—using what are essentially threads, although not implemented using the POSIX library. For a complete discussion of the Ada 95 task facility, refer to the Ada 95 Reference Manual, which installs with the Ada 95 compiler (GNAT) product.
MIPSpro Fortran 77 supports compiler directives that command parallel execution of the bodies of DO-loops. The MIPSpro POWER Fortran 77 product is a preprocessor that automates the insertion of these directives in a serial program.
MIPSpro Fortran 90 supports parallelizing directives similar to MIPSpro Fortran 77, and the MIPSpro POWER Fortran 90 product automates their placement.
MIPSpro POWER C supports compiler pragmas that command parallel execution of segments of code. The IRIS POWER C analyzer automates the insertion of these pragmas in a serial program.
In all three languages, the run-time library—which provides the execution environment for the compiled program—contains support for parallel execution. The compiler generates library calls. The library functions create lightweight processes using sproc(), and distribute loop iterations among them.
The run-time support can adapt itself dynamically to the number of available CPUs. Alternatively, you can control it—either using program source statements, or using environment variables at execution time—to use a certain number of CPUs.
Statement-level parallel support is based on using common variables in memory, and so it can be used only within the bounds of a single-memory system, a CHALLENGE system or a single node in a POWER CHALLENGEarray system.
One way to design a parallel program is to think of each thread of execution as operating in an independent address space, communicating with other threads by exchanging discrete units of data as messages through an abstract, formal interface for message exchange.
The threads of a program designed in this way can be distributed over different computers. Three message-passing execution models are supported by Silicon Graphics systems. Each defines and implements a formal, abstract model for data exchange. Two of the models allow a computation to be distributed across the nodes of a multiple-memory system, without having to reflect the system configuration in the source code. The programming models are:
Shared Memory Model (SHMEM)
Message-Passing Interface (MPI)
Parallel Virtual Machine (PVM)
All three models are briefly summarized in the following topics, and are discussed in more detail in Chapter 14, “Message-Passing Parallelism.” Support for all three is included in the Message-Passing Toolkit (MPT) product. For an overview of MPT, see this URL:
The SHMEM library has been used for some time on Cray systems and is now available for all Silicon Graphics multiprocessors as part of the MPT. A program built on SHMEM is a process-level parallel program. Each process runs in a separate address space. The SHMEM library routines are used to exchange data, and to coordinate execution, between the processes.
SHMEM routines support remote data transfer through put operations, which transfer data to a different process, and get operations, which transfer data from a different process. Other operations supported include data broadcast and data reduction; barrier synchronization; as well as atomic memory updates such as a fetch-and-increment on remote or local data objects.
SHMEM operations are all memory-to-memory, and as a result have extremely high bandwidth and low latency. However, a SHMEM-based program cannot be distributed across multiple systems.
MPI is a standard programming interface for the construction of a portable, parallel application in Fortran 77 or in C, especially when the application can be decomposed into a fixed number of processes operating in a fixed topology (for example, a pipeline, grid, or tree). MPI has wide use on many large computers.
A highly tuned, efficient implementation of MPI is part of the MPT. Within a single system, MPI messages are moved memory-to-memory. Between nodes of an Silicon Graphics Array system, MPI messages are passed over a HIPPI network. Latency and bandwidth are intermediate between memory-to-memory data exchange and socket-based network communication.
PVM is an integrated set of software tools and libraries that emulates a general-purpose, flexible, heterogeneous, concurrent computing framework on interconnected computers of varied architecture. Using PVM, you can create a parallel application that executes as a set of concurrent processes on a set of computers that can include uniprocessors, multiprocessors, and nodes of Array systems.
Like MPI, PVM has wide use on many types of supercomputer, including Cray systems. An implementation of PVM is included in the MPT. PVM is discussed in more detail under Chapter 14, “Message-Passing Parallelism.”