You can use statement-level parallelism in three language packages: Fortran 77, Fortran 90, and C. This execution model is unique in that you begin with a normal, serial program, and you can always return the program to serial execution by recompiling. Every other parallel model requires you to plan and write a parallel program from the start.
Software support for statement-level parallelism is available from Silicon Graphics and from independent vendors.
The parallel features of the three languages from Silicon Graphics are documented in detail in the manuals listed in Table 11-1.
C Language Reference Manual
Covers all pragmas, including parallel ones.
IRIS Power C User's Guide
Use of Power C source analyzer to place pragmas automatically.
MIPSpro Fortran 77 Programmer's Guide
General use of Fortran 77, including parallelizing assertions and directives.
MIPSpro Power Fortran 77 Programmer's Guide
Use of the Power Fortran source analyzer to place directives automatically.
MIPSpro Fortran 90 Programmer's Guide
General use of Fortran 90, including parallelizing assertions and directives.
MIPSpro Power Fortran 90 Programmer's Guide
Use of the Power Fortran 90 source analyzer to place directives automatically.
In addition to these products from Silicon Graphics, the High Performance Fortran (HPF) compiler from the Portland Group is a compiler for Fortran 90 augmented to the HPF standard. It supports automatic parallelization. (Refer to http://www.pgroup.com for more information).
The FORGE products from Applied Parallel Research (APRI) contain a Fortran 77 source analyzer that can insert parallelizing directives, although not the directives supported by MIPSpro Fortran 77. (Refer to http://www.apri.com for more information.)
In each of the three languages, the language compiler supports explicit statements that command parallel execution (#pragma lines for C; directives and assertions for Fortran). However, placing these statements can be a demanding, error-prone task. It is easy to create a suboptimal program, or worse, a program that is incorrect in subtle ways. Furthermore, small changes in program logic can invalidate parallel directives in ways that are hard to foresee, so it is difficult to modify a program that has been manually made parallel.
For each language, there is a source-level program analyzer that is sold as a separate product (IRIS POWER C, MIPSpro Power Fortran 77, MIPSpro Power Fortran 90). The analyzer identifies sections of the program that can safely be executed in parallel, and automatically inserts the parallelizing directives. After any logic change, you can run the analysis again, so that maintenance is easier.
The source analyzer makes conservative assumptions about the way the program uses data. As a result, it often is unable to find all the potential parallelism. However, the analyzer produces a detailed listing of the program source, showing each segment that could or could not be parallelized, and why. Directed by this listing, you insert source assertions that give the analyzer more information about the program.
The method of creating an optimized parallel program is as follows:
Write a complete application that runs on a single processor.
Completely debug and verify the correctness of the program in serial execution.
Apply the source analyzer and study the listing it produces.
Add assertions to the source program. These are not explicit commands to parallelize, but high-level statements that describe the program's use of data.
Repeat steps 3 and 4 until the analyzer finds as much parallelism as possible.
Run the program on a single-memory multiprocessor.
When the program requires maintenance, you make the necessary logic changes and, simultaneously, remove any assertions about the changed code—unless you are certain that the assertions are still true of the modified logic. Then repeat the preceding procedure from step 2.
The run-time library for all three languages is the same, libmp. It is documented in the mp(3) reference page. libmp uses IRIX lightweight processes to implement parallel execution (see Chapter 12, “Process-Level Parallelism”).
When a parallel program starts, the run-time support creates a pool of lightweight processes using the sproc() function. Initially the extra processes are blocked, while one process executes the opening passage of the program. When execution reaches a parallel section, the run-time library code unblocks as many processes as necessary. Each process begins to execute the same block of statements. The processes share global variables, while each allocates its own copy of variables that are local to one iteration of a loop, such as a loop index.
When a process completes its portion of the work of that parallel section, it returns to the run-time library code, where it picks up another portion of work if any work remains, or suspends until the next time it is needed. At the end of the parallel section, all extra processes are suspended and the original process continues to execute the serial code following the parallel section.
You can specify the number of lightweight processes that are started by a program. In IRIS POWER C, you can use #pragma numthreads to specify the exact number of processes to start, but it is not a good idea to embed this number in a source program. In all implementations, the run-time library by default starts enough processes so there is one for each CPU in the system. That default is often too high, since typically not all CPUs are available for one program.
The run-time library checks an environment variable, MP_SET_NUM_THREADS, for the number of processes to start. You can use this environment variable to choose the number of processes used by a particular run of the program, thereby tuning the program's requirements to the system load. You can even force a parallelized program to execute on a single CPU when necessary.
MIPSpro Fortran 77 and MIPSpro Fortran 90 also recognize additional environment variables that specify a range of process numbers, and use more or fewer processes within this range as system load varies. (See the Programmer's Guide for the language for details.)
At certain points the multiple processes must wait for one another before continuing. They do this by waiting in a busy loop for a certain length of time, then by blocking until they are signaled. You can specify the amount of time that a process should spend spinning before it blocks, using either source directives or an environment variable (see the Programmer's Guide for the language for system functions for this purpose).
Most parallel sections are loops. The benefit of parallelization is that some iterations of the loop are executed in one CPU, concurrent with other iterations of the same loop in other CPUs. But how are the different iterations distributed across processes? The languages support four possible methods of scheduling loop iterations, as summarized in Table 11-2.
Each process executes ⌊N/P⌋ iterations starting at Q*(⌊N/P⌋). First process to finish takes the remainder chunk, if any.
Each process executes C iterations of the loop, starting with the next undone chunk unit, returning for another chunk until none are left undone.
Each process executes C iterations at C*Q, C*2Q, C*3Q...
Each process executes chunks of decreasing size, (N/2P), (N/4P),...
The variables used in Table 11-2 are as follows:
Number of iterations in the loop, determined from the source or at run-time.
Number of available processes, set by default or by environment variable (see “Controlling the Degree of Parallelism”).
Number of a process, from 0 to N-1.
“Chunk” size, set by directive or by environment variable.
The effects of the scheduling types depend on the nature of the loops being parallelized. For example:
The SIMPLE method works well when N is relatively small. However, unless N is evenly divided by P, there will be a time at the end of the loop when fewer than P processes are working, and possibly only one.
The DYNAMIC and INTERLEAVE methods allow you to set the chunk size to control the span of an array referenced by each process. You can use this to reduce cache effects. When N is very large so that not all data fits in memory, INTERLEAVE may reduce the amount of paging compared to DYNAMIC.
The guided self-scheduling (GSS) method is good for triangular matrices and other algorithms where loop iterations become faster toward the end.
You can use source directives or pragmas within the program to specify the scheduling type and chunk size for particular loops. Where you do not specify the scheduling, the run-time library uses a default method and chunk size. You can establish this default scheduling type and chunk size using environment variables.
In any statement-level parallel program, memory cache contention can harm performance. This subject is covered under “Dealing With Cache Contention”.
When a statement-parallel program runs in an Origin2000 or Onyx2 system, the location of the program's data can affect performance. These issues are covered at length under “Using Origin2000 Nonuniform Memory”.