Chapter 7. Optimizing for Multiprocessors

This chapter contains these sections:

Overview of Parallel Optimization

Using MIPSpro POWER Fortran 90—an extended version of MIPSpro Fortran 90 with extra features for optimization—you can compile your program so that, when you run it in a Silicon Graphics multiprocessor with available CPUs, your program will recruit the power of additional CPUs to work in parallel on the data. You request parallel optimization in one of two ways.

Automatic Parallelization

If you specify the -pfa driver option (see “Specifying Optimization Levels”), the compiler analyzes the program and attempts to parallelize every loop. You can see the result in the listing file (see “Setting the Listing Level”). In some cases the compiler will needlessly parallelize loops that do not contain enough work to justify parallel execution; and in other cases the compiler will not be able to parallelize loops owing to data dependencies.

Explicit Parallelization

You can also direct the compiler to parallelize specific loops, or specific procedures, or show it how to handle specific data dependencies. You do this by omitting the -pfa driver option, and instead writing directives in the program.

You have the choice of two different models for explicit parallelization:

  • A simple model is based on the use of the DOACROSS directive. With it you enable specified DO-loops to execute in parallel, so that multiple iterations of the loop execute concurrently.

  • A more general model is based on the Parallel Computing Forum (PCF) directives. With them, you can parallelize both looping and nonlooping sections of code, and you can specify critical sections and single-process sections of code.

This chapter discusses techniques for analyzing your program and converting it to multiprocessing operations. Chapter 8, “Compiling and Debugging Parallel Fortran,” gives compilation and debugging instructions for parallel processing.

Parallel Execution

The basic idea of parallel optimization is that you can distribute parts of the program's work to two or more independent threads. Each thread executes asynchronously on a different CPU. By doing more than one piece of work at a time, your program finishes sooner than if all the work were done by one process on one CPU.

Process Structure

The processes that participate in the parallel execution of a program are arranged in a master/slave organization. The original process, created when the program is first loaded, is the master. It creates zero or more slave processes to assist it. The master process and each of the slave processes are called a thread of execution, or simply a thread.

Note: The term “thread” is used here as a convenient term for an independent executable entity within a program. Do not assume that it means any particular implementation of threads, for example “POSIX threads.”

When the master process reaches a parallelized section of the program—which is usually a loop—the master assigns some of the work to each slave. The slaves and master execute concurrently, each on a different part of the loop or different section of code. As each slave completes its portion of the loop, it waits for further signals from the master, while the master resumes normal execution.

By default, the number of threads is set equal to the number of CPUs on the particular machine.You can control the number of threads used, either by setting environment variables before running the program, or from within the program by calling library routines.

Program Design

For multiprocessing to work correctly, the code executed by one thread must not depend on results produced by another thread. This is because you cannot predict the timing relationship between one thread and another. In particular, when a loop is parallelized, each iteration of the loop must produce the same answer regardless of when any other iteration of the loop is executed. Not all loops have this property. Loops without it cannot be correctly executed in parallel. (The same is true of loop unrolling. See “Assertions About Data Dependence”.)

In the more general model of parallel execution, you specify sections of code that can run in parallel. When there are data dependencies between them, you can resolve these by delimiting a critical section that can be executed by only one process at a time.

Dynamic Scheduling

Since a long-running program spends most of its time within loops, parallel optimization focuses on the DO loop. The compiler tries to arrange it so that different iterations of the DO loop execute in parallel on multiple CPUs. For example, suppose a DO loop consisting of 20000 iterations will run on a machine with four available CPUs. Using the SIMPLE scheduling method (described in following topics), the first 5000 iterations run on one CPU, the next 5000 on another, and so on. The total execution time of that loop will be 1/4th the time for the nonparallel loop, plus the overhead time it takes to recruit, initialize, and release the added CPUs.

The multiprocessing code adjusts itself at run time to the number of CPUs actually present on the machine. Thus, if a 200-iteration loop is moved to a machine with only two available CPUs, it is automatically scheduled as two blocks of 100 iterations each, without any need to recompile or relink. In fact, multiprocessing code can be run on single-processor machines.

Parallel Directives

You control the parallelization of your program by writing directives in it. You have a choice of two families of directives (or you can mix them in the same program).

To provide compatibility for existing parallel programs, Silicon Graphics supports the directives for parallelism used by Sequent Computer Corporation. These directives allow you to parallelize specified DO-loops, while leaving the details to the Fortran compiler. To use this model, see “Writing Simple Parallel Loops”.

You can also use the proposed Parallel Computing Forum (PCF) standard (ANSI-X3H5 91-0023-B Fortran language binding) directives. With these directives you can specify more general kinds of parallelization, and you can coordinate between the threads. To use the PCF model, see “Using PCF Directives”. (The section “Writing Simple Parallel Loops” has important conceptual material you should read first.)

The directives are compiled with your source program. In addition, you manage parallel execution at run time using environment variables that are tested by the run-time library. Also, there are a number of special library routines that permit more direct, run-time control over the parallel execution (refer to “Run-Time Control of Multiprocessing” for more information.)

Writing Simple Parallel Loops

Six multiprocessing directives are used to parallelize specified loops:


Specify multiprocessing parameters


Continue a C$DOACROSS directive to multiple lines


Identify a line of code executed only when multiprocessing is active.


Specify the way a loop is divided across CPUs


Specify the units of work into which a loop is divided.


Load a local copy of a COMMON block from the master process's version.

Note: In fixed-format source, directives start with “C,” as comments must. In free-format source, they start with “!” instead, but are otherwise the same. In all cases, directives must start in the first column to be recognized (see “Recognizing Directives and Assertions”).

When you compile a program with MIPSpro Fortran 90 (without the “Power” option), directives related to multiprocessing are treated as comments. This allows the identical source to be compiled with a single-processing compiler or by Fortran without the multiprocessing option.

The C$COPYIN directive is described under “Using Local COMMON Blocks”. The other directives are described in the topics that follow.

Syntax of C$DOACROSS

The essential compiler directive for multiprocessing is C$DOACROSS. This directive marks a DO loop to be run in parallel, and specifies the details of how that loop is to be executed. The directive applies only to the following statement, which must be a DO loop. The C$DOACROSS directive has the form

C$DOACROSS [clause [ [,] clause ...]

where valid values for each optional clause are

IF (logical_expression)
{LOCAL | PRIVATE} (item[,item ...] ) 
{SHARED | SHARE} (item[,item ...])
{LASTLOCAL | LAST LOCAL} (item[,item ...])
REDUCTION (item[,item ...])
{CHUNK=integer_expression | BLOCKED(integer_expression)}

The preferred form of the directive (as generated by WorkShop Pro MPF) uses the optional commas between clauses. The C$& directive can be used to extend C$DOACROSS to multiple lines, so each clause can be written on one line.

!$DOACROSS IF (N.GT.10000), CHUNK=100,

Using the IF Clause

You use the IF clause to decide at run time whether the loop is actually executed in parallel. The logical expression is tested at run-time. If its value is .TRUE., the loop is executed in parallel. If the expression is .FALSE., the loop is executed serially. Typically, the expression tests the number of times the loop will execute to be sure there is enough work in the loop to justify the overhead of parallel execution (see “Adjusting the Work Quantum”). In some cases, it tests the number of threads available (see “Using mp_numthreads and mp_set_numthreads”).

Using the LOCAL, LASTLOCAL, and SHARE Clauses

The LOCAL, SHARE, and LASTLOCAL clauses specify lists of variables that need special treatment when used within the loop controlled by C$DOACROSS. Only the names of variables can appear in these clauses. An array variable is listed by name only, without any subscripts. Names of variables in COMMON blocks can not appear in a LOCAL list (but see “Using Local COMMON Blocks”). A variable can appear in only one of these clauses.

The LOCAL, SHARE, LASTLOCAL and REDUCTION lists are discussed at more length under “Analyzing Data Dependencies for Multiprocessing”.

Using the LOCAL Clause

The LOCAL clause gives a list of variables that can be localized to each slave thread. Each iteration of the loop receives a private, uninitialized copy of a LOCAL variable. You should specify a variable as LOCAL when its value is calculated and used in the course of a single iteration of the loop and its value does not depend on any other iteration of the loop.

PRIVATE is a synonym for LOCAL, but LOCAL is preferred.

Using the LASTLOCAL Clause

The LASTLOCAL clause, like the LOCAL clause, specifies variables that can be localized to each iteration of the loop. In addition, the compiler generates code to copy the final value of the variable from the local copy of whichever slave process executes the logically-final iteration, and saves this value in the named variable for use in the serial code following the loop.

The loop iteration variable is given the LASTLOCAL attribute by default. However, if you do not need the value of the iteration variable after the loop, you can save a little time by specifying it as LOCAL instead.

The phrase LAST LOCAL is a synonym for LASTLOCAL, but LASTLOCAL is preferred.

Using the SHARE Clause

The SHARE clause specifies variables that must be common to all slave processes. When a variable is declared as SHARE, all iterations of the loop can safely share a single copy of the variable. You should declare a variable SHARE when:

  • it is not modified in the loop

  • it is an array in which each iteration of the loop accesses a different element

All variables except the loop-iteration variable are SHARE by default. The word SHARED is a synonym for SHARE, but SHARE is preferred.

Using the REDUCTION Clause

The REDUCTION clause specifies variables involved in a reduction operation. In a reduction operation, the compiler keeps local copies of the variables but combines them when it exits the loop. For an example and more discussion, see “Dealing With Reduction”. For the relationship between reduction analysis and optimization levels, see “Controlling General Optimizations”.

An element of the REDUCTION list must be a scalar variable and cannot be an array. However, it can be an individual element of an array specified by subscript.

One element of an array can be used in a reduction operation, while other elements of the array are used in other ways. To allow for this, if an element of an array appears in the REDUCTION list, the entire array can also appear in the SHARE list.

The four types of reductions supported are sum, product, min, and max. Sum and product reductions are recognized through the use of the + and * operators. The min and max reductions are recognized through the use of the MIN and MAX intrinsic functions.

The compiler confirms that the reduction expression is legal by making some simple checks. The compiler does not, however, check all statements in the DO loop for illegal reductions. You must ensure that the reduction variable is used correctly in a reduction operation.

Using the CHUNK and MP_SCHEDTYPE Clauses

The CHUNK and MP_SCHEDTYPE clauses affect the way the compiler divides the work among the slave threads. These clauses do not affect the correctness of the loop. They are useful for tuning the performance of critical loops. See “Balancing the Load With Interleaving” for more details.

For the MP_SCHEDTYPE=mode clause, mode can have one of the following five values:


Divide the loop by the available CPUs and give each slave thread a contiguous group of iterations.


Dynamically vary the amount of work per thread, allocating smaller units as the loop approaches the end.


Use environment variables to manage scheduling.


Slave threads compete for CHUNK-sized assignments.


Parcel out CHUNK-sized assignments to CPUs in rotation.

SIMPLE scheduling is the default unless CHUNK is specified; then DYNAMIC is the default. You can use different modes in different loops.

The CHUNK clause supplements the DYNAMIC and INTERLEAVE modes only. Instead of the CHUNK option, you can specify the –WK,–chunk=n driver option.


The simple scheduling method (MP_SCHEDTYPE=SIMPLE) divides the iterations of the loop among processes by dividing iterations into as many contiguous pieces as there are slave threads, assigning one piece to each process. The SIMPLE method has the lowest overhead, since each slave thread receives one assignment of work, after which it is finished. Use it when every loop iteration takes the same amount of time. When some iterations take longer than others, one slave can fall behind. The completion time of the loop is the completion of the most heavily-loaded CPU.

STATIC is a synonym for SIMPLE, but SIMPLE is preferred.


The Guided Self-Scheduling algorithm (GSS) divides the iterations of the loop into pieces whose size varies depending on the number of iterations remaining. The initial pieces are not sufficient to finish the loop. When a slave thread finishes its piece, it returns for another piece of work.

By parceling out relatively large pieces to start with and relatively small pieces toward the end, the system can achieve good load balancing while reducing the number of slave entries into the critical section. Use GSS when there are relatively few slave CPUs and they are shared with other programs.

GUIDED is a synonym for GSS, but GSS is preferred.


In dynamic scheduling, the iterations of the loop are broken into pieces of the size specified with the CHUNK clause. (CHUNK=1 is the default.)

As each slave process finishes a piece, it enters a critical section to take the next available piece. The smaller the CHUNK, the more of these entries that will occur, increasing overhead.


The interleave method breaks the iterations into pieces of the size specified by the CHUNK option, and execution of those pieces is interleaved among the processes. For example, if there are four processes and CHUNK=2, then the first process will execute iterations 1–2, 9–10, 17–18, …; the second process will execute iterations 3–4, 11–12, 19–20,…; and so on. Although this is more complex than the simple method, it is still a fixed schedule with only a single scheduling decision. (This scheduling type is discussed further under “Balancing the Load With Interleaving”.)

INTERLEAVED (with a final “D”) is a synonym for INTERLEAVE, but the latter is preferred.


You can defer the choice of the scheduling method until run time using MP_SCHEDTYPE=RUNTIME. In this case, the scheduling routine examines values in environment variables to select one of the other methods. See “Environment Variables for RUNTIME Scheduling” for more details. Use this when you want to experiment with the performance of different scheduling types without recompiling.


Example 7-1 shows a simple loop marked for parallel execution. The default scheduling is SIMPLE. By default, variable I is LASTLOCAL while A and B are SHARE.

Example 7-1. Simple Parallel Loop

   DO 10 I = 1, 100
      A(I) = B(I)

If you know that the value of I is not required following the loop, it can be made LOCAL. Example 7-2 shows the loop with the use of the variables specified.

Example 7-2. Simple Parallel Loop With LOCAL, SHARE

   DO 10 I = 1, 100
      A(I) = B(I)

The loop in Example 7-3 uses a variable X that is set and used locally to each iteration of the loop. Time will be saved by making this variable LOCAL as shown, so that each slave thread has its own copy. Since the loop variable I is not used after the loop, it is marked LOCAL also. This loop illustrates a parallel call to a function, SQRT. For more discussion, see “Parallel Procedure Calls”.

Example 7-3. Parallel Loop With LOCAL and Function Call

   DO 10 I = 1, N
      X = SQRT(A(I))
      B(I) = X*C(I) + X*D(I)

In the loop shown in Example 7-4, the final values of I and X are needed after the loop completes. This gives a reason for the use of LASTLOCAL. Also, Example 7-4 illustrates the use of the C$& directive to continue the directive.

Example 7-4. Parallel Loop With LASTLOCAL

   DO 10 I = M, K, N
      X = D(I)**2
      Y = X + X
      DO 20 J = I, ITOP
         A(I,J) = A(I,J) + B(I,J) * C(I,J) *X + Y
   PRINT*, I, X

Note that in Example 7-4, J is listed as LOCAL. It is important to see that the inner loop using J is not a parallel loop (it has no C$DOACROSS directive). Each slave thread executes all the iterations for this inner loop within each iteration of the outer loop that it handles.

The variable J would be SHARE by default, but this would produce incorrect results. As multiple slave threads attempted to execute copies of the inner loop, they would interfere with each other's assignments to J. Hence it is important to specify J as LOCAL so that each slave has an independent copy.

Using C$

The C$ directive is considered a comment line except when multiprocessing. A line beginning with C$ is treated as a conditionally compiled Fortran statement. The rest of the line contains a standard Fortran statement. The statement is compiled only if multiprocessing is turned on. In this case, the “C$” or “!$” prefix is treated as blanks. You can use these directives to insert debugging statements, or to insert arbitrary code into the multiprocessed version. In Example 7-5, a diagnostic PRINT statement executes only when the program executes in multiprocessing mode.

Example 7-5. Use of “C$” Conditional Code

    DO I = 1, 100


The C$MP_SCHEDTYPE=mode directive acts as default MP_SCHEDTYPE clause for following C$DOACROSS directives. The mode value is any of the modes listed in the section called “Using the CHUNK and MP_SCHEDTYPE Clauses”. Any following C$DOACROSS directive that does not have an explicit MP_SCHEDTYPE clause is given the value specified in the last directive prior to the line, rather than the normal default.

The C$CHUNK=integer_expression directive affects the CHUNK clause of a C$DOACROSS in the same way that the C$MP_SCHEDTYPE directive affects the MP_SCHEDTYPE clause for all C$DOACROSS directives in scope. Both directives are in effect from the place they occur in the source until another corresponding directive is encountered or the end of the procedure is reached.

You can also invoke this functionality from the command line during a compile. The –mp_schedtype=schedule_type and –chunk=integer command line options have the effect of implicitly putting the corresponding directives as the first lines in the file.


The compiler does not support direct nesting of C$DOACROSS loops. For example, the following is illegal and generates a compilation error:

   DO I = 1, N
      DO J = 1, N
         A(I,J) = B(I,J)
      END DO

However, a different form of nesting is allowed. A procedure that uses C$DOACROSS can be called from within a parallel region. This can be useful if a single procedure is called from several different places, sometimes from within a parallel loop and sometimes not.

Nesting in this way does not increase the parallelism. When the first C$DOACROSS loop is encountered, that loop is run in parallel. This fully occupies all the slave threads. If, in the parallel loop, a call is made to a routine that itself has a C$DOACROSS, that inner loop is executed serially.

Parallel Procedure Calls

It is possible to use functions and subroutines (intrinsic or user-defined) within a parallel loop. However, it is up to you to make sure that parallel invocations of a procedure do not interfere with one another. Intrinsic functions such as SQRT return a value that depends only on the input arguments; they do not modify global data and they do not use static storage. We say that such a function has no side effects.

Except for RANDOM_SEED and RANDOM_NUMBER, the standard Fortran 90 intrinsic functions have no side effects and can safely be called from a parallel loop. For the most part, the Fortran library functions listed in “Support for IRIX Kernel Functions” do have side effects and can not safely be included in a parallel loop.

For user-written procedures, it is the responsibility of the programmer to ensure that the routines can be correctly multiprocessed.

Caution: Do not use the –static option when compiling routines called within a parallel loop. This converts procedure local variables into static variables which cannot be used in parallel threads.

Tip: You cannot call RANDOM_NUMBER within a parallel loop because the slave threads, running concurrently within the function, would interfere with each other updating the seed values. Repeated or nonrandom values could be returned. In order to use random numbers in a parallel loop, first create an array containing one number for each iteration of the loop. Then treat that array as SHARE within the loop. Example 7-6 illustrates the technique.

Example 7-6. Loop Using Random Numbers

    DO I = 1,N
      X(I) = PERTURB(X(I),R(I))

Analyzing Data Dependencies for Multiprocessing

The essential condition required to parallelize a loop correctly is that each iteration of the loop must be independent of all other iterations. If a loop meets this condition, then the order in which the iterations of the loop execute is not important. They can be executed backward or even at the same time, and the answer is still the same.

This property is captured by the notion of data independence. For a loop to be data-independent, no iterations of the loop can write a value into a memory location that is read or written by any other iteration of that loop. It is all right if the same iteration reads or writes a memory location repeatedly, as long as no other iterations do. It is all right if many iterations read the same location, as long as none of them write to it.

In a Fortran program, memory locations are represented by variable names. So, to determine if a particular loop can be run in parallel, examine the way variables are used in the loop. Because data dependence occurs only when memory locations are modified, pay particular attention to variables that appear on the left-hand side of assignment statements. If a variable is not modified, there is no data dependence associated with it. (Remember that a variable can be modified through the action of a procedure call.)

The Fortran compiler supports four kinds of variable usage within a parallel loop: SHARE, LOCAL, LASTLOCAL, and REDUCTION. The basic meanings of these keywords are discussed under “Using the LOCAL, LASTLOCAL, and SHARE Clauses” and “Using the REDUCTION Clause”.

It is often difficult to analyze loops for data dependence information. Each use of each variable must be examined to see if it fulfills the criteria for LOCAL, LASTLOCAL, SHARE, or REDUCTION. If all of the uses of all variables conform, the loop can be parallelized. If not, the loop cannot be parallelized as it stands, but possibly can be rewritten into an equivalent parallel form (see “Breaking Data Dependencies”).

MIPSpro Power Fortran 90 analyzes loops for data dependence and automatically inserts the required C$DOACROSS directives when it determines that a loop has data independence. When Power Fortran 90 cannot determine whether the loop is independent, it produces a listing file detailing where the problems lie. You can use directives to specify the use of variables, assisting the compiler (see “Assertions About Data Dependence”).

Simple Independence

The loop in Example 7-7 demonstrates simple data independence.

Example 7-7. Loop With Data Independence

      DO 10 I = 1,N
10    A(I) = X + B(I)*C(I)

Each iteration writes to a different location in A, and none of the variables appearing on the right-hand side is modified. This loop can be correctly run in parallel. All the variables are SHARE except for I, which is either LOCAL or LASTLOCAL, depending on whether its last value is used later.

Simple Dependence

The loop in Example 7-8 refers to A(I) on the left-hand side and A(I-1) on the right. This means that one iteration of the loop writes to a location in A and the next iteration reads from that same location. Because different iterations of the loop read and write the same memory location, this loop cannot be run in parallel.

Example 7-8. Loop With Stride-1 Dependence

      DO 20 I = 2,N
20    A(I) = B(I) - A(I-1)

The loop in Example 7-9 looks like the one in Example 7-8. The difference is that the stride of the DO loop (the fixed increment between each iteration) is now 2 rather than 1. Now A(I) is always an even-numbered element of A, while A(I-1) is always an odd-numbered element that never receives an assignment under the expression A(I). None of the data locations on the right-hand side is ever the same as any of the data locations written to on the left-hand side. There is no dependence. The loop can be run in parallel. Arrays A and B can be declared SHARE, while variable I should be declared LOCAL or LASTLOCAL.

Example 7-9. Loop With Stride-2 Dependence

      DO 20 I = 2,N,2
20    A(I) = B(I) - A(I-1)

Complicated Independence

At first glance, the loop in Example 7-10 looks like it cannot be run in parallel because it uses both W(I) and W(I-K). Closer inspection reveals that because the value of I varies between K+1 and 2*K, the value of I-K goes from 1 to K. This means that the W(I-K) term varies from W(1) up to W(K), while the W(I) term varies from W(K+1) up to W(2*K). So W(I-K) in any iteration of the loop is never the same memory location as W(I) in any other iterations. Because there is no data overlap, there are no data dependencies. This loop can be run in parallel. Variables W, B, and K can be declared SHARE, while variable I should be declared LOCAL or LASTLOCAL.

Example 7-10. Loop With Apparent Dependence

DO I = K+1, 2*K
   W(I) = W(I) + B(I,K) * W(I-K)

This example points out a general rule: the more complex the expression used to index an array, the harder it is to analyze. If the arrays in a loop are indexed only by the loop index variable, the analysis is usually straightforward, though tedious. Fortunately, in practice most array indexing expressions are simple.

An Inconsequential Data Dependence

There is a data dependence in Example 7-11 because it is possible that at some point I will be the same as INDEX, so there will be a data location that is being read and written by different iterations of the loop.

Example 7-11. Loop With Inconsequential Dependence

DO I = 1, N
   A(I) = A(INDEX)

In this special case, you can ignore the dependence. You know that when I and INDEX are equal, the value written into A(I) is exactly the same as the value that is already there. The fact that some iterations of the loop read the value before it is written and some after it is written is not important, because they all get the same value. Therefore, this loop can be parallelized. Array A can be declared SHARE, while variable I should be declared LOCAL or LASTLOCAL.

Use of Local Variable

In Example 7-12, each iteration of the loop reads and writes the variable X. However, no loop iteration ever needs the value of X from any other iteration. X is used as a temporary variable; its value does not survive from one iteration to the next. This loop can be parallelized by declaring X to be a LOCAL variable within the loop.

Example 7-12. Loop With Local Variable Use

DO I = 1, N
   X = A(I)*A(I) + B(I)
   B(I) = X + B(I)*X

Note that B(I) is both read and written by the loop. This is not a problem because each iteration has a different value for I, so each iteration uses a different B(I). The same B(I) is allowed to be read and written as long as it is done by the same iteration of the loop. The loop can be run in parallel. Arrays A and B can be declared SHARE, while variable I should be declared LOCAL or LASTLOCAL.

Rewritable Data Dependence

In Example 7-13, the value of INDX survives the loop iteration and is carried into the next iteration. This loop cannot be parallelized as it is written.

Example 7-13. Loop With Dependence

INDX = 0
DO I = 1, N
   INDX = INDX + I
   A(I) = B(I) + C(INDX)

Making INDX a LOCAL variable does not work because you need the value of INDX computed in the previous iteration. It is possible to rewrite this loop to make it parallel (see “Breaking Data Dependencies”).

Exit Branch

The loop in Example 7-14 contains an exit branch; that is, under certain conditions the flow of control exits the loop. The Fortran compiler cannot parallelize loops containing exit branches. While one slave thread might discover the exit condition, other slave threads working on later iterations would continue to run.

Example 7-14. Loop With Exit Branch

   DO I = 1, N
      IF (A(I) .LT. EPSILON) GOTO 320
      A(I) = A(I) * B(I)

Local Array

In Example 7-15, each iteration of the loop uses the same locations in the D array.

Example 7-15. Loop With Local Array Use

DO I = 1, N
   D(1) = A(I,1) - A(J,1)
   D(2) = A(I,2) - A(J,2)
   D(3) = A(I,3) - A(J,3)
   TOTAL_DISTANCE(I,J) = SQRT(D(1)**2 + D(2)**2 + D(3)**2)

However, closer inspection reveals that the entire D array is being used as a temporary. This can be multiprocessed by declaring D to be LOCAL. The compiler allows arrays (even multidimensional arrays) to be LOCAL variables with one restriction: the size of the array must be known at compile time. The dimension bounds must be constants; the LOCAL array cannot have been declared using a variable or the asterisk syntax. Arrays TOTAL_DISTANCE and A can be declared SHARE, while array D and variable I should be declared LOCAL or LASTLOCAL.

Breaking Data Dependencies

Many loops that have data dependencies can be rewritten so that some or all of the loop can be run in parallel. The essential idea is to locate the expressions in the loop that cannot be made parallel and try to find another way to express them that does not depend on any other iteration of the loop. If this is not possible, try to pull the statements out of the loop and into a separate loop, allowing the remainder of the original loop to be run in parallel.

The first step is to analyze the loop to discover the data dependencies (see “Analyzing Data Dependencies for Multiprocessing”). You can use WorkShop Pro MPF with MIPSpro Fortran 90 to identify the problem areas.

Sometimes the dependencies in a loop cannot be broken, and you must either accept the serial execution rate or try to discover a new parallel method of solving the problem. The rest of this section is devoted to a series of examples on how to deal with common situations. These are by no means exhaustive but cover some situations that happen in practice.

Value Derived From Iteration Count

Example 7-16 is the same as Example 7-13. INDX has a value derived from the iteration count. The programmer, almost by instinct, has written code to build this value incrementally—this is obviously the “efficient” way to do it in serial code. However, because each value of INDX depends on the value in a previous iteration, the loop cannot be paralellized.

Example 7-16. Loop With Dependence

   INDX = 0
   DO I = 1, N
      INDX = INDX + I
      A(I) = B(I) + C(INDX)

In fact, INDX can be derived directly from the current iteration number without reference to preceding iterations. As shown in Example 7-17, this can be done using code that would be “inefficient” in a serial program, since it does an “unnecessary” multiply and divide in each loop.

Example 7-17. Loop With Dependence Removed

   DO I  = 1, N
      INDX = (I*(I+1))/2
      A(I) = B(I) + C(INDX)
   END DO 

As a result, INDX can be designated a LOCAL variable, and the loop can now be multiprocessed.

Indirect Indexing

The loop in Example 7-18 cannot be parallelized. It is the final statement that causes problems.

Example 7-18. Loop With Indirect Indexing

   DO 100 I = 1, N
      IX = INDEXX(I)
      IY = INDEXY(I)

The indexes IXX and IYY are computed in a complex way and depend on the values from the IXOFFSET and IYOFFSET arrays. It cannot be said that TOTAL(IXX,IYY) in one iteration of the loop will always be different from TOTAL(IXX,IYY) in every other iteration of the loop.

Example 7-19 shows that the assignment to TOTAL can be pulled out into a separate loop by expanding IXX and IYY into arrays that retain intermediate values.

Example 7-19. Loop With Dependency Split to Other Loop

   DO I  = 1, N
      IX = INDEXX(I)
      IY = INDEXY(I)
   DO 100 I = 1, N

Here, IXX and IYY have been turned into arrays to hold all the values computed by the first loop. The first loop (containing most of the work) can now be run in parallel. Only the second loop must still be run serially. This will be true if IXOFFSET or IYOFFSET are permutation vectors.

Tip: Temporary arrays such as IXX and IYY in Example 7-19 can be ALLOCATABLE, allocated just before needed and released after.

Before we leave this example, note that if we were certain that the value for IXX was always different in every iteration of the loop, then the original loop could be run in parallel. It could also be run in parallel if IYY was always different. If IXX (or IYY) is always different in every iteration, then TOTAL(IXX,IYY) is never the same location in any iteration of the loop, and so there is no data conflict.

This sort of knowledge is, of course, program-specific and should always be used with great care. It may be true for a particular data set, but to run the original code in parallel as it stands, you need to be sure it will always be true for all possible input data sets—and you need to document the dependence on this assertion, so that future program maintenance does not make it invalid.

Dealing With Recurrence

Example 7-20 shows a simple example of recurrence, which exists when a value computed in one iteration is immediately used by another iteration.

Example 7-20. Loop With Recurrence Relation

DO I = 1,N
   X(I) = X(I-1) + Y(I)

There is no good way of running this loop in parallel. If this type of construct appears in a critical loop, try pulling the statement(s) out of the loop as in the previous example. Sometimes an inner loop encloses the recurrence; in that case, try to parallelize the outer loop.

Dealing With Reduction

The operation in Example 7-21 is known as a reduction. Reductions occur when the elements of an array of values are combined and reduced to a single value. This example is a sum reduction because the combining operation is addition.

Example 7-21. Loop With a Sum Reduction

ASUM  = 0.0
DO I = 1,N
   ASUM = ASUM + A(I)

Since the value of ASUM is carried from one loop iteration to the next, this loop cannot be parallelized. However, because this example merely sums the elements of A(I), we can rewrite it to accumulate multiple, independent subtotals. Then we can do much of the work in parallel. This approach is shown in Example 7-22. (The mp_numthreads library call is discussed under “Using mp_numthreads and mp_set_numthreads”.)

Example 7-22. Loop With Partitioned Sum Reduction

      PARTIAL_ASUM(K) = 0.0
! The first thread does 1 through IPIECE_SIZE, the second
! does IPIECE_SIZE + 1 through 2*IPIECE_SIZE, and so on.
! If M is not evenly divisible by num_threads, the MIN
! expression makes the last piece small.
      END DO
! Finally, add up the partial sums
   ASUM = 0.0

The outer loop on K can be run in parallel. The array pieces for the partial sums are contiguous; that is, each inner loop processes a span of elements of A(I) that are adjacent in memory. This results in good cache utilization.

The approach illustrated in Example 7-22 is needed so often that automatic support is provided by the REDUCTION clause of C$DOACROSS. All of Example 7-22 can be reduced to the simple code of Example 7-23.

Example 7-23. Loop With Automatic Partitioned Reduction

   ASUM = 0.0
   DO I = 1, N
      ASUM = ASUM + A(I)

Types of Reductions

You can use the REDUCTION clause to automatically partition reductions based on four types of reduction operations:


S = S+A(I)


P = P*A(I)


L = MIN(L,A(I))


M = MAX(M,A(I))

Multiple reductions are supported in a single loop, as shown in Example 7-24.

Example 7-24. Loop With Four Reductions

   DO I = 1,N
      BG_SUM  = BG_SUM + A(I)
      BG_PROD = BG_PROD * A(I)
      BG_MIN  = MIN(BG_MIN, A(I))
      BG_MAX  = MAX(BG_MAX, A(I)

The compiler recognizes the type of reduction based on the operator or intrinsic function used. The number of available threads is calculated at runtime, and the arrays of partial values are allocated automatically.

Note: A partitioned reduction may not produce results identical to the serial reduction. Because computer arithmetic has limited precision, round-off errors accumulate in different ways when you sum the values in a different order. The final answer can differ, usually only in the last few decimal places. Either answer is “correct.” If the difference is significant, neither answer is trustworthy.

Inner Reduction

One further example of a reduction transformation is noteworthy. Consider the nested loops in Example 7-25.

Example 7-25. Reduction Nested in Outer Loop

   DO I = 1, N
      TOTAL = 0.0
      DO J = 1, M
         TOTAL = TOTAL + A(J,I)
      END DO
      B(I) = C(I) * TOTAL

The inner loop could be parallelized with a REDUCTION clause, and this would be a reasonable optimization in the case provided that N is small and M is large.

However, first consider the outer loop. The variable TOTAL fulfills the criteria for a LOCAL variable in the outer loop: the value of TOTAL in each iteration of the outer loop does not depend on the value of TOTAL in any other iteration of the outer loop. The outer loop can be parallelized, as shown in Example 7-26.

Example 7-26. Parallel Outer Loop With Inner Reduction

   DO I = 1, N
      TOTAL = 0.0
      DO J = 1, M
         TOTAL = TOTAL + A(J,I)
      END DO
      B(I) = C(I) * TOTAL

Iterations of the outer loop are performed in parallel. Each one has its own copy of I, J, and TOTAL, and executes the inner loop serially.

Coding Reductions Manually

When you have a reduction that is not a simple sum, product, MIN or MAX operation, it cannot be parallelized automatically. However, you can apply the technique shown in Example 7-22. The idea of partitioning an array to permit parallel computation, and then combining the partial results, is an important technique for breaking data dependence. This situation turns up again and again in various contexts and guises.

Tip: If multiple CPUs are not available, a program such as Example 7-22 will waste time in overhead operations. You can write two versions of a reduction, one serial and one parallel, and encapsulate them in subroutines. Then you can dynamically choose which to execute using the C$ directive (see “Using C$”).

Adjusting the Work Quantum

A certain amount of overhead is needed to initialize a parallel loop. If the work done in the loop is small, the parallel loop can actually run slower. To avoid this, make the amount of work inside the multiprocessed region as large as possible.

Using a Loop Interchange

In the nested loops of Example 7-27 you could choose to parallelize the J loop or the I loop. In general, try to parallelize the outermost DO loop because it encloses the most work. However, you cannot parallelize the K loop in Example 7-27 because different iterations of the K loop read and write the same values of A(I,J).

Example 7-27. Nested Loops

DO K = 1, N
   DO I = 1, N
      DO J = 1, N
         A(I,J) = A(I,J) + B(I,K) * C(K,J)
      END DO

In this example, the I loop is the outermost one that can be parallelized. However, using the technique called loop interchange, you can reorder the loops to make this one the outermost, as shown in Example 7-28.

Example 7-28. Nested Loops, Interchanged

DO I = 1, N
   DO K = 1, N
      DO J = 1, N
         A(I,J) = A(I,J) + B(I,K) * C(K,J)
      END DO

Now the parallelizable loop encloses more work and shows better performance. In practice, relatively few loops can be reordered in this way. However, it does occasionally happen that several loops in a nest of loops are candidates for parallelization. In such a case, it is usually best to parallelize the outermost one.

Conditional Parallelism

Occasionally, the only loop available to be parallelized has a fairly small amount of work. It may be worthwhile to force certain loops to run without parallelism or to select between a parallel version and a serial version, on the basis of the length of the loop.

The first loop in Example 7-29 can never execute more thanfour iterations. It is not worth parallelizing such a loop unless there is an extraordinary amount of work in the body of the loop.

Example 7-29. Loop Not Worth Parallelizing

   J = (N/4) * 4
   DO I = J+1, N
      A(I) = A(I) + X*B(I)
   DO I = 1, J, 4
      A(I) = A(I) + X*B(I)
      A(I+1) = A(I+1) + X*B(I+1)
      A(I+2) = A(I+2) + X*B(I+2)
      A(I+3) = A(I+3) + X*B(I+3)

The second loop in Example 7-29 has been optimized using manual loop unrolling of order four. Even so, this loop is worth parallelizing if N is big enough. In general, the overhead of initiating a parallel loop is roughly equivalent to 1,000 floating-point operations. Let F be the approximate number of floating-point operations in the loop, and let P be the number of available CPUs. If parallelization is to save time, the approximate inequality F-F/P>1000 must hold.

The revision in Example 7-30 uses the IF clause on the DOACROSS directive to test if time can be saved by parallelization.

Example 7-30. Loop With Conditional Parallelization

   J = (N/4) * 4
   DO I = J+1, N
      A(I) = A(I) + X*B(I)
!$ IF ((J*8)-(J*8/MP_NUMTHREADS()).GE.1000)
   DO I = 1, J, 4 ! 8 flops per iter.
      A(I) = A(I) + X*B(I)
      A(I+1) = A(I+1) + X*B(I+1)
      A(I+2) = A(I+2) + X*B(I+2)
      A(I+3) = A(I+3) + X*B(I+3)
   END DO 

Cache Effects

The cache memory of current Silicon Graphics systems has a major effect on performance. A CPU runs at or near its theoretical maximum speed only when all data and instructions are present in the cache. Whenever the CPU must fetch data from main memory, execution slows.

The technique for the best cache performance in Fortran is quite simple: make the loop step through the array in the same way that the array is laid out in memory. This means stepping through the array:

  • without skipping any elements (with a “stride” of 1)

  • with the leftmost subscript varying the fastest

Note that this optimization does not depend on multiprocessing, nor is it required in order for multiprocessing to work correctly. However, when you divide work between multiple CPUs, it is easy to introduce nonsequential array access. Always try to divide work so that each slave thread works on a contiguous span of array elements.

Parallelizing a Matrix Multiply

The loops in Example 7-31 are the same as in Example 7-28. In order to get the most work into the outer loop, the I loop was interchanged with the K loop.

Example 7-31. Nested Loops, Interchanged

DO I = 1, N
   DO K = 1, N
      DO J = 1, N
         A(I,J) = A(I,J) + B(I,K) * C(K,J)
      END DO

Unfortunately, to get the best cache performance, the I loop should be innermost. This is because I is the leftmost index in the references to arrays A and B. As the example stands, the innermost statement touches elements of A with a stride of J, and elements of B with a stride of K, resulting in unnecessarily frequent cache misses.

At the same time, to get the best multiprocessing performance, the outermost loop should be parallelized. In this case, you can perform one additional loop interchange of the I and J loops, and get the best of both optimizations. This is illustrated in Example 7-32.

Example 7-32. Nested Loops Interchanged for Cache Performance

DO J = 1, N
   DO K = 1, N
      DO I = 1, N
         A(I,J) = A(I,J) + B(I,K) * C(K,J)
      END DO

Trade-Offs Between Optimizations

Sometimes you must choose between the possible optimizations and their costs. The loop in Example 7-33 can be parallelized on I but not on J.

Example 7-33. Vector Reduction

DO J = 1, N
   DO I = 1, M
      A(I) = A(I) + B(J)*C(I,J)

As shown in Example 7-34, you could interchange the loops to put I on the outside, thus getting a bigger work quantum.

Example 7-34. Parallelized Vector Reduction

DO I = 1, M
   DO J = 1, N
      A(I) = A(I) + B(J)*C(I,J)

However, putting J on the inside means that the loop steps through the C array with a stride of I—the second subscript varies the fastest. Supposing that C is, in total, much larger than the cache, all of C will be pumped through the cache, I times.

Another approach is to forego the loop interchange, and to parallelize only the inner loop. The inner loop can be seen as a sum reduction into A(I). The result is shown in Example 7-35.

Example 7-35. Vector Reduction With Parallel Inner Loop

DO J = 1, N
   DO I = 1, M
      A(I) = A(I) + B(J)*C(I,J)

However, this approach entails initiating parallel execution J times; so M needs to be large for this approach to show any improvement.

You must trade off the various possible optimizations to find the combination that is right for the particular job.

Balancing the Load With Interleaving

When the Fortran compiler divides a loop into pieces, by default it uses the simple method of separating the iterations into contiguous blocks of equal size for each process. However, some iterations can take significantly longer to complete than other iterations. This can be the natural result of the algorithm used.

At the end of a parallel region, the program has to wait for all slave threads to complete their tasks. If the work is not divided evenly, time is wasted waiting for the slowest process to finish.

Example 7-36. Loop With Iterations of Different Lengths

DO I = 1, N
   DO J = 1, I
      A(J, I) = A(J, I) + B(J)*C(I)

The code segment in Example 7-36 can be parallelized on the outer loop. Because the inner loop goes from 1 to I, the first iterations of the outer loop will end long before the last iterations of the outer loop. In this example, this is easy to predict, so you can change the program as shown in Example 7-37. (See also “Using mp_numthreads and mp_set_numthreads”.)

Example 7-37. Loop With Balanced Iterations

      DO I = K, N, NUM_THREADS
         DO J = 1, I
            A(J, I) = A(J, I) + B(J)*C(I)
         END DO
      END DO

In this rewritten version, instead of breaking up the I loop into contiguous blocks, it has been broken into interleaved blocks. Thus, each execution thread receives some small values of I and some large values of I, so that each slave thread has about the same amount of work to do. Interleaving usually, but not always, cures a load balancing problem.

You can use the MP_SCHEDTYPE clause to automatically perform this desirable transformation.

      DO 20 I = 1, N
         DO 10 J = 1, I
            A (J,I) = A(J,I) + B(J)*C(J)

Note that interleaving can cause poor cache performance because the array is no longer stepped through at a stride of 1. You can improve performance somewhat by adding a CHUNK=integer_expression clause. Usually 4 or 8 is a good value for integer_expression. Each small chunk has stride 1 to improve cache performance, while the chunks are interleaved to improve load balancing.

Interleaving is one possible scheduling mode. Both interleaving and the SIMPLE scheduling method are examples of fixed schedules; the iterations are assigned to processes by a single decision made when the loop is entered. For more complex loops, it may be desirable to use DYNAMIC or GSS schedules.

Comparing the output from pixie or from prof allows you to see how well the load is being balanced so you can compare the different methods of dividing the load. Refer to the discussion of the MP_SCHEDTYPE clause in “Using the CHUNK and MP_SCHEDTYPE Clauses” for more information.

Even when the load is perfectly balanced, iterations may still take varying amounts of time to finish because of random factors. One process may take a page fault, another may be interrupted to let a different program run, and so on. Because of these unpredictable events, some time can be spent waiting for the last processes to complete, even with near-perfect balance.

Run-Time Control of Multiprocessing

A number of features are provided so that you can control the use of multiprocessing at runtime. This section provides a brief explanation of these features, which are documented in the mp(3f) reference page.

Using mp_block and mp_unblock

The mp_block library function puts the slave threads into a blocked state using the IRIX system function blockproc. The slave threads stay blocked until a call is made to mp_unblock. These routines are useful if the job has bursts of parallelism separated by long stretches of single processing, as with an interactive program. You can block the slave processes so they consume CPU cycles only as needed, thus freeing the machine for other users. The Fortran system automatically unblocks the slaves on entering a parallel region should you neglect to do so.

Using mp_setup, mp_create, and mp_destroy

The mp_setup, mp_create, and mp_destroy library calls create and destroy threads of execution. This can be useful if the job has only one parallel portion or if the parallel parts are widely scattered. When you destroy the extra execution threads, they cannot consume system resources, but they must be re-created when needed. Frequent creation of threads can degrade performance. The mp_block and mp_unblock routines should be used in almost all cases.

The mp_setup library call takes no arguments. It creates the default number of processes as defined by previous calls to mp_set_numthreads (see “Using mp_numthreads and mp_set_numthreads”), by the environment variable MP_SET_NUMTHREADS (see “Environment Variables for Scheduling Control”), or by the number of CPUs on the current hardware platform. mp_setup is called automatically when the first parallel loop is entered to initialize the slave threads.

The mp_create call takes a single integer argument, the total number of execution threads desired. Note that the total number of threads includes the master thread. Thus, mp_create(n) creates one thread less than the value of its argument. mp_destroy takes no arguments—it destroys all the slave execution threads, leaving the master untouched.

When the slave threads end, they generate a SIGCLD signal. If your program has changed the signal handler to catch SIGCLD, it must be prepared to deal with this signal when mp_destroy is executed. This signal also occurs when the program exits; mp_destroy is called as part of normal cleanup when a parallel Fortran job terminates.

Using mp_blocktime

The Fortran slave threads wait by “spinning” (repetitive testing of a lock) until there is work to do. This makes them immediately available when a parallel region is reached. However, spinning consumes CPU resources. After a certain maximum amount of spinning, the slaves block themselves through blockproc. Once the slaves are blocked, it requires a system call to unblockproc to activate the slaves again (refer to the unblockproc(2) reference page for details). This slows the startup of a parallel loop.

This trade-off between response time and CPU usage can be adjusted with the mp_blocktime call. mp_blocktime takes a single integer argument that specifies the number of times to spin before blocking. By default, it is set to 10,000,000; this takes roughly one second. If called with an argument of 0, the slave threads will not block themselves no matter how much time has passed. Explicit calls to mp_block, however, do still block the threads.

This automatic blocking is transparent to the program. Blocked threads are automatically unblocked when a parallel region is reached.

Using mp_numthreads and mp_set_numthreads

Occasionally, you may want to know how many execution threads are available (for example, in order to call mp_setup, or for a test in a C$DOACROSS IF expression). The mp_numthreads function takes no arguments. It returns the total number of execution threads available for this job. The count includes the master thread.

The mp_set_numthreads subroutine takes a single-integer argument and changes the default number of threads to the specified value. A subsequent call to mp_setup will use the specified value rather than the original defaults.

Note: This call has an effect only when mp_setup is called. If the slave threads have already been created, this call will not change their number. To change the number of threads, use mp_destroy, then mp_set_numthreads, then mp_setup.

Using mp_my_threadnum

The mp_my_threadnum function takes no arguments. It returns a number indicating the number of thread executing the call. If there are n execution threads, the function call returns a value between zero and n – 1. The master thread is always thread zero. This function can be useful when parallelizing certain kinds of loops. Most of the time, the loop index variable can be used for the same purpose. Occasionally, the loop index may not be accessible, as, for example, when an external routine is called from within the parallel loop. This routine provides a mechanism for those cases.

Environment Variables for Scheduling Control

The MP_SET_NUMTHREADS, MP_BLOCKTIME, and MP_SETUP environment variables act as an implicit call to the corresponding routines of the same name, but they take effect at program start-up time.

For example, the csh command


causes the program to create two threads regardless of the number of CPUs actually on the machine, just like the source statement


Similarly, the sh commands


prevent the slave threads from autoblocking, just as does the statement

call mp_blocktime (0)

For compatibility with older releases, the environment variable NUM_THREADS is supported as a synonym for MP_SET_NUMTHREADS.

To help support networks with several multiprocessors and several CPUs, the environment variable MP_SET_NUMTHREADS also accepts an expression involving integers +, –, min, max, and the special symbol all, which stands for “the number of CPUs on the current machine.”

For example, the following command selects the number of threads to be two fewer than the total number of CPUs (but always at least one):

% setenv MP_SET_NUMTHREADS max(1,all-2)

Environment Variables for Load-Sensitive Scheduling

In an environment with long running jobs and varying workloads, it may be preferable to vary the number of threads during execution of some jobs.

If the environment variable MP_SUGNUMTHD has a non-null value when the program starts, the run-time library creates an additional, asynchronous process that periodically wakes up and monitors the system load. When idle processors exist, this process increases the number of threads, up to the maximum set by MP_SET_NUMTHREADS. When the system load increases, the process decreases the number of threads, possibly to as few as one. When MP_SUGNUMTHD has no value, this feature is disabled and multithreading works as before.

The environment variables MP_SUGNUMTHD_MIN and MP_SUGNUMTHD_MAX are used to limit this feature as desired. When MP_SUGNUMTHD_MIN is set to an integer value between 1 and MP_SET_NUMTHREADS, the process will not decrease the number of threads below that value.

When MP_SUGNUMTHD_MAX is set to an integer value between the minimum number of threads and MP_SET_NUMTHREADS, the process does not increase the number of threads above that value.

If you set any value in the environment variable MP_SUGNUMTHD_VERBOSE, informational messages are written to stderr whenever the process changes the number of threads in use.

Calls to mp_numthreads and mp_set_numthreads are taken as a sign that the application depends on the number of threads in use. The number in use is frozen upon either of these calls; and if MP_SUGNUMTHD_VERBOSE is set, a message to that effect is written to stderr.

Environment Variables for RUNTIME Scheduling

These environment variables specify the type of scheduling to use on DOACROSS loops that have their scheduling type set to RUNTIME. For example, the following csh commands cause loops with the RUNTIME scheduling type to be executed as interleaved loops with a chunk size of 4:

% setenv CHUNK 4 

The defaults are the same as on the C$DOACROSS directive: if neither variable is set, SIMPLE scheduling is assumed; if MP_SCHEDTYPE is set but CHUNK is not set, a CHUNK of 1 is assumed. If CHUNK is set, but MP_SCHEDTYPE is not, DYNAMIC scheduling is assumed.

Using mp_setlock, mp_unsetlock, mp_barrier

The mp_setlock, mp_unsetlock, and mp_barrier subroutines provide convenient (although limited) access to the locking and barrier functions provided by the IRIX functions ussetlock, usunsetlock, and barrier. These subroutines are convenient because you do not need to initialize them; calls such as usconfig and usinit are done automatically. The limitation is that there is only one lock and one barrier. For most programs, this amount is sufficient. If your program requires more complex or flexible locking facilities, use the ussetlock family of subroutines directly.

Using Local COMMON Blocks

Variables in COMMON blocks are static, and if there is an assignment to a static variable within a loop, the loop can't be parallelized.

A special ld option allows named COMMON blocks to be local to a process. Each process in the parallel job gets its own private copy of the common block. This can be helpful in converting certain types of loops into parallel form.

Only a named COMMON can be made process-local (blank COMMON may not be made local). The COMMON block must not be initialized by executable code, not by DATA statements.

To create a local COMMON block, give the special loader directive –Xlocal followed by a list of COMMON block names. Note that the external name of a COMMON block known to the loader has a trailing underscore and is not surrounded by slashes. For example, the command

% f90 –mp a.out –Xlocal foo_

makes the COMMON block /FOO/ a local COMMON block in the resulting a.out file. You can specify multiple –Xlocal options if necessary.

It is occasionally desirable to be able to copy values from the master thread's version of the COMMON block into the slave thread's version. The special directive C$COPYIN allows this. It has the form

C$COPYIN item [, item …]

Each item must be a member of a localized COMMON block. It can be a variable, an array, an individual element of an array, or the entire COMMON block.

For example,

!$COPYIN x,y, /foo/, a(i)

propagates the values for x and y, all the values in the COMMON block FOO, and the ith element of array a. All these items must be members of local COMMON blocks. Note that this directive is translated into executable code, so in this example i is evaluated at the time this statement is executed.

Compatibility With sproc

The parallelism used in Fortran is implemented using the IRIX system call sproc. It is not a good idea to attempt to use both parallel loops and explicit sproc calls. It is possible, but there are several restrictions:

  • Any explicit threads you create may not execute $DOACROSS loops. Only the original thread is allowed to do this.

  • The calls to routines like mp_block and mp_destroy apply only to the threads created by mp_create or to those automatically created when the Fortran job starts; they have no effect on any user-created threads.

  • Calls to routines such as m_get_numprocs do not apply to threads created explicitly. However, the Fortran-created slave threads are ordinary subprocesses; so using the system function kill with the arguments 0 and a signal number (for example, kill (0,9)) to signal all members of the process group will kill the threads used to execute C$DOACROSS.

  • If you choose to intercept the SIGCLD signal, you must be prepared to receive this signal when the threads used for the C$DOACROSS loops exit; this occurs when mp_destroy is called or at program termination.

  • Note in particular that m_fork is implemented using sproc, so it is not legal to m_fork a family of processes that each subsequently executes C$DOACROSS loops. Only the original thread can execute C$DOACROSS loops.

DOACROSS Implementation

This section discusses how multiprocessing is implemented in a loop controlled by C$DOACROSS. This information is useful when you use a debugger or interpret the results of an execution profile.

Loop Transformation

When the Fortran compiler encounters a C$DOACROSS directive, it puts the body of the corresponding DO loop into a separate subroutine and replaces the loop with a call to a special library routine __mp_parallel_do.

The newly created routine is named by appending .pregion to the name of the original routine, followed by the number of the parallel loop in the routine, where 0 is the first loop. For example, the first parallel loop in a routine named foo is named foo.pregion0, the second parallel loop is foo.pregion1, and so on.

If a loop occurs in the main routine and if that routine has not been given a name by the PROGRAM statement, its name is assumed to be main. Any variables declared to be local in the original C$DOACROSS statement are declared as local variables in the created routine. References to SHARE variables are resolved by referring back to the original routine.

Because the created routine is now just a DO loop, the routine uses subroutine arguments to specify which part of the loop a particular process is to execute. The created routine has three arguments: the starting value for the index, the number of times to execute the loop, and a special flag word.

Consider the subroutine in Example 7-38.

Example 7-38. Subroutine With Parallel Loop

   REAL A(*), B(*), C(*)
   DO I = 1, N
      X = A(I)*B(I)
      C(I) = X + X**2
   C(N) = A(1) + B(2)

The compiler generates the new subroutine shown in Example 7-39 to contain the parallelized loop code. The name of the generated routine is derived from the containing subroutine, EXAMPLE.

Example 7-39. Generated “pregion” Subroutine

      X = A(I)*B(I)
      C(I) = X + X**2
      I = I + 1

Executing Created Routines

The set of processes that cooperate to execute the parallel Fortran job are members of a process share group created by sproc, anIRIX system call. The process share group is created by special startup routines that are used only when the executable is linked with the –mp option, which enables multiprocessing.

The first process is the master process. It executes all the nonparallel portions of the code. The other processes are slave processes; they are controlled by the routine mp_slave_control. When they are inactive, they wait in the special routine __mp_slave_wait_for_work.

The __mp_parallel_do routine divides the work and signals the slaves. The master process then calls the created routine to do its share of the work. When a slave is signaled, it wakes up from the wait loop, calculates which iterations of the spooled DO loop it is to execute, and then calls the routine with the appropriate arguments. When the routine returns, the slave reports that it has finished and returns to __mp_slave_wait_for_work.

When the master completes its execution of its portion of the spooled routine, it waits in the special routine mp_wait_for_loop_completion until all the slaves have completed processing. The master then returns to the main routine and continues execution.

Using PCF Directives

The compiler supports a more general model of parallelism, in addition to the simple loop-level parallelism offered by the C$DOACROSS directive. This model is based on the work done by the Parallel Computing Forum (PCF), which itself formed the basis for the proposed ANSI-X3H5 standard. The compiler supports this model through compiler directives, rather than extensions to the source language.

The main concept in this model is the parallel section, which can be any arbitrary section of code (not just a DO loop). Within the parallel region, you designate work-sharing constructs to specify how the work is divided among separate threads. The parallel region can also contain a critical section construct, where exactly one process executes at a time.

The master thread executes the user program until it reaches a parallel region. It then spawns one or more slave threads that begin executing code at the beginning of a parallel region. Each thread executes all the code in the region until a work-sharing construct is encountered. Each thread then executes some portion of the work sharing construct, and resumes executing the parallel region code. At the end of the parallel region, all the threads synchronize, and the master thread continues execution of the user program.

The PCF directives, summarized in Table 7-1, implement the general model of parallelism. They look like Fortran comments: always starting in column one, and beginning with a “C$PAR” in fixed-form source or “!$PAR”' in free-form source.

The compiler recognizes these directives when multiprocessing is enabled with either the –mp or -pfa driver option. If multiprocessing is not enabled, the compiler treats these statements as comments.

Table 7-1. Summary of PCF Directives




Ensures that each process waits until all processes reach the barrier before proceeding.



Ensures that the enclosed block of code is executed by only one process at a time by using a lock variable.



Encloses a parallel region, which includes work-sharing constructs and critical sections.


Precedes a single DO loop for which separate iterations are executed by different processes. This directive is equivalent to the C$DOACROSS directive.



Separate iterations of the enclosed loop are executed by different processes. This directive must be inside a parallel region.



Parcels out each block of code in turn to a process.


Signifies a starting line for an individual section within a parallel section.



Ensures that the enclosed block of code is executed by exactly one process.


Continues a PCF directive onto multiple lines.


Occasionally, the clauses in PCF directives are longer than one line. You can use the C$PAR & directive to continue a directive onto multiple lines. For example,

!$PAR PARALLEL local(i,j)
!$PAR& shared(a,n,index_x,index_y,cur_max,
!$PAR& big_max,bmax_x,bmax_y)

Parallel Region

A parallel region encloses any number of PCF constructs (described in “Using PCF Directives”). It signifies the boundary within which slave threads execute. Slave threads are created at entry to the region if necessary. A user program can contain any number of parallel regions. The syntax of the parallel region is

C$PAR PARALLEL [clause [[,] clause]...]

where valid clauses are

IF ( logical_expression )
{LOCAL | PRIVATE}(item [,item ...])
{SHARED | SHARE}(item [,item ...])

The IF, LOCAL, and SHARED clauses have the same meaning as in the C$DOACROSS directive (refer to “Writing Simple Parallel Loops”).

Note: The preferred form of the directive uses no commas between the clauses. The SHARED keyword is preferred over SHARE, and LOCAL is preferred over PRIVATE.

In Example 7-40, all threads enter the parallel region and call the subroutine named foo, passing the number of the thread executing the call.

Example 7-40. Subroutine With Parallel Region

   subroutine ex1(index)
      integer i
      i = mp_my_threadnum()
      call foo(i)

PCF Work-Sharing Constructs

The principal PCF constructs are the work-sharing constructs. (The other types are critical sections and barriers.) The work-sharing constructs direct the application of slave threads to code. The work-sharing constructs are:

  • parallel DO

  • PDO

  • parallel sections

  • single process

All master and slave threads synchronize at the bottom of any work-sharing construct. None of the threads continue past the end of the construct until they all have completed execution within that construct.

If specified, the PDO, parallel section, and single process constructs must appear inside of a parallel region, which creates the threads. The parallel DO construct cannot. Specifying a parallel DO construct inside of a parallel region produces a syntax error.

Parallel DO

The parallel DO construct is the same as the C$DOACROSS directive—it calls for parallelizing the single DO loop that immediately follows it.

Conceptually, parallel DO is the same as a parallel region containing exactly one PDO construct and no other code. Each thread inside the enclosing parallel region executes separate iterations of the loop within the parallel DO construct. The syntax of the parallel DO construct is

C$PAR PARALLEL DO [clause [[,] clause]...]

“Syntax of C$DOACROSS” describes valid values for clause. The only difference is in the MP_SCHEDTYPE=mode clause. For the C$PAR PARALLEL DO directive, the keyword MP_SCHEDTYPE= is optional; you can simply specify mode.


PDO is a generalization of parallel DO to loops of any kind. Each thread inside the enclosing parallel region executes a separate iteration of the loop within the PDO construct. The syntax of the PDO construct, which can only be specified within a parallel region, is

C$PAR PDO [clause [[,] clause]...]

where valid values for clause are

{LOCAL | PRIVATE} (item[,item ...])
{LASTLOCAL | LAST LOCAL} (item[,item ...]

LOCAL, LASTLOCAL, sched, and chunk have the same meaning as in the C$DOACROSS directive (refer to “Writing Simple Parallel Loops”). Note in particular that it is legal to declare a data item as LOCAL in a PDO even if it was declared as SHARED in the enclosing parallel region. The (ORDERED) clause is equivalent to a sched clause of DYNAMIC and a chunk clause of 1. The parentheses are required.

LASTLOCAL is preferred over LAST LOCAL and LOCAL is preferred over PRIVATE.

The END PDO directive is optional. If specified, this directive must appear immediately after the end of a loop. The optional NOWAIT clause specifies that each process should proceed directly to the code immediately following the directive. If you do not specify NOWAIT, the processes will wait until all have reached the directive before proceeding.

Example 7-41 shows an example of the PDO construct.

Example 7-41. Simple PDO Structure

subroutine ex2(a,n)
   real a(n)
!$PAR PARALLEL local(i) shared(a)
   do i = 1, n
      a(i) = a(i) + 1.0

The effect of this example could be achieved with a parallel DO or with a C$DOACROSS directive. In fact, the compiler recognizes this as a special case and generates the same (more efficient) code as for a C$ DOACROSS directive.

Parallel Sections

The PCF parallel sections construct is a parallel version of the Fortran 90 SELECT statement. Each block of code is parcelled out in turn to a separate thread. The syntax of the parallel sections construct is

C$PAR PSECTION[S] [clause]
      code] ...

The only valid value for clause is

{LOCAL | PRIVATE} (item [,item])

LOCAL is preferred over PRIVATE and has the same meaning as for the C$DOACROSS directive (refer to “Syntax of C$DOACROSS”). Note in particular that it is legal to declare a data item as LOCAL in a parallel sections construct even if it was declared as SHARED in the enclosing parallel region.

The optional NOWAIT clause specifies that each process should proceed directly to the code immediately following the directive. If you do not specify NOWAIT, the processes will wait until all have reached the END PSECTION directive before proceeding.

Parallel sections must appear within a parallel region. They can contain critical section constructs (described in “Critical Sections”) but cannot contain any of the following types of constructs:

  • PDO

  • parallel DO or C$ DOACROSS

  • single process

Each section is executed in parallel, depending on the number of processes available. The code blocks are assigned to threads one at a time, in the order specified. Each code block is executed by only one thread.

Example 7-42 illustrates parallel sections. The first thread to enter the parallel sections construct executes the first section; the second thread executes the second section; and a third thread, if one exists, executes the third section. If the parallel region is executed by only two threads, whichever thread first finishes its section executes the remaining section.

Example 7-42. Parallel Sections

subroutine ex3(a,n1,b,n2,c,n3)
   real a(n1), b(n2), c(n3)
!$PAR PARALLEL local(i) shared(a,b,c)
   do i = 1, n1
      a(i) = 0.0
   end do
   do i = 1, n2
      b(i) = 0.5
   call normalize(c,n3)
   do i = 1, n3
      c(i) = c(i) + 1.0

This example has only three sections, so if more than three threads are in the parallel region, the fourth and higher threads wait at the C$PAR END PSECTION directive until all threads are finished.

This example uses DO loops, but a parallel section can contain any code—provide it has no data dependency on other sections. Be aware of the significant overhead of a parallel construct. Make sure the amount of work performed is enough to outweigh the extra overhead.

The sections within a parallel sections construct are assigned to threads one at a time, from the top down. There is no other implied ordering to the operations within the sections. In particular, a later section cannot depend on the results of an earlier section, unless some form of explicit synchronization is used. If there is such explicit synchronization, you must be sure that the lexical ordering of the blocks is a legal order of execution.

Single Process

The single process construct, which can only be specified within a parallel region, ensures that a block of code is executed by exactly one process. The syntax of the single process construct is


The only valid value for clause is

{LOCAL | PRIVATE} (item [,item])

LOCAL is preferred over PRIVATE and has the same meaning as for the C$ DOACROSS directive (refer to “Syntax of C$DOACROSS”). Note in particular that it is legal to declare a data item as LOCAL in a single process construct even if it was declared as SHARED in the enclosing parallel region.

The optional NOWAIT clause specifies that each process should proceed directly to the code immediately following the directive. If you do not specify NOWAIT, the processes will wait until all have reached the directive before proceeding.

This construct is semantically equivalent to a parallel sections construct with only one section. The single process construct provides a more descriptive syntax.The first thread to reach the single process section executes the code in that block. All other threads wait at the end of the section until the code has been executed.

Critical Sections

The critical section construct protects a block of code with a lock so that it is executed by only one thread at a time. Another process arriving at the critical section must wait until the current process has finished it. Threads do not synchronize at the bottom of a critical section, as they do at the end of a work-sharing construct.

The critical section construct can appear anywhere in a program, inside and outside a parallel region and even within a C$ DOACROSS loop. The syntax of the critical section construct is

C$PAR CRITICAL SECTION [ ( lock_variable ) ]

The lock_variable is an optional integer variable that must be initialized to zero. The parenthesis are required around its name. If you do not specify lock_variable, the compiler automatically supplies one.

Multiple critical section constructs inside the same parallel region are normally independent of each other. However, if they use the same explicit lock_variable, they are linked and only one process can execute in any of the linked critical sections at one time.

Barrier Constructs

A barrier construct ensures that each process waits until all processes reach the barrier before proceeding. There is an implicit barrier at the end of each work-sharing construct (unless NOWAIT is specified). The syntax of the barrier construct is



The three work-sharing constructs, PDO, PSECTION, and SINGLE PROCESS, must be executed by all the threads executing in the parallel region (or none of the threads). The following is illegal:

         if (mp_my_threadnum() .gt. 5) then
            many_processes = .true.

This code will hang forever when run with enough processes. One or more process will be stuck at the C$PAR END SINGLE PROCESS directive waiting for all the threads to arrive. But threads with numbers less than 6 never take the appropriate branch, and never encounter the construct.

However, the following kind of simple looping is supported:

!$PAR PARALLEL local(i,j) shared(a)
         do i= 1,n
            do j = 2,n

The distinction here is that all of the threads encounter the work-sharing construct, they all complete it, and they all loop around and encounter it again.

Note that this restriction does not apply to the critical section construct, which operates on one thread at a time without regard to any other threads.

Parallel regions cannot be lexically nested inside of other parallel regions, nor can work-sharing constructs be nested. However, as an aid to writing library code, you can call an external routine that contains a parallel region even from within a parallel region. In this case, only the first region is actually run in parallel. Therefore, you can create a parallelized routine without accounting for whether it will be called from within an already parallelized routine.

A Few Words About Efficiency

The more general PCF constructs are typically slower than the special case parallelism offered by the C$DOACROSS directive. They are slower because of the extra synchronization required. When a C$DOACROSS loop executes, there is a synchronization point at entry and another at exit. When a parallel region executes, there is a synchronization point at entry to the region, another at each entry to a work-sharing construct, another at each exit from a work-sharing construct, and one at exit from the region. Thus, several separate C$DOACROSS loops typically execute faster than a single parallel region with several PDO constructs. Limit your use of the parallel region construct to those few cases that actually need it.