Chapter 8. Compiling and Debugging Parallel Fortran

This chapter gives instructions on how to compile and debug a parallel Fortran program. It contains the following sections:

This chapter assumes you have read Chapter 7, “Optimizing for Multiprocessors,” and have reviewed the techniques and vocabulary for parallel processing in the IRIX environment.

Compiling and Running

After you have written a program for parallel processing, you should first debug your program in a single-processor environment by compiling it without parallel optimization. You can also debug your program using the CASEvision/WorkShop debugger, which is sold as a separate product. After your program has executed successfully on a single processor, you can compile it for multiprocessing.

To turn on multiprocessing, use the driver option –mp. This option causes the compiler to generate multiprocessing code for the particular files being compiled. When linking, you can combine object files produced with and without the –mp option. When you use either f90 or ld to link a program containing any object files compiled with –mp, you must again use –mp so that the correct libraries are linked.

Using the –static Option

The -static driver option causes procedure local variables to be allocated statically, not on the process stack as normal. However, the multiprocessing implementation demands some use of the stack to allow multiple threads of execution to execute the same code simultaneously. Therefore, parallel code regions are effectively compiled with the –automatic option, even if the routine enclosing them is compiled with –static.

This means that SHARE variables in a parallel section behave with –static semantics, but that LOCAL variables in a parallel section do not (see “Using the LOCAL, LASTLOCAL, and SHARE Clauses”).

Finally, if a parallel region calls an external procedure, that procedure cannot be compiled with –static. As noted under “Parallel Procedure Calls”, to call a procedure that uses static variables from multiple, concurrent threads would create race conditions and incorrect results. You can mix static and multiprocessed object files in the same executable; the restriction is that static variables cannot be modified from within a parallel section.

Examples of Compiling

The following examples illustrate compiling code using –mp. The following command line

% f90 –mp foo.f

compiles and links the Fortran program foo.f into a multiprocessor executable.

In this example

% f90 –c –mp –O2 snark.f

the Fortran routines in the file snark.f are compiled with multiprocess code generation enabled. The optimizer is also used. A standard snark.o binary is produced, which must be linked:

% f90 –mp –o boojum snark.o bellman.o

Here, the –mp option signals the linker to use the Fortran multiprocessing library. The file bellman.o need not have been compiled with the –mp option, although it could have been.

After linking, the resulting executable can be run like any standard executable. Creating multiple execution threads, and running, synchronizing, and terminating them, are all handled automatically.

When an executable has been linked with –mp, the Fortran initialization routines determine how many parallel threads of execution to create. This determination occurs each time the program starts; the number of threads is not compiled into the code. The default is to use whichever is less: 4, or the number of processors that are on the machine. (This number will be the value returned by the system call sysmp(MP_NAPROCS); see the sysmp(2) reference page.) You can override the default using environment variable MP_SET_NUMTHREADS or using a library call, as discussed under “Run-Time Control of Multiprocessing”.

Profiling a Parallel Fortran Program

After converting a program, you need to examine execution profiles to judge the effectiveness of the transformation. Good execution profiles of the program are crucial to help you focus on the loops consuming the most time.

IRIX provides profiling tools that can be used on Fortran parallel programs. Both pixie (see the pixie(1) reference page) and pc-sample profiling can be used. On jobs that use multiple threads, both methods create a separate profile data file for each thread. You can use the standard profile analyzer prof (see the prof(1) reference page) to examine this output. The MIPS Compiling and Performance Tuning Guide has details about using prof and pixie.

The profile of a Fortran parallel job is different from a standard profile. As mentioned in “DOACROSS Implementation”, to produce a parallel program, the compiler pulls the parallel DO loops out into separate subroutines, one routine for each loop. Each of these loops is shown as a separate procedure in the profile. Comparing the amount of time spent in each loop by the various threads shows how well the workload is balanced.

In addition to the loops, the profile shows the special routines that actually do the multiprocessing. The __mp_parallel_do routine is the synchronizer and controller. Slave threads wait for work in the routine __mp_slave_wait_for_work. The less time they wait, the more time they work. This gives a rough estimate of how parallel the program is.

Debugging Parallel Fortran

This section presents some techniques to assist in debugging a parallel program.

General Debugging Hints

  • Debugging a multiprocessed program is much more difficult than debugging a single-processor program. Therefore you should do as much debugging as possible on the single-processor version.

  • Try to isolate the problem as much as possible. Ideally, try to reduce the problem to a single C$DOACROSS loop or PCF parallel section.

  • Before debugging a multiprocessed program, change the order of the iterations on the parallel DO loop on a single-processor version. If the loop can be multiprocessed, then the iterations can execute in any order and produce the same answer. If the loop cannot be multiprocessed, changing the order frequently causes the single-processor version to fail, and standard single-process debugging techniques can be used to find the problem.

Example 8-1 contains a bug: the two references to a have the indexes in reverse order. If the indexes were in the same order (if both were a(i,j) or both were a(j,i)), the loop could be multiprocessed. As written, there is a data dependency, so the C$DOACROSS is a mistake.

Example 8-1. Erroneous C$DOACROSS


!$doacross local(i,j)
   do i = 1, n
      do j = 1, n
         a(i,j) = a(j,i) + x*b(i)
      end do
   end do

Because a (correct) multiprocessed loop can execute its iterations in any order, the example could be rewritten as shown in Example 8-2.

Example 8-2. Corrected use of C$DOACROSS


!$doacross local(i,j)
      do i = n, 1, –1
         do j = 1, n
            a(i,j) = a(j,i) + x*b(i)
         end do
      end do

This loop no longer gives the same answer as the original even when compiled without the –mp option. This reduces the problem to a normal debugging problem. When a multiprocessed loop gives the wrong answer, make the following checks.

  • Check the LOCAL variables when the code runs correctly as a single process but fails when multiprocessed. Carefully check any scalar variables that appear in the left-hand side of an assignment statement in the loop to be sure they are all declared LOCAL. Be sure to include the index of any loop nested inside the parallel loop.

    A related problem occurs when you need the final value of a variable but the variable is declared LOCAL rather than LASTLOCAL. If the use of the final value happens several hundred lines farther down, or if the variable is in a COMMON block and the final value is used in a completely separate routine, a variable can look as if it is LOCAL when in fact it should be LASTLOCAL. To combat this problem, simply declare all the LOCAL variables LASTLOCAL when debugging a loop.

  • Check for EQUIVALENCE problems. Two variables of different names may in fact refer to the same storage location if they are associated through an EQUIVALENCE.

  • Check for the use of uninitialized variables. Some programs assume uninitialized variables have the value 0. This works with the –static option, but without it, uninitialized values assume the value left on the stack. When compiling with –mp, the program executes differently and the stack contents are different. You should suspect this type of problem when a program compiled with –mp and run on a single processor gives a different result when it is compiled without –mp. One way to track down a problem of this type is to compile suspected routines with –static. If an uninitialized variable is the problem, it should be fixed by initializing the variable rather than by continuing to compile –static.

  • Try compiling with the –C option for range checking on array references. If arrays are indexed out of bounds, a memory location may be referenced in unexpected ways. This is particularly true of adjacent arrays in a COMMON block.

  • If the analysis of the loop was incorrect, one or more arrays that are SHARE may have data dependencies. This sort of error is seen only when running multiprocessed code. When stepping through the code in the debugger, the program executes correctly. In fact, this sort of error often is seen only intermittently, with the program working correctly most of the time.

  • The most likely candidates for this error are arrays with complicated subscripts. If the array subscripts are simply the index variables of a DO loop, the analysis is probably correct. If the subscripts are more involved, they are a good choice to examine first.

  • If you suspect this type of error, as a final resort print out all the values of all the subscripts on each iteration through the loop. Then use uniq() (see the uniq(1) reference page) to look for duplicates. If duplicates are found, then there is a data dependency.