This chapter contains the following sections:
“Overview” explains when to optimize PFA execution.
“Controlling Code Execution” describes how to control whether PFA runs eligible loops in parallel.
“Controlling PFA Code Transformations” describes how to control the various transformations performed by PFA.
“Performing Inlining and Interprocedural Analysis” describes inlining and interprocedural analysis and explains how and when to perform these procedures.
“Controlling Fortran Language Elements” explains how to control standard Fortran elements with command line options to PFA.
“Controlling Directives and Assertions” explains how to override PFA directives and assertions with command line options.
“Controlling PFA I/O” explains how to customize the names of PFA input and output files.
“Obsolete Syntax” lists obsolete PFA command line options.
To customize how PFA executes an entire program, you can specify various command line options when you run PFA directly or when you specify PFA as part of a compile. Chapter 2, “How to Use PFA,” explains both procedures. For a complete summary of the PFA command line options, refer to Appendix A, “PFA Command Line Options.”
When modifying most programs to allow loops to run in parallel, modify the code so that PFA can automatically run the loop in parallel. Avoid forcing the loop to run in parallel by directly inserting a C$DOACROSS directive. If you force code to run in parallel, you (and not PFA) need to verify that no subsequent modification inserts data dependencies. Forcing these data dependencies in code to run in parallel can produce serious (and difficult-to-find) errors. Rewriting the loop so that PFA recognizes the loop as safe to run in parallel allows PFA to check future modifications for potential data dependencies.
This section describes how to control whether eligible loops are run in parallel and how to specify a work threshold for loops.
The -CONCURRENTIZE option (or -C) converts eligible loops to run in parallel. This is the default value for this option. The -NOCONCURRENTIZE option (or -NCONC) prevents PFA from converting loops to run in parallel.
The -MINCONCURRENT= n option (or -MC= n) specifies the minimum amount of work needed inside the loop to make executing a loop in parallel profitable. The integer n is a count of the number of operations (for example, add, multiply, load, store) in the loop, multiplied by the number of times the loop will be executed.
If the loop does not contain at least this much work, the loop will not be run in parallel. If the loop bounds are not constants, an IF clause will be automatically added to the PFA-generated C$ DOACROSS directive to test at run time if sufficient work exists.
If you do not specify this option, PFA runs all loops containing 500 or more operations in parallel.
For example, given the original loop
do 2 i =1,n x(i) = y(i) * z(i) 2 continue
PFA generates the following transformed loop:
C$DOACROSS IF (N .GT. 100), SHARE (N,X,Y,Z), LOCAL(I) DO 3 I=1,N x(i) = y(i)*z(i) 3 CONTINUE
The IF clause ensures that n is large enough to make running the loop in parallel profitable (otherwise, PFA will run the loop serially). If the loop bound is a small constant (such as 10) instead of n, PFA would not generate a DOACROSS statement for the loop and the listing file will state that the loop does not contain enough work. Conversely, if the bound is a large constant (such as 100), then PFA generates the DOACROSS statement without the IF clause.
This section discusses the various ways in which you can control the standard transformations that PFA performs.
You can control the thresholds for internal table size and routine complexity in order to analyze larger and more complex routines.
The -ARCLIMIT= n option (or -ARCLM= n) controls the size of the internal table used to store data dependence information (arcs). If this table overflows, PFA stops analyzing the loop and the PFA listing file shows the message
too many stmts/dd arcs
Increasing ARCLIMIT might allow PFA to analyze the loop but at the cost of additional processing time.
The -LIMIT= n option (or -LM=n) controls the amount of time PFA can spend trying to determine whether a loop is safe to run in parallel. PFA estimates how much time is required to analyze each loop nest construct. If an outer loop looks like it would take too much time to analyze, PFA ignores the outer loop and recursively visits the inner loops.
Larger limits often allow PFA to generate parallel code for deeply nested loop structures that it might not otherwise be able to run safely in parallel. However, with larger limits PFA can also take more time to analyze a program. (The limit does not correspond to the DO loop nest level. It is an estimate of the number of loop orderings that PFA can generate from a loop nest.) This option has the same effect as the global C*$* LIMIT(n) directive.
|Note: You do not usually need to change these limits.|
The -OPTIMIZE= n option (or -O= n) sets the optimization level. The higher you set the optimization level, the more code is optimized and the longer PFA runs. Programs that are written for running in parallel often do not need advanced transformation. With these programs, a lower optimization level is enough. Valid values for n are
Avoids converting loops to run in parallel.
Converts loops to run in parallel without using advanced data dependence tests. Enables loop interchanging.
Determines when scalars need last-value assignment using lifetime analysis. Also uses more powerful data dependence tests to find loops that can run safely in parallel. This level allows reductions in loops that execute concurrently but only if the -ROUNDOFF option is set to 2. (Refer to the following section for details about the -ROUNDOFF option.)
Breaks data dependence cycles using special techniques and additional loop interchanging methods, such as interchanging triangular loops. This level also implements special-case data dependence tests.
Generates two versions of a loop, if necessary, to break a data-dependent arc. This level also implements more-exact data dependence tests and allows special index sets (called wraparound variables) to convert more code to run in parallel.
Fuses two adjacent loops if it is legal to do so (that is, there are no data dependencies) and if the loops have the same control values. In certain limited cases, this level recognizes arrays as local variables. This level is the default.
This option has the same effect as the global C*$* OPTIMIZE(n) directive described in Chapter 5, “Fine-Tuning PFA.”
|Note: If you want to use the -UNROLL command line option, set the -OPTIMIZE option to 4 or higher (the default optimization level is above this threshold).|
Suppresses any round-off transformations. This is the default.
Allows reductions to be performed in parallel. The valid reduction operators are addition, multiplication, min, and max. This value is one of the most commonly specified user options.
Recognizes REAL induction variables. Permits memory management transformations (refer to “Memory Management Transformations”).
When executing reductions in parallel, PFA processes values in a different order from the original serial code. Round-off errors accumulate differently and produce a slightly different answer. Some algorithms are sensitive to this variation, and so, by default, PFA does not run reductions in parallel. Usually, these tiny variations are irrelevant, and you can allow PFA to process a reduction in parallel allowing more loops to be run in parallel.
Performs no scalar transformations.
Enables dead code elimination, pulling loop invariants, forward substitution, and conversion of IF-GOTO into IF-THEN-ELSE.
Enables induction variable recognition, loop unrolling, loop fusion, array expansion, scalar promotion, and floating invariant IF tests. (Loop fusion also requires -OPTIMIZE=5.)
Enables the memory management transformations (refer to “Memory Management Transformations”). (Memory management also requires -ROUNDOFF=3.) This is the default value.
The -UNROLL=n option (or -UR=n) unrolls scalar inner loops when PFA cannot run the loops in parallel. n specifies the number of times to replicate the loop body. The default is 4. Specify a small power of two for the unroll value, such as two, four, or eight. Disable unrolling by setting -UNROLL=1.
The -UNROLL2=m option (or -UR2=m) allows you to adjust the number of operations used by the -UNROLL option. Selecting a larger value for -UNROLL2 allows PFA to unroll loops containing more calculations. This form of unrolling applies only to the innermost loops in a nest of loops. You can unroll loops whether they execute serially or concurrently.
PFA counts the number of array references and arithmetic operations in the loop. It unrolls the loop until it reaches either the number of operations specified by the -UNROLL2 option or the number of iterations specified by -UNROLL.
When PFA unrolls a loop, it replicates the body of the loop a certain number of times, making the loop run faster. However, unrolling loops also increases the program size.
For example, if the original program is
do i = 1,100 a(i) = b(i) + c(i)*d(i) enddo
the unrolled program (unrolling of order 4) is
do i = 1,100,4 a(i) = b(i) + c(i)*d(i) a(i+1) = b(i+1) + c(i+1)*d(i+1) a(i+2) = b(i+2) + c(i+2)*d(i+2) a(i+3) = b(i+3) + c(i+3)*d(i+3) enddo
The second (unrolled) version runs faster than the original version. The reason for the improvement is that SGI processors have separate add and multiply hardware, allowing addition and multiplication operations to run simultaneously. In the original program, the processor has to do the multiplication, wait for it to complete, then do the addition. In the second case, the processor can do the first multiplication, wait for it to complete, then overlap the second multiplication and the first addition, then the third multiplication and the second addition, and so on.
The additions require nearly no additional time because all but the last one are completed within the time it takes the (previous) multiplication to complete. If the loop already contains many computations (for example, many lines of code, many additions and multiplications), then unrolling it might help a little but not much.
When -ROUNDOFF and -SCALAROPT are both set to 3, PFA attempts to do outer loop unrolling (to improve register utilization) and automatic loop blocking (also called tiling) to improve cache utilization.
Outer loop unrolling is a standard hand-optimization technique. Note that the -UNROLL and -UNROLL2 options apply to inner-loop unrolling. Outer-loop unrolling can occur even if inner-loop unrolling is disabled.
Loop blocking is a complex transformation that is applicable when the loop nesting depth is greater than the dimensions of the data arrays being manipulated. The canonical example is the simple matrix multiply, where a three-deep nest of loops operates on two-dimensional arrays.
The simple method repeatedly sweeps over the entire array. If the array is too large to fit into the cache, this can result in a large amount of memory traffic. A better method is to break the arrays up into blocks, where each block is small enough to fit into the cache, and then sweep over each block in turn (rather than over the whole array). The code to do this is often ugly and complicated. PFA attempts to ease the burden of writing block-style algorithms by automatically generating the block version from the simple version. Note, however, that blocking does not help the more common case where the algorithm touches each array element exactly once (for example, a two-dimensional array inside of a two-deep loop nest). Because in this case the data is not being reused, blocking does not apply.
For example, given the loop nest
do k =1,n do j= 1,n do i =1,n a(i,j) = a(i,j) + b(i,k)*c(k,j) enddo enddo enddo
using the option -r=3 , PFA produces the listing below:
II3 = 1 II1 = MOD (N - 1, 682) + 1 II2 = II1 II10 = N - 7 II11= (II10 + 7) / 8 DO 4 II4=1, N, 682 II8 = II3 + II2 - 1 DO 2 K=1, II10, 8 C$DOACROSS SHARE(N,K,C,II3,II8,A,B),LOCAL(DD1,DD2,C$& DD3, DD4,DD5,DD6,DD7,DD8,DD9,J,I) DO 2 J=1,N DD2 = C(K,J) DD3 = C(K+1,J) DD4 = C(K+2,J) DD5 = C(K+3,J) DD6 = C(K+4,J) DD7 = C(K+5,J) DD8 = C(K+6,J) DD9 = C(K+7,J) DO 2 I=II3, II8, 1 DD1 = A(I,J) DD1 = DD1 + B(I,K) * DD2 DD1 = DD1 + B(I, K+1) * DD3 DD1 = DD1 + B(I, K+2) * DD4 DD1 = DD1 + B(I, K+3) * DD5 DD1 = DD1 + B(I, K+4) * DD6 DD1 = DD1 + B(I, K+5) * DD7 DD1 = DD1 + B(I, K+6) * DD8 DD1 = DD1 + B(I, K+7) * DD9 A(I,J) = DD1 2 CONTINUE II7 = II11 * 8 + 1 II9 = II3 + II2 - 1 DO 3 K=II7, N, 1 C$DOACROSS SHARE(N,K,C,II3,II9,A,B),LOCAL(DD10,J,I) DO 3 J=1,N DD10 = C(K,J) DO 3 I=II3,II9,1 A(I,J) = A(I,J) + B(I,K) * DD10 3 CONTINUE II3 = II3 + II2 II2 = 682 4 CONTINUE
Obviously, PFA's version is more complicated than the original, but it runs significantly faster.
Function and subroutine calls create an obstacle to parallelization. PFA provides three ways of dealing with this obstacle:
Assert that the external routine is safe for concurrent execution (see “C*$* ASSERT CONCURRENT CALL”).
Inline the routine by replacing the call to the external routine with the actual code.
Perform interprocedural analysis (IPA) by analyzing the external routine ahead of time and using the results of that analysis when a reference to the routine is encountered.
Inlining and IPA tend to be slow, memory-intensive operations. Attempting to inline all routines everywhere they occur can take a lot of time and use a lot of system resources. Inlining should usually be restricted to a few time-critical places.
This section discusses the three steps for inlining or IPA:
Specify which routines will be inlined (or interprocedurally analyzed).
Specify which source files and libraries will be searched to find the routines.
Specify which occurrences of those routines are to be inlined (or analyzed).
If you do not specify list, PFA will attempt to inline all eligible routines.
The options listed in Table 4-1 tell PFA where to search for the routines specified with the -INLINE or -IPA option. If you do not specify either option, PFA searches the current source file by default.
Long Option Name
Short Option Name
Current Source File
Current Source File
If one of the names in list is a directory, then all appropriate files in that directory will be used. PFA assumes files with the extension .f are Fortran source and files with the extension .klib are PFA-produced libraries.
Specify multiple files and directories with the same option by using a colon-separated list. For example,
|Note: These options by themselves do not initiate inlining or IPA. They only specify where to look for the routines. Use them in conjunction with the appropriate -INLINE or -IPA option.|
When performing inlining and IPA, PFA analyzes the routines in the source program. Normally, inlining is done directly from a source file. However, when inlining the same set of routines in many different programs, it is more efficient to create a preanalyzed library of the routines. Use the -INLINE_CREATE =name option (or -INCR =name) to create a library of prepared routines (for later use with the -INLINE_FROM_LIBRARIES option). PFA assigns a name to the library file it creates; for maximum compatibility, use the filename extension .klib: for example, samp.klib.
The library used to do IPA does not have to be generated from the same source that will be linked into the running program. Using this capability can cause errors, but it can also be useful. For example, you could write a library of hand-optimized assembly language routines, then construct a PFA-compatible IPA library using Fortran routines that mimic the behavior of the assembly code. Thus, you can do parallelism analysis with IPA correctly but still call the hand-optimized assembly routines. Use the following procedure to create and use a PFA library:
Create a library by passing the source program directly through pfa. Library creation is done by PFA and should not be done at the same time as an ordinary compilation. For example, the following command line creates a library called samp.klib for the source program samp.f:
% /usr/lib/pfa -INLINE_CREATE=samp.klib samp.f
Compile the program with pfa:
% f77 -pfa keep -WK,-INFL=samp.klib samp.f
|Note: Libraries created for inlining contain complete information and can be used for inlining or IPA. Libraries created for IPA contain only summary information and can be used only for IPA.|
The loop level, depth, and manual options allow you to control which occurrences of the routines specified with the -INLINE or -IPA option are actually dealt with when the -INLINE or -IPA options are used.
The -INLINE_LOOPLEVEL=n (or -INLL=n) and -IPA_LOOPLEVEL=n (or -IPALL =n) options allow you to limit PFA to work only on occurrences within deeply nested loops. Thus, a value of 1 restricts PFA to deal with routines only at the single-most deeply nested level; a value of 2 restricts PFA to the deepest and second-deepest levels; and so on.
To determine most deeply nested, PFA constructs a call graph to account for nesting due to loops that occur farther up the call chain. If you do not specify either option, the loop level is 10.
The -INLINE_DEPTH=n (or -IND) option restricts the number of times PFA will continue to attempt inlining on already inlined routines. For example, suppose you use PFA to inline the routine foo. However, foo itself contains a call to bar. Should PFA now attempt a second inlining depth and inline bar? And if bar calls baz, should PFA inline three deep? This option provides control over this process, as routines are only inlined to the specified depth.
As a special case, if you specify the value –1, only routines that do not reference other routines are inlined (that is, only leaf routines are inlined). Note that the extension to \xf0 –2, –3, and so on is not supported, only –1. Note also that there is no -IPA_DEPTH option.
The -INLINE_MAN option turns on recognition of the C*$*INLINE directive. This directive (described in Chapter 5, “Fine-Tuning PFA”) allows you select individual occurrences of routines to be inlined. -IPA_MAN is the analogous option for the C*$*IPA directive (also described in Chapter 5, “Fine-Tuning PFA.”).
Several conditions make a routine ineligible for inline expansion or IPA:
Dummy arguments do not match the actual arguments in number, type, shape, or size.
The calling program and called routine have conflicting declarations for the same COMMON block.
The calling program and the called routine have conflicting EQUIVALENCE statements.
The routine to be inlined has a SAVE, ENTRY, or NAMELIST statement.
The routine to be inlined has a DATA loaded variable.
The routine to be inlined is too long (the limit is about 600 lines).
This section explains how to control various Fortran 77 language elements.
Allows equivalence variables to refer to the same memory location inside one loop. For more information, see Chapter 5, “Fine-Tuning PFA.”
Instructs PFA to use a temporary variable within the optimized loop and assign the last value to the original scalar if PFA determines that scalar can be reused before it is assigned. This value is important when a scalar is assigned in a loop run in parallel. For more information, see Chapter 5, “Fine-Tuning PFA.”
Allows for parameter aliasing in a subprogram. For more information, see Chapter 5, “Fine-Tuning PFA.”
By default, PFA assumes that a program conforms to the ANSI (and VMSTM) standard; therefore, the default is -ASSUME=EL.
The -DLINES option tells PFA to treat the letter D in column one as if the letter were a character space. PFA then parses the rest of that line as a normal Fortran 77 statement. The -NODLINES option tells PFA to treat these lines as though they were comments. These options are useful for excluding or including debugging lines. f77 passes this option to PFA automatically when you specify the f77 -d_lines option.
This option, which is the default, does not execute a DO loop whose termination condition is initially satisfied. f77 passes the -ONETRIP option to PFA automatically when you specify the f77 -one_trip option.
Performs a lifetime analysis on a procedure's variables to determine those that need to have their value saved across invocations of the procedure. When it finds such a variable, PFA generates a SAVE statement for the variable.
Does not generate SAVE statements. This is the default value.
The -SCAN=n option controls the number of columns that PFA assumes to be significant. PFA ignores anything beyond the specified column number. The default value for n is 72. Specifying any of the following f77 options automatically sets this option: -col72, -col120, or -extend_source.
Setting the -SYNTAX=c option (or -SY=c) alters the interpretation of the Fortran input to be in compliance with other standards. c is one of the following values:
Interprets the source in strict compliance with the ANSI Fortran 77 standard.
Interprets the source in compliance with the VMS Fortran standard but without the additional SGI extensions.
If you do not specify this option, PFA uses the same rules as the standard SGI Fortran compiler (refer to the Fortran 77 Programmer's Guide for details).
This section discusses the options you can use to select whether PFA accepts a specific directive or assertion. You can use these options to override directives and assertions that are specified in the source program.
The -DIRECTIVES=list option specifies the directives and assertions to accept. The -NODIRECTIVES option tells PFA to ignore all directives and assertions. This option is useful when you suspect unsafe directives are causing problems with program execution.
|Note: Some directives are called assertions because they assert program characteristics that PFA cannot verify. (For example, an assertion could assert that subroutine x contains no data dependencies.) However, you might want PFA to use it when optimizing. Refer to Chapter 1, “Overview of PFA,”for more information about directives and assertions.|
Valid values for list are any combination of the values
Accepts Cray CDIR$ directives; CDIR$IVDEP ignores certain data dependencies in a loop. But because of differences between SGI hardware and a Cray machine, these data dependencies are not always safe to ignore on SGI hardware. To be safe, PFA does not recognize the CDIR$IVDEP directive by default. You can, at your own risk, turn on Cray-directive recognition, which will cause PFA to treat this Cray directive as if it were a C*$*ASSERT DO (CONCURRENT) assertion.
Accepts C*$* directives.
Accepts C$ directives. PFA recognizes the directives C$DOACROSS, C$, and C$&. (For more information, see the Fortran 77 Programmer's Guide.) If a C$DOACROSS directive appears, PFA does not examine or alter the loop to which the directive applies. This allows you to mix code that you converted to parallel execution with code that PFA converted to parallel execution.
For example, specifying -DIRECTIVES=K enables PFA directives only, whereas -DIRECTIVES=CK enables both Cray and PFA directives. Adding A to the DIRECTIVES sequence also enables PFA assertions. Any combination of options is acceptable.
If you do not specify either option, PFA will accept all assertions, PFA C*$* directives, all C$ directives, and VAST CVD$ directives.
This section describes command line options you can use to name PFA input and output. You do not need to use these options unless you want to change the default names. In particular, some versions of the make(1) utility assume that files ending in .1 are lex(1) input files. To perform automatic makes without overwriting the PFA listing file, use a different suffix for the listing filename.
Use the -INPUT=file.f option to specify the name of the Fortran source program PFA input file. If you do not specify this option, PFA assumes that a command line argument not preceded by a dash is the input filename.
The -FORTRAN=file option specifies the name of the PFA intermediate file (that is, the transformed source). If you do not specify this filename, PFA names the intermediate file.m, where file is the name of the input file. For details about the intermediate file, refer to Chapter 3, “Utilizing PFA Output.”
The -LIST=file option specifies the name of the PFA listing file. If you do not specify this filename, PFA names the listing file file.l, where file is the name of the input file. For details about the listing file, refer to Chapter 3, “Utilizing PFA Output.”
Table 4-2 lists obsolete PFA command line options.
Long Option Name
Short Option Name
PFA now accepts new syntax for some of the command line options (particularly the syntax for inlining). For compatibilIty with the older versions, these options are translated into their newer equivalents in Table 4-3. Whenever possible do not use the older syntax; support for it might be withdrawn in the future.