Chapter 4. Customizing PFA Execution

This chapter contains the following sections:

Overview

To customize how PFA executes an entire program, you can specify various command line options when you run PFA directly or when you specify PFA as part of a compile. Chapter 2, “How to Use PFA,” explains both procedures. For a complete summary of the PFA command line options, refer to Appendix A, “PFA Command Line Options.”

Controlling Code Execution

When modifying most programs to allow loops to run in parallel, modify the code so that PFA can automatically run the loop in parallel. Avoid forcing the loop to run in parallel by directly inserting a C$DOACROSS directive. If you force code to run in parallel, you (and not PFA) need to verify that no subsequent modification inserts data dependencies. Forcing these data dependencies in code to run in parallel can produce serious (and difficult-to-find) errors. Rewriting the loop so that PFA recognizes the loop as safe to run in parallel allows PFA to check future modifications for potential data dependencies.

This section describes how to control whether eligible loops are run in parallel and how to specify a work threshold for loops.

Running Code in Parallel

The -CONCURRENTIZE option (or -C) converts eligible loops to run in parallel. This is the default value for this option. The -NOCONCURRENTIZE option (or -NCONC) prevents PFA from converting loops to run in parallel.

Specifying a Work Threshold

The -MINCONCURRENT= n option (or -MC= n) specifies the minimum amount of work needed inside the loop to make executing a loop in parallel profitable. The integer n is a count of the number of operations (for example, add, multiply, load, store) in the loop, multiplied by the number of times the loop will be executed.

If the loop does not contain at least this much work, the loop will not be run in parallel. If the loop bounds are not constants, an IF clause will be automatically added to the PFA-generated C$ DOACROSS directive to test at run time if sufficient work exists.

If you do not specify this option, PFA runs all loops containing 500 or more operations in parallel.

For example, given the original loop

      do 2 i =1,n
         x(i) = y(i) * z(i)
2     continue

PFA generates the following transformed loop:

C$DOACROSS IF (N .GT. 100), SHARE (N,X,Y,Z), LOCAL(I)
      DO 3 I=1,N
         x(i) = y(i)*z(i)
3     CONTINUE

The IF clause ensures that n is large enough to make running the loop in parallel profitable (otherwise, PFA will run the loop serially). If the loop bound is a small constant (such as 10) instead of n, PFA would not generate a DOACROSS statement for the loop and the listing file will state that the loop does not contain enough work. Conversely, if the bound is a large constant (such as 100), then PFA generates the DOACROSS statement without the IF clause.

Controlling PFA Code Transformations

This section discusses the various ways in which you can control the standard transformations that PFA performs.

Controlling Size/Complexity Thresholds

You can control the thresholds for internal table size and routine complexity in order to analyze larger and more complex routines.

Controlling Internal Table Size

The -ARCLIMIT= n option (or -ARCLM= n) controls the size of the internal table used to store data dependence information (arcs). If this table overflows, PFA stops analyzing the loop and the PFA listing file shows the message

too many stmts/dd arcs

Increasing ARCLIMIT might allow PFA to analyze the loop but at the cost of additional processing time.

Specifying a Complexity Limit

The -LIMIT= n option (or -LM=n) controls the amount of time PFA can spend trying to determine whether a loop is safe to run in parallel. PFA estimates how much time is required to analyze each loop nest construct. If an outer loop looks like it would take too much time to analyze, PFA ignores the outer loop and recursively visits the inner loops.

Larger limits often allow PFA to generate parallel code for deeply nested loop structures that it might not otherwise be able to run safely in parallel. However, with larger limits PFA can also take more time to analyze a program. (The limit does not correspond to the DO loop nest level. It is an estimate of the number of loop orderings that PFA can generate from a loop nest.) This option has the same effect as the global C*$* LIMIT(n) directive.


Note: You do not usually need to change these limits.


Setting the Optimization Level

The -OPTIMIZE= n option (or -O= n) sets the optimization level. The higher you set the optimization level, the more code is optimized and the longer PFA runs. Programs that are written for running in parallel often do not need advanced transformation. With these programs, a lower optimization level is enough. Valid values for n are

0 

Avoids converting loops to run in parallel.

1 

Converts loops to run in parallel without using advanced data dependence tests. Enables loop interchanging.

2 

Determines when scalars need last-value assignment using lifetime analysis. Also uses more powerful data dependence tests to find loops that can run safely in parallel. This level allows reductions in loops that execute concurrently but only if the -ROUNDOFF option is set to 2. (Refer to the following section for details about the -ROUNDOFF option.)

3 

Breaks data dependence cycles using special techniques and additional loop interchanging methods, such as interchanging triangular loops. This level also implements special-case data dependence tests.

4 

Generates two versions of a loop, if necessary, to break a data-dependent arc. This level also implements more-exact data dependence tests and allows special index sets (called wraparound variables) to convert more code to run in parallel.

5  

Fuses two adjacent loops if it is legal to do so (that is, there are no data dependencies) and if the loops have the same control values. In certain limited cases, this level recognizes arrays as local variables. This level is the default.

This option has the same effect as the global C*$* OPTIMIZE(n) directive described in Chapter 5, “Fine-Tuning PFA.”


Note: If you want to use the -UNROLL command line option, set the -OPTIMIZE option to 4 or higher (the default optimization level is above this threshold).


Controlling Variations in Round Off

The -ROUNDOFF=n option (or -R=n) controls the amount of variation in round off that PFA will allow. Valid values for n are the integers

0–1 

Suppresses any round-off transformations. This is the default.

2 

Allows reductions to be performed in parallel. The valid reduction operators are addition, multiplication, min, and max. This value is one of the most commonly specified user options.

3 

Recognizes REAL induction variables. Permits memory management transformations (refer to “Memory Management Transformations”).

When executing reductions in parallel, PFA processes values in a different order from the original serial code. Round-off errors accumulate differently and produce a slightly different answer. Some algorithms are sensitive to this variation, and so, by default, PFA does not run reductions in parallel. Usually, these tiny variations are irrelevant, and you can allow PFA to process a reduction in parallel allowing more loops to be run in parallel.

Controlling the Number of Scalar Optimizations

The -SCALAROPT=n option (or -SO=n) controls the amount of standard scalar optimizations attempted by PFA. Valid values for n are the integers

0 

Performs no scalar transformations.

1 

Enables dead code elimination, pulling loop invariants, forward substitution, and conversion of IF-GOTO into IF-THEN-ELSE.

2 

Enables induction variable recognition, loop unrolling, loop fusion, array expansion, scalar promotion, and floating invariant IF tests. (Loop fusion also requires -OPTIMIZE=5.)

3 

Enables the memory management transformations (refer to “Memory Management Transformations”). (Memory management also requires -ROUNDOFF=3.) This is the default value.

Enabling Loop Unrolling

The -UNROLL=n option (or -UR=n) unrolls scalar inner loops when PFA cannot run the loops in parallel. n specifies the number of times to replicate the loop body. The default is 4. Specify a small power of two for the unroll value, such as two, four, or eight. Disable unrolling by setting -UNROLL=1.

The -UNROLL2=m option (or -UR2=m) allows you to adjust the number of operations used by the -UNROLL option. Selecting a larger value for -UNROLL2 allows PFA to unroll loops containing more calculations. This form of unrolling applies only to the innermost loops in a nest of loops. You can unroll loops whether they execute serially or concurrently.

PFA counts the number of array references and arithmetic operations in the loop. It unrolls the loop until it reaches either the number of operations specified by the -UNROLL2 option or the number of iterations specified by -UNROLL.

When PFA unrolls a loop, it replicates the body of the loop a certain number of times, making the loop run faster. However, unrolling loops also increases the program size.

For example, if the original program is

do i = 1,100
   a(i) = b(i) + c(i)*d(i) 
enddo

the unrolled program (unrolling of order 4) is

do i = 1,100,4
   a(i) = b(i) + c(i)*d(i) 
   a(i+1) = b(i+1) + c(i+1)*d(i+1) 
   a(i+2) = b(i+2) + c(i+2)*d(i+2) 
   a(i+3) = b(i+3) + c(i+3)*d(i+3) 
enddo

The second (unrolled) version runs faster than the original version. The reason for the improvement is that SGI processors have separate add and multiply hardware, allowing addition and multiplication operations to run simultaneously. In the original program, the processor has to do the multiplication, wait for it to complete, then do the addition. In the second case, the processor can do the first multiplication, wait for it to complete, then overlap the second multiplication and the first addition, then the third multiplication and the second addition, and so on.

The additions require nearly no additional time because all but the last one are completed within the time it takes the (previous) multiplication to complete. If the loop already contains many computations (for example, many lines of code, many additions and multiplications), then unrolling it might help a little but not much.

Memory Management Transformations

When -ROUNDOFF and -SCALAROPT are both set to 3, PFA attempts to do outer loop unrolling (to improve register utilization) and automatic loop blocking (also called tiling) to improve cache utilization.

Outer loop unrolling is a standard hand-optimization technique. Note that the -UNROLL and -UNROLL2 options apply to inner-loop unrolling. Outer-loop unrolling can occur even if inner-loop unrolling is disabled.

Loop blocking is a complex transformation that is applicable when the loop nesting depth is greater than the dimensions of the data arrays being manipulated. The canonical example is the simple matrix multiply, where a three-deep nest of loops operates on two-dimensional arrays.

The simple method repeatedly sweeps over the entire array. If the array is too large to fit into the cache, this can result in a large amount of memory traffic. A better method is to break the arrays up into blocks, where each block is small enough to fit into the cache, and then sweep over each block in turn (rather than over the whole array). The code to do this is often ugly and complicated. PFA attempts to ease the burden of writing block-style algorithms by automatically generating the block version from the simple version. Note, however, that blocking does not help the more common case where the algorithm touches each array element exactly once (for example, a two-dimensional array inside of a two-deep loop nest). Because in this case the data is not being reused, blocking does not apply.

For example, given the loop nest

do k  =1,n 
   do j= 1,n 
      do i  =1,n
         a(i,j) = a(i,j) + b(i,k)*c(k,j) 
      enddo
   enddo
enddo

using the option -r=3 , PFA produces the listing below:

   II3 = 1
   II1 = MOD (N - 1, 682) + 1
   II2 = II1
   II10 = N - 7
   II11= (II10 + 7) / 8
   DO 4  II4=1, N, 682
   II8 = II3 + II2 - 1
   DO 2 K=1, II10, 8
C$DOACROSS SHARE(N,K,C,II3,II8,A,B),LOCAL(DD1,DD2,C$& DD3, DD4,DD5,DD6,DD7,DD8,DD9,J,I)
   DO 2 J=1,N
   DD2 = C(K,J)
   DD3 = C(K+1,J)
   DD4 = C(K+2,J)
   DD5 = C(K+3,J)
   DD6 = C(K+4,J)
   DD7 = C(K+5,J)
   DD8 = C(K+6,J)
   DD9 = C(K+7,J)
      DO 2 I=II3, II8, 1
      DD1 = A(I,J)
         DD1 = DD1 +  B(I,K) * DD2
         DD1 = DD1 +  B(I, K+1) *  DD3
         DD1 = DD1 +  B(I, K+2) *  DD4
         DD1 = DD1 +  B(I, K+3) *  DD5
         DD1 = DD1 +  B(I, K+4) *  DD6
         DD1 = DD1 +  B(I, K+5) *  DD7
         DD1 = DD1 +  B(I, K+6) *  DD8
         DD1 = DD1 +  B(I, K+7) *  DD9
     A(I,J) = DD1
2    CONTINUE
     II7 = II11 * 8 + 1
     II9 = II3 + II2 - 1
     DO 3 K=II7, N, 1
C$DOACROSS SHARE(N,K,C,II3,II9,A,B),LOCAL(DD10,J,I)
   DO 3 J=1,N
   DD10 = C(K,J)
      DO 3 I=II3,II9,1
         A(I,J) = A(I,J) + B(I,K) * DD10
3     CONTINUE
      II3 = II3 + II2
      II2 = 682
4     CONTINUE

Obviously, PFA's version is more complicated than the original, but it runs significantly faster.

Performing Inlining and Interprocedural Analysis

Function and subroutine calls create an obstacle to parallelization. PFA provides three ways of dealing with this obstacle:

  • Assert that the external routine is safe for concurrent execution (see “C*$* ASSERT CONCURRENT CALL”).

  • Inline the routine by replacing the call to the external routine with the actual code.

  • Perform interprocedural analysis (IPA) by analyzing the external routine ahead of time and using the results of that analysis when a reference to the routine is encountered.

Inlining and IPA tend to be slow, memory-intensive operations. Attempting to inline all routines everywhere they occur can take a lot of time and use a lot of system resources. Inlining should usually be restricted to a few time-critical places.

This section discusses the three steps for inlining or IPA:

  1. Specify which routines will be inlined (or interprocedurally analyzed).

  2. Specify which source files and libraries will be searched to find the routines.

  3. Specify which occurrences of those routines are to be inlined (or analyzed).

Specifying Routines for Inlining or IPA

PFA supports the -INLINE=list option (or -IN=list) that specifies the routines to be inlined and the -IPA=list option for IPA. list is a colon-separated list of routines to be inlined. For example,

-INLINE=jump:more

If you do not specify list, PFA will attempt to inline all eligible routines.

Specifying Where to Search for Routines

The options listed in Table 4-1 tell PFA where to search for the routines specified with the -INLINE or -IPA option. If you do not specify either option, PFA searches the current source file by default.

Table 4-1. Inlining and IPA Search Command Line Options

Long Option Name

Short Option Name

Default Value

-INLINE_FROM_FILES= list

-INFF= list

Current Source File

-IPA_FROM_FILES=list

-IPAFF= list

Current Source File

-INLINE_FROM_LIBRARIES=list

-INFL=list

None

-IPA_FROM_LIBRARIES=list

-IPAFL=list

None

If one of the names in list is a directory, then all appropriate files in that directory will be used. PFA assumes files with the extension .f are Fortran source and files with the extension .klib are PFA-produced libraries.

Specify multiple files and directories with the same option by using a colon-separated list. For example,

-INLINE_FROM_FILES=file1:file2:file3


Note: These options by themselves do not initiate inlining or IPA. They only specify where to look for the routines. Use them in conjunction with the appropriate -INLINE or -IPA option.


Creating a Library

When performing inlining and IPA, PFA analyzes the routines in the source program. Normally, inlining is done directly from a source file. However, when inlining the same set of routines in many different programs, it is more efficient to create a preanalyzed library of the routines. Use the -INLINE_CREATE =name option (or -INCR =name) to create a library of prepared routines (for later use with the -INLINE_FROM_LIBRARIES option). PFA assigns a name to the library file it creates; for maximum compatibility, use the filename extension .klib: for example, samp.klib.

The -IPA_CREATE=name option (or -IPACR=name) is the analogous option for IPA.

The library used to do IPA does not have to be generated from the same source that will be linked into the running program. Using this capability can cause errors, but it can also be useful. For example, you could write a library of hand-optimized assembly language routines, then construct a PFA-compatible IPA library using Fortran routines that mimic the behavior of the assembly code. Thus, you can do parallelism analysis with IPA correctly but still call the hand-optimized assembly routines. Use the following procedure to create and use a PFA library:

  1. Create a library by passing the source program directly through pfa. Library creation is done by PFA and should not be done at the same time as an ordinary compilation. For example, the following command line creates a library called samp.klib for the source program samp.f:

    % /usr/lib/pfa -INLINE_CREATE=samp.klib samp.f
    

  2. Compile the program with pfa:

    % f77 -pfa keep -WK,-INFL=samp.klib samp.f
    


Note: Libraries created for inlining contain complete information and can be used for inlining or IPA. Libraries created for IPA contain only summary information and can be used only for IPA.


Specifying Occurrences

The loop level, depth, and manual options allow you to control which occurrences of the routines specified with the -INLINE or -IPA option are actually dealt with when the -INLINE or -IPA options are used.

Loop Level

The -INLINE_LOOPLEVEL=n (or -INLL=n) and -IPA_LOOPLEVEL=n (or -IPALL =n) options allow you to limit PFA to work only on occurrences within deeply nested loops. Thus, a value of 1 restricts PFA to deal with routines only at the single-most deeply nested level; a value of 2 restricts PFA to the deepest and second-deepest levels; and so on.

To determine most deeply nested, PFA constructs a call graph to account for nesting due to loops that occur farther up the call chain. If you do not specify either option, the loop level is 10.

Depth

The -INLINE_DEPTH=n (or -IND) option restricts the number of times PFA will continue to attempt inlining on already inlined routines. For example, suppose you use PFA to inline the routine foo. However, foo itself contains a call to bar. Should PFA now attempt a second inlining depth and inline bar? And if bar calls baz, should PFA inline three deep? This option provides control over this process, as routines are only inlined to the specified depth.

As a special case, if you specify the value –1, only routines that do not reference other routines are inlined (that is, only leaf routines are inlined). Note that the extension to \xf0 –2, –3, and so on is not supported, only –1. Note also that there is no -IPA_DEPTH option.

Manual

The -INLINE_MAN option turns on recognition of the C*$*INLINE directive. This directive (described in Chapter 5, “Fine-Tuning PFA”) allows you select individual occurrences of routines to be inlined. -IPA_MAN is the analogous option for the C*$*IPA directive (also described in Chapter 5, “Fine-Tuning PFA.”).

Conditions That Prevent Inlining or IPA

Several conditions make a routine ineligible for inline expansion or IPA:

  • Dummy arguments do not match the actual arguments in number, type, shape, or size.

  • The calling program and called routine have conflicting declarations for the same COMMON block.

  • The calling program and the called routine have conflicting EQUIVALENCE statements.

  • The routine to be inlined has a SAVE, ENTRY, or NAMELIST statement.

  • The routine to be inlined has a DATA loaded variable.

  • The routine to be inlined is too long (the limit is about 600 lines).

Controlling Fortran Language Elements

This section explains how to control various Fortran 77 language elements.

Global Assumptions

The -ASSUME=list option (or -AS=list) controls certain global assumptions of a program. list consists of any combination of the following values:

E 

Allows equivalence variables to refer to the same memory location inside one loop. For more information, see Chapter 5, “Fine-Tuning PFA.”

L 

Instructs PFA to use a temporary variable within the optimized loop and assign the last value to the original scalar if PFA determines that scalar can be reused before it is assigned. This value is important when a scalar is assigned in a loop run in parallel. For more information, see Chapter 5, “Fine-Tuning PFA.”

P 

Allows for parameter aliasing in a subprogram. For more information, see Chapter 5, “Fine-Tuning PFA.”

By default, PFA assumes that a program conforms to the ANSI (and VMSTM) standard; therefore, the default is -ASSUME=EL.

Debugging Lines

The -DLINES option tells PFA to treat the letter D in column one as if the letter were a character space. PFA then parses the rest of that line as a normal Fortran 77 statement. The -NODLINES option tells PFA to treat these lines as though they were comments. These options are useful for excluding or including debugging lines. f77 passes this option to PFA automatically when you specify the f77 -d_lines option.

DO Loop Execution

The -ONETRIP option (or -l) provides compatibility with older versions of Fortran where a DO loop is always executed at least once. The -NOONETRIP (or -N1) option conforms to the Fortran 77 standard.

This option, which is the default, does not execute a DO loop whose termination condition is initially satisfied. f77 passes the -ONETRIP option to PFA automatically when you specify the f77 -one_trip option.

Variable Saving Across Invocations

The -SAVE=c option (or -SV=c) specifies whether a procedure's variables are saved across invocations. c is one of the following values:

A 

Performs a lifetime analysis on a procedure's variables to determine those that need to have their value saved across invocations of the procedure. When it finds such a variable, PFA generates a SAVE statement for the variable.

M 

Does not generate SAVE statements. This is the default value.

Significant Columns

The -SCAN=n option controls the number of columns that PFA assumes to be significant. PFA ignores anything beyond the specified column number. The default value for n is 72. Specifying any of the following f77 options automatically sets this option: -col72, -col120, or -extend_source.

Fortran Standard

Setting the -SYNTAX=c option (or -SY=c) alters the interpretation of the Fortran input to be in compliance with other standards. c is one of the following values:

A 

Interprets the source in strict compliance with the ANSI Fortran 77 standard.

V 

Interprets the source in compliance with the VMS Fortran standard but without the additional SGI extensions.

If you do not specify this option, PFA uses the same rules as the standard SGI Fortran compiler (refer to the Fortran 77 Programmer's Guide for details).

Controlling Directives and Assertions

This section discusses the options you can use to select whether PFA accepts a specific directive or assertion. You can use these options to override directives and assertions that are specified in the source program.

Selecting Directives and Assertions

The -DIRECTIVES=list option specifies the directives and assertions to accept. The -NODIRECTIVES option tells PFA to ignore all directives and assertions. This option is useful when you suspect unsafe directives are causing problems with program execution.


Note: Some directives are called assertions because they assert program characteristics that PFA cannot verify. (For example, an assertion could assert that subroutine x contains no data dependencies.) However, you might want PFA to use it when optimizing. Refer to Chapter 1, “Overview of PFA,”for more information about directives and assertions.

Valid values for list are any combination of the values

A 

Accepts assertions.

C 

Accepts Cray CDIR$ directives; CDIR$IVDEP ignores certain data dependencies in a loop. But because of differences between SGI hardware and a Cray machine, these data dependencies are not always safe to ignore on SGI hardware. To be safe, PFA does not recognize the CDIR$IVDEP directive by default. You can, at your own risk, turn on Cray-directive recognition, which will cause PFA to treat this Cray directive as if it were a C*$*ASSERT DO (CONCURRENT) assertion.

K 

Accepts C*$* directives.

S 

Accepts C$ directives. PFA recognizes the directives C$DOACROSS, C$, and C$&. (For more information, see the Fortran 77 Programmer's Guide.) If a C$DOACROSS directive appears, PFA does not examine or alter the loop to which the directive applies. This allows you to mix code that you converted to parallel execution with code that PFA converted to parallel execution.

V 

Accepts VAST CVD$ directives.

For example, specifying -DIRECTIVES=K enables PFA directives only, whereas -DIRECTIVES=CK enables both Cray and PFA directives. Adding A to the DIRECTIVES sequence also enables PFA assertions. Any combination of options is acceptable.

If you do not specify either option, PFA will accept all assertions, PFA C*$* directives, all C$ directives, and VAST CVD$ directives.

Controlling PFA I/O

This section describes command line options you can use to name PFA input and output. You do not need to use these options unless you want to change the default names. In particular, some versions of the make(1) utility assume that files ending in .1 are lex(1) input files. To perform automatic makes without overwriting the PFA listing file, use a different suffix for the listing filename.

Use the -INPUT=file.f option to specify the name of the Fortran source program PFA input file. If you do not specify this option, PFA assumes that a command line argument not preceded by a dash is the input filename.

The -FORTRAN=file option specifies the name of the PFA intermediate file (that is, the transformed source). If you do not specify this filename, PFA names the intermediate file.m, where file is the name of the input file. For details about the intermediate file, refer to Chapter 3, “Utilizing PFA Output.”

The -LIST=file option specifies the name of the PFA listing file. If you do not specify this filename, PFA names the listing file file.l, where file is the name of the input file. For details about the listing file, refer to Chapter 3, “Utilizing PFA Output.”

Obsolete Syntax

Table 4-2 lists obsolete PFA command line options.

Table 4-2. Obsolete Options

Long Option Name

Short Option Name

Default Value

-EXPAND

-X, -EX

off

-CREATE

-CR

off

-LIBRARY

-LIB

off

-LIMIT2

-LM2

5000

PFA now accepts new syntax for some of the command line options (particularly the syntax for inlining). For compatibilIty with the older versions, these options are translated into their newer equivalents in Table 4-3. Whenever possible do not use the older syntax; support for it might be withdrawn in the future.

Table 4-3. Obsolete Options and Their Equivalents

Old Version

New Version

-EXPAND=A

-INLINE

-EXPAND=M

-INLINE_MAN

-LIBRARY=name

-INLINE_FROM_LIBRARIES=name

-CREATE -LIBRARY=name

-INLINE_CREATE=name

-LIMIT2=n

-ARCLIMIT=n