All MIPSpro compilers share a common optimizer phase, controlled by the use of the -On driver option. MIPSpro Fortran 90 also has source-level scalar optimization, controlled with source assertions and the -WK driver option.
Use of the common optimizer with any language is documented in the MIPS Compiling and Performance Tuning Guide. This chapter covers optimizer features that are unique to Fortran 77 and Fortran 90, and the use of source-level scalar optimizations, that is, optimizations that are equally effective on uniprocessor and multiprocessor machines.
You can use the compiler to perform various scalar optimizations by specifying any of the options listed in Table 4-1 on the command line. Specify the options in a comma-separated list with no intervening blanks, following the –WK option, as follows:
% f90 f77options -WK,option[,option] ... file
The -WK option passes its arguments to the optimization phases that are invoked by the Fortran front end.
Defaults are set for the options in Table 4-1 by the -On option. These defaults are reflected in the pfa(1) and fopt(1) reference pages, and are usually correct. You use specific values of these options only when dealing with special performance-tuning issues unique to your application.
option on with –scalaropt=2 or –optimize=5
depends on –O option
depends on –O option
depends on –O option
The –scalaropt=n option (or –so=n) controls the level of scalar optimizations that the compiler performs. Valid values for n are:
Disables all scalar optimizations.
Enables simple scalar optimizations—dead code elimination, global forward substitution of variables, and conversion of IF-GOTO to IF-THEN-ELSE.
Enables the full range of scalar optimizations—floating of invariant IF statements out of loops, array expansion, loop fusion, loop peeling, induction variable recognition, and loop rerolling and unrolling (if –roundoff is greater than zero; see “Controlling Variations in Round Off”
Enables memory management transformations if –roundoff=3 (see “Performing Memory Management Transformations” on page 68). Performs dead-code elimination during output conversion.
There is no default value for -scalaropt. If you do not specify it, this option can still be in effect through the –O option.
Optimization level can also be influenced with a directive in the source file; see “Setting Optimization Level”.
The –optimize=n option (or –o=n) sets the optimization level. Each optimization level is cumulative (that is, level 5 performs everything up to and including level 5). You can also modify the optimization level on a loop-by-loop basis by using a directive within the source file (see “Setting Optimization Level”). Valid values for n are:
Performs only simple optimizations. Enables induction variable recognition.
Performs lifetime analysis to determine when last-value assignment of scalars is necessary.
Recognizes triangular loops and attempts loop interchanging to improve memory referencing. Uses special case data dependence tests. Also, recognizes special index sets called wraparound variables.
Generates two versions of a loop, if necessary, to break a data dependence arc.
Enables array expansion and loop fusion.
Although higher optimization levels increase performance, they also increase compilation time. There is no default value for this option. If you do not specify it, this option can still be in effect through the –O option.
The output of following example is described for –optimize=1, –optimize=2, and –optimize=5 to illustrate the range of this option. (This example also uses -pfa and –minconcurrent=0.)
ASUM = 0.0 DO 10 I = 1,M DO 10 J = 1,N ASUM = ASUM + A(I,J) C(I,J) = A(I,J) + 2.0 10 CONTINUE
At –optimize=1, the compiler sees the summation in ASUM as an intractable data dependence between iterations and does not try to optimize the loop. At –optimize=2 (perform lifetime analysis and do not interchange around reduction) the compiler is able to recognize the reduction and automatically introduce a reduction parallelization:
ASUM = 0. C$DOACROSS SHARE(M,N,A,C),LOCAL(I,J),REDUCTION(ASUM) DO 3 I=1,M DO 2 J=1,N ASUM = ASUM + A(I,J) C(I,J) = 2. + A(I,J) 2 CONTINUE 3 CONTINUE
Specifying –optimize=5 (loop interchange around reduction to improve memory referencing) produces the following:
ASUM = 0. C$DOACROSS SHARE(N,M,A,C),LOCAL(J,I),REDUCTION(ASUM) DO 3 J=1,N DO 2 I=1,M ASUM = ASUM + A(I,J) C(I,J) = 2. + A(I,J) 2 CONTINUE 3 CONTINUE
The –assume=list option (or –as=list) controls certain global assumptions of a program. You can also control most of these assumptions with various assertions (see “Using Assertions”). The default is –assume=cel.
The list can contain the following letters:
Assumes procedure argument aliasing is used, meaning that different dummy arguments could refer to the same object. This practice is forbidden by the Fortran 77 and Fortran 90 standards. This option provides a method of dealing with programs that use argument aliasing anyway.
Assumes array subscripts may go outside the declared array bounds.
Places constants used in subroutine or function calls in temporary variables.
Assumes that two or more variable names defined in EQUIVALENCE statements may be used to refer to the same memory location inside one DO loop nest.
Uses temporary variables within an optimized loop and assigns the last value to the original scalar, if the compiler determines that the scalar can be reused before it is assigned.
By default, the compiler assumes that a program conforms to the Fortran 90 standard, that is, –assume=el, and includes –asssume=c to simplify some analysis and inlining. When your program conforms strictly to language standards, you can specify the –noassume option.
The following command compiles the Fortran program source.f90, assuming it allows subscripts out of bounds:
% f90 -WK,-assume=b source.f
The –recursion (or –rc) option tells the compiler that subroutines and functions in the source program can be called recursively (that is, a subroutine or function calls itself, or it calls another routine that calls it). The presence of recursion affects storage allocation decisions.
This option is enabled by default. To disable it, specify –norecursion (or –nrc). You can control this assumption on a procedure basis by using a directive within the source file (see “Ignoring Data Dependence Conflicts”).
Unsafe transformations can occur unless the –recursion option is enabled for each recursive routine that the compiler processes.
If you request one, the optimization phases will produce a listing showing how your source text was modified. This enables you to check line by line to see which loops were unrolled, how temporary variables were inserted, and so on. The basic option for receiving a listing is -listoptions=letters (or -lo=letters) The letters that can be given include:
display calling tree
display program unit names as processed
annotated listing of original program
annotated listing of output program
For example, the command
f90 ...options... -O5 -WK,-lo=CT testmodule.f90
produces a file named testmodule.L containing a listing of the calling tree as deduced by the optimizer, and an annotated listing of the modified program.
The –fuse option enables loop fusion, an optimization that transforms two adjacent loops into a single loop. The use of data-dependence tests allows fusion of more loops than is possible with standard techniques. You must also specify –scalaropt=2 or –optimize=5 to enable loop fusion.
When a loop contains an IF statement whose condition cannot change from one iteration to another (a loop-invariant if), the compiler performs the same test for every iteration. The code can often be made more efficient by floating the IF statement out of the loop and putting the THEN and ELSE sections into their own loops. This process is called invariant IF floating.
The –each_invariant_if_growth and the –max_invariant_if_growth options control limits on invariant IF floating. This process generally involves duplicating the body of the loop, which can increase the amount of code considerably.
The –each_invariant_if_growth=n option (or –eiifg=n) controls the rewriting of IF statements nested within loops. This option specifies a limit on the number of executable statements in a nested IF statement. If the number of statements in the loop exceeds this limit, the compiler does not rewrite the code. If there are fewer statements, the compiler improves execution speed by interchanging the loop and IF statements. This process becomes complicated when there is other code in the loop, since a copy of the other code must be included in both the THEN and ELSE loops.
Valid values for n are from 0 to 100; the default is 20. The code in Example 4-1 illustrates a loop-invariant IF.
DO I = ... section-1 IF ( ) THEN section-2 ELSE section-3 ENDIF section-4 ENDDO
IF ( ) THEN DO I = ... section-1 section-2 section-4 ENDDO ELSE DO I = ... section-1 section-3 section-4 ENDDO ENDIF
When sections 1 and 4 are large, the extra code generated can cost a program more time, through cache contention, extra paging, and so on, than it gains from the reduced number of Boolean expressions. The –each_invariant_if_growth option provides a maximum size (in number of lines of executable code) of the duplicate sections, above which the compiler will try to float an invariant IF statement outside a loop.
You can limit the total amount of additional code generated in a program unit through invariant IF floating by specifying the –max_invariant_if_growth option. This option (or –miifg=n) specifies an upper bound on the total number of additional lines of code the compiler can generate in each program unit through invariant-IF floating. This limit is applied on a per-subroutine basis. For example, if a subroutine is 400 lines long and –miifg=500, the compiler can add at most 100 lines in the process of invariant IF floating. The default for n is 500.
|Note: Other compiler optimizations can add or delete lines, so the final number of lines might differ from the value specified with –miifg.|
Both limits can be controlled on a loop-by-loop basis with a directive within the source (see “Setting Invariant IF Floating Limits”).
The –roundoff=n option (or –r=n) controls the amount of variation in roundoff error produced by optimization. If an arithmetic reduction is accumulated in a different order than is used in the scalar program, the roundoff error is accumulated differently and the final result might differ from the output of the original program. Although the difference is usually insignificant, certain restructuring transformations performed by the compiler must be disabled to obtain exactly the same answers as the scalar program.
The values you can specify for n are cumulative. For example, –roundoff=3 performs what is described for level 3 in addition to what is listed for the previous levels. Valid values for n are:
Suppresses any transformations that change roundoff error.
Performs expression simplification (which might generate various overflow or underflow errors) for expressions with operands between binary and unary operators, expressions that are inside trigonometric intrinsic functions returning integer values, and after forward substitution. Enables strength reduction. Performs intrinsic function simplification for max and min. Enables code floating if –scalaropt is at least 1. Allows loop interchanging around serial arithmetic reductions, if –optimize is at least 4. Allows loop rerolling, if –scalaropt is at least 2.
Allows loop interchanging around arithmetic reductions if –optimize is at least 4. For example, the floating point expression A/B/C is computed as A/(B*C).
Recognizes REAL (float) induction variables if –scalaropt greater than 2 or –optimize is at least 1. Enables sum reductions. Enables memory management optimizations if –scalaropt=3 (see “Performing Memory Management Transformations” for details about memory management transformations).
Consider the code fragment in Example 4-3.
ASUM = 0.0 DO 10 I = 1,M DO 10 J = 1,N ASUM = ASUM + A(I,J) C(I,J) = A(I,J) + 2.0 10 CONTINUE
When –roundoff=1, the compiler does not transform the summation reduction. The compiler distributes the loop, as shown in Example 4-4.
ASUM = 0. DO 2 J=1,N DO 2 I=1,M ASUM = ASUM + A(I,J) 2 CONTINUE DO 3 J=1,N DO 3 I=1,M C(I,J) = A(I,J) + 2. 3 CONTINUE
ASUM = 0. DO 10 J=1,N DO 2 I=1,M ASUM = ASUM + A(I,J) C(I,J) = A(I,J) + 2. 2 CONTINUE 10 CONTINUE
When –roundoff=3 and –optimize=5, the compiler recognizes REAL induction variables. Example 4-6 shows such a loop.
ASUM = 0.0 X = 0.0 DO 10 I = 1,N ASUM = ASUM + A(I)*COS(X) X = X + 0.01 10 CONTINUE
When –roundoff=3 and –optimize=5, this loop is transformed in the way shown in Example 4-7.
ASUM = 0. X = 0. DO 10 I=1,N ASUM = ASUM + A(I) * COS ((I - 1) * 0.01) 10 CONTINUE
The nine intrinsic functions ASIN, ACOS, ATAN, COS, EXP, LOG, SIN, TAN, and SQRT have a scalar (element by element) version and a special version optimized for vectors. When you use -O3 optimization, the compiler uses the vector versions if it can. On the MIPS R8000 and R10000 processors, the vector function is significantly faster than the scalar version, but has a few restrictions on its use.
To apply the vector intrinsics, the compiler searches for loops of the following form:
real a(10000), b(10000) do j = 1, 1000 b(2*j) = sin(a(3*j)) enddo
The compiler can recognize the eight functions ASIN, ACOS, ATAN, COS, EXP, LOG, SIN, and TAN when they are applied between elements of named variables in a loop (SQRT is not recognized automatically). The compiler automatically replaces the loop with a single call to a special, vectorized version of the function.
The compiler cannot use the vector intrinsic when the input is based on a temporary result or when the output replaces the input. In the following example, only certain functions can be vectorized:
real a(400,400), b(400,400), c(400,400), d( 400,400 ) call xx(a,b,c,d) do j = 100,300,2 do i = 100, 300,3 a(i,j) = 1.23*i + a(i,j) b(i,j) = sin(a(i,j) + 1.0) a(i,j) = log(a(i,j)) c(i,j) = sin(c(i,j)) / cos(d(i,j)) d(i+30,j-10) = tan( d(j,i) ) enddo enddo call xx(a,b,c,d) end
In the preceding function,
the first SIN call is applied to a temporary value and cannot be vectorized
the LOG call can be vectorized
results from the second SIN call and first COS call are used in temporary expressions and cannot be vectorized
the TAN call can be vectorized
The vector intrinsics are limited in the following ways:
The SQRT function is not used automatically in the current release.
The single-precision COS, SIN, and TAN functions are valid only for arguments whose absolute value is less than or equal to 2**28.
The double-precision COS, SIN, and TAN functions are valid only for arguments whose absolute value is less than or equal to PI*219.
The vector functions assume that the input and output arrays either coincide completely, or do not overlap. They do not check for partial overlap, and will produce unpredictable results if it occurs.
If you need to disable use of vector intrinsics while still compiling at -O3 level, you can do so. Specify the option -OPT:vector_intrinsics=OFF.
f90 -64 -mips4 -O3 -OPT:vector_intrinsics=OFF trig.f
The MIPSpro Fortran 77 Programmer's Guide gives a method of calling the vector intrinsic functions directly. This method cannot be used from Fortran 90 (because the C functions that implement vector intrinsics do not have names ending in an underscore). However, you can write a “wrapper” in C or in Fortran 77 that calls the vector functions, and call this from Fortran 90.
The –aggressive=letter option (or –ag=letter) performs optimizations that are normally forbidden. In order to use this option, your program must be a single file so that the compiler can analyze all of it simultaneously.
The only available value for letter is a, which instructs the compiler to add padding to Fortran COMMON blocks. This optimization provides favorable alignments of the virtual addresses. This option does not have a default value.
% f90 -WK,-ag=a program.f
Unfortunately, it is not always possible to add padding to a COMMON. Fortran allows different routines to have different definitions of COMMON. Therefore, when using this option the entire program must be in a single source file, so the compiler can check for equivalent COMMON definitions.
The –arclimit=integer option (or –arclm=integer) sets the size of the internal table that the compiler uses to store data dependence information. The default value for integer is 5000.
The compiler dynamically allocates the dependence data structure on a loop-nest-by-loop-nest basis. If a loop contains too many dependence relationships and cannot be represented in the dependence data structure, the compiler will stop analyzing the loop. Increasing the value of –arclimit allows the compiler to analyze larger loops.
|Note: Most users do not need to change this value. Also, the number of data dependencies (and the time required to do the analysis) is potentially nonlinear in the length of the loop. Very long loops (several hundred lines) may be impossible to analyze regardless of the value of –arclimit.|
Memory management transformations are based on information about the characteristics of the hardware memory and cache.
The compiler recognizes the following memory management command line options when specified with the -WK option:
Specifies the width of the memory channel between cache and main memory.
Specifies the data cache size.
Specifies number of available floating-point registers.
Specifies number of available double precision registers.
Specifies which memory management transformation to use.
The –cacheline=n option (or –chl=n) specifies the width of the memory channel, in bytes, between the cache and main memory. The default value for n is 4. The correct value for the Power Challenge and Power Indigo2 systems is 128.
The –cachesize=n option (or –chs=n) specifies the size of the data cache, in kilobytes, for which to optimize. The default value for n is 256 kilobytes. You can obtain the cache size for a given machine with the hinv(1) command.
The –setassociativity=n option (or –sasc=n) provides information on the mapping of physical addresses in main memory to cache pages. The default value for integer, 1, says that a datum in main memory can be put in only one place in the cache. If this cache page is already in use, its contents must be rewritten or flushed so that the newly-accessed page can be copied into the cache. The recommended value is 1 for all machines except the POWER CHALLENGE and POWER Onyx series, where you should set it to 4.
The –dpregisters=n option (or –dpr=n) specifies the number of double-precision registers each CPU has. The –fpregisters=n option (or –fpr=n) specifies the number of single precision (that is, ordinary floating point) registers each CPU has.
You should specify the same value for both –dpregisters and –fpregisters. The default values for n are 16 for both options. When compiled in 32-bit mode, Silicon Graphics recommends that you do not specify 16, although that is what the hardware supports. It is better to specify a smaller value such as 12, to provide extra registers in case the compiler needs them. In 64-bit mode, where the hardware supports 32 registers, specify 28.