This chapter contains the following sections:
“Overview of Scalar Optimization” provides an overview of the scalar optimization command-line options.
“Performing General Optimizations” describes the general scalar optimizations you can enable from the command line.
“Performing Advanced Optimizations” describes the advanced scalar optimizations you can enable from the command line.
You can use the compiler to perform various scalar optimizations by specifying any of the options listed in Table 5-1 from the command line. Specify the options in a comma-separated list following the –pfa option without any intervening blanks, as follows:
% f77 f77options -pfa,option[,option] ... file
|Note: These options specifically control optimizations performed by the Fortran front end. The defaults are usually sufficient. Use these options when trying to improve the last bit of performance of your code.|
You can also initiate many of these optimizations with compiler directives (see Chapter 7, “Fine-Tuning Power Fortran.”)
option on with –scalaropt=2 or –optimize=5
depends on –O option
depends on –O option
depends on –O option
The –On option directly initiates basic optimizations.
This section discusses the general optimizations that you can enable.
The –fuse option enables loop fusion, an optimization that transforms two adjacent loops into a single loop. The use of data-dependence tests allows fusion of more loops than is possible with standard techniques. You must also specify –scalaropt=2 or –optimize=5 to enable loop fusion.
The –assume=list option (or –as=list) controls certain global assumptions of a program. You can also control most of these assumptions with various assertions (see “Assertions”). The default is –assume=cel.
list can contain the following characters:
Allows procedure argument aliasing, which is when different subroutine or function parameters refer to the same object. This practice is forbidden by the Fortran 77 standard. This option provides a method of dealing with programs that use argument aliasing anyway.
Allows array subscripts to go outside the declared bounds.
Places constants used in subroutine or function calls in temporary variables.
Allows variables in EQUIVALENCE statements to refer to the same memory location inside one DO loop nest.
Uses temporary variables within an optimized loop and assigns the last value to the original scalar, if the compiler determines that the scalar can be reused before it is assigned.
By default, the compiler assumes that a program conforms to the Fortran 77 standard, that is, –assume=el, and includes –asssume=c to simplify some analysis and inlining. You can disable the default values by specifying the –noassume option.
When a loop contains an IF statement whose condition does not change from one iteration to another (loop-invariant), the compiler performs the same test for every iteration. The code can often be made more efficient by floating the IF statement out of the loop and putting the THEN and ELSE sections into their own loops. This process is called invariant IF floating.
The –each_invariant_if_growth and the –max_invariant_if_growth options control limits on invariant IF floating. This process generally involves duplicating the body of the loop, which can increase the amount of code considerably.
The –each_invariant_if_growth=integer option (or –eiifg=integer) controls the rewriting of IF statements nested within loops. This option specifies a limit on the number of executable statements in a nested IF statement. If the number of statements in the loop exceeds this limit, the compiler does not rewrite the code. If there are fewer statements, the compiler improves execution speed by interchanging the loop and IF statements.
Valid values for integer are from 0 to 100; the default is 20.
This process becomes complicated when there is other code in the loop, since a copy of the other code must be included in both the THEN and ELSE loops.
For example, the following code:
DO I = ... section-1 IF ( ) THEN section-2 ELSE section-3 ENDIF section-4 ENDDO
IF ( ) THEN DO I = ... section-1 section-2 section-4 ENDDO ELSE DO I = ... section-1 section-3 section-4 ENDDO ENDIF
When sections 1 and 4 are large, the extra code generated can slow a program down (through cache contention, extra paging, and so on) more than the reduced number of IF tests speed it up. The –each_invariant_if_growth option provides a maximum size (in number of lines of executable code) of sections 1 and 4, below which the compiler tries to float an invariant IF statement outside a loop.
This can be controlled on a loop-by-loop basis with the C*$* EACH_INVARIANT_IF_GROWTH (integer) directive within the source (see “Setting Invariant IF Floating Limits” in Chapter 7).
You can limit the total amount of additional code generated in a program unit through invariant IF floating by specifying the –max_invariant_if_growth option.
The –max_invariant_if_growth=integer option (or –miifg=integer) specifies an upper bound on the total number of additional lines of code the compiler can generate in each program unit through invariant IF floating. This limit is applied on a per subroutine basis. For example, if a subroutine is 400 lines long and –miifg=500, the compiler can add at most 100 lines in the process of invariant IF floating. The default for integer is 500.
|Note: Other compiler optimizations can add or delete lines, so the final number of lines might differ from the value specified with –miifg.|
This can be controlled on a loop-by-loop basis with the C*$* MAX_INVARIANT_IF_GROWTH (integer) directive within the source (see “Setting Invariant IF Floating Limits” in Chapter 7).
The –optimize=integer option (or –o=integer) sets the optimization level. Each optimization level is cumulative (that is, level 5 performs everything up to and including level 5). You can also modify the optimization level on a loop-by-loop basis by using the C*$* OPTIMIZE(integer) directive within the source (see “Optimization Level” in Chapter 7).
Valid values for integer are as follows:
Performs only simple optimizations. Enables induction variable recognition.
Performs lifetime analysis to determine when last-value assignment of scalars is necessary.
Recognizes triangular loops and attempts loop interchanging to improve memory referencing. Uses special case data dependence tests. Also, recognizes special index sets called wrap-around variables.
Generates two versions of a loop, if necessary, to break a data dependence arc.
Enables array expansion and loop fusion.
Although higher optimization levels increase performance, they also increase compilation time.
Output of the following example is described for –optimize=1, –optimize=2, and –optimize=5 to illustrate the range of this option. (This example also uses –minconcurrent=0.)
ASUM = 0.0 DO 10 I = 1,M DO 10 J = 1,N ASUM = ASUM + A(I,J) C(I,J) = A(I,J) + 2.0 10 CONTINUE
At –optimize=1, the compiler sees the summation in ASUM as an intractable data dependence between iterations and does not try to optimize the loop. Specifying –optimize=2 (perform lifetime analysis and do not interchange around reduction) produces the following:
ASUM = 0. C$DOACROSS SHARE(M,N,A,C),LOCAL(I,J),REDUCTION(ASUM) DO 3 I=1,M DO 2 J=1,N ASUM = ASUM + A(I,J) C(I,J) = 2. + A(I,J) 2 CONTINUE 3 CONTINUE
Specifying –optimize=5 (loop interchange around reduction to improve memory referencing) produces the following:
ASUM = 0. C$DOACROSS SHARE(N,M,A,C),LOCAL(J,I),REDUCTION(ASUM) DO 3 J=1,N DO 2 I=1,M ASUM = ASUM + A(I,J) C(I,J) = 2. + A(I,J) 2 CONTINUE 3 CONTINUE
The –roundoff=integer option (or –r=integer) controls the amount of variation in round-off error produced by optimization. If an arithmetic reduction is accumulated in a different order than in the scalar program, the round-off error is accumulated differently and the final result might differ from the output of the original program. Although the difference is usually insignificant, certain restructuring transformations performed by the compiler must be disabled to obtain exactly the same answers as the scalar program.
The values you can specify for integer are cumulative. For example, –roundoff=3 performs what is described for level 3, in addition to what is listed for the previous levels. Valid values for integer are as follows:
Suppresses any transformations that change round-off error.
Performs expression simplification, which might generate various overflow or underflow errors, for expressions with operands between binary and unary operators, expressions that are inside trigonometric intrinsic functions returning integer values, and after forward substitution. Enables strength reduction. Performs intrinsic function simplification for max and min. Enables code floating if –scalaropt is at least 1. Allows loop interchanging around serial arithmetic reductions, if –optimize is at least 4. Allows loop rerolling, if –scalaropt is at least 2.
Allows loop interchanging around arithmetic reductions if –optimize is at least 4. For example, the floating point expression A/B/C is computed as A/(B*C).
Recognizes REAL (float) induction variables if –scalaropt is greater than 2 or –optimize is at least 1. Enables sum reductions. Enables memory management optimizations if –scalaropt=3 (see “Performing Memory Management Transformations” for details about memory management transformations).
Consider the following code segment.
ASUM = 0.0 DO 10 I = 1,M DO 10 J = 1,N ASUM = ASUM + A(I,J) C(I,J) = A(I,J) + 2.0 10 CONTINUE
When –roundoff=1, the compiler does not transform the summation reduction. The compiler distributes the loop:
ASUM = 0. DO 2 J=1,N DO 2 I=1,M ASUM = ASUM + A(I,J) 2 CONTINUE DO 3 J=1,N DO 3 I=1,M C(I,J) = A(I,J) + 2. 3 CONTINUE
When –roundoff=2 and –optimize=5 (reduction variable identification and loop interchange around arithmetic reduction), the original code becomes:
ASUM = 0. DO 10 J=1,N DO 2 I=1,M ASUM = ASUM + A(I,J) C(I,J) = A(I,J) + 2. 2 CONTINUE 10 CONTINUE
When –roundoff=3 and –optimize=5, the compiler recognizes REAL induction variables. In this example, the compiler performs forward substitution of the transformed induction variable X:
The following code
ASUM = 0.0 X = 0.0 DO 10 I = 1,N ASUM = ASUM + A(I)*COS(X) X = X + 0.01 10 CONTINUE
ASUM = 0. X = 0. DO 10 I=1,N ASUM = ASUM + A(I) * COS ((I - 1) * 0.01) 10 CONTINUE
Disables all scalar optimizations.
Enables simple scalar optimizations—dead code elimination, global forward substitution of variables, and conversion of IF-GOTO to IF-THEN-ELSE.
Enables the full range of scalar optimizations—floating invariant IF statements out of loops, loop rerolling and unrolling (if –roundoff is greater than zero), array expansion, loop fusion, loop peeling, and induction variable recognition.
Enables memory management transformations if –roundoff=3 (see “Performing Memory Management Transformations” for details about memory management transformations). Performs dead-code elimination during output conversion.
Unlike the –scalaropt command-line option, the C*$* SCALAR OPTIMIZE directive sets the level of loop-based optimizations (for example, loop fusion) only, and not straight-code optimizations (for example, dead-code elimination).
Refer to “Controlling Scalar Optimizations” in Chapter 7 for details about the C*$* SCALAR OPTIMIZE directive.
The nine intrinsic functions ASIN, ACOS, ATAN, COS, EXP, LOG, SIN, TAN and SQRT have a scalar (element by element) version and a special version optimized for vectors. When you use -O3 optimization, the compiler uses the vector versions if it can. On the MIPS R8000 and R10000 processors, the vector function is significantly faster than the scalar version, but has a few restrictions on use.
To apply the vector intrinsics, the compiler searches for loops of the following form:
real a(10000), b(10000) do j = 1, 1000 b(2*j) = sin(a(3*j)) enddo
The compiler can recognize the eight functions ASIN, ACOS, ATAN, COS, EXP, LOG, SIN, and TAN when they are applied between elements of named variables in a loop (SQRT is not recognized automatically). The compiler automatically replaces the loop with a single call to a special, vectorized version of the function.
The compiler cannot use the vector intrinsic when the input is based on a temporary result or when the output replaces the input. In the following example, only certain functions can be vectorized.
real a(400,400), b(400,400), c(400,400), d( 400,400 ) call xx(a,b,c,d) do j = 100,300,2 do i = 100, 300,3 a(i,j) = 1.23*i + a(i,j) b(i,j) = sin(a(i,j) + 1.0) a(i,j) = log(a(i,j)) c(i,j) = sin(c(i,j)) / cos(d(i,j)) d(i+30,j-10) = tan( d(j,i) ) enddo enddo call xx(a,b,c,d) end
In the preceding function,
the first SIN call is applied to a temporary value and cannot be vectorized
the LOG call can be vectorized
results from the second SIN call and first COS call are used in temporary expressions and cannot be vectorized
the TAN call can be vectorized
The vector intrinsics are limited in the following ways:
The SQRT function is not used automatically in the current release (but it can be called directly; see “Calling Vector Functions Directly”).
The single-precision COS, SIN, and TAN functions are valid only for arguments whose absolute value is less than or equal to 2**28.
The double-precision COS, SIN, and TAN functions are valid only for arguments whose absolute value is less than or equal to PI*219.
The vector functions assume that the input and output arrays either coincide completely, or do not overlap. They do not check for partial overlap, and produces unpredictable results if it occurs.
If you need to disable use of vector intrinsics while still compiling at the –O3 level, you can do so. Specify the option –OPT:vector_intrinsics=OFF:
f77 -64 -mips4 -O3 -OPT:vector_intrinsics=OFF trig.f
The vector intrinsic functions are C functions that can be called directly using the techniques discussed in the MIPSpro Fortran 77 Programmer's Guide. The prototype of one function is as follows:
__vsinf( void*from, void*dest, int count, int fromstride, int deststride )
Note the two leading underscore characters in the name. The arguments are
Address of the first element of the source array
Address of first element of destination array
Number of elements to process
Number of elements to advance in the source array
Number of elements to advance in the destination array
For example, the compiler converts a loop of this form:
real a(10000), b(10000) do j = 1, 1000 b(2*j) = sin(a(3*j)) enddo
into nonlooping code of this form:
real a(10000), b(10000) call __VSINF$(%REF(A(1)),%REF(A(2)),%VAL(1000),%VAL(3),%VAL(2))
All the vector intrinsic functions have the same prototype as the one shown above for __vsinf. The names of the available vector functions are shown in Table 5-2.
REAL*4 Function Name
REAL*8 Function Name
This section describes advanced optimization techniques you can use to obtain maximum performance.
The –aggressive=letter option (or –ag=letter) performs optimizations that are normally forbidden. When using this option, your program must be a single file, so that the compiler can analyze all of it simultaneously.
The only available value for letter is a, which instructs the compiler to add padding to Fortran COMMON blocks. This optimization provides favorable alignments of the virtual addresses. This option does not have a default value:
% f77 -pfa,-ag=a program.f
For example, on a machine with a 64-kilobyte direct-mapped cache, a COMMON definition such as the following:
COMMON /alpha/ a(128,128),b(128,128),c(128,128)
can degrade performance if your program contains the following statement:
a(i,j) = b(i,j) * c(i,j)
All three of the arrays a, b, and c have the same starting virtual address modulo the cache size, and so every access to the array elements causes a cache miss. It would be much better to add some padding between each of the arrays to force the virtual addresses to be different. The –aggressive=a option does exactly this.
Unfortunately, this transformation is not always possible. Fortran allows different routines to have different definitions of COMMON. If some other routine contained the definition
COMMON /alpha/ scratch(49152)
the compiler could not arbitrarily add padding. Therefore, when using this option the entire program must be in a single source file, so the compiler can check for this sort of occurrence.
The compiler dynamically allocates the dependence data structure on a loop-nest-by-loop-nest basis. If a loop contains too many dependence relationships and cannot be represented in the dependence data structure, the compiler will stop analyzing the loop. Increasing the value of –arclimit allows the compiler to analyze larger loops.
|Note: The number of data dependencies (and the time required to do the analysis) is potentially non-linear in the length of the loop. Very long loops (several hundred lines) may be impossible to analyze regardless of the value of –arclimit.|
You can use the –arclimit option to increase the size of the data structure to enable the compiler to perform more optimizations. (Most users do not need to change this value.)
When both –roundoff and –scalaropt are set to 3, the compiler attempts to perform outer loop unrolling (to improve register utilization) and automatic loop blocking (to improve cache utilization).
Normal loop unrolling (enabled with the –unroll and –unroll2 options) applies to the innermost loop in a nest of loops. In outer loop unrolling, one of the other loops (typically the next innermost) is unrolled. In certain situations, this technique (also called “unroll and jam”) can greatly improve the register utilization.
Loop blocking is a transformation that can be applied when the loop nesting depth is greater than the dimensions of the data arrays being manipulated. For example, the simple matrix multiply uses a nest of three loops operating on two-dimensional arrays. The simple approach repeatedly sweeps across the entire arrays. A better approach is to break the arrays up into blocks, each block being small enough to fit into the cache, and then make repeated sweeps over each (in-cache) block. (This technique is also sometimes called “tiles” or “tiling.”) However, the code needed to implement a block style algorithm is often very complex and messy. This automatic transformation allows you to write the simpler method, and have the compiler transform it into the more complex and efficient block method.
The compiler recognizes the following memory management command-line options when specified with the -pfa option:
–cacheline specifies the width of the memory channel between cache and main memory.
–cachesize specifies the data cache size.
–fpregisters specifies an unrolling factor.
–dpregisters ensures that registers do not overflow during loop unrolling.
–setassociativity specifies which memory management transformation to use.
The –cacheline=integer option (or –chl=integer) specifies the width of the memory channel, in bytes, between the cache and main memory. The default value for integer is 4. Refer to Table 5-3 for the recommended setting for your machine.
The –cachesize=integer option (or –chs=integer) specifies the size of the data cache, in kilobytes, for which to optimize. The default value for integer is 256 kilobytes. Refer to Table 5-3 for the recommended setting for your machine. You can obtain the cache size for a given machine with the hinv(1) command. This option is generally useful only in conjunction with the other memory management transformations.
Cache Size Value
POWER Series 4D/100
POWER Series 4D/200
R4000® (including Crimson™
The –setassociativity=integer option (or –sasc=integer) provides information on the mapping of physical addresses in main memory to cache pages. The default value for integer, 1, says a datum in main memory can be put in only one place in the cache. If this cache page is already in use, its contents must be rewritten or flushed so that the newly-accessed page can be copied into the cache. Silicon Graphics recommends you set this value to 1 for all machines, except the POWER CHALLENGE™ series, where you should set it to 4.
The –dpregisters=integer option (or –dpr=integer) specifies the number of DOUBLE PRECISION registers each processor has. The –fpregisters option (or –fpr=integer) specifies the number of single precision (that is, ordinary floating point) registers each processor has.
Silicon Graphics recommends you specify the same value for both –dpregisters and –fpregisters. The default values for integer are 16 for both options. When compiled in 32-bit mode, Silicon Graphics recommends that you do not specify 16, although that is what the hardware supports. It is better to specify a smaller value for integer, like 12, to provide extra registers in case the compiler needs them. In 64-bit mode, where the hardware supports 32 registers, specify 28 for integer.
The –unroll and the –unroll2 options control how the compiler unrolls scalar loops. When loops cannot be optimized for concurrent execution, loop execution is often more efficient when the loops are unrolled. (Fewer iterations with more work per iteration require less overhead overall.) You must also specify –scalaropt= 2 when using these options.
The –unroll=integer (or –ur=integer) option directs the compiler to unroll inner loops. integer specifies the number of times to replicate the loop. The default value is 4.
Uses default values to unroll.
Unrolls, at most, this many iterations.
The –unroll2=weight (or –ur2=weight) option specifies an upper bound on the number of operations in a loop when unrolling it with the –unroll option. The default value for weight is 100. The compiler unrolls an inner loop until the number of operations (the amount of work) in the unrolled loop is close to this upper bound, or until the number of iterations specified in the –unroll option is reached, whichever occurs first.
For the –unroll2 option the compiler analyzes a given loop by computing an estimate of the computational work that is inside the loop for one iteration. This rough estimate is obtained by adding the number of
The following example uses the C*$* UNROLL directive (see “Enabling Loop Unrolling” in Chapter 7) to specify 8 for the maximum number of iterations to unroll and 100 for the maximum “work per unrolled iteration.” (This is equivalent to specifying –pfa,–unroll=8,–unroll2=100.)
C*$*UNROLL(8,100) DO 10 I = 2,N A(I) = B(I)/A(I-1) 10 CONTINUE
This example has:
0 IF statements
2 arithmetic operators
6 is the weighted sum (the work for 1 iteration)
This weighted sum is then divided into 100 to give a potential unrolling factor of 16. However, the example has also specified 8 for the maximum number of unrolled iterations. The compiler takes the minimum of the two values (8) and unrolls that many iterations. (The maximum number of iterations the compiler unrolls is 100.)
In this case (an unknown number of iterations), the compiler generates two loops—the primary unrolled loop and a cleanup loop to ensure that the number of iterations in the main loop is a multiple of the unrolling factor.
The result is the following example.
INTEGER I1 C*$*UNROLL(8,100) I1 = MOD (N - 1, 8) DO 2 I=2,I1+1 A(I) = B(I) / A(I-1) 2 CONTINUE DO 10 I=I1+2,N,8 A(I) = B(I)/A(I-1) A(I+1) = B(I+1) / A(I) A(I+2) = B(I+2) / A(I+1) A(I+3) = B(I+3) / A(I+2) A(I+4) = B(I+4) / A(I+3) A(I+5) = B(I+5) / A(I+4) A(I+6) = B(I+6) / A(I+5) A(I+7) = B(I+7) / A(I+6) 10 CONTINUE
Accepts Cray CDIR$ directives.
Accepts Silicon Graphics C*$* and C$PAR directives.
Accepts parallel programming directives.
Accepts Sequent® C$ directives.
Accepts VAST CVD$ directives.
The default value for list is ackpv. For example, –pfa,–directives=k enables Silicon Graphics directives only, whereas –pfa,–directives=kas enables Silicon Graphics directives and assertions and Sequent directives.
To disable all of the above options, enter –nodirectives or –directives (without any values for list) on the command line. Chapter 7, “Fine-Tuning Power Fortran,” describes the Silicon Graphics, Cray, Sequent, and VAST directives the compiler accepts.
Assertions are similar in form to directives, but they assert program characteristics that the compiler can use in its optimizations. In addition to specifying a in list, you can control whether the compiler accepts assertions using the C*$* ASSERTIONS and C*$* NO ASSERTIONS directives (refer to “Using Assertions” in Chapter 7).
The –recursion option (or –rc) allows the compiler to call subroutines and functions in the source program recursively (that is, a subroutine or function calls itself, or it calls another routine that calls it). Recursion affects storage allocation decisions.
This option is enabled by default. To disable it, specify –norecursion (or –nrc).
Unsafe transformations can occur unless the –recursion option is enabled for each recursive routine that the compiler processes.