Now that you have been introduced to the SN0 architecture, it is time to address “tuning,” that is, how to make your programs run their fastest on this architecture. One answer, in the SN0 environment, is to make the program use multiple CPUs in parallel. However, this should be the last step. First, make your program run as efficiently as a possible as a single process.
The process of tuning can be divided into the following steps:
Make sure the program produces the right answers. This is especially important when first porting a program to IRIX 6.5, 64-bit computing, and SN0.
Use existing, tuned library code.
Steps 1 and 2 are covered in this chapter.
Profile the program's execution and analyze its dynamic behavior. The many tools for this purpose are covered in Chapter 4, “Profiling and Analyzing Program Behavior”.
Make the compiler produce the most efficient code. The use of the many interrelated compiler options is covered in Chapter 5, “Using Basic Compiler Optimizations”.
Modify the program to access memory efficiently. This is covered in Chapter 6, “Optimizing Cache Utilization”.
Exploit the compiler's ability to optimize loops. Loop optimization is covered in Chapter 7, “Using Loop Nest Optimization”.
When you have done all these things, and the program still takes too long, Chapter 8, “Tuning for Parallel Processing”, covers parallelization.
You can make any program run as fast as you want—if you don't insist on correct results. However, the very first rule in tuning is to make sure that the program generates the correct results. Although this may seem obvious, it is easy to forget to check the answers along the way as you make performance improvements. A lot of frustration can be avoided by making only one change at a time, verifying that the program is correct after each change.
In addition to problems introduced by performance changes, correctness issues may be raised merely by porting a code from one system to another, or by compiling to a different environment. Most programs that are written in a high-level language and adhere to the language standard will port with no difficulties. This is particularly true for standard-compliant FORTRAN 77 or Fortran 90 programs. Some C programs can cause problems.
IRIX 6.5 is a 64-bit operating system. Although programs can be (and often are) compiled and run in a 32-bit address space, you may need to take advantage of the larger address space and so compile in the 64-bit Application Binary Interface (ABI). (For a survey of the available ABIs and the differences among them, see the ABI(5) man page.) All modules of a program must use the same ABI.
When a program is written in a portable manner, you can compile it for a different ABI merely by specifying one of the compiler flags: -o32, -n32, or -64. However, a change of ABI can cause a program to fail, because of a variety of subtle problems. All of these issues are discussed in depth in the MIPSpro 64-Bit Porting and Transition Guide listed in “Related Documents”.
The oldest ABI is the "old" 32-bit environment (compiler flag -o32). It was the only ABI before IRIX version 6. Programs compiled -o32 are limited to the MIPS I and MIPS II instruction sets, on the assumption that they must be able to execute in an older computer. If you want to use the capabilities of the MIPS III or MIPS IV instructions sets—and you do, if performance is a goal—you need to compile to a different ABI.
The advantage of using -n32 is that it allows programs to use either the MIPS III or the MIPS IV instruction set architectures (ISAs), which can be selected by using the -mips3 or mips4 compiler options.
With the “new” 32-bit ABI (compiler flag -n32), the program uses a 32-bit address space. Pointers and long integers are 32 bits in size, so there are fewer portability issues (for example, a binary file containing 32-bit integers written by an old program can be mapped by the identical header file in an -n32 program).
The advantage of using -n32 is that it allows programs to use either the MIPS III or the MIPS IV ISAs (selected by the -mips3 and -mips4 compiler flags). These ISAs use more and longer registers, and use a faster protocol for subroutine calls, than the old 32-bit ABI did. The compiler defaults to this ABI on R10000/R12000/R14000 machines.
When compiled with the -64 flag, the program runs in a 64-bit address space. This permits definition of immense arrays of data. Pointers and long integers are 64 bits in length. Other than that, -64 and -n32 are essentially the same. That is, there is no performance advantage to the use of -64, and indeed, because pointers take up more memory, a program that has many pointers may run more slowly when compiled -64. Use it only when your program requires memory data totalling more than 2 Gigabytes in size, or your program calls the MPI subroutine library.
Unless you require your program to run on a MIPS R4000 system (such as an Indy workstation), use one of the combinations -n32 -mips4 or -64 -mips4. If backward compatibility with the R4x00 CPU is needed, use -n32 -mips3.
On all non-R8000 systems prior to the R10000, the default execution environment is -o32 and -mips2. For R10000/R12000/R14000 systems, the default execution environment is -n32 and -mips4. On Challenge and Onyx R8000-based systems, these defaults are -64 -mips4. Because the defaults vary by system type, and because you may compile on a workstation for execution on a server (or vice versa), it is wise to specify the desired ABI and ISA explicitly in every compile. The best way to do this is with a makefile, as described under “Using a Makefile” in Chapter 5.
A common example is a Fortran program that assumes that all variables are initialized to zero. While this initialization occurred on most older machines, it is not common on newer, stack-based architectures. When a program making such an assumption is ported to IRIX, it can generate incorrect results.
Although fixing the program is the “right thing to do,” it does require some effort. An alternative that requires almost no effort is to use the -static flag, which causes all variables to be allocated from a static data area rather than the stack. Static variables are initialized to zero. There is a penalty to the easy way out: use of the -static flag hurts program performance; a 20 percent penalty is common.
Other types of porting failures can occur as well. The results generated on one computer may disagree with those from another because of differences in the floating point format, or because of differences in the generated code that cause the roundoff error to change. You need to understand the particular program to determine if these differences are significant.
In addition, some programs may even fail to compile on a new system if language extensions are not supported or if they are provided through a different syntax. Fortran pointers provide a good example of this; there is no pointer type in the FORTRAN 77 standard, and different systems provide this extension in incompatible ways. In such situations, the code must be modified to be compatible with the new compiler. Similarly, if a program makes calls to a particular vendor's libraries, these calls must be replaced with equivalent entry points from the new vendor's libraries.
To verify the existence and syntax of Fortran features in SGI compilers, see the compiler guides listed in “Related Documents”.
Finally, programs sometimes have mistakes in them that by pure luck have not yet caused problems but that, when compiled on a new system, cause errors to appear. Writing beyond the end of an array can show this behavior: whether storing past the end of the array causes a problem depends on how the compiler lays out variables in memory.
For Fortran, use the -check_bounds flag to instruct the compiler to generate run-time subscript range checking. If an out-of-bounds array reference is made, the program aborts and displays an error message indicating the illegal reference and the line of code where it occurred.
The quickest and easiest way to improve a program's performance is to link it with libraries already tuned for the target hardware. The standard math library is so tuned, and there are optional libraries that can provide performance benefits: libfastm, CHALLENGEcomplib, and SCSL.
For the standard libraries such as libc, hardware-specific versions are automatically linked, based on the compiler's information about the target system. The compiler assumes that the current system is the target for execution, but you can tell it otherwise with compiler options. The -TARG option group (described in the cc(1) , f90(1), or f77(1) man pages) describes the target system by CPU type or by SGI processor board type (“IP” number). Alternatively, you can specify the -r flag to specify a particular MIPS CPU.
The standard math library includes special “vector intrinsics,” that is, vectorized versions of the functions vacos, vasin, and others (plus single-precision versions whose names are made by appending an f). These functions are designed to take maximum advantage of the pipelining characteristics of the CPU when processing a vector of numbers.
You can write explicit calls to the vector functions, passing input and output vectors. In certain circumstances, the loop-nest optimizer of the Fortran compilers recognizes a vector loop and automatically replaces it with a vector intrinsic call (see the compiler documentation listed under “Related Documents”).
The standard math library is described in the math(3) man page. This library is linked in automatically by Fortran. When you link using the ld or cc commands, use the -lm flag.
The libfastm library provides highly optimized versions of a subset of the routines found in the standard libm math library. Optimized scalar versions of sin, cos, tan, exp, log, and pow for both single and double precision are included. To link libfastm, append -lfastm to the end of the link line: cc -o modules... -lfastm.
Separate versions of libfastm exist for the different CPU types. When the target system is not the one used for compiling, explicitly specify the target CPU. For details, see the libfastm(3) man page.
CHALLENGEcomplib is a library of mathematical routines that carries out linear algebra operations, Fast Fourier Transforms (FFTs), and convolutions. It contains implementations of the well-known public domain libraries LINPACK, EISPACK, LAPACK, FFTPACK, and the Level-1, -2, and -3 BLAS, but all tuned specifically for high performance on the MIPS IV ISA. In addition, the library contains proprietary sparse solvers, also tuned to the hardware.
Many CHALLENGEcomplib routines have been parallelized to use multiple CPUs concurrently to shorten solution time. A parallelized algorithm runs in a one-CPU system (or when a program is confined to a single CPU), but it incurs additional overhead. CHALLENGEcomplib comes in both sequential and parallel versions. To link the sequential version, add -lcomplib.sgimath to the link line:
% f77 -o modules -lcomplib.sgimath -lfastm
To link the parallel version, use -lcomplib.sgimath_mp; in addition, the -mp flag must be used when linking in the parallel code:
% f77 -mp -o modules -lcomplib.sgimath_mp -lfastm
SCSL, the SGI Scientific Library, is an optimized library that contains FFT routines, Level-1, -2, and -3 BLAS, and LAPACK routines, many of them parallelized. SCSL is distributed as a separate product from IRIX, but is available at no charge to IRIX users. SCSL ships automatically to all SN0 systems.
Note that the user interface for the FFT routines is different from the interface used in CHALLENGEcomplib. Some extended BLAS routines, which are available in libsci but not in CHALLENGEcomplib, have been included in SCSL.
You link SCSL into your program by using -lscs for the single-threaded version, or -lscs_mp for the parallelized version of the library.
|Tip: Both CHALLENGEcomplib and SCSL define names also found in libfastm, so if you want to call libfastm it should be named last in the link command.|
Many working UNIX applications compile and run correctly without modification in IRIX. Nevertheless, dependency on vendor-specific interfaces, or on compiler behavior, or incorrect assumptions about data representation, can cause errors, and the first task is to make sure the program processes its full range of input to produce correct answers. Then, if the program uses external math functions, link it with a tuned library and again make sure that correct answers emerge. Then the program is ready to be tuned.