This guide tells how to tune programs that run on the Silicon Graphics Origin2000, Onyx2, and Origin200 multiprocessor systems for best performance. The material is meant for two different uses:
As a self-paced study course, to be used by any software developer who is writing or maintaining an application to run on an Origin system.
As an outline and supplementary material for a course delivered in person by a Silicon Graphics System Support Engineer.
The guide also contains a glossary of terms related to performance tuning and to the hardware concepts of SN0 (Scalable Node 0, the name for Silicon Graphics server architecture).
The guide is written for experienced programmers, familiar with IRIX commands and with either the C or Fortran programming languages. The focus is on achieving the highest possible performance by exploiting the features of IRIX, the MIPS R10000 CPU, and the SN0 architecture.
The material assumes that you know the basics of software engineering and that you are familiar with standard methods and data structures. If you are new to software design, to UNIX, to IRIX, or to Silicon Graphics hardware, this guide will not help you learn these things.
Chapter 1, “Understanding SN0 Architecture,” describes the features of the SN0 architecture that affect performance, in particular the cache-coherent nonuniform memory architecture (CC-NUMA).
Chapter 2, “SN0 Memory Management,” reviews general programming issues for these systems and the programming practices that lead to good (and bad) performance.
Chapter 3, “Tuning for a Single Process,” covers tuning for single-process performance in detail, showing how to take best advantage of the R10000 CPU and cache memory, how to use the profiling tools, and how to select among the many compiler options.
Chapter 8, “Tuning for Parallel Processing,” discusses tuning issues for parallel programs, including points on how to avoid cache contention and how to distribute virtual memory segments to different nodes.
Appendix A, “Bentley's Rules Updated,” is a summary of the performance-tuning guidelines first published by Jon Bentley in the out-of-print classic Writing Efficient Programs, updated for the modern world of superscalar CPUs and multiprocessors.
Appendix B, “R10000 Counter Event Types,” describes the meanings of the event counter registers in the R10000 CPU and their use for tuning.
Appendix C, “Useful Scripts and Code,” contains several longer examples and scripts mentioned in the text.
The material covered in this book is related to other works in the Silicon Graphics library.
All of the following books can be read online on the Internet from the Tech Pubs Library at http://techpubs.sgi.com/library .
MIPS R10000 Microprocessor User Guide, Version 2.0, 007-2490-001, is the authoritative guide to the internal operations of the CPU chip used in SN0 systems.
Origin and Onyx2 Theory of Operations Manual, 007-3439-nnn, covers the basic design of the SN0 architecture.
Origin and Onyx2 Programmer's Reference Manual, 007-3410-nnn, has additional details of SN0 physical and virtual addressing and other topics.
MIPSpro Compiling and Performance Tuning Guide, 007-2360-nnn, covers compiler and linker use that is common to all the compilers, including the many optimization directives and command-line options.
MIPSpro 64-Bit Porting and Transition Guide, 007-2391-nnn, discusses the problems that arise when porting from a 32-bit to a 64-bit computing environment, and has some discussion of optimization features.
The Fortran compilers are documented in: MIPSpro Fortran 77 Programmer's Guide, 007-2361-nnn and MIPSPro Fortran 90 Commands and Directives Reference Manual, 007-3696-nnn. These books address general run-time issues, have some discussion of performance tuning, and document compiler directives, including the OpenMP directives for parallel processing.
MIPSpro C and C++ Pragmas, 007-3587-nnn, covers parallelization and other directives for C programming.
MIPSpro Auto-Parallelizing Option Programmer's Guide, 007-3572-nnn, documents how the C, C++, and Fortran compilers automatically parallelize serial programs.
SpeedShop User's Guide, 007-3311-nnn, documents the tuning and profiling tools mentioned in this book.
Topics in IRIX Programming, 007-2478-nnn, details the available models for parallel programming and documents a number of advanced programming topics.
Message Passing Toolkit: MPI Programmer's Manual 007-3687-nnn and Message Passing Toolkit: PVM Programmer's Manual 007-3686-nnn document the use of these popular libraries for parallel programming.
IRIX Admin: System Configuration and Operation, 007-2859-nnn, documents the commands the system administrator uses, including the system tuning variables.
The foundation of the SN0 multiprocessor design is explained in Scalable Shared-Memory Multiprocessing by Daniel Lenoski and Wolf-Dietrich Weber (San Francisco: Morgan Kauffman, 1995).
A particularly good book on parallel programming is Practical Parallel Programming by Barr E. Bauer (Academic Press, 1992; ISBN 0120828103). Although it is not current for Silicon Graphics compilers and SN0 hardware, it has good conceptual material.
Courses and information about parallel and distributed programming are available on the Internet. The following are some useful links:
The Boston University Scientific Computing and Visualization Group offers a number of useful tutorials on such topics as parallel programming in Fortran 90 and the use of MPI. The URL is http://scv.bu.edu/SCV/Tutorials/ .
The web page for the Computational Science and Engineering Graduate Option Program at the University of Illinois at Urbana-Champaign links to the lecture notes for courses on parallel computation and parallel numerical algorithms. The URL is http://www.cse.uiuc.edu/ .
The entire text of Designing and Building Parallel Programs by Ian Foster (Addison-Wesley 1995; ISBN 0-201-57594-9) is available online, with a wealth of supplementary material and links related to parallel programming. The URL is http://www.mcs.anl.gov/dbpp/ .
The NAS Parallel Benchmarks are a suite of programs that are used to compare the performance of parallel systems and parallelizing compilers. The benchmarks are described in a report found at the following URL: http://science.nas.nasa.gov/Pubs/TechReports/RNRreports/dbailey/RNR-94-007/html/npbspec.html .
The reference pages for the compilers and tools are detailed and informative. Look up the following reference pages using the InfoSearch facility (under IRIX 6.5, found in the desktop Toolchest menu under Help > Man Pages). From the InfoSearch window you can print copies of these pages for study and for reference.
cc(1), CC(1), f77(1), and f90(1) each document the operation and main option groups for one compiler. These pages are very similar because the most options are used by the common back-end and linker.
ipa(5) documents the -IPA option subgroup, controlling the interprocedural analysis phase of all compilers.
lno(5) documents the -LNO option group, controlling loop-nest optimization for all compilers.
opt(5) documents the -OPT option group, controlling general optimizations for all compilers.
math(3) details the standard math library used by all programs. Specially tuned libraries are described in libfastm(3) and sgimath(3) (this very long page lists all BLAS, LAPACK, and other functions).
ld(1) documents the linker; rld(1) documents the runtime linker; and dso(5) documents the format of dynamic shared objects (runtime-linkable libraries).
The auto-parallelizing feature is documented in auto_p(5). It generates code that uses the multiprocessing library; its runtime control is documented in mp(3F) for Fortran 77, pe_environ(5) for Fortran 90, and mp(3C) for C and C++. The newer OpenMP runtime features are in omp_threads(3), omp_lock(3), and omp_nested(3).
The reference pages perfex(1), speedshop(1), ssrun(1), and prof(1) document the software tools used to profile and analyze your programs.
Different text fonts are used in this book to indicate different kinds of information, as shown in the following table:
Terms that are defined in the Glossary. You can click such terms to link to their definitions.
This performance problem is generically referred to as cache contention.
Names of IRIX commands and command-line options.
Compile with cc -LNO:off. Check the CPU clock rate with hinv.
Names of filesystems, paths, linkable libraries, and files.
Examine /etc/config. Devices appear in the /hw filesystem.
Names of routines, functions and procedures when used as names.
Two common library functions are printf() and wait().
User input and program statements or expressions, when they must be typed exactly as shown.
Use c$doacross mp_schedtype=simple to parallelize with the basic scheduling. Enter y when prompted.
Program and mathematical variables used as names; and variable elements of program expressions.
Applying p CPUs to a program does not result in a speedup of p times. The feedback file is written as program . n .fb.