- 64-bit address space
- Selecting an ABI and ISA
- adi2 example program
- Program adi2
- aliasing models
- Understanding Aliasing Models
- Amdahl's law
- Understanding Parallel Speedup and Amdahl's Law
- awk script for
- Awk Script for Amdahl's Law Estimation
- execution time given n and p
- Predicting Execution Time with n CPUs
- parallel fraction p
- Understanding Amdahl's Law
- parallel fraction p given speedup( n )
- Calculating the Parallel Fraction of a Program
- speedup(n ) given p
- Understanding Amdahl's Law
- superlinear speedup
- Understanding Superlinear Speedup
- application binary interface (ABI)
- Selecting an ABI and ISA
- 64-bit
- 64-Bit ABI
- new 32-bit
- New 32-Bit ABI
- old 32-bit
- Old 32-Bit ABI
- arithmetic error
- Understanding Arithmetic Standards
- array padding
- Using Array Padding to Prevent Thrashing
- Diagnosing and Eliminating Cache Thrashing
- Using Array Padding
- auto-parallelizing
- Compiling Serial Code for Parallel Execution
- Bentley, Jon
- Bentley's Rules Updated
- cache
- and hardware event counter
- Primary Cache Use
- blocking
- Understanding Cache Blocking
- Controlling Cache Blocking
- cache miss
- Understanding Level-One and Level-Two Cache Use
- coherent
- Understanding Cache Coherency
- Cache Coherency Events
- compiler's model of
- Adjusting the Optimizer's Cache Model
- contention in
- Diagnosing Cache Problems
- correcting
- Correcting Cache Contention in General
- event 31 reveals
- Diagnosing Cache Problems
- Identifying False Sharing
- diagnosing problems in
- Identifying Cache Problems with Perfex and SpeedShop
- Diagnosing Cache Problems
- directory-based
- Memory Overhead Bits
- Understanding Directory-Based Coherency
- false sharing of
- Identifying False Sharing
- L1
- Level-1 Cache
- Understanding Level-One and Level-Two Cache Use
- Primary Cache Use
- L2
- Level-Two Cache
- Understanding Level-One and Level-Two Cache Use
- Secondary Cache Use
- line size
- Understanding Level-One and Level-Two Cache Use
- data structure blocking for
- Data Structure Augmentation
- on-chip
- Cache Architecture
- operation of
- Understanding Cache Coherency
- Understanding Directory-Based Coherency
- Understanding Level-One and Level-Two Cache Use
- principles of use
- Principles of Good Cache Use
- proper use of
- Principles of Good Cache Use
- Using Other Cache Techniques
- array padding
- Using Array Padding
- blocking data for
- Understanding Cache Blocking
- Controlling Cache Blocking
- grouping related data for
- Grouping Data Used at the Same Time
- loop fusion for
- Understanding Loop Fusion
- parallel execution issues
- Diagnosing Cache Problems
- stride-one access for
- Using Stride-One Access
- transposition for
- Understanding Transpositions
- set-associative
- Understanding Level-One and Level-Two Cache Use
- thrashing in
- Understanding Cache Thrashing
- snoopy
- Coherency Methods
- thrashing
- Understanding Cache Thrashing
- Diagnosing and Eliminating Cache Thrashing
- cache coherence
- and hardware event counter
- Cache Coherency Events
- cache coherency
- Understanding Cache Coherency
- cache line
- Understanding Level-One and Level-Two Cache Use
- call hierarchy profile
- Profiling the Call Hierarchy
- compiler directive
- See directive
-
Reader Comments
- compiler feedback file
- Creating a Compiler Feedback File
- compiler flag
- See compiler option
-
Reader Comments
- compiler option
- -32
- Old 32-Bit ABI
- -64
- 64-Bit ABI
- recommended
- Understanding Compiler Options
- -apo
- Compiling an Auto-Parallel Version of a Program
- -check_bounds
- Computational Differences
- Using Array Padding
- -clist
- Reading the Transformation File
- default
- Understanding Compiler Options
- -fb
- Creating a Compiler Feedback File
- Passing a Feedback File
- -flist
- Reading the Transformation File
- for cache model
- Adjusting the Optimizer's Cache Model
- IEEE_arithmetic
- Exploit Algebraic Identities
- -INLINE
- Using Manual Inlining
- Using Automatic Inlining
- -IPA
- Requesting IPA
- forcedepth
- Using Automatic Inlining
- inline
- Using Automatic Inlining
- space
- Using Automatic Inlining
- -LNO
- Using Loop Nest Optimization
- blocking
- Adjusting Cache Blocking Block Sizes
- fission
- Controlling Fission and Fusion
- gather_scatter
- Understanding Gather-Scatter
- ignore_pragmas
- Requesting LNO
- interchange=off
- Using Loop Interchange
- outer_unroll
- Controlling Loop Unrolling
- prefetch
- Controlling Prefetching
- vintr
- Vector Intrinsics
- -mips3
- New 32-Bit ABI
- -mips4
- New 32-Bit ABI
- Recommended Starting Options
- -n32
- New 32-Bit ABI
- Recommended Starting Options
- -On
- Setting Optimization Level with -On
- -O2
- Recommended Starting Options
- -O3
- for SWP
- Enabling Software Pipelining with -O3
- -Ofast
- versus -O3
- Compile -O3 or -Ofast for Critical Modules
- -Olimit
- Using Automatic Inlining
- -OPT
- alias
- Understanding Aliasing Models
- cray_ivdep
- Breaking Other Dependencies
- IEEE_arithmetic
- Recommended Starting Options
- IEEE Conformance
- IEEE_NaN_inf
- IEEE Conformance
- liberal_ivdep
- Breaking Other Dependencies
- reorg_common
- Using Array Padding
- roundoff
- Roundoff Control
- -r10000
- Standard Math Library
- Setting Target System with -TARG
- -r5000
- Standard Math Library
- Setting Target System with -TARG
- -r8000
- Standard Math Library
- Setting Target System with -TARG
- roundoffWhen
- Exploit Algebraic Identities
- -S
- Reading Software Pipelining Messages
- -static
- Uninitialized Variables
- -TARG
- Setting Target System with -TARG
- -TENV
- Profiling Exception Frequency
- X
- Controlling the Level of Speculation
- copying
- to reduce TLB thrashing
- Using Copying to Circumvent TLB Thrashing
- correctness
- Getting the Right Answers
- CPU
- See MIPS CPU
-
Reader Comments
- CrayLink
- Hub and NUMAlink
- data distribution
- Using Data Distribution Directives
- and dplace
- Using _DSM_VERBOSE
- directives for
- Understanding Directive Syntax
- Distribute directive
- Using Distribute for Loop Parallelization
- mapping types
- Understanding Distribution Mapping Options
- ONTO clause
- Understanding the ONTO Clause
- page placement
- Using the Page_Place Directive for Custom Mappings
- redistribution
- Understanding the Redistribution Directives
- reshaped
- Using Reshaped Distribution Directives
- restrictions
- Restrictions of Reshaped Distribution
- data placement
- Scalability and Data Placement
- for libmp programs
- Tuning Data Placement for MP Library Programs
- modifying code for
- Modifying the Code to Tune Data Placement
- DAXPY
- Understanding Software Pipelining
- and alias model
- Understanding Aliasing Models
- loop fusion of
- Understanding Loop Fusion
- with indirection
- Breaking Other Dependencies
- debugging
- possible with -O2
- Start with -O2 for All Modules
- use -O0 for
- Use -O0 for Debugging
- dependency
- Breaking Other Dependencies
- directive
- blocking size
- Adjusting Cache Blocking Block Sizes
- for data distribution
- Fortran Source with Directives
- Using Data Distribution Directives
- Distribute
- Using Distribute for Loop Parallelization
- page place
- Using the Page_Place Directive for Custom Mappings
- syntax
- Understanding Directive Syntax
- for loop interchange
- Using Loop Interchange
- for loop nest optimizer
- Requesting LNO
- for loop unrolling
- Controlling Loop Unrolling
- for parallel execution
- Fortran Source with Directives
- affinity clause
- Using Parallel Do with Distributed Data
- Understanding the AFFINITY Clause for Data
- Understanding the AFFINITY Clause for Threads
- nest clause
- Understanding the NEST Clause
- for prefetching
- Controlling Prefetching
- ivdep
- Breaking Other Dependencies
- OpenMP
- Fortran Source with Directives
- dlook
- Applying dlook
- dplace
- Non-MP Library Programs and Dplace
- disables data distributiondirectives
- Using _DSM_VERBOSE
- enable migration with
- Enabling Page Migration
- library interface to
- Using the dplace Library for Dynamic Placement
- not for use with libmp
- Non-MP Library Programs and Dplace
- placement file
- Placement File Syntax
- distribute statement
- Assigning Threads to Memories
- memories statement
- Using the memories Statement
- threads statement
- Using the threads Statement
- set page size with
- Changing the Page Size
- Using Larger Page Sizes to Reduce TLB Misses
- specify topology with
- Specifying the Topology
- with MPI
- Using dplace with MPI 3.1
- dprof
- Applying dprof
- dynamic page migration
- Dynamic Page Migration
- Enabling Page Migration
- Trying Dynamic Page Migration
- administration
- Trying Dynamic Page Migration
- enabling
- Trying Dynamic Page Migration
- environment variable
- _DSM_MIGRATION
- Trying Dynamic Page Migration
- Experimenting with Migration Levels
- _DSM_PPM
- Advanced Options
- _DSM_ROUND_ROBIN
- Trying Round-Robin Placement
- _DSM_VERBOSE
- Using _DSM_VERBOSE
- for SpeedShop
- Identifying False Sharing
- in dplace placement file
- Using Environment Variables in Placement Files
- MP_SET_NUMTHREADS
- Controlling a Parallelized Program at Run Time
- MPI_DSM_OFF
- Using dplace with MPI 3.1
- PAGESIZE_*
- Using Larger Page Sizes to Reduce TLB Misses
- SGI_ABI
- Specifying the ABI
- SpeedShop use of
- Sampling Through Other Hardware Counters
- TRAP_FPE
- Understanding Treatment of Underflow Exceptions
- event counter
- See hardware event counter
-
Reader Comments
- exception
- event counter overflow
- R10000 Counter Event Types
- from speculative execution
- Permitting Speculative Execution
- handling
- Using Exception Profiling
- profiling occurrence of
- Using Exception Profiling
- TLB miss
- Understanding TLB and Virtual Memory Use
- underflow
- Understanding Treatment of Underflow Exceptions
- exception profile
- Using Exception Profiling
- false sharing
- Memory Contention
- Identifying False Sharing
- fast fourier transform (FFT)
- Understanding Transpositions
- data placement for
- First-Touch Placement with Multiple Data Distributions
- feedback file
- Creating a Compiler Feedback File
- use of
- Passing a Feedback File
- FFT
- See fast fourier transform (FFT)
-
Reader Comments
- first-touch placement
- Using First-Touch Placement
- Programming For First-Touch Placement
- floating-point exception
- See exception
-
Reader Comments
- floating-point status register (FSR)
- Understanding Treatment of Underflow Exceptions
- graduated instruction
- Graduated Instructions
- hardware event counter
- R10000 Counter Event Types
- branch instructions
- Branching Instructions
- cache coherency
- Cache Coherency Events
- cache use
- Primary Cache Use
- clock cycles
- Clock Cycles
- event 21
- Displaying Operation Counts
- Finding and Removing Memory Access Problems
- event 31
- Sampling Through Other Hardware Counters
- Finding and Removing Memory Access Problems
- Diagnosing Cache Problems
- Identifying False Sharing
- event 4
- Finding and Removing Memory Access Problems
- instruction counts
- Instructions Issued and Done
- lock instructions
- Lock-Handling Instructions
- profiling from
- Sampling through Hardware Event Counters
- Sampling Through Other Hardware Counters
- TLB miss
- Virtual Memory Use
- hardware graph
- Indicating Resource Affinity
- hardware trap
- See exception, page fault, TLB
-
Reader Comments
- hub
- SN0 Organization
- Hub and NUMAlink
- cache coherency support
- Understanding Directory-Based Coherency
- hypercube
- SN0 Organization
- SN0 Memory Distribution
- ideal time profile
- Using Ideal Time Profiling
- IEEE 754
- Understanding Arithmetic Standards
- versus optimization
- IEEE Conformance
- IEEE arithmetic
- Understanding Arithmetic Standards
- inlining
- Understanding Inlining
- automatic versus manual
- Understanding Inlining
- manual with -INLINE
- Using Manual Inlining
- instruction scheduling
- Setting Target System with -TARG
- Understanding Software Pipelining
- instruction set architecture (ISA)
- MIPS I
- Old 32-Bit ABI
- MIPS II
- Old 32-Bit ABI
- MIPS III
- Old 32-Bit ABI
- New 32-Bit ABI
- MIPS IV
- MIPS IV Instruction Set Architecture
- New 32-Bit ABI
- interprocedural analysis (IPA)
- Exploiting Interprocedural Analysis
- applied during link step
- Compiling and Linking with IPA
- features of
- Exploiting Interprocedural Analysis
- requesting
- Requesting IPA
- -IPA
- See compiler option, -IPA
-
Reader Comments
- IRIX
- memory management in
- SN0 Memory Management
- porting to
- Dealing with Porting Issues
- lazy evaluation
- Lazy Evaluation
- ld
- performs IPA
- Compiling and Linking with IPA
- library
- BLAS
- CHALLENGEcomplib Library
- SCSL Library
- CHALLENGEcomplib
- Exploiting Existing Tuned Code
- CHALLENGEcomplib Library
- EISPACK
- CHALLENGEcomplib Library
- LAPACK
- CHALLENGEcomplib Library
- SCSL Library
- libc
- Standard Math Library
- libfastm
- Exploiting Existing Tuned Code
- libfastm Library
- Recommended Starting Options
- libfpe
- Using Exception Profiling
- Understanding Treatment of Underflow Exceptions
- libmp
- Controlling a Parallelized Program at Run Time
- conflicts with dplace
- Non-MP Library Programs and Dplace
- data placement with
- Tuning Data Placement for MP Library Programs
- page migration with
- Trying Dynamic Page Migration
- Experimenting with Migration Levels
- page size control
- Using Larger Page Sizes to Reduce TLB Misses
- round-robin placement with
- Trying Round-Robin Placement
- LINPACK
- CHALLENGEcomplib Library
- SCSL
- Exploiting Existing Tuned Code
- SCSL Library
- library routine
- bzero
- Initializing to Zero
- calloc
- Initializing to Zero
- dplace_file
- Using the dplace Library for Dynamic Placement
- dplace_line
- Using the dplace Library for Dynamic Placement
- dsm_home_threadnum
- Using Dynamic Placement Information
- handle_sigfpes
- Using Exception Profiling
- sasum
- Using Reshaped Distribution Directives
- sscal
- Using Reshaped Distribution Directives
- -LNO
- See loop nest optimizer (LNO) and compiler option -LNO
-
Reader Comments
- loop fission
- Using Loop Fission
- loop fusion
- by LNO
- Using Loop Fusion
- manual
- Understanding Loop Fusion
- loop interchange
- Using Loop Interchange
- disabling
- Using Loop Interchange
- loop nest optimizer (LNO)
- Using Loop Nest Optimization
- cache blocking by
- Controlling Cache Blocking
- controlling
- Adjusting Cache Blocking Block Sizes
- disable loop transformation
- Requesting LNO
- gather-scatter by
- Understanding Gather-Scatter
- loop fission by
- Using Loop Fission
- loop fusion by
- Using Loop Fusion
- loop interchange
- Using Loop Interchange
- loop unrolling
- Using Outer Loop Unrolling
- prefetching by
- Prefetch Overhead and Unrolling
- requesting
- Requesting LNO
- transformed source file
- Reading the Transformation File
- vector intrinsic transformation
- Vector Intrinsics
- loop peeling
- Using Loop Fusion
- loop unrolling
- and roundoff
- Roundoff Control
- and SWP
- Using Outer Loop Unrolling
- by loop nest optimizer (LNO)
- Using Outer Loop Unrolling
- with loop interchange
- Combining Loop Interchange and Loop Unrolling
- makefile
- example
- Basic Makefile
- use of
- Using a Makefile
- math libraries
- Exploiting Existing Tuned Code
- vector intrinsics
- Standard Math Library
- matrix multiply
- loop unrolling of
- Using Outer Loop Unrolling
- memory use in
- Understanding Cache Blocking
- performance of
- Understanding Cache Blocking
- matrix multipy
- cache blocking of
- Controlling Cache Blocking
- memory
- 64-bit addressing
- Selecting an ABI and ISA
- administrator setup
- Using Larger Page Sizes to Reduce TLB Misses
- Trying Dynamic Page Migration
- bus-based
- Memory for Multiprocessors
- Scalability in Multiprocessors
- cache directory bits
- Memory Overhead Bits
- contention for
- Memory Contention
- distributed versus shared
- Shared Memory Multiprocessing
- error correction bits
- Memory Overhead Bits
- hierarchy
- Understanding the Levels of the Memory Hierarchy
- latency of
- SN0 Latencies and Bandwidths
- Degrees of Latency
- locality management
- Memory Locality Management
- management by IRIX
- SN0 Memory Management
- page fault
- Understanding TLB and Virtual Memory Use
- paged virtual
- Understanding TLB and Virtual Memory Use
- parallel execution tuning
- Finding and Removing Memory Access Problems
- physical address display
- Page Address Routine va2pa()
- placement
- first-touch
- Using First-Touch Placement
- Programming For First-Touch Placement
- round-robin
- Using Round-Robin Placement
- Trying Round-Robin Placement
- prefetching
- Understanding Prefetching
- Using Prefetching
- stride
- Using Stride-One Access
- virtual
- Understanding Level-One and Level-Two Cache Use
- See also page
-
Reader Comments
- memory locality domain (MLD)
- Memory Locality Management
- Memory Locality Domain Use
- memory locality domain set (MLDS)
- Memory Locality Domain Use
- Message-Passing Interface (MPI)
- Message-Passing Models MPI and PVM
- dplace with
- Using dplace with MPI 3.1
- perfex with
- Using perfex with MPI
- MIPS CPU
- architecture of
- Understanding MIPS R10000 Architecture
- Understanding Prefetching
- event counters in
- R10000 Counter Event Types
- issued versus graduated instruction
- Graduated Instructions
- off-chip cache
- Level-Two Cache
- on-chip cache
- Cache Architecture
- out-of-order execution
- Executing Out of Order
- R10000
- speculative execution
- Hardware Speculative Execution
- underflow control
- Understanding Treatment of Underflow Exceptions
- R4000
- Specifying the ABI
- R8000
- Specifying the ABI
- Dealing with Software Pipelining Failures
- Software Speculative Execution
- underflow ignored on
- Understanding Treatment of Underflow Exceptions
- specify to compiler
- Standard Math Library
- speculative execution
- Speculative Execution
- superscalar features
- Superscalar CPU Features
- See also hardware event counter
-
Reader Comments
- MIPS IV ISA
- MIPS IV Instruction Set Architecture
- and IEEE 754
- IEEE Conformance
- prefetch in
- Understanding Prefetching
- MP library
- See library,libmp
-
Reader Comments
- MPI
- See Message-Passing Interface (MPI)
-
Reader Comments
- mpirun
- with perfex
- Using perfex with MPI
- node
- SN0 Organization
- SN0 Node Board
- CPU in
- CPUs and Memory
- nonuniform memory access (NUMA)
- SN0 Memory Distribution
- Dealing With Nonuniform Access Time
- and parallel program
- Parallel Programs under NUMA
- and single-threaded program
- Single-Threaded Programs under NUMA
- numeric error
- Understanding Arithmetic Standards
- OpenMP directives
- Fortran Source with Directives
- C pragmas for
- C and C++ Source with Pragmas
- -OPT
- See compiler option, -OPT
-
Reader Comments
- optimization level
- Setting Optimization Level with -On
- out of order execution
- Executing Out of Order
- packing
- Packing
- page
- Understanding TLB and Virtual Memory Use
- migration of
- Dynamic Page Migration
- Enabling Page Migration
- Trying Dynamic Page Migration
- size of
- Dynamic Page Migration
- Policy Modules
- Single-Threaded Programs under NUMA
- Using Larger Page Sizes to Reduce TLB Misses
- set with dplace
- Changing the Page Size
- valid sizes
- Using Larger Page Sizes to Reduce TLB Misses
- page fault
- Understanding TLB and Virtual Memory Use
- parallel execution
- affinity clause
- Using Parallel Do with Distributed Data
- Understanding the AFFINITY Clause for Data
- Understanding the AFFINITY Clause for Threads
- Amdahl's law
- Understanding Parallel Speedup and Amdahl's Law
- auto-parallizing
- Compiling Serial Code for Parallel Execution
- data placement for
- Scalability and Data Placement
- memory access tuning for
- Finding and Removing Memory Access Problems
- nest clause
- Understanding the NEST Clause
- parallel fraction p
- Understanding Amdahl's Law
- Ensuring That the Program Is Properly Parallelized
- programming models for
- Explicit Models of Parallel Computation
- scalability of
- Scalability in Multiprocessors
- Scalability and Data Placement
- topology
- Specifying the Topology
- tuning SN0 for
- Tuning Parallel Code for SN0
- perfex
- Analyzing Performance with perfex
- absolute event counts
- Taking Absolute Counts of One or Two Events
- analytic output
- Getting Analytic Output with the -y Option
- awk script to parse
- Awk Script for Perfex Output
- cache use analysis
- Identifying Cache Problems with Perfex and SpeedShop
- library interface
- Collecting Data over Part of a Run
- statistical counts
- Taking Statistical Counts of All Events
- performance
- aphorisms about
- Bentley's Rules Updated
- of matrix multiply
- Understanding Cache Blocking
- of parallel program
- Parallel Programs under NUMA
- of single-threaded program
- Single-Threaded Programs under NUMA
- performance techniques
- algebraic identities
- Exploit Algebraic Identities
- Exploit Algebraic Identities
- array padding
- Using Array Padding to Prevent Thrashing
- Using Array Padding
- avoiding tests
- Combining Tests
- cache blocking
- Understanding Cache Blocking
- Controlling Cache Blocking
- Controlling Cache Blocking
- caching
- Principles of Good Cache Use
- code motion
- Code Motion Out of Loops
- combining related functions
- Combine Paired Computation
- common block padding
- Exploiting Interprocedural Analysis
- common subexpressions
- Eliminate Common Subexpressions
- constant propagation
- Exploiting Interprocedural Analysis
- copying
- Using Copying to Circumvent TLB Thrashing
- coroutines
- Use Coroutines
- data structure augmentation
- Data Structure Augmentation
- dead function elimination
- Exploiting Interprocedural Analysis
- dead variable elimination
- Exploiting Interprocedural Analysis
- gather-scatter
- Understanding Gather-Scatter
- inlining
- Exploiting Interprocedural Analysis
- Collapse Procedure Hierarchies
- interpreters
- Interpreters
- lazy evaluation
- Lazy Evaluation
- loop fission
- Using Loop Fission
- loop fusion
- Understanding Loop Fusion
- Using Loop Fusion
- Loop Fusion
- loop interchange
- Using Loop Interchange
- loop unrolling
- Using Outer Loop Unrolling
- Loop Unrolling
- packing
- Packing
- precomputation
- Store Precomputed Results
- Precompute Logical Functions
- prefetching
- Using Prefetching
- recursion elimination
- Transform Recursive Procedures
- short-circuiting
- Short-Circuit Monotone Functions
- software pipelining
- Understanding Software Pipelining
- speculative execution
- Permitting Speculative Execution
- transposition
- Understanding Transpositions
- policy module (PM)
- Memory Locality Management
- Policy Modules
- Portable Virtual Machine (PVM)
- Message-Passing Models MPI and PVM
- POSIX threads
- C Source Using POSIX Threads
- pragma
- See directive
-
Reader Comments
- precomputation
- Store Precomputed Results
- prefetching
- Understanding Prefetching
- Using Prefetching
- controlling
- Controlling Prefetching
- manual
- Using Manual Prefetching
- overhead of
- Prefetch Overhead and Unrolling
- pseudo
- Using Pseudo-Prefetching
- prof
- default report
- Displaying Profile Reports from Sampling
- feedback file
- Creating a Compiler Feedback File
- ideal time report
- Default Ideal Time Profile
- line numbers off with opt
- Including Line-Level Detail
- option -archinfo
- Displaying Operation Counts
- option -butterfly
- Displaying Ideal Time Call Hierarchy
- option -feedback
- Creating a Compiler Feedback File
- Passing a Feedback File
- option -heavy
- Displaying Profile Reports from Sampling
- Including Line-Level Detail
- option -lines
- Including Line-Level Detail
- simplifying report
- Removing Clutter from the Report
- profiling
- address space usage
- Using Address Space Profiling
- cache usage
- Identifying Cache Problems with Perfex and SpeedShop
- call hierarchy
- Profiling the Call Hierarchy
- ideal time for
- Using Ideal Time Profiling
- Identifying Cache Problems with Perfex and SpeedShop
- opcode counts
- Displaying Operation Counts
- sampling for
- Understanding Sample Time Bases
- Identifying Cache Problems with Perfex and SpeedShop
- tools for
- Profiling Tools
- program correctness
- Getting the Right Answers
- R4000
- See MIPS CPU
-
Reader Comments
- R8000
- See MIPS CPU
-
Reader Comments
- R10000
- See MIPS CPU
-
Reader Comments
- roundoff
- Roundoff Control
- round-robin placement
- Using Round-Robin Placement
- Trying Round-Robin Placement
- scalability
- Scalability in Multiprocessors
- and bus architecture
- Scalability in Multiprocessors
- and data placement
- Scalability and Data Placement
- and shared memory
- Scalability and Shared, Distributed Memory
- smake
- Using a Makefile
- SN0
- CrayLink
- Hub and NUMAlink
- hub
- SN0 Organization
- Hub and NUMAlink
- Input/Output
- SN0 Input/Output
- latencies
- SN0 Latencies and Bandwidths
- node
- SN0 Organization
- SN0 Node Board
- router
- SN0 Organization
- XIO
- SN0 Organization
- XIO Connection
- SN0 architecture
- Understanding SN0 Architecture
- building blocks of
- SN0 Organization
- hypercube
- SN0 Organization
- SN0 Memory Distribution
- nonuniform memory access (NUMA)
- SN0 Memory Distribution
- snoopy cache
- Coherency Methods
- software pipelining (SWP)
- Exploiting Software Pipelining
- compiler report in
- script to extract
- Software Pipeline Script swplist
- compiler report in .s
- Reading Software Pipelining Messages
- Using Outer Loop Unrolling
- dereferenced pointer defeats
- Improving C Loops
- effect of alias model
- Understanding Aliasing Models
- enable with -O3
- Enabling Software Pipelining with -O3
- failure cause
- Dealing with Software Pipelining Failures
- global variables defeat
- Improving C Loops
- loop unrolling with
- Using Outer Loop Unrolling
- of DAXPY loop
- Pipelining the DAXPY Loop
- speculative execution
- Speculative Execution
- Permitting Speculative Execution
- hardware driven
- Hardware Speculative Execution
- software-driven
- Software Speculative Execution
- speedshop
- Using SpeedShop
- sample time bases
- Understanding Sample Time Bases
- See also prof, ssrun
-
Reader Comments
- ssrun
- exception trace
- Profiling Exception Frequency
- experiment types
- Understanding Sample Time Bases
- ideal time trace
- Capturing an Ideal Time Trace
- Passing a Feedback File
- output filename format
- Performing ssrun Experiments
- shell script to run
- Shell Script ssruno
- usertime experiment
- Displaying Usertime Call Hierarchy
- using
- Performing ssrun Experiments
- stride
- Using Stride-One Access
- superlinear speedup
- Understanding Superlinear Speedup
- superscalar
- Superscalar CPU Features
- -SWP
- See compiler option, -SWP
-
Reader Comments
- swplist shell script
- Reading Software Pipelining Messages
- system routine
- mmap
- C and C++ Source Using UNIX Processes
- Initializing to Zero
- sproc
- C and C++ Source Using UNIX Processes
- sysmp
- Advanced Options
- syssgi
- Using Dynamic Placement Information
- thread
- C Source Using POSIX Threads
- TLB
- See translate lookaside buffer (TLB)
-
Reader Comments
- translate lookaside buffer (TLB)
- Understanding TLB and Virtual Memory Use
- miss
- Understanding TLB and Virtual Memory Use
- hardware counter
- Virtual Memory Use
- thrashing elimination
- Diagnosing and Eliminating TLB Thrashing
- copying
- Using Copying to Circumvent TLB Thrashing
- larger page size
- Using Larger Page Sizes to Reduce TLB Misses
- transposition
- Understanding Transpositions
- trap
- See exception
-
Reader Comments
- uninitialized variable, avoiding
- Uninitialized Variables
- vector intrinsic function
- Standard Math Library
- and LNO
- Vector Intrinsics
- virtual memory
- Understanding TLB and Virtual Memory Use
- XIO
- SN0 Organization
- XIO Connection
- zero-fill
- Initializing to Zero