List of Figures

Figure 1-1. Block Diagrams of 4-, 8-, 16-, 32-, and 64-CPU SN0 Systems
Figure 1-2. Block Diagram and Approximate Appearance of a Node Board
Figure 1-3. Block Diagram of Memory, Hub, and Cache Directory
Figure 1-4. XIO and XBOW Provide I/O Attachment to a Node
Figure 2-1. Program Address Space Versus SN0 Architecture
Figure 2-2. Parallel Process Memory Access Pattern Versus SN0 Architecture
Figure 2-3. Parallel Processes at Opposite Corners of SN0 System
Figure 2-4. Parallel Processes and Memory at Bad Locations
Figure 2-5. Parallel Program Ideally Placed in SN0 System
Figure 2-6. Parallel Program Mapped to a Pair of MLDs
Figure 2-7. Parallel Program Mapped to an MLD Set with Hypercube Topology and Affinity to a Graphics Device
Figure 2-8. Parallel Program Mapped through MLDs to Hardware
Figure 4-1. Code Residence versus Data reference
Figure 4-2. On-screen plot of dprof output
Figure 4-3. dprof Output for a Program with Poor Memory
Figure 4-4. dprof Output for a Program with Good Memory
Figure 5-1. DAXPY Software Pipeline Schedule
Figure 5-2. Inlining and the Call Hierarchy
Figure 6-1. Processing Directions in adi2.f
Figure 6-2. Memory Use in Matrix Multiply
Figure 6-3. Cache Blocking of Matrix Multiplication
Figure 6-4. Schematic of Data Motion in Radix-2 Fast Fourier Transform
Figure 7-1. Table of Loop-Unrolling Parameters for Matrix Multiply
Figure 7-2. Performance of Vector Intrinsic Functions on an Origin 2000
Figure 8-1. Possible Speedup for Different Values of p
Figure 8-2. Performance of Weather Model Before and After Tuning
Figure 8-3. Calculated Bandwidth for Different Placement Policies
Figure 8-4. Calculated Iteration Times for Different Placement Policies
Figure 8-5. Cumulative Run Time for Different Placement Policies
Figure 8-6. Effect of Migration Level on Iteration Time
Figure 8-7. Effect of Page Granularity in First-Touch Allocation
Figure 8-8. Data Partition for NAS FT Kernel
Figure 8-9. NAS FT Kernel Data Redistributed
Figure 8-10. Some Possible Regular Distributions for Four Processors
Figure 8-11. Possible Outcomes of Distribute ONTO Clause
Figure 8-12. Reshaped Distribution of Three-Dimensional Array for Four CPUs
Figure 8-13. Copying By Cache Lines for Summation
Figure 8-14. Placement File and its Results