This short glossary contains only terms related to performance tuning, as used in this guide. For a full glossary of Silicon Graphics technical terms, see the Silicon Graphics Glossary of Terms, 007-1859-xxx.
A mathematical approximation (not really a law) stating the ideal speedup that should result when a program is executed in parallel: S(n) = 1 / ((p / n) + (1 - p)), where n is the number of CPUs applied and p is the fraction of the program code that can be executed by parallel threads. When p is small (much of the program executes serially), S(n) is near 1.0, in other words, there is no speedup. When p is near 1.0 (most of the program can be executed in parallel), S(n) is near 1/n, in other words, linear speedup.
The amount of data transferred per second over a given path. Common types are:
Theoretical bandwidth, the maximum units per second under perfect conditions
Available bandwidth, units per second of useful data after subtracting protocol overhead and turnaround delays in normal traffic situations
Bisection bandwidth, the maximum available bandwidth between two arbitrarily-chosen points in the system allowing for bottlenecks.
See also latency.
A sequence of program statements that contains no labels and no branches. A basic block can only be executed completely (because it contains no labels, it cannot be entered other than at the top; because it contains no branches, it cannot be exited before the end), and in sequence (no internal labels or branches). Any program can be decomposed into a sequence of basic blocks. The compiler optimizes units of basic blocks. An ideal profile counts the use of each basic block.
See speculative execution.
In common language, a hiding-place for valuables, provisions, and so on. In computing, a separate memory where the most recently used items are kept in anticipation of their being needed again. The MIPS R10000 CPU contains an on-chip cache for instructions and another for data (the primary cache or L1 cache). The SN0 node provides a secondary cache (or L2 cache) of either 1 MB or 4 MB, organized as 128-byte units called cache lines.
An optimization technique that modifies a program so that it works on blocks of data that fit in cache, and deals completely with each block before going on to another.
In a multiprocessor, each CPU has its own cache; hence any memory item can appear in multiple copies among the various caches. Cache coherency hardware ensures that all cached copies are identical to memory. The basic method is to recognize when one CPU modifies memory, and to automatically invalidate any other copies of the changed data. See cache, directory, snoopy cache.
When multiple CPUs update the same cache line, each has to become the exclusive owner of that line in turn. This can drastically slow execution. Not the same phenomenon as cache thrashing, but caused by memory contention or by false sharing.
A design for cache coherency used in the SN0 architecture. Each secondary cache
cache line of 128 bytes is augmented with a set of directory bits, one bit for each CPU that could have a cached copy of that line. When a CPU accesses a cache line, its bit is set in the directory of the node that owns that line. When a CPU modifies memory, all CPUs whose directory bits are set in the directory of the associated line are notified to invalidate their copies of the line.
The unit of memory fetched and stored in a cache. Cache line size is a function of the hardware. In the MIPS R10000 CPU, the primary cache line is 32 bytes. In the SN0 (and most other Silicon Graphics systems) the secondary cache line is 128 bytes. See cache.
When the CPU looks for a particular memory word in a cache and does not find it. Execution of the current instruction has to be delayed until the cache line has been loaded from memory. A superscalar CPU can continue to execute other instructions out of order while waiting.
Cache lines are assigned locations in cache based on the bits of their physical address. It is possible for cache lines from unrelated parts of memory to contend for the same location in cache. When a program cycles through multiple arrays that happen to have similar base addresses, it is possible for every fetched cache line to displace another recently fetched line. The SN0 uses a “two-way set-associative” cache that can deal with alternating references to two arrays with similar address bits, but it can be defeated by a loop that refers to three or more arrays. This “thrashing” effectively negates the cache, reducing access to memory speed. It is prevented by padding arrays to force their starting addresses to differ in the least-significant bits. See also false sharing, which is quite different.
See cache blocking.
See hardware counter.
When one instruction or program statement requires the result of another as its input, and so depends on the completion of the other. A loop contains a dependency when one iteration of the loop requires the result of a prior iteration; such loops cannot be parallelized. In a superscalar CPU, an instruction cannot complete until its dependencies are satisfied by the completion of the instructions that prepare its operand values.
See cache directory.
page migration that is initiated by the operating system, based on hardware counters in the hub that show which nodes are using any given page of memory. Dynamic page migration can be turned on for the whole system (not recommended) or for a specific program by running it under dplace.
When two or more variables that are not logically related happen to fall into the same cache line, and one variable is updated, a reference to any of the other variables cause the cache line to be fetched anew from memory even though the data in those variables has not changed. In effect, false sharing is unintentional memory contention. False sharing is eliminated by putting volatile variables in cache lines that are shared only by variables that are logically associated with them and that are updated at the same time.
The default rule for IRIX memory allocation in the SN0: new memory pages are allocated in the node where they are requested, or in the nearest node to it. See also round robin allocation.
A compiler optimization technique for improving the speed of a loop that iterates over an array, testing each element and operating on only selected elements. Because such a loop contains a conditional branch, software pipelining is less effective in it. The loop is converted into two loops: one that tests all elements, collecting the indexes of the target elements in a scratch array; and a second loop that iterates over the scratch array and operates on the selected elements. The second loop, because it contains no conditional branch, can be more effectively optimized.
In a superscalar CPU, instructions are executed out of the order in which they were written. In order to ensure that the program state changes only in ways that the programmer expects, any instruction that completes early is delayed until the instructions that logically preceded it in the program text have also completed. Then the instruction “graduates” (as a student graduates from school) and its effect on the program state is made permanent. Under speculative execution, some instructions are executed and then discarded, never to graduate. An R10000 CPU hardware counter can count instructions issued and graduated.
In the MIPS R10000 CPU, one of two registers that can be programmed to count events such as cache miss, graduated instruction, and so on. Used generating a sampled profile.
In the SN0 architecture, the custom circuit that manages memory access in a node, directing the data flowing between attached I/O devices and CPU chips to memory in the node and through a router to other nodes.
In topology, the four-dimensional figure whose faces are cubes of identical size (as the cube is the 3D figure whose faces are squares of identical size). In an SN0 system with 32 or more CPUs, the CPU nodes are connected by data paths that follow the edges of a hypercube.
A compiler optimization technique in which the call to a procedure is replaced by a copy of the body of the procedure, with the actual parameters substituted for the parameter variables. Inlining has two immediate effects: to eliminate the overhead of calling the procedure, and to increase the size of the code. It also has the side-effect of giving the compiler more statements to practice optimization upon. For one example among many, when one of the actual parameter values is a constant, it may be possible to eliminate many of the conditional statements from the inserted code because, with that actual data, they are unreachable.
A compiler optimization technique. Normally a compiler analyzes one procedure at a time in isolation, making the most conservative assumptions about the possible range of argument values it might receive and about the actions of the procedures that it calls. Using IPA, the compiler examines all calls to each procedure to determine the actual ranges of arguments, and to determine whether or not a called procedure can ever modify shared data. The information gained from IPA permits the compiler to do more aggressive optimization, but to get it, the entire program text must be available at once. The current Silicon Graphics compilers implement IPA by performing only syntax analysis at “compile” time, and performing the actual compilation at “link” time, when all modules are brought together.
The IRIX kernel, since release 6.4, is structured as a set of independent, interruptible, lightweight threads. These “kthreads” (kernel threads) are used to implement system functions and to handle interrupts and such background tasks as disk buffering. Some kthreads are used to execute the code of threads in user-level programs. However, a kthread and a thread are quite different in their abilities and execution environment.
Regarding memory or a communications path, the time elapsed from the instant a request is issued until the response begins to arrive. When communications consists of many short exchanges, latency dominates the data rate. When messages are long enough that the transmission time of a message is larger than latency time, bandwidth dominates. Regarding a CPU, latency is the time from loading an instruction to completing it.
A highly optimized version of the standard Fortran math library.
A compiler optimization technique, the opposite of loop fusion, in which a loop whose body is too large for effective optimization or software pipelining is automatically split into two loops. May also be done when the loop body contains inner loops such that, after the fission, loop interchange can be performed.
A compiler optimization technique in which two adjacent loops that have the same induction variable range are merged. The statements in the loop bodies are put together within a single loop structure. Benefits include halving the time spent on loop overhead; putting more statements in the basic block to give the compiler more material for software pipelining; most important, using cache data more effectively. See also loop peeling.
A compiler optimization technique in which the loop variables of inner and outer loops are interchanged. This is done either to reduce the stride of the loop, to improve its cache use, or to make cache blocking or loop unrolling possible.
A compiler optimization technique in which the statements in nested loops are transformed to give better cache use or shorter instruction sequences.
A compiler optimization technique in which some of the first (or last) iterations of a loop are made into separate statements preceding (following) the loop proper. This is done in order to permit loop fusion with an adjacent loop that uses a smaller range of indexes.
A compiler optimization technique in which the body of a loop is replicated several times, and the increment of the loop variable is adjusted to match (for example, instead of using array element a(i) in one statement, the loop has four statements using elements a(i), a(i+1), a(i+2), and a(i+3), and then increments i by 4). This divides the loop overhead by the amount of unrolling, but more importantly gives the compiler more statements in the basic block to optimize, for example making software pipelining easier. See also loop nest optimization.
When the code in two or more CPUs repeatedly updates the same memory locations, execution is slowed due to cache contention.
The degree to which data is located close to the CPU (or CPUs) that access it. In the SN0 architecture, there is a (small) time penalty for access to memory on a different node, so the operating system applies many strategies to keep data close to the CPUs that use it.
An abstract representation of physical memory, used within the IRIX operating system. An MLD is a source for memory allocation having a prescribed type of memory locality, for example, the two nodes in one corner of a hypercube can yield memory that is close to either node.
A measure of the speed of execution of a program. A program contains many non-floating-point instructions, so the MFLOPS achieved on the same hardware, varying the compiler options, is an indication of how effectively the compiler optimized the code. Often, MFLOPS is reported through an approximate calculation, based on counting the number of arithmetic operations in the source code for a crucial loop body and multiplying by the number of iterations. However, the hardware counter of the R10000 CPU can count actual floating point graduated instructions, giving an absolute measure of MFLOPS.
A measure of the speed of execution of a CPU. A superscalar CPU such as the MIPS R10000 CPU can normally finish 1.5 instructions per clock cycle, or about 300 MIPS for a CPU with a 200MHZ clock.
The basic building-block of the SN0 architecture, a single board carrying two R10000 CPUs, the secondary cache for each, some memory, and the hub chip that controls data flow to and from memory.
A computer memory design that has a single physical address space, but in which parts of memory take longer to access than other parts. In the SN0, memory in the local node is accessed fastest; memory in another node takes longer by about 100 nanoseconds per router.
Finding the solution that is the best fit to the available resources. See tuning.
Moving a page of memory data from one node to another one that is closer to the CPU that is using the data in the page. Page migration can be explicitly programmed using Fortran directives or runtime calls on the dplace program. See also dynamic page migration.
Causing a program to execute as more than one thread operating in parallel on the same data. Often specified by using C pragmas or Fortran directives to make loops run in parallel, or by writing explicitly parallel code that starts multiple processes or pthreads.
The IRIX kernel data structure type that specifies one memory locality domain (MLD) and other rules regarding memory allocation from that domain.
An instruction that notifies the CPU that a particular cache line will be needed by a following instruction, permitting memory fetch to overlap execution. The prefetch instruction is simply a load that does not use its operand. Current Silicon Graphics compilers can generate prefetch instructions automatically.
The on-chip cache memory, closest to the execution units and with smallest latency. In the MIPS R10000 CPU, primary cache is 32KB organized in 32-byte units. Also called L1 (level-one) cache. See cache, secondary cache.
To analyze the execution of a program by counting the number of times it executes any given statement. An ideal profile is done by modifying the executable so that it contains instructions to count the entry to each basic block. A sampled profile is done by executing the unmodified executable under a monitor that interrupts it at regular intervals and records the point of execution at each interruption. A sampled profile emphasizes different kinds of behavior depending on the chosen sampling time base.
To order data references in a loop so that a harmless reference to the next cache line is executed well in advance of the use of the data from that cache line. A manual version of prefetch often used to hide the latency of loading the primary cache.
A thread of a user program, created and controlled using the POSIX 1003.1c standard interface. An alternative to parallelization using multiple IRIX processes. Not related in any way to kthread.
An alternative to first touch allocation in which new pages are distributed in rotation to each node. This prevents the hub chip in one node from having to serve all requests for memory from one program.
In the SN0, the custom board that routes memory access between nodes.
The quality of scaling, or increasing in proportion, in a regular fashion based on the availability of resources. In hardware, increasing in performance as a smooth function of increasingly complex circuitry; or conversely, increasing in cost as a smooth function of capacity. In software, improving execution time as a smooth function of available CPUs and memory (see Amdahl's Law).
Cache external to the CPU chip; also called L2 (level-two) cache. See cache, primary cache.
A design for cache coherency used in the CHALLENGE and Onyx systems. Each CPU board monitors all bus traffic. When it observes a memory write, it invalidates any cached copy it might have of the same memory. This is the simplest way to implement cache coherency, but it depends on having all CPUs on a single memory bus, which is not the case in the SN0.
A compiler technique in which the compiler generates sequences of instructions that are carefully tailored to take maximal advantage of the multiple execution units of a superscalar CPU.
A superscalar CPU can decode and begin processing multiple instructions in parallel. When it decodes a branch instruction, the condition tested by the branch may not yet be known, because the instruction that prepares that condition is still executing or perhaps is suspended pending yet another unknown result. Rather than suspending execution pending completion of the branch input operand, the CPU predicts the direction the branch will go, and continues to decode and execute instructions along that path. However, it does so in a way that any changes in program state that result can be retracted. If the prediction turns out to be wrong, the results of the speculatively executed instructions are discarded. Experience with the branch prediction logic of the MIPS CPUs shows that branch prediction is typically correct well over 90% of the time. Branch mispredictions can be counted in a hardware counter.
Name for a family of similar algorithms that combine the value from a cell in an input array with the values in its neighbor cells to make a new value for the same cell of an output array. A 2D stencil operates on the 9-cell neighborhood comprising a(i,j) and its eight neighbors (a(i-i,j), a(i+1,j), and so on). A 3D stencil operates on the 27-cell neighborhood around a(i,j,k). Conway's Game of Life and other cellular automata are usually implemented using a stencil algorithm, as are many graphics transformations such as smoothing and edge-finding. Stencil algorithms are both CPU-intensive and memory-intensive, and are challenging subjects for compiler optimizations.
The distance between the memory addresses of array elements that are touched in a loop. A stride-one loop touches successive array elements, and hence scans memory consecutively. This uses cache memory most efficiently because all of a cache line is used before the next cache line is fetched, and the loop never returns to a cache line after using it. Strides greater than one are less efficient in memory use, but are easy to create accidentally given Fortran array semantics. The compiler can sometimes use loop nest optimization or loop interchange to shorten the stride.
When adding more CPUs to parallel execution of a program, the program normally speeds up in conformity with Amdahl's Law. However, when adding CPUs accidentally relieves some other bottleneck, the speedup can exceed the number of CPUs added. This is superlinear speedup: improvement out of proportion to the hardware added.
In a RISC CPU, the ability to execute more than one instruction per clock cycle, achieved by having multiple parallel execution units. The MIPS R10000 CPU routinely achieves instruction rates of 1.5 to 2 times its clock rate. It uses five, independent execution units, and is able to execute instructions out of order, so that it can complete some instructions while waiting for the memory operands of others to arrive. See also speculative execution and software pipelining.
Loosely, any independent execution of program code. A thread is most commonly represented by an IRIX process; or by a POSIX thread using the pthread library. When discussing parallelization of a program, it is usually assumed that there are as many CPUs as the program has threads, so all threads can execute concurrently. This is not assumed in the general case; the threads of a multithreaded program are dispatched by IRIX as CPUs are available, and some may run while others wait. See also kthread.
The ways in which the SN0 nodes are connected in general (see also hypercube); but in particular the relationship between the nodes in which the various threads of a parallel program are executed. Typical program topologies are cluster (which minimizes the distance between nodes), cube, and hypercube.
A table within the CPU that associates virtual addresses to physical page addresses. The TLB has limited size, but all entries are searched in parallel, simultaneously with the decoding of the instruction (hence “lookaside”). When the operand address of an instruction is found in the TLB, the CPU can present the needed physical address to the memory without delay. (The TLB is in essence a cache for page table entries.) When the TLB lookup fails, the CPU traps to an OS routine that locates the desired virtual page in a large, in-memory page table that defines the virtual address space of the process. When that lookup succeeds, the OS loads one entry of the TLB to address the page, and resumes execution. When that lookup fails, the process has suffered a page fault.
Finding the executable version of a program that uses the least of a critical resource on a given hardware platform. The critical resource is usually time; the tuned program runs fastest. However, tuning for least memory or I/O use is also possible. Tuning is done by: altering the program's source code to use a different algorithm; by compiling the program with a different compiler or with different compiler options to get a more highly optimized executable; or changing the numbers of CPUs or placement of memory at execution time.
A Fortran (usually) math function that is implemented using hardware vector operations. The compiler can substitute vector intrinsic calls automatically.