This appendix describes the counters in the R10000 systems. The counters in the R12000 and R14000 systems are somewhat different. For a description of the R12000/R14000 counters, see the r10k_counters (5) man page.
The MIPS R10000 CPU contains two 32-bit hardware counters, each of which can be assigned to count any one of 16 events. The counters can be used in two ways. First, they can be used to tabulate the frequency of events in a particular program, for example, counting all instructions, or counting floating-point instructions.
Second, the CPU can be conditioned to take a trap when one of the counters overflows. If the counter is preloaded to be just N short of an overflow, the CPU will trap when exactly N events have occurred. The ssrun command uses this feature to sample a program's state at regular intervals; for example, every 32K graduated instructions (see “Sampling through Hardware Event Counters” in Chapter 4).
The IRIX kernel extends the utility of the counters by giving each process its own set of virtual counter registers. When the kernel preempts a process, it saves its current counter registers, just as it saves other machine registers. When the kernel resumes execution of a process, it restores the counter values along with the rest of the machine registers. In this way, every process can accumulate its own counts accurately, even though it shares the CPU with the kernel and many processes. (See man page r10k_counters(5) .)
Table B-1 summarizes the types of events that can be counted by the R10000 CPU in either Counter 0 or Counter 1. The table is ordered by the event type number as used in the R10000 special register that controls event counting. The same event numbers are used by the perfex and ssrun commands (see “Profiling Tools” in Chapter 4). A detailed discussion of the events follows the table.
Counter 0 Event
Counter 1 Event
Instructions issued to functional units
Memory data access (load, prefetch, sync, cacheop) issued
Memory data loads graduated
Memory stores issued
Memory data stores graduated
Floating-point instructions graduated
Quadwords written back from L1 cache
Quadwords written back from L2 cache
TLB refill exceptions
Correctable ECC errors on L2 cache
L1 cache misses (instruction)
L1 cache misses (data)
L2 cache misses (instruction)
L2 cache misses (data)
L2 cache way mispredicted (instruction)
L2 cache way mispredicted (data)
External intervention requests
External intervention request hits in L2 cache
External invalidate requests
External invalidate request hits in L2 cache
Instructions done (in chip rev 2.x, virtual coherence)
Stores, or prefetches with store hint, to CleanExclusive L2 cache blocks
Stores, or prefetches with store hint, to Shared L2 cache blocks
This section lists the events in related groups for clarity. There are a few bugs in the counting algorithms in early runs of the R10000 chip (through revisions 2.x), so counts can differ between systems even when all other factors are the same. Roughly speaking, R10000 revision 2.x chips are used in machines manufactured before 1997, and revision 3.2 in 1997 and later.
Either counter can be incremented on each CPU clock cycle (event 0 or event 17). This permits counting cycles along with any other event. Note that a 32-bit counter overflows after only 21 seconds at 200 MHZ.
Several counters record when instructions of different kinds are “issued,” that is, taken from the input queue and assigned to a functional unit for processing. Some instructions can be issued more than once before they are complete, and some can be issued and then discarded (speculative execution). As a result, issued instructions reflect the amount of the work the CPU does, but only graduated instructions (see “Graduated Instructions” following) reflect the effective work toward completion of the algorithm.
This counter is incremented on each cycle by the sum of these factors:
Integer operations marked as “done” in the active list. Zero, 1, or 2 operations can be so marked on each cycle.
Floating point operations issued to an FPU. Zero, 1, or 2 can be issued per cycle. In early revs of the chip, a conditionally or tentatively issued FP instruction can be counted as issued multiple times.
Load and store instructions issued to the address calculation unit on the previous cycle: 0 or 1 per cycle. In early chips, prefetch instructions are not counted as issued, and loads and stores are counted each time they are issued to the address calc unit, which can be multiple times per instruction.
A load/store instruction can be reissued if it does not complete its tag check cycle. One case occurs when the data cache tag array is busy for the required bank because of an external operation (a refill or an invalidate cycle), or if the CPU initiated a refill on the previous cycle to the same bank. This case usually generates just one extra “issue.”
Another case occurs when the Miss Handling Table is already busy with four operations and cannot accept another. In this case, the instruction is re-issued up to once every four cycles as long as the Miss Handling Table is full. If several instructions are waiting, a spurious issue could be counted every cycle.
This counter is incremented when a load instruction was issued to the address-calc unit on the previous cycle. Unlike the combined count in Event 1, this counts each load instruction only once. Prefetches are counted with issued loads since revision 3.x. See the discussion of “Issued Versus Graduate Loads and Stores”.
This counter is incremented when a store instruction was issued to the address-calc unit on the previous cycle. Store-conditional instructions are included in this count. Unlike the combined count in Event 1, this counts stores as issued only once. See also “Issued Store Conditionals (Event 4)”, and the discussion of “Issued Versus Graduate Loads and Stores”.
Beginning with chip revision 3.x, this counter's meaning is changed. It is incremented on the cycle after either ALU1, ALU2, FPU1, or FPU2 marks an instruction as “done.” Done is not the same as graduated, because an instruction that is done, while complete, can still be discarded if it is on a speculative branch line that was mispredicted.
This counter had a different meaning in early revisions of the chip, see “Virtual Coherency Conditions (Old Event 14)”.
An instruction “graduates” when it is complete and can no longer be discarded. The R10000 CPU issues instructions whenever it can, and it executes instructions in any sequence it can (and sometimes executes instructions on speculation only to discard them). However, it graduates instructions in the sequence they were written, so that graduation means that an instruction has had its final effect on the visible state of registers and memory. This predictability makes graduated instructions a more reliable, repeatable count of the instructions executed by a program than the issued-instructions counts. An ideal profile run, which counts executed instructions by software, should agree with the count of graduated instructions for the same program and input. (However, revision 2.x chips can slightly undercount graduated instructions, as compared to an ideal profile run.)
Load and store instructions (and prefetch, load conditional, and store conditional) all require address translation and possibly cache operations. These instructions can be issued, delayed, retracted, and reissued before they are finished and graduate. For this reason, issued loads and stores always outnumber graduated loads and stores. However, the difference between a load or store issued and one graduated can provide valuable insight into performance problems.
Specifically, after a load or store is issued it can often be killed due to contention for some on-chip or off-chip resource, such as a tag-bank read-port. In that case the instruction is reissued, and so counted twice by the issued counter but only once by the graduated counter.
Normally reissues are rare, so issued and graduated should correspond reasonably well. But sometimes resource contention can become a performance bottleneck, and in that case the number of issued load/stores can soar. If you see the number of issued loads or stores greatly exceeding the number graduated, look at the count of mispredicted branches. If it remains low, you can guess that the load/store pipeline is having some kind of resource contention, causing those instructions to be issued repeated. The R10000 CPU can issue at most one load or store per cycle, so excessive reissues can seriously degrade performance.
This counter is incremented by the number of instructions that were graduated on the previous cycle. Integer multiply and divide instructions are counted as two instructions.
Same as Event 15. Supporting this count in either counter register permits counting graduated instructions along with any other value.
This counter is incremented by the number of loads that graduated on the previous cycle. Prefetch instructions are included in this count. Up to four loads can graduate in one cycle. (Through revision 2.x, when a store graduates on a given cycle, loads that graduate on that cycle do not increment this counter.)
This counter is incremented on the cycle after a store graduates. At most one store can graduate per cycle. Prefetch-exclusive instructions (which indicate an intent to modify memory) are counted here, as are store conditional instructions (see “20 Graduated store conditionals ”).
The R10000 CPU predicts whether any branch will be taken before the branch is issued. Execution continues along the predicted line. The instructions executed speculatively can be done—incrementing event 14, see “Instructions Done (Event 14)”— but if the prediction proves to be false, they are discarded and do not increment event 15 (“Graduated Instructions (Event 15)”)
In old chips (revision 2.x), this counter is incremented when any branch instruction is decoded. This includes both conditional and unconditional branches. Although conditional branches have been predicted at this point, they may still be discarded due to an exception or a prior mispredicted branch.
Starting with revision 3.x, this counter is incremented when a conditional branch is determined to have been correctly or incorrectly predicted. This occurs once for every branch that the program actually executes. The count in current chips does not include unconditional branches, and it does not include branches that are decoded on speculation and then discarded. The current definition of the counter enables meaningful comparison with event 24 (“Mispredicted Branches (Event 24)”).
The determination of prediction correctness is known as the branch being “resolved.” Some branches depend on conditions in the floating-point unit. Multiple floating-point branches can be resolved in a single cycle, but this counter can be incremented by only 1 in any cycle. As a result, when multiple FP conditional branches are resolved in a single cycle, this count will be incorrect (low). This is a relatively rare event.
This counter is incremented on the cycle after a branch is restored because it was mispredicted.
The R10000 CPU should predict a high percentage of branches correctly. The compilers order branches based on heuristic evaluation of their likelihood of being taken, or on the basis of a feedback file created from a trace. If the count of event 24 is more than a few percent of event 6 (“Decoded Branches (Event 6)”), something is wrong with the compiler options or something is unusual about the algorithm. You can use prof to generate a feedback file containing actual branch frequency, and the compiler can use such a feedback file to order the machine instructions to minimize mispredictions. (See “Passing a Feedback File” in Chapter 5.)
The R10000 primary cache consists of separate instruction cache (i-cache) and data cache (d-cache) contained on the CPU chip. The primary caches together are called the level-one (L1) cache. This small area (2 × 32KB) is organized as 16-byte units called quadwords. Instructions and data are fetched and stored between the primary cache and the secondary cache in quadwords. Several counters document L1 cache use.
This counter is incremented on the cycle after a request to refill a line of the primary instruction cache is entered into the Miss Handling Table. The count indicates that a needed instruction word was not in the primary cache. The relationship between this counter and event 1 (“Issued Instructions (Event 1)”) indicates the effectiveness of the L1 i-cache.
This counter is incremented one cycle after a request to refill a line of the primary data cache is entered into the Miss Handling Table. The count indicates that a needed operand was not in the primary cache. The affected instruction is suspended and reissued when the needed quadword arrives.
The secondary cache, also called the level-two (L2) cache, is external to the R10000 chip. MIPS documentation for the CPU does not describe the L2 cache in detail because its design is not dictated by the CPU chip. The design of the L2 cache is part of the total system design. For example, the L2 cache of the Power Challenge 10000 system is very different from the L2 cache of the SN0 systems. The counters for L2 cache events are defined in CPU terms, not in terms of the cache design.
This counter is incremented on each cycle that the data for a quadword is written back from the secondary cache to the system-interface unit. In SGI systems, the L2 cache is organized as 128-byte lines. One cache line is 8 quadwords, so this counter is updated by 8 on each cache line writeback.
The interface between the CPU chip and the L2 cache includes lines for an ECC code. This counter is incremented on the cycle following the correction of a single-bit error in a quadword read from the secondary cache data array. A small number of single-bit errors is inevitable in high-density logic systems. However, any significant count in this counter is cause for concern.
This counter is incremented the cycle after the fourth quadword of a cache line is written from memory into the secondary cache, when the cache refill was initially triggered by a primary instruction cache miss. Data misses are counted in “Secondary Data Cache Misses (Event 26)”.
This counter is incremented when the secondary cache control begins to retry an access because it hit in the “way,” or bank, that was not predicted, and the event that initiated the access was an instruction fetch. Data mispredictions are counted in “Data Misprediction from Scache Way Prediction Table (Event 27)”..
This counter is incremented the cycle after the second quadword of a cache line is written from memory into the secondary cache, when the cache refill was initially triggered by a primary data cache miss. Instruction misses are counted in “Secondary Instruction Cache Misses (Event 10)”.
This counter is incremented when the secondary cache control begins to retry an access because it hit in the “way,” or bank, that was not predicted, and the event that initiated the access was not an instruction fetch. Instruction mispredictions are counted in “Instruction Misprediction from Scache Way Prediction Table (Event 11)”.
This counter is incremented on the cycle after an update request is issued for a clean line in the secondary cache. An update request is the result of processing a store instruction or a prefetch instruction with the exclusive option. In the SN0, the update request from this CPU will cause the hub chip to update memory and possibly to initiate other cache coherency operations (see “Cache Coherency Events”).
This counter is incremented on the cycle after an update request is issued for a shared line in the secondary cache. An update request is the result of processing a store instruction or a prefetch instruction with the exclusive option. In the SN0, the update request from this CPU will cause the hub chip to update memory and possibly to initiate other cache coherency operations (see “Cache Coherency Events”). For example, if other CPUs have a copy of the modified cache line, they will be sent invalidations (“External Invalidations (Event 13)”).
This event is the best indication of cache contention between CPUs. The CPU that accumulates a high count of event 31 is repeatedly modifying shared data.
In a cache-coherent multiprocessor, the system signals to the CPU when the CPU has to take action to maintain the coherency of the primary and secondary cache data. In the SN0 architecture, it is the hub chip that signals the CPU when some other CPU has invalidated memory (see “SN0 Node Board” in Chapter 1 and “Understanding Directory-Based Coherency” in Chapter 1).
From the standpoint of the CPU, an “intervention” is a signal stating that some other CPU in the system wants to use data from a cache line that this CPU has. (Other CPUs can tell from the directory bits stored in memory that this CPU has a copy of a cache line; see “Understanding Directory-Based Coherency” in Chapter 1). The other CPU requests the status of the cache line, and requests a copy of the line if it is not the same as memory.
An “invalidation” is a notice that another CPU has modified memory data that this CPU has in its cache. This CPU needs to invalidate (that is, discard) its copy of the data.
This counter is incremented on the cycle after an intervention is received and the intervention is not an invalidate type.
This counter is incremented on the cycle after an external invalidate is entered.
This is an obsolete counter definition valid only through chip revision 2.x. Beginning with revision 3.x, counter 14 has a different meaning (see “Instructions Done (Event 14)”).
This counter is incremented on the cycle after an external intervention is determined to have hit in the L2 cache, necessitating a response of status and possibly a copy of the cache line.
This counter is incremented on the cycle after an external invalidate request is determined to have hit in the L2 cache, necessitating an action.
This event is a good indicator of cache contention. The CPU that produces a high count of event 29 is being slowed because it is using shared data that is being updated by a different CPU. The CPU doing the updating generates event 31 (“Store or Prefetch-Exclusive to Shared Block in Scache (Event 31)”).
The CPU uses a table, the translation lookaside buffer (TLB), to map virtual addresses to physical addresses. The TLB points to a limited number of pages. When a virtual address cannot be found in the TLB, the CPU traps to a fixed location. Operating system code in real memory analyzes the miss against the page tables in memory that describe the process's full address space. If the virtual address is valid, the trap code eventually loads one or more new entries into the TLB and resumes execution.
A TLB miss involves at least some in-memory processing. It can precipitate extensive processing to write pages out, allocate a page frame, and read a page from backing store. For this reason, the number of TLB misses, averaged per second of the program's run (after initial startup transients), is an important performance metric.
The Load Linked instruction and Store Conditional instruction (LL/SC) are used to implement mutual exclusion objects such as locks and semaphores. A Load Linked instruction loads a word from memory and simultaneously tags its cache line in memory. The matching Store Conditional tests the target cache line: if it has not been updated since the Load Linked was executed, the Store Conditional succeeds and modifies memory, removing the tag from the cache line. If the target line has been modified, the Store Conditional fails. The pair of instructions can be used to implement various kinds of mutual exclusion.
LL/SC should never be significant in program execution time—when they are, it indicates some kind of contention or false-sharing problem involving mutual exclusion between asynchronous threads.
This counter is incremented when a store-conditional instruction was issued to the address-calc unit on the previous cycle. By subtracting this count from event 3 (“Issued Stores (Event 3)”), you can isolate nonconditional (normal) store instructions. This counter cannot count a given instruction more than once.
This counter is incremented when a store-conditional instruction fails; that is, when the target cache line had been modified between the completion of the Load Linked and the execution of the Store Conditional. A failed instruction also graduates and is counted in event 20 (“20 Graduated store conditionals ”).
A small proportion of failed SC instructions is to be expected when asynchronous threads use mutual exclusion. However, anything more than a few percent of failures indicates a performance problem, either because the shared resource is overused (bad design) or because the target cache line is occupied by too many modifiable variables (false sharing).
This counter is incremented on the cycle following the graduation of any store-conditional instruction, including one counted in event 5 (“Failed Store Conditionals (Event 5)”).