Chapter 3. Controlling CPU Workload

This chapter describes how to use IRIX kernel features to make the execution of a real-time program predictable. Each of these features works in some way to dedicate hardware to your program's use, or to reduce the influence of unplanned interrupts on it. The main topics covered are:

Using Priorities and Scheduling Queues

The default IRIX scheduling algorithm is designed for a conventional time-sharing system, where the best results are obtained by favoring I/O-bound processes and discouraging CPU-bound processes. However, IRIX supports a variety of scheduling disciplines that are optimized for parallel processes. You can take advantage of these in different ways to suit the needs of different programs.

Note: You can use the methods discussed here to make a real-time program more predictable. However, to reliably achieve a high frame rate, you should plan to use the REACT/Pro Frame Scheduler described in Chapter 4, “Using the Frame Scheduler”.

Scheduling Concepts

In order to understand the differences between scheduling methods you need to know some basic concepts.

Tick Interrupts

In normal operation, the kernel pauses to make scheduling decisions every 10 milliseconds (ms) in every CPU. The duration of this interval, which is called the “tick” because it is the metronomic beat of the scheduler, is defined in the sys/param.h file. Every CPU is normally interrupted by a timer every tick interval. (However, the CPUs in a multiprocessor are not necessarily synchronized. Different CPUs may take tick interrupts at different times.)

During the tick interrupt the kernel updates accounting values, does other housekeeping work, and chooses which process to run next—usually the interrupted process, unless a process of superior priority has become ready to run. The tick interrupt is the mechanism that makes IRIX scheduling “preemptive”; that is, it is the mechanism that allows a high-priority process to take a CPU away from a lower-priority process.

Before the kernel returns to the chosen process, it checks for pending signals, and may divert the process into a signal handler.

You can stop the tick interrupt in selected CPUs in order to keep these interruptions from interfering with real-time programs—see “Making a CPU Nonpreemptive”.

Time Slices

Each process has a guaranteed time slice, which is the amount of time it is normally allowed to execute without being preempted. By default the time slice is 10 ticks, or 100 ms, on a multiprocessor system and 2 ticks, or 20 ms, on a uniprocessor system. A typical process is usually blocked for I/O before it reaches the end of its time slice.

At the end of a time slice, the kernel chooses which process to run next on the same CPU based on process priorities. When executable processes have the same priority, the kernel runs them in turn.

Understanding the Real-Time Priority Band

A real-time thread can select one of a range of 256 priorities (0-255) in the real-time priority band, using POSIX interfaces sched_setparam() or sched_setscheduler(). The higher the numeric value of the priority, the more important the thread. The range of priorities is shown in Figure 3-1.

It is important to consider the needs of the application and how it should interact with the rest of the system before selecting a real-time priority. In making this decision, consider the priorities of the system threads.

IRIX manages system threads to handle kernel tasks, such as paging and interrupts. System daemon threads execute between priority range 90 and 109, inclusive. System device driver interrupt threads execute between priority range 200 and 239, inclusive.

An application can set the priorities of its threads above those of the system threads, but this can adversely affect the behavior of the system. For example, if the disk interrupt thread is blocked by a higher priority user thread, disk data access is delayed until the user thread completes.

Setting the priorities of application threads within or above the system thread range requires an advanced understanding of IRIX system threads and their priorities. The priorities of the IRIX system threads are found in /var/sysgen/mtune/kernel. If necessary, you can change these defaults using the systune command, although this is not recommended for most users (see the systune(1M) man page for details).

Many soft real-time applications simply need to execute ahead of time-share applications, so priority range 0 through 89 is best suited. Since time-share applications are not priority scheduled, a thread running at the lowest real-time priority (0) still executes ahead of all time-share applications. At times, however, the operating system briefly promotes time-share threads into the real-time band to handle time-outs and avoid priority inversion. In these special cases, the promoted thread's real-time priority is never boosted higher than 1.

Figure 3-1. Real-Time Priority Band

Real-Time Priority Band

Note: Applications cannot depend on system services if they are running ahead of system threads, without observing system responsiveness timing guidelines.

Interactive real-time applications (such as digital media) need low latency response times from the operating system, but changing interrupt thread behavior is undesirable. In this case, priority range 110 through and including 199 is the best choice, allowing execution ahead of system daemons but behind interrupt threads. Applications in this range are typically cooperating with a device driver, in which case the correct priority for the application is the priority of the device driver interrupt thread minus 50. If the application is multi-threaded, and multiple priorities are warranted, then priorities of threads should be no greater than the priority of the device driver interrupt thread minus 50. Note that threads running at a higher priority than system daemon threads should never run for more than a few milliseconds at a time, in order to preserve system responsiveness.

Hard real-time applications can use priorities 240 through 254 for the most deterministic behavior and the lowest latencies. However, if threads running at this priority range ever reach the state where they consume 100% of the system's processor cycles, the system becomes completely unresponsive. Threads running at a higher priority than the interrupt threads should never run for more than a few hundred microseconds at a time, to preserve system responsiveness.

Priority 255, the highest real-time priority, should not be used by applications. This priority is reserved for system use to handle timers for urgent real-time applications and kernel debugger interrupts. Applications running at this priority risk hanging the system.

The proprietary IRIX interface for selecting a real-time priority, schedctl(), is supported for binary compatibility, but is not the interface of choice. The nondegrading real-time priority range of schedctl() is remapped onto the POSIX real-time priority band as priorities 90 through 118, as shown in Table 3-1.

Table 3-1. schedctl() Real-Time Priority Range Remapping























Notice the large gap between the first two priorities; it preserves the scheduling semantics of schedctl() threads and system daemons.

Real-time users are encouraged to use tools such as par and irixview to observe the actual priorities and dynamic behaviors of all threads on a running system (see the par(1) and irixview(1) man pages for details).

Understanding Affinity Scheduling

Affinity scheduling is a special scheduling discipline used in multiprocessor systems. You do not have to take action to benefit from affinity scheduling, but you should know that it is done.

As a process executes, it causes more and more of its data and instruction text to be loaded into the processor cache. This creates an “affinity” between the process and the CPU. No other process can use that CPU as effectively, and the process cannot execute as fast on any other CPU.

The IRIX kernel notes the CPU on which a process last ran, and notes the amount of the affinity between them. Affinity is measured on an arbitrary scale.

When the process gives up the CPU—either because its time slice is up or because it is blocked—one of three things can happen to the CPU:

  • The CPU runs the same process again immediately.

  • The CPU spins idle, waiting for work.

  • The CPU runs a different process.

The first two actions do not reduce the process's affinity. But when the CPU runs a different process, that process begins to build up an affinity while simultaneously reducing the affinity of the earlier process.

As long as a process has any affinity for a CPU, it is dispatched only on that CPU if possible. When its affinity has declined to zero, the process can be dispatched on any available CPU. The result of the affinity scheduling policy is that:

  • I/O-bound processes, which execute for short periods and build up little affinity, are quickly dispatched whenever they become ready.

  • CPU-bound processes, which build up a strong affinity, are not dispatched as quickly because they have to wait for “their” CPU to be free. However, they do not suffer the serious delays of repeatedly “warming up” a cache.

Using Gang Scheduling

You can design a real-time program as a family of cooperating, lightweight processes, created with sproc(), sharing an address space. These processes typically coordinate their actions using locks or semaphores (see “Synchronization and Communication” in Chapter 2).

When process A attempts to seize a lock that is held by process B, one of two things can happen, depending on whether or not process is B is running concurrently in another CPU:

  • If process B is not currently active, process A spends a short time in a “spin loop” and then is suspended. The kernel selects a new process to run. Time passes. Eventually process B runs and releases the lock. More time passes. Finally process A runs and now can seize the lock.

  • When process B is concurrently active on another CPU, it typically releases the lock while process A is still in the spin loop. The delay to process A is negligible, and the overhead of multiple passes into the kernel and out again is avoided.

In a system with many processes, the first scenario is common even when processes A, B, and their siblings have real-time priorities. Clearly it is better if processes A and B are always dispatched concurrently.

Gang scheduling achieves this. Any process in a share group can initiate gang scheduling. Then all the processes that share that address space are scheduled as a unit, using the priority of the highest-priority process in the gang. IRIX tries to ensure that all the members of the share group are dispatched when any one of them is dispatched.

You initiate gang scheduling with a call to schedctl(), as sketched in Example 3-1.

Example 3-1. Initiating Gang Scheduling

if (-1 == schedctl(SCHEDMODE,SGS_GANG))
   if (EPERM == errno)
      fprintf(stderr,"You forget to suid again\n");

You can turn gang scheduling off again with another call, passing SGS_FREE in place of SGS_GANG.

Changing the Time Slice Duration

You can change the length of the time slice for all processes from its default (100 ms, multiprocessor systems/20 ms, uniprocessor systems) using the systune command (see the systune(1) man page). The kernel variable is slice_size; its value is the number of tick intervals that make up a slice. There is probably no good reason to make a global change of the time-slice length.

You can change the length of the time slice for one particular process using the schedctl() function (see the schedctl(2) man page), as shown in Example 3-2.

Example 3-2. Setting the Time-Slice Length

#include <sys/schedctl.h>
int setMyTimeSliceInTicks(const int ticks)
   int ret = schedctl(SLICE,0,ticks)
   if (-1 == ret)
      { perror("schedctl(SLICE)"); }
   return ret;

You can lengthen the time slice for the parent of a process group that is gang-scheduled (see “Using Gang Scheduling”). This keeps members of the gang executing concurrently longer.

Controlling Kernel Threads

In some situations, kernel threads, like user threads, must run on specific processors or with other special behavior. The XThread Control Interface (XTCI) was added in IRIX 6.5.16 to control these special behaviors. Users can add XTHREAD entries in the /var/sysgen/system/ file. Kernel threads not mentioned operate with default behavior. After the file is modified, you must run autoconfig to reconfigure the system.

You can enter up to 32 XTHREAD entries in the file. In the event that conflicting entries are found, to preserve compatability, XTCI entries defer to the legacy /var/sysgen/master.d/sgi interface. Entries cannot combine any of the BOOT, FLOAT, or CPU options. Specific interface options include the following:

XTHREAD: name[*] [BOOT] [FLOAT] [STACK s] [PRI p] [CPU m...n]

Options are defined as follows:




Specifies kernel thread control. Any line beginning with XTHREAD: controls kernel threads. All of the information must be on the same line.


Specifies a kernel thread name. Any kernel thread with a name equal to name is affected by the directives that follow it. If [*] follows, any thread whose name begins with name is affected. The list of kernel system and interrupt threads is available through the icrash command and the separate product IRIXview.


Indicates that the thread will stay within the boot cpuset if one exists.


Indicates that the thread will never be bound to a CPU.


Specifies the starting thread stack size.

PRI p 

Specifies the starting thread CPU scheduling priority.

CPU m...n  

Specifies a list of CPUs on which to attempt to place the thread, if possible. Threads that cannot be placed on their CPU list are considered FLOAT. This is comparable to the sysmp() MP_MUSTRUN command for user threads.


To keep the kernel's onesec system thread within the boot cpuset, the following entry is placed within the /var/sysgen/system/ file.


All names of kernel interrupt threads for handling VME devices begin with vme_intrd. To force all of the names to run on processor 1, with a priority of 210, for example, the following entry is used:

XTHREAD: vme_intrd* CPU 1 PRI 210

The names of the kernel threads for supporting the /dev/random pseudo device begin with randproc and end with the number of the processor to which they have been bound. For example, to keep the kernel thread assigned to processor 4 from being bound to it, the following entry is used:

XTHREAD: randproc4 FLOAT

Minimizing Overhead Work

A certain amount of CPU time must be spent on general housekeeping. Since this work is done by the kernel and triggered by interrupts, it can interfere with the operation of a real-time process. However, you can remove almost all such work from designated CPUs, leaving them free for real-time work.

First decide how many CPUs are required to run your real-time application (regardless of whether it is to be scheduled normally, or as a gang, or by the Frame Scheduler). Then apply the following steps to isolate and restrict those CPUs. The steps are independent of each other. Each needs to be done to completely free a CPU.

Assigning the Clock Processor

Every CPU that uses normal IRIX scheduling takes a “tick” interrupt that is the basis of process scheduling. However, one CPU does additional housekeeping work for the whole system, on each of its tick interrupts. You can specify which CPU has these additional duties using the privileged mpadmin command (see the mpadmin(1) man page). For example, to make CPU 0 the clock CPU (a common choice), use:

mpadmin -c 0 

The equivalent operation from within a program uses sysmp() as shown in Example 3-3 (see also the sysmp(2) man page).

Example 3-3. Setting the Clock CPU

#include <sys/sysmp.h>
int setClockTo(int cpu)
   int ret = sysmp(MP_CLOCK,cpu);
   if (-1 == ret) perror("sysmp(MP_CLOCK)");
   return ret;

Unavoidable Timer Interrupts

In machines based on the R4x00 CPU, even when the clock and fast timer duties are removed from a CPU, that CPU still gets an unwanted interrupt as a 5-microsecond “blip” every 80 seconds. Systems based on the R8000 and R10000 CPUs are not affected, and processes running under the Frame Scheduler are not affected even by this small interrupt.

Isolating a CPU from Sprayed Interrupts

By default, the Origin, Onyx 2, CHALLENGE, and Onyx systems direct I/O interrupts from the bus to CPUs in rotation (called spraying interrupts). You do not want a real-time process interrupted at unpredictable times to handle I/O. The system administrator can isolate one or more CPUs from sprayed interrupts by placing the NOINTR directive in the configuration file /var/sysgen/system/ The syntax is

NOINTR cpu# [cpu#]...

Before the NOINTR directive takes effect, the kernel must be rebuilt using the command /etc/autoconfig -vf, and rebooted.

Redirecting Interrupts

To minimize latency of real-time interrupts, it is often necessary to direct them to specific real-time processors. This process is called interrupt redirection.

A device interrupt can be redirected to a specific processor using the DEVICE_ADMIN directive in the /usr/sysgen/system/ file.

The DEVICE_ADMIN and the NOINTR directives are typically used together to guarantee that the target processor only handles the redirected interrupts.

For example, adding the following lines to the system configuration file ensures that CPU 1 handles only PCI interrupt 4:

DEVICE_ADMIN: /hw/module/1/slot/io1/baseio/pci/4 INTR_TARGET=/hw/cpunum/1

On the Origin 3000 series, if a DEVICE_ADMIN directive is used to redirect an interrupt, the hardware limitations might not allow the the interrupt to be redirected to the requested CPU. If this occurs, you will see the following message on the console and in the system log (hwgraph path and CPU number as appropriate for each case):

WARNING:Override explicit interrupt targetting: /hw/module/001c10/Ibrick/xtalk/15/pci/4/ei(0x2f8),unable to target CPU 4

For a threaded interrupt handler, the directive will still ensure that the interrupt handler thread is given control on the specified CPU.

If the interrupt handler is non-threaded and interrupt redirection is requested to ensure the handler runs on a particular CPU, choice of the interrupt CPU is critical. A device on a particular PCI bus can interrupt CPUs only on one Processor Interface (PI), either PI-0 or PI-1. (A device can still interrupt CPUs on any node, but it can interrupt only those on one PI.) At boot time, it is determined which CPUs are interruptible from which PCI bus. Once determined, the set of interruptible CPUs for a particular PCI bus should not change from boot to boot, unless a system configuration change is made, such as disabling the CPU and reconfiguring the I/O.

If you receive the previously mentioned warning message, indicating that an interrupt redirect failed, you can perform the following procedure to determine to which CPUs an interrupt can be directed, and then change the DEVICE_ADMIN directive accordingly. From the message, you know that CPU 4 is on a PI that cannot receive interrupts from the device in question. As shown in the following example, output from an ls command indicates which PI the CPU is on. In this case, it is PI-0, as indicated by the 0 in the path.

o3000%ls -l /hw/cpumun/4 
lrw------- ...   4 -> /hw/module/001c13/node/cpubus/0/a

You now know that the PCI bus that this device is on can interrupt CPUs only on PI-1. Using this knowledge and output from the ls command, you can choose an interruptible CPU. As shown from the output from the ls command in the following example, changing the DEVICE_ADMIN directive to use CPU 2, 3, 6, or 7 will allow you to work around this hardware limitation.

o3000%ls -l /hw/cpunum
lrw------- ...  0 -> /hw/module/001c10/node/cpubus/0/a
lrw------- ...  1 -> /hw/module/001c10/node/cpubus/0/b
lrw------- ...  2 -> /hw/module/001c10/node/cpubus/1/a
lrw------- ...  3 -> /hw/module/001c10/node/cpubus/1/b
lrw------- ...  4 -> /hw/module/001c13/node/cpubus/0/a
lrw------- ...  5 -> /hw/module/001c13/node/cpubus/0/b
lrw------- ...  6 -> /hw/module/001c13/node/cpubus/1/a
lrw------- ...  7 -> /hw/module/001c13/node/cpubus/1/b

Another possible solution is to also direct the PCI error interrupt for the bus. The PCI error interrupt is the first interrupt for the bus assigned to a CPU; if it is assigned to a CPU on a different PI, it will cause the targetting of the selected device to fail. As an example, to target the error interrupt to CPU 4, put the following directive into the file:

DEVICE_ADMIN: /hw/module/001c10/Ibrick/xtalk/15/pci INTR_TARGET=/hw/cpunum/4

Note: The actual DEVICE_ADMIN directive varies depending on the system's hardware configuration.

Before the directives take effect, the kernel must be rebuilt using the command /etc/autoconfig -vf, and rebooted.

Understanding the Vertical Sync Interrupt

In systems with dedicated graphics hardware, the graphics hardware generates a variety of hardware interrupts. The most frequent of these is the vertical sync interrupt, which marks the end of a video frame. The vertical sync interrupt can be used by the Frame Scheduler as a time base (see “Vertical Sync Interrupt” in Chapter 4). Certain GL and Open GL functions are internally synchronized to the vertical sync interrupt (for an example, refer to the gsync(3g) man page).

All the interrupts produced by dedicated graphics hardware are at an inferior priority compared to other hardware. All graphics interrupts including the vertical sync interrupt are directed to CPU 0. They are not “sprayed” in rotation, and they cannot be directed to a different CPU.

Restricting a CPU from Scheduled Work

For best performance of a real-time process or for minimum interrupt response time, you need to use one or more CPUs without competition from other scheduled processes. You can exert three levels of increasing control: restricted, isolated, and nonpreemptive.

In general, the IRIX scheduling algorithms run a process that is ready to run on any CPU. This is modified by considerations of

  • Affinity — CPUs are made to execute the processes that have developed affinity to them

  • Processor group assignments — The pset command can force a specified group of CPUs to service only a given scheduling queue

You can restrict one or more CPUs from running any scheduled processes at all. The only processes that can use a restricted CPU are processes that you assign to those CPUs.

Note: Restricting a CPU overrides any group assignment made with pset. A restricted CPU remains part of a group, but does not perform any work you assign to the group using pset.

You can find out the number of CPUs that exist, and the number that are still unrestricted, using the sysmp() function as in Example 3-4.

Example 3-4. Number of Processors Available and Total

#include <sys/sysmp.h>
int CPUsInSystem = sysmp(MP_NPROCS);
int CPUsNotRestricted = sysmp(MP_NAPROCS);

To restrict one or more CPUs, you can use mpadmin. For example, to restrict CPUs 4 and 5, you can use

mpadmin -r 4
mpadmin -r 5

The equivalent operation from within a program uses sysmp() as in Example 3-5 (see also the sysmp(2) man page).

Example 3-5. Restricting a CPU

#include <sys/sysmp.h>
int restrictCpuN(int cpu)
   int ret = sysmp(MP_RESTRICT,cpu);
   if (-1 == ret) perror("sysmp(MP_RESTRICT)");
   return ret;

You remove the restriction, allowing the CPU to execute any scheduled process, with mpadmin -u or with sysmp(MP_EMPOWER).

Note: The following points are important to remember:

Assigning Work to a Restricted CPU

After restricting a CPU, you can assign processes to it using the command runon (see the runon(1) man page). For example, to run a program on CPU 3, you could use

runon 3 ~rt/bin/rtapp

The equivalent operation from within a program uses sysmp() as in Example 3-6 (see also the sysmp(2) man page).

Example 3-6. Assigning the Calling Process to a CPU

#include <sys/sysmp.h>
int runMeOn(int cpu)
   int ret = sysmp(MP_MUSTRUN,cpu);
   if (-1 == ret) perror("sysmp(MP_MUSTRUN)");
   return ret;

You remove the assignment, allowing the process to execute on any available CPU, with sysmp(MP_RUNANYWHERE). There is no command equivalent.

The assignment to a specified CPU is inherited by processes created by the assigned process. Thus if you assign a real-time program with runon, all the processes it creates run on that same CPU. More often you want to run multiple processes concurrently on multiple CPUs. There are three approaches you can take:

  1. Use the REACT/Pro Frame Scheduler, letting it restrict CPUs for you.

  2. Let the parent process be scheduled normally using a nondegrading real-time priority. After creating child processes with sproc(), use schedctl(SCHEDMODE,SGS_GANG) to cause the share group to be gang-scheduled. Assign a processor group to service the gang-scheduled process queue.

    The CPUs that service the gang queue cannot be restricted. However, if yours is the only gang-scheduled program, those CPUs are effectively dedicated to your program.

  3. Let the parent process be scheduled normally. Let it restrict as many CPUs as it has child processes. Have each child process invoke sysmp(MP_MUSTRUN,cpu) when it starts, each specifying a different restricted CPU.

Isolating a CPU from TLB Interrupts

When the kernel changes the address space in a way that could invalidate TLB entries held by other CPUs, it broadcasts an interrupt to all CPUs, telling them to update their translation lookaside buffers (TLBs).

You can isolate the CPU so that it does not receive broadcast TLB interrupts. When you isolate a CPU, you also restrict it from scheduling processes. Thus isolation is a superset of restriction, and the comments in the preceding topic, “Restricting a CPU from Scheduled Work”, also apply to isolation.

The isolate command is mpadmin -I; the function is sysmp(MP_ISOLATE, cpu#). After isolation, the CPU synchronizes its TLB and instruction cache only when a system call is executed. This removes one source of unpredictable delays from a real-time program and helps minimize the latency of interrupt handling.

Note: The REACT/Pro Frame Scheduler automatically restricts and isolates any CPU it uses.

When an isolated CPU executes only processes whose address space mappings are fixed, it receives no broadcast interrupts from other CPUs. Actions by processes in other CPUs that change the address space of a process running in an isolated CPU can still cause interrupts at the isolated CPU. Among the actions that change the address space are:

  • Causing a page fault. When the kernel needs to allocate a page frame in order to read a page from swap, and no page frames are free, it invalidates some unlocked page. This can render TLB and cache entries in other CPUs invalid. However, as long as an isolated CPU executes only processes whose address spaces are locked in memory, such events cannot affect it.

  • Extending a shared address space with brk(). Allocate all heap space needed before isolating the CPU.

  • Using mmap(), munmap(), mprotect(), shmget(), or shmctl() to add, change or remove memory segments from the address space; or extending the size of a mapped file segment when MAP_AUTOGROW was specified and MAP_LOCAL was not. All memory segments should be established before the CPU is isolated.

  • Starting a new process with sproc(), thus creating a new stack segment in the shared address space. Create all processes before isolating the CPU; or use sprocsp() instead, supplying the stack from space allocated previously.

  • Accessing a new DSO using dlopen() or by reference to a delayed-load external symbol (see the dlopen(3) and DSO(5) man pages). This adds a new memory segment to the address space but the addition is not reflected in the TLB of an isolated CPU.

  • Calling cacheflush() (see the cacheflush(2) man page).

  • Using DMA to read or write the contents of a large (many-page) buffer. For speed, the kernel temporarily maps the buffer pages into the kernel address space, and unmaps them when the I/O completes. However, these changes affect only kernel code. An isolated CPU processes a pending TLB flush when the user process enters the kernel for an interrupt or service function.

Isolating a CPU When Performer Is Used

The Performer graphics library supplies utility functions to isolate CPUs and to assign Performer processes to the CPUs. You can read the code of these functions in the file /usr/src/Performer/src/lib/libpfutil/lockcpu.c. They use CPUs starting with CPU number 1 and counting upward. The functions can restrict as many as 1+2×pipes CPUs, where pipes is the number of graphical pipes in use (see the pfuFreeCPUs(3pf) man page for details). The functions assume these CPUs are available for use.

If your real-time application uses Performer for graphics—which is the recommended approach for high-performance simulators—you should use the libpfutil functions with care. You may need to replace them with functions of your own. Your functions can take into account the CPUs you reserve for other time-critical processes. If you already restrict one or more CPUs, you can use a Performer utility function to assign Performer processes to those CPUs.

Making a CPU Nonpreemptive

After a CPU has been isolated, you can turn off the dispatching “tick” for that CPU (see “Tick Interrupts”). This eliminates the last source of overhead interrupts for that CPU. It also ends preemptive process scheduling for that CPU. This means that the process now running continues to run until the following events occur:

  • The process gives up control voluntarily by blocking on a semaphore or lock, requesting I/O, or calling sginap().

  • The process calls a system function and, when the kernel is ready to return from the system function, a process of higher priority is ready to run.

Some effects of this change within the specified CPU include the following:

  • IRIX no longer ages degrading priorities. Priority ageing is done on clock tick interrupts.

  • IRIX no longer preempts a low-priority process when a high-priority process becomes executable, except when the low-priority process calls a system function.

  • Signals (other than SIGALARM) can only be delivered after I/O interrupts or on return from system calls. This can extend the latency of signal delivery.

Normally an isolated CPU runs only a few, related, time-critical processes that have equal priorities, and that coordinate their use of the CPU through semaphores or locks. When this is the case, the loss of preemptive scheduling is outweighed by the benefit of removing the overhead and unpredictability of interrupts.

To make a CPU nonpreemptive you can use the mpadmin command. For example, to isolate CPU 3 and make it nonpreemptive, you can use

mpadmin -I 3
mpadmin -D 3

The equivalent operation from within a program uses sysmp() as shown in Example 3-7 (see the sysmp(2) man page).

Example 3-7. Making a CPU nonpreemptive

#include <sys/sysmp.h>
int stopTimeSlicingOn(int cpu)
   int ret = sysmp(MP_NONPREEMPTIVE,cpu);
   if (-1 == ret) perror("sysmp(MP_NONPREEMPTIVE)");
   return ret;

You reverse the operation with sysmp(MP_PREEMPTIVE) or with mpadmin -C.

Minimizing Interrupt Response Time

Interrupt response time is the time that passes between the instant when a hardware device raises an interrupt signal, and the instant when—interrupt service completed—the system returns control to a user process. IRIX guarantees a maximum interrupt response time on certain systems, but you have to configure the system properly to realize the guaranteed time.

Maximum Response Time Guarantee

In properly configured systems, interrupt response time is guaranteed not to exceed 50 microseconds for Origin 3000 and Onyx 3 series systems and not to exceed 100 microseconds for Origin 2000 and Onyx 2 series systems.

This guarantee is important to a real-time program because it puts an upper bound on the overhead of servicing interrupts from real-time devices. You should have some idea of the number of interrupts that will arrive per second. Multiplying this by 50 microseconds yields a conservative estimate of the amount of time in any one second devoted to interrupt handling in the CPU that receives the interrupts. The remaining time is available to your real-time application in that CPU.

Components of Interrupt Response Time

The total interrupt response time includes these sequential parts:

Hardware latency

The time required to make a CPU respond to an interrupt signal.

Software latency

The time required to dispatch an interrupt thread.

Device service time

The time the device driver spends processing the interrupt and dispatching a user thread.

Mode switch

The time it takes for a thread to switch from kernel mode to user mode.

The parts are diagrammed in Figure 3-2 and discussed in the following topics.

Figure 3-2. Components of Interrupt Response Time

Components of Interrupt Response Time

Hardware Latency

When an I/O device requests an interrupt, it activates a line in the VME or PCI bus interface. The bus adapter chip places an interrupt request on the system internal bus, and a CPU accepts the interrupt request.

The time taken for these events is the hardware latency, or interrupt propagation delay. In Challenge or Onyx systems, the typical propagation delay is 2 microseconds. The worst-case delay can be much greater. The worst-case hardware latency can be significantly reduced by not placing high-bandwidth DMA devices such as graphics or HIPPI interfaces on the same hardware unit (POWERChannel-2 in the Challenge, module and hub chip in the Origin) used by the interrupting devices.

Software Latency

The primary function of interrupt dispatch is to determine which device triggered the interrupt and dispatch the corresponding interrupt thread. Interrupt threads are responsible for calling the device driver and executing its interrupt service routine.

While interrupt dispatch is executing, all interrupts for that processor are masked until it completes. Any pending interrupts are dispatched before interrupt threads execute. Thus, the handling of an interrupt could be delayed by one or more devices.

In order to achieve 50-microsecond response time, you must ensure that the time-critical devices supply the only interrupts directed to that CPU (see “Redirecting Interrupts”).

Kernel Critical Sections

Most of the IRIX kernel code is noncritical and executed with interrupts enabled. However, certain sections of kernel code depend on exclusive access to shared resources. Spin locks are used to control access to these critical sections. Once in a critical section, the interrupt level is raised in that CPU. New interrupts are not serviced until the critical section is complete.

Although most kernel critical sections are short, there is no guarantee on the length of a critical section. In order to achieve 50-microsecond response time, your real-time program must avoid executing system calls on the CPU where interrupts are handled. The way to ensure this is to restrict that CPU from running normal processes (see “Restricting a CPU from Scheduled Work”) and isolate it from TLB interrupts (see “Isolating a CPU from TLB Interrupts”)—or to use the Frame Scheduler.

You may need to dedicate a CPU to handling interrupts. However, if the interrupt-handling CPU has power well above that required to service interrupts—and if your real-time process can tolerate interruptions for interrupt service—you can use the isolated CPU to execute real-time processes. If you do this, the processes that use the CPU must avoid system calls that do I/O or allocate resources, for example, fork(), brk(), or mmap(). The processes must also avoid generating external interrupts with long pulse widths (see “External Interrupts” in Chapter 6).

In general, processes in a CPU that services time-critical interrupts should avoid all system calls except those for interprocess communication and for memory allocation within an arena of fixed size.

Device Service Time

The time spent servicing an interrupt should be negligible. The interrupt handler should do very little processing, only wake up a sleeping user process and possibly start another device operation. Time-consuming operations such as allocating buffers or locking down buffer pages should be done in the request entry points for read(), write(), or ioctl(). When this is the case, device service time is minimal.

Device drivers supplied by SGI indeed spend negligible time in interrupt service. Device drivers from third parties are an unknown quantity. Hence the 50-microsecond guarantee is not in force when third-party device drivers are used on the same CPU at a superior priority to the time-critical interrupts.

Dispatch User Thread

Typically, the result of the interrupt is to make a sleeping thread runnable. The runnable thread is entered in one of the scheduler queues. (This work may be done while still within the interrupt handler, as part of a device driver library routine such as wakeup().)

Mode Switch

A number of instructions are required to exit kernel mode and resume execution of the user thread. Among other things, this is the time the kernel looks for software signals addressed to this process, and redirects control to the signal handler. If a signal handler is to be entered, the kernel might have to extend the size of the stack segment. (This cannot happen if the stack was extended before it was locked.)

Minimal Interrupt Response Time

To summarize, you can ensure interrupt response time of 50 microseconds or less for one specified device interrupt provided you configure the system as follows:

  • The interrupt is directed to a specific CPU, not “sprayed”; and is the highest-priority interrupt received by that CPU.

  • The interrupt is handled by an SGI-supplied device driver, or by a device driver from another source that promises negligible processing time.

  • That CPU does not receive any other “sprayed” interrupts.

  • That CPU is restricted from executing general UNIX processes, isolated from TLB interrupts, and made nonpreemptive—or is managed by the Frame Scheduler.

  • Any process you assign to that CPU avoids system calls other than interprocess communication and allocation within an arena.

When these things are done, interrupts are serviced in minimal time.

Tip: If interrupt service time is a critical factor in your design, consider the possibility of using VME programmed I/O to poll for data, instead of using interrupts. It takes at most 4 microseconds to poll a VME bus address (see “PIO Access” in Chapter 6). A polling process can be dispatched one or more times per frame by the Frame Scheduler with low overhead.