Chapter 2. How IRIX and REACT/pro Support Real–Time Programs

This chapter provides an overview of the real-time support for programs in IRIX and REACT/pro.

Some of the features mentioned here are discussed in more detail in the following chapters of this guide. For details on other features, you are referred to man pages or to other manuals. The main topics surveyed are:

Kernel Facilities for Real-Time Programs

The IRIX kernel has a number of features that are valuable when you are designing your real-time program. These are described in the following sections.

Kernel Optimizations

The IRIX kernel has been optimized for performance in a multiprocessor environment. Some of the optimizations are as follows:

  • Instruction paths to system calls and traps are optimized, including some hand coding, to maximize cache utilization.

  • In the real-time dispatch class (described further in “Using Priorities and Scheduling Queues” in Chapter 3), the run queue is kept in priority-sorted order for fast dispatching.

  • Floating point registers are saved only if the next process needs them, and restored only if saved.

  • The kernel tries to redispatch a process on the same CPU where it most recently ran, looking for some of its data remaining in cache (see “Understanding Affinity Scheduling” in Chapter 3).

Special Scheduling Disciplines

The default IRIX scheduling algorithm is designed to ensure fairness among time-shared users. Called an “earnings-based” scheduler, the kernel credits each process group with a certain number of microseconds on each dispatch cycle. The process with the fattest “bank account” is dispatched first. If a process exhausts its “bank account,” it is preempted.

POSIX Real-Time Policies

While the earnings-based scheduler is effective at scheduling time-share applications, it is not suitable for real time. For deterministic scheduling, IRIX provides the POSIX real-time policies: first-in-first-out and round robin. These policies share a real-time priority band consisting of 256 priorities. Processes scheduled using the POSIX real-time policies are not subject to “earnings” controls. For more information about scheduling, see “Understanding the Real-Time Priority Band” in Chapter 3 and the realtime(5) man page.

Gang Scheduling

When your program is structured as a share process group (using sproc()), you can request that all the processes of the group be scheduled as a “gang.” The kernel runs all the members of the gang concurrently, provided there are enough CPUs available to do so. This helps to ensure that, when members of the process group coordinate through the use of locks, a lock is usually released in a timely manner. Without gang scheduling, the process that holds a lock may not be scheduled in the same interval as another process that is waiting on that lock.

For more information, see “Using Gang Scheduling” in Chapter 3.

Locking Virtual Memory

IRIX allows a process to lock all or part of its virtual memory into physical memory, so that it cannot be paged out and a page fault cannot occur while it is running.

Memory locking prevents unpredictable delays caused by paging. Of course the locked memory is not available for the address spaces of other processes. The system must have enough physical memory to hold the locked address space and space for a minimum of other activities.

The system calls used to lock memory, such as mlock() and mlockall(), are discussed in detail in Topics in IRIX Programming (see “Related Publications and Sites”).

Mapping Processes and CPUs

Normally IRIX tries to keep all CPUs busy, dispatching the next ready process to the next available CPU. (This simple picture is complicated by the needs of affinity scheduling, and gang scheduling). Since the number of ready processes changes all the time, dispatching is a random process. A normal process cannot predict how often or when it will next be able to run. For normal programs this does not matter, as long as each process continues to run at a satisfactory average rate.

Real-time processes cannot tolerate this unpredictability. To reduce it, you can dedicate one or more CPUs to real-time work. There are two steps:

  1. Restrict one or more CPUs from normal scheduling, so that they can run only the processes that are specifically assigned to them.

  2. Assign one or more processes to run on the restricted CPUs.

A process on a dedicated CPU runs when it needs to run, delayed only by interrupt service and by kernel scheduling cycles (if scheduling is enabled on that CPU). For details, see “Assigning Work to a Restricted CPU” in Chapter 3. The REACT/pro Frame Scheduler takes care of both steps automatically; see “REACT/pro Frame Scheduler”.

Controlling Interrupt Distribution

In normal operations, CPUs receive frequent interrupts:

  • I/O interrupts from devices attached to, or near, that CPU.

  • A scheduling clock causes an interrupt to every CPU every time-slice interval of 10 milliseconds.

  • Whenever interval timers expire (See “Timers and Clocks”), a CPU handling timers receives timer interrupts.

  • When the map of virtual to physical memory changes, a TLB interrupt is broadcast to all CPUs.

These interrupts can make the execution time of a process unpredictable. However, you can designate one or more CPUs for real-time use, and keep interrupts of these kinds away from those CPUs. The system calls for interrupt control are discussed further in “Minimizing Overhead Work” in Chapter 3. The REACT/pro Frame Scheduler also takes care of interrupt isolation.

REACT/pro Frame Scheduler

Many real-time programs must sustain a fixed frame rate. In such programs, the central design problem is that the program must complete certain activities during every frame interval.

The REACT/pro Frame Scheduler is a process execution manager that schedules activities on one or more CPUs in a predefined, cyclic order. The scheduling interval is determined by a repetitive time base, usually a hardware interrupt.

The Frame Scheduler makes it easy to organize a real-time program as a set of independent, cooperating threads. The Frame Scheduler manages the housekeeping details of reserving and isolating CPUs. You concentrate on designing the activities and implementing them as threads in a clean, structured way. It is relatively easy to change the number of activities, or their sequence, or the number of CPUs, even late in the project. For detailed information about the Frame Scheduler, see Chapter 4, “Using the Frame Scheduler”.

Synchronization and Communication

In a program organized as multiple, cooperating processes, the processes need to share data and coordinate their actions in well-defined ways. IRIX with REACT provides the following mechanisms, which are surveyed in the topics that follow:

  • Shared memory allows a single segment of memory to appear in the address spaces of multiple processes.

  • Semaphores are used to coordinate access from multiple processes to resources that they share.

  • Locks provide a low-overhead, high-speed method of mutual exclusion.

  • Barriers make it easy for multiple processes to synchronize the start of a common activity.

  • Signals provide asynchronous notification of special events or errors. IRIX supports signal semantics from all major UNIX heritages, but POSIX-standard signals are recommended for real-time programs.

Shared Memory Segments

IRIX allows you to map a segment of memory into the address spaces of two or more processes at once. The block of shared memory can be read concurrently, and possibly written, by all the processes that share it. IRIX supports the POSIX and the SVR4 models of shared memory, as well as a system of shared arenas unique to IRIX. These facilities are covered in detail in Topics in IRIX Programming (see “Related Publications and Sites”).

Semaphores

A semaphore is a flexible synchronization mechanism used to signal an event, limit concurrent access to a resource, or enforce mutual exclusion of critical code regions.

IRIX implements industry standard POSIX and SVR4 semaphores, as well as its own arena-based version. All three versions are discussed in Topics in IRIX Programming (see “Related Publications and Sites”). While the interfaces and semantics of each type are slightly different, the way they are used is fundamentally the same.

Semaphores have two primary operations that allow threads to atomically increment or decrement the value of a semaphore. With POSIX semaphores, these operations are sem_post() and sem_wait(), respectively (see sem_post(3) and sem_wait(3) for additional information).

When a thread decrements a semaphore and causes its value to becomes less than zero, the thread blocks; otherwise, the thread continues without blocking. A thread blocked on a semaphore typically remains blocked until another thread increments the semaphore.

The wakeup order depends on the version of semaphore being used:

POSIX 

Thread with the highest priority waiting for the longest amount of time (priority-based)

Arena 

Process waiting the longest amount of time (FIFO-based)

SVR4 

Process waiting the longest amount of time (FIFO-based)


Tip: SGI recommends using the POSIX semaphores for the synchronization of real-time threads, because they queue blocked threads in priority order and outperform the other semaphore versions with low to no contention.


Following are examples of using semaphores:

  • To implement a lock using POSIX semaphores, an application initializes a semaphore to 1, and uses sem_wait() to acquire the semaphore and sem_post() to release it.

  • To use semaphores for event notification, an application initializes the semaphore to 0. Threads waiting for the event to occur call sem_wait(), while threads signaling the event use sem_post().

Locks

A lock is a mutually exclusive synchronization object that represents a shared resource. A process that wants to use the resource sets a lock and later releases the lock when it is finished using the resource.

As discussed in “Semaphores”, a lock is functionally the same as a semaphore that has a count of 1. The set-lock operation acquires a lock for exclusive use of a resource. On a multiprocessor system, one important difference between a lock and semaphore is when a resource is not immediately available, a semaphore always blocks the process, while a lock causes a process to spin until the resource becomes available.

A lock, on a multiprocessor system, is set by “spinning.” The program enters a tight loop using the test-and-set machine instruction to test the lock's value and to set it as soon as the lock is clear. In practice the lock is often already available, and the first execution of test-and-set acquires the lock. In this case, setting the lock takes a trivial amount of time.

When the lock is already set, the process spins on the test a certain number of times. If the process that holds the lock is executing concurrently in another CPU, and if it releases the lock during this time, the spinning process acquires the lock instantly. There is zero latency between release and acquisition, and no overhead from entering the kernel for a system call.

If the process has not acquired the lock after a certain number of spins, it defers to other processes by calling sginap(). When the lock is released, the process resumes execution.

The recommended locks for pthreads are pthread mutexes. These mutexes can have many different individual configurations. An important configuration decision you must make is your choice of priority protection protocol. The protocol will determine the action the lock uses to deal with priority inversion. The protocols are as follows:

  • Priority inheritance (standard) temporarily boosts the priority of the lock holder to the priority of the highest-priority thread that is waiting for the lock.

  • Priority ceiling (standard) temporarily boosts the priority of any lock holder to the priority of the highest-priority pthread that might take the lock. This may give more deterministic performance in some situations.

  • Nonpreemptive (specific to IRIX) gives very low overhead locking performance as compared to the previously mentioned standard protocols. During the time that a pthread holds a mutex with this protocol, it prevents other pthreads from preempting it. This avoids some cases of priority inversion; for example, in the textbook situation of a medium-priority thread keeping a low-priority thread from making progress on a mutex that a high-priority thread wants, it will keep the medium-priority thread from running while the low-priority thread holds the lock.

For more information on locks, refer to Topics in IRIX Programming (see “Related Publications and Sites”), and to the pthread_mutex_lock(3P), pthread_mutexattr_setprotocol(3P), usnewlock(3), ussetlock(3), and usunsetlock(3) man pages.

Mutual Exclusion Primitives

IRIX supports library functions that perform atomic (uninterruptable) sample-and-set operations on words of memory. For example, test_and_set() copies the value of a word and stores a new value into the word in a single operation; while test_then_add() samples a word and then replaces it with the sum of the sampled value and a new value.

These primitive operations can be used as the basis of mutual-exclusion protocols using words of shared memory. For details, see the test_and_set(3p) man page.

The test_and_set() and related functions are based on the MIPS R4000 instructions Load Linked and Store Conditional. Load Linked retrieves a word from memory and tags the processor data cache “line” from which it comes. The following Store Conditional tests the cache line. If any other processor or device has modified that cache line since the Load Linked was executed, the store is not done. The implementation of test_then_add() is comparable to the following assembly-language loop:

1:
    ll    retreg, offset(targreg)
    add   tmpreg, retreg, valreg
    sc    tmpreg, offset(targreg)
    beq   tmpreg, 0, b1

The loop continues trying to load, augment, and store the target word until it succeeds. Then it returns the value retrieved. For more details on the R4000 machine language, see one of the books listed in “Related Publications and Sites”.

The Load Linked and Store Conditional instructions operate only on memory locations that can be cached. Uncached pages (for example, pages implemented as reflective shared memory, see “Reflective Shared Memory”) cannot be set by the test_and_set() function.

Signals

A signal is a notification of an event, sent asynchronously to a process. Some signals originate from the kernel: for example, the SIGFPE signal that notifies of an arithmetic overflow; or SIGALRM that notifies of the expiration of a timer interval (for the complete list, see the signal(5) man page). The Frame Scheduler issues signals to notify your program of errors or termination. Other signals can originate within your own program.

Signal Latency

The time that elapses from the moment a signal is generated until your signal handler begins to execute is known as signal latency. Signal latency can be long (as real-time programs measure time) and signal latency has a high variability. (Some of the factors are discussed under “Signal Delivery and Latency” in Chapter 4.) In general, use signals only to deliver infrequent messages of high priority. Do not use the exchange of signals as the basis for scheduling in a real-time program.


Note: Signals are delivered at particular times when using the Frame Scheduler. See “Using Signals Under the Frame Scheduler” in Chapter 4.


Signal Families

In order to receive a signal, a process must establish a signal handler, a function that is entered when the signal arrives.

There are three UNIX traditions for signals, and IRIX supports all three. They differ in the library calls used, in the range of signals allowed, and in the details of signal delivery (see Table 2-1). Real-time programs should use the POSIX interface for signals.

Table 2-1. Signal Handling Interfaces

Function

SVR4-compatible Calls

BSD 4.2 Calls

POSIX Calls

set and query signal handler

sigset(2) 
signal(2)

sigvec(3) 
signal(3)

sigaction(2) 
sigsetops(3) 
sigaltstack(2)

send a signal

sigsend(2) 
kill(2) 

kill(3) 
killpg(3)

sigqueue(2)

temporarily block specified signals

sighold(2) 
sigrelse(2)

sigblock(3) 
sigsetmask(3)

sigprocmask(2)

query pending signals

 

 

sigpending(2)

wait for a signal

sigpause(2)

sigpause(3)

sigsuspend(2) 
sigwait(2) 
sigwaitinfo(2) 
sigtimedwait(2)

The POSIX interface supports the following 64 signal types:

1-31 

Same as BSD

32 

Reserved by IRIX kernel

33-48 

Reserved by the POSIX standard for system use

49-64 

Reserved by POSIX for real-time programming

Signals with smaller numbers have priority for delivery. The low-numbered BSD-compatible signals, which include all kernel-produced signals, are delivered ahead of real-time signals; and signal 49 takes precedence over signal 64. (The BSD-compatible interface supports only signals 1-31. This set includes two user-defined signals.)

IRIX supports POSIX signal handling as specified in IEEE 1003.1b-1993. This includes FIFO queueing new signals when a signal type is held, up to a system maximum of queued signals. (The maximum can be adjusted using systune; see the systune(1) man page.)

For more information on the POSIX interface to signal handling, refer to Topics in IRIX Programming and to the signal(5), sigaction(2), and sigqueue(2) man pages.

Timers and Clocks

A real-time program sometimes needs a source of timer interrupts, and some need a way to create a high-precision timestamp. Both of these are provided by IRIX. IRIX supports the POSIX clock and timer facilities as specified in IEEE 1003.1b-1993, as well as the BSD itimer facility. The timer facilities are covered in Topics in IRIX Programming.

Hardware Cycle Counter

The hardware cycle counter is a high-precision hardware counter that is updated continuously. The precision of the cycle counter depends on the system in use, but in most, it is a 64-bit counter.

You sample the cycle counter by calling the POSIX function clock_gettime() specifying the CLOCK_SGI_CYCLE clock type.

The frequency with which the cycle counter is incremented also depends on the hardware system. You can obtain the resolution of the clock by calling the POSIX function clock_getres().


Note: The cycle counter is synchronyzed only to the CPU crystal and is not intended as a perfect time standard. If you use it to measure intervals between events, be aware that it can drift by as much as 100 microseconds per second, depending on the hardware system in use.


Interchassis Communication

SGI systems support three methods for connecting multiple computers:

  • Standard network interfaces let you send packets or streams of data over a local network or the Internet.

  • Reflective shared memory (provided by third-party manufacturers) lets you share segments of memory between computers, so that programs running on different chassis can access the same variables.

  • External interrupts let one Challenge, Onyx, or Origin system signal another.

Socket Programming

One standard, portable way to connect processes in different computers is to use the BSD-compatible socket I/O interface. You can use sockets to communicate within the same machine, between machines on a local area network, or between machines on different continents.

For more information about socket programming, refer to one of the networking books listed in “Related Publications and Sites”.

Message-Passing Interface (MPI)

The Message-Passing Interface (MPI) is a standard architecture and programming interface for designing distributed applications. SGI supports MPI in the Power Challenge Array product. For the MPI standard, see http://www.mcs.anl.gov/mpi.

The performance of both sockets and MPI depends on the speed of the underlying network. The network that connects nodes (systems) in an Array product has a very high bandwidth.

Reflective Shared Memory

Reflective shared memory consists of hardware that makes a segment of memory appear to be accessible from two or more computer chassis. The Challenge and Onyx implementation consists of VME bus devices in each computer, connected by a very high-speed, point-to-point network.

The VME bus address space of the memory card is mapped into process address space. Firmware on the card handles communication across the network, so as to keep the memory contents of all connected cards consistent. Reflective shared memory is slower than real main memory but faster than socket I/O. Its performance is essentially that of programmed I/O to the VME bus, which is discussed under “PIO Access” in Chapter 6.

Reflective shared memory systems are available for SGI equipment from several third-party vendors. The details of the software interface differ with each vendor. However, in most cases you use mmap() to map the shared segment into your process's address space (see Topics in IRIX Programming as well as the usrvme(7) man page).

External Interrupts

The Origin, Challenge, and Onyx systems support external interrupt lines for both incoming and outgoing external interrupts. Software support for these lines is described in the IRIX Device Driver Programmer's Guide and the ei(7) man page. You can use the external interrupt as the time base for the Frame Scheduler. In that case, the Frame Scheduler manages the external interrupts for you. (See “Selecting a Time Base” in Chapter 4.)