A kernel-level device driver consists of a module of subroutines that supply services to the kernel. The subroutines are public entry points in the driver. When events occur, the kernel calls these entry points. The driver takes action and returns a result code.
This chapter discusses when the driver entry points are called, what parameters they receive, and what actions they are expected to take. For a conceptual overview of the kernel and drivers, see “Kernel-Level Device Control”. For details on how a driver is compiled, linked, and added to IRIX, see Chapter 10, “Building and Installing a Driver.”
![]() | Note: This chapter discusses device drivers. The entry point conventions for STREAMS drivers are covered in Chapter 16, “STREAMS Drivers.” Additional entry points supported only for PCI drivers are covered in Chapter 15, “PCI Device Drivers.” |
The primary topics covered in this chapter are:
“Summary of Driver Structure” summarizes the entry points and how they are made known to the kernel.
“Driver Flag Constant” documents a public constant the driver must supply.
“Initialization Entry Points” documents the entry points that are called at boot time and when a loadable driver is loaded.
“Open and Close Entry Points” documents the entry points called by the open() and close() kernel functions.
“Control Entry Point” documents the entry point called by the ioctl() kernel function.
“Data Transfer Entry Points” documents the entry points called by the read() and write() kernel functions.
“Poll Entry Point” documents the entry point called by the poll() kernel function.
“Memory Map Entry Points” documents the entry points called by the mmap() kernel function.
“Interrupt Entry Point” documents the entry point called to handle a device interrupt.
“Support Entry Points” documents the entry points that support kernel operations and system administration.
“Planning for Multiprocessor Use” points out methods for writing a multiprocessor-aware driver.
A driver consists of a binary object module in ELF format stored in the /var/sysgen/boot directory. As a program, the driver consists of a set of functional entry points that supply services to the IRIX kernel. There is a large set of entry points to cover different situations, but no single driver supports all possible entry points.
The entry points that a driver supports must be named according to a specified convention. The lboot command uses entry point names to build tables used by the kernel.
The device driver makes known which entry points it supports by giving them public names in its object module. The lboot command links together the object modules of drivers and other kernel modules to make a bootable kernel. lboot recognizes the entry points by the form of their names.
A device driver must be described by a file in the /var/sysgen/master.d directory (see “Master Configuration Database”). One of the items in that configuration file specifies the driver prefix, a string of 1 to 14 characters that is unique to that driver. For example, the prefix of the SCSI driver is scsi_.
The prefix string is defined in the /var/sysgen/master.d file only. The string does not have to appear as a constant in the driver, and the name of the driver object file does not have to correspond to the prefix (although the object module typically has a related name).
The lboot command recognizes driver entry points by searching the driver object module for public names that begin with the prefix string. For example, the entry point for the open() operation must have a name that consists of the prefix string followed by the letters “open.”
In this book, entry point names are written as follows: pfxopen, where pfx stands for the driver's prefix string.
The IRIX kernel maintains tables that allow it to dispatch calls to device drivers quickly. These tables are built by lboot based on the device major numbers and the names of the driver entry points. The tables are named as follows:
bdevsw | Table of block device drivers |
cdevsw | Table of character device drivers |
fmodsw | Table of STREAMS drivers |
vfssw | Table of filesystem modules (not related to device drivers) |
The tables for block and character drivers have one row for each major device number, and one column for each possible driver entry point. As lboot loads a driver, it fills in that driver's row of a switch table with the addresses of the driver's entry points. Where an entry point is not defined, lboot leaves the address of a null routine that returns the ENODEV error code.
The sizes of the switch tables are fixed at boot time in order to minimize kernel data space. The table sizes are tunable parameters that can be set with systune (see the systune(1) reference page).
When a driver is loaded dynamically (see “Configuring a Loadable Driver”), the associated row of the switch table is not filled at link time but rather is filled when the driver is loaded. When you add new, loadable drivers, you might need to specify a larger switch table. The IRIX Administration: System Configuration and Operation book documents these tunable parameters.
The names of all possible driver entry points and their purposes are summarized in Table 8-1. STREAMS drivers are covered in Chapter 16.
Table 8-1. Entry Points in Alphabetic Order
Entry Point | Purpose | Discussion | Reference Page |
---|---|---|---|
pfx attach | PCI device attach entry point. |
| |
pfx close | Note the device is not in use. | close(D3) | |
pfx devflag | Constant flag bits for driver features. | devflag(D1) | |
pfx detach | PCI device detach entry point. |
| |
pfx edtinit | Initialize driver from VECTOR statement. | edtinit(D2) | |
pfx halt | Prepare for system shutdown. | halt(D2) | |
pfx init | Initialize driver at load or boot time. | init(D2) | |
pfx intr | Handle device interrupt (not used). | intr(D2) | |
pfx ioctl | Implement control operations. | ioctl(D2) | |
pfx map | Implement memory-mapping (IRIX). | map(D2) | |
pfx mmap | Implement memory-mapping (SVR4). | mmap(D2) | |
pfx open | Connect a process to a device. Connect a stream module. | open(D2) | |
pfx poll | Implement device event test. | poll(D2) | |
pfx print | Display diagnostic about block device. | print(D2) | |
pfx read | Implement device input. | read(D2) | |
pfx rput | STREAMS message on read queue. | put(D2) | |
pfx size | Return logical size of block device. | size(D2) | |
pfx srv | STREAMS service queued messages. | srv(D2) | |
pfx start | Initialize driver at load or boot time. | start(D2) | |
pfx strategy | Input/output for a block device. | strategy(D2) | |
pfx unload | Prepare loadable module for unloading. | unload(D2) | |
pfx unmap | Note the end of a memory mapping. | unmap(D2) | |
pfx wput | STREAMS message on write queue. | put(D2) | |
pfx write | Implement device output. | write(D2) |
The use of entry points in different types of drivers is summarized in Table 8-2. The columns of Table 8-2 show the different types of drivers. The table cells show whether a given entry point is optional (O), required (R), or not allowed (N).
Table 8-2. Use of Driver Entry Points
Entry Point | Character | Block | Pseudo | STREAMS |
---|---|---|---|---|
pfx attach | R (PCI) | R (PCI) | N | N |
pfx close | R | R | R | R |
pfx detach | R (PCI) | R (PCI) | N | N |
pfx devflag | O | O | O | O |
pfx edtinit | O | O | N | N |
pfx halt | O | O | O | O |
pfx init | O | O | O | O |
pfx intr | O | O | N | N |
pfx ioctl | O | O | O | N |
pfx map | O | O | O | N |
pfx mmap | O | O | O | N |
pfx open | R | R | R | R |
pfx poll | O | O | N | O |
pfx print | N | O | N | N |
pfx read | O | N | O | N |
pfx rput | N | N | N | R |
pfx size | N | R | N | N |
pfx srv | N | N | N | R |
pfx start | O | O | O | O |
pfx strategy | N | R | N | N |
pfx unload | O | O | O | O |
pfx unmap | O | O | O | N |
pfx wput | N | N | N | R |
pfx write | O | N | O | N |
As can be seen from Table 8-2, no driver supports all entry points.
A minimal driver for a character device supports pfxinit(), pfxopen(), pfxread(), pfxwrite(), and pfxclose(). The pfxioctl() and pfxpoll() entry points are optional. (The pfxattach() and pfxdetach() entry points are also required for a PCI device.)
A minimal pseudo-device driver supports pfxstart(), pfxopen(), pfxmap(), pfxunmap(), and pfxclose() (the latter two as stubs).
A minimal block device driver supports pfxedtinit(), pfxopen(), pfxsize(), pfxstrategy(), and pfxclose(). (The pfxattach() and pfxdetach() entry points are also required for a PCI device.)
Any device driver or STREAMS module should define a public name pfxdevflag as a static integer. This integer contains a bitmask with zero or more of the following flags, which are declared in sys/conf.h:
D_MP | The driver is prepared for multiprocessor systems. |
D_WBACK | The driver handles its own cache-writeback operations. |
D_MT | The driver is prepared for a multithreaded kernel. |
D_OLD | The driver implements IRIX 4.x semantics. |
The flag names are declared in the header file sys/ddi.h. A typical definition would resemble the following:
int testdrive_devflag = D_MP; |
A STREAMS module should also provide this flag, but the only relevant bit value for a STREAMS driver is D_MP (see “Driver Flag Constant”).
The flag value is saved in the kernel switch table with the driver's entry points (see “Kernel Switch Tables”).
When a driver does not define a pfxdevflag, lboot saves a word containing D_OLD by default. See the note regarding D_OLD on page 144.
You specify D_MP in pfxdevflag to tell lboot that your driver is designed to operate in a multiprocessor system. The top half of the driver is designed to cope with multiple concurrent entries in multiple CPUs. The top and bottom halves synchronize through the use of semaphores or locks and do not rely on interrupt masking for critical sections. These issues are discussed further under “Planning for Multiprocessor Use”.
When D_MP is not present in pfxdevflag, IRIX ensures that the driver code, including the upper-half entry points and the interrupt handler, executes only on CPU 0 of a multiprocessor. This ensures behavior similar to a uniprocessor, but can cause a performance bottleneck when either the device or CPU 0 is heavily used.
You specify D_WBACK in pfxdevflag to tell lboot that a block driver performs any necessary cache write-back operations through explicit calls to dki_dcache_wb() and related functions (see the dki_dcache_wb(D3) reference page).
When D_WBACK is not present in pfxdevflag, the physiock() function ensures that all cached data related to buf_t structures is written back to main memory before it enters the driver's strategy routine. (See the physiock(D3) reference page and “Entry Point strategy()”.)
This flag is defined in IRIX 6.2 but has no effect in that release. The next major release of IRIX will run driver interrupt routines as threads of control within the kernel address space. D_MT indicates that this driver understands that it can be run as one or more cooperating threads, and uses kernel synchronization primitives to serialize access to driver common data structures.
The D_OLD flag exists only to retain compatibility with certain drivers written originally for IRIX 4.x. It changes two features of the kernel-to-driver interface:
The first argument to the pfxopen() entry is a dev_t value instead of the pointer-to-dev_t that is now standard.
The driver sets its return code by storing it into a global, u.u_error, instead of returning it as the result of the function call.
D_OLD is incompatible with D_MP.
When a driver has no pfxdevflag constant, lboot assumes it is a D_OLD driver.
The kernel calls a driver to initialize itself at any of three different entry points, as follows:
pfx init | Initialize self-defining hardware or a pseudo-device. |
pfx edtinit | Initialize a hardware device based on VECTOR data. |
pfx start | General initialization. |
Each call has different abilities. A driver may define any combination of the three entry points. It is not uncommon to define both a pfxstart() and one of pfxedtinit() or pfxinit().
The initialization entry points of ordinary (nonloadable) drivers are called during system startup, after interrupts have been enabled and before the message “The system is coming up” is displayed. In all cases, interrupts are enabled and basic kernel services are available at this time. However, other loadable or optional kernel modules might not have been initialized, depending on the sequence of statements in the files in /var/sysgen/system.
Whenever a driver is initialized, the entry points are called in the following sequence:
pfxinit() is called first.
pfxedtinit() is called once for each VECTOR statement in reverse order of the VECTOR statements found in /var/sysgen/system files.
pfxstart() is called last.
A loadable driver (see “Loadable Drivers”) is initialized any time it is loaded. This can occur more than once, if the driver is loaded, unloaded, and reloaded. When a loadable driver is configured for autoregister, it is loaded with other drivers during system startup. (For more information on autoregister, see “Configuring a Loadable Driver”.) Such a driver is initialized at system startup time along with the nonloadable drivers.
The pfxinit() entry point is called once during system startup or when a loadable driver is loaded. It receives no input arguments; its prototype is simply:
void pfxinit(void); |
You can use this entry point to initialize a hardware device that is self-defining; that is, all the information the driver needs is either coded into the driver, or can be gotten by probing the device itself. You can also use pfxinit() to initialize a pseudo-device driver; that is, a driver that does not have real hardware attached.
A driver that is brought into the system by a USE or INCLUDE line in a system configuration file (see “Configuring a Kernel”) typically initializes in the pfxinit() entry point.
The pfxedtinit() entry is designed to initialize devices that are configured using the VECTOR statement in the system configuration file (see “System Configuration Files”). The entry point name is a contraction of “early device table initialization.”
The VECTOR statement specifies hardware details about a device on the VME, GIO, or EISA bus (on systems that have one of those buses), including iospace addresses, interrupt level, and an integer parameter. The VECTOR statement can specify a “probe” parameter that lets the kernel test for the existence of the specified hardware.
When the kernel processes a VECTOR statement during bootstrap and the probe is successful (or no probe is specified), the kernel stores the VECTOR parameters in a structure of type edt_t. (This structure is declared in sys/edt.h.)
Each time the kernel loads a driver that is named in a VECTOR statement, the kernel calls the driver's pfxedtinit() entry one time for each VECTOR statement that named that driver and had a successful probe (or that had no probe). VECTOR statements are processed in reverse sequence to the order in which they are coded in /var/sysgen/system files.
The prototype of the pfxedtinit() entry is
void pfxedtinit(edt_t *e); |
The edt_t contains at least the following fields (see the system(4) reference page for the corresponding VECTOR parameters):
e_bus_type | Integer specifying the bus type; constant values are declared in sys/edt.h, for example ADAP_VME, ADAP_GIO, or ADAP_EISA. |
e_adap | Integer specifying the adapter (bus) number. |
e_ctlr | Value from the VECTOR ctlr= parameter; typically the device minor number. |
e_space | Array of up to three I/O space structures of type iospace_t. |
The difference between pfxinit() and pfxedtinit() is that pfxedtinit() is parameterized with information from the VECTOR line, and is called once for each VECTOR line that is associated with real hardware.
A driver that uses pfxedtinit() needs to save the edt_t information in a data structure. If the driver supports multiple devices—that is, if it can be called for multiple VECTOR statements—it needs to allocate an array or chain of structures, and save new data on each entry.
The pfxstart() entry point is called at system startup, and whenever a loadable driver is loaded. It is called after pfxedtinit() and pfxinit(), but before any other entry point such as pfxopen(). The pfxstart() entry point receives no arguments; its prototype is simply
void pfxstart(void); |
The pfxstart() entry point is a suitable place to allocate a poll-head structure using phalloc(), as discussed in “Use and Operation of poll(2)”.
The pfxopen() and pfxclose() entries for block and character devices are called when a device comes into use and when use of it is finished. For a conceptual overview of the open() process, see “Overview of Device Open”.
The kernel calls a device driver's pfxopen() entry when a process executes the open() system call on any device special file (see the open(2) reference page). It is also called when a process executes the mount() system call on a block device (see the mount(2) reference page). ( For the pfxopen() entry point of a STREAMS driver, see “Entry Point open()”.)
The prototype of pfxopen() is as follows:
int pfxopen(dev_t *devp, int oflag, int otyp, cred_t *crp); |
The argument values are
*devp | Pointer to a dev_t value from which you can extract both the major and minor device numbers. |
otyp | An integer flag specifying the source of the call: a user process opening a character device or block device, or another driver. |
oflag | Flag bits specifying user mode options on the open() call. |
crp | A cred_t object—an opaque structure for use in authentication. Standard access privileges to the special device file have already been verified. |
![]() | Note: When the driver's pfxdevflag entry contains D_OLD or when pfxdevflag is not defined, the first argument to pfxopen() is a dev_t value, not a pointer to a dev_t value. See “Flag D_OLD”. |
The open(D2) reference page discusses the kind of work the pfxopen() entry point can do. In general, the driver is expected to verify that this user process is permitted access in the way specified in otyp (reading, writing, or both) for the device specified in *devp. If access is not allowable, the driver returns a nonzero error code from sys/errno.h, for example ENOMEM or EBUSY.
When the driver supports a single device with no logical unit divisions, the device number is of little interest except for diagnostic displays. When the driver supports multiple devices, or a device with multiple logical units, the minor device number is the key to locating the device information. The device number can also encode device options, as discussed under “Minor Device Number”.
When the driver supports the pfxedtinit() entry, the driver needs a way to associate the different edt_t structures passed to pfxedtinit() with the device numbers passed to pfxopen() and other routines. One solution is to require that the ctlr= value from the VECTOR statement—which is passed in the e_ctlr field of edt_t—must be the same as the device minor number.
The otyp flag distinguishes between the following possible sources of this call to pfxopen() (the constants are defined in sys/open.h).
a call to open a character device (OTYP_CHR)
a call to open a block device (OTYP_BLK)
a call to a mount a block device as a filesystem (OTYP_MNT)
a call to open a block device as swapping device (OTYP_SWP)
a call direct from a device driver at a higher level (OTYP_LYR)
Typically a driver is written only to be a character driver or a block driver, and can be called only through the switch table for that type of device. When this is the case, the otyp value has little use.
It is possible to have the same driver treated as both block and character, in which case the driver needs to know whether the open() call addressed a block or character special device. It is possible for a block device to support different partitions with different uses, in which case the driver might need to record the fact that a device has been mounted, or opened as a swap device.
With all open types except OTYP_LYR, pfxopen() is called for every open or mount operation, but pfxclose() is called only when the last close or unmount occurs. The OTYP_LYR feature is used almost exclusively by drivers distributed with IRIX, like the host adapter SCSI driver (see “Host Adapter Concepts”). For each open of this type, there is one call to pfxclose().
The interpretation of the open mode flags is up to the designer of the driver. Four modes can be requested (declared in sys/file.h):
FREAD | Input access wanted. |
FWRITE | Output access wanted (both FREAD and FWRITE may be set, corresponding to O_RDWR mode). |
FNDELAY or FNONBLOCK | Return at once, do not sleep if the open cannot be done immediately. |
FEXCL | Request exclusive use of the device. |
You decide which of the flags have meaning with respect to the abilities of this device. You can return an EINVAL error when an unsupported mode is requested.
A key decision is whether the device can be opened only by one process at a time, or by multiple processes. If multiple opens are supported, a process can still request exclusive access with the FEXCL mode.
When the device can be used by only one process, or when FEXCL access is supported, the driver must keep track of the fact that the device is open. When the device is busy, the driver can test the FNDELAY and FNONBLOCK flags; if either is set, it can return EBUSY. Otherwise, the driver should sleep until the device is free; this requires coordination with the pfxclose() entry point.
The cred_t object passed to pfxopen(), pfxclose(), and pfxioctl() can be used with the drv_priv() function to find out if the effective calling user ID is privileged or not (see the drv_priv(D3) reference page). Do not examine the object in detail, since its contents are subject to change from release to release.
In a block device driver, the pfxsize() entry point will be called soon after pfxopen() (see “Entry Point size()”). It is typically best to calculate or read the device capacity at open time, and save it to be reported from pfxsize().
If your driver is, or might be, compiled to the 64-bit model for use with a 64-bit IRIX kernel, and if it supports the pfxioctl() or pfxpoll() entry points, the driver should test and save the user process's programming model during an open. For details, see “Handling 32-Bit and 64-Bit Execution Models”.
The kernel calls the pfxclose() entry when the last process calls close() or umount() for the device special file. It is important to know that when the device can be opened by multiple processes, pfxclose() is not called for every close() function, but only when the last remaining process closes the device and no other processes have it open.
The function prototype and arguments of pfxclose() are
int pfxclose(dev_t dev, int flag, int otyp, cred_t *crp); |
The arguments are the same as were passed to pfxopen(). However, the flag argument is not necessarily the same as at any particular call to open().
It is up to you to design the meaning of “close” for this type of device. The close(D2) reference page discusses some of the actions the driver can do. Some considerations are:
If the device is opened and closed frequently, you may decide to retain dynamic data structures.
If the device can perform an action such as “rewind” or “eject,” you decide whether that action should be done upon close. Possibly the choice of acting or not acting can be set by an ioctl() call; or possibly the choice can be encoded into the device minor number—for example, the no-rewind-on-close option is encoded in certain tape minor device numbers.
If the pfxopen() entry point supports exclusive access, and it can be waiting for the device to be free, pfxclose() must release the wait.
The pfxclose() entry can detect an error and report it with a return code. However, the file is closed or unmounted regardless.
The pfxioctl() entry point is called by the kernel when a user process executes the ioctl() system call (see the ioctl(2) reference page). This entry point is allowed in character drivers only. Block device drivers do not support it, and STREAMS drivers pass control information as messages.
For an overview of the relationship between the user process, kernel, and the control entry point, see “Overview of Device Control”.
The prototype of the entry point is
int pfxioctl(dev_t dev, int cmd, void *arg, int mode, cred_t *crp, int *rvalp); |
The argument values are
dev | A dev_t value from which you can extract the major and minor device numbers. |
cmd | The request value specified in the ioctl() call. |
arg | The optional argument value specified in the ioctl() call, or NULL if none was specified. |
mode | Flag bits specifying the open() mode, as associated with the file descriptor passed to the ioctl() system function. |
crp | A cred_t object—an opaque structure for use in authentication, describing the process that is in-context. Standard access privileges to the special device file have already been verified. |
*rvalp | The integer result to be returned to the user process. |
It is up to the device driver to interpret the cmd and arg values in the light of the mode and other arguments. When the arg value is a pointer to data in the process address space, the driver uses the copyin() kernel function to copy the data into kernel space, and the copyout() function to return updated values. (See the copyin(D3) and copyout(D3) reference pages, and also “Transferring Data”.)
The command numbers supported by pfxioctl() are arbitrary; but the recommended practice is to make sure that they are different from those of any other driver. One method to achieve this is suggested in the ioctl(D2) reference page.
The ioctl() entry point may need to interpret a structure prepared in the user process. In a 64-bit system, the user process can be either a 32-bit or a 64-bit program. For discussion of this issue, see “Handling 32-Bit and 64-Bit Execution Models”
The kernel returns 0 to the ioctl() system function unless the pfxioctl() function returns an error code. In the event of an error, the kernel returns the code the driver places in *rvalp, if any, or -1. To ensure that the user process sees a specific error code, set the code in *rvalp, and return that value.
The pfxread() and pfxwrite() entry points are supported by character device drivers and pseudo-device drivers that allow reading and writing. They are called by the kernel when the user process calls the read(), readv(), write(), or writev() system function.
The pfxstrategy() entry point is required of block device drivers. It is called by the kernel when either a filesystem or the paging subsystem needs to transfer a block of data.
The pfxread() and pfxwrite() entry points are similar to each other—only the direction of data transfer differs. The prototypes of the functions are
int pfxread (dev_t dev, uio_t *uiop, cred_t *crp); int pfxwrite(dev_t dev, uio_t *uiop, cred_t *crp); |
The arguments are
dev | A dev_t value from which you can extract both the major and minor device numbers. |
*uiop | A uio_t object—a structure that defines the user's buffer memory areas. |
crp | A cred_t object—an opaque structure for use in authentication. Standard access privileges to the special device file have already been verified. |
A character device driver using PIO transfers data in the following steps:
If there is a possibility of a timeout, start a timeout delay (see “Waiting for Time to Pass”).
Initiate the device operation as required.
Transfer data between the device and the buffer represented by the uio_t (see “Transferring Data Through a uio_t Object”).
If it is necessary to wait for an interrupt, put the process to sleep (see “Waiting and Mutual Exclusion”).
When data transfer is complete, or when an error occurs, clear any pending timeout and return the final status of the operation. If the return code is 0, the final state of the uio_t determines the byte count returned by the read() or write() call.
A device driver that supports both character and block interfaces must have a pfxstrategy() routine in which it performs the actual I/O. For example, the Silicon Graphics disk drivers support both character and block driver interfaces, and perform all I/O operations in the pfxstrategy() function. However, the pfxread(), pfxwrite() and pfxioctl() entries supported for character-type access also need to perform I/O operations. They do this by calling the pfxstrategy() routine indirectly, using the kernel function physiock() or uiophysio() (see the physiock(D3) and uiophysio(D3) reference pages, and see “Waiting for Block I/O to Complete”).
Both the physiock() and uiophysio() functions takes care of the housekeeping needed to interface to the pfxstrategy() entry, including the work of allocating a buffer and a buf_t structure, locking buffer pages in memory and waiting for I/O completion. Both routines require the uio_t to describe only a single segment of data (uio_iovcnt of 1). Although they are very similar, the two functions differ in the following ways:
physiock() returne EINVAL if the initial offset is not a multiple of 512 bytes. If this is a requirement of your pfxstrategy() routine, use physiock(); if not, use uiophysio().
physiock() is compatible with SVR4, while uiophysio() is unique to IRIX.
Example 8-1 shows the skeleton of a hypothetical driver in which the pfxread() entry does its work through the pfxstrategy() entry.
hypo_read (dev_t dev, uio_t *uiop, cred_t *crp) { // ...validate the operation... // return physiock(hypo_strategy, /* our strategy entry */ 0, /* allocate temp buffer & buf_t */ dev, /* dev_t arg for strategy */ B_READ, /* direction flag for buf_t */ uiop); } |
The pfxwrite() entry would be identical except for passing B_WRITE instead of B_READ.
This dual-entry strategy is required only in a driver that supports both character and block access.
A block device driver does not directly support system calls by user processes. Instead, it provides services to a filesystem such as XFS, or to the memory paging subsystem of IRIX. These subsystems call the pfxstrategy() entry point to read data in whole blocks.
Calls to pfxstrategy() are not directly related in time to system functions called by a user process. For example, a filesystem may buffer many blocks of data in memory, so that the user process may execute dozens or hundreds of write() calls without causing an entry to the device driver. When the user function closes the file or calls fsync()—or when for unrelated reasons the filesystem needs to free some buffers—the filesystem calls pfxstrategy() to write numerous blocks of data.
In a driver that supports the character interface as well, the pfxstrategy() entry can be called indirectly from the pfxread(), pfxwrite() and pfxioctl() entries, as described under “Calling Entry Point strategy() From Entry Point read() or write()”.
The prototype of the pfxstrategy() entry point is
int pfxstrategy(struct buf *bp); |
The argument is the address of a buf_t structure, which gives the strategy routine the information it needs to perform the I/O:
The dev_t containing major and minor device numbers
The direction of the transfer (read or write)
The location of the buffer in kernel memory
The amount of data to transfer
The starting block number on the device
For more on the contents of the buf_t structure, see “Structure buf_t” and the buf(D4) reference page.
The driver uses the information in the buf_t to validate the data transfer and programs the device to start the transfer. Then it stores the address of the buf_t where the interrupt handler can find it (see “Interrupt Entry Point”) and calls biowait() to wait for the operation to complete. For the next step, see “Completing Block I/O” (see also the biowait(D3) reference page).
The pfxpoll() entry point is called by the kernel when a user process calls the poll() or select() system function asking for status on a character special device. To implement it, you need to understand the IRIX implementation of poll().
The IRIX version of poll() allows a process to wait for events of different types to occur on any combination of devices, files, and STREAMS (see the poll(2) and select(2) reference pages). It is possible for multiple processes to be waiting for events on the same device.
It is up to you as the designer of a driver to decide which of the events that are documented in poll(2) are meaningful for your device. Other requested events simply never happen to the device.
Much of the complexity of poll() is handled by the IRIX kernel, but the kernel requires the assistance of any device driver that supports poll(). The driver is expected to allocate and hold a pollhead structure (declared in sys/poll.h) for each minor device that it supports. Allocation is simple; the driver merely calls the phalloc() kernel function. (The pfxstart() entry point is a suitable place for this call; see “Entry Point start()”.)
There are two phases to the operation of poll(). When the system function is called, the kernel calls the pfxpoll() entry point to find out if any requested events are pending at this time. If the kernel finds any event s pending (on this or any other polled object), the poll() function returns to the user process. Nothing further is required.
However, when no requested event has happened, the user process expects the poll() function to block until an event has occured. The kernel cannot implement this delay by repeatedly testing for events; that would be too inefficient. The kernel must rely on device drivers to notify it when an event has occurred.
A device driver that supports pfxpoll() is required to notify the kernel whenever an event that the driver supports has occurred. The driver does this by calling a kernel function, pollwakeup(), passing the pollhead structure for the affected device, and bit flags for the events that have taken place. In the event that one or more user processes are blocked in a poll(), waiting for an event from this device, the call to pollwakeup() will release the sleeping processes. For an example, see “Calling pollwakeup()”.
If the device in question does not support interrupts, the driver cannot support poll() unless it can somehow get control to discover an event and report it to pollwakeup(). One possibility is that the driver could simulate interrupts by setting a succession of itimeout() delays. On each timeout the driver would test its device for a change of status, call pollwakeup() when an event has occurred; and schedule a new delay. (See “Waiting for Time to Pass”.)
The prototype for pfxpoll() is as follows:
int pfxpoll(dev_t dev, short events, int anyyet, short *reventsp, struct pollhead **phpp); |
The argument values are
dev | A dev_t value from which you can extract the major and minor device numbers. |
events | Bit-flags for the events the user process is testing, as passed to poll() and declared in sys/poll.h.. |
reventsp | A field to receive the bit-flags of events that have occurred, or to receive 0x0000 if no requested events have occurred.. |
anyyet and phpp | When anyyet is zero and no events have occurred, the kernel requires the address of the pollhead structure for this minor device to be returned in *phpp. |
Example 8-2 shows the pfxpoll() code of a hypothetical device driver. Only three event tests are supported: POLLIN and POLLRDNORM (treated as equivalent) and POLLOUT. The device driver maintains an array of pollhead structures, one for each supported minor device. These are presumably allocated during initialization.
struct pollhead phds[MAXMINORS]; #define OUR_EVENTS (POLLIN|POLLOUT|POLLRDNORM) hypo_poll(dev_t dev, short events, int anyyet, short *reventsp, struct pollhead **phpp) { minor_t dminor = geteminor(dev); short happened = 0; short wanted = events & OUR_EVENTS; if (wanted & (POLLIN|POLLRDNORM)) { if (device_has_data_ready(dminor)) happened |= (POLLIN|POLLRDNORM); } if (wanted & POLLOUT) { if (device_ready_for_output(dminor)) happened |= POLLOUT; } if (device_pending_error(dminor)) happened |= POLLERR; if (0 == (*reventsp = happened)) { if (!anyyet) *phpp = phds[dminor] } return 0; } |
The code in Example 8-2 begins by discarding any unsupported event flags that might have been requested. Then it tests the remaining flags against the device status. If the device has an uncleared error, the code inserts the POLLERR event. If no events were detected, and if the kernel requested it, the address of the pollhead structure for this minor device is returned.
A user process requests memory mapping by calling the system function mmap(). When the mapped object is a character device special file, the kernel calls the pfxmmap() or pfxmap() entry to validate and complete the mapping. To understand these entry points, you must understand the mmap() system function.
The purpose of the mmap() system function (see the mmap(2) reference page) is to make the contents of a file directly accessible as part of the virtual address space of the user process. The results depend on the kind of file that is mapped:
When the mapped object is a normal file, the process can load and store data from the file as if it were an array in memory.
When the mapped object is a character device special file, the process can load and store data from device registers as if they were memory variables.
When the mapped object is a block of memory owned and prepared by a pseudo-device driver, the process gains access to some special piece of memory data that it would not normally be able to access.
In all cases, access is gained through normal load and store instructions, without the overhead of calling system functions such as read(). Furthermore, the same mapping can be executed by other processes, in which case the same memory, or file, or device is shared by multiple, concurrent processes. This is how shared memory segments are achieved.
The mmap() system function takes four key parameters:
the file descriptor for an open file, which can be either a normal disk file or a device special file
an offset within that file at which the mapped data is to start. For a normal file, this is a file offset; for a device file, it represents an address in the address space of the device or the bus
the length of data to be mapped
protection flags, showing whether the mapped data is read-only or read-write
When the mapped object is a normal file, the filesystem implements the mapping. The filesystem does not call the block device driver for assistance in mapping a file. It does call the block device driver pfxstrategy() entry to read and write blocks of file data as necessary, but the mapping of pages of data into pages of memory is controlled in the filesystem code.
When the mapped object is a device special file, the mmap() parameters are passed to the device driver at either its pfxmmap() or pfxmap() entry point. The device driver interprets the parameters in the context of the device, and uses a kernel function to create the mapping.
Once a device or kernel memory has been mapped into some user address space, the mapping persists until the user process terminates or calls unmap() (see the unmap(2) reference page). In particular, the mapping does not end simply because the device special file is closed. You cannot assume, in the pfxclose() or pfxunload() entry points, that all mappings to devices have ended.
The pfxmap() entry point can be defined in either a character or a block driver (it is the only mapping entry point that a block driver can supply). The function prototype is
int pfxmap(dev_t dev, vhandl_t *vt, off_t off, int len, int prot); |
The argument values are
dev | A dev_t value from which you can extract both the major and minor device numbers. |
vt | The address of an opaque structure that describes the assigned address in the user process address space. The structure contents are subject to change. |
off, len | The offset and length arguments passed to mmap() by the user process. |
prot | Flags showing the access intentions of the user process. |
The first task of the driver is to verify that the access specified in prot is allowed. The next task is to validate the off and len values: do they fall in the valid address space of the device?
When the device driver approves of a mapping, it uses a kernel function, v_mapphys(), to establish the mapping. This function (documented in the v_mapphys(D3) reference page) takes the vhandle_t, an address in kernel cached or uncached memory, and a length. It makes the specified region of kernel space a part of the address space of the user process.
For example, a pseudo-device driver that intends to share kernel virtual memory with user processes would first allocate the memory:
caddr_t *kaddr = kmem_alloc (len , KM_CACHEALIGN); |
It would then use the address of the allocated memory with the vhandle_t value it had received to map the allocated memory into the user space:
v_mapphys (vt, kaddr, len) |
![]() | Note: There are no special precautions to take when mapping cached memory into user space, or when mapping device registers or bus addresses. However, you should almost never map uncached memory into user space. The effects of uncached memory access are hardware dependent and differ between multiprocessors and uniprocessors. Among uniprocessors, the IP26 CPU module has highly restrictive rules for the use of uncached memory (see “Uncached Memory Access in the IP26 CPU”). In general, mapping uncached memory makes a driver nonportable and is likely to lead to subtle failures that are hard to resolve. |
Example 8-3 contains an edited fragment of code from a Silicon Graphics device driver. This pseudo-device driver, whose prefix is flash_, provides access to “flash” PROM in certain computer models. It allows a user process to map the PROM into user space.
int flash_map(dev_t dev, vhandl_t *vt, off_t off, long len) { long offset = (long) off; /*Actual offset in flash prom*/ /* Don't allow requests which exceed the flash prom size */ if ((offset + len) > FLASHPROM_SIZE) return ENOSPC; /* Don't allow non page-aligned offsets */ if ((offset % NBPC) != 0) return EIO; /* Only allow mapping of entire pages */ if ((len % NBPC) != 0) return EIO; return v_mapphys(vt, FLASHMAP_ADDR + offset, len); } |
When the driver allocates some memory resource associated with the mapping, and when more than one mapping can be active at a time, the driver needs to tag each memory resource so it can be located when the pfxunmap() entry point is called. One answer is to use the vt_gethandle() macro defined in sys/region.h. This macro takes a pointer to a vhandle_t and returns a unique pointer-sized integer that can be used to tag allocations. No other information in sys/region.h is supported for driver use.
The pfxmmap() (note: two letters “m”) entry can be used only in a character device driver. The prototype is
int pfxmmap(dev_t dev, off_t off, int prot); |
The argument values are
dev | A dev_t value from which you can extract both the major and minor device numbers. |
off | The offset argument passed to mmap() by the user process. |
prot | Flags showing the access intentions of the user process. |
The function is expected to return the page frame number (PFN) that corresponds to the offset off in the device address space. A PFN is an address divided by the page size. (See “Working With Page and Sector Units” for page unit conversion functions.)
This entry point is supported only for compatibility with SVR4. When the kernel needs to map a character device, it looks first for pfxmap(). It calls pfxmmap() only when pfxmap() is not available. The differences between the two entry points are as follows:
This entry point receives no vhandl_t argument, so it cannot use v_mapphys(). It has to calculate a page frame number, which means that it has to be aware of the current page size (obtainable from the ptob() kernel function, see the ptob(D3) reference page).
This entry point does not receive a length argument, so it has to assume a default length for every map (typically the page size).
When a mapping is created using this entry point, the pfxunmap() entry is not called.
The kernel calls the pfxunmap() entry point when a mapping is created using the pfxmap() entry point. This entry should be supplied, even if it is an empty function, when the pfxmap() entry point is supplied. If it is not supplied, the munmap() system function returns the ENODEV error.
The pfxunmap() entry point is only called when the mapped region has been completely unmapped by all processes. For example, suppose a parent process calls mmap() to map a device. Then the parent creates one or more child processes using sproc(). Each child shares the address space, including the mapped segment. A process in the share group can terminate, or can explicitly unmap() the segment or part of the segment; these actions do not result in a call to pfxunmap(). Only when the last process with access to the segment has fully unmapped the segment is pfxunmap() called.
On entry, the kernel has completed unmapping the object from the user process address space. This entry point does not need to do anything to affect the user address space; it only needs to release any resources that were allocated to support the mapping.
The prototype is
int pfxunmap(dev_t dev, vhandl_t *vt); |
The argument values are
dev | A dev_t value from which you can extract both the major and minor device numbers. |
vt | The address of an opaque structure that describes the assigned address in the user process address space. |
If the driver allocated no resources to support a mapping, no action is needed here; the entry point can consist of a “return 0” statement.
When the driver does allocate memory to support a mapping, and supports multiple mappings, the driver needs to identify the resource associated with this particular mapping in order to release it. The vt_gethandle() function returns a unique number based on the vt argument; this can be used to identify resources.
In traditional UNIX, when a hardware device presents an interrupt, the kernel locates the device driver for the device and calls the pfxintr() entry point (see the intr(D2) reference page). In current practice, a driver must register a specific interrupt handler for each device. The kernel functions for doing this are bus-specific, and are discussed in the bus-specific chapters. For example, the means of registering a PCI interrupt handler is discussed in Chapter 19, “PCI Device Attachment.” However, the discussion of interrupts that follows is still relevant to any interrupt handler.
In principle an interrupt can happen at any time. Normally an interrupt occurs because at some previous time, the driver initiated a device operation. Some devices can interrupt without a preceding command.
The association between an interrupt and the driver is established in different ways depending on the hardware.
For devices on the SCSI bus, all interrupts are handled by a single, low-level driver which notifies a callback function (see Chapter 13, “SCSI Device Drivers”).
For devices on the PCI bus, the driver registers an interrupt handler using pci_intr_connect() at the time the device is attached (“Attaching a Device”).
When an interrupt occurs, the system is in an unknown state. As a result, the interrupt handler can use only a restricted set of kernel services, and no services that can sleep. In general, the interrupt handler implements the following tasks.
When the driver supports multiple logical units, use ivec to locate the data structure for the interrupting unit.
Determine the reason for the interrupt by interrogating the device.
When the interrupt is a response to a device operation, note the success or failure of the command.
If the driver top half is waiting for the interrupt, waken it.
If the driver supports polling, and the interrupt represents a pollable event, call pollwakeup().
If the device is not in an error state and another operation is waiting to be started, start it.
The details of each of these tasks depends on the hardware and on the design of the data structures used by the driver top half.
In a uniprocessor system, there is only one CPU and when it is executing the interrupt handler, nothing else is executing. An interrupt handler can only be preempted by an interrupt of higher priority—which would be an interrupt for a different driver, and so would have no conflicts with this driver over the use of data.
In a multiprocessor, an interrupt can be taken on any CPU, while other CPUs continue to execute kernel or user code.
In a multiprocessor, when an interrupt must be handled by a driver that is not marked as multiprocessor-aware (see “Flag D_MP”), the interrupt may be received on some other CPU, but the driver interrupt entry point is always executed on CPU 0.
In a multiprocessor, when the driver is multiprocessor-aware, one or more other CPUs can execute in the driver's top-half entry points while another CPU executes the driver's interrupt entry point. An interrupt handler written for a multiprocessor must not assume that it has exclusive use of the driver's data (see “Planning for Multiprocessor Use”).
It is theoretically possible in a multiprocessor for a device to interrupt; for one CPU to enter the interrupt handler; and for the device to interrupt again, resulting in multiple concurrent entries to the same interrupt handler. However, IRIX prevents this. You can assume that your interrupt handler code is entered serially, and not used concurrently by multiple CPUs.
Speed in exiting the interrupt handler is critical to system performance. In a uniprocessor, the system is doing nothing else while it executes the handler, and it cannot respond to interrupts of a lower priority. In a multiprocessor, interrupts can be taken by different CPUs. While a CPU executes a handler, that CPU cannot respond to lower-priority interrupts, but other CPUs can be processing user-level code or responding to other interrupts.
In a block device driver, an I/O operation is represented by the buf_t structure. The pfxstrategy() routine starts operations and waits for them to complete (see “Entry Point strategy()”).
The interrupt entry point sets the residual count in b_resid. It can post an error using bioerror(). It posts the operation complete and wakens the pfxstrategy() routine by calling biodone(). If the pfxstrategy() entry has set the address of a completion callback function in the b_iodone field of the buf_t, biodone() invokes it. (For more discussion, see “Waiting for Block I/O to Complete”.)
In a character device driver, the driver top half typically awaits an interrupt by sleeping on a semaphore or synchronizing variable, and the interrupt routine posts the semaphore (see “Waiting for a General Event”). Error information must be passed in driver variables according to some local convention.
When the interrupt represents an event that can be reported by the driver's pfxpoll() entry point (see “Entry Point poll()”), the interrupt handler must report the event to the kernel, in case some user process is waiting in a poll() call. Hypothetical code to do this is shown in Example 8-4.
hypo_intr(int ivec) { struct hypo_dev_info *pinfo; if (! pinfo = find_dev_info(ivec)) return; /* not our device */ ... if (pinfo->have_data_flag) pollwakeup (pinfo->phead, POLLIN, POLLRDNORM); if (pinfo->output_ok_flag) pollwakeup (pinfo->phead, POLLOUT); ... } |
Certain driver entry points are used to support the operations of the kernel or the administration of the system.
The pfxunload() entry point is called when the kernel is about to dynamically remove a loadable driver from the running system. The prototype is
int pfxunload(void); |
A driver can be unloaded either because all its devices are closed and a timeout has elapsed, or because the operator has used the ml command (see the ml(1) reference page). The kernel does not unload a driver unless the driver provides a pfxunload() entry point. Without this entry point, the driver can be dynamically loaded, but then remains in memory.
It is not easy to retain state information about the device over the time when the driver is not in memory. The entire text and data of a loadable driver, including static variables, are removed and reloaded. Only global variables defined in the descriptive file (see “Describing the Driver in /var/sysgen/master.d”) remain in memory after the driver is unloaded. Be sure not to store any addresses of driver code or driver static variables in global variables, since these addresses will be different when the driver is reloaded.
The driver may have allocated dynamic memory. This should be released, because the addresses of allocated memory will be lost when the driver is unloaded, and more will be allocated if the driver is reloaded. For example, the driver should use phfree() to release a pollhead structure allocated by phalloc() (see “Use and Operation of poll(2)”, and the phalloc(D3) and phfree(D3) reference pages). It is also the time to release any PIO maps (see “Inactivating Maps and Releasing Objects”), and to release any process handles (see “Sending a Process Signal”).
The driver is not required to unload. If the driver should not be unloaded at this time, it returns a nonzero return code to the call, and the kernel does not unload it. There are several reasons why a driver should not be unloaded.
The kernel calls pfxunload() only when no device special files managed by the driver are open. If any device had been opened, the pfxclose() entry has been called. However, if any device was mapped through the pfxmap() entry, the mapping could still exist. If the driver has any resources tied up in association with a memory mapping, it should return a nonzero value to the pfxunload call.
A driver should never permit unloading when there is any kind of pointer to the driver held in any kernel data structure. It is a frequent design error to unload when there is a live pointer to the driver. Unpredictable kernel panics often result.
One example of a live pointer to a driver is a pending callback function. Any pending itimeout() or bufcall() timers should be cancelled before returning 0 from pfxunload(). A driver for the PCI bus can register an interrupt handler, and should unregister an interrupt handler (see “Unloading”) before it permits unloading.
The kernel calls the pfxhalt() entry point, if one exists, while performing an orderly system shutdown (see the halt(1) reference page). No other driver entry points are called after this one. The prototype is simply
void pfxhalt(void); |
Since the system is shutting down, there is no point in returning allocated memory. The only purpose this entry point can serve is to leave the device in a safe and stable condition. For example, this is the place at which a disk driver could command the heads of the drive to move to a safe zone for power off.
The driver cannot assume that interrupts are disabled or enabled. The driver cannot block waiting for device actions, so whatever commands it issues to the device must take effect immediately.
The pfxsize() entry point is required of block device drivers. It reports the size of the device in “sector” units, where a “sector” size is declared as NBPSCTR in sys/param.h (currently 512). The prototype is
int pfxsize(dev_t dev); |
The device major and minor numbers can be extracted from the dev argument. The entry point is not called until pfxopen() has been called. Typically the driver will calculate the size of the medium during pfxopen().
Since the int return value is 32 bits in all systems, the largest possible block device is 1,024 gigabytes ((231*512)/1,0243).
The pfxprint() entry point is called from the kernel to display a diagnostic message when an error is detected on a block device. The prototype and the complete logic of the entry point is shown in Example 8-5.
#include <sys/cmn_err.h> #include <sys/ddi.h> int hypo_print(dev_t dev, char *str) { cmn_err(CE_NOTE,"Error on dev %d: %s\n",geteminor(dev),str); return 0; } |
The pfxioctl() entry point can be passed a data structure from the user process address space; that is, the arg value can be a pointer to a structure or an array of data. In order to interpret such a structure, the driver has to know the execution model for which the user process was compiled.
The execution model is specified when code is compiled. The 32-bit model (compiler option -32 or -n32) uses 32-bit address values and a long int contains 32 bits. The 64-bit model (compiler option -64) uses 64-bit address values and a long int contains 64 bits. (The size of an unqualified int is 32 bits in both models.) The execution model is sometimes casually called the “ABI” (Authorized Binary Interface), but this is an improper use of that term—an ABI comprises calling conventions, public names, and structure definitions, as well as the execution model.
An IRIX kernel compiled to the 32-bit model contains 32-bit drivers and supports only 32-bit user processes. A kernel compiled to the 64-bit model contains 64-bit drivers, but it supports user processes compiled to either 32-bit or 64-bit models. Therefore, in a 64-bit kernel, a driver can be asked to interpret data produced by a 32-bit program.
This is true only of the pfxioctl() and pfxpoll() entry points. Other driver entry points move data to and from user space as streams or blocks of bytes—not as a structure with fields to be interpreted.
Since in other respects it is easy to make your driver portable between 64-bit and 32-bit systems, you should design your driver so that it can handle the case of operating in a 64-bit kernel, receiving ioctl() requests alternately from 32-bit and 64-bit programs.
The simplest way to do this is to define the arguments passed to the entry points in such a way that they have the same precision in either system. However, this is not always possible. To handle the general case, the driver must know to which model the user process was compiled.
You find this out by calling the userabi() kernel function (for which, unfortunately, there is no reference page available).
The prototype of userabi() (declared in sys/ddi.h) is
int userabi(__userabi_t *); |
If there is no user process context, userabi() returns ESRCH. Otherwise it fills out a __userabi_t structure and returns 0. The structure of type __userabi_t (declared in sys/types.h) contains the fields listed below:
uabi_szint | Size of a user int (4). |
uabi_szlong | Size of a user long (4 or 8). |
uabi_szptr | Size of a user address (4 or 8). |
uabi_szlonglong | Size of a user long long (8). |
Store the value of uabi_szptr when opening a device. Then you can use it to choose between 32-bit and 64-bit declarations of a structure passed to pfxioctl() or an address passed to pfxpoll().
Multiprocessor computers are a central part of the Silicon Graphics product line and will become increasingly common in the future. A device driver that is not multiprocessor-ready can be used in a multiprocessor, but it is likely to cause a performance bottleneck. A multiprocessor-ready driver, on the other hand, works well in a uniprocessor with little if any loss of speed.
A multiprocessor has two or more CPU modules, all of the same type. The CPUs execute independently, but all share the same main memory. Any CPU can execute the code of the IRIX kernel, and it is common for two or more CPUs to be executing kernel code, including driver code, simultaneously.
The original UNIX architecture assumed a uniprocessor hardware environment with a hierarchy of interrupt levels. Ordinary code could be preempted by an interrupt, but an interrupt handler could only be preempted by an interrupt at a higher level.
This assumed hardware environment was reflected in the design of device drivers and kernel support functions.
In a uniprocessor, an upper-half driver entry point such as pfxopen() cannot be preempted except by an interrupt. It has exclusive access to driver variables except for those changed by the interrupt handler.
Once in an interrupt handler, no other code can possibly execute except an interrupt of a higher hardware level. The interrupt handler has exclusive access to driver variables.
The interrupt handler can use kernel functions such as splhi() to set the hardware interrupt mask, blocking interrupts of all kinds, and thus getting exclusive access to all memory including kernel data structures.
All of these assumptions fail in a multiprocessor.
Upper-half entry points can be entered concurrently on multiple CPUs. For example, one CPU can be executing pfxopen() while another CPU is in pfxstrategy(). Exclusive use of driver variables cannot be assumed.
An interrupt can be taken on one CPU while upper-half routines or a timeout function execute concurrently on other CPUs. The interrupt routine cannot assume exclusive use of driver variables.
Interrupt-level functions such as splhi() are meaningless, since at best they set the interrupt mask on the current CPU only. Other CPUs can accept interrupts at all levels. The interrupt handler can never gain exclusive access to kernel data.
The process of making a driver multiprocessor-ready consists of changing all code whose correctness depends on uniprocessor assumptions.
Whenever a common resource can be updated by two processes concurrently, the resource must be protected by a lock that represents the exclusive right to update the resource. Before changing the resource, the software acquires the lock, claiming exclusive access. After changing the resource, the software releases the lock.
The IRIX kernel provides a set of functions for creating and using locks. It provides another set of functions for creating and using semaphore objects, which are like locks but sometimes more flexible. Both sets of functions are discussed under “Waiting and Mutual Exclusion”.
Sometimes the lock is not available—some other process executing in another CPU has acquired the lock. When this happens, the requesting process is delayed in the lock function until the lock is free. To delay, or sleep, is allowed for upper-half entry points, because they execute (in effect) as subroutines of user processes.
Interrupt handlers and timeout functions are not permitted to sleep. They have no process identity and so there is no mechanism for saving and restoring their state. An interrupt handler can test a lock, and can claim the lock conditionally, but if a lock is already held, the handler must have some alternate way of storing data.
When designing an upper-half entry point, keep in mind that it could be executed concurrently with any other upper-half entry point, and that the one entry point could even be executed concurrently by multiple CPUs. Only a few entry points are immune:
The pfxinit(), pfxedtinit(), and pfxstart() entry points cannot be entered concurrently with each other or any other entry point (pfxstart() could be entered concurrently with the interrupt handler).
The pfxunload() and pfxhalt() entry points cannot be entered concurrently with any other entry point except for stray interrupts.
Certain entry points have no cause to use shared data; for example, pfxsize() and pfxprint() normally do not need to take any precautions.
Other upper-half entry points, and all STREAMS entry points, can be entered concurrently by multiple CPUs, when the driver is multiprocessor-aware.
You can deal with concurrency at different levels of sophistication.
If you do not set the D_MP flag in a character driver or STREAMS driver (see “Flag D_MP”), the driver is executed only on CPU 0. As a result, upper-half entry points cannot execute concurrently, and the interrupt handler cannot run in true concurrency with an upper-half routine (although it can preempt an upper-half routine as it does in a uniprocessor).
The result is that user processes are serialized for the use of the driver for any purpose. Since CPU 0 is often busy with other housekeeping activities, access to the driver can have a latency that is long and variable.
You can create a single lock for upper-half serialization. Each upper-half function begins with read-only operations such as extracting the device minor number and testing and validating arguments. You allow these to execute concurrently on any CPU (the D_MP flag is set.)
In each enry point, when the preliminaries are complete, you acquire the single lock, and release it just before returning. The result is that processes are serialized for I/O through the driver. If the driver supports only a single device, processes would be serialized in any case, waiting for the device to operate. Since the upper half can execute on any CPU, latency is more predictable.
When the driver supports multiple minor devices, you will normally have a data structure per device, indexed by the device minor number. Typically an upper-half routine is concerned only with one minor device. You can define a lock in the data structure for the minor device, and acquire that lock as soon as the device number is known.
This permits concurrent execution of upper-half requests for different minor devices, while serializing access to any one device.
Upper-half entry points prepare work for the device to do, and the interrupt routine reports the completion of the device action. In a block device driver, this communication is relatively simple. In a character driver, you have more design options. The kernel functions mentioned in the following topics are covered under “Waiting and Mutual Exclusion”.
In a block device driver, the pfxstrategy() routine initiates a read or a write based on a buf_t structure (see “Entry Point strategy()”), and leaves the address of the buf_t where the interrupt routine can find it. Then pfxstrategy() calls the biowait() kernel function to wait for completion.
The pfxintr() entry point updates the buf_t (using pfxbioerror() if necessary) and then uses biodone() to mark the buf_t as complete. This ends the wait for pfxstrategy(). These kernel functions are multiprocessor-aware.
In a character driver that supports interrupts, you design your own coordination mechanism. The simplest (and not recommended) would be based on using the kernel function sleep() in the upper half, and wakeup() in the interrupt routine. You can also use a semaphore and use psema() in the upper half and vsema() in the interrupt handler.
If you need to allow for timeouts, you have to deal with the complication that the timeout function can be called concurrently with an interrupt. When you use a semaphore, the interrupt routine can use vsema() to post completion, and the timeout function can use cvsema() to post it only if it has not already been posted.
As a general approach, you can convert a uniprocessor driver to make it multiprocessor-safe in the following steps:
If it currently uses the D_OLD flag (or has no pfxdevflag constant), convert it to use the current interface, with a pfxdevflag of 0x00.
Make sure it works in the original uniprocessor at the current release of IRIX.
Test it in a multiprocessor running in CPU 0.
Begin adding semaphores, locks, and other exclusion and synchronization tools. Since the driver still runs serially on CPU 0, it will never wait for a lock, but the coordination between upper half and interrupt handler should work.
Add the D_MP flag and test on a multiprocessor.
In performing the conversion, you can use calls to spl..() functions as signs that work is needed. These functions are used for mutual exclusion in a uniprocessor, and they are all ineffective or unnecessary in a multiprocessor-safe driver.
The code in Example 8-6 shows typical logic in a uniprocessor character driver.
s = splvme(); flag |= WAITING; while (flag & WAITING) { sleep(&flag, PZERO); } splx(s); |
The upper half calls the splvme() function with the intention of blocking interrupts, and thus preventing execution of this driver's interrupt handler while the flag variable is updated. In a multiprocessor this is ineffective because at best it sets the interrupt level on the current CPU. The interrupt handler can execute on another CPU and change the variable.
The corresponding interrupt handler is sketched in Example 8-7.
if (flag & WAITING) { wakeup(&flag); flag &= ~WAITING; } |
The interrupt handler could execute on another CPU, and test the flag after the upper half has called splvme() and before it has set WAITING in flag. The interrupt is effectively lost. This would happen rarely and would be hard to repeat, but it would happen and would be hard to trace.
A more reliable, and simpler, technique is to use a semaphore. The driver defines a global semaphore:
static sema_t sleeper; |
A driver with multiple devices would have a semaphore per device, perhaps as an array of sema_t items indexed by device minor number.
The semaphore (or array) would be initialized to a starting value of 1 in the pfxinit() or pfxstart() entry:
void hypo_start() { ... initnsema(&sleeper,1,"sleeper"); } |
After the upper half started a device operation, it would await the interrupt using psema():
psema(sleeper,PZERO); |
The PZERO argument makes the wait immune to signals. If the driver should wake up when a signal is sent to the calling process (such as SIGINT or SIGTERM), the second argument can be PCATCH. A return value of -1 indicates the semaphore was posted by a signal, not by a vsema() call.
The interrupt handler would use vsema() or cvsema() to post the semaphore. The use of cvsema() ensures that the semaphore is not incremented past 1, in the event that it is posted from more than one location (as from a timeout or a signal handler).