Guaranteed-rate I/O, or GRIO for short, is a mechanism that enables a user application to reserve part of a system's I/O resources for its exclusive use. For example, it can be used to enable “real-time” retrieval and storage of data streams. It manages the system resources among competing applications, so the actions of new processes do not affect the performance of existing ones. GRIO can read and write only files on a real-time subvolume of an XFS filesystem.
This chapter explains important guaranteed-rate I/O concepts, describes how to configure a system for GRIO, and provides instructions for creating an XLV logical volume for use with applications that use GRIO. The main sections in this chapter are:
|Note: By default, IRIX supports four GRIO streams (concurrent uses of GRIO). To increase the number of streams to 40, you can purchase the High Performance Guaranteed-Rate I/O—5-40 Streams software option. For more streams, you can purchase the High Performance Guaranteed-Rate I/O—Unlimited Streams software option. See the grio Release Notes for information on purchasing these software options and obtaining the required NetLS licenses.|
The GRIO mechanism is designed for use in an environment where many different processes attempt to access scarce I/O resources simultaneously. GRIO provides a way for applications to determine that resources are already fully utilized and attempts to make further use would have a negative performance impact.
If the system is running a single application that needs access to all the system resources, the GRIO mechanism does not need to be used. Since there is no competition, the application gains nothing by reserving the resources before accessing them.
Applications negotiate with the system to make a GRIO reservation, an agreement by the system to provide a portion of the bandwidth of a system resource for a period of time. The only resources supported by GRIO are files residing within a real-time subvolume of an XFS filesystem.
A GRIO reservation is described as the number of bytes per second the application will receive from or transmit to the resource starting at a specific time and continuing for a specific period. The application issues a reservation request to the system, which either accepts or rejects the request. If the reservation is accepted, the application can begin accessing the resource at the reserved time, and it can expect that it will receive the reserved number of bytes per second throughout the time of the reservation. If the system rejects the reservation, it returns the maximum amount of bandwidth that can be reserved for the resource at the specified time. The application can determine if the available bandwidth is sufficient for its needs and issue another reservation request for the lower bandwidth, or it can schedule the reservation for a different time. The GRIO reservation continues until it expires, the file is closed, or an explicit grio_remove_request() library call is made (for more information, see the grio_remove_request(3X) reference page).
If a process has a rate guarantee on a file, any reference by that process to that file uses the rate guarantee, even if a different file descriptor is used. However, any other process that accesses the same file does so without a guarantee or must obtain its own guarantee. This is true even when the second process has inherited the file descriptor from the process that obtained the guarantee.
Sharing file descriptors between processes in a process group is supported for files used for GRIO, but the processes do not share the guarantee. If a process inherits an open file descriptor from a parent process and wants to have a rate guarantee on the file, the file must be closed and reopened before grio_request(3X) is called. If the sproc(2) system call is used with the PR_SFDS attribute to keep the open file table synchronized, the automatic removal of rate guarantees on last close of a file is not supported. The rate guarantee is removed when the reservation time expires or the process explicitly calls grio_remove_request(3X).
In addition to specifying the amount and duration of the reservation, the application must specify the type of guarantee desired. Each guarantee is a hard guarantee or a soft guarantee. Each guarantee is also a Video on Demand (VOD) guarantee or a non-VOD guarantee. The next few sections describe these types of guarantees and give an example that illustrates the differences between VOD and non-VOD guarantees.
Hard guarantees are possible only when the disks that are used for the real-time subvolume meet the requirements listed in the section “Hardware Configuration Requirements for GRIO” in this chapter.
Because of these disk configuration requirements, incorrect data can be returned to the application without an error notification, but the I/O requests return within the guaranteed time. If an application requests a hard guarantee and some part of the system configuration makes the granting of a hard guarantee impossible, the reservation is rejected. The application can then issue a reservation request with a soft guarantee.
A soft guarantee means the system tries to achieve the desired rate, but there may be circumstances beyond its control that cause it to fail. For example, if a non-real-time disk is on the same SCSI bus as real-time disks and there is a disk data error on the non-real-time disk, the driver retries the request to recover the data. This could cause the rate guarantee on the real-time disk to be missed.
VOD (Video On Demand) is a special type of rate guarantee applied to either hard or soft guarantees. It allows more streams to be supported per disk drive, but requires that the application provide careful control of when and where I/O requests are issued.
VOD guarantees are supported only when using a striped volume. The application must time multiplex the I/O requests to different drives at different times. A process stream can only access a single disk during any one second. Therefore, the stripe unit must be set to the number of kilobytes of data that the application needs to access per second per stream of data. (The stripe unit is set using xlv_make(1M) when volume elements are created.) If the process tries to access data on a different disk during a time period, it is suspended until the appropriate time period.
With VOD reservations, if the application does not read the file sequentially, but rather skips around in the file, it will have a performance impact. For example, if disks are four-way striped, it could take as long as four seconds (the size of the volume stripe) for the first I/O request after a seek to complete.
Assume the system has eight disks each supporting twenty-three 64 KB operations per second. For non-VOD GRIO, if an application needs 512 KB of data each second, the eight disks would be arranged in a eight-way stripe. The stripe unit would be 64 KB. Each application read/write operation would be 512 KB and cause concurrent read/write operations on each disk in the stripe. The application could access any part of the file at any time, provided that the read/write operation always started at a stripe boundary. This would provide 23 process streams with 512 KB of data each second.
With a VOD guarantee, the eight drives would be given an optimal I/O size of 512 KB. Each drive can support seven such operations each second. The higher rate (7 x 512 KB versus 23 x 64 KB) is achievable because the larger transfer size does less seeking. Again the drives would be arranged in an eight-way stripe but with a stripe unit of 512 KB. Each drive can support seven 512K streams per second for a total of 8 * 7 = 56 streams. Each of the 56 streams is given a time period. There are eight different time periods with seven different processes in each period. Therefore, 8 * 7 = 56 processes are accessing data in a given time unit. At any given second, the processes in a single time period are only allowed to access a single disk.
Using a VOD guarantee more than doubles the number of streams that can be supported with the same number of disks. The trade off is that the time tolerances are very stringent. Each stream is required to issue the read/write operation within a second. If the process issues the call too late, the request blocks until the next time period for that process on the disk. In this example, this could mean a delay of up to eight seconds. In order to receive the rate guarantee, the application must access the file sequentially. The time periods move sequentially down the stripe allowing each process to access the next 512 KB of the file.
The system daemon is ggd(1M). It is started from the script /etc/rc2.d/S94grio when the system is started. It is always started; unlike some other daemons, it is not turned on and off with chkconfig(1M). A lock file is created in the /tmp directory to prevent two copies of the daemon from running simultaneously. The daemon reads the GRIO configuration files /etc/grio_config and /etc/grio_disks.
/etc/grio_config describes the various I/O hardware paths on the system, starting with the system bus and ending with the individual peripherals such as disk and tape drives. It also describes the bandwidth capabilities of each component. The format of this file is described in the section “/etc/grio_config File Format” in this chapter. If you want a soft rate guarantee, you must edit this file. See step 9 in the section “Example: Setting Up an XLV Logical Volume for GRIO” in this chapter for more information.
The utility cfg(1M) is used to automatically generate an /etc/grio_config configuration file for a system's configuration. A checksum is appended to the end of the file by cfg. When the ggd daemon reads the configuration information, it validates the checksum. You can edit /etc/grio_config to tune the performance characteristics to fit a given application. See the next section, “Configuring the ggd Daemon,” for more information.
/etc/grio_disks describes the performance characteristics for the types of disk drives that may be found on the system. You can edit the file to add support for new drive types. The format of this file is described in the section “/etc/grio_disks File Format” in this chapter.
The library /usr/lib/libgrio.so contains a collection of routines that enable an application to establish a GRIO session. The library routines are the only way in which an application program can communicate with the ggd daemon.
Put only real-time subvolume volume elements on a single disk (not log or data subvolume volume elements). This configuration is recommended for soft guarantees and required for hard guarantees.
Only SCSI disks can be used for real-time subvolumes. IPI, ESDI, and other non-SCSI disks cannot be used.
For GRIO with hard guarantees, each disk used for hard guarantees must be on a controller whose disks are used exclusively for real-time subvolumes. These controllers cannot have any devices other than SCSI disks on their buses. Any other devices could prevent the disk from accessing the SCSI bus in a timely manner and cause the rate to be missed.
The drive firmware in each disk used in the real-time subvolume must have the predictive failure analysis and thermal recalibration features disabled. All disk drives have been shipped from Silicon Graphics this way since March 1994.
For hard guarantees, the disk drive retry and error correction mechanisms must be disabled for all disks used in the real-time subvolume. See the section “Disabling Disk Error Recovery” in this chapter for more information.
When possible, disks used in the real-time subvolume of an XLV volume should have the RC (read continuous) bit enabled. This allows the disks to perform faster, but at the penalty of occasionally returning incorrect data (without giving an error). Enabling the RC bit is part of the procedure described in the section “Disabling Disk Error Recovery.”
Disks used in the data and log subvolumes of the XLV logical volume must not have their retry mechanisms disabled. The data and log subvolumes contain information critical to the filesystem and cannot afford an occasional disk error.
SCSI disks in XLV logical volumes used by GRIO applications that require hard guarantees must have their parameters modified to prevent the disk from performing automatic error recovery. When the drive does error recovery, its performance degrades and there can be lengthy delays in completing I/O requests. When the drive error recovery mechanisms are disabled, occasionally invalid data is returned to the user without an error indication. Because of this, the integrity of data stored on an XLV real-time subvolume is not guaranteed.
The fx(1M) utility is used in expert mode to set the drive parameters for real-time operation. Table 5-1 shows the disk drive parameters that must be changed for GRIO.
Auto bad block reallocation (read)
Auto bad block reallocation (write)
Delay for error recovery (disabling this parameter enables the read continuous (RC) bit)
|Caution: Setting disk drive parameters must be performed correctly on approved disk drive types only. Performing the procedure incorrectly, or performing on an unapproved type of disk drive could severely damage the disk drive. Setting disk drive parameters should be performed only by experienced system administrators.|
fx reports the disk drive type after the controller test on a line that begins with Scsi drive type. The approved disk drives types whose parameters can be set for real-time operation are:
SGI 0664N1D 6s61
SGI 0664N1D 4I4I
The procedure for setting disk drive parameters is shown in the example below. It uses the parameters shown in Table 5-1 for a disk drive on controller 131, unit 1.
fx -x fx version 5.3, Nov 18, 1994 fx: "device-name" = (dksc) <Enter> fx: ctlr# = (0) 131 fx: drive# = (1) 1 fx: lun# = (0) ...opening dksc(131,1,0) ...controller test...OK Scsi drive type == SGI 0664N1D 6s61 ----- please choose one (? for help, .. to quit this menu)----- [exi]t [d]ebug/ [l]abel/ [b]adblock/ [exe]rcise/ [r]epartition/ fx > label ----- please choose one (? for help, .. to quit this menu)----- [sh]ow/ [sy]nc [se]t/ [c]reate/ fx/label> show ----- please choose one (? for help, .. to quit this menu)----- [para]meters [part]itions [b]ootinfo [a]ll [g]eometry [s]giinfo [d]irectory fx/label/show> parameters ----- current drive parameters----- Error correction enabled Enable data transfer on error Don't report recovered errors Do delay for error recovery Don't transfer bad blocks Error retry attempts 10 Do auto bad block reallocation (read) Do auto bad block reallocation (write) Drive readahead enabled Drive buffered writes disabled Drive disable prefetch 65535 Drive minimum prefetch 0 Drive maximum prefetch 65535 Drive prefetch ceiling 65535 Number of cache segments 4 Read buffer ratio 0/256 Write buffer ratio 0/256 Command Tag Queueing disabled ----- please choose one (? for help, .. to quit this menu)----- [para]meters [part]itions [b]ootinfo [a]ll [g]eometry [s]giinfo [d]irectory fx/label/show> .. ----- please choose one (? for help, .. to quit this menu)----- [sh]ow/ [sy]nc [se]t/ [c]reate/ fx/label> set ----- please choose one (? for help, .. to quit this menu)----- [para]meters [part]itions [s]giinfo [g]eometry [m]anufacturer_params [b]ootinfo fx/label/set> parameters fx/label/set/parameters: Error correction = (enabled) <Enter> fx/label/set/parameters: Data transfer on error = (enabled) <Enter> fx/label/set/parameters: Report recovered errors = (disabled) <Enter> fx/label/set/parameters: Delay for error recovery = (enabled) disable fx/label/set/parameters: Err retry count = (10) <Enter> fx/label/set/parameters: Transfer of bad data blocks = (disabled) <Enter> fx/label/set/parameters: Auto bad block reallocation (write) = (enabled) disable fx/label/set/parameters: Auto bad block reallocation (read) = (enabled) disable fx/label/set/parameters: Read ahead caching = (enabled) <Enter> fx/label/set/parameters: Write buffering = (disabled) <Enter> fx/label/set/parameters: Drive disable prefetch = (65535) <Enter> fx/label/set/parameters: Drive minimum prefetch = (0) <Enter> fx/label/set/parameters: Drive maximum prefetch = (65535) <Enter> fx/label/set/parameters: Drive prefetch ceiling = (65535) <Enter> fx/label/set/parameters: Number of cache segments = (4) <Enter> fx/label/set/parameters: Enable CTQ = (disabled) <Enter> fx/label/set/parameters: Read buffer ratio = (0/256) <Enter> fx/label/set/parameters: Write buffer ratio = (0/256) <Enter> * * * * * W A R N I N G * * * * * about to modify drive parameters on disk dksc(131,1,0)! ok? yes ----- please choose one (? for help, .. to quit this menu)----- [para]meters [part]itions [b]ootinfo [a]ll [g]eometry [s]giinfo [d]irectory fx/label/set> .. ----- please choose one (? for help, .. to quit this menu)----- [sh]ow/ [sy]nc [se]t/ [c]reate/ fx/label> .. ----- please choose one (? for help, .. to quit this menu)----- [exi]t [d]ebug/ [l]abel/ [a]uto [b]adblock/ [exe]rcise/ [r]epartition/ [f]ormat fx> exit label info has changed for disk dksc(131,1,0). write out changes? (yes) <Enter>
The files /etc/grio_disks, /etc/grio_config, and /etc/config/ggd.options can be modified as described below to configure and tune the ggd daemon. After any of these files have been modified, ggd must be restarted. Give these commands to restart ggd:
/etc/init.d/grio stop /etc/init.d/grio start
When ggd is restarted, current rate guarantees are lost.
Some ways to configure and tune ggd are:
You can edit /etc/grio_config to tune the performance characteristics to fit a given application. See the section “/etc/grio_config File Format” for information about the format of this file. ggd must then be started with the –d c option, so the file checksum is not used. This is done by creating or editing the file /etc/config/ggd.options and adding –d c.
Run ggd as a real-time process. If the system has more than one CPU and you are willing to dedicate an entire CPU to performing GRIO requests, add the –c cpunum to the file /etc/config/ggd.options. This causes the CPU to be marked isolated, restricted to running selected processes, and nonpreemptive. After ggd has been restarted, you can confirm that the CPU has been marked by giving this command (cpunum is 3 in this example):
mpadmin -s processors: 0 1 2 3 4 5 6 7 unrestricted: 0 1 2 5 6 7 isolated: 3 restricted: 3 preemptive: 0 1 2 4 5 6 7 clock: 0 fast clock: 0
Processes using GRIO should mark their processes as real-time and runable only on CPU cpunum. The sysmp(2) reference page explains how to do this.
To mark an additional CPU for real-time processes after ggd has been restarted, give these commands:
mpadmin -rcpunum2 mpadmin -Icpunum2 mpadmin -Ccpunum2
This section gives an example of configuring a system for GRIO as described in previous sections: creating an XLV logical volume with a real-time subvolume, making a filesystem on the volume and mount it, and configuring and restarting the ggd daemon. It assumes that the disk partitions have been chosen following the guidelines in the section “Hardware Configuration Requirements for GRIO” and that the disk drive parameters have already been modified as described in the section “Disabling Disk Error Recovery.”
The name of the volume with a real-time subvolume.
The rate at which applications using this volume will access the data. rate is the number of bytes per second per stream (the rate) divided by 1K. This information may be available in published information about the applications or from the developers of the applications. Remember that the GRIO system allows each stream to issue only one read/write request each second. The stream must obtain all the data it needs in one second from a single read call.
The number of disks that will be included in the real-time subvolume of the volume.
When the real-time disks are striped (required for Video on Demand and recommended otherwise), this is the amount of data written to one disk before writing to the next. It is expressed in 512-byte sectors.
For non-VOD guarantees:
For VOD guarantees:
For non-VOD guarantees:
For VOD guarantees:
For non-VOD guarantees, it should be an even factor of stripe_unit, but not less than 64.
For VOD guarantees:
Table 5-2 gives examples for the values of these variables.
Type of Guarantee
This name matches the last component of the device name for the volume, /dev/dsk/xlv/vol_name
For this example, assume 512 KB per second per stream
For this example, assume 4 disks
hard or soft
VOD hard or soft
hard or soft
512 * 1K
VOD hard or soft
512 * 1K * 4
hard or soft
128/1 = 128 or 128/2 = 64 are possible
VOD hard or soft
Same as rate
Create an xlv_make(1M) script file that creates the XLV logical volume. (See the section “Using xlv_make to Create Volume Objects” in Chapter 4 for more information.) Example 5-1 shows an example script file for a volume.
# Configuration file for logical volume vol_name. In this # example, data and log subvolumes are partitions 0 and 1 of # the disk at unit 1 of controller 1. The real-time # subvolume is partition 0 of the disks at units 1-4 of # controller 2. # vol vol_name data plex ve dks1d1s0 log plex ve dks1d1s1 rt plex ve -stripe -stripe_unit stripe_unit dks2d1s0 dks2d2s0 dks2d3s0 dks2d4s0 show end exit
Run xlv_make to create the volume:
script_file is the xlv_make script file you created in step 2.
mkfs -r extsize=extent_size /dev/dsk/xlv/vol_name
To mount the filesystem immediately, give these commands:
mkdir mountdir mount /dev/dsk/xlv/vol_name mountdir
mountdir is the full pathname of the directory that is the mount point for the filesystem.
/dev/dsk/xlv/vol_name mountdir xfs rw,raw=/dev/rdsk/xlv/vol_name 0 0
If the file /etc/grio_config exists, and you see OPTSZ=65536 for each device, skip to step 9.
cfg -d opt_IO_size
from the lines for disks where software retry is required (see the section “/etc/grio_config File Format” in this chapter for more information).
/etc/init.d/grio stop /etc/init.d/grio start
Now the user application can be started. Files created on the real-time subvolume volume can be accessed using guaranteed-rate I/O.
The /etc/grio_config file describes the configuration of the system I/O devices. The information in this file is used by the ggd daemon to construct a tree that describes the relationships between the components of the I/O system and their bandwidths. In order to grant a rate guarantee on a disk device, the ggd daemon checks that each component in the I/O path from the system bus to the disk device has sufficient available bandwidth.
There are two basic types of records in /etc/grio_config: component records and relationship records. Each record occupies a single line in the file. Component records describe the I/O attributes for a single component in the I/O subsystem. CPU and memory components are described in the file, as well, but do not currently affect the granting or refusal of a rate guarantee.
The format of component records is:
componentname= parameter=value parameter=value ... (descriptive text)
componentname is a text string that identifies a single piece of hardware present in the system. Some componentnames are:
The machine itself. There is always one SYSTEM component.
A CPU board in slot n. It is attached to SYSTEM.
A memory board in slot n. It is attached to SYSTEM.
An I/O board with n as its internal location identifier. It is attached to SYSTEM.
An I/O adaptor. It is attached to IOBn at location m.
SCSI controller number n. It is attached to an I/O adaptor.
Disk device m attached to SCSI controller n.
parameter can be one of the following:
The number of OPTSZ I/O requests supported by the component each second
The backplane slot number where the component is located, if applicable (not used on all systems)
The CPU type of system (for example, IP22, IP19, and so on; not used on all systems)
The number of CPUs attached to the component (valid only for CPU components; not used on all systems)
The MHz value of the CPU (valid only for CPU components; not used on all systems)
The SCSI controller number of the component (valid only for SCSI devices)
The SCSI unit number of the component (valid only for SCSI devices)
Set to 1 if the disk is in a real-time subvolume (remove this parameter for soft guarantees)
The value is the integer or text string value assigned to the parameter. The string enclosed in parentheses at the end of the line describes the component.
Some examples of component records taken from /etc/grio_config on an Indy™ system are shown below. Each record is a single line, even if it is shown on multiple lines here.
SYSTEM= OPTSZ=65536 NUM=5000 (IP22)
The componentname SYSTEM refers to the system bus. It supports five thousand 64 KB operations per second.
CPU= OPTSZ=65536 NUM=5000 SLOT= 0 VER=IP22 NUMCPUS=1 MHZ=100
This describes a 100 MHz CPU board in slot 0. It supports five thousand 64 KB operations per second.
CTR0= OPTSZ=65536 NUM=100 CTLRNUM=0 (WD33C93B,D)
This describes SCSI controller 0. It supports one hundred 64 KB operations per second.
DSK0U0= OPTSZ=65536 NUM=23 CTLRNUM=0 UNIT=1 (SGI SEAGATE ST31200N9278)
This describes a SCSI disk attached to SCSI controller 0 at SCSI unit 1. It supports twenty three 64 KB operations per second.
component: attached_component1 attached_component2 ...
These records indicate that if a guarantee is requested on attached_component1, the ggd daemon must determine if component also has the necessary bandwidth available. This is performed recursively until the SYSTEM component is reached.
Some examples of relationship records taken from /etc/grio_config on an Indy system are:
This describes the CPU board as being attached to the system bus.
This describes SCSI disk 1 being attached to SCSI controller 0.
The file /etc/grio_disks contains information that describes I/O bandwidth parameters of the various types of disk drives that can be used on the system. The ggd daemon and cfg contain built-in knowledge for the disks supported by Silicon Graphics for optimal I/O sizes of 64K, 128K, 256K, and 512K. To add additional disks or to specify a different optimal I/O size, you must add additional information to the /etc/grio_disks file.
The records in /etc/grio_disks are of the form:
ADD "disk id string" optimal_iosize number_optio_per_second
The first field is always the keyword ADD. The next field is a 28-character string that is the drive manufacturer's disk ID string. The next field is an integer denoting the optimal I/O size of the device in bytes. The last field is an integer denoting the number of optimal I/O size requests that the disk can satisfy in one second.
Some examples of these records are:
ADD “SGI SEAGATE ST31200N9278” 64K 23
ADD “SGI 0064N1D 4I4I” 64K 23
Both of these disk drives support twenty-three 64 KB requests per second.
If you change this file, you must restart ggd to have your changes take effect. See the section “Configuring the ggd Daemon” in this chapter for more information.