This chapter describes how a user defined as an NQS operator can monitor and control NQS on the local execution server. Most of these actions are performed using the commands of the qmgr utility. A user defined as an NQS manager can perform all of the actions described in this chapter.
This chapter discusses the following topics:
Overview of operator actions
Starting and stopping NQS on the local execution server
Monitoring the local NQS server, including displaying system parameters, monitoring the status of queues and requests, and displaying the list of machines currently defined in the network
Controlling the availability of local NQS queues, including starting and enabling queues
Manipulating batch requests on the local execution server
An NQS operator can perform the following actions on the local NQS server (these actions may also be performed by an NQS manager):
Controlling the availability of NQS:
Starting up and shutting down the system
Enabling and disabling NQS queues
Starting and stopping NQS queues
Monitoring NQS by using various qmgr show commands and by using the cqstatl or qstat utility to display the following:
The current values of system parameters
The status of NQS queues
A detailed description of the characteristics of NQS queues
The status of individual requests in NQS queues
Controlling the processing of batch requests that users submit to the system:
Moving requests between queues or within queues
Holding and releasing requests in various ways
Modifying characteristics of a request
Deleting requests
The nqeinit script is responsible for starting all NQE components. It calls qstart to start the NQS component. You can also start an individual process by executing the qstart command (as follows), which then automatically starts the other NQS processes.
% qstart options |
![]() | Note: For customers who plan to install PALs on UNICOS systems that run only the NQE subset (NQS and FTA components), the qstart -c option should be used to direct the NQS console file into a directory that does not require root to write to it. |
See the qstart(8) man page for a complete description of the syntax of the qstart command.
The qstart command is a shell script that starts nqsdaemon and passes qmgr startup commands to the daemon. By default, qstart performs the following functions:
Starts nqsdaemon
Initializes the system's NQS environment with commands defined in the qmgr input file
On startup, the nqeinit and qstart commands check whether the NQS database has been created. If it has not, the commands create it. During the creation of the NQS database, default values are taken from the following file:
/nqebase/etc/nqs_config |
When NQS is being configured, this process occurs only once, unless you delete the NQS database structure. After the database is created, nqs_config is not used again.
During each startup of NQS, /nqebase/etc is examined for a file called NQS.startup. If the file exists, NQS uses it as input to qmgr as NQS is being started. If it does not exist, NQS is brought up and the qmgr start all command is issued. You can see a log of the startup activity in the NQS log directory ($NQE_SPOOL/log).
The directory structure under $NQE_SPOOL/log is as follows:
# cd $NQE_SPOOL # ls # ftaqueue log nlbdir spool # ls log nqscon.26553.out nqslog qstart.26553.out |
Except for the nqslog file, the files have the UNIX process ID appended, which makes the files unique across invocations of qstart and qstop.
After the initial boot of NQS on a system, you should make subsequent database changes interactively using qmgr. After changes are made, you can use the qmgr snap file command to save the current configuration. To update the standard configuration file, you can replace the nqs_config file with the snap file.
![]() | Note: In order to use the qmgr snap file command, you must have NQS manager privilege. |
You do not need an NQS.startup file to preserve changes made to qmgr. Any qmgr changes are written to the NQS database and preserved across shutdowns and startups. You should make a copy of your changes to either the nqs_config or NQS.startup file, in case your database is ever corrupted.
After nqsdaemon is initialized, qstart determines whether the NQS spool database (usually in $NQE_SPOOL/spool) exists. If qstart finds that the database is missing, the /nqebase/etc/nqs_config configuration file is sent as input to qmgr. This file should contain the complete NQS database configuration, including the machine IDs, queues, and global limits. You can copy the output from a qmgr snap file to /nqebase/etc/nqs_config when you are doing an upgrade and an NQS database must be constructed. NQS executes the /nqebase/etc/NQS.startup file each time it starts up, whether or not an NQS database exists.
The changes that you make to the /nqebase/etc/nqs_config file are not propagated automatically into the active NQS spool database. Usually, the database is in $NQE_SPOOL/spool when the system is booted; therefore, the configuration is already loaded (except when the system is initially installed or when a clean $NQE_SPOOL/spool is being used).
To ensure that the NQS daemon is running, use the following command:
# /nqebase/bin/qping -v # nqs-2550 qping: INFO # The NQS daemon is running. |
The following is a sample NQS.startup file, which the site can configure. A copy of the file, as it is shown in this example, is contained in /nqebase/examples.
# USMID %Z%%M% %I% %G% %U% # # NQS.startup - Qmgr commands to start up NQS. # # Example of site-configurable script that resides in /nqebase/etc/config; # /nqebase/etc/qstart will invoke /nqebase/etc/NQS.startup unless the # -i option is specified. # # # Set general parameters. # # Fewer timeouts on qstat, qmgr, etc., if nqsdaemon locked in memory. # lock local_daemon # # Site may not run accounting during nonprime time to save disk space. # set accounting on # # Could have put checkpoint directory in /tmp/nqs/chkpnt # for a benchmark; so make sure it is in usual place # for prime time. # set checkpoint_directory =(/usr/spool/nqe/spool/private/root/chkpnt) # # Could ensure that NQS will not fill up /usr/spool/nqe by # setting the log_file to /tmp/nqs/nqslog. # set log_file =/usr/spool/nqe/log/nqslog # # Debug level may have been set to 3 (verbose) for testing; # reduce log file size for prime time by reducing level. # set debug 0 # # There are many NQS message types that can be turned on and # off separately. Set some message types on when wanting # more information logged for areas of interest or set all # message types on when tracking problems. Note that # "set message_type on all" will turn on all but "flow" # messages. Using "set message_type on flow" should only be # used very selectively in controlled situations for tracking # a very difficult problem. # set message_type off all # # NQS message headers in the log file can now be set to short # (same as always) or long, which includes information on where # the message is being issued from. Use long when tracking # problems. # set message_header short # # NQS log files can be automatically segmented based on time, size, # or at each startup of NQS. The directory to which NQS will # segment the log file is specified with the "set segment directory" # command. # set segment directory /usr/spool/nqe/log # # NQS can now automatically segment the NQS log file each time NQS # is started. # set segment on_init on # # Name "batch" is not magic; corresponds to hypothetical pipe_queue. # See nqs/example/qmgr.example. # set default batch_request queue batch # # Number of hours during which a pipe queue can be unreachable; # a routing request that exceeds this limit is marked as failed. # set default destination_retry time 72 # # Number of minutes to wait before trying a pipe queue destination # that was unreachable at the time of the last attempt. # set default destination_retry wait 5 # # Turn on the default return of the user's job log file. # set default job_log on # # Assumes NQS account exists on the Cray Research system # and belongs to group root. If -me or -mb on qsub; this is # who mail will be from. # set mail nqs # # # Now set global parameters. # set global batch_limit = 20 set global user_limit = 20 set global group_limit = 15 # # Consider this number an "oversubscription"; it does not # have to be the size of the machine. # set global memory_limit = 256Mw # # Consider this number an "oversubscription"; it does not # have to be the size of the SSD. # set global quickfile_limit = 1Gw # Y-MP Only # # The numbers in each tape group must reflect # local site configuration. # set global tape_limit a = 10 set global tape_limit b = 5 set global tape_limit c = 5 set global tape_limit d = 0 set global tape_limit e = 0 set global tape_limit f = 0 set global tape_limit g = 0 set global tape_limit h = 0 # # Maximum number of pipe requests allowed to run concurrently. # set global pipe_limit = 5 # # Capture configuration status at startup; written to # /usr/spool/nqe/log/qstart.out unless otherwise specified. # These commands are entirely optional, but handy # for reference if problems occur. # sho queues sho parameters sho mids sho man # # Now NQS will begin scheduling jobs. # start all_queues |
The nqestop script is responsible for stopping all NQE components. It calls qstop to stop the NQS component. You can also stop an individual NQS process by executing the qstop(8) command as follows:
% qstop options |
See the qstop(8) man page for a complete description of the syntax of the qstop command.
![]() | Caution: When you stop NQS by using the qstop command, you should be aware that you have not stopped NQE. The Network Load Balancer collector daemon uses logdaemon services . The qstop command does not stop NLB collectors; use nqestop(8) to stop all of NQE. |
When the qmgr shutdown command shuts down NQS, all processes that make up a restartable running batch request on the local host are killed. On UNICOS, UNICOS/mk, and IRIX systems, an image of the request is saved in a file in the checkpoint directory by NQS using the chkpnt(2) system call on UNICOS and UNICOS/mk systems and the cpr(1) command on IRIX systems.
For UNICOS, UNICOS/mk, and IRIX systems, for a batch request to be considered restartable, it must meet the recoverability criteria of job and process recovery, as described in General UNICOS System Administration, publication SG-2301 or in the Checkpoint and Restart Operation Guide, publication 007-3236-001, a Silicon Graphics publication.
When NQS is restarted, checkpointed requests are recovered from the images in their respective restart files. They resume processing from the point of suspension before the shutdown. After a request is restarted successfully, the restart file is kept until the request completes (in which case, the file is removed) or the request is checkpointed again (in which case, the file is overwritten).
The following is a sample NQS.shutdown file, which the site can configure. A copy of the file, as it is shown in this example, is contained in /nqebase/examples.
# USMID %Z%%M% %I% %G% %U% # # NQS.shutdown - Qmgr commands to shut down NQS. # # Example of site-configurable script that resides in /nqebase/etc/config; # /nqebase/etc/qstop invokes /etc/config/NQS.shutdown, unless the # -i option is specified. # stop all # # The "set accounting off" line has been removed from this example # as it was found that setting accounting off at NQS.shutdown time # turned NQS accounting off before accounting records had been given # their appropriate terminating status and this caused problems with # the post-processing accounting routine summaries of the accounting # data. # # The 60-second grace period means shutdown will take longer than 60 # seconds. The nqsdaemon sends a SIGSHUTDN signal to the processes # of all running requests and then waits for the number of seconds # specified by the grace period. After the grace period, nqsdaemon # attempts to checkpoint all running requests that are restartable; # i.e., qsub option -nc NOT specified. nqsdaemon will send SIGKILL to # processes of requests that were not checkpointed, including those # for which the checkpoint failed. All checkpointed requests and # rerunnable requests (i.e., qsub option -nr NOT specified) will be # requeued to be restarted or rerun when NQS is next initiated. # shutdown 60 |
Several qmgr commands are available to monitor NQS; these commands all begin with show.
You also can use the NQE GUI Status display and the cqstatl or the qstat command to gather valuable information about requests and queues.
To display the current values for system parameters, use the following qmgr command:
show parameters |
An example of the display follows:
Qmgr: sho parameters sho para Checkpoint directory = /nqebase/version/pendulum/database/spool/private/root/chkpnt Debug level = 1 Default batch_request queue = nqenlb Default destination_retry time = 72 hours Default destination_retry wait = 5 minutes Default return of a request's job log is OFF Global batch group run-limit: unspecified Global batch run-limit: 5 Global batch user run-limit: 2 Global MPP Processor Element limit: unspecified Global memory limit: unlimited Global pipe limit: 5 Global quick-file limit: unspecified Global tape-drive a limit: unspecified Global tape-drive b limit: unspecified Global tape-drive c limit: unspecified Global tape-drive d limit: unspecified Global tape-drive e limit: unspecified Global tape-drive f limit: unspecified Global tape-drive g limit: unspecified Global tape-drive h limit: unspecified Job Initiation Weighting Factors: Fair Share Priority = 0 (Fair Share is not enabled) Requested CPU Time = 0 Requested Memory = 0 Time-in-Queue = 1 Requested MPP CPU Time = 0 Requested MPP PEs = 0 User Specified Priority = 0 Job Scheduling: Configured = nqs normal Active = nqs normal Log_file = /nqebase/version/pendulum/database/spool/../log/nqslog MESSAGE_Header = Short |
MESSAGE_Types: Accounting OFF CHeckpoint OFF COMmand_flow OFF CONfig OFF DB_Misc OFF DB_Reads OFF DB_Writes OFF Flow OFF NETWORK_Misc OFF NETWORK_Reads OFF NETWORK_Writes OFF OPer OFF OUtput OFF PACKET_Contents OFF PACKET_Flow OFF PROTOCOL_Contents OFF PROTOCOL_Flow OFF RECovery OFF REQuest OFF ROuting OFF Scheduling OFF USER1 OFF USER2 OFF USER3 OFF USER4 OFF USER5 OFF Mail account = root Netdaemon = /nqebase/bin/netdaemon Network client = /nqebase/bin/netclient Network retry time = 31 seconds Network server = /nqebase/bin/netserver NQS daemon accounting is OFF NQS daemon is not locked in memory Periodic_checkpoint concurrent_checkpoints = 1 Periodic_checkpoint cpu_interval = 60 Periodic_checkpoint cpu_status on Periodic_checkpoint max_mem_limit = 32mw Periodic_checkpoint max_sds_limit = 64mw Periodic_checkpoint min_cpu_limit = 60 Periodic_checkpoint scan_interval = 60 Periodic_checkpoint status off Periodic_checkpoint time_interval = 180 Periodic_checkpoint time_status off SEgment Directory = NONE SEgment On_init = OFf SEgment Size = 0 bytes SEgment Time_interval = 0 minutes Sequence number for next request = 0 Snapfile = /nqebase/version/pendulum/database/spool/snapfile Validation type = validation files |
This information also is displayed by the following qmgr command (along with the list of managers and operators and any user access restrictions on queues):
show all |
The following qmgr command displays only the global limits (a subset of the output displayed in the preceding command):
show global_parameters |
All entries in the show parameters display, except for the Sequence number for next request, can be configured by qmgr commands.
![]() | Note: Some parameters are not enforced if the operating system does not support the feature, such as MPP processing elements or checkpointing. |
The following list explains each entry and shows the qmgr command (in parentheses) that is used to change the entry. Most of these commands can be used only by an NQS manager; commands that can be issued by an NQS operator are prefixed by a dagger (†).
Display entry | |
Description | |
Debug level | |
The level of information written to the log file (set debug). | |
Default batch request queue | |
The queue to which requests are submitted if the user does not specify a queue (set [no_]default batch_request queue). | |
Default destination retry time | |
The maximum time that NQS tries to send a request to a pipe queue (set default destination_retry time). | |
Default destination retry wait | |
The interval between successive attempts to retry sending a request (set default destination_retry wait). | |
Default return of request's job log | |
Determines whether the default action is to send a job log to users when the request is complete (set default job log on/off). | |
Global batch group run-limit | |
The maximum number of batch requests that all users of a group can run concurrently (†set global group_limit). | |
Global batch run-limit | |
The maximum number of batch requests that can run simultaneously at a host (†set global batch_limit). | |
Global batch user run-limit | |
The maximum number of batch requests any one user can run concurrently (†set global user_limit). | |
Global MPP Processor Element limit | |
The maximum number of MPP processing elements available to all batch requests running concurrently (†set global mpp_pe_limit). Valid only on the Cray MPP systems and optionally enabled on IRIX systems (see “Enabling per Request Limits on Memory Size, CPU Time, and Number of Processors on IRIX Systems” in Chapter 5). | |
Global memory limit | |
The maximum memory that can be allocated to all batch requests running concurrently (†set global memory_limit). | |
Global pipe limit | |
The maximum number of requests that can be routed simultaneously at the host (†setglobal pipe_limit). | |
Global quick-file limit | |
The maximum quickfile space that can be allocated to all batch requests running concurrently on NQS on UNICOS systems (†set global quickfile_limit). | |
Global tape-drive x limit | |
The maximum number of the specified tape drives that can be allocated to all batch requests running concurrently (†set global tape_limit). | |
Job Initiation Weighting Factoring | |
The weight of the specified factor in selecting the next request initiated in a queue (set sched_factor cpu|memory|share|time|mpp_cpu|mpp_pe|user_priority) | |
Job Scheduling option(s) | |
The job scheduling type used by nqsdaemon (set job_scheduling). | |
Log file | |
The path name of the log file (set log_file). | |
MESSAGE_Header | |
The NQS message header (set message_header long|short). | |
MESSAGE_Types | |
The type of NQS messages sent to message destinations such as the NQS log file or the user's terminal (set message_types off|on). | |
Mail account | |
The name that appears in the From: field of mail sent to users by NQS (set mail). | |
Netdaemon | |
The name of the TCP/IP network daemon (set [no_]network daemon). | |
Network client | |
The name of the network client process (set network client). | |
Network retry time | |
The maximum number of seconds a network function can fail before being marked as completely failed (set network retry_time). | |
Network server | |
The name of the network server process (set network server). | |
NQS daemon accounting | |
Determines whether NQS daemon accounting is enabled (set accounting off|on). | |
NQS daemon is/is not locked in memory | |
The state of the nqsdaemon process; that is, whether it is locked into memory or not (†lock local daemon and †unlock local daemon). | |
Periodic_checkpoint concurrent_checkpoints | |
The maximum number of checkpoints that can occur simultaneously during a periodic checkpoint scan interval (valid only on UNICOS, UNICOS/mk, and IRIX systems) (set periodic_checkpoint concurrent_checkpoints). | |
Periodic_checkpoint cpu_interval | |
The default CPU time interval for periodic checkpointing (valid only on UNICOS, UNICOS/mk, and IRIX systems) (set periodic_checkpoint cpu_interval). | |
Periodic_checkpoint cpu_status on | |
Determines whether NQS periodic checkpoints are initiated based on the CPU time used by a request (valid only on UNICOS, UNICOS/mk, and IRIX systems) (set periodic_checkpoint cpu_status off|on). | |
Periodic_checkpoint max_mem_limit | |
The maximum memory limit for a request that can be periodically checkpointed (valid only on UNICOS, UNICOS/mk, and IRIX systems) (set periodic_checkpoint max_mem_limit). | |
Periodic_checkpoint max_sds_limit | |
The maximum SDS limit for a request that can be periodically checkpointed (valid only on UNICOS systems) (set periodic_checkpoint max_sds_limit). | |
Periodic_checkpoint min_cpu_limit | |
The minimum CPU limit for a request that can be periodically checkpointed (valid only on UNICOS, UNICOS/mk, and IRIX systems) (set periodic_checkpoint max_mem_limit). | |
Periodic_checkpoint scan_interval | |
The time interval in which NQS scans running requests to find those eligible for periodic checkpointing (valid only on UNICOS, UNICOS/mk, and IRIX systems) (set periodic_checkpoint scan_interval). | |
Periodic_checkpoint status off | |
Determines whether NQS will examine running requests to see whether they can be checkpointed and then schedule them for checkpointing (on). If the status is off, no requests are periodically checkpointed (valid only on UNICOS, UNICOS/mk, and IRIX systems) (set periodic_checkpoint status off|on). | |
Periodic_checkpoint time_interval | |
The default wall-clock time interval for periodic checkpointing (valid only on UNICOS, UNICOS/mk, and IRIX systems) (set periodic_checkpoint cpu_interval). | |
Periodic_checkpoint time_status off | |
Determines whether NQS periodic checkpoints are initiated based on wall-clock time used by a request (valid only on UNICOS, UNICOS/mk, and IRIX systems) (set periodic_checkpoint time_status off|on). | |
SEgment Directory | |
The name of the directory containing log file segments (set segment directory). | |
SEgment ON_init | |
Determines whether the log file is segmented at initialization (set segment on_init off|on). | |
SEgment Size | |
The maximum size of an NQS log file before it is segmented (set segment size). | |
SEgment Time_interval | |
The maximum time that can elapse before the NQS log file is segmented (set segment time_interval). | |
Sequence number for next request | |
The next sequence number that will be assigned to a batch request. | |
Snap_file | |
The default name of the file that will receive output from the snap command (set snapfile). | |
Validation type | |
The type of user validation that will be performed on user requests (set [no_]validation). |
To display a list of the currently defined NQS managers and operators for this server, use the following qmgr command:
show managers |
In the resulting display, users who are managers are indicated by a :m suffix; operators are indicated by a :o suffix, as follows:
Qmgr: show managers show managers root:m snowy:m fred:o xyz:o |
The list always includes root (as a manager).
To display a brief summary status of all currently defined batch and pipe queues, use the following qmgr command:
show queue |
An example of the display follows; it is identical to that produced by the cqstatl -s or qstat -s command:
Qmgr: show queue show queue ----------------------------- NQS BATCH QUEUE SUMMARY ----------------------------- QUEUE NAME LIM TOT ENA STS QUE RUN ----------------------- --- --- --- --- --- --- nqebatch 5 0 yes on 0 0 ----------------------- --- --- --- --- --- --- pendulum 5 0 0 0 ----------------------- --- --- --- --- --- --- ---------------------------- NQS PIPE QUEUE SUMMARY ---------------------------- QUEUE NAME LIM TOT ENA STS QUE ROU ----------------------- --- --- --- --- --- --- nqenlb 1 0 yes on 0 0 ----------------------- --- --- --- --- --- --- pendulum 5 0 0 0 ----------------------- --- --- --- --- --- --- |
To display a more detailed summary, use the following qmgr command:
show long queue |
An example of the display follows; it is identical to that produced by the cqstatl or qstat command.
Qmgr: show long queue show long queue ----------------------------- NQS BATCH QUEUE SUMMARY ----------------------------- QUEUE NAME LIM TOT ENA STS QUE RUN WAI HLD ARR EXI ---------------------- --- --- --- --- --- --- --- --- --- --- nqebatch 5 0 yes on 0 0 0 0 0 0 ---------------------- --- --- --- --- --- --- --- --- --- --- latte 5 0 0 0 0 0 0 0 ---------------------- --- --- --- --- --- --- --- --- --- --- ---------------------------- NQS PIPE QUEUE SUMMARY ---------------------------- QUEUE NAME LIM TOT ENA STS QUE ROU WAI HLD ARR DEP DESTINATIONS ---------------------- --- --- --- --- --- --- --- --- --- --- ------------- nqenlb 1 0 yes on 0 0 0 0 0 0 cool 1 0 yes on 0 0 0 0 0 0 [email protected] ---------------------- --- --- --- --- --- --- --- --- --- --- ------------- latte 5 0 0 0 0 0 0 0 ---------------------- --- --- --- --- --- --- --- --- --- --- ------------- |
The columns in these summary displays are described as follows:
Column name | Description |
QUEUE NAME | Indicates the name of the queue. |
LIM | Indicates the maximum number of requests that can be processed in this queue at any one time. This is the run limit that the NQS manager defines for the queue. For batch queues, LIM indicates the maximum number of requests that can execute in this queue at one time. After this limit is reached, additional requests will be queued until a request already in the queue completes execution. For pipe queues, LIM indicates the maximum number of requests that can be routed at one time. |
TOT | Indicates the total number of requests currently in the queue. |
ENA | Indicates whether the queue was enabled to accept requests. If the queue is disabled (ENA is no), the queue will not accept requests. |
STS | Indicates whether the queue has been started. If a queue was not started (STS is off), the queue will accept requests, but it will not process them. |
QUE | The number of requests that are waiting to be processed. |
ROU | The number of pipe queue requests that are currently being routed to another queue. |
RUN | The number of batch queue requests that are currently running. |
WAI | The number of requests that are waiting to be processed at a specific time. |
HLD | The number of requests the NQS operator or manager has put into the hold state. |
ARR | The number of requests currently arriving from other queues. |
DEP | The number of requests currently terminating their processing. |
DESTINATIONS | The destinations that were defined for the queue by the NQS manager. |
The last line of the display shows the total figures for the use of the queues by users at the NQS host (host1 in the example), except for the LIM column, which shows the global pipe limit for this system.
When you use a cqstatl command rather than qmgr show commands, you can limit the display to NQS pipe queues by using the -p option, or limit the display to NQS batch queues by using the -b option.
To display details of all the characteristics defined for NQS queues, use the cqstatl -f or the qstat -f command. You can restrict the display to specific queues by using the queues option.
You must separate queue names with a space character.
An example of a full display for the batch queue nqebatch follows:
% cqstatl -f nqebatch ------------------------------------- NQS BATCH QUEUE: [email protected] Status: ENABLED/INACTIVE ------------------------------------- Priority: 30 <ENTRIES> Total: 0 Running: 0 Queued: 0 Waiting: 0 Holding: 0 Arriving: 0 Exiting: 0 <RUN LIMITS> Queue: 5 User: unspecified Group: unspecified <COMPLEX MEMBERSHIP><RESOURCE USAGE> LIMIT ALLOCATED Memory Size unspecified 0kw Quick File Space unspecified 0kw MPP Processor Elements unspecified 0 <REQUEST LIMITS> PER-PROCESS PER-REQUEST type a Tape Drives unspecified type b Tape Drives unspecified type c Tape Drives unspecified type d Tape Drives unspecified type e Tape Drives unspecified type f Tape Drives unspecified type g Tape Drives unspecified type h Tape Drives unspecified Core File Size unspecified Data Size unspecified Permanent File Space unspecified unspecified Memory Size unspecified unspecified Nice Increment 0 Quick File Space unspecified unspecified Stack Size unspecified CPU Time Limit unlimited unlimited Temporary File Space unspecified unspecified Working Set Limit unspecified MPP Processor Elements unspecified MPP Time Limit unspecified unspecified |
Shared Memory Limit unspecified Shared Memory Segments unspecified MPP Memory Size unspecified unspecified <ACCESS> Route: Unrestricted Users: Unrestricted <CUMULATIVE TIME> System Time: 904.22 secs User Time: 59.14 secs |
Some parameters are not enforced if the operating system does not support the feature, such as MPP processor elements.
An example of a full display for the pipe queue nqenlb follows:
% cqstatl -f nqenlb ---------------------------------- NQS PIPE QUEUE: [email protected] Status: ENABLED/INACTIVE ---------------------------------- Priority: 63 <ENTRIES> Total: 0 Running: 0 Queued: 0 Waiting: 0 Holding: 0 Arriving: 0 Departing: 0 <DESTINATIONS> <SERVER> /nqebase/bin/pipeclient CRI_DS <ACCESS> Route: Unrestricted Users: Unrestricted <CUMULATIVE TIME> System Time: 269.92 secs User Time: 95.21 secs |
The cqstatl or qstat command also can be used to display details of characteristics defined for NQS queues on other hosts. The difference is that you must give the name of the remote host where the queues are located. For example, for the cqstatl command, you could include the -h option to specify the remote host, as follows:
cqstatl -d nqs -h target_host -f queues |
You also can change your NQS_SERVER environment variable to specify the remote host.
You can display a summary of the requests currently in the queue if you include the queue-name parameter in the qmgr commands show queue and show long queue:
show queue queue-name show long queue queue-name |
If no requests are currently in the queue, the following message is displayed:
nqs-2224 qmgr: CAUTION no requests in queue queue-name |
You can restrict the display to the requests in the queue that were submitted by a specific user by including the name of that user after the queue-name argument:
show queue queue-name username show long queue queue-name username |
An example of the display for a batch queue follows:
Qmgr: show long queue nqebatch show long queue nqebatch ------------------------------- NQS BATCH REQUEST SUMMARY ------------------------------- IDENTIFIER NAME USER LOCATION/QUEUE JID PRTY REQMEM REQTIM ST ---------- ----- -------- ---------------- ---- ---- ------ ------ --- 39.latte STDIN mary [email protected] 10644 --- 262144 ** R |
The columns in this display have the following descriptions:
The show queue display has much of this information, but it does not contain the USER, QUEUE, JID, and PRTY fields.
The following example display is a summary of all requests in an NQS pipe queue called squall at a system called host1:
Qmgr: show long queue nqenlb show long queue nqenlb ------------------------------ NQS PIPE REQUEST SUMMARY ------------------------------ IDENTIFIER NAME OWNER USER LOCATION/QUEUE PRTY ST ------------- ------- -------- -------- --------------------- ---- --- 40.latte STDIN 1201 mary [email protected] 1 Q |
The columns in this display have the following meanings:
Column name | Description |
IDENTIFIER | The request identifier (as displayed when the request was submitted). |
NAME | The name of the script file, or stdin if the request was created from standard input. |
OWNER | The user name under which the request was submitted. |
USER | The user name under which the request will be executed. |
LOCATION/QUEUE | The queue in which the request currently resides. |
PRTY | The intraqueue priority of the request. |
ST | An indication of the current state of the request (in this case, Q means the request is queued and ready to be routed to a batch queue). |
The show queue display has much of this information in it, but it does not contain the OWNER, USER, QUEUE, and PRTY fields.
A list of all requests that run in all NQS queues can be displayed by root or an NQS manager (otherwise, the display shows only the submitting user's requests). Use the cqstatl or qstat command with the -a option; for example:
cqstatl -a |
To display a list of NQS systems that are currently defined in the machine ID (mid) database, use the following qmgr command:
show mid |
An example of the display follows:
Qmgr: sho mid sho mid MID PRINCIPAL NAME AGENTS ALIASES -------- -------------- ------ ------- 10111298 latte nqs latte.cray.com 10653450 pendulum nqs 10671237 gust nqs gust.cray.com |
This display shows that three systems are currently defined; the descriptions of these entries are explained in “Defining NQS Hosts in the Network” in Chapter 5.
The following qmgr command displays information on a specific mid or host:
show mid hostname|mid |
An example of such a display follows:
Qmgr: show mid latte show mid latte MID PRINCIPAL NAME AGENTS ALIASES -------- -------------- ------ ------- 10111298 latte nqs latte.cray.com |
Before an NQS queue can receive and process requests submitted by a user, it must meet the following requirements:
The queue must be enabled so that it can receive requests. If a queue is disabled, it can still process requests already in the queue, but it cannot receive any new requests.
The queue must be started so that it can process requests that are waiting in the queue. If a queue is stopped, it can still receive new requests from users, but it will not try to process them.
The NQS operator (or manager) can control these states, thereby controlling the availability of a queue.
For example, you can disable all queues before shutting down the NQS system and prevent users from submitting any additional requests. The requests that are already in the queues will be processed.
Another example might be when no destination for a pipe queue will be available for some time (for example, the destination host may not be functioning). You can stop the queue rather than having NQS repeatedly try to send requests and deplete system resources. There is a system limit to the amount of time a request can wait in a queue.
To enable a queue and allow requests to be placed in it, use the following qmgr command:
enable queue queue-name |
If the queue is already enabled, no action is performed.
To disable a queue and prevent any additional requests from being placed in it, use the following qmgr command:
disable queue queue-name |
If the queue is already disabled, no action is performed. If requests are already in the queue, these can still be processed (if the queue has been started).
An NQS manager also can prevent or enable individual users' access to a queue (see “Restricting User Access to Queues” in Chapter 5).
In the following example, you want to prevent users from sending requests to the NQS queue called express for a few hours. To disable the queue, use the following qmgr command:
disable queue express |
When you want to enable the queue, use the following qmgr command:
enable queue express |
To start an NQS queue and allow it to process any requests waiting in it, use the following qmgr command:
start queue queue-name |
If the queue is already started, no action is performed.
To start all NQS queues, use the following qmgr command:
start all_queues |
To stop a queue and prevent any further requests in it from being processed, use the following qmgr command:
stop queue queue-name |
If the queue is already stopped, no action is performed. A request that had already begun processing before the command was issued is allowed to complete processing. All other requests in the queue, and any new requests placed in the queue, are not processed until the queue is started.
To stop all NQS queues, use the following qmgr command:
stop all_queues |
You can also suspend the processing of a specific request in a queue (instead of the entire queue); see “Suspending, Resuming, and Rerunning Executing Requests”.
In the following example, you can stop all NQS queues except the queue called cray1-std, by first stopping all queues and then restarting cray1-std, as follows:
stop all_queues start queue cray1-std |
The following qmgr command starts all of the queues:
start all_queues |
You can perform certain actions by using qmgr commands on requests that were submitted by users. To delete a request, you must have either operator privileges on NQS or be the owner of the request.
To move a request from one NQS queue to another, use the following qmgr command:
move request = request queue_name |
A request can be moved only if it has not yet started processing (that is, has not yet started being transferred to another queue).
The request argument is the request identifier. It corresponds to the value displayed under the IDENTIFIER column of the summary request display (see “Displaying Summary Status of User Requests in NQS Queues”). The queue_name argument is the name of the queue to hold the request.
To move more than one request, use the following qmgr command:
move request = (requests) queue_name |
You must separate request identifiers with a space or a comma and enclose them all in parentheses.
To move all requests that have not started processing in one queue to another queue, use the following qmgr command:
move queue from_queue_name to_queue_name |
The following qmgr command schedules queued requests manually, changing their position in a queue:
schedule request request(s) [order] |
The request argument is one or more request identifiers of the request to be rescheduled. If you specify more than one request, you must enclose them all in parentheses and separate them with a space character or a comma.
The order argument can be one of the following:
first, before all other requests in the queue, including restarted or rerun requests
next, after all other qmgr scheduled requests in the queue
now, immediately initiating the request and bypassing all NQS limit checks
system, moving a previously scheduled request back to system-controlled scheduling
You can use the schedule command only on queued requests.
If a queue has a higher interqueue priority, the requests in that queue are usually initiated before those in queues with a lower interqueue priority.
The following example repositions the requests 100, 101, and 102 to the order 101, 102, and then 100:
schedule request 100 next schedule request 101 now schedule request 102 first |
Request 101 is scheduled immediately, before all other queued requests.
Request 102 is scheduled to go before all other requests scheduled by qmgr (but not before those scheduled with the schedule now command, which take precedence).
Request 100 is scheduled to go after all requests scheduled by qmgr, but before system-scheduled requests.
To hold specific queued requests so that they are not processed, use the following qmgr command:
hold request request(s) [grace-period] |
The request argument is one or more request identifiers of the requests to be held. If you specify more than one request, you must enclose them all in parentheses and separate them with a space character or a comma.
The grace-period argument is the number of seconds between when the command is issued and when the request is actually held; the default is 60 seconds.
Held requests become ineligible for processing; they are removed from the run queue and their NQS resources are released. UNICOS, UNICOS/mk, and IRIX requests that are running can be held; they are checkpointed until they are released. On other platforms, running requests cannot be held because they cannot be checkpointed.
You can hold more than one request at a time. Until it is released, you cannot process a held request (for instance, move it to another queue).
To release requests that are held, use the following qmgr command:
release request request(s) |
The qmgr command release request does not affect any request that is in a state other than being held. No grace period exists for releasing a request.
In the following example, the request with identifier 123 is immediately held to prevent it from being routed to a destination by using the following qmgr command:
hold request 123 0 |
When you want to begin processing the request, release the request by using the following qmgr command:
release request 123 |
To suspend specific requests that are running and make their processes ineligible for execution, use the following qmgr command:
suspend request request(s) |
The request argument is one or more request identifiers of the requests to be held. If you specify more than one request, you must enclose them all in parentheses and separate them with a space character or a comma.
The resources of the request are not released with the suspend command.
You can suspend more than one request at a time.
When you suspend a request, its processes are sent a SIGSTOP signal; when you resume processing of the request, a SIGCONT signal is sent. On UNICOS and UNICOS/mk systems, NQS uses the suspend(2) system call.
You cannot process a suspended request (for instance, route it to another queue) until it is resumed.
To resume requests that are suspended, use the following qmgr command:
resume request request(s) |
The resume request command does not affect any request that is in a state other than suspension.
To abort and immediately requeue a running request, use the following qmgr command:
rerun request request(s) |
This command kills processes of the request and the request is returned to the queued state at its current priority. If a request is not running or cannot be rerun, no action is taken. Users can specify that a request cannot be rerun by using the NQE GUI Submit display, the cqsub or qsub command with the -nr option, or the qalter -r command.
In the following example, the request with identifier 222 is suspended and its processes become ineligible for execution:
suspend request 222 |
When you want to begin processing the request, you must release the request by using the following qmgr command:
resume request 222 |
The following qmgr command changes characteristics of a request in a queue:
modify request request option=limit |
The request argument is the request identifier of the request to be modified.
The limit argument overrides the limit specified when the request was submitted and overrides the default limit defined for the batch queue in which the request is to be executed. NQS considers the new limit for selecting the batch queue in which the request will execute.
The option can set the following per-process and per-request attributes; the actual name of the option is given in parentheses:
Attribute | Description | |
CPU time limit | Per-request and per-process CPU time that can be used; on CRAY T3E MPP systems, this limit applies to command PEs (commands and single-PE applications) (rtime_limit and ptime_limit). | |
priority | NQS user specified priority of a request. To enable user specified priority for job initiation scheduling, you should set the user_priority weighting factor to nonzero and set all of the other weighting factors to 0. Then, only user specified priority will be used in determining a job's intraqueue priority for job initiation. | |
Memory limit | Per-request and per-process memory that can be used (rmemory_limit and pmemory_limit). | |
Nice value | Execution priority of the processes that form the batch request; specified by a nice increment between 1 and 19 (nice_limit). | |
Per-request permanent file space limit | Per-request permanent file space that can be used (rpermfile_limit). | |
Per-process permanent file space limit | Per-process permanent file space that can be used (ppermfile_limit). | |
Per-request MPP processing elements (CRAY T3D systems) or MPP application processing elements (CRAY T3E systems) or number of processors (IRIX systems) limit; optionally enabled on IRIX systems, see “Enabling per Request Limits on Memory Size, CPU Time, and Number of Processors on IRIX Systems” in Chapter 5 | Per-request Cray MPP processing elements (PEs) limit (-l mpp_p). | |
Per-process Cray MPP residency time limit | For CRAY T3D MPP systems, sets the per-process wallclock residency time limit, or for CRAY T3E MPP systems, sets the per-process CPU time limit for application PEs (multi-PE applications) for subsequent acceptance of a batch request into the specified batch queue (-l p_mpp_time_limit) . | |
Per-request Cray MPP residency time limit | For CRAY T3D MPP systems, sets the per-request wallclock residency time limit, or for CRAY T3E MPP systems, sets the per-request CPU time limit for application PEs (multi-PE applications) for subsequent acceptance of a batch request into the specified batch queue (-l r_mpp_time_limit) . | |
Per-request quick-file size limit | (UNICOS systems with SSDs) The user's per-process secondary data segments (SDS) limit, which is set to the value of the user's user database (UDB) entry for per-process SDS limits (-lQ). | |
Per-request shared memory size limit | Per-request shared memory size limit (-l shm_limit) . | |
Per-request shared memory segment limit | Per-request shared memory segment limit (-l shm_segment). | |
MPP memory size | Per-process and Per-request MPP application PE memory size limits (-l p_mpp_m,mpp_m). |
You can add a suffix to the memory limit with one of the following characters (these are case sensitive) which denote the units of size used. A word is 8 bytes on UNICOS, UNICOS/mk, and 64-bit IRIX systems, and a word is 2 bytes on other supported systems.
Suffix | Units |
b | Bytes (default) |
w | Words |
kb | Kilobytes (210 bytes) |
kw | Kilowords (210 words) |
mb | Megabytes (220 bytes) |
mw | Megawords (220 words) |
gb | Gigabytes (230 bytes) |
gw | Gigawords (230 words) |
In this example, a request with identifier 1340 is waiting to execute in an NQS queue. The submitter of the request wants to increase the memory allocated to the request to 500 Mwords. To do this, you can enter the following qmgr command:
modify request 1340 rmemory_limit=500mw |
The following qmgr commands delete a request that is in a NQS queue:
abort request, which sends a SIGKILL signal to each of the processes of a running request.
delete request, which deletes requests that are not currently running.
The request is removed from the queue and cannot be retrieved. The original script file of the request is unaffected by this command.
To delete all requests in a particular NQS queue, use the following qmgr command:
abort queue, which sends a SIGKILL signal to all running requests in the specified queue
purge queue, which purges all waiting, held, and queued requests in the specified queue but allows running requests to complete
To delete requests using the NQE GUI Status window, select the request to be deleted by clicking on the request line; then select Delete from the Actions menu.
You can also use the cqdel or the qdel command to delete requests; an example of the syntax follows:
cqdel -u username requestids |
The username argument is the name of the user who submitted the request.
After a request is in an NQS queue at a remote system, you can use the NQE GUI Status window or the cqdel or qdel command to delete or signal the request (the qmgr subcommands affect only requests in local NQS queues). You must authenticate yourself by entering the correct password or having an entry in your .rhosts or .nqshosts file and in the .rhosts or .nqshosts file of the job's owner. You do not have any special privileges to delete other users' requests on remote NQS systems, even if you are an NQS operator or manager on both the local and the remote host.
![]() | Note: If you are using the NQE database, you only need system access to the NQE database from your remote host. For more information, see Chapter 9, “NQE Database”, and the nqedbmgr(8) man page. |
To delete requests using the NQE GUI Status window, select the request to be deleted by clicking on the request line; then select Delete Job from the Actions menu. For more information about using the cqdel and qdel commands to delete or signal requests on NQS, see the cqdel(1) and qdel(1) man pages.
When a process within a job has been terminated by a SIGUME or SIGRPE signal or a SIGPEFAILURE signal (UNICOS/mk systems only), NQS requeues the job rather than deleting it if either of the following is true:
The job is rerunnable
The job is restartable and has a restart file
Applications running on a CRAY T3E system are killed when a PE assigned to the application goes down. NQS is now notified when a job is terminated by a SIGPEFAILURE signal. NQS will requeue the job and either restart or rerun the job, as applicable.
Periodic checkpointing should be enabled so that restart files will be available for restarting rather than rerunning jobs which were terminated by a downed PE SIGPEFAILURE signal.
By default, each NQS job is both rerunnable and restartable. These defaults can be changed through use of the qsub -nr and -nc options and through use of the qalter -r n and -c n options. The job's owner can specify the qsub options and use the qalter command to modify the job rerun and/or restart attributes. An administrator can also use the qalter command to modify any job's rerun and/or restart attributes.
If NQS requeues a job because the job was terminated by either the SIGUME or SIGRPE signals, the following message is written into the system syslog, the NQS log file, and the job's log file:
Request <1.subzero>: Request received SIGRPE or SIGUME signal; request requeued. |
If NQS requeues a job because the job was terminated by the SIGPEFAILURE signal, the following message is written into the system syslog, the NQS log file, and the job's log file:
Request <1.subzero>: Request received SIGPEFAILURE signal; request requeued. |
As a requeued job, the job will be reinitiated after it is selected by the NQS scheduler. The qmgr schedule request now command can be used to force the job to be reinitiated immediately.
The following actions can be expected when a job is terminated by either a SIGRPE or SIGUME signal or a SIGPEFAILURE signal (UNICOS/mk systems only):
For a job that has default rerun and restart attributes, the job is requeued and rerun.
For a job that has default rerun and restart attributes and has a restart file associated with it, the job is requeued and restarted from the restart file.
For a job that has the no-rerun attribute and has no restart file, the job is deleted.
For a job that has the no-rerun attribute but does have a restart file, the job is requeued and restarted from the restart file.
For a job that has the no-restart attribute and uses the default rerun attribute, the job is requeued and rerun.
For a job that has the no-rerun and no-restart attributes, the job is deleted.