Chapter 6. Operating NQS

This chapter describes how a user defined as an NQS operator can monitor and control NQS on the local execution server. Most of these actions are performed using the commands of the qmgr utility. A user defined as an NQS manager can perform all of the actions described in this chapter.

This chapter discusses the following topics:

Overview of Operator Actions

An NQS operator can perform the following actions on the local NQS server (these actions may also be performed by an NQS manager):

  • Controlling the availability of NQS:

    • Starting up and shutting down the system

    • Enabling and disabling NQS queues

    • Starting and stopping NQS queues

  • Monitoring NQS by using various qmgr show commands and by using the cqstatl or qstat utility to display the following:

    • The current values of system parameters

    • The status of NQS queues

    • A detailed description of the characteristics of NQS queues

    • The status of individual requests in NQS queues

    You must use the cqstatl or qstat command to display information about NQS queues.

  • Controlling the processing of batch requests that users submit to the system:

    • Moving requests between queues or within queues

    • Holding and releasing requests in various ways

    • Modifying characteristics of a request

    • Deleting requests

Starting NQS

The nqeinit script is responsible for starting all NQE components. It calls qstart to start the NQS component. You can also start an individual process by executing the qstart command (as follows), which then automatically starts the other NQS processes.

% qstart options


Note: For customers who plan to install PALs on UNICOS systems that run only the NQE subset (NQS and FTA components), the qstart -c option should be used to direct the NQS console file into a directory that does not require root to write to it.

See the qstart(8) man page for a complete description of the syntax of the qstart command.

The qstart command is a shell script that starts nqsdaemon and passes qmgr startup commands to the daemon. By default, qstart performs the following functions:

  • Starts nqsdaemon

  • Initializes the system's NQS environment with commands defined in the qmgr input file

On startup, the nqeinit and qstart commands check whether the NQS database has been created. If it has not, the commands create it. During the creation of the NQS database, default values are taken from the following file:

/nqebase/etc/nqs_config

When NQS is being configured, this process occurs only once, unless you delete the NQS database structure. After the database is created, nqs_config is not used again.

During each startup of NQS, /nqebase/etc is examined for a file called NQS.startup. If the file exists, NQS uses it as input to qmgr as NQS is being started. If it does not exist, NQS is brought up and the qmgr start all command is issued. You can see a log of the startup activity in the NQS log directory ($NQE_SPOOL/log).

The directory structure under $NQE_SPOOL/log is as follows:

# cd $NQE_SPOOL
# ls
# ftaqueue   log    nlbdir   spool
# ls log
nqscon.26553.out  nqslog      qstart.26553.out

Except for the nqslog file, the files have the UNIX process ID appended, which makes the files unique across invocations of qstart and qstop.

After the initial boot of NQS on a system, you should make subsequent database changes interactively using qmgr. After changes are made, you can use the qmgr snap file command to save the current configuration. To update the standard configuration file, you can replace the nqs_config file with the snap file.


Note: In order to use the qmgr snap file command, you must have NQS manager privilege.

You do not need an NQS.startup file to preserve changes made to qmgr. Any qmgr changes are written to the NQS database and preserved across shutdowns and startups. You should make a copy of your changes to either the nqs_config or NQS.startup file, in case your database is ever corrupted.

After nqsdaemon is initialized, qstart determines whether the NQS spool database (usually in $NQE_SPOOL/spool) exists. If qstart finds that the database is missing, the /nqebase/etc/nqs_config configuration file is sent as input to qmgr. This file should contain the complete NQS database configuration, including the machine IDs, queues, and global limits. You can copy the output from a qmgr snap file to /nqebase/etc/nqs_config when you are doing an upgrade and an NQS database must be constructed. NQS executes the /nqebase/etc/NQS.startup file each time it starts up, whether or not an NQS database exists.

The changes that you make to the /nqebase/etc/nqs_config file are not propagated automatically into the active NQS spool database. Usually, the database is in $NQE_SPOOL/spool when the system is booted; therefore, the configuration is already loaded (except when the system is initially installed or when a clean $NQE_SPOOL/spool is being used).

To ensure that the NQS daemon is running, use the following command:

# /nqebase/bin/qping -v
# nqs-2550 qping: INFO
#   The NQS daemon is running.

The following is a sample NQS.startup file, which the site can configure. A copy of the file, as it is shown in this example, is contained in /nqebase/examples.

# USMID %Z%%M%  %I%     %G% %U%
#
#   NQS.startup - Qmgr commands to start up NQS.
#
#   Example of site-configurable script that resides in /nqebase/etc/config;
#   /nqebase/etc/qstart will invoke /nqebase/etc/NQS.startup unless the
#   -i option is specified.
#
#
#   Set general parameters.
#
#   Fewer timeouts on qstat, qmgr, etc., if nqsdaemon locked in memory.
#
lock local_daemon
#
#   Site may not run accounting during nonprime time to save disk space.
#
set accounting on
#
#   Could have put checkpoint directory in /tmp/nqs/chkpnt
#   for a benchmark; so make sure it is in usual place
#   for prime time.
#
set checkpoint_directory =(/usr/spool/nqe/spool/private/root/chkpnt)
#
#   Could ensure that NQS will not fill up /usr/spool/nqe by
#   setting the log_file to /tmp/nqs/nqslog.
#
set log_file =/usr/spool/nqe/log/nqslog
#
#   Debug level may have been set to 3 (verbose) for testing;
#   reduce log file size for prime time by reducing level.
#
set debug 0
#
#   There are many NQS message types that can be turned on and
#   off separately. Set some message types on when wanting
#   more information logged for areas of interest or set all
#   message types on when tracking problems. Note that
#   "set message_type on all" will turn on all but "flow"
#   messages. Using "set message_type on flow" should only be
#   used very selectively in controlled situations for tracking
#   a very difficult problem.
#
set message_type off all
#
#   NQS message headers in the log file can now be set to short
#   (same as always) or long, which includes information on where
#   the message is being issued from. Use long when tracking
#   problems.
#
set message_header short
#
#   NQS log files can be automatically segmented based on time, size,
#   or at each startup of NQS. The directory to which NQS will
#   segment the log file is specified with the "set segment directory"
#   command.
#
set segment directory /usr/spool/nqe/log
#
#   NQS can now automatically segment the NQS log file each time NQS
#   is started.
#
set segment on_init on
#
#   Name "batch" is not magic; corresponds to hypothetical pipe_queue.
#   See nqs/example/qmgr.example.
#
set default batch_request queue batch
#
#   Number of hours during which a pipe queue can be unreachable;
#   a routing request that exceeds this limit is marked as failed.
#
set default destination_retry time 72
#
#   Number of minutes to wait before trying a pipe queue destination
#   that was unreachable at the time of the last attempt.
#
set default destination_retry wait 5
#
#   Turn on the default return of the user's job log file.
#
set default job_log on
#
#   Assumes NQS account exists on the Cray Research system
#   and belongs to group root. If -me or -mb on qsub; this is
#   who mail will be from.
#
set mail nqs
#
#
#   Now set global parameters.
#
set global batch_limit = 20
set global user_limit = 20
set global group_limit = 15
#
#   Consider this number an "oversubscription"; it does not
#   have to be the size of the machine.
#
set global memory_limit = 256Mw
#
#   Consider this number an "oversubscription"; it does not
#   have to be the size of the SSD.
#
set global quickfile_limit = 1Gw   # Y-MP Only
#
#   The numbers in each tape group must reflect
#   local site configuration.
#
set global tape_limit a = 10
set global tape_limit b = 5
set global tape_limit c = 5
set global tape_limit d = 0
set global tape_limit e = 0
set global tape_limit f = 0
set global tape_limit g = 0
set global tape_limit h = 0
#
#   Maximum number of pipe requests allowed to run concurrently.
#
set global pipe_limit = 5
#
#   Capture configuration status at startup; written to
#   /usr/spool/nqe/log/qstart.out unless otherwise specified.
#   These commands are entirely optional, but handy
#   for reference if problems occur.
#
sho queues
sho parameters
sho mids
sho man
#
#    Now NQS will begin scheduling jobs.
#
start all_queues

NQS Shutdown and Restart

The nqestop script is responsible for stopping all NQE components. It calls qstop to stop the NQS component. You can also stop an individual NQS process by executing the qstop(8) command as follows:

% qstop options

See the qstop(8) man page for a complete description of the syntax of the qstop command.


Caution: When you stop NQS by using the qstop command, you should be aware that you have not stopped NQE. The Network Load Balancer collector daemon uses logdaemon services . The qstop command does not stop NLB collectors; use nqestop(8) to stop all of NQE.

When the qmgr shutdown command shuts down NQS, all processes that make up a restartable running batch request on the local host are killed. On UNICOS, UNICOS/mk, and IRIX systems, an image of the request is saved in a file in the checkpoint directory by NQS using the chkpnt(2) system call on UNICOS and UNICOS/mk systems and the cpr(1) command on IRIX systems.

For UNICOS, UNICOS/mk, and IRIX systems, for a batch request to be considered restartable, it must meet the recoverability criteria of job and process recovery, as described in General UNICOS System Administration, publication SG-2301 or in the Checkpoint and Restart Operation Guide, publication 007-3236-001, a Silicon Graphics publication.

When NQS is restarted, checkpointed requests are recovered from the images in their respective restart files. They resume processing from the point of suspension before the shutdown. After a request is restarted successfully, the restart file is kept until the request completes (in which case, the file is removed) or the request is checkpointed again (in which case, the file is overwritten).

The following is a sample NQS.shutdown file, which the site can configure. A copy of the file, as it is shown in this example, is contained in /nqebase/examples.

# USMID %Z%%M%  %I%     %G% %U%
#
#   NQS.shutdown - Qmgr commands to shut down NQS.
#
#   Example of site-configurable script that resides in /nqebase/etc/config;
#   /nqebase/etc/qstop invokes /etc/config/NQS.shutdown, unless the
#   -i option is specified.
#
stop all
#
#   The "set accounting off" line has been removed from this example
#   as it was found that setting accounting off at NQS.shutdown time
#   turned NQS accounting off before accounting records had been given
#   their appropriate terminating status and this caused problems with
#   the post-processing accounting routine summaries of the accounting
#   data.
#
#   The 60-second grace period means shutdown will take longer than 60
#   seconds. The nqsdaemon sends a SIGSHUTDN signal to the processes
#   of all running requests and then waits for the number of seconds
#   specified by the grace period. After the grace period, nqsdaemon
#   attempts to checkpoint all running requests that are restartable;
#   i.e., qsub option -nc NOT specified. nqsdaemon will send SIGKILL to
#   processes of requests that were not checkpointed, including those
#   for which the checkpoint failed. All checkpointed requests and
#   rerunnable requests (i.e., qsub option -nr NOT specified) will be
#   requeued to be restarted or rerun when NQS is next initiated.
#
shutdown 60


Monitoring NQS

Several qmgr commands are available to monitor NQS; these commands all begin with show.

You also can use the NQE GUI Status display and the cqstatl or the qstat command to gather valuable information about requests and queues.

Displaying System Parameters

To display the current values for system parameters, use the following qmgr command:

show parameters

An example of the display follows:

Qmgr: sho parameters
sho para

  Checkpoint directory =
/nqebase/version/pendulum/database/spool/private/root/chkpnt
  Debug level = 1
  Default batch_request queue = nqenlb
  Default destination_retry time = 72 hours
  Default destination_retry wait = 5 minutes
  Default return of a request's job log is OFF
  Global batch group run-limit: unspecified
  Global batch run-limit:       5
  Global batch user run-limit: 2
  Global MPP Processor Element limit:   unspecified
  Global memory limit:          unlimited
  Global pipe limit:            5
  Global quick-file limit:      unspecified
  Global tape-drive a limit:    unspecified
  Global tape-drive b limit:    unspecified
  Global tape-drive c limit:    unspecified
  Global tape-drive d limit:    unspecified
  Global tape-drive e limit:    unspecified
  Global tape-drive f limit:    unspecified
  Global tape-drive g limit:    unspecified
  Global tape-drive h limit:    unspecified
  Job Initiation Weighting Factors:
 Fair Share Priority = 0    (Fair Share is not enabled)
    Requested CPU Time  = 0
    Requested Memory    = 0
    Time-in-Queue       = 1
    Requested MPP CPU Time  = 0
    Requested MPP PEs       = 0
    User Specified Priority = 0
Job Scheduling:   Configured = nqs normal   Active = nqs normal
  Log_file = /nqebase/version/pendulum/database/spool/../log/nqslog
  MESSAGE_Header = Short
MESSAGE_Types:
    Accounting         OFF    CHeckpoint         OFF    COMmand_flow       OFF
    CONfig             OFF    DB_Misc            OFF    DB_Reads           OFF
    DB_Writes          OFF    Flow               OFF    NETWORK_Misc       OFF
    NETWORK_Reads      OFF    NETWORK_Writes     OFF    OPer               OFF
    OUtput             OFF    PACKET_Contents    OFF    PACKET_Flow        OFF
    PROTOCOL_Contents  OFF    PROTOCOL_Flow      OFF    RECovery           OFF
    REQuest            OFF    ROuting            OFF    Scheduling         OFF
    USER1              OFF    USER2              OFF    USER3              OFF
    USER4              OFF    USER5              OFF
  Mail account = root
  Netdaemon = /nqebase/bin/netdaemon
  Network client = /nqebase/bin/netclient
  Network retry time = 31 seconds
  Network server = /nqebase/bin/netserver
  NQS daemon accounting is OFF
NQS daemon is not locked in memory
  Periodic_checkpoint concurrent_checkpoints = 1
  Periodic_checkpoint cpu_interval = 60
  Periodic_checkpoint cpu_status on
  Periodic_checkpoint max_mem_limit = 32mw
  Periodic_checkpoint max_sds_limit = 64mw
  Periodic_checkpoint min_cpu_limit = 60
  Periodic_checkpoint scan_interval = 60
  Periodic_checkpoint status off
  Periodic_checkpoint time_interval = 180
  Periodic_checkpoint time_status off
  SEgment Directory = NONE
  SEgment On_init = OFf
  SEgment Size = 0 bytes
  SEgment Time_interval = 0 minutes
  Sequence number for next request = 0
  Snapfile = /nqebase/version/pendulum/database/spool/snapfile
  Validation type = validation files

This information also is displayed by the following qmgr command (along with the list of managers and operators and any user access restrictions on queues):

show all

The following qmgr command displays only the global limits (a subset of the output displayed in the preceding command):

show global_parameters

All entries in the show parameters display, except for the Sequence number for next request, can be configured by qmgr commands.


Note: Some parameters are not enforced if the operating system does not support the feature, such as MPP processing elements or checkpointing.

The following list explains each entry and shows the qmgr command (in parentheses) that is used to change the entry. Most of these commands can be used only by an NQS manager; commands that can be issued by an NQS operator are prefixed by a dagger (†).

Display entry
 

Description

Debug level
 

The level of information written to the log file (set debug).

Default batch request queue
 

The queue to which requests are submitted if the user does not specify a queue (set [no_]default batch_request queue).

Default destination retry time
 

The maximum time that NQS tries to send a request to a pipe queue (set default destination_retry time).

Default destination retry wait
 

The interval between successive attempts to retry sending a request (set default destination_retry wait).

Default return of request's job log
 

Determines whether the default action is to send a job log to users when the request is complete (set default job log on/off).

Global batch group run-limit
 

The maximum number of batch requests that all users of a group can run concurrently (†set global group_limit).

Global batch run-limit
 

The maximum number of batch requests that can run simultaneously at a host (†set global batch_limit).

Global batch user run-limit
 

The maximum number of batch requests any one user can run concurrently (†set global user_limit).

Global MPP Processor Element limit
 

The maximum number of MPP processing elements available to all batch requests running concurrently (†set global mpp_pe_limit). Valid only on the Cray MPP systems and optionally enabled on IRIX systems (see “Enabling per Request Limits on Memory Size, CPU Time, and Number of Processors on IRIX Systems” in Chapter 5).

Global memory limit
 

The maximum memory that can be allocated to all batch requests running concurrently (†set global memory_limit).

Global pipe limit
 

The maximum number of requests that can be routed simultaneously at the host (†setglobal pipe_limit).

Global quick-file limit
 

The maximum quickfile space that can be allocated to all batch requests running concurrently on NQS on UNICOS systems (†set global quickfile_limit).

Global tape-drive x limit
 

The maximum number of the specified tape drives that can be allocated to all batch requests running concurrently (†set global tape_limit).

Job Initiation Weighting Factoring
 

The weight of the specified factor in selecting the next request initiated in a queue (set sched_factor cpu|memory|share|time|mpp_cpu|mpp_pe|user_priority)

Job Scheduling option(s)
 

The job scheduling type used by nqsdaemon (set job_scheduling).

Log file
 

The path name of the log file (set log_file).

MESSAGE_Header
 

The NQS message header (set message_header long|short).

MESSAGE_Types
 

The type of NQS messages sent to message destinations such as the NQS log file or the user's terminal (set message_types off|on).

Mail account
 

The name that appears in the From: field of mail sent to users by NQS (set mail).

Netdaemon
 

The name of the TCP/IP network daemon (set [no_]network daemon).

Network client
 

The name of the network client process (set network client).

Network retry time
 

The maximum number of seconds a network function can fail before being marked as completely failed (set network retry_time).

Network server
 

The name of the network server process (set network server).

NQS daemon accounting
 

Determines whether NQS daemon accounting is enabled (set accounting off|on).

NQS daemon is/is not locked in memory
 

The state of the nqsdaemon process; that is, whether it is locked into memory or not (†lock local daemon and †unlock local daemon).

Periodic_checkpoint concurrent_checkpoints
 

The maximum number of checkpoints that can occur simultaneously during a periodic checkpoint scan interval (valid only on UNICOS, UNICOS/mk, and IRIX systems) (set periodic_checkpoint concurrent_checkpoints).

Periodic_checkpoint cpu_interval
 

The default CPU time interval for periodic checkpointing (valid only on UNICOS, UNICOS/mk, and IRIX systems) (set periodic_checkpoint cpu_interval).

Periodic_checkpoint cpu_status on
 

Determines whether NQS periodic checkpoints are initiated based on the CPU time used by a request (valid only on UNICOS, UNICOS/mk, and IRIX systems) (set periodic_checkpoint cpu_status off|on).

Periodic_checkpoint max_mem_limit
 

The maximum memory limit for a request that can be periodically checkpointed (valid only on UNICOS, UNICOS/mk, and IRIX systems) (set periodic_checkpoint max_mem_limit).

Periodic_checkpoint max_sds_limit
 

The maximum SDS limit for a request that can be periodically checkpointed (valid only on UNICOS systems) (set periodic_checkpoint max_sds_limit).

Periodic_checkpoint min_cpu_limit
 

The minimum CPU limit for a request that can be periodically checkpointed (valid only on UNICOS, UNICOS/mk, and IRIX systems) (set periodic_checkpoint max_mem_limit).

Periodic_checkpoint scan_interval
 

The time interval in which NQS scans running requests to find those eligible for periodic checkpointing (valid only on UNICOS, UNICOS/mk, and IRIX systems) (set periodic_checkpoint scan_interval).

Periodic_checkpoint status off
 

Determines whether NQS will examine running requests to see whether they can be checkpointed and then schedule them for checkpointing (on). If the status is off, no requests are periodically checkpointed (valid only on UNICOS, UNICOS/mk, and IRIX systems) (set periodic_checkpoint status off|on).

Periodic_checkpoint time_interval
 

The default wall-clock time interval for periodic checkpointing (valid only on UNICOS, UNICOS/mk, and IRIX systems) (set periodic_checkpoint cpu_interval).

Periodic_checkpoint time_status off
 

Determines whether NQS periodic checkpoints are initiated based on wall-clock time used by a request (valid only on UNICOS, UNICOS/mk, and IRIX systems) (set periodic_checkpoint time_status off|on).

SEgment Directory
 

The name of the directory containing log file segments (set segment directory).

SEgment ON_init
 

Determines whether the log file is segmented at initialization (set segment on_init off|on).

SEgment Size
 

The maximum size of an NQS log file before it is segmented (set segment size).

SEgment Time_interval
 

The maximum time that can elapse before the NQS log file is segmented (set segment time_interval).

Sequence number for next request
 

The next sequence number that will be assigned to a batch request.

Snap_file
 

The default name of the file that will receive output from the snap command (set snapfile).

Validation type
 

The type of user validation that will be performed on user requests (set [no_]validation).

Displaying a List of Managers and Operators

To display a list of the currently defined NQS managers and operators for this server, use the following qmgr command:

show managers

In the resulting display, users who are managers are indicated by a :m suffix; operators are indicated by a :o suffix, as follows:

Qmgr: show managers
show managers

   root:m
   snowy:m
   fred:o
   xyz:o

The list always includes root (as a manager).

Displaying a Summary Status of NQS Queues

To display a brief summary status of all currently defined batch and pipe queues, use the following qmgr command:

show queue

An example of the display follows; it is identical to that produced by the cqstatl -s or qstat -s command:

Qmgr: show queue
show queue
-----------------------------
NQS BATCH QUEUE SUMMARY
-----------------------------
QUEUE NAME              LIM TOT ENA STS QUE RUN
----------------------- --- --- --- --- --- ---
nqebatch                  5   0 yes  on   0   0
----------------------- --- --- --- --- --- ---
pendulum                  5   0           0   0
----------------------- --- --- --- --- --- ---
----------------------------
NQS PIPE QUEUE SUMMARY
----------------------------
QUEUE NAME              LIM TOT ENA STS QUE ROU
----------------------- --- --- --- --- --- ---
nqenlb                    1   0 yes  on   0   0
----------------------- --- --- --- --- --- ---
pendulum                  5   0           0   0
----------------------- --- --- --- --- --- ---

To display a more detailed summary, use the following qmgr command:

show long queue

An example of the display follows; it is identical to that produced by the cqstatl or qstat command.

Qmgr: show long queue
show long queue
-----------------------------
NQS BATCH QUEUE SUMMARY
-----------------------------
QUEUE NAME             LIM TOT ENA STS QUE RUN  WAI HLD ARR EXI
---------------------- --- --- --- --- --- ---  --- --- --- ---
nqebatch                 5   0 yes  on   0   0    0   0   0   0
---------------------- --- --- --- --- --- ---  --- --- --- ---
latte                    5   0           0   0    0   0   0   0
---------------------- --- --- --- --- --- ---  --- --- --- ---
----------------------------
NQS PIPE QUEUE SUMMARY
----------------------------
QUEUE NAME             LIM TOT ENA STS QUE ROU  WAI HLD ARR DEP  DESTINATIONS
---------------------- --- --- --- --- --- ---  --- --- --- ---  -------------
nqenlb                   1   0 yes  on   0   0    0   0   0   0
cool                     1   0 yes  on   0   0    0   0   0   0  [email protected]
---------------------- --- --- --- --- --- ---  --- --- --- ---  -------------
latte                    5   0           0   0    0   0   0   0
---------------------- --- --- --- --- --- ---  --- --- --- ---  -------------

The columns in these summary displays are described as follows:

Column name

Description

QUEUE NAME

Indicates the name of the queue.

LIM

Indicates the maximum number of requests that can be processed in this queue at any one time. This is the run limit that the NQS manager defines for the queue.

For batch queues, LIM indicates the maximum number of requests that can execute in this queue at one time. After this limit is reached, additional requests will be queued until a request already in the queue completes execution.

For pipe queues, LIM indicates the maximum number of requests that can be routed at one time.

TOT

Indicates the total number of requests currently in the queue.

ENA

Indicates whether the queue was enabled to accept requests. If the queue is disabled (ENA is no), the queue will not accept requests.

STS

Indicates whether the queue has been started. If a queue was not started (STS is off), the queue will accept requests, but it will not process them.

QUE

The number of requests that are waiting to be processed.

ROU

The number of pipe queue requests that are currently being routed to another queue.

RUN

The number of batch queue requests that are currently running.

WAI

The number of requests that are waiting to be processed at a specific time.

HLD

The number of requests the NQS operator or manager has put into the hold state.

ARR

The number of requests currently arriving from other queues.

DEP

The number of requests currently terminating their processing.

DESTINATIONS

The destinations that were defined for the queue by the NQS manager.

The last line of the display shows the total figures for the use of the queues by users at the NQS host (host1 in the example), except for the LIM column, which shows the global pipe limit for this system.

When you use a cqstatl command rather than qmgr show commands, you can limit the display to NQS pipe queues by using the -p option, or limit the display to NQS batch queues by using the -b option.

Displaying All the Characteristics Defined for an NQS Queue

To display details of all the characteristics defined for NQS queues, use the cqstatl -f or the qstat -f command. You can restrict the display to specific queues by using the queues option.

You must separate queue names with a space character.

An example of a full display for the batch queue nqebatch follows:

% cqstatl -f nqebatch
-------------------------------------
NQS BATCH QUEUE: [email protected]        Status:          ENABLED/INACTIVE
-------------------------------------
                                              Priority:       30
<ENTRIES>
        Total:          0
        Running:        0       Queued:         0       Waiting:        0
        Holding:        0       Arriving:       0       Exiting:        0
<RUN LIMITS>
        Queue:          5       User: unspecified       Group: unspecified
<COMPLEX MEMBERSHIP><RESOURCE USAGE>
                                   LIMIT                ALLOCATED
        Memory Size             unspecified                    0kw
        Quick File Space        unspecified                    0kw
        MPP Processor Elements  unspecified                    0
<REQUEST LIMITS>
                                PER-PROCESS             PER-REQUEST
        type a Tape Drives                              unspecified
        type b Tape Drives                              unspecified
        type c Tape Drives                              unspecified
        type d Tape Drives                              unspecified
        type e Tape Drives                              unspecified
        type f Tape Drives                              unspecified
        type g Tape Drives                              unspecified
        type h Tape Drives                              unspecified
        Core File Size          unspecified
        Data Size               unspecified
        Permanent File Space    unspecified             unspecified
        Memory Size             unspecified             unspecified
        Nice Increment                 0
        Quick File Space        unspecified             unspecified
        Stack Size              unspecified
        CPU Time Limit           unlimited               unlimited
        Temporary File Space    unspecified             unspecified
        Working Set Limit       unspecified
        MPP Processor Elements                          unspecified
        MPP Time Limit          unspecified             unspecified
        
        Shared Memory Limit                               unspecified
        Shared Memory Segments                            unspecified
        MPP Memory Size          unspecified             unspecified

<ACCESS>
        Route: Unrestricted            Users: Unrestricted
<CUMULATIVE TIME>
        System Time:    904.22 secs     User Time:      59.14 secs

Some parameters are not enforced if the operating system does not support the feature, such as MPP processor elements.

An example of a full display for the pipe queue nqenlb follows:

% cqstatl -f nqenlb
----------------------------------
NQS PIPE QUEUE: [email protected]       Status:      ENABLED/INACTIVE
----------------------------------
                                          Priority:       63
<ENTRIES>
     Total:          0

     Running:        0       Queued:         0       Waiting:        0

     Holding:        0       Arriving:       0       Departing:      0

<DESTINATIONS>

<SERVER>
        /nqebase/bin/pipeclient CRI_DS

<ACCESS>
        Route: Unrestricted            Users: Unrestricted

<CUMULATIVE TIME>
        System Time:    269.92 secs     User Time:      95.21 secs

The cqstatl or qstat command also can be used to display details of characteristics defined for NQS queues on other hosts. The difference is that you must give the name of the remote host where the queues are located. For example, for the cqstatl command, you could include the -h option to specify the remote host, as follows:

cqstatl -d nqs -h target_host -f queues

You also can change your NQS_SERVER environment variable to specify the remote host.

Displaying Summary Status of User Requests in NQS Queues

You can display a summary of the requests currently in the queue if you include the queue-name parameter in the qmgr commands show queue and show long queue:

show queue queue-name
show long queue queue-name

If no requests are currently in the queue, the following message is displayed:

nqs-2224 qmgr: CAUTION
no requests in queue queue-name

You can restrict the display to the requests in the queue that were submitted by a specific user by including the name of that user after the queue-name argument:

show queue queue-name username
show long queue queue-name username

An example of the display for a batch queue follows:

Qmgr: show long queue nqebatch
show long queue nqebatch
-------------------------------
NQS BATCH REQUEST SUMMARY
-------------------------------
IDENTIFIER NAME  USER     LOCATION/QUEUE   JID  PRTY REQMEM REQTIM ST
---------- ----- -------- ---------------- ---- ---- ------ ------ ---
39.latte   STDIN mary     [email protected]   10644  --- 262144     ** R

The columns in this display have the following descriptions:

Column name 

Description

IDENTIFIER 

The request identifier (as displayed when the request was submitted).

NAME 

The name of the script file.

USER 

The user name under which the request will be executed at the NQS system.

LOCATION/QUEUE 

The NQS queue in which the request currently resides.

JID 

The job identifier for the request at the NQS system.

PRTY 

The nice value of the request.

REQMEM 

The number of kilowords of memory the request is using.

REQTIM 

The number of seconds of CPU time remaining to the request.

ST 

An indication of the current state of the request. This can be composed of a major and a minor status value. Major status values are as follows:

A

Arriving

C

Checkpointed

D

Departing

E

Exiting

H

Held

Q

Queued

R

Running

S

Suspended

U

Unknown state

W

Waiting

See the cqstatl(1) and qstat(1) man pages for a list of the minor values.

The show queue display has much of this information, but it does not contain the USER, QUEUE, JID, and PRTY fields.

The following example display is a summary of all requests in an NQS pipe queue called squall at a system called host1:

Qmgr: show long queue nqenlb
show long queue nqenlb
------------------------------
NQS PIPE REQUEST SUMMARY
------------------------------
IDENTIFIER    NAME    OWNER    USER     LOCATION/QUEUE        PRTY ST
------------- ------- -------- -------- --------------------- ---- ---
40.latte      STDIN   1201     mary     [email protected]             1 Q  

The columns in this display have the following meanings:

Column name

Description

IDENTIFIER

The request identifier (as displayed when the request was submitted).

NAME

The name of the script file, or stdin if the request was created from standard input.

OWNER

The user name under which the request was submitted.

USER

The user name under which the request will be executed.

LOCATION/QUEUE

The queue in which the request currently resides.

PRTY

The intraqueue priority of the request.

ST

An indication of the current state of the request (in this case, Q means the request is queued and ready to be routed to a batch queue).

The show queue display has much of this information in it, but it does not contain the OWNER, USER, QUEUE, and PRTY fields.

A list of all requests that run in all NQS queues can be displayed by root or an NQS manager (otherwise, the display shows only the submitting user's requests). Use the cqstatl or qstat command with the -a option; for example:

cqstatl -a


Note: For information about displaying requests in the NQE database, see Chapter 9, “NQE Database”.


Displaying the Mid Database

To display a list of NQS systems that are currently defined in the machine ID (mid) database, use the following qmgr command:

show mid

An example of the display follows:

Qmgr: sho mid
sho mid
   MID      PRINCIPAL NAME   AGENTS    ALIASES
 --------   --------------   ------    -------
 10111298   latte            nqs       latte.cray.com
 10653450   pendulum         nqs
 10671237   gust             nqs       gust.cray.com

This display shows that three systems are currently defined; the descriptions of these entries are explained in “Defining NQS Hosts in the Network” in Chapter 5.

The following qmgr command displays information on a specific mid or host:

show mid hostname|mid

An example of such a display follows:

Qmgr: show mid latte
show mid latte
   MID      PRINCIPAL NAME   AGENTS    ALIASES
 --------   --------------   ------    -------
 10111298   latte            nqs       latte.cray.com

Controlling the Availability of NQS Queues

Before an NQS queue can receive and process requests submitted by a user, it must meet the following requirements:

  • The queue must be enabled so that it can receive requests. If a queue is disabled, it can still process requests already in the queue, but it cannot receive any new requests.

  • The queue must be started so that it can process requests that are waiting in the queue. If a queue is stopped, it can still receive new requests from users, but it will not try to process them.

The NQS operator (or manager) can control these states, thereby controlling the availability of a queue.

For example, you can disable all queues before shutting down the NQS system and prevent users from submitting any additional requests. The requests that are already in the queues will be processed.

Another example might be when no destination for a pipe queue will be available for some time (for example, the destination host may not be functioning). You can stop the queue rather than having NQS repeatedly try to send requests and deplete system resources. There is a system limit to the amount of time a request can wait in a queue.

Enabling and Disabling Queues

To enable a queue and allow requests to be placed in it, use the following qmgr command:

enable queue queue-name

If the queue is already enabled, no action is performed.

To disable a queue and prevent any additional requests from being placed in it, use the following qmgr command:

disable queue queue-name

If the queue is already disabled, no action is performed. If requests are already in the queue, these can still be processed (if the queue has been started).

An NQS manager also can prevent or enable individual users' access to a queue (see “Restricting User Access to Queues” in Chapter 5).

Example of Enabling and Disabling Queues

In the following example, you want to prevent users from sending requests to the NQS queue called express for a few hours. To disable the queue, use the following qmgr command:

disable queue express

When you want to enable the queue, use the following qmgr command:

enable queue express

Starting and Stopping Queues

To start an NQS queue and allow it to process any requests waiting in it, use the following qmgr command:

start queue queue-name

If the queue is already started, no action is performed.

To start all NQS queues, use the following qmgr command:

start all_queues

To stop a queue and prevent any further requests in it from being processed, use the following qmgr command:

stop queue queue-name

If the queue is already stopped, no action is performed. A request that had already begun processing before the command was issued is allowed to complete processing. All other requests in the queue, and any new requests placed in the queue, are not processed until the queue is started.

To stop all NQS queues, use the following qmgr command:

stop all_queues

You can also suspend the processing of a specific request in a queue (instead of the entire queue); see “Suspending, Resuming, and Rerunning Executing Requests”.

Example of Starting and Stopping Queues

In the following example, you can stop all NQS queues except the queue called cray1-std, by first stopping all queues and then restarting cray1-std, as follows:

stop all_queues
start queue cray1-std

The following qmgr command starts all of the queues:

start all_queues

Manipulating Requests

You can perform certain actions by using qmgr commands on requests that were submitted by users. To delete a request, you must have either operator privileges on NQS or be the owner of the request.

Moving Requests between NQS Queues

To move a request from one NQS queue to another, use the following qmgr command:

move request = request queue_name

A request can be moved only if it has not yet started processing (that is, has not yet started being transferred to another queue).

The request argument is the request identifier. It corresponds to the value displayed under the IDENTIFIER column of the summary request display (see “Displaying Summary Status of User Requests in NQS Queues”). The queue_name argument is the name of the queue to hold the request.

To move more than one request, use the following qmgr command:

move request = (requests) queue_name

You must separate request identifiers with a space or a comma and enclose them all in parentheses.

To move all requests that have not started processing in one queue to another queue, use the following qmgr command:

move queue from_queue_name to_queue_name

Example of Moving Requests between Queues

The following example moves the two requests with request identifiers 123 and 135 from the current queue to another queue called fast:

move request = (123,135) fast

You do not have to specify the queue in which these requests currently reside.

Moving Requests within a Queue

The following qmgr command schedules queued requests manually, changing their position in a queue:

schedule request request(s) [order]

The request argument is one or more request identifiers of the request to be rescheduled. If you specify more than one request, you must enclose them all in parentheses and separate them with a space character or a comma.

The order argument can be one of the following:

  • first, before all other requests in the queue, including restarted or rerun requests

  • next, after all other qmgr scheduled requests in the queue

  • now, immediately initiating the request and bypassing all NQS limit checks

  • system, moving a previously scheduled request back to system-controlled scheduling

You can use the schedule command only on queued requests.

If a queue has a higher interqueue priority, the requests in that queue are usually initiated before those in queues with a lower interqueue priority.

Example of Moving Requests within a Queue

The following example repositions the requests 100, 101, and 102 to the order 101, 102, and then 100:

schedule request 100 next
schedule request 101 now
schedule request 102 first

Request 101 is scheduled immediately, before all other queued requests.

Request 102 is scheduled to go before all other requests scheduled by qmgr (but not before those scheduled with the schedule now command, which take precedence).

Request 100 is scheduled to go after all requests scheduled by qmgr, but before system-scheduled requests.

Holding and Releasing Requests

To hold specific queued requests so that they are not processed, use the following qmgr command:

hold request request(s) [grace-period]

The request argument is one or more request identifiers of the requests to be held. If you specify more than one request, you must enclose them all in parentheses and separate them with a space character or a comma.

The grace-period argument is the number of seconds between when the command is issued and when the request is actually held; the default is 60 seconds.

Held requests become ineligible for processing; they are removed from the run queue and their NQS resources are released. UNICOS, UNICOS/mk, and IRIX requests that are running can be held; they are checkpointed until they are released. On other platforms, running requests cannot be held because they cannot be checkpointed.

You can hold more than one request at a time. Until it is released, you cannot process a held request (for instance, move it to another queue).

To release requests that are held, use the following qmgr command:

release request request(s)

The qmgr command release request does not affect any request that is in a state other than being held. No grace period exists for releasing a request.

Example of Holding and Releasing Requests

In the following example, the request with identifier 123 is immediately held to prevent it from being routed to a destination by using the following qmgr command:

hold request 123 0

When you want to begin processing the request, release the request by using the following qmgr command:

release request 123

Suspending, Resuming, and Rerunning Executing Requests

To suspend specific requests that are running and make their processes ineligible for execution, use the following qmgr command:

suspend request request(s) 

The request argument is one or more request identifiers of the requests to be held. If you specify more than one request, you must enclose them all in parentheses and separate them with a space character or a comma.

The resources of the request are not released with the suspend command.

You can suspend more than one request at a time.

When you suspend a request, its processes are sent a SIGSTOP signal; when you resume processing of the request, a SIGCONT signal is sent. On UNICOS and UNICOS/mk systems, NQS uses the suspend(2) system call.

You cannot process a suspended request (for instance, route it to another queue) until it is resumed.

To resume requests that are suspended, use the following qmgr command:

resume request request(s)

The resume request command does not affect any request that is in a state other than suspension.

To abort and immediately requeue a running request, use the following qmgr command:

rerun request request(s)

This command kills processes of the request and the request is returned to the queued state at its current priority. If a request is not running or cannot be rerun, no action is taken. Users can specify that a request cannot be rerun by using the NQE GUI Submit display, the cqsub or qsub command with the -nr option, or the qalter -r command.

Example of Suspending and Resuming Requests

In the following example, the request with identifier 222 is suspended and its processes become ineligible for execution:

suspend request 222

When you want to begin processing the request, you must release the request by using the following qmgr command:

resume request 222

Modifying the Characteristics of Requests in Queues

The following qmgr command changes characteristics of a request in a queue:

modify request request option=limit

The request argument is the request identifier of the request to be modified.

The limit argument overrides the limit specified when the request was submitted and overrides the default limit defined for the batch queue in which the request is to be executed. NQS considers the new limit for selecting the batch queue in which the request will execute.

The option can set the following per-process and per-request attributes; the actual name of the option is given in parentheses:

Attribute 

Description

CPU time limit 

Per-request and per-process CPU time that can be used; on CRAY T3E MPP systems, this limit applies to command PEs (commands and single-PE applications) (rtime_limit and ptime_limit).

priority 

NQS user specified priority of a request.

To enable user specified priority for job initiation scheduling, you should set the user_priority weighting factor to nonzero and set all of the other weighting factors to 0. Then, only user specified priority will be used in determining a job's intraqueue priority for job initiation.

Memory limit 

Per-request and per-process memory that can be used (rmemory_limit and pmemory_limit).

Nice value 

Execution priority of the processes that form the batch request; specified by a nice increment between 1 and 19 (nice_limit).

Per-request permanent file space limit 

Per-request permanent file space that can be used (rpermfile_limit).

Per-process permanent file space limit 

Per-process permanent file space that can be used (ppermfile_limit).

Per-request MPP processing elements (CRAY T3D systems) or MPP application processing elements (CRAY T3E systems) or number of processors (IRIX systems) limit; optionally enabled on IRIX systems, see “Enabling per Request Limits on Memory Size, CPU Time, and Number of Processors on IRIX Systems” in Chapter 5 

Per-request Cray MPP processing elements (PEs) limit (-l mpp_p).

Per-process Cray MPP residency time limit 

For CRAY T3D MPP systems, sets the per-process wallclock residency time limit, or for CRAY T3E MPP systems, sets the per-process CPU time limit for application PEs (multi-PE applications) for subsequent acceptance of a batch request into the specified batch queue (-l p_mpp_time_limit) .

Per-request Cray MPP residency time limit 

For CRAY T3D MPP systems, sets the per-request wallclock residency time limit, or for CRAY T3E MPP systems, sets the per-request CPU time limit for application PEs (multi-PE applications) for subsequent acceptance of a batch request into the specified batch queue (-l r_mpp_time_limit) .

Per-request quick-file size limit 

(UNICOS systems with SSDs) The user's per-process secondary data segments (SDS) limit, which is set to the value of the user's user database (UDB) entry for per-process SDS limits (-lQ).

Per-request shared memory size limit 

Per-request shared memory size limit (-l shm_limit) .

Per-request shared memory segment limit 

Per-request shared memory segment limit (-l shm_segment).

MPP memory size 

Per-process and Per-request MPP application PE memory size limits (-l p_mpp_m,mpp_m).

You can add a suffix to the memory limit with one of the following characters (these are case sensitive) which denote the units of size used. A word is 8 bytes on UNICOS, UNICOS/mk, and 64-bit IRIX systems, and a word is 2 bytes on other supported systems.

Suffix

Units

b

Bytes (default)

w

Words

kb

Kilobytes (210 bytes)

kw

Kilowords (210 words)

mb

Megabytes (220 bytes)

mw

Megawords (220 words)

gb

Gigabytes (230 bytes)

gw

Gigawords (230 words)

Example of Modifying a Queued Request

In this example, a request with identifier 1340 is waiting to execute in an NQS queue. The submitter of the request wants to increase the memory allocated to the request to 500 Mwords. To do this, you can enter the following qmgr command:

modify request 1340 rmemory_limit=500mw

Deleting Requests in NQS Queues

The following qmgr commands delete a request that is in a NQS queue:

  • abort request, which sends a SIGKILL signal to each of the processes of a running request.

  • delete request, which deletes requests that are not currently running.

The request is removed from the queue and cannot be retrieved. The original script file of the request is unaffected by this command.

To delete all requests in a particular NQS queue, use the following qmgr command:

  • abort queue, which sends a SIGKILL signal to all running requests in the specified queue

  • purge queue, which purges all waiting, held, and queued requests in the specified queue but allows running requests to complete

To delete requests using the NQE GUI Status window, select the request to be deleted by clicking on the request line; then select Delete from the Actions menu.

You can also use the cqdel or the qdel command to delete requests; an example of the syntax follows:

cqdel -u username requestids

The username argument is the name of the user who submitted the request.

Example of Deleting a Queued Request

To delete all requests waiting in the queue bqueue15, use the following qmgr command:

purge queue bqueue15

Deleting or Signaling Requests in Remote NQS Queues

After a request is in an NQS queue at a remote system, you can use the NQE GUI Status window or the cqdel or qdel command to delete or signal the request (the qmgr subcommands affect only requests in local NQS queues). You must authenticate yourself by entering the correct password or having an entry in your .rhosts or .nqshosts file and in the .rhosts or .nqshosts file of the job's owner. You do not have any special privileges to delete other users' requests on remote NQS systems, even if you are an NQS operator or manager on both the local and the remote host.


Note: If you are using the NQE database, you only need system access to the NQE database from your remote host. For more information, see Chapter 9, “NQE Database”, and the nqedbmgr(8) man page.

To delete requests using the NQE GUI Status window, select the request to be deleted by clicking on the request line; then select Delete Job from the Actions menu. For more information about using the cqdel and qdel commands to delete or signal requests on NQS, see the cqdel(1) and qdel(1) man pages.

Recovering Jobs Terminated Because of Hardware Problems

When a process within a job has been terminated by a SIGUME or SIGRPE signal or a SIGPEFAILURE signal (UNICOS/mk systems only), NQS requeues the job rather than deleting it if either of the following is true:

  • The job is rerunnable

  • The job is restartable and has a restart file

Applications running on a CRAY T3E system are killed when a PE assigned to the application goes down. NQS is now notified when a job is terminated by a SIGPEFAILURE signal. NQS will requeue the job and either restart or rerun the job, as applicable.

Periodic checkpointing should be enabled so that restart files will be available for restarting rather than rerunning jobs which were terminated by a downed PE SIGPEFAILURE signal.

By default, each NQS job is both rerunnable and restartable. These defaults can be changed through use of the qsub -nr and -nc options and through use of the qalter -r n and -c n options. The job's owner can specify the qsub options and use the qalter command to modify the job rerun and/or restart attributes. An administrator can also use the qalter command to modify any job's rerun and/or restart attributes.

If NQS requeues a job because the job was terminated by either the SIGUME or SIGRPE signals, the following message is written into the system syslog, the NQS log file, and the job's log file:

Request <1.subzero>: Request received SIGRPE or SIGUME
signal; request requeued.

If NQS requeues a job because the job was terminated by the SIGPEFAILURE signal, the following message is written into the system syslog, the NQS log file, and the job's log file:

Request <1.subzero>: Request received SIGPEFAILURE signal; request requeued.

As a requeued job, the job will be reinitiated after it is selected by the NQS scheduler. The qmgr schedule request now command can be used to force the job to be reinitiated immediately.

The following actions can be expected when a job is terminated by either a SIGRPE or SIGUME signal or a SIGPEFAILURE signal (UNICOS/mk systems only):

  • For a job that has default rerun and restart attributes, the job is requeued and rerun.

  • For a job that has default rerun and restart attributes and has a restart file associated with it, the job is requeued and restarted from the restart file.

  • For a job that has the no-rerun attribute and has no restart file, the job is deleted.

  • For a job that has the no-rerun attribute but does have a restart file, the job is requeued and restarted from the restart file.

  • For a job that has the no-restart attribute and uses the default rerun attribute, the job is requeued and rerun.

  • For a job that has the no-rerun and no-restart attributes, the job is deleted.