Chapter 1. Seeing the Big Picture

This chapter provides an overview of the Network Queuing Environment (NQE). The following topics are discussed:

The remaining chapters of this guide describe in detail all user tasks.

You also may want to read Introducing NQE, publication IN-2153, which provides a quick overview of how to perform basic user tasks. You can access Introducing NQE, publication IN-2153, online by using the Cray DynaWeb server.

NQE Components and NQE Cluster Components

NQE is a set of clients and servers that lets you submit requests to be executed across a load-balanced network of hosts. NQE supports computing with a large number of nodes in a large network that supports two basic models:

  • The NQE database model that supports up to 36 servers and hundreds of clients.

  • The NQS model that supports an unlimited number of NQS servers and hundreds of clients.

The grouping of servers and clients is referred to as an NQE cluster. The servers provide reliable, unattended processing and management of the NQE cluster. Users who have long running requests and a need for reliability can submit batch requests to an NQE cluster.

Batch requests are shell scripts that are executed independently from an interactive terminal session. You submit requests from NQE clients and they are executed at NQS server nodes. You also can log on to nodes and submit requests. You can monitor and control the progress of a batch request through the NQE components in the NQE cluster.

The following sections describe the NQE components and the NQE cluster components.

NQE Components

NQE includes the following components:

  • An NQE client provides the client user interfaces to NQE. It supports the submission, monitoring, and control of work from the workstation for job execution of the batch request on the nodes. NQE clients are intended to run on every node in the NQE cluster where users need an interactive interface to the NQE cluster. It provides the NQE GUI (accessed through the nqe command) and a command line interface.

    For a description of the user interfaces, see “User Interfaces”.

  • The Network Queuing System (NQS) initiates requests on NQS servers. An NQS server is the host on which NQS runs. Your default NQS server is designated by your system administrator and is specified in the NQE configuration file (nqeinfo); you can submit your request to a specific NQS server by setting the NQS_SERVER environment variable, which overrides the default value of NQS_SERVER defined by your system administrator.

  • The Network Load Balancer (NLB) provides status and control of work scheduling within the group of components in the NQE cluster. This information is then used to load balance batch requests across NQS servers in the NQE cluster. The NLB offers NQS a list of servers, in order of preference, to run a request; NQS uses the list to route the request.

  • The NQE database provides a central repository for batch requests in the NQE cluster. The NQE scheduler uses the NQE database and an alternative mechanism for distributing work. The NQE scheduler examines each request and determines when and on which execution node the request will run. The lightweight server (LWS) verifies validation, submits the copy of a request to NQS, and obtains exit status of completed requests from NQS.

  • The File Transfer Agent (FTA) provides asynchronous and synchronous file transfer. You can queue your transfers so that they are retried if a network link fails.


Note: If you are running NQE without a license on a Cray PVP system, only the NQS and FTA components are accessible.


NQE Cluster Components

The NQE cluster can contain the following components:

  • The Network Load Balancer (NLB) server, which receives and stores information from the NLB collectors in the NLB database that it manages. For more information on the NLB, see Chapter 15, “Monitoring Machine Load”.

  • The NQE database server, which serves connections from clients, the scheduler, the monitor and lightweight server (LWS) components in the cluster to add, modify, or remove data from the NQE database. Currently, NQE uses the mSQL database. For more information on the NQE database server, see “Submitting a Request to the NQE Database” in Chapter 4.

  • The NQE scheduler, which analyzes data in the NQE database, and makes scheduling decisions. For more information on the NQE scheduler, see NQE Administration, publication SG-2150.

  • The NQE database monitor, which monitors the state of the database and which NQE database components are connected. For more information on the NQE database monitor, see NQE Administration, publication SG-2150.

  • NQE clients (running on numerous machines) contain software so users can submit, monitor, and control requests by using either the NQE graphical user interface (GUI) or the command line interface. From clients, users also can monitor request status, delete or signal requests, monitor machine load, and receive request output using the FTA.

The machines in your network where you run NQS are usually machines that have a large execution capacity. Job requests can be submitted from components in an NQE cluster, but they will only be initiated on an NQS server node.

FTA can be used from any NQS server node to transfer data to and from any node in the network by using the ftpd daemon. It also can provide file transfer by communicating with ftad daemons that incorporate network peer-to-peer authorization, which is a more secure method than ftp.

On NQS servers, you need to run a collector process to gather information about the machine for load balancing and request status for the NQE GUI Status and Load windows programs. The collector forwards this data to the NLB server.

The NLB server runs on one or more NQE nodes in a cluster, but it is easiest to run it initially on the first node where you install NQE. Redundant NLB servers ensure that the NLB database has a greater availability if an NLB server cannot be reached through the cluster.


Note:: The NQE database must be on only one NQE node; there is no redundancy.


How NQE Works

This section describes how your work is processed by using NQE. It describes the general flow of a request and how a request flows through NQS queues.

Work Flow

Using NQE, you can submit a request to NQS or to the NQE database. The following sections describe the work flow of a request submitted to each of these destinations. For more information about submitting requests, see Chapter 4, “Submitting Requests”.

Flow of a Request Submitted to NQS by Using the NLB

When you submit a request to NQS, by default NQS solicits information from the NLB to determine which NQE execution node will receive and process the request.


Note: Your site may have changed the defaults; contact your system administrator if your environment seems to work differently.

Figure 1-1 shows how a request flows through NQE when you send a request directly to NQS by using the NLB. The steps are as follows:

  1. From your client workstation, you submit your request to schedule and initiate a batch job. For information about the user interfaces, see “User Interfaces”.

  2. Through the NQE client, your request enters NQS on your NQS server (as indicated by your NQS_SERVER environment variable).

  3. NQS solicits information from the NLB about the most appropriate servers and queues for your request.

  4. The NLB uses the system load information received from other NQE nodes in the network and offers NQS a list of servers, in order of preference, to run a request.

  5. Using this information, NQS sends the request to the most appropriate destination in the NQE cluster. It may queue the request locally at your NQS server. The request is assigned a unique NQS request identifier (requestid).

  6. From your client workstation, you monitor your request by using the NQE GUI Status window or the cqstatl command. (From a node, you can also use the qstat command to monitor your request.)

  7. The request executes on the host selected in step 5.

  8. When the job request completes, standard output and standard error files are returned to you by default at your client workstation.

    Figure 1-1. Work Flow through NQE Using the NLB with NQS


Flow of a Request Submitted to the NQE Database

When you submit a request to the NQE database, it works with an administrator-defined NQE scheduler to analyze your request and to determine which NQS server will receive and process the request.

When the scheduler has chosen a server for your request, a copy of your request is sent to the NQE node. The original request remains in the NQE database. Because the original request remains in the NQE database, if a problem occurs during execution and the copy of the request is lost, a new copy can be submitted for processing.

For more information about submitting a request to the NQE database, see “Submitting a Request to the NQE Database” in Chapter 4.

Figure 1-2 shows how a request flows through NQE when you send a request to the NQE database. The steps are as follows:

  1. From your client workstation, you submit your request to schedule and initiate a batch job. For information about the user interfaces, see “User Interfaces”.

  2. Through the NQE client, your request is sent to the NQE database. A request submitted to the NQE database is called a task, and it is assigned a unique task identifier (tid).

  3. The NQE scheduler examines the request in the NQE database and determines when and where the request will run. The scheduled node can be any NQS server node in the NQE cluster.

  4. The lightweight server (LWS) on the scheduled NQE node receives a copy of the request from the NQE database.

  5. The LWS submits a local request to NQS. The request is placed in a local batch queue to run. The request is assigned a unique NQS request identifier (requestid). The LWS updates the NQE database with this information.

  6. You can monitor the status of your request by using the NQE GUI Status window. The status information is obtained from the NQE database and displayed on your client workstation.

  7. The request executes on the host selected in 3.

  8. When the job request completes, standard output and standard error files are returned to you by default at your client workstation.

  9. When the job request completes, the NQE database is updated with exit information. However, the request is not deleted from the NQE database immediately so that you can continue to get information about the request and its status. Your system administrator determines how long data remains in the NQE database after the request has completed.

    Figure 1-2. Work Flow through NQE Using the NQE Database and Its Scheduler


NQS Queues

To process your request, NQE may send it through a series of queues. A queue is a list of job requests waiting to be scheduled and initiated.

NQS has three types of queues:

  • Batch queues initiate job requests. Generally, a job request in a batch queue is executing or waiting for resources so that it can execute.

  • Pipe queues route requests. A pipe queue sends the request to another queue for further processing. This other queue could be on any NQS server in the NQE cluster. It could be a batch queue that will initiate the request or a pipe queue that will route it further. If your request cannot enter the queue to which it was sent, NQS sends you a mail message that explains the problem.

    Pipe queues are not used if you send your job request to the NQE database.

  • Destination-selection queues load-balance job requests. These are pipe queues that do not have a preset destination. Instead, destinations are determined by load-balancing policies.

    When a request enters a destination-selection pipe queue, NQS queries the NLB for a list of destinations that could process your request. The NLB returns a list of destinations that is ordered according to the administrator-defined policy at your site. If for some reason the first destination cannot accept the request, the second is tried, and so on.

    Destination-selection pipe queues are not used if you send your request to the NQE database.

Figure 1-3 shows an example of how requests submitted to NQS may flow from your client workstation through NQS queues.

Figure 1-3. Detail of Work Flow through NQE When Submitting Directly to NQS

Figure 1-4 shows an example of how requests submitted to the NQE database flow from your client workstation through an NQS batch queue.

Figure 1-4. Detail of Work Flow When Submitting to the NQE Database


User Interfaces

You can use the NQE graphical user interface (GUI) or a command line interface to do most of the functions described in this guide. The following sections provide a brief overview of these functions. (You also can submit your request by using a World Wide Web (WWW) interface; for further information, ask your system administrator.)

NQE Graphical User Interface

The NQE GUI is similar to a Motif interface. To access the NQE GUI, execute the nqe command. Figure 1-5 shows the initial NQE GUI button bar window that will appear:

Figure 1-5. Initial NQE GUI Button Bar Window


To access a window, use the left mouse button and click on the button once.

You can use the NQE GUI for the following tasks:

  • Use the Submit window to do the following:

    • Open and edit a job script

    • Save changes made to a job script

    • Submit a request to NQE

    • Launch a request on a periodic basis

    • From within the Submit window, reset your configuration preferences for the request you are submitting

    • View, segment, delete, or reset your NQE GUI log

    • Set or unset your password

    • Configure and save your job-related options (job profile)

  • Use the Status window to do the following:

    • View updated status of your requests (the window is refreshed periodically)

    • View updated status of your FTA file transfers (the window is refreshed periodically)

    • Delete a request

    • Send a specified signal to a request

    • View the detailed status of a request

    • Set or unset your password

Context-sensitive help is displayed as you glide your mouse cursor over a menu or field name in the Status window; a brief description of the menu or field appears at the bottom of the display.

  • Use the Load window to do the following:

    • Display continually updated system load information for machines in the group of execution nodes in the NQE cluster

    • Display data about a specific host

    • Display the same data that is provided on the main Load window, but have it grouped by host rather than by type of data

  • Use the Config window to do the following:

    • Set your preferences for the following: A specific NQS server, default job profile, temporary directory, job script, job output, job profile, and NQE GUI log directories.

    • View your currently set preferences

  • To display the current NQE version number and copyright information in the Submit, Status, and Config windows, use the left mouse button and click once on the Cray Research logo button.

  • To access online help, use the left mouse button and click once on the Help button.

  • To exit the NQE GUI, use the left mouse button and click on the Exit button.

When the mouse pointer is within a display area of a specific NQE GUI window, you can use the ALT key and the underscored letter from the menu bar to pop up submenus and to select more submenu options. An alternative way to do this is to use the F10 key to activate the menu bar and then use the cursor movement keys to select submenus and options.

For a summary of the NQE GUI displays and functions, see the nqe(1) man page.

Command Line Interface

NQE provides a command line interface for the following user functions. Each of the commands listed in this section is documented on a man page (man pages are provided in online form only).

You can issue the following commands from any NQE node because all NQE nodes contain the NQE client software:

Command

Description

cevent

Posts, reads, and deletes job-dependency event information

cqdel

Signals a request that is either running or awaiting processing

cqstatl

Displays the status of NQE work through a line-mode, static display

cqsub

Submits a script file to NQE for execution

ilb

Executes a load-balanced interactive command; for an overview of the ilb command, see “Using the ilb Command”; for detailed information about the ilb command, see the ilb(1) man page.

You can issue the following commands only at an NQE node that has installed the NQE components; if you issue them from an NQE client, they have no effect. The following commands are not installed on NQE clients; they do not recognize the NQS_SERVER environment variable:

Command

Description

ftua

Transfers a file interactively

qalter

Alters the attributes of one or more NQS requests

qchkpnt

Checkpoints an NQS request on a UNICOS, UNICOS/mk, or IRIX system

qdel

Deletes or signals an NQS request

qlimit

Displays NQS batch limits for the local host

qmsg

Writes messages to stderr, stdout, or the job log file of an NQS batch request

qping

Determines whether the local NQS daemon is running and responding to requests

qstat

Displays the status of NQS queues, requests, and queue complexes

qsub

Submits a batch request to NQS

rft

Transfers a file to and from a remote system

For a list of all user-level man pages provided online, see Appendix A, “Man Page List”.

Preparing to Use NQE

To use NQE, you must set certain environment variables. For an explanation of which environment variables you must set, see “Setting Environment Variables” in Chapter 2. For a list of optional environment variables you can set, see Chapter 9, “Customizing Your Environment”.

To submit requests to the NQE database, you must have a database user account (dbuser) that has user privileges. Your NQE administrator controls who has access to the database and from which client host. For information about how to specify your database user name, see “NQE Database Authorization” in Chapter 2, or “Specifying a Database User Name for Your Request” in Chapter 4.

By default, NQS uses file validation to authorize users. NQS also may be configured to use password validation or both file and password validation.

For additional information about preparing to use NQE and about validation files, see Chapter 2, “Preparing to Use NQE”.

Creating Batch Requests

Before you submit a batch request, you usually will create a script file that contains the UNIX commands that make up the request. To create this file, use any text editor (such as vi). You also can create a batch request from within the NQE GUI Submit window.

A batch request can be one command, such as ls (which lists files). Usually, however, batch requests contain several commands.

On UNICOS, UNICOS/mk, or IRIX systems, you can checkpoint an executing request at any time during its execution by saving its current image in a restart file by including qchkpnt(1) statements within the script file. You then can use the restart file to restart the job from a known point if a system interrupt occurs.

For more information about creating a batch request, see Chapter 3, “Creating Batch Requests”. For more detailed information about customizing a batch request, see Chapter 5, “Customizing Requests”.

Submitting Requests

You can submit a request to run under UNIX or under the Distributed Computing Environment (DCE). For information about how to submit a request to DCE, see Chapter 4, “Submitting Requests”.

To submit a request to NQE, you can use either the NQE GUI or the command line interface.

To use the NQE GUI, key in the nqe command at the prompt and, using the left mouse button, click once on the Submit button of the initial NQE GUI button bar.

To use the command line interface to submit a batch request to NQE, use the cqsub or qsub command. For a complete list of the options, see the cqsub(1) or qsub(1) man page.

For more information about submitting a request for execution, see Chapter 4, “Submitting Requests”.

Monitoring Requests and Queues

To view where your request is in the NQE network, use the NQE GUI Status window or the cqstatl or qstat command.

When you use the NQE GUI Status window, the default window shows the status of all of your requests in the NQE cluster. Using the NQE GUI has the following advantages over using the cqstatl or qstat command:

  • A display is refreshed periodically. To get new information, you do not need to reissue a command.

  • A request is easy to find. All of your requests are displayed on the main NQE GUI Status window; you do not need to specify a specific node.

When you use the cqstatl or qstat command, you can obtain information about all queues on that NQE node. The NQE GUI does not display any information about queue structures; only queues that contain requests are displayed through the NQE GUI.

For detailed information about monitoring requests, see Chapter 10, “Monitoring Requests”. For detailed information about monitoring queues, see Chapter 11, “Monitoring Queues”.

Examining Output

After your batch request completes, NQS returns standard output and standard error files to you. If you use the NQE GUI, the files are written to your home directory by default. If you use the command line interface, by default the files are written to the directory you were in when you issued the cqsub or qsub command.

For information about working with output files, see Chapter 6, “Working with Output Files”. For information about communicating with output, see Chapter 7, “Communicating with Requests”.

Deleting or Signaling Requests

You can delete or signal a request that you have submitted to NQE. The request may be executing or may be waiting to execute on an NQS server node.


Note: You can send any UNIX signal to a request. Your request script could be written to trap the signal and then take some appropriate action, rather than to abort.

To delete an executing request, you can use either the NQE GUI Status window or the cqdel or qdel command. If you use the cqdel or qdel command, you must send it a UNIX signal. You can send one of several signals to a request; one of the most common is the SIGKILL signal, which aborts a running process.

Standard output, standard error, and job log files are still produced for an executing request that is deleted by a signal. These files record the execution of the request up to the moment that the signal is received.

For more information about deleting a request, see Chapter 12, “Deleting Requests”. For more information about sending a signal to a request, see Chapter 13, “Signaling Requests”.

Transferring Files

You can transfer files between remote systems on a network either from within a batch request or interactively by using the NQE File Transfer Agent (FTA). The ftua and rft commands transfer files. The ftua interface to FTA is similar to the TCP/IP ftp utility. File transfers can be initiated on NQE nodes only.

You might choose to use FTA for the following reasons:

  • You can queue your transfers. You can execute file transfers immediately or queue them for later execution. If the transfer is queued, it is executed after you leave the utility, letting you proceed to other tasks.

  • You can display queued transfers. If you have issued a file transfer request in queue mode, you can display details about the request. To view the status of an FTA transfer, you can use either the NQE GUI or the qls command.

  • Your transfers are retried. If your file transfer fails for some transient reason (such as a network link failing), FTA automatically requeues the transfer. Retries are useful in batch requests because your requests will not abort if a transfer cannot occur when it is first tried.

  • You do not have to provide passwords. FTA provides network peer-to-peer authorization (NPPA). NPPA lets you transfer files without specifying passwords in either batch request files or in .netrc files or by transmitting passwords over the network. For more information on NPPA, see “Using NPPA” in Chapter 14.

  • It provides both synchronous and asynchronous reliable file transfer. If a transient error condition occurs during the transfer, transfers are retried. Retries are useful when transferring files from within an NQS request.

    If you disable the synchronous feature by selecting the -nowait option, the transfers are done in asynchronous fashion but are still reliable.

To transfer files from within a batch request, use the rft command. The rft command has the following advantages over other file transfer commands:

  • It is a one-line interface to FTA. This makes it easier to use in batch job requests.

  • rft provides an option that deletes the local file on the completion of a transfer. This is useful when transferring files at the end of an NQS request to the system from which you submitted the request.

For more information about using FTA, see Chapter 14, “Transferring Files”.

Using the ilb Command

The ilb utility lets you execute a command on a machine chosen by the NLB. Enter the ilb command followed by the command you wish to execute. The NLB is then queried to determine which machine to log you into. Once the login process is complete, the command is executed and I/O is connected to your terminal or pipeline.

The .ilbrc file contains login and initialization information used during the automated login process. The .ilbrc file must reside in your home directory on the local host. The default system ilbrc file contains information that the ilb utility uses to establish connections to remote systems.

The following example executes the uname command on the system chosen by the NLB and returns the output to the user's terminal:

$ ilb uname -a
Attempting connection to pendulum...
ILB Info: Using /usr/bsd/rlogin to connect, based on PATH.
Executing uname -a...
IRIX pendulum 6.2 06101030 IP22

For detailed information about the ilb command, see the ilb(1) man page.