Chapter 1. Overview

The Cray Network Queuing Environment (NQE) lets you submit batch requests to execute across a load-balanced network of hosts. NQE supports computing in a large network; up to 36 servers and hundreds of clients using the NQE database model, or an unlimited number of NQS servers. This grouping of servers and clients is referred to as an NQE cluster. Batch requests are shell scripts that are executed independently from an interactive terminal session.

You submit requests from NQE clients and they are executed at NQS server nodes. You also can log on to nodes and submit requests. You can monitor and control the progress of a batch request through the NQE components in the NQE cluster.

This chapter describes the following:

NQE Components and NQE Cluster Components

NQE includes the following components:

  • The NQE client provides the client user interface to NQE. It supports the submission, monitoring, and control of work from the workstation for job execution of the batch request on the nodes. NQE clients are intended to run on every node in the NQE cluster where users need an interactive interface to the NQE cluster. It provides the NQE graphical user interface (GUI) (accessed through the nqe(1) command) and a command line interface. The interfaces are documented in the NQE User's Guide, publication SG-2148, and the NQE GUI has context-sensitive help.

    For information about using the NQE GUI, see “User Interfaces” in Chapter 2.

  • The Network Queuing System (NQS) initiates requests on NQS servers. An NQS server is the host on which NQS runs. Your default NQS server is designated by your system administrator and is specified in the NQE configuration file (nqeinfo); you can submit your request to a specific NQS server by using the NQE GUI Config window or by setting the NQS_SERVER environment variable, which overrides the default value of NQS_SERVER defined by your system administrator.

  • The Network Load Balancer (NLB) provides status and control of work scheduling within the group of components in the NQE cluster. This information is then used to load balance batch requests across NQS servers in the NQE cluster. The NLB offers NQS a list of servers, in order of preference, to run a request; NQS uses the list to route the request.

  • The NQE database provides a central repository for batch requests in the NQE cluster. The NQE scheduler uses the NQE database and an alternative mechanism for distributing work. The NQE scheduler examines each request and determines when and on which execution node the request will run. The NQE database lightweight server (LWS) verifies validation, submits the copy of a request that is in the NQE database to NQS, and obtains the exit status of completed requests from NQS.

  • The File Transfer Agent (FTA) provides asynchronous and synchronous file transfer. You can queue your transfers so that they are retried if a network link fails.

Note: Your system administrator can tell you if all the NQE components, or only the NQS and FTA components, are available to you.

The NQE cluster can contain the following components:

  • The Network Load Balancer (NLB) server, which receives and stores information from the NLB collectors in the NLB database that it manages.

  • The NQE database server, which serves connections from clients, the scheduler, the monitor and lightweight server (LWS) components in the cluster to add, modify, or remove data from the NQE database. Currently, NQE uses the mSQL database.

  • The NQE scheduler, which analyzes data in the NQE database and makes scheduling decisions.

  • The NQE database monitor, which monitors the state of the database and to which the NQE database components are connected.

  • NQE clients (running on numerous machines) contain software so users can submit, monitor, and control requests by using either the NQE GUI or the command line interface. From clients, users also can monitor request status, delete or signal requests, monitor machine load, and receive request output using the FTA.

The machines in your network where you run NQS are usually machines that have a large execution capacity. Job requests can be submitted from components in an NQE cluster, but they will only be initiated on an NQS server node.

FTA can be used from any NQS server node to transfer data to and from any node in the network by using the ftpd daemon. It also can provide file transfer by communicating with ftad daemons that incorporate network peer-to-peer authorization, which is a more secure method than ftp.

On NQS server nodes, you need to run a collector process to gather information about the machine for load balancing and request status for the NQE GUI Status and Load windows programs. The collector forwards this data to the NLB server.

The NLB server runs on one or more NQE nodes in a cluster, but it is easiest to run it initially on the first node where you install NQE. Redundant NLB servers ensure that the NLB database has a greater availability if an NLB server cannot be reached through the cluster.

Note:: The NQE database must be on only one NQE node; there is no redundancy.

Tasks You Can Perform Using NQE

NQE lets you do the following tasks:

  • Submit batch requests to a node in the NQE cluster that is running NQS.

  • Route requests automatically without knowing the names of queues or hosts.

  • Monitor the status of requests across the NQE cluster. Request status is refreshed at configurable intervals.

  • Signal a request in the node cluster.

  • Delete a request in the node cluster.

  • Allow independent jobs to communicate with each other by using the cevent(1) command.

  • Communicate with other UNIX systems running public domain versions of NQS.

  • Monitor the machine load across the NQE cluster.

  • Recover and restart your requests (see your NQE User's Guide, publication SG-2148, to determine which systems support this capability).

  • Transfer files both interactively and from within batch requests on NQE servers without specifying a password.

  • Queue file transfers so that you can go on to other work.

  • Automatically retry file transfers that encounter network failures.

  • Customize your NQE environment.