This chapter describes new features of NQE.
|Note: Because our customers may be upgrading to NQE 3.3 from various NQE release levels, to give you a complete picture of the NQE features that may be new to you, this chapter describes the following:|
For compatibilities and differences that users may see when upgrading to the NQE 3.3 release, see Chapter 3, “Compatibilities and Differences”.
|Note: Customers with valid NQE 3.1.x or NQE 3.2.x FLEXlm licenses do not need new licenses for the NQE 3.3 release.|
The following features were added in the NQE 3.3 release:
Support was added for the following operating system levels: UNICOS/mk 2.0.2, UNICOS 10.0, IRIX 6.5, and HP-UX 10.10. (HP-UX 10.10 was initially supported in the NQE 3.2.2 release.) For a complete list of operating systems and release levels supported with the NQE 3.3 release, see “Operating Systems Supported by NQE 3.3” in Chapter 1.
Miser integration on Origin systems is supported. NQE will support the submission of jobs that specify Miser resources.
Distributed Computing Environment (DCE) support was enhanced as follows:
Ticket forwarding and inheritance is now supported on selected platforms. This feature lets users submit jobs in a DCE environment without providing passwords. Ticket forwarding is supported on all NQE platforms except Digital UNIX systems. Ticket inheritance is supported only on UNICOS and IRIX systems.
IRIX systems now support access to DCE resources for jobs submitted to NQE.
Support for tasks that use a password for DCE authentication is available on all NQE 3.3 platforms.
The following NQE database enhancements were made:
There can be a maximum of 36 simultaneous client and execution server connections to the NQE database.
The MAX_SCRIPT_SIZE variable was added to the nqeinfo file, which lets an administrator limit the size of the script file submitted to the NQE database. If the MAX_SCRIPT_SIZE variable is set to 0 or is not set, a script file of unlimited size is allowed. The script file is stored in the NQE database; if the file is bigger than MAX_SCRIPT_SIZE, it can affect the performance of the NQE database.
The compactdb(8) command was added to let administrators compress the NQE database, nqedb, on demand.
The user exit template for T3E mixed mode scheduling is available on /usr/src/lib/libuex/nqs_uex_jobselect.template.4. T3E systems can have PEs with various memory sizes and clock speeds. NQS can identify jobs with special PE needs and apply those needs to scheduling decisions.
Applications running on a CRAY T3E system are killed when a PE assigned to the application goes down. NQS is now notified when a job is terminated by a SIGPEFAILURE signal (UNICOS/mk systems only). NQS will requeue the job and either restart or rerun the job, as applicable, when the job is selected by the NQS scheduling code to be initiated.
Overall performance of the Network Load Balancer (NLB) collector under UNICOS was increased; new information is provided.
The capabilities of the NQE database scheduler (LWS) have been extended.
The security enhancements to UNICOS/mk systems are supported with this NQE release.
The csuspend utility has the following two new command line options: -l loopcount and -p period. These two new options suspend or enable batch processing based on interactive use. The amount of interactive use is determined by calls to sar. These options give the administrator greater control over how sar is used and, consequently, the frequency of checking on whether to suspend or start NQE.
The qstart(8) and qstop(8) commands now let an administrator execute programs immediately before and after the NQS daemon starts (NQE_ETC/qstart.pre and NQE_ETC/qstart.pst, where NQE_ETC is defined in the nqeinfo file) and immediately before and after the NQS daemon is shut down (NQE_ETC/qstop.pre and NQE_ETC/qstop.pst , where NQE_ETC is defined in the nqeinfo file). The administrator must create the files, and they must be executable.
All nqeinfo file variables are now documented on the new nqeinfo(5) man page; this man page is also accessible through the Help facility of the NQE configuration (nqeconfig) utility.
The ability to start and stop one or more specific NQE components at a time was added to the nqeinit and nqestop scripts.
Support was added for checkpointing and restarting the jobs that run on UNICOS/mk systems. This feature was initially supported in the NQE 3.2.1 release.
Array Services support was added for UNICOS systems. Array Services let you manage related processes as a single unit, including processes running across multiple machines. Array Services use array sessions to group these related processes together through use of a unique identifier called an array session handle (ASH). A global ASH is needed when the processes within an array session are not all running on the local node. The NQE request server now asks for a global ASH before initiating the job. NQE logs the global ASH associated with the job in a log message in the user's job log. The global ASH associated with a job is shown in a Global ASH: field in an NQE job log display. A job log display can be requested by supplying the NQE job identifier when using the qstat -j or cqstatl -j command, or the job log can be displayed through the NQE graphical user interface (GUI) by clicking on a specific job within the Status display and then selecting the Actions->Job Log menu. The global ASH for a job is also entered into the NQS log file.
Political scheduling support for CRAY T3E systems running the UNICOS/mk operating system. This includes obtaining fair-share information by using the multilayered user fair-share scheduling environment (MUSE) and scheduling a job for immediate execution with preferential CPU priority (prime job). (This feature was initially supported in the NQE 3.2.1 release.)
Fair-share scheduling within NQS continues to be enabled by using the qmgr set sched_factor share command with a nonzero share weighting value. If the CRAY T3E political scheduling daemon is available, NQS provides a list of user ID/account ID (UID/ACID) pairs to the political scheduling daemon. The MUSE factors returned to NQS are used in the NQS priority formula to calculate NQS job ordering priorities.
An optional prime keyword has been added to the qmgr schedule request now command. The prime keyword sets a prime job status for the request, which gives the request preferential CPU scheduling. If the job is already running, NQS notifies the political scheduling daemon that this job needs prime job status. If the job is queued, NQS initiates the job immediately and then notifies the political scheduling daemon. The qstat -f command display will show the job priority as SCHEDULED NOW PRIME, and the psview -rRM display will show the job as a prime job.
If a job has prime job status, the qmgr schedule request system command is used to remove the NOW state of the job with NQS and to remove the prime job status. Consequently, the job will no longer receive preferential CPU scheduling.
Support was added to enforce the pmppmemlim and jmppmemlim batch UDB limits on CRAY T3E systems. These limits are used to restrict the amount of memory made available to processes and jobs running on CRAY T3E application PEs. The NQS system that resides on a CRAY T3E system will obtain these limits from the system's UDB. (This feature was initially supported in the NQE 3.2.1 release.)
The TCL_RCFILE and TK_RCFILE configuration variables have been removed from the nqeinfo file. This eliminates the need for the $HOME environment variable to be set in system startup and shutdown scripts.
The NQE_DEFAULT_COMPLIST configuration variable in the nqeinfo file has replaced the NQE_TYPE configuration variable, which defines the list of NQE components to be started or stopped. The NQE 3.3 release is shipped with the NQE_DEFAULT_COMPLIST variable set to the following components:
NQS, COLLECTOR, NLB
NQS now sets the USER environment variable to the same value as the LOGNAME environment variable before it initiates a job. This was added to accommodate the platforms that use the USER environment variable instead of LOGNAME.
NQS now supports per-request limits for CPU usage, memory usage, and the number of processors when running on IRIX platforms. The per-request usage of these resources is displayed by the NQE GUI and the cqstatl and qstat commands. Requests that exceed the limits will be terminated. The periodic checkpointing of requests based on accumulated CPU time is also supported.
The CPU and memory scheduling weighting factors are added for application PEs. They can be specified with the qmgr set sched_factor(blank) command as mpp_cpu, mpp_pe, and user_priority options. The NQS scheduling weighting factors are used with the NQS priority formula to calculate the intraqueue job initiation priority for NQS runnable jobs. This feature also restores the user-specified priority scheduling functionality (specified by the cqsub -p(blank) and qsub -p(blank) commands).
The -f option was added to the qdel(1) command; this option specifies that no request output will be returned to the user. This option behaves similarly to the -k option except that the user's standard error, standard output, and job log files are not returned to the user or stored at the execution node in the NQS failed directory.
Year 2000 support for NQE has been completed.
The project ID is added to the end of the current accounting records in the NQS accounting file (nqsacct) written by the NQS daemon accounting on IRIX systems.
NQE documentation was revised for this release; for additional information, see “NQE Documentation” in Chapter 4.
The following features were added in the NQE 3.2 release:
NQE supports all Origin systems, including the Origin200 and the Origin2000 models.
Support was added for the following operating system levels: Digital UNIX 4.0, IRIX 6.3, IRIX 6.4, Solaris 2.5, and UNICOS/mk 1.4. Also, IRIX 5.3 is again supported.
|Note: On Solaris 2.5 systems using the Distributed Computing Environment (DCE) with NQE 3.2, when a user submits a request using the “per-process working set size limit” option (-lw if using the command line interface), the user must specify 7mw or larger, or the request will not be accepted.|
For IRIX systems, in addition to the existing NQE installation process, the inst installation utility is supported to install NQE.
Performance improvements were made to the NQE GUI to improve response time.
Support was added for checkpointing of jobs running on a Cray PVP system that invoke CRAY T3D processes.
Support was added for checkpointing and restarting of jobs running on IRIX 6.4 systems.
NQE no longer enforces restrictions on the number of client commands that can be used simultaneously to submit jobs, delete or signal jobs, and obtain the status of jobs.
The craylmd vendor license daemon was removed.
|Note: Customers with valid NQE 3.1 FLEXlm licenses do not need new licenses for the NQE 3.2 release.|
A site that is licensed to use full NQE may use all clients on all NQE platforms. However, on systems running only the NQE subset (only the NQS and FTA components), only the standard “q” commands (not the client commands) on the native platform are supported.
NQS recovers UNICOS PVP jobs that are terminated because of hardware problems. If a job is terminated through receipt of a SIGRPE or SIGUME signal, NQS requeues the job rather than deleting it if the job is rerunnable or if the job is restartable and has a restart file. For compatibilities and differences information, see “Job Recovery After a SIGRPE or SIGUME Signal” in Chapter 3.
The -Rf option is supported on the cqsub(1) command. The -Rf option forces the request to be restarted from a checkpoint image. This option is supported on UNICOS systems only.
For the NQE database, the system object trace_level attribute was added, and the NQEDB_TRACE_LEVEL environment variable was removed. System object trace levels are now set either through a TRACE_LEVEL event or by setting the system object trace_level attribute to the desired value and posting a TINIT event to that system object. The nqedbmgr post event and the nqedbmgr ask commands may be used to change the trace level.
The -f option was added to the ftad(8) command so that a system administrator may specify the nqeinfo file to be used for configuration information.
The variable HOSTNAME_TIMEOUT was added to the nqeinfo file. HOSTNAME_TIMEOUT specifies, in minutes, the interval in which to refresh the values (cached in a linked list) that were returned by calls made to the gethostbyname function. If you specify a value of 0, the list is not refreshed. If you specify a value of -1, no list will be generated, and the gethostbyname function will be called directly. You can use the nqeconfig(8) command to configure HOSTNAME_TIMEOUT.
The DCE integrated login capability is supported on Digital UNIX systems. A future NQE release will support this capability on the other systems that support DCE.
Support was added for the multilevel security feature (Cray ML-Safe) on UNICOS/mk systems.
NQE documentation was revised for this release as follows:
The NQE Release Overview and Installation Bulletin, publication RO-5237, was separated into the following two documents as of the NQE 3.2 release:
NQE Release Overview, publication RO-5237 3.2
NQE Installation, publication SG-5236 3.2
These two publications are provided in printed form with the NQE release package.
All NQE 3.2 publications were revised in online form:
Introducing NQE, publication IN-2153 2/97
NQE User's Guide, publication SG-2148 3.2
NQE Administration, publication SG-2150 3.2
NQE Release Overview, publication RO-5237 3.2
NQE Installation, publication SG-5236 3.2
All NQE 3.2 publications are available from the Cray Research Online Software Publications Library, which is now available publicly at the following URL:
All NQE 3.2 publications are also provided on the Cray DynaWeb CD-ROM that is included with your NQE 3.2 release package.
PostScript files of all NQE 3.2 publications are available as follows:
On the Cray DynaWeb CD-ROM that is included with the NQE 3.2 release package
Through anonymous FTP at the following location:
Man pages were revised; also, the nqsdaemon(8) man page was added to the NQE man page set. Man pages are provided in online form only as part of the NQE release package.
The following features were added in the NQE 3.1 release:
For UNICOS and UNICOS/mk systems, beginning with the NQE 3.1 release, NQE is released asynchronously from UNICOS and UNICOS/mk; NQS and FTA are packaged in the NQE product for all platforms. Documentation that was previously provided in the Network Queuing System (NQS) User's Guide, publication SG-2105, the UNICOS NQS and NQE Administrator's Guide, publication SG-2305, and the FTA User and Administrator Manual, publication SG-2144, has been incorporated into the NQE documentation set.
For Cray Research software licensing information, see “Licensing Agreement Information for UNICOS and UNICOS/mk Systems” in Chapter 5.
|Note: Cray PVP systems that do not have an NQE license are limited to accessing and using only the NQE subset (NQS and FTA components).|
The vendor license daemon is named craylmd on all supported platforms. This was a change for some platforms.
A new NQE version maintenance utility was provided; it is invoked by using the nqemaint(8) command.
A new NQE configuration utility was provided; it is invoked by using the nqeconfig(8) command. This NQE configuration utility may be used to modify NQE configuration values (the nqeinfo file contents) and specify additional variables not included in the default configuration.
For workstation platforms, nqeconfig(8) is also automatically invoked and used during the NQE installation process.
The cload(1) command was removed; the functionality is provided through the NQE GUI Load window. The NQE GUI is invoked by using the nqe(1) command.
For NQE 3.1 running on AIX 4.2 systems, support was added for using the Distributed Computing Environment (DCE) processing capability when submitting a job to NQE. The NQE_AUTHENTICATION variable and the NQE_DCE_BIN variable support this feature.
For UNICOS systems, DCE 1.1 is the supported DCE release. In addition to the DCE/DFS features added in the NQE 3.0 release, support for DCE/DFS ticket refresh was added for UNICOS systems. The NQE_DCE_REFRESH variable, which defines the refresh interval (in minutes), was added.
On UNICOS and UNICOS/mk systems that run the full NQE release, the nlbconfig(8) utility is used instead of the fta.conf(5) utility to configure FTA. However, the fta.conf(5) utility is still used on Cray PVP systems that are running UNICOS with only the NQE subset (NQS and FTA components).
Online documentation was made available using the Cray DynaWeb server. The Cray DynaWeb server lets users access information using a World Wide Web browser, such as Netscape. CrayDoc is no longer installed as part of the NQE release. Documentation for the Cray DynaWeb server includes the Online Software Publications Installation Guide, publication SG-6105, and the Online Software Publications Administrator's Guide, publication SG-6104.
The NQE 3.0.1 release supports Silicon Graphics workstations, POWER CHALLENGE systems, and POWER CHALLENGEarray systems running the IRIX 6.2 operating system. The array sessions and project names features are supported on the POWER CHALLENGEarray product when it is used in a POWERnode configuration.
Array sessions support was added. NQE logs the global ASH associated with the job in a log message in the user's job log. The global ASH associated with a job is shown in a Global ASH: field in an NQE job log display. A job log display can be requested by supplying the NQE job identifier when using the qstat -j or cqstatl -j command, or the job log can be displayed through the NQE GUI by clicking on a specific job within the Status display and then selecting the Actions->Job Log menu. The global ASH for a job is also entered into the NQS log file.
Project names support was added. For NQE jobs that are both submitted and executed on the local host, NQE sets an appropriate project name against which the job will run. If an explicitly requested project name is provided, NQE uses that project name. If an explicitly requested project name is not provided, NQE uses the value of the NQS_ACCOUNTNAME environment variable. If the NQS_ACCOUNTNAME environment variable is not set, NQE uses the project name in effect at job submission.
For NQE jobs that execute on a remote host, NQE determines the appropriate project name for the job as follows: If an explicitly requested project name is provided, that project name is used. If an explicitly requested project name is not provided, the user's default project name on the remote host is used.
A user can explicitly request a project name when submitting an NQE job by using the qsub -A or cqsub -A command or by setting the NQE GUI Submit -> Configure -> General Options -> Account/Project name field. The project name associated with a job is shown in an Account/Project: field in NQE full status displays. A full status display can be requested by supplying the NQE job identifier when using the qstat -f or cqstatl -f command, or it can be requested through the NQE GUI by double clicking on a selected job within the Status display.
The NQE 3.0 release supports all platforms except UNICOS and UNICOS/mk systems. The new features in NQE 3.0 are supported on UNICOS and UNICOS/mk systems as of the NQE 3.1 release.
The following features were added in the NQE 3.0 release:
The NQE GUI was added; it is invoked by the nqe(1) command. The NQE GUI lets you do the following:
Use the Submit window to open and edit a job script; to save changes made to a job script; to submit a request to NQE; to view, segment, delete, or reset your NQE GUI log; and to set or unset your password. A launching capability enables you to submit a request periodically at specific, even repeating, intervals. Also use the Submit window to set (configure) and to save your job-related options (job profiles).
Use the Status window to verify the status of your requests and FTA file transfers. Also use the Status window to delete a request, to send a specified signal to a request, to get a detailed status of a request, and to set or unset your password. Context-sensitive help is provided with the status windows; as you glide your mouse cursor over a menu or field name in a window, a brief description of the menu or field appears at the bottom of the display.
Use the Load window to view a continually updated display of the system load for machines in the execution server complex. You also can view data about a specific configured host, and you can view the data that is provided on the main window but have it grouped by host rather than by data type.
Use the Config window to set (configure) specific user preferences and to view how NQE-related variables are currently set.
The nqeinfo file is supported on NQE clients. NQS_SERVER may be set in the nqeinfo file, which eliminates the need for client users to set the NQS_SERVER environment variable if they want to use the default setting.
The cqstat(1) command was removed and replaced with the new NQE GUI Status window capability.
Support was added for the NQE database, which is an mSQL database, and the NQE scheduler. You can use the NQE GUI to submit requests to the NQE database and to obtain a status of requests that you have submitted. You can also use the command line interface for NQE database requests; the following command line interface changes were made to support this feature:
The -d option was added to the cqdel, cqstatl, and cqsub user commands to specify the request's destination (NQE database or NQS); the default destination is NQS.
The -u option of the cqdel, cqstatl, and cqsub user commands was expanded to support specifying the NQE database user name.
The nqedbmgr(8) command was added for administration of the NQE database.
The NQE scheduler is written in Tcl. This feature lets a site create its own scheduler rather than using NLB destination selection and NQS.
Support was added for cluster-wide rerun of requests submitted to the NQE database.
Support was added for using the Distributed Computing Environment (DCE) processing capability when submitting a job to NQE. The NQE_AUTHENTICATION variable and the NQE_DCE_BIN variable were added to support this feature.
FTA was enhanced to support byte-level recovery. You can also use the NQE GUI Status window to verify the status of your FTA file transfers. The following three FTA variables were added: NQE_FTA_NLBCOLLECT, which activates logging of transfer information to the NLB; NQE_FTA_SMARTE_RESTART, which enables FTA to do byte-level restarts of partially complete transfers; and NQE_FTA_STAT_INTERVAL, which determines the time interval (in seconds) between progress updates sent to the NLB.
Support was added to change job resource limits (qalter -l option).
The PPermfile_limit option was added to the qmgr MODify Request subcommand so that you can modify the NQS per-process permanent file space limit for a request that was already submitted.
A new customized collector feature lets you store arbitrary information in the NLB. This information is periodically sent to the NLB, along with the data that is usually stored and updated by NQE. After the customized data is in the NLB, it can be used in policies or displays just as any other data. NQE collects and stores the customized data, but you define, generate, and update it. The ccollect -C filename option and the NQE_CUSTOM_FILE_LIST variable were added to support this new feature. The ccollect(8) command was modified to support this feature.
The ilb utility was added in this release. It executes a command on a machine chosen by the NLB. To use the utility, enter the ilb(1) command followed by the command you want to execute. The NLB is queried to log you on to the appropriate machine. After the login process is complete, the command is executed, and I/O is connected to your terminal or pipeline.
For UNICOS systems, the NQS config.h file customization variables were moved into the nqeinfo file. An appendix in NQE Administration, publication SG-2150, lists each new nqeinfo file variable name, its equivalent config.h file variable name, the default setting, and a description of the variable's function.
The NQE 2.0 release supports all platforms except UNICOS and UNICOS/mk systems. The new features in NQE 2.0 are supported on UNICOS and UNICOS/mk systems as of the NQE 3.1 release.
The following features were added in the NQE 2.0 release:
Job dependency was added through the cevent(1) command. Job dependency provides the ability for job scripts or NQS requests to specify interdependency among events and the ability to display related status information. For example, a script might specify that request testB is to be run after request testA completes. Data concerning the events is stored in the NLB server database.
The csuspend(8) command was added; it suspends NQE batch activity when interactive use occurs. csuspend is invoked on a server.
The nqeinfo file was enhanced to support new configuration parameters that you can optionally add if your configuration requires their use. The parameters configure default paths for NQE requests, validation behavior for client commands, NQS shell invocation, the NQS temporary directory for requests, and the host name for the NQS server.
System administrators can set up a World Wide Web (WWW) interface to selected NQE functions. Once it is installed, users can access the interface through WWW clients, such as Mosaic or Netscape. Through this interface, users can submit a batch request from a file or enter it interactively, obtain status on requests, delete requests, and signal requests. Users can also view and save output files that are produced.
The cqsub -p option assigns a Unified Resource Manager (URM) priority increment. The -p option has an effect only if the request is running on a UNICOS system. On UNICOS systems, this priority is passed to URM during request registration. URM adds this value as an increment value to the priority that it calculates for the request. The priority is an integer between 0 and 63. If you do not use this option, the default is 1.
Additional data objects in the NLB server database provide more information about NQE servers and requests.
Four new values for resources were added to the cqsub -l option, as follows (these values are used on CRAY T90 and Cray MPP systems):
The -l p_mpp_t value specifies the per-process Cray MPP residency time limit.
The -l mpp_t value specifies the per-request Cray MPP residency time limit.
The -l shm_l[imit] value specifies the per-request shared memory size limit.
The -l shm_s[egments] value specifies the per-request shared memory segment limit.