Chapter 14. Read This Before You Start NQE

This chapter includes important information that you should be aware of before you start NQE.

For information about compatibility issues and differences that your users may experience after upgrading to NQE 3.3, see the NQE Release Overview, publication RO-5237 3.3.

Checkpointing Function on IRIX Systems

The checkpointing function is supported on IRIX 64-bit systems; NQE 3.3 requires IRIX release 6.4.1 or later for the checkpointing function to work.


Note: The new environment variable NQE_SHEPHERD_PID was added so that the qchkpnt(1) command will work on 64-bit IRIX systems. NQE_SHEPHERD_PID is added to the initiated job's environment only on 64-bit IRIX systems. The value of NQE_SHEPHERD_PID is the shepherd PID for the job.


Checkpoint and Restart Operations on IRIX Systems

Checkpoint and restart operations work by default on IRIX 6.4.1 or later systems for jobs that have the following two characteristics:

  • The job was submitted using the default two-shell invocation (see section 5.3 in the NQE User's Guide, publication SG-2148)

  • Direct job output was requested when the job was submitted (if the command-line interface was used, this means the -ro option was used with either the qsub(1) or cqsub(1) command)

Jobs that do not have these characteristics cannot be restarted when using NQE as installed by default. In order for these jobs to be restartable, you must change the permissions on the $NQEBASE/spool/private directory from 700 to 711. The permission change must be made manually after the system has been installed and configured.


Caution: Changing these permissions may open a security hole and allow any user access to the internal NQE directories. If a user inadvertently or maliciously alters the files located there, running jobs may fail and NQE operations may be corrupted. Sites that are willing to accept the security risk in order to provide checkpoint and restart operations to all jobs should make the permission modification.


This problem will be corrected in a future release of NQE and the IRIX checkpoint and restart software, allowing all jobs to be checkpointed while still maintaining proper security.

-Rf Option Supported on cqsub(1) Command

The -Rf option is now supported on the cqsub(1) command. The -Rf option forces the request to be restarted from a checkpoint image. This option is supported only on UNICOS systems.

Possible Need to Modify the ilbrc File

The ilbrc file in the nqebase/etc directory may need to be modified, depending on your system configuration. This configuration file controls how ilb behaves when logging in to a remote machine. Currently, ilbrc contains references to /usr/bin/telnet and /usr/bin/rlogin. The administrator should ensure that these paths are correct on each of the machines on which NQE is installed.

Cray DynaWeb Server


Note: This was also a dependency for accessing the NQE 3.1 and 3.2 online documentation using the Cray DynaWeb server application that is provided with the NQE release package.

A Cray DynaWeb server is required to access the following NQE 3.3 online documentation:

  • NQE Release Overview, publication RO-5237 3.3

  • NQE Installation, publication SG-5236 3.3

  • Introducing NQE, publication IN-2153 3.3

  • NQE User's Guide, publication SG-2148 3.3

  • NQE Administration, publication SG-2150 3.3

For additional information, see the Cray DynaWeb documentation that is included with your NQE 3.3 release package.

Hewlett-Packard Patch Required before You Start NQE


Note: This was also a dependency for the NQE 3.1 and 3.2 releases.


NQE 3.3 requires the following patch from Hewlett-Packard for HP-UX 9.x: PHCO_5341 Patch.

This patch fixes the following problem:

 s700_800 9.x mkdir -p will not cross a read-only NFS mount.

The patch can be accessed through the Hewlett-Packard Support Line Services World Wide Web page at the following URL:

http://us.external.hp.com

Hewlett-Packard Systems Requirement for whatis File


Note: This was also a dependency for the NQE 3.1 and 3.2 releases.


After you have installed NQE 3.3 on Hewlett-Packard systems, but before you run the man -k command, you must merge the whatis file from NQE with the whatis file in /usr/lib if you want whatis or man -k operations to pick up NQE entries. The easiest way to do this is to use cat(1) to concatenate the /usr/lib/whatis file and the NQE whatis file, pipe the concatenated file through sort, and then replace /usr/lib/whatis with the result.

Problem with Shutdown/restart of NQE Database


Note: This was also a dependency for the NQE 3.1 and 3.2 releases.


Following the shutdown of the NQE database (mSQL database), it may not be possible to restart the NQE database daemon, msqld, unless you wait for 3 or 4 minutes. A bind system call failure (port in use) will show up in the msqld_log file if you try to restart msqld too soon. The problem is caused by the way in which TCP/IP handles the closing of connections. A TIME_WAIT_TCP connection state is used to allow for the proper reuse of port numbers.

To work around the problem, do not attempt to restart msqld until the MSQL_TCP_PORT (603) has been released by TCP/IP. Use the following command to verify that the port is available.

# netstat -a | grep 603

You should not see any entries of the following forms, which indicate that the MSQL_TCP_PORT 603 is still in use:

latte.603     latte.974     8192  0 8192 0 TIME_WAIT
localhost.603 localhost.974 8192  0 8192 0 LAST_ACK


Note: If the mSQL port number is defined in /etc/services, the previous command may not show the active port because the name of the port appears as the service name. In this case, you should use the netstat -a | grep msql command.


QSUB_WORKDIR Value for Requests Submitted through the NQE Database


Note: This was also a dependency for the NQE 3.1 and 3.2 releases.


When you submit a request, the QSUB_WORKDIR environment variable is set to be the current directory when the request was submitted. However, for requests that are submitted through the NQE database, the value of the QSUB_WORKDIR environment variable is set to be the job owner's $HOME directory on the NQE database LWS or the NQE_BIN directory for job owners whose $HOME directory is within the Distributed File System (DFS).

NLB Differences for UNICOS/mk Systems


Note: This was also a dependency for the NQE 3.1 and 32. releases.


The Network Load Balancer (NLB) database information is supplied by the ccollect(8) program. On UNICOS/mk systems, ccollect(8) relies on the sar(1) command to retrieve system performance data. Currently, the sar command on UNICOS/mk systems does not provide system performance data for memory or swapping usage statistics. As a result, the NQE GUI Load window for memory demand displays a fixed value of 96% when reporting on UNICOS/mk systems.

In addition, the following NLB attributes are not meaningful for UNICOS/mk systems:

NLB_A_SWAPPING
NLB_SWAPPING
NLB_SWAPSIZE
NLB_SWAPFREE
NLB_FREEMEM
NLB_A_FREEMEM

UNICOS Sites Using ftp Interface or USCP Interface


Note: This was also a dependency for the NQE 3.1 and 3.2 releases.


This section describes additional installation changes needed for sites using the UNICOS ftp(1) interface or the USCP interface to NQS. The NQS commands qsub(1), qstat(1), and qdel(1) have been moved from /usr/bin to /nqebase/bin; for USCP, the qmsg(1) command has also been moved from /usr/bin to /nqebase/bin.

These commands and absolute path names are used in the ftp interface and in the USCP interface to NQS. Sites that want to use the ftp or USCP interface to NQS must create symbolic links from /nqebase/bin to /usr/bin. This can be done after NQE is installed.

For the ftp interface, to create symbolic links from /nqebase/bin to /usr/bin, enter the following to delete the old commands and create links to the version available in NQE 3.3:

rm /usr/bin/qsub
rm /usr/bin/qstat
rm /usr/bin/qdel

ln -s /nqebase/bin/qsub /usr/bin/qsub
ln -s /nqebase/bin/qdel /usr/bin/qdel
ln -s /nqebase/bin/qstat /usr/bin/qstat

For the USCP interface, to create symbolic links from /nqebase/bin to /usr/bin, enter the following to delete the old commands and create links to the version available in NQE 3.3:

rm /usr/bin/qsub
rm /usr/bin/qstat
rm /usr/bin/qdel
rm /usr/bin/qmsg

ln -s /nqebase/bin/qsub /usr/bin/qsub
ln -s /nqebase/bin/qdel /usr/bin/qdel
ln -s /nqebase/bin/qstat /usr/bin/qstat
ln -s /nqebase/bin/qmsg /usr/bin/qmsg

Command Paths Changed for UNICOS Systems


Note: This was also a dependency for the NQE 3.1 and 3.2 releases.


Previous releases of NQS and NQX on Cray Research systems were bundled with UNICOS and installed in /etc, /usr/bin, /usr/lib, /usr/include, and /usr/man. NQE 3.3 on UNICOS and UNICOS/mk platforms is released as an asynchronous product and is installed in the /nqebase directory (that is, /nqebase on UNICOS, UNICOS/mk, and Solaris systems or /usr/craysoft/nqe on all other supported platforms). For a description of the NQE directory structure, see Chapter 3, “NQE Directory Structure”.

This path change affects both administrators and end users of NQS, FTA, and NQX. Users must be notified of the new command location so their user environments can be changed to access the commands from /nqebase/bin. For example, this command path change will affect user cron jobs, job submission scripts, and any user programs that reference NQS, FTA, or NQX commands.

The system files that set up user environments can be modified to add /nqebase/bin to the default path. The modules package can be used to set up the appropriate path to the NQE commands. See Chapter 16, “Using modules with NQE”, for more information about the modules package.

UNICOS/mk Changes to /etc/rc and /etc/config/rcoptions


Note: This was also a dependency for the NQE 3.1 and 3. 2 releases.


For UNICOS/mk systems, the RC_NQSLOGFILE variable has been removed from the /etc/config/rcoptions file. The /etc/config/rcoptions file is created by the UNICOS/mk installation tool. In addition, the code in the /etc/rc script that references the RC_NQSLOGFILE variable has been removed. The RC_NQSLOGFILE variable was used to provide a way to move the current NQS log file to logfile.date during system startup.

The NQS qmgr(8) command provides log file segmentation at NQS startup and periodically as desired. For further information, see the qmgr(8) man page.

NQS Customized Variables Now May Be Defined in the nqeinfo File


Note: This was also a dependency for the NQE 3.1 and 3.2 releases.


NQS customized variables may be defined in the nqeinfo file by using the nqeconfig(8) command. These customized variables were made available in the NQE 3.0 release for workstation platforms by manually editing the nqeinfo file. Prior to the NQE 3.1 release, these customized variables were available on UNICOS systems within the /usr/src/net/nqe/src/nqs/include/config.h file. For a list of each config.h file variable and its equivalent nqeinfo file entry, see NQE Administration, publication SG-2150. For further information about the nqeconfig(8) command, see Appendix B, “Configuring NQE Variables”, or the nqeconfig(8) man page.

DCE Considerations


Note: This was also a dependency for the NQE 3.1 and 3.2 releases.


If you plan to use DCE with NQE, please note the following considerations:

  • NQE support of DCE/DFS does not include support for an installation running DFS-only file space. The NQE spool and binary trees, among other components, must reside in UNIX file space.


    Note: On UNICOS systems, this release of NQE supports only DCE/DFS version 1.1.


  • UNICOS and IRIX systems must have the DCE integrated login feature enabled in order for DCE credentials to be passed through NQE. For further information, see Section 14, “Configuring DCE/DFS” in NQE Administration, publication SG-2150 and Cray DCE Client Services/Cray DCE DFS Server Release Overview, publication RO-5225.

  • For IBM AIX 4.2 systems to run NQE 3.3 with DCE/DFS, you must install the IBM APAR ix59568 patch; otherwise, system crashes may occur, unless DCE/DFS is disabled. AIX customers may obtain the patch by calling 1-800-CALLAIX and requesting the fix for APAR ix59568.


    Note: This patch is not needed when running NQE 3.3 without DCE/DFS on IBM AIX 4.2 systems.


  • On HP-UX systems, NQE is configured to provide DCE authentication by adding the NQE_AUTHENTICATION variable, set to dce, in the nqeinfo file. If this is done on an HP-UX system that does not have DCE installed, NQE jobs initiated on this HP-UX system abort. The following mail message is sent to the job owner:

    Request aborted via a signal.
    Request deleted.
    
    Aborting signal was: 6

    The following message is written into the NQS log file:

    /lib/dld.sl: Can't find path for shared library: libc_r.sl

    If this occurs, stop NQE on the HP-UX system and remove the NQE_AUTHENTICATION variable from the nqeinfo file on the HP-UX system. After you restart NQE, NQE jobs on this non-DCE HP-UX system will be run without this error.

FLEXlm Documentation

As of the NQE 3.0 release, the PostScript file containing the Flexible License Manager End User Manual is no longer provided with NQE. For more information on FLEXlm, access the GLOBEtrotter Software, Inc., World Wide Web page at the following URL:

http://www.globetrotter.com

Also, you can order the Flexible License Manager End User Manual from GLOBEtrotter Software, Inc., or from the Cray Research Distribution Center.