Chapter 16. Solving Problems

This chapter describes some problems that you may have when using NQE, along with their possible solutions. The following problems are discussed:

If your request does not complete successfully, you will receive either an error message in the standard error file or a mail message that describes the problem; your job log may also be appended to your mail, depending on which cqsub or qsub command options are set.

Commands Do Not Execute

Before you can use the NQE commands, you must add the /nqebase/bin directory (and the /nqebase/etc directory to use administrator commands) to your search path. Before you can use the man pages (which tell you about the NQE commands and command options) you must add the /nqebase/man directories to your search path. For a description of how to set these variables, see “Setting Environment Variables” in Chapter 2.

Requests Not Queued

When you try to submit a request to NQS, you may receive the following message:

NQS local daemon is not present at local host.
Retry later.

This message indicates that the NQS system is not running. You can wait until later or contact your NQE administrator, or you can try using a different NQS server node. If you do not know which systems are NQS server nodes in your NQE cluster, contact your NQE administrator.

If you try to submit a request without specifying a queue name, you may get the following message:

No request queue specified, and no local default has been defined.
Request not queued.

The message indicates that no default queue is currently defined. You can do any of the following:

  • Ask the NQE administrator to define a default queue.

  • Using the NQE GUI, select General Options on the Configure menu of the Submit window, enter the name in the Queue name field, and apply the change.

  • Define your own default queue by using the QSUB_QUEUE environment variable (see “Unsuccessful Submissions” in Chapter 4).

  • Using either the cqsub or qsub command, specify a queue by using the -q queuename option.

When you submit a request, you may get the following message:

Access denied at local host.

This message indicates that the queue to which you submitted the request has access restrictions that prohibit your request entering the queue. The restriction can be any of the following:

  • The queue is pipeonly, which means that it can accept a request only from a pipe queue.

  • Only certain users or user groups can submit requests to the queue.

To determine the access restriction, use the cqstatl -f or the qstat -f command (see “Displaying Queue Details” in Chapter 11).

If the UNICOS MLS feature or the UNICOS/mk security enhancements are enabled on your system, and the requested label does not dominate the submission label, you may receive the following message:

Unanticipated transaction failure at local host.

Requests Not Executing

If you find that a request remains in a queue for a long time, check the Status column of the NQE GUI Status window or the ST column of the cqstatl or qstat display, which lists a status code that will help you determine the problem. The following are possible reasons:

  • A letter W can indicate the following:

    • Your request is waiting for a license (see “No Licenses Are Available”).

    • A pipe queue's destination queue is disabled.

    • A pipe queue's destination is at a remote system that is not available. NQS automatically retries sending the request at periodic intervals.

  • A letter Q that has a substatus code of qs indicates the queue has stopped. Only the NQE administrator can start and enable NQS queues.

  • A letter Q that has a substatus code beginning with c, g, or q (except qs) indicates that a limit has been reached for the queue complex, globally on the NQS server, or for a queue. You can wait for resources to become available or delete the request.

  • Your request may be accepted into an NQS queue and later be deleted. Check your electronic mail for a message such as the following:

    The request could not be routed to any of the possible pipe
    queue destinations because of the following reason(s)  :
    
    No account authorization at transaction peer;

    If you receive this message, one of the following is true:

    • Password checking is in use and either you did not request a password prompt (by selecting Set Password on the Actions menu of the NQE GUI Submit window, by using cqsub -P, or by setting the NQS_PASSWORD_NEEDED environment variable), or you supplied a password that is not valid.

    • Validation file checking is in use and no correct entry exists in the validation file of the user name you used to submit the request. See Chapter 2, “Preparing to Use NQE”.

    • UNICOS MLS or UNICOS/mk security enhancements are enabled to run on the remote system and the workstation access list (WAL) does not allow access to NQS services.

    See “-h Option Displays Error”, for more information about troubleshooting validation files.

Another possible reason that your request is not executing is that when NQS reads your .cshrc file or your request script file, it may encounter commands it cannot invoke and cause your request not to run. You will see these problems reflected in your standard error file. Also, NQS sets an environment variable called ENVIRONMENT to value BATCH for each NQS initiated job. This variable can be checked within a .profile, .login, or .cshrc script and be used to differentiate between interactive and batch sessions; this action can be used to avoid performing terminal setup operations for a batch job. A benefit of NQS initiating the batch job as a login shell is that .profile, .login, or .cshrc scripts are run and your environment is set up as expected.

If the UNICOS multilevel security (MLS) feature or the UNICOS/mk security enhancements are enabled on the remote system, you cannot submit a request to a remote host if the remote host has a workstation access list (WAL) entry for the host of origin that restricts your access to NQS services.

If the UNICOS MLS feature or the UNICOS/mk security enhancements are enabled on your system, the request is deleted if a requested execution label is not within the authorized range, and you may receive the following mail message:

System security violation.
Request deleted.

Bad session security attributes.

Connection Failure Messages

If you are using a client command to talk directly to an NQS server and a network error condition exists that can be retried, you will receive messages such as the following:

Retrying connection to NQS_SERVER on host ice (147.111.21.90) . . .
QUESRV: ERROR: Failed to connect to NQS_SERVER at ice [port 607]
NETWORK: ERROR: NQS network daemon not responding

The server may have just come up, it may not be listening, or the network may be busy. The client will retry the connection to the server for 30 seconds. You also can verify that the value of the NQS_SERVER environment variable is set to a host running NQS.

If you are using a client command or the NQE GUI to connect to the NQE database, and the database server is not up or does not exist, you will see a message such as the following:

Connect: Connection refused
NETWORK: ERROR: NQE Database connection failure:
        Can't connect to MSQL server on Latte

Authorization Failure Messages

The default validation type in NQS is file validation. File validation requires a .rhosts or .nqshosts file in the login directory of your account on all of the NQE node systems where your request may run. This applies even if your target NQS server is your local machine.

For information about validation files, see “NQE Database Authorization” in Chapter 2.

If you encounter authorization failure messages, consider these alternatives:

  • A .rhosts or .nqshosts file may not exist for the user name under which the request will be run. This may be the case if the $HOME environment variable points to a location that does not have these files.

  • The proper hostname-username pair is not contained in a validation file. You must have an entry in a validation file for every NQE node that the Network Load Balancer (NLB) or the NQE database may select to run your request.

  • A password is required and was not supplied, or the wrong password was supplied.

  • Depending on your local network configuration, host names of nodes might have to be fully qualified (that is, you may have to include ice.site.com rather than simply ice).

  • You may have created both .rhosts and .nqshosts files, and only the .rhosts file is correct. When the .nqshosts file exists, NQS ignores the .rhosts file.

NQE Database Authorization Failures

A client trying to connect to the database without the proper validation will result in an error such as the following:

latte$ cqstatl
NETWORK: ERROR: NQE Database connection failure:
         Connection disallowed.
latte$

To determine why you cannot connect, check with your database administrator.

Requests Disappear

If it appears that your request has disappeared, the following procedures may reveal its location (if these do not uncover the request, consult with your NQE administrator):

  • Enter either the cqstatl -a or the qstat -a command to display all of your requests on the local machine; this display may reveal the location of your request.

  • If you think that your request may have been sent to a remote system for execution, enter either the cqstatl -a -h hostname or the qstat -a -h hostname command to display all of your requests at the remote system.

  • Check to see whether you have received a mail message about the problem; this message may help you to determine the cause of the problem.

  • In the NQE GUI, display all of your requests in NQS batch queues in the cluster by using the left mouse button to click on the Status button. For information about interpreting this display, see “Using the NQE GUI Status Window” in Chapter 10.

  • Check to see whether the request has completed. Look for the standard output and standard error files that the request produced. Unless you specified a different location for these files, they should be in the directory from which you submitted your request. If they are not there, check your home directory on the NQS server at which your request executed.

    If you used an alternative user name, try the home directory of the user name you used to submit the request.

    If the request was executed at a remote system, try checking the remote system in the home directory of the user under whom the request was executed.

NQE Scheduler Not Scheduling

If the NQE scheduler is not scheduling the request, an NQS server may not be available to accept your request.

-h Option Displays Error

You may receive the following error message when you issue a cqstatl, cqdel, qdel, or qstat command with the -h option to specify the host name of an NQS server, or you may receive the message if you use the NQE GUI and try to delete or signal a request to a specified host.

When using the cqstatl or qstat command, the following message may be displayed:

No account authorization at transaction peer.

When using the cqdel or qdel command or the NQE GUI, the following message may be displayed:

No account authorization on target host

When you encounter the preceding message, consider these alternatives:

  • A .rhosts or (.nqshosts) file may not exist for this user. This may be the case if the $HOME environment variable points to a location that does not have these files.

  • The proper hostname-username pair is not contained in either of these files. NQS checks the .rhosts file only when the .nqshosts file does not exist.

  • A password is required and was not supplied, or the wrong password was supplied.

For further information, see Chapter 2, “Preparing to Use NQE”.

Resource Limits Exceeded

If your request exceeds resource limits, errors or unexpected results can occur. If an error occurs, your request proceeds with the next command. You then should examine the return values from commands and calls executed within a request to check whether an error has occurred.

Within a request file, you can verify the exit status of the last command executed by examining one of the following:

  • The ? variable for requests that use sh or ksh

  • The status variable for requests that use csh or tcsh

You also can look at the job log for exit status information.

Some limits violations can terminate your program, which may result in the generation of a core file. If your job does not complete successfully and there are no other errors indicated, you should look for a core file in the directories in which your job was executing.

Output Files Cannot Be Found

Unless you specified a different location for the standard output, standard error, and job log files, they should be in the directory you were working in when you submitted your request.

The first thing you should do is check your electronic mail. If NQS has tried to write your output to another directory, it sends you mail. For examples of mail messages, see “Using NQS Mail” in Chapter 4.

The most likely locations for your output are as follows:

  • Your home directory on the executing NQS server


    Note: For NQE database requests, the NQE database will send your output files to your home directory on your NQE client.


  • If you used the NQE GUI, or the cqsub -u or qsub -u command, the home directory of the user name you specified

  • The NQS failed directory

  • The NQE GUI has some other output default settings that you should check, such as the job output directory.

NQS uses the following methods, in the order listed, to return your output to you:

  1. NQS tries to send it by using NQS protocol. This method works if you submitted the request from an NQS server.

  2. NQS tries to use fta. This method works if you have a .netrc file containing a password on the execution node or if FTA has been configured between client and node to use NPPA (no passwords).

  3. NQS tries to use rcp. This method works if you have a .rhosts entry on your submitting system, and your .rhosts file has an entry for the NQE node that executed your request. Depending on your local network configuration, host names of nodes might have to be fully qualified (that is, you may have to include ice.site.com rather than simply ice).

  4. NQS sends you mail stating that it could not deliver your output to its first destination. It places your output in your home directory on the NQS execution server.

  5. NQS delivers your output to an administrative directory. To retrieve the files, you must contact your system administrator.

When using DCE/DFS, note the following:

  • After a request completes, NQS uses kdestroy to destroy any credentials obtained by NQS on behalf of the request's owner.


    Caution: On UNICOS systems, do not put a kdestroy within a request's job script; it will destroy the credentials obtained by NQS and prevent NQS from returning request output files into DFS space.


  • On UNIX platforms, there is not an integrated login system feature available. NQS on UNIX platforms obtains separate DCE credentials for request output return. Therefore, a kdestroy placed within a request's job script running on an NQE UNIX server will not affect the return of request output files into DFS space.

If the UNICOS MLS feature or the UNICOS/mk security enhancements are enabled on your system, the job output files are labeled with the job execution label. For jobs that are submitted locally, the return of the job output files may fail if the job submission directory label does not match the job execution label. For example, if a job is submitted from a level 0 directory, and the job is executed at a requested level 2, the job output files cannot be written to the level 0 directory. If the home directory of the UNICOS user under whom the job ran is not a level 2 directory, does not have a wildcard label, or is not a multilevel directory, the job output files cannot be returned to that directory either. The job output files will be stored in the NQS failed directory.

If the UNICOS MLS feature or the UNICOS/mk security enhancements are enabled on your system and you submitted a job remotely, the Internet Protocol Security Options (IPSO) type and label range, which is defined in the network access list (NAL) entry for the remote host, affects the job output file return. The following example shows a successful return of the job output files to a wildcard-labeled home directory on the execution host:

Message concerning NQS request: 1262.cool ended.
Request name:   STDIN
Request owner: snow
Mail sent at:   10:24:04 CST
Request exited normally.

_Exit() value was: 0.

  Stdout file staging event status:
  Destination: -o cool:/home/usr/snow/STDIN.o1262
    Output file could not be returned to primary destination.
    Output file successfully returned to backup destination
    in user home directory on the execution machine.

    Transaction failure reason at primary destination:
    File access denied at transaction peer.

  Stderr file staging event status:
  Destination: -e coal:/home/usr/snow/STDIN.e1262
    Output file could not be returned to primary destination.
    Output file successfully returned to backup destination
    in user home directory on the execution machine.

    Transaction failure reason at primary destination:
    File access denied at transaction peer.

For more information on locating output files, see “Finding Lost Output” in Chapter 6.

stdout Reports no access to tty

The following message is always displayed at the start of the standard output file for batch requests that have executed under the C shell:

Warning: no access to tty; thus no job control in this shell...

This message does not indicate that an error has occurred. It is simply a warning that the usual C shell job control options are not available because this is a batch request.

Job control is a means of controlling multiple shells or processes interactively. It is not available with a batch request because no interactive session (referred to as tty in the message) is associated with the job.

stderr Reports Many Syntax Errors

If the standard error file contains an excessive number of syntax errors, you may be using the wrong shell.

For information on selecting a shell, see “Specifying a Shell” in Chapter 5. Usually, it is shell flow control commands (such as if, while, for, and foreach) that cause errors when the wrong shell is used.

stderr Reports file not found

When a request begins execution, NQS assumes that any files you are trying to access are in your home directory (or the initial working directory you are in when you log in interactively).

To indicate that a file is somewhere else, do one of the following:

  • Use the full path name

  • Move to the correct directory by using the cd command in your request before you try to access the file

No Licenses Are Available

When you try to submit a request and no NQE licenses are available, your request is accepted, but it will be held in a wait state in NQS. You will see the following status messages:

Wlm
WAITING
License Unavailable

NQE tries to obtain a license every 3 minutes.

DCE/DFS Credentials Not Obtained

Failure to obtain DCE credentials results in a nonfatal error. The request will be initiated even if the attempt to obtain DCE credentials for the request owner fails. If DCE credentials are successfully obtained, the KRB5CCNAME environment variable is set within the request process that is initiated.

A restarted job correctly gets the new credentials obtained from NQS, but the KRB5CCNAME environment variable within the restart file is not reset to the new cache file name. After the job is restarted, a klist within the job script will incorrectly state that there are no credentials. As a result, DCE services are affected but not DFS, which continues to work with the new credentials.