This chapter helps you to identify common problems you may experience while administering NQE; possible causes and resolutions are also included.
For information on DCE/DFS issues, see Chapter 14, “Configuring DCE/DFS”.
For information on obtaining support from customer service, see “Calling Customer Service”.
The following categories of problems are included in this chapter:
Verifying that NQE components are running
Verifying that daemons are running
NQS troubleshooting
Client troubleshooting
NLB troubleshooting
NQE database troubleshooting
Network debugging
Determining Array Services status
Reinstalling the software
Calling customer service
To ensure that NQE components are running, complete the following steps:
Ensure that the daemons are running as described in “Verifying That Daemons Are Running”.
Determine whether the NLB server daemon is up and running by using the following command:
/nqebase/bin/nlbconfig -ping |
Determine whether the nqsdaemon is up and running by using the following command:
/nqebase/bin/qping -v |
Determine the systems that are candidates for accepting requests based on current NLB policies by using the following command:
/nqebase/bin/nlbpolicy -p nqs |
Display load-balancing request information by selecting the Load button on the NQE GUI.
Display information about queues by using the cqstatl-f or qstat-f command.
Display NQS request information by selecting the Status button on the NQE GUI display or by using the cqstatl-f or qstat-f command.
If all of these steps display nonerror information, your NQE software is ready for use.
To ensure that the daemons are running, complete the following steps:
Log in as root.
Depending on which ps command you have in your search path, use either the ps -e or the ps -aux command to display running processes. Some systems truncate the daemon names in your ps display.
If one or more of the daemons shown in Table 16-1 are not running but are configured to run on this node, NQE is not operational. In that case, you should have received an error from the nqeinit script or have an error in the NQS NQE_SPOOL/log/nqslog log file or in the qstart.$$.out file (where $$ is the PID of the process running qstart), or in the nqscon.$$.out file, which logs NQS activity until the NQS log daemon starts.
You also can check the license manager log files for any error information.
Table 16-1 describes the NQE server daemons.
Table 16-1. NQE Server Daemons
Component | Daemons |
---|---|
Collector | ccol NLBportnumber |
FTA | ftad (when FTA NPPA transfers are active) |
| fta (when any FTA transfers are active) |
Load balancer | nlbserver |
NQE database | msqld |
LWS | nqedb_lws.tcl |
NQS | logdaemon, netdaemon, and nqsdaemon |
If you have checked all of the steps described in “Verifying That NQE Components Are Running”, and the NQE daemons still are not running, try the following steps:
Ensure that you have added the necessary NQE port numbers to your /etc/services file. You may have already had these port numbers assigned to other daemons and not realized it, or you may have made a typing error when you entered the information.
Ensure that you have configured all desired NQE components in the nqeinfo(5) file. Client software contains only commands; it has no daemons.
Check the /nqebase/nqeversion/host/nqeinfo file for typing errors. This file records your choices during installation. The host is the TCP/IP host name of the machine on which you installed the software. Ensure that all of the information in this file is correct. If not, you should reinstall the software as described in NQE Installation, publication SG-5236.
To ensure that NQE is running properly, try the following steps:
Ensure that the nqenlb and nqebatch queues exist, are enabled, and are started on each node NQS is configured for by using the NQE cqstatl or qstat command.
Ensure that TCP/IP is running between all NQE clients and servers by using the telnet and ftp commands between the client and server systems.
If you are using file validation, ensure that NQS validation is correct by checking that you have a valid account on all NQS servers and that the submitting user name matches the user name at the executing NQS server.
Check that the submitting account has an entry in either a .rhosts or .nqshosts file in the home directory of the target user on each NQS server. Entries must have the format hostname username.
Examine the NQS log file for any error messages. The NQS log file contains messages from the various NQS daemon processes, messages indicating successful completion of events, and messages indicating specific errors that may have occurred.
To determine the NQS log file path name, use the following qmgr command:
show parameters |
To increase the amount of information recorded in the log file, use the following qmgr command:
set message_types on all |
This section describes some common problems users may encounter in running NQS, along with their possible causes and resolutions.
If you cannot find a request, try the following steps to locate it.
The following may show the location of the request:
The NQE GUI Status window displays all requests in the complex.
The cqstatl -a and the qstat -a command display all requests on the server defined by the NQS_SERVER environment variable. The cqstatl -a also displays an NQE database request summary (the NQE database server is defined by the MSQL_SERVER, MSQL_TCP_PORT, and NQE_DEST_TYPE environment variables).
The cqstatl -a -h remote-hostname and qstat -a -h remote-hostname commands display all requests at a remote host.
The request may have been completed. The user should check for the standard output and standard error files produced by the request. Unless the user specified a different location for these files, they should be in the directory from which the request was submitted. If they are not there, check the user's home directory on the remote system.
The user should check electronic mail. If NQS encounters problems running a request, it may send a mail message to the issuing user.
A request may be queued without being run for the following reasons.
A request may be queued without being run when it reaches an NQS batch queue, even though the NQS batch queue has not reached its limits. This could happen because the limits for the queue complex, the user's group, or the NQS server global limits were reached.
To get details of the request, including its requested resource limits, use the cqstatl or qstat command. Following is an example of the command:
cqstatl -f -h target_host requestids |
Compare these limits with the limits defined for the batch queue by using the same command to receive details about the queue. Following is an example of the command:
cqstatl -f queues |
When a pipe queue's destination list points to a disabled batch queue, the default retry time for re-queuing requests has no effect.
A pipe queue's destination list should not contain any elements that are disabled batch queues. In the event that this does occur, any jobs that are submitted to the pipe queue will remain in the pipe queue if they cannot be sent to any of the other destinations in the list. Because the disabled batch queue exists, NQS waits for the queue to become enabled or for it to be deleted before it moves the job from the pipe queue. To ensure that jobs are not sent to a particular destination in the pipe queue's list, remove the destination from the list instead of disabling the batch queue.
Unless the user specified a different location for these files, they should be in the directory that was current when the request was submitted. If they are not there, check the user's home directory or the home directory at the remote system of the user under which the request executed.
If you cannot find any output files, check for electronic mail messages sent to the user. NQS might have removed the request for a variety of reasons (for example, it requested more of a resource than was allowed for that user or the request violated your system's security). The mail message contains a description of the problem.
The user may get a mail message stating that the standard output and error files could not be written back to the user's system. The message indicates where these files were actually placed (usually in the home directory of the remote user under which the request was run). These files usually cannot be written back because the user's home directory does not contain a suitable validation file entry that authorizes NQS to write the output files. However, it also may be due to a problem with the network connection to the NQS system.
NQS tries the following methods to return output to a user:
NQS tries to send the output over an NQS protocol.
NQS tries to use FTA.
NQS tries to use rcp, which works if the user has a .rhosts entry for the server on the user's submitting system (NQE client).
NQS sends the user mail stating that it could not deliver the output to its first destination. It places the output in the user's home directory (or in the home directory of the target user) on the server that executed the request.
If for some reason NQS cannot write to the user's home directory on the server that executed the request (for example, if the permissions on the user's home directory are r-xr-xr-x), it writes the output to the $NQE_NQS_SPOOL/private/root/failed directory (only root can access this directory) .
If the standard error file contains an excessive number of syntax errors, the user may be using the wrong shell. Usually, it is the shell flow control commands (such as if, while, for, and foreach) that cause errors when the wrong shell is used. Ensure that the user is using the correct shell strategy.
If the user receives an error message such as file not found or file does not exist in the standard error file, the batch request may have tried to access a file that could not be found.
NQS assumes that any files the user is trying to access are in the home directory of the user at the execution host. To indicate that a file is in another location, the user should either specify the full path name or use the cd command to move to that directory before trying to access the file within the batch request script.
NQS can shut off a queue or several queues (with no intervention from a manager or operator). The reason for the action might be that the nqsdaemon session reached a process limit or that it cannot allocate a resource. To determine the reason, you should examine the NQS log file. After determining the reason and correcting the problem, the queue(s) can be restarted manually.
Requests may unexpectedly switch states from queued to wait. A checkpointed UNICOS, UNICOS/mk, or IRIX request may be waiting for a specific process or a job ID to become available. When it becomes available, the request will remain in queued state until it executes.
Most qmgr commands can be executed only by an NQS manager or operator. An operator can issue only a subset of the commands that are available to the manager. If you try to use a command that you are not allowed to use, the following message is displayed:
NQS manager[TCML_INSUFFPRV ]: Insufficient privilege at local host. |
If NQS or qmgr fails with a no local machine id message, and the site is using DNS (Domain Name Service) to resolve host names, follow these procedures:
Use the hostname command to find the true host name that NQS or qmgr would be using. For example:
# hostname sn9031 |
Use the nslookup utility to find out what the DNS server thinks your host name really is (that is, the name/alias(s) it goes by); for example:
# nslookup Default Server: VGER.PRIUS.JNJ.CO Address: 122.147.94.7 sn9031 Server: VGER.PRIUS.JNJ.COM Address: 122.147.94.7 Name: Cray1.PRIUS.JNJ.com Address: 122.147.92.39 Aliases: SN9031.PRIUS.JNJ.com exit |
The DNS on the system above says sn9031 has the name Cray1.PRIUS.JNJ.com and has an alias of SN9031.PRIUS.JNJ.com. Use qmgr commands to add the machine ID with those names. Assuming that the machine ID is to be 1, the command would be as follows:
# qmgr ADd MId 1 sn9031 Cray1.PRIUS.JNJ.com SN9031.PRIUS.JNJ.com exit |
It is important to note that it does make a difference in the host names/aliases if they are upper- or lowercase; the names/aliases must be entered exactly as DNS shows them. The result of the preceding example is shown by using the qmgr sho mi command as follows:
# qmgr sho mi MID PRINCIPAL NAME AGENTS ALIASES -------- -------------- ------ ------- 1 sn9031 nqs Cray1.PRIUS.JNJ.com SN9031.PRIUS.JNJ.com |
qmgr can now find the machine ID for the host name of the local machine.
If NQS appears to be running slowly, you may need to edit the /etc/hosts file.
Because NQS supports networks of machines, it uses the gethostbyname system call to get the host information for a request. The implementation makes calls to this routine while processing all requests, including requests that are submitted on a local machine for execution on the local machine.
The gethostbyname system call looks up the host name by doing a sequential search of the /etc/hosts.bin binary file (or the text version /etc/hosts if the binary file does not exist).
The bigger the host file, and the further down in the file the local machine name appears, the longer it will take for the system call to complete, and the longer the nqsdaemon will take to process requests.
To solve this problem, edit the /etc/hosts file so that the local machine appears at the top of the file, then use the mkbinhost file to make a new copy of the binary /etc/host.bin file.
NQE allows an administrator to expand the default scope of the information displayed to non-NQS managers using the qstat(1) or cqstatl(1) commands. The default behavior for these commands is to display information only to non-NQS managers regarding jobs that they have submitted themselves. To let users display the status of all jobs residing at that NQS node when they execute the qstat(1) or cqstatl(1) command, use the NQE configuration utility (nqeconfig(8) command) and set the nqeinfo file variable NQE_NQS_QSTAT_SHOWALL to 1, which modifies the information that is displayed to the users of these commands.
This section describes several common problems users may encounter in running NQE clients, along with their possible causes and resolutions.
![]() | Note: If a user does not specify a user name when submitting a request, the user name must be the same on both the client host and the NQS server or account authorization will fail. If the NQE_NQS_NQCACCT environment variable is set to ORIGIN, the user must have the same account name on both the client and the server. |
If the NQS_SERVER environment variable is not set, the client tries to connect to the NQS server on the local host. If the user cannot connect to a local NQS server, the following message is displayed:
QUESR: ERROR: Failed to connect to NQS_SERVER at host[port] obtaining UNIX environment failed NETWORK: ERROR: Connect: Connection refused |
Ensure that the NQS_SERVER environment variable is set. This must be set in each user's environment to contain the host name of the machine running the NQS server.
It also may be true that the NQS server is not running or that a networking problem exists between the workstation and the NQS server (see “Network Debugging”).
NQE clients are installed with set UID (suid), are owned by root, and use secure ports. This means that clients running at a specific NQE release level are rejected if they try to connect to an NQE server running at a different NQE release level. The following messages are displayed:
QUESRV: ERROR: Failed to retrieve status information QUESRV: ERROR: Non-secure network port verification error at transaction peer QUESRV: ERROR: Received response NONSECPORT (01462) for transaction NPK_QSTAT |
If you receive these messages, you should upgrade your client software.
A request that remains in an NQS queue for a long time without executing usually indicates that NQS is not running or that the machine system itself is not running.
However, you should check the status of the queue in which the request is waiting to see whether the queue is currently stopped. Enter the qmgr command show queue and look under the column STS, which displays on when the queue is started. A stopped queue cannot process requests that are waiting in it.
You also can check any request attributes specified on the cqsub or the qsub command line. Ensure that they match the attributes of the batch queue.
The default validation type in NQS is file validation. File validation requires a .rhosts (or .nqshosts) file in the home directory of your account on all of the NQS server systems on which your request may run. This applies even if your target NQS server is your local machine.
If a client command returns the message No account authorization at transaction peer, or if a cqsub command results in the message being returned in a mail message, one of the following may be true:
No .nqshosts or .rhosts file exists for this user. This may be the case if the $HOME environment variable points to a location that does not have these files.
The proper hostname username pair is not contained in either the .nqshosts or .rhosts file. NQS checks the .rhosts file only when the .nqshosts file does not exist. An example .rhosts file entry for the user jane to submit requests to host snow is as follows:
snow jane |
The file mode of the .nqshosts or .rhosts file may need to be changed to 644 (rw-r--r--). This may not be true for all platforms that NQE supports.
Before you can use the NQE commands, you must add the NQE bin and etc directories to your search path. Before you can use the man pages (which tell you about the NQE commands and command options), you must add the NQE man directories to your search path. For a description of how to set these variables, see Introducing NQE, publication IN-2153.
If you submit requests by using the cqsub or qsub command and receive ctoken generation failure messages in the NQS log, it indicates that NQS tried to use FTA to deliver output files and did not find a user-supplied password, a .netrc file, or an NPPA S-key. The log also contains a message indicating that the FTA transfer failed.
When NQS returns output, it tries to use FTA. FTA checks first for a user-supplied password on the cqsub or qsub command. (To determine the syntax used to specify the FTA password, see the description of the -e option on the cqsub(1) or qsub man page.)
If no password was supplied, FTA tries to send the output to the host and user (with an associated password) by using the user's .netrc file on the NQS execution host. If no .netrc file exists, FTA tries to send the file by using NPPA. If you have not defined S-keys as described in “Configuring S-keys” in Chapter 13, you will receive ctoken generation failure messages.
NQS then tries to return the output by using rcp.
You can ensure that you do not receive the messages by using any of the following methods:
Ensure that users supply a password on the cqsub or qsub command line.
Ensure that users have a .netrc file on the NQS execution host that contains an entry for the destination host and user name and an associated password.
Configure NPPA as described in section Chapter 13, “FTA Administration”.
Change the configuration of the FTA default inet domain so that the nopassword flag is not set. “Configuring S-keys” in Chapter 13, describes how to make changes to the FTA configuration.
If your NQE Load display fails, ensure that your NLB_SERVER environment variable is set to a host that is running an NLB server. You do not have to specify the host port number unless the NLB server is configured with a NLB port number that is different from the default (604) provided during installation.
This section describes some common problems you may encounter in running NLB, along with their possible causes and resolutions.
If NLB commands report that they cannot determine the server location, no information is available to locate the server. NQE may not have been configured on the host. To connect to the server, define the NLB_SERVER environment variable or use the -s option on the command line.
If a collector is running on a host, but no data appears in the NLB server for that host, check the following:
The collector may be pointing to the wrong location. Check that the machine on which the collector is running is configured to point at the correct host name and TCP/IP port to connect properly to the server.
An ACL may be preventing the collector from entering data into the server. If you have an ACL for NLB or NQE GUI status data, it must include a record to allow the user ID running the collector to insert and delete data. This can be either an explicit entry for a user ID on the collector host, or an entry for that ID on all hosts using "*" for the host name.
If you have modified the policies file and the nlbconfig -pol command reports an error, the policies file that you downloaded to the server may contain a syntax error. If this is true, the nlbconfig -pol command fails; however, you do not receive an explanation of the error. To see the error, stop the server (with nlbconfig -kill) and start it again (with nlbserver). Any errors in parsing the policy file will be reported to stderr. The server does not treat an error in the policy file as fatal, but load balancing will not work until you have a correct policies file.
Collectors on Solaris systems may report incorrect data under the following conditions:
If a collector appears to be working, but reports zero memory size for a machine, you are running the collector from an account that does not have permission to read from /dev/kmem, which is used when finding memory size. You must run the collector from an account with permission to read from /dev/kmem.
If a collector appears to be running, but reports random data, check that sar is working correctly. The collector tries to execute /usr/bin/sar -crquw 1 to obtain data. If the command does not work, you must fix the system problem.
If swap space data seems to be invalid, check that /usr/sbin/swap -s works. If the command does not work, you must fix the system problem.
If you are using a policy and getting unexpected results, probably the wrong policy is being used. If a policy name is not defined, a default policy is used. The policy name might be defined incorrectly if you misspell the policy name either in the NQS queue definition or in the nlbpolicy command.
If you try to modify the ACL for object type C_OBJ, you will receive the message No privilege for requested operation. You cannot modify the ACL for C_OBJ because it is fixed. It consists of the master ACL plus WORLD read permission. These permissions are necessary because, if a user cannot read the C_OBJ object type, most commands will fail.
This section describes some common problems you may encounter in running the NQE database, along with their possible causes and resolutions.
A client trying to connect to the NQE database without the proper validation will result in an error such as the following:
NETWORK: ERROR: NQE Database connection failure: Connection disallowed. latte$ |
If this occurs, verify that the user has an entry in the nqedbusers file. For additional information about the nqedbusers file, see Chapter 9, “NQE Database”.
If a user is trying to submit a request to the NQE database and the NQE database server is not up, the following message will be sent:
NETWORK: ERROR: NQE Database connection failure: Cannot connect to MSQL server on latte. latte$ |
For additional information on checking network connectivity, see “Network Debugging”.
The following sections discuss how to diagnose network problems.
The first step in network debugging is to ensure that the machine that is being accessed can be reached by the local host.
To ensure that you can access a host through TCP/IP, use the ping command. This command sends a data packet to the destination host and does not rely on the presence of any specific service at the destination. For example, if the node name is rhubarb, enter the following command:
ping rhubarb |
Entering the previous command can result in several replies:
ping: unknown host rhubarb
If you receive this reply, the host name is unknown on your machine; obtain its Internet address and ensure that this address is in your /etc/hosts file or the network information service (NIS).
sendto: Network is unreachable
If you receive this reply, your machine does not know how to send a packet to the destination machine. Consult with your network administrator and explain what you are trying to do, because the solution may require action on his or her part. For example, a routing table may need updating, a gateway machine may be down, or the network to which you are trying to connect cannot be reached.
no answer from rhubarb
If you receive this reply, your machine knows how to get to the destination node, but did not get a reply. This usually means that either the destination machine is shut down, or the destination machine is running but does not know how to send a packet to your machine. In this case, check the routing on the destination machine.
rhubarb is alive.
As far as ping is concerned, everything is working.
If the host-to-host network connection is running, the next step is to examine the individual connections on your machine and the machine with which you are trying to communicate.
The main tool for examining TCP/IP connections is the netstat command, which examines kernel data structures and prints the current state of all connections to and from the host on which it is run. For more information on how to use this command, see the operating system manuals for your machine.
When you specify the netstat(1) command without any parameters, it reports each open connection on a machine; netstat -a also reports the connections offered by servers.
The most important information to check is the protocol type (Proto), your local address (Local Address), and the status of the connection (state).
If you receive connection failed messages, ensure that your NQS_SERVER environment variable is set to a host that is running an NQS server. You do not have to specify the host port number unless the NQS server is configured with an NQS port number that is different from the default (607) provided during installation.
If a transient network error condition exists that allows the connection to be retried, you will receive messages such as the following:
Retrying connection to NQS_SERVER on host ice (127.111.21.90) . . . Retrying connection to NQS_SERVER on host ice (127.111.21.90) . . . QUESRV: ERROR: Failed to connect to NQS_SERVER at ice [port 607] NETWORK: ERROR: NQS network daemon not responding |
These messages indicate that the command received a connection refused error from the connect () system call or that an address it is trying to use is temporarily in use. This could be true, for example, because the server is just coming up or is not listening, or the network is busy. NQS retries its connection to the server for 30 seconds.
If a user's job project is displayed as 0 instead of the project specified in /etc/project, the Array Services daemon may not be running. For information about Array Services administration, see the Silicon Graphics publication Getting Started with Array Systems.
If you want to reinstall the software, complete the following steps:
Stop all NQE daemons by using the nqestop(8) command. If this does not stop all of them, or does not work, you can use the kill -9 command to stop them.
Delete the desired version of NQE using the nqemaint(8) utility.
Reinstall the software as described in NQE Installation, publication SG-5236.
To obtain support for NQE, contact your local call center or local service representative. The cs_support(7) man page provides NQE product support information.
You also can send a facsimile (fax) to +1-404-631-2226. To obtain service, you must have your NQE product customer number.
Before you call your local call center or local service representative, have the following information ready to help with your problem:
The exact error messages and unexpected behavior you are seeing.
The ps displays of all daemons that are running on the systems with which you are having problems.
A list of NQE binary files, their paths, their protections, and their sizes.
The qstat -f or cqstatl -f output from all queues in the batch complex.
Hardware platforms, model numbers, system software levels, and additional software patches applied to your NQE platforms.
Accounting information, if any, about the exit status of any daemons or commands aborting (such as copies of the corresponding NQE log file and any error messages sent to the user).