Chapter 6. Troubleshooting ONC3/NFS

This chapter suggests strategies for troubleshooting the ONC3/NFS environment, including automounting. This chapter contains these sections:

General Recommendations

If you experience difficulties with ONC3/NFS, review the ONC3/NFS documentation before trying to debug the problem. In addition to this guide, the ONC3/NFS Release Notes and the manual pages for mount(1M), nfsd(1M), showmount(1M), exportfs(1M), rpcinfo(1M), mountd(1M), inetd(1M), fstab(4), mtab(4), lockd(1M), statd(1M), automount(1M), and exports(4) contain information you should review. You do not have to understand them fully, but be familiar with the names and functions of relevant daemons and database files.

Be sure to check the console and /var/adm/SYSLOG for messages about ONC3/NFS or other activity that affects ONC3/NFS performance. Logged messages frequently provide information that helps explain problems and assists with troubleshooting.

Understanding the Mount Process

This section explains the interaction of the various players in the mount request. If you understand this interaction, the problem descriptions in this chapter will make more sense. Here is an sample mount request:

# mount krypton:/usr/src /n/krypton.src

These are the steps mount goes through to mount a remote file system:

  1. mount parses /etc/fstab.

  2. mount checks to see if the caller is the superuser and if /n/krypton.src is a directory.

  3. mount opens /etc/mtab and checks that this mount has not already been done.

  4. mount parses the first argument into the system krypton and remote directory /usr/src.

  5. mount calls library routines to translate the host name (krypton) to its Internet Protocol (IP) address. Depending on the host resolution order in /etc/resolv.conf, mount uses /etc/resolv.conf, the NIS databases, or the DNS databases to determine the NFS server. See resolver(4).

  6. mount calls krypton's portmap daemon to get the port number of mountd. See portmap(1M).

  7. mount calls krypton's mountd and passes it /usr/src.

  8. krypton's mountd reads /etc/exports and looks for the exported file system that contains /usr/src.

  9. krypton's mountd calls library routines to expand the host names and network groups in the export list for /usr.

  10. krypton's mountd performs a system call on /usr/src to get the file handle.

  11. krypton's mountd returns the file handle.

  12. mount does a mount system call with the file handle and /n/krypton.src.

  13. mount does a statfs(2) call to krypton's NFS server (nfsd).

  14. mount opens /etc/mtab and adds an entry to the end.

Any of these steps can fail, some of them in more than one way.

Identifying the Point of Failure

When analyzing an NFS problem, keep in mind that NFS, like all network services, has three main points of failure: the server, the client, and the network itself. The debugging strategy outlined below isolates each individual component to find the one that is not working.

Checking Out a Server

If a client is having NFS trouble, check first to make sure the server is up and running. From a client, give this command:

# /usr/etc/rpcinfo –p server_name | grep mountd

This checks whether the server is running. If the server is running, this command displays a list of programs, versions, protocols, and port numbers similar to this:

    100005    1   tcp   1035  mountd
    100005    1   udp   1033  mountd
    391004    1   tcp   1037  sgi_mountd
    391004    1   udp   1034  sgi_mountd

If the mountd server is running, use rpcinfo to check if the mountd server is ready and waiting for mount requests by using the program number and version for sgi_mountd returned above. Give this command:

# /usr/etc/rpcinfo –u server_name 391004 1

The system responds:

program 391004 version 1 ready and waiting

If these fail, log in to the server and check its /var/adm/SYSLOG for messages.

Checking Out a Client

If the server and the network are working, give the command ps –de to check your client daemons. inetd(1M), routed(1M), portmap, and four biod(1M) and nfsd daemons should be running. For example, the command ps –de produces output similar to this:

   PID TTY      TIME COMD
   103 ?        0:46 routed
   108 ?        0:01 portmap
   136 ?        0:00 nfsd
   137 ?        0:00 nfsd
   138 ?        0:00 nfsd
   139 ?        0:00 nfsd
   142 ?        0:00 biod
   143 ?        0:00 biod
   144 ?        0:00 biod
   145 ?        0:00 biod
   159 ?        0:03 inetd

If the daemons are not running on the client, check /var/adm/SYSLOG, and ensure that network and nfs chkconfig(1M) flags are on. Rebooting the client almost always clears the problem.

Checking Out the Network

If the server is operative but your system cannot reach it, check the network connections between your system and the server and check /var/adm/SYSLOG. Visually inspect your network connection. You can also test the logical network connection with various network tools like ping(1M). You can also check other systems on your network to see if they can reach the server.

Troubleshooting NFS Common Failures

The sections below describe the most common types of NFS failures. They suggest what to do if your remote mount fails, and what to do when servers do not respond to valid mount requests.

Remote Mount Failed

When network or server problems occur, programs that access hard-mounted remote files fail differently from those that access soft-mounted remote files. Hard-mounted remote file systems cause programs to continue to try until the server responds again. Soft-mounted remote file systems return an error message after trying for a specified number of intervals. See fstab(4) for more information.

Programs that access hard-mounted file systems do not respond until the server responds. In this case, NFS displays this message both to the console window and to the system log file /var/adm/SYSLOG:

server not responding

On a soft-mounted file system, programs that access a file whose server is inactive get the message:

Connection timed out

Unfortunately, many IRIX programs do not check return conditions on file system operations, so this error message may not be displayed when accessing soft-mounted files. Nevertheless, an NFS error message is displayed on the console.

Programs Do Not Respond

If programs stop responding while doing file-related work, your NFS server may be inactive. You may see the message:

NFS server host_name not responding, still trying

The message includes the host name of the NFS server that is down. This is probably a problem either with one of your NFS servers or with the network hardware. Attempt to ping and rlogin(1C) to the server to determine whether the server is down. If you can successfully rlogin to the server, its NFS server function is probably disabled.

Programs can also hang if an NIS server becomes inactive.

If your system hangs completely, check the servers from which you have file systems mounted. If one or more of them is down, it is not cause for concern. If you are using hard mounts, your programs will continue automatically when the server comes back up, as if the server had not become inactive. No files are destroyed in such an event.

If a soft-mounted server is inactive, other work should not be affected. Programs that timeout trying to access soft-mounted remote files fail, but you should still be able to use your other file systems.

If all of the servers are running, ask some other users of the same NFS server or servers if they are having trouble. If more than one client is having difficulty getting service, then the problem is likely with the server's NFS daemon nfsd. Log in to the server and give the command ps –de to see if nfsd is running and accumulating CPU time. If not, you may be able to kill and then restart nfsd. If this does not work, reboot the server.

If other people seem to be able to use the server, check your network connection and the connection of the server.

Hangs Partway through Boot

If your workstation mounts local file systems after a boot but hangs when it normally would be doing remote mounts, one or more servers may be down or your network connection may be bad. This problem can be avoided entirely by using the background( bg) option to mount in /etc/fstab (see fstab(4)).

Everything Works Slowly

If access to remote files seems unusually slow, give this command on the server:

# ps –de

Check whether the server is being slowed by a runaway daemon. If the server seems to be working and other people are getting good response, make sure your block I/O daemons are running. To check block I/O daemons, give this command on the client:

# ps –de | grep biod

This command helps you determine whether processes are hung. Note the current accumulated CPU time, then copy a large remote file and again give this command:

# ps –de | grep biod

If there are no biods running, restart the processes by giving this command:

# /usr/etc/biod 4

If biod is running, check your network connection. The netstat(1) command netstat -i tells you if packets are being dropped. A packet is a unit of transmission sent across the network. Also, you can use nfsstat -c and nfsstat -s to tell if the client or server is retransmitting a lot. A retransmission rate of 5% is considered high. Excessive retransmission usually indicates a bad network controller board, a bad network transceiver, a mismatch between board and transceiver, a mismatch between your network controller board and the server's board, or any problem or congestion on the network that causes packet loss.

Cannot Access Remote Devices

You can not use NFS to mount a remote character or block device (that is, a remote tape drive or similar peripheral).

Understanding the Automount Process

This section presents a detailed explanation of how the automounter works that can help you with troubleshooting automounter operation.

There are two distinct stages in the automounter's actions: the initial stage; system start up, when /etc/init.d/network starts the automounter; and the mounting stage, when a user tries to access a file or directory on a remote system. These two stages, and the effect of map type (direct or indirect) on automounting behavior are described below.

System Startup

At the initial stage, when /etc/init.d/network invokes automount, it opens a user datagram protocol (UDP) socket and registers it with the portmapper service as an NFS server port. It then starts a server daemon that listens for NFS requests on the socket. The parent process proceeds to mount the daemon at its mount points within the file system (as specified by the maps). Through the mount system call, it passes the server daemon's port number and an NFS file handle that is unique to each mount point. The arguments to the mount system call vary according to the kind of fleshiest. For NFS file systems, the call is:

mount ("nfs", "/usr", &nfs_args);

where nfs_args contains the network address for the NFS server. By having the network address in nfs_args refer to the local process (the automount daemon), automount causes the kernel to treat it as if it were an NFS server. Once the parent process completes its calls to mount, it exits, leaving the automount daemon to serve its mount points.

Mounting

In the second stage, when the user actually requests access to a remote file hierarchy, the daemon intercepts the kernel NFS request and looks up the name in the map associated with the directory.

Taking the location (server:pathname) of the remote file system from the map, the daemon then mounts the remote file system under the directory /tmp_mnt. It answers the kernel, saying it is a symbolic link. The kernel sends an NFS READLINK request, and the automounter returns a symbolic link to the real mount point under /tmp_mnt.

The Effect of Map Types

The behavior of the automounter is affected by whether the name is found in a direct or an indirect map. If the name is found in a direct map, the automounter emulates a symbolic link, as stated above. It responds as if a symbolic link exists at its mount point. In response to a GETATTR, it describes itself as a symbolic link. When the kernel follows up with a READLINK, it returns a path to the real mount point for the remote hierarchy in /tmp_mnt.

If, on the other hand, the name is found in an indirect map, the automounter emulates a directory of symbolic links. It describes itself as a directory. In response to a READLINK, it returns a path to the mount point in /tmp_mnt, and a readdir(3) of the automounter's mount point returns a list of the entries that are currently mounted.

Whether the map is direct or indirect, if the file hierarchy is already mounted and the symbolic link has been read recently, the cached symbolic link is returned immediately. Since the automounter is on the same system, the response is much faster than a READLINK to a remote NFS server. On the other hand, if the file hierarchy is not mounted, a small delay occurs while the mounting takes place.

Troubleshooting CacheFS

A common error message that can occur during a mount is No space left on device. The most likely cause of this error is inappropriate allocation of parameters for the cache (see “Cache Resource Parameters” for explanations about these parameters).

The following example shows this error for a CacheFS client machine named sluggo, caching data from a server neteng. One mount has been performed successfully for the cache /cache. A second mount was attempted and returned the error message No space left on device. The cfsadmin -l command returned the following:

cfsadmin: list cache FS information
     maxblocks        90%           (109109 blocks)
     minblocks        0%            (0 blocks)
     threshblocks     85%           (103047 blocks)
     hiblocks         85%           (92743 blocks)
     lowblocks        75%           (81832 blocks)
     maxfiles         90%           (188570 files)
     minfiles         0%            (0 files)
     threshfiles      85%           (178094 files)
     hifiles          85%           (160285 files)
     lowfiles         75%           (141428 files)
     maxfilesize      3MB
neteng:_home:_home
     flags            CFS_DUAL_WRITE
     popsize          65536
     fgsize           256

Current Usage:
     blksused         406
     filesused        30
     flags            CUSAGE_ACTIVE

The df command reported the usage statistics for /cache on sluggo. The following shows the df command and its returned information:

#df -i /cache
Filesystem  Type   blocks   use     avail  %use    iuse   ifree   %iuse Mounted
/dev/root   efs    1939714  1651288 288426 85%     18120  191402  9%    /

By default, minfiles and minblocks are both 0. This means if any files or blocks are allocated, CacheFS uses threshfiles and threshblocks to determine whether to perform an allocation or fail with the error ENOSPC. CacheFS fails an allocation if the usage on the front file system is higher than threshblocks or threshfiles, whichever is appropriate fro the allocation being done. In this example, the threshfiles value is 178094, but only 18120 files are in use. The threshblocks value is 103047 (8K blocks) or 1648752 512-byte blocks. The df output shows the total usage on the front file system is 1651288 512-byte blocks. This is larger than the threshold, so further block allocations fail.

The possible resolutions for the error are:

  • Use cfsadmin to increase minblocks or threshblocks or both. Increasing threshblocks should be more effective since /dev/root is already 85% allocated.

  • Remove unnecessary files from /dev/root. At least 2536 512-byte blocks of data need to be removed; removing more makes the cache more useful. At the current level of utilization, CacheFS needs to continually throw away files to allow room for the new ones.

  • Use a separate disk partition for /cache.