Chapter 6. Troubleshooting ONC3/NFS

This chapter suggests strategies for troubleshooting the ONC3/NFS environment, including automatic mounting. This chapter contains these sections:

General Recommendations

If you experience difficulties with ONC3/NFS, review the ONC3/NFS documentation before trying to debug the problem. In addition to this guide, the ONC3/NFS Release Notes and the man pages for mount(1M) , nfsd(1M) , showmount(1M) , exportfs(1M) , rpcinfo(1M) , mountd(1M) , inetd(1M) , fstab(4) , mtab(4) , lockd(1M) , statd(1M) , automount(1M) , autofs(1M) , and exports(4) contain information you should review. You do not have to understand them fully, but be familiar with the names and functions of relevant daemons and database files.

Be sure to check the console and /var/adm/SYSLOG for messages about ONC3/NFS or other activity that affects ONC3/NFS performance. Logged messages frequently provide information that helps explain problems and assists with troubleshooting.

Understanding the Mount Process

This section explains the interaction of the various players in the mount request. If you understand this interaction, the problem descriptions in this chapter will make more sense. Here is a sample mount request:

mount krypton:/usr/src /n/krypton.src

These are the steps mount goes through to mount a remote filesystem:

  1. mount parses /etc/fstab.

  2. mount checks to see if the caller is the superuser and if /n/krypton.src is a directory.

  3. mount opens /etc/mtab and checks that this mount has not already been done.

  4. mount parses the first argument into the system krypton and remote directory /usr/src.

  5. mount calls library routines to translate the hostname (krypton) to its Internet Protocol (IP) address. This hostname resolution will be performed using the Unified Name Services defined in /etc/nsswitch.conf. See the nsswitch.conf(4) man page .

  6. mount calls krypton's portmap daemon to get the port number of mountd. See the portmap(1M) man page .

  7. mount calls krypton's mountd and passes it /usr/src.

  8. krypton's mountd reads /etc/exports and looks for the exported filesystem that contains /usr/src.

  9. krypton's mountd calls library routines to expand the hostnames and network groups in the export list for /usr/src.

  10. krypton's mountd performs a system call on /usr/src to get the file handle.

  11. krypton's mountd returns the file handle.

  12. mount does a mount system call with the file handle and /n/krypton.src.

  13. mount does a statfs call to krypton's NFS server (nfsd).

  14. mount opens /etc/mtab and adds an entry to the end.

Any of these steps can fail, some of them in more than one way.

Identifying the Point of Failure

When analyzing an NFS problem, keep in mind that NFS, like all network services, has three main points of failure: the server, the client, and the network itself. The debugging strategy outlined below isolates each individual component to find the one that is not working.

Verifying Server Status

If a client is having NFS trouble, check first to make sure the server is up and running. From a client, enter this command:

/usr/etc/rpcinfo –p server_name | grep mountd

This checks whether the server is running. If the server is running, this command displays a list of programs, versions, protocols, and port numbers similar to this:

100005    1   tcp   1035  mountd
100005    1   udp   1033  mountd
391004    1   tcp   1037  sgi_mountd
391004    1   udp   1034  sgi_mountd

If the mountd server is running, use rpcinfo to check if the mountd server is ready and waiting for mount requests by using the program number and version for sgi_mountd returned above. Enter this command:

# /usr/etc/rpcinfo –u server_name 391004 1

The system responds:

program 391004 version 1 ready and waiting

If these fail, log in to the server and check its /var/adm/SYSLOG for messages.

Verifying Client Status

If the server and the network are working, enter the command ps –de to check your client daemons. inetd, routed, portmap, and four biod and nfsd daemons should be running. For example, the command ps –de produces output similar to this:

   103 ?        0:46 routed
   108 ?        0:01 portmap
   136 ?        0:00 nfsd
   137 ?        0:00 nfsd
   138 ?        0:00 nfsd
   139 ?        0:00 nfsd
   142 ?        0:00 biod
   143 ?        0:00 biod
   144 ?        0:00 biod
   145 ?        0:00 biod
   159 ?        0:03 inetd

If the daemons are not running on the client, check the /var/adm/SYSLOG, and ensure that network and nfs chkconfig flags are on. Rebooting the client almost always clears the problem.

Verifying Network Status

If the server is operative but your system cannot reach it, check the network connections between your system and the server and check /var/adm/SYSLOG. Visually inspect your network connection. You can also test the logical network connection with various network tools such as ping. You can also check other systems on your network to see if they can reach the server.

Troubleshooting NFS Common Failures

The following sections describe the most common types of NFS failures. They suggest what to do if your remote mount fails, and what to do when servers do not respond to valid mount requests.

Troubleshooting Mount Failure

When network or server problems occur, programs that access hard-mounted remote files fail differently from those that access soft-mounted remote files. Hard-mounted remote filesystems cause programs to continue to try until the server responds again. Soft-mounted remote filesystems return an error message after trying for a specified number of intervals. See the fstab(4) man page for more information.

Programs that access hard-mounted filesystems do not respond until the server responds. In this case, NFS displays this message both to the console window and to the system log file /var/adm/SYSLOG:

server not responding

On a soft-mounted filesystem, programs that access a file with a server that is inactive get the following message:

Connection timed out

Unfortunately, many IRIX programs do not check return conditions on filesystem operations, so this error message may not be displayed when accessing soft-mounted files. Nevertheless, an NFS error message is displayed on the console.

Troubleshooting Lack of Server Response

If programs stop responding while doing file-related work, your NFS server may be inactive. You may see the message:

NFS server host_name not responding, still trying

The message includes the hostname of the NFS server that is down. This is probably a problem either with one of your NFS servers or with the network hardware. Attempt to connect to the server using ping and rlogin to determine whether the server is down. If you can successfully connect to the server using rlogin, its NFS server function is probably disabled.

Programs can also hang if an NIS server becomes inactive.

If your system hangs completely, check the servers from which you have file systems mounted. If one or more of them is down, it is not cause for concern. If you are using hard mounts, your programs will continue automatically when the server comes back up, as if the server had not become inactive. No files are destroyed in such an event.

If a soft-mounted server is inactive, other work should not be affected. Programs that time-out trying to access soft-mounted remote files fail, but you should still be able to use your other filesystems.

If all of the servers are running, ask some other users of the same NFS server or servers if they are having trouble. If more than one client is having difficulty getting service, then the problem is likely with the server's NFS daemon nfsd. Log in to the server and enter the command ps –de to see if nfsd is running and accumulating CPU time. If not, you may be able to kill and then restart nfsd. If this does not work, reboot the server.

If other people seem to be able to use the server, check your network connection and the connection of the server.

Troubleshooting Remote Mount Failure

If your workstation mounts local file systems after a boot but hangs when it normally would be doing remote mounts, one or more servers may be down or your network connection may be bad. This problem can be avoided entirely by using the background(bg) option to mount in /etc/fstab (see the fstab(4) man page ). 

Troubleshooting Slow Performance

If access to remote files seems unusually slow, enter this command on the server:

ps –de

Check whether the server is being slowed by a runaway daemon. If the server seems to be working and other people are getting good response, make sure your block I/O daemons are running.

Note: The following text describes NFS version 2 on clients. NFS version 3 uses bio3d.

To check block I/O daemons, enter this command on the client:

ps –de | grep biod

This command helps you determine whether processes are hung. Note the current accumulated CPU time, then copy a large remote file and again enter this command:

ps –de | grep biod

If there are no biods running, restart the processes by entering this command:

/usr/etc/biod 4

If biod is running, check your network connection. The netstat command netstat -i reports errors and conditions that may help you determine why packets are being dropped. A packet is a unit of transmission sent across the network. Also, you can use nfsstat -c to tell if the client or server is retransmitting a lot. A retransmission rate of 5% is considered high. Excessive retransmission usually indicates a bad network controller board, a bad network transceiver, a mismatch between board and transceiver, a mismatch between your network controller board and the server's board, or any problem or congestion on the network that causes packet loss.

Failure to Access Remote Devices

You cannot use NFS to mount a remote character or block device (that is, a remote tape drive or similar peripheral).

Troubleshooting automount

This section presents a detailed explanation of how automount works that can help you with troubleshooting automount operation.

There are two distinct stages in the automount command's actions: the initial stage, system startup, when /etc/init.d/network starts automount; and the mounting stage, when a user tries to access a file or directory on a remote system. These two stages, and the effect of map type (direct or indirect) on automounting behavior are described in the following subsections.

Role of automount in System Startup

At the initial stage, when /etc/init.d/network invokes automount, it opens a user datagram protocol (UDP) socket and registers it with the portmapper service as an NFS server port. It then starts a server daemon that listens for NFS requests on the socket. The parent process proceeds to mount the daemon at its mount points within the filesystem (as specified by the maps). Through the mount system call, it passes the server daemon's port number and an NFS file handle that is unique to each mount point. The arguments to the mount system call vary according to the kind of file system. For NFS file systems, the call is:

mount ("nfs", "/usr", &nfs_args);

where nfs_args specifies the network address for the NFS server. By having the network address in nfs_args refer to the local process (the automountd daemon), automount causes the kernel to treat it as if it were an NFS server. Once the parent process completes its calls to mount, it exits, leaving the automount daemon to serve its mount points.

Daemon Action in the Mounting Process

In the second stage, when the user actually requests access to a remote file hierarchy, the daemon intercepts the kernel NFS request and looks up the name in the map associated with the directory.

Taking the location (server:pathname) of the remote filesystem from the map, the daemon then mounts the remote filesystem under the directory /tmp_mnt. It answers the kernel, saying it is a symbolic link. The kernel sends an NFS READLINK request, and the automounter returns a symbolic link to the real mount point under /tmp_mnt.

Effect of automount Map Types

The behavior of the automounter is affected by whether the name is found in a direct or an indirect map. If the name is found in a direct map, the automounter emulates a symbolic link, as stated above. It responds as if a symbolic link exists at its mount point. In response to a GETATTR, it describes itself as a symbolic link. When the kernel follows up with a READLINK, it returns a path to the real mount point for the remote hierarchy in /tmp_mnt.

If, on the other hand, the name is found in an indirect map, the automounter emulates a directory of symbolic links. It describes itself as a directory. In response to a READLINK, it returns a path to the mount point in /tmp_mnt, and a readdir of the automounter's mount point returns a list of the entries that are currently mounted.

Whether the map is direct or indirect, if the file hierarchy is already mounted and the symbolic link has been read recently, the cached symbolic link is returned immediately. Since the automounter is on the same system, the response is much faster than a READLINK to a remote NFS server. On the other hand, if the file hierarchy is not mounted, a small delay occurs while the mounting takes place.

Troubleshooting autofs

The autofs process is similar to the automount process, described in “Troubleshooting automount” with the following exceptions:

  • autofs uses the autofsd daemon for mounting and unmounting.

  • In-place mounting is used instead of symbolic links (the /tmp_mnt links with /hosts are not used).

  • autofs accepts dynamic configuration changes; there is no need to restart autofsd.

  • autofs requires an /etc/auto_master file.

  • autofs uses the LoFS (loopback file system) to access local files.

  • To record who is requesting bad mounts in the log file, set the autofs_logging variable of the systune option to autofs_logging=1.

Troubleshooting CacheFS

A common error message that can occur during a mount is No space left on device. The most likely cause of this error is inappropriate allocation parameters for the cache. The following example shows this error for a CacheFS client machine named nabokov, caching data from a server neteng. One mount has been performed successfully for the cache /cache. A second mount was attempted and returned the error message No space left on device.

The cfsadmin -l command returned the following:

    cfsadmin: list cache FS information
   Version        2   4  50
   maxblocks     90% (1745743 blocks)
    hiblocks     85% (1648757 blocks)
   lowblocks     75% (1454786 blocks)
   maxfiles      90% (188570 files)
    hifiles      85% (178094 files)
   lowfiles      75% (157142 files)

The df command reported the usage statistics for /cache on nabokov. The following shows the df command and its returned information:

# df -i /cache 
Filesystem  Type  blocks   use      avail   %use  iuse   ifree   %iuse  Mounted
/dev/root   xfs   1939714  1651288  288426  85%   18120  191402  9%     /

If any files or blocks are allocated, CacheFS uses hifiles and hiblocks to determine whether to perform an allocation or fail with the error ENOSPC. CacheFS fails an allocation if the usage on the front file system is higher than hiblocks or hifiles, whichever is appropriate for the allocation being done. In this example, the hifiles value is 178094, but only 18120 files are in use. The hiblocks value is 103047 (8K blocks) or 1648752 512-byte blocks. The df output shows the total usage on the front file system is 1651288 512-byte blocks. This is larger than the threshold, so further block allocations fail.

The possible resolutions for the error are:

  • Use cfsadmin to increase hiblocks. Increasing hiblocks should be effective since /dev/root is already 85% allocated.

  • Remove unnecessary files from /dev/root. At least 2536 512-byte blocks of data need to be removed; removing more makes the cache more useful. At the current level of utilization, CacheFS needs to discard many files to allow room for the new ones.

  • Use a separate disk partition for /cache.