This chapter suggests strategies for troubleshooting the ONC3/NFS environment, including automatic mounting. This chapter contains these sections:
If you experience difficulties with ONC3/NFS, review the ONC3/NFS documentation before trying to debug the problem. In addition to this guide, the ONC3/NFS Release Notes and the reference pages for mount(1M), nfsd(1M), showmount(1M), exportfs(1M), rpcinfo(1M), mountd(1M), inetd(1M), fstab(4), mtab(4), lockd(1M), statd(1M), automount(1M), autofs(1M), and exports(4) contain information you should review. You do not have to understand them fully, but be familiar with the names and functions of relevant daemons and database files.
Be sure to check the console and /var/adm/SYSLOG for messages about ONC3/NFS or other activity that affects ONC3/NFS performance. Logged messages frequently provide information that helps explain problems and assists with troubleshooting.
This section explains the interaction of the various players in the mount request. If you understand this interaction, the problem descriptions in this chapter will make more sense. Here is an sample mount request:
mount krypton:/usr/src /n/krypton.src
These are the steps mount goes through to mount a remote filesystem:
mount checks to see if the caller is the superuser and if /n/krypton.src is a directory.
mount opens /etc/mtab and checks that this mount has not already been done.
mount parses the first argument into the system krypton and remote directory /usr/src.
mount calls library routines to translate the host name (krypton) to its Internet Protocol (IP) address. Depending on the host resolution order in /etc/resolv.conf, mount uses the NIS databases, the DNS databases, or /etc/hosts (local) to determine the NFS server. See resolver(4).
mount calls krypton's portmap daemon to get the port number of mountd. See portmap(1M).
mount calls krypton's mountd and passes it /usr/src.
krypton's mountd reads /etc/exports and looks for the exported filesystem that contains /usr/src.
krypton's mountd calls library routines to expand the hostnames and network groups in the export list for /usr/src.
krypton's mountd performs a system call on /usr/src to get the file handle.
krypton's mountd returns the file handle.
mount does a mount system call with the file handle and /n/krypton.src.
mount does a statfs call to krypton's NFS server (nfsd).
mount opens /etc/mtab and adds an entry to the end.
Any of these steps can fail, some of them in more than one way.
When analyzing an NFS problem, keep in mind that NFS, like all network services, has three main points of failure: the server, the client, and the network itself. The debugging strategy outlined below isolates each individual component to find the one that is not working.
If a client is having NFS trouble, check first to make sure the server is up and running. From a client, give this command:
/usr/etc/rpcinfo –p server_name | grep mountd
This checks whether the server is running. If the server is running, this command displays a list of programs, versions, protocols, and port numbers similar to this:
100005 1 tcp 1035 mountd 100005 1 udp 1033 mountd 391004 1 tcp 1037 sgi_mountd 391004 1 udp 1034 sgi_mountd
If the mountd server is running, use rpcinfo to check if the mountd server is ready and waiting for mount requests by using the program number and version for sgi_mountd returned above. Give this command:
/usr/etc/rpcinfo –u server_name 391004 1
The system responds:
program 391004 version 1 ready and waiting
If these fail, log in to the server and check its /var/adm/SYSLOG for messages.
If the server and the network are working, give the command ps –de to check your client daemons. inetd, routed, portmap, and four biod and nfsd daemons should be running. For example, the command ps –de produces output similar to this:
PID TTY TIME COMD 103 ? 0:46 routed 108 ? 0:01 portmap 136 ? 0:00 nfsd 137 ? 0:00 nfsd 138 ? 0:00 nfsd 139 ? 0:00 nfsd 142 ? 0:00 biod 143 ? 0:00 biod 144 ? 0:00 biod 145 ? 0:00 biod 159 ? 0:03 inetd
If the daemons are not running on the client, check /var/adm/SYSLOG, and ensure that network and nfs chkconfig flags are on. Rebooting the client almost always clears the problem.
If the server is operative but your system cannot reach it, check the network connections between your system and the server and check /var/adm/SYSLOG. Visually inspect your network connection. You can also test the logical network connection with various network tools like ping. You can also check other systems on your network to see if they can reach the server.
The sections below describe the most common types of NFS failures. They suggest what to do if your remote mount fails, and what to do when servers do not respond to valid mount requests.
When network or server problems occur, programs that access hard-mounted remote files fail differently from those that access soft-mounted remote files. Hard-mounted remote filesystems cause programs to continue to try until the server responds again. Soft-mounted remote filesystems return an error message after trying for a specified number of intervals. See fstab(4) for more information.
Programs that access hard-mounted filesystems do not respond until the server responds. In this case, NFS displays this message both to the console window and to the system log file /var/adm/SYSLOG:
server not responding
On a soft-mounted filesystem, programs that access a file whose server is inactive get the message:
Connection timed out
Unfortunately, many IRIX programs do not check return conditions on filesystem operations, so this error message may not be displayed when accessing soft-mounted files. Nevertheless, an NFS error message is displayed on the console.
If programs stop responding while doing file-related work, your NFS server may be inactive. You may see the message:
NFS server host_name not responding, still trying
The message includes the host name of the NFS server that is down. This is probably a problem either with one of your NFS servers or with the network hardware. Attempt to ping and rlogin to the server to determine whether the server is down. If you can successfully rlogin to the server, its NFS server function is probably disabled.
Programs can also hang if an NIS server becomes inactive.
If your system hangs completely, check the servers from which you have file systems mounted. If one or more of them is down, it is not cause for concern. If you are using hard mounts, your programs will continue automatically when the server comes back up, as if the server had not become inactive. No files are destroyed in such an event.
If a soft-mounted server is inactive, other work should not be affected. Programs that timeout trying to access soft-mounted remote files fail, but you should still be able to use your other filesystems.
If all of the servers are running, ask some other users of the same NFS server or servers if they are having trouble. If more than one client is having difficulty getting service, then the problem is likely with the server's NFS daemon nfsd. Log in to the server and give the command ps –de to see if nfsd is running and accumulating CPU time. If not, you may be able to kill and then restart nfsd. If this does not work, reboot the server.
If other people seem to be able to use the server, check your network connection and the connection of the server.
If your workstation mounts local file systems after a boot but hangs when it normally would be doing remote mounts, one or more servers may be down or your network connection may be bad. This problem can be avoided entirely by using the background(bg) option to mount in /etc/fstab (see fstab(4)).
If access to remote files seems unusually slow, give this command on the server:
Check whether the server is being slowed by a runaway daemon. If the server seems to be working and other people are getting good response, make sure your block I/O daemons are running.
|Note: The following text describes NFS version 2 on clients. NFS version 3 uses bio3d.|
To check block I/O daemons, give this command on the client:
ps –de | grep biod
This command helps you determine whether processes are hung. Note the current accumulated CPU time, then copy a large remote file and again give this command:
ps –de | grep biod
If there are no biods running, restart the processes by giving this command:
If biod is running, check your network connection. The netstat command netstat -i reports errors and conditions that may help you determine why packets are being dropped. A packet is a unit of transmission sent across the network. Also, you can use nfsstat -c to tell if the client or server is retransmitting a lot. A retransmission rate of 5% is considered high. Excessive retransmission usually indicates a bad network controller board, a bad network transceiver, a mismatch between board and transceiver, a mismatch between your network controller board and the server's board, or any problem or congestion on the network that causes packet loss.
This section presents a detailed explanation of how automount works that can help you with troubleshooting automount operation.
There are two distinct stages in automount's actions: the initial stage, system startup, when /etc/init.d/network starts automount; and the mounting stage, when a user tries to access a file or directory on a remote system. These two stages, and the effect of map type (direct or indirect) on automounting behavior are described below.
At the initial stage, when /etc/init.d/network invokes automount, it opens a user datagram protocol (UDP) socket and registers it with the portmapper service as an NFS server port. It then starts a server daemon that listens for NFS requests on the socket. The parent process proceeds to mount the daemon at its mount points within the filesystem (as specified by the maps). Through the mount system call, it passes the server daemon's port number and an NFS file handle that is unique to each mount point. The arguments to the mount system call vary according to the kind of file system. For NFS file systems, the call is:
mount ("nfs", "/usr", &nfs_args);
where nfs_args contains the network address for the NFS server. By having the network address in nfs_args refer to the local process (the automountd daemon), automount causes the kernel to treat it as if it were an NFS server. Once the parent process completes its calls to mount, it exits, leaving the automount daemon to serve its mount points.
In the second stage, when the user actually requests access to a remote file hierarchy, the daemon intercepts the kernel NFS request and looks up the name in the map associated with the directory.
Taking the location (server:pathname) of the remote filesystem from the map, the daemon then mounts the remote filesystem under the directory /tmp_mnt. It answers the kernel, saying it is a symbolic link. The kernel sends an NFS READLINK request, and the automounter returns a symbolic link to the real mount point under /tmp_mnt.
The behavior of the automounter is affected by whether the name is found in a direct or an indirect map. If the name is found in a direct map, the automounter emulates a symbolic link, as stated above. It responds as if a symbolic link exists at its mount point. In response to a GETATTR, it describes itself as a symbolic link. When the kernel follows up with a READLINK, it returns a path to the real mount point for the remote hierarchy in /tmp_mnt.
If, on the other hand, the name is found in an indirect map, the automounter emulates a directory of symbolic links. It describes itself as a directory. In response to a READLINK, it returns a path to the mount point in /tmp_mnt, and a readdir of the automounter's mount point returns a list of the entries that are currently mounted.
Whether the map is direct or indirect, if the file hierarchy is already mounted and the symbolic link has been read recently, the cached symbolic link is returned immediately. Since the automounter is on the same system, the response is much faster than a READLINK to a remote NFS server. On the other hand, if the file hierarchy is not mounted, a small delay occurs while the mounting takes place.
The autofs process is similar to the automount process, described in “Understanding the automount Process” with the following exceptions:
autofs uses the autofsd daemon for mounting and unmounting
In-place mounting is used instead of symbolic links (the /tmp_mnt links with /hosts are not used)
autofs accepts dynamic configuration changes; there is no need to restart autofsd
autofs requires an /etc/auto_master file
autofs uses the LoFS (loopback file system) to access local files
A common error message that can occur during a mount is No space left on device. The most likely cause of this error is inappropriate allocation parameters for the cache. The following example shows this error for a CacheFS client machine named sluggo, caching data from a server neteng. One mount has been performed successfully for the cache /cache. A second mount was attempted and returned the error message No space left on device. The cfsadmin -l command returned the following:
cfsadmin: list cache FS information maxblocks 90% (109109 blocks) threshblocks 85% (103047 blocks) hiblocks 85% (92743 blocks) lowblocks 75% (81832 blocks) maxfiles 90% (188570 files) threshfiles 85% (178094 files) hifiles 85% (160285 files) lowfiles 75% (141428 files) maxfilesize 3MB neteng:_home:_home flags CFS_DUAL_WRITE popsize 65536 fgsize 256 Current Usage: blksused 406 filesused 30 flags CUSAGE_ACTIVE
The df command reported the usage statistics for /cache on sluggo. The following shows the df command and its returned information:
df -i /cache Filesystem Type blocks use avail %use iuse ifree %iuse Mounted /dev/root efs 1939714 1651288 288426 85% 18120 191402 9% /
If any files or blocks are allocated, CacheFS uses threshfiles and threshblocks to determine whether to perform an allocation or fail with the error ENOSPC. CacheFS fails an allocation if the usage on the front file system is higher than threshblocks or threshfiles, whichever is appropriate for the allocation being done. In this example, the threshfiles value is 178094, but only 18120 files are in use. The threshblocks value is 103047 (8K blocks) or 1648752 512-byte blocks. The df output shows the total usage on the front file system is 1651288 512-byte blocks. This is larger than the threshold, so further block allocations fail.
The possible resolutions for the error are:
Use cfsadmin to increase threshblocks. Increasing threshblocks should be effective since /dev/root is already 85% allocated.
Remove unnecessary files from /dev/root. At least 2536 512-byte blocks of data need to be removed; removing more makes the cache more useful. At the current level of utilization, CacheFS needs to throw away many files to allow room for the new ones.
Use a separate disk partition for /cache.