SGI InfiniteStorage NAS Manager provides current and historical views of the state and the performance of a NAS server. This includes CPU usage, disk and network throughput, and many other metrics. It also allows you to view connected clients and determine how each of these contribute to the current workload.
This chapter does not describe all of the details of each NAS Manager monitoring screen, because most screens are quite straightforward. Instead, it attempts to explain why the displayed information matters and how it can be sensibly interpreted.
This chapter discusses the following:
Figure 4-1 shows the top-level Monitoring screen.
The information provided by NAS Manager can be roughly broken down into “who” and “how much”. NAS Manager continuously gathers performance metrics and stores them in archives in /var/nasmgr/archives. Each month, a data reduction process is performed on the metric gathered for the month. This reduces the size of the archives while retaining a consistent amount of information.
Although the size of metric archives has a bounded maximum, this can still be quite large depending on the configuration of the server and how many clients access it. For example, a server with a large number of filesystems could generate up to 100 Mbytes of archives per day. You should initially allow around 2 Gbytes of space for archive storage and monitor the actual usage for the first few weeks of operation.
![]() | Note: NAS
Manager uses the International Electrotechnical Commission's International
Standard names and symbols for binary multiples of units. In particular,
this means that 1 MiB/s is 220 = 1048576 Bytes
per second. For more information on this standard, see the National Institute
of Standards & Technology information about prefixes for binary multiples
at:
http://physics.nist.gov/cuu/Units/binary.html |
NAS Manager distinguishes between current and historic time. Current metrics are either drawn live from the server or are taken from the last few minutes of the metric archives. Historic metrics are taken exclusively from the metric archives. NAS Manager is able to display this historical information for three time periods:
Last hour
Last day (the previous 24 hours)
Last month (the previous 30 days)
Within bar graphs, NAS Manager uses color-coding to display the direction of data flow:
Figure 4-2 describes how NAS Manager color-codes the direction of data flow graphs. For an example of the result in a graph, see Figure 4-3.
NAS Manager provides a Summary menu selection at the top of the screen. This screen displays the following:
Click History to view the historical status of a parameter.
The screen displays ticks along the status bars labeled d (day) and h (hour). These represent the average value over the past day or hour, rather than the immediate value that is shown by the graph.
Figure 4-3 shows an example Summary screen.
In Figure 4-3, the bar graph for Network Throughput shows 34.8 MiB/s of data read/sent (the blue part of the graph) and 7.77 MiB/s of data written/received (the red part of the graph). If you were sending and receiving data at the same rate, there would be equal amounts of red and blue in the graph. For more information, see Figure 4-2.
The Alerts screen displays messages from the system logs. These provide informative messages, notifications of unusual events, and error conditions.
Only unacknowledged alerts are displayed. After a period of time, alerts are archived and will not be redisplayed. Acknowledged alerts are archived after 2 days and unacknowledged alerts are archived after 7 days. The /var/nasmgr/alerts/archive file contains all the archived alert messages.
A resource is a finite capacity of the fileserver. NAS Manager contains a separate screen to display the utilization of each resource.
The following sections provide details about the resources:
Where multiple physical resources are bonded into a single logical resource (for example, load-balanced NICs and RAID volumes in a filesystem), NAS Manager shows the structure of the aggregated resource, and (where possible) shows metrics for both the aggregate and the component resources.
The Disk Space screen shows the number of bytes available on each filesystem. If the amount of disk space appears low on a filesystem on which disk quotas are enabled, you can use the Disk Quota screen to find out who is using the most disk space.
Disk quotas provide limits on the number of files and the amount of disk space a user is allowed to consume on each filesystem. A side effect of this is that they make it possible to see how much each user is currently consuming.
Because quotas are applied on a per-filesystem basis, the limits reported in the All Filesystems screen are not additive. This means that if a user has a 500-MiB disk space limit on filesystem A and a 500-MiB limit on filesystem B, the user cannot store a 1-GiB file because there is no single filesystem with a large-enough space allowance.
However the current usage shown in the used column on the All Filesystems screen is additive, so you can use this screen to determine the users who are currently consuming the most disk space. The All Filesystems screen highlights users who have exceeded the quota on any filesystem on which they have been allocated a quota.
Disk operations occur when the result of a file operation is committed to disk. The most common types of disk operation are data reads and writes, but in some types of workload, metadata operations can be significant. Metadata operations include the following:
Truncating and removing files
Looking up filenames
Determining the size and types of files
Disk operations are measured in I/O per second (IOPS).
Disk throughput is the amount of data that is transferred to and from the disks. This is predominantly the result of reading and writing data.
The Disk Throughput and Disk IOPS screens display a bar graph for each active filesystem. For RAID filesystems, a separate graph is displayed for each volume element.
If the cache hit rate is low and the network throughput is high, the disk throughput should be high. Usually, the disk throughput would be steady somewhere a little under the maximum bandwidth of the disk subsystem. If the disk throughput is consistently too high relative to the network throughput, this might indicate that the server has too little memory for the workload.
Under heavy loads, a fileserver must be able to sustain a high rate of disk operations. You can use the disk operations metrics in conjunction with other metrics to determine the characteristics of a workload so that you can tune the server can be tuned. For example, a high utilization of NICs but few IOPS could indicate that a workload is coming straight from the cache. A large number of IOPS but low throughput (either disk or network) indicates a metadata-dominated load. You can determine the contributing operations or clients from the NFS screen, CIFS screen, and the various screens under the Clients category.
The DMF Resources screens show how DMF is using its hardware, as described in the following sections:
For information about solving problems, see “DMF Error Messages”. For information on how NAS Manager displays user-generated DMF activity, see “DMF Activity”.
The following displays the tape library slot usage, which is the number of slots used by DMF, other applications, or vacant):
Monitoring -> Resources -> DMF -> Tape Libraries
The Tape Libraries screen is available only if the OpenVault tape subsystem is in use. This screen is unavailable if you are using Tape Management Facility (TMF). (You must choose a single method for handling tapes, either OpenVault or TMF.)
The following shows information about tape drives:
Monitoring -> Resources -> DMF -> Tape Drives
The Tape Drives screen provides information for each tape drive concerning its current state:
When the drive is in use, it also shows the following:
Activity (such as waiting)
Purpose (such as recall)
Details of the tape volume (such as volume name)
![]() | Note: This information is available only for DMF's tapes. Any other use, such as filesystem backups or direct tape use by users, is not shown; any such drives appear to be idle on this screen. |
This screen also includes a link to the Reservation Delay History screen, which indicates when demand for tape drives exceeds the number available. This is purely a relative indication, to be compared visually with the equivalent indicator at other times; it has no useful numerical value.
The following shows the number of tape volumes in various states according to volume group (VG):
Monitoring -> Resources -> DMF -> Tape Volumes
Those volume groups that share an allocation group are shown together inside a box that indicates the grouping.
Because of their normally large number, full volumes are only shown numerically. Those in other states (such as empty) are shown graphically. History links show trends over time.
The following shows the proportions of files on DMF-managed filesystems that are migrated and not migrated:
Monitoring -> Resources -> DMF -> Filesystems
Gathering this information is expensive, so the screen is only updated when DMF needs this information for other purposes, and the time at which it was accurate is displayed. Updates are caused by the execution of the following DMF programs: dmfsfree, dmdaux (run at DMF startup), dmaudit, dmscanfs, dmhdelete, and dmselect.
The screen also displays the amount of offline data related to the filesystems and the over-subscription ratios (which are typically in the range of 10:1 to 1000:1, although they vary considerable from site to site). As this is being viewed from the filesystem perspective, the fact that migrated files may have more than one copy on the back-end media is not considered. That is, this is a measure of data that could be on disk but is not at the time, rather than a measure of the amount of back-end media being used.
The following shows Disk Cache Manager (DCM) disk caches:
Monitoring -> Resources -> DMF -> Caches
DCM disk caches have similar issues to filesystems with regard to the frequency of updates as described in “DMF-Managed Filesystems”.
Dual-resident refers to cache files that have already been copied to back-end tape and can therefore be quickly removed from the cache if it starts filling. Non-dual-resident files would have tape copies made before they could be removed, which is much slower.
This section describes problems you may encounter when monitoring DMF with NAS Manager.
This screen requires statistics from DMF that are unavailable; check that DMF is running, including the "pmdadmf2"process. Make sure the DMF "EXPORT_METRICS" configuration parameter is enabled. |
This message indicates that DMF is idle. When this occurs, perform the following procedure:
Check the version of DMF by running the dmversion command. It should report version 3.4.0.0 or later.
Check that the EXPORT_METRICS on line has been added to /etc/dmf/dmf.conf after the TYPE base line.
Run dmcheck to search the DMF configuration file for syntactic errors.
Check that DMF has been restarted after the change to /etc/dmf/dmf.conf was made in step 2.
Check that the data is being exported by DMF by running the following command:
# dmarenadump -v |
If it is not, run the following commands as root:
# cd /dmf/spool # or equivalent at your site # rm base/arena # /etc/init.d/dmf restart # /etc/init.d/pcp stop # /etc/init.d/pcp start # /etc/init.d/nasmgr restart # if necessary |
Check that the data is passing through PCP by running the following command:
# pminfo -f dmf2 |
If it is not, run the following commands as root:
# cd /var/lib/pcp/pmdas/dmf2 # ./Remove # ./Install # /etc/init.d/nasmgr restart |
No OpenVault-controlled library found. |
This indicates that OpenVault is not running. Run the following command to verify that the ov_stat command is available:
# ls -lL /usr/bin/ov_stat -rws--x--x 1 root sys 322304 Jul 22 2005 /usr/bin/ov_stat |
If the file permissions are not -rws--x--x as shown above, run the following command to change the permissions:
# chmod 4711 /usr/bin/ov_stat |
Serving files places demands on the fileserver CPU as well as the I/O subsystem. The CPU helps with copying data to and from disks, calculating checksums, and other tasks. Table 4-1 shows the CPU metrics that NAS Manager reports.
Table 4-1. CPU Metrics Reported by NAS Manager
CPU Metric | Description |
---|---|
Time that remained when the CPU could not find any tasks to run. | |
Time when a CPU was forced to do nothing while waiting for an event to occur. Typical causes of wait time are filesystem I/O and memory swapping. | |
Time the CPU spent processing requests from I/O devices. In a fileserver context, these are almost exclusively generated by disk operations or network packets and by switching between processes. | |
Time the CPU spent executing kernel code. This is usually dominated by NFS file serving and accessing data from disks. | |
Time when the CPU is devoted to running ordinary programs. The biggest consumers of user time in a fileserver would usually be the CIFS server, HTTP server, or FTP server. |
CPU time is displayed as a percentage, where 100% is the total time available from a single CPU. This means that for an 8-CPU server, the total available CPU time is 800%.
In general, NFS workloads consume more system time, whereas CIFS, HTTP, and FTP workloads consume more user time. The NAS Manager performance monitoring infrastructure consumes only a small amount of user time.
The most useful problem indicator is consistently having little or no idle time. This can mean that the server is underpowered compared to the workload that is expected of it.
The Network Throughput screen displays the amount of data transferred through each network interface card (NIC).
If an interface is load-balanced, NAS Manager displays throughput for both the aggregated (bonded) interface and its constituent interfaces.
![]() | Note: The throughput displayed is total network throughput (which includes protocol headers), so real data transfer will be somewhat lower than this value. The Services category screens show the amount of real data transferred from a variety of perspectives. |
A service is a task that is performed by the fileserver. While the primary service is fileserving, NAS Manager breaks this down by the different methods of accessing the server. The services known to NAS Manager are NFS, CIFS and HTTP.
This section discusses the following screens available under the Services category:
NFS and CIFS traffic are the primary contributors to fileserver utilization.
Both NFS and CIFS services report statistics aggregated across all exports/shares as well as statistics for each export/share.
Table 4-2 describes the statistics reported by both the NFS and CIFS screens. Table 4-3 describes additional information reported by the NFS screen.
Table 4-2. Statistics Reported by NFS and CIFS Screens
Graph | Description |
---|---|
Current incoming and outgoing traffic for the export/share. The NFS service Throughput graph includes all types of operations, whereas the CIFS graph only shows actual data transfer. | |
Operations by Type | |
Read Block Sizes | |
Write Block Sizes |
Table 4-3. Additional Information Reported by the NFS Screen
Both the CIFS and the NFS services gather like operations into a smaller number of operation classes. While these classes are largely similar, there are some differences. Table 4-4 and Table 4-5 summarize these classes.
Table 4-4. CIFS Operation Classes
Class | Description |
---|---|
Operations that are used to perform access checking, including checks for file ownership, permission, size, and type | |
All access control list (ACL) operations | |
File close operations | |
File data synchronization operations; these occur when an application requests that the contents of a file be flushed to disk | |
New file or directory creation, hard and symbolic link creation, file renaming, and device file creation operations | |
Operations that retrieve file attributes, such as access times | |
All file locking operations, including lock requests and lock releases | |
Operations that result in filename translations; that is, operations that are applied to a filename rather than to a file handle, such as open | |
All other operations, including operations such as quota check that are not frequently used | |
File read operations | |
Directory traversal operations | |
File removal | |
Directory removal | |
Operations that set file attributes, such as ownership and permissions | |
File write operations | |
Operations that manipulate XFS extended attributes |
Table 4-5. NFS Operation Classes
Class | Description |
---|---|
File accessibility tests; checks whether a client can open a particular file | |
Commit request; requests that the server flush asynchronously written data to stable storage | |
Filesystem statistics and information requests, pathconf calls, and service availability tests | |
File attribute retrieval operations | |
New file or directory creation, hard and symbolic link creation, file renaming, and device file creation operations | |
General lock operations not covered by other classes | |
Number of lock granting operations | |
Number of export/share reservation operations | |
Operations that result in filename translations; that is, operations that are applied to a filename rather than to a file handle, such as open | |
File read operations and symbolic link resolution operations | |
Directory entry listing operations | |
Extended directory entry listing operations; returns the attributes of the directory entries as well as their names | |
File deletion operations | |
File attribute setting operations, which include file truncations and changing permissions and ownership | |
Asynchronous writes; the written data may be cached and scheduled for writing at a later time | |
Synchronous write; these do not complete until the data is written to stable storage | |
Operations that manipulate XFS extended attributes |
The HTTP screen reports HTTP metrics for default installations of the Apache server. NAS Manager does not display information for other HTTP servers.
![]() | Note: The HTTP information reported by NAS Manager does not include the traffic generated by accessing the NAS Manager interface itself. |
The DMF Activity screen shows user-generated DMF activity from two points of view:
Number of requests being worked on (the Requests screen)
Rate of data throughput resulting from those requests (the Throughput screen)
![]() | Note: Values shown on the Requests and Throughput screens are averaged over the previous few minutes, so they are not necessarily integers as would be expected. This process also causes a slight delay in the display, which means that the values of DMF Activity screens do not necessarily match the current activity on the system, as seen in the DMF log files. |
There are two distinct types of requests that are reflected in these screens:
Requests from the user to the DMF daemon. These are presented as an aggregate across the DMF server, and on a per-filesystem basis, using the label of Filesystems.
Requests from the DMF daemon to the subordinate daemons managing the back-end storage, the caches, the volume groups (VGs), and the media-specific processes (MSPs). Technically, caches are a variant of MSP despite their different purpose, hence the description Non-Cache MSP in the NAS Manager screens.
Sometimes, there is a 1:1 correspondence between a daemon request and a back-end request by cache, volume group, or MSP (such as when a file is being recalled from back-end media back to the primary DMF-managed filesystem), but this is frequently not the case. For example, migrating a newly created file to back-end media will result in one back-end request per copy, but deleting a migrated file results in a single daemon request but no back-end request at that time. Tape merges may cause a lot of activity within a volume group but none at the daemon level.
On the top-level requests and throughput screens, and their associated History screens, for the sake of clarity the different types of requests are not distinguished from each other. However, if you zoom in (via one of the Filesystems, Caches , Volume Groups, or MSPs links on the left-hand side), the resulting screen shows the broad categories as well as by filesystem or by back-end storage group, as appropriate. This also applies to the related History screens.
A client is a computer running a program that accesses the fileserver. Clients are known to NAS Manager by their IP address; if multiple accessing programs are running on the same computer, they are all counted as a single client.
![]() | Note: Client information is gathered only for CIFS and NFS protocols. |
The All Clients screen displays the clients sorted according to hostname. The other selections sort according to the chosen selection (such as by aggregate throughput).
From each of these screens, you can change the sorted display of the data without returning to the Monitoring screen.
Displaying the clients in this fashion is useful for pinpointing how the current set of clients are contributing the workload profile. For example, upon noticing an unusually large amount of network traffic on the Network Throughput screen, changing to display the clients in order of aggregate throughput will quickly identify the contributing clients.
From the list of clients, you can display a detailed view of the NFS and CIFS traffic generated by a particular client. This is useful when trying to diagnose problems that affect only a single client or type of client. For example, by viewing the client detail, it may be obvious that throughput is limited by the client using very small read and write sizes. Continuing from the client details to the client history screen can help diagnose problems, such as hung NFS mounts.
The iSCSI screen display a list of the connected iSCSI initiators are connected and their targets.