Chapter 4. Performance Monitoring

SGI InfiniteStorage NAS Manager provides current and historical views of the state and the performance of a NAS server. This includes CPU usage, disk and network throughput, and many other metrics. It also allows you to view connected clients and determine how each of these contribute to the current workload.

This chapter does not describe all of the details of each NAS Manager monitoring screen, because most screens are quite straightforward. Instead, it attempts to explain why the displayed information matters and how it can be sensibly interpreted.

This chapter discusses the following:

Figure 4-1 shows the top-level Monitoring screen.

Figure 4-1. Monitoring Screen

Monitoring Screen

Metrics Collected

The information provided by NAS Manager can be roughly broken down into “who” and “how much”. NAS Manager continuously gathers performance metrics and stores them in archives in /var/nasmgr/archives. Each month, a data reduction process is performed on the metric gathered for the month. This reduces the size of the archives while retaining a consistent amount of information.

Although the size of metric archives has a bounded maximum, this can still be quite large depending on the configuration of the server and how many clients access it. For example, a server with a large number of filesystems could generate up to 100 Mbytes of archives per day. You should initially allow around 2 Gbytes of space for archive storage and monitor the actual usage for the first few weeks of operation.


Note: NAS Manager uses the International Electrotechnical Commission's International Standard names and symbols for binary multiples of units. In particular, this means that 1 MiB/s is 220 = 1048576 Bytes per second. For more information on this standard, see the National Institute of Standards & Technology information about prefixes for binary multiples at:

http://physics.nist.gov/cuu/Units/binary.html

NAS Manager distinguishes between current and historic time. Current metrics are either drawn live from the server or are taken from the last few minutes of the metric archives. Historic metrics are taken exclusively from the metric archives. NAS Manager is able to display this historical information for three time periods:

  • Last hour

  • Last day (the previous 24 hours)

  • Last month (the previous 30 days)

Within bar graphs, NAS Manager uses color-coding to display the direction of data flow:

  • Red represents write and receive data flow

  • Blue represents read and send data flow

Figure 4-2 describes how NAS Manager color-codes the direction of data flow graphs. For an example of the result in a graph, see Figure 4-3.

Figure 4-2. Color-Coding the Direction of Data Flow

Color-Coding the Direction of Data Flow

System Summary

NAS Manager provides a Summary menu selection at the top of the screen. This screen displays the following:

  • CPU utilization

  • Disk space

  • Disk throughput

  • Network throughput

  • The number of NFS, CIFS, and iSCSI clients (if iSCSI targets have been created)

  • System uptime

  • Number of users

  • Load average

Click History to view the historical status of a parameter.

The screen displays ticks along the status bars labeled d (day) and h (hour). These represent the average value over the past day or hour, rather than the immediate value that is shown by the graph.

Figure 4-3 shows an example Summary screen.

Figure 4-3. Summary Screen

Summary  Screen

In Figure 4-3, the bar graph for Network Throughput shows 34.8 MiB/s of data read/sent (the blue part of the graph) and 7.77 MiB/s of data written/received (the red part of the graph). If you were sending and receiving data at the same rate, there would be equal amounts of red and blue in the graph. For more information, see Figure 4-2.

System Alerts

The Alerts screen displays messages from the system logs. These provide informative messages, notifications of unusual events, and error conditions.

Only unacknowledged alerts are displayed. After a period of time, alerts are archived and will not be redisplayed. Acknowledged alerts are archived after 2 days and unacknowledged alerts are archived after 7 days. The /var/nasmgr/alerts/archive file contains all the archived alert messages.

Resources

A resource is a finite capacity of the fileserver. NAS Manager contains a separate screen to display the utilization of each resource.

The following sections provide details about the resources:

Where multiple physical resources are bonded into a single logical resource (for example, load-balanced NICs and RAID volumes in a filesystem), NAS Manager shows the structure of the aggregated resource, and (where possible) shows metrics for both the aggregate and the component resources.

Disk Space

The Disk Space screen shows the number of bytes available on each filesystem. If the amount of disk space appears low on a filesystem on which disk quotas are enabled, you can use the Disk Quota screen to find out who is using the most disk space.

Disk Quota

Disk quotas provide limits on the number of files and the amount of disk space a user is allowed to consume on each filesystem. A side effect of this is that they make it possible to see how much each user is currently consuming.

Because quotas are applied on a per-filesystem basis, the limits reported in the All Filesystems screen are not additive. This means that if a user has a 500-MiB disk space limit on filesystem A and a 500-MiB limit on filesystem B, the user cannot store a 1-GiB file because there is no single filesystem with a large-enough space allowance.

However the current usage shown in the used column on the All Filesystems screen is additive, so you can use this screen to determine the users who are currently consuming the most disk space. The All Filesystems screen highlights users who have exceeded the quota on any filesystem on which they have been allocated a quota.

Disk Throughput and Disk IOPS

Disk operations occur when the result of a file operation is committed to disk. The most common types of disk operation are data reads and writes, but in some types of workload, metadata operations can be significant. Metadata operations include the following:

  • Truncating and removing files

  • Looking up filenames

  • Determining the size and types of files

Disk operations are measured in I/O per second (IOPS).

Disk throughput is the amount of data that is transferred to and from the disks. This is predominantly the result of reading and writing data.

The Disk Throughput and Disk IOPS screens display a bar graph for each active filesystem. For RAID filesystems, a separate graph is displayed for each volume element.

If the cache hit rate is low and the network throughput is high, the disk throughput should be high. Usually, the disk throughput would be steady somewhere a little under the maximum bandwidth of the disk subsystem. If the disk throughput is consistently too high relative to the network throughput, this might indicate that the server has too little memory for the workload.

Under heavy loads, a fileserver must be able to sustain a high rate of disk operations. You can use the disk operations metrics in conjunction with other metrics to determine the characteristics of a workload so that you can tune the server can be tuned. For example, a high utilization of NICs but few IOPS could indicate that a workload is coming straight from the cache. A large number of IOPS but low throughput (either disk or network) indicates a metadata-dominated load. You can determine the contributing operations or clients from the NFS screen, CIFS screen, and the various screens under the Clients category.

DMF Resources

The DMF Resources screens show how DMF is using its hardware, as described in the following sections:

For information about solving problems, see “DMF Error Messages”. For information on how NAS Manager displays user-generated DMF activity, see “DMF Activity”.

OpenVault Tape Libraries

The following displays the tape library slot usage, which is the number of slots used by DMF, other applications, or vacant):

Monitoring -> Resources -> DMF -> Tape Libraries

The Tape Libraries screen is available only if the OpenVault tape subsystem is in use. This screen is unavailable if you are using Tape Management Facility (TMF). (You must choose a single method for handling tapes, either OpenVault or TMF.)

Tape Drives

The following shows information about tape drives:

Monitoring -> Resources -> DMF -> Tape Drives

The Tape Drives screen provides information for each tape drive concerning its current state:

  • Idle

  • Busy

  • Unavailable

When the drive is in use, it also shows the following:

  • Activity (such as waiting)

  • Purpose (such as recall)

  • Details of the tape volume (such as volume name)


Note: This information is available only for DMF's tapes. Any other use, such as filesystem backups or direct tape use by users, is not shown; any such drives appear to be idle on this screen.

This screen also includes a link to the Reservation Delay History screen, which indicates when demand for tape drives exceeds the number available. This is purely a relative indication, to be compared visually with the equivalent indicator at other times; it has no useful numerical value.

Tape Volumes

The following shows the number of tape volumes in various states according to volume group (VG):

Monitoring -> Resources -> DMF -> Tape Volumes

Those volume groups that share an allocation group are shown together inside a box that indicates the grouping.

Because of their normally large number, full volumes are only shown numerically. Those in other states (such as empty) are shown graphically. History links show trends over time.

DMF-Managed Filesystems

The following shows the proportions of files on DMF-managed filesystems that are migrated and not migrated:

Monitoring -> Resources -> DMF -> Filesystems

Gathering this information is expensive, so the screen is only updated when DMF needs this information for other purposes, and the time at which it was accurate is displayed. Updates are caused by the execution of the following DMF programs: dmfsfree, dmdaux (run at DMF startup), dmaudit, dmscanfs, dmhdelete, and dmselect.

The screen also displays the amount of offline data related to the filesystems and the over-subscription ratios (which are typically in the range of 10:1 to 1000:1, although they vary considerable from site to site). As this is being viewed from the filesystem perspective, the fact that migrated files may have more than one copy on the back-end media is not considered. That is, this is a measure of data that could be on disk but is not at the time, rather than a measure of the amount of back-end media being used.

Disk Caches

The following shows Disk Cache Manager (DCM) disk caches:

Monitoring -> Resources -> DMF -> Caches

DCM disk caches have similar issues to filesystems with regard to the frequency of updates as described in “DMF-Managed Filesystems”.

Dual-resident refers to cache files that have already been copied to back-end tape and can therefore be quickly removed from the cache if it starts filling. Non-dual-resident files would have tape copies made before they could be removed, which is much slower.

DMF Error Messages

This section describes problems you may encounter when monitoring DMF with NAS Manager.

DMF Statistics are Unavailable or DMF is Idle
This screen requires statistics from DMF that are unavailable; 
check that DMF is running, including the "pmdadmf2"process. 
Make sure the DMF "EXPORT_METRICS" configuration parameter is enabled.

This message indicates that DMF is idle. When this occurs, perform the following procedure:

  1. Check the version of DMF by running the dmversion command. It should report version 3.4.0.0 or later.

  2. Check that the EXPORT_METRICS on line has been added to /etc/dmf/dmf.conf after the TYPE base line.

    Run dmcheck to search the DMF configuration file for syntactic errors.

  3. Check that DMF has been restarted after the change to /etc/dmf/dmf.conf was made in step 2.

  4. Check that the data is being exported by DMF by running the following command:

    # dmarenadump -v

    If it is not, run the following commands as root:

    # cd /dmf/spool  # or equivalent at your site
    # rm base/arena
    # /etc/init.d/dmf restart
    # /etc/init.d/pcp stop
    # /etc/init.d/pcp start
    # /etc/init.d/nasmgr restart       # if necessary

  5. Check that the data is passing through PCP by running the following command:

    # pminfo -f dmf2

    If it is not, run the following commands as root:

    # cd /var/lib/pcp/pmdas/dmf2 
    # ./Remove 
    # ./Install 
    # /etc/init.d/nasmgr restart 

OpenVault Library Is Missing
No OpenVault-controlled library found.

This indicates that OpenVault is not running. Run the following command to verify that the ov_stat command is available:

# ls -lL /usr/bin/ov_stat
-rws--x--x 1 root sys 322304 Jul 22 2005 /usr/bin/ov_stat

If the file permissions are not -rws--x--x as shown above, run the following command to change the permissions:

# chmod 4711 /usr/bin/ov_stat

CPU Utilization

Serving files places demands on the fileserver CPU as well as the I/O subsystem. The CPU helps with copying data to and from disks, calculating checksums, and other tasks. Table 4-1 shows the CPU metrics that NAS Manager reports.

Table 4-1. CPU Metrics Reported by NAS Manager

CPU Metric

Description

Idle time

Time that remained when the CPU could not find any tasks to run.

Wait time

Time when a CPU was forced to do nothing while waiting for an event to occur. Typical causes of wait time are filesystem I/O and memory swapping.

Interrupt time

Time the CPU spent processing requests from I/O devices. In a fileserver context, these are almost exclusively generated by disk operations or network packets and by switching between processes.

System time

Time the CPU spent executing kernel code. This is usually dominated by NFS file serving and accessing data from disks.

User time

Time when the CPU is devoted to running ordinary programs. The biggest consumers of user time in a fileserver would usually be the CIFS server, HTTP server, or FTP server.

CPU time is displayed as a percentage, where 100% is the total time available from a single CPU. This means that for an 8-CPU server, the total available CPU time is 800%.

In general, NFS workloads consume more system time, whereas CIFS, HTTP, and FTP workloads consume more user time. The NAS Manager performance monitoring infrastructure consumes only a small amount of user time.

The most useful problem indicator is consistently having little or no idle time. This can mean that the server is underpowered compared to the workload that is expected of it.

Network Throughput

The Network Throughput screen displays the amount of data transferred through each network interface card (NIC).

If an interface is load-balanced, NAS Manager displays throughput for both the aggregated (bonded) interface and its constituent interfaces.


Note: The throughput displayed is total network throughput (which includes protocol headers), so real data transfer will be somewhat lower than this value. The Services category screens show the amount of real data transferred from a variety of perspectives.


Hardware Inventory

The hardware inventory is a summary of the hardware configuration, including the CPUs, I/O controllers, memory, network controllers, and SCSI disks. The list of SCSI disks includes both the system root disk and the configured RAID logical units (LUNs).

Services

A service is a task that is performed by the fileserver. While the primary service is fileserving, NAS Manager breaks this down by the different methods of accessing the server. The services known to NAS Manager are NFS, CIFS and HTTP.

This section discusses the following screens available under the Services category:

NFS and CIFS

NFS and CIFS traffic are the primary contributors to fileserver utilization.

Both NFS and CIFS services report statistics aggregated across all exports/shares as well as statistics for each export/share.

Table 4-2 describes the statistics reported by both the NFS and CIFS screens. Table 4-3 describes additional information reported by the NFS screen.


Note: There is not a one-to-one correspondence between CIFS and NFS IOPS. The former measures operations that are received from a network client, the latter measures operations that are sent to a local filesystem.


Table 4-2. Statistics Reported by NFS and CIFS Screens

Graph

Description

Throughput

Current incoming and outgoing traffic for the export/share. The NFS service Throughput graph includes all types of operations, whereas the CIFS graph only shows actual data transfer.

Operations by Type

Export/share operations by class.

Read Block Sizes

Reads by size.

Write Block Sizes

Writes by size.


Table 4-3. Additional Information Reported by the NFS Screen

Category

Description

IOPS

I/O per second for TCP and for UDP. (This is not needed for CIFS because CIFS always uses a TCP transport.)

Service Times

Number of operations falling into each service time band as tracked by the NFS server for each operation.

Both the CIFS and the NFS services gather like operations into a smaller number of operation classes. While these classes are largely similar, there are some differences. Table 4-4 and Table 4-5 summarize these classes.


Note: The NFS operation statistics measure classes of NFS protocol operations sent by clients. The CIFS statistics, on the other hand, measure classes of system calls performed by the CIFS server on behalf of a client. NAS Manager does not provide a way to measure which CIFS protocol operations are requested. This is because the mapping from protocol operations to system calls is much more complex for CIFS than it is for NFS; in some cases, a single protocol operation can result in many thousands of system calls. The system call information is much more reflective of performance than the CIFS protocol operations.


Table 4-4. CIFS Operation Classes

Class

Description

access

Operations that are used to perform access checking, including checks for file ownership, permission, size, and type

acl

All access control list (ACL) operations

close

File close operations

commit

File data synchronization operations; these occur when an application requests that the contents of a file be flushed to disk

create

New file or directory creation, hard and symbolic link creation, file renaming, and device file creation operations

gettattr

Operations that retrieve file attributes, such as access times

lock_request

All file locking operations, including lock requests and lock releases

lookup

Operations that result in filename translations; that is, operations that are applied to a filename rather than to a file handle, such as open

misc

All other operations, including operations such as quota check that are not frequently used

read

File read operations

readdir

Directory traversal operations

remove

File removal

rmdir

Directory removal

setattr

Operations that set file attributes, such as ownership and permissions

write

File write operations

xattr

Operations that manipulate XFS extended attributes


Table 4-5. NFS Operation Classes

Class

Description

access

File accessibility tests; checks whether a client can open a particular file

commit

Commit request; requests that the server flush asynchronously written data to stable storage

fsinfo

Filesystem statistics and information requests, pathconf calls, and service availability tests

getattr

File attribute retrieval operations

inode_mods

New file or directory creation, hard and symbolic link creation, file renaming, and device file creation operations

lockd

General lock operations not covered by other classes

lockd_granted

Number of lock granting operations

lockd_share

Number of export/share reservation operations

lookup

Operations that result in filename translations; that is, operations that are applied to a filename rather than to a file handle, such as open

read

File read operations and symbolic link resolution operations

readdir

Directory entry listing operations

readdirplus

Extended directory entry listing operations; returns the attributes of the directory entries as well as their names

remove

File deletion operations

setattr

File attribute setting operations, which include file truncations and changing permissions and ownership

write_async

Asynchronous writes; the written data may be cached and scheduled for writing at a later time

write_sync

Synchronous write; these do not complete until the data is written to stable storage

xattr

Operations that manipulate XFS extended attributes


HTTP

The HTTP screen reports HTTP metrics for default installations of the Apache server. NAS Manager does not display information for other HTTP servers.


Note: The HTTP information reported by NAS Manager does not include the traffic generated by accessing the NAS Manager interface itself.


DMF Activity

The DMF Activity screen shows user-generated DMF activity from two points of view:

  • Number of requests being worked on (the Requests screen)

  • Rate of data throughput resulting from those requests (the Throughput screen)


Note: Values shown on the Requests and Throughput screens are averaged over the previous few minutes, so they are not necessarily integers as would be expected. This process also causes a slight delay in the display, which means that the values of DMF Activity screens do not necessarily match the current activity on the system, as seen in the DMF log files.

There are two distinct types of requests that are reflected in these screens:

  • Requests from the user to the DMF daemon. These are presented as an aggregate across the DMF server, and on a per-filesystem basis, using the label of Filesystems.

  • Requests from the DMF daemon to the subordinate daemons managing the back-end storage, the caches, the volume groups (VGs), and the media-specific processes (MSPs). Technically, caches are a variant of MSP despite their different purpose, hence the description Non-Cache MSP in the NAS Manager screens.

Sometimes, there is a 1:1 correspondence between a daemon request and a back-end request by cache, volume group, or MSP (such as when a file is being recalled from back-end media back to the primary DMF-managed filesystem), but this is frequently not the case. For example, migrating a newly created file to back-end media will result in one back-end request per copy, but deleting a migrated file results in a single daemon request but no back-end request at that time. Tape merges may cause a lot of activity within a volume group but none at the daemon level.

On the top-level requests and throughput screens, and their associated History screens, for the sake of clarity the different types of requests are not distinguished from each other. However, if you zoom in (via one of the Filesystems, Caches , Volume Groups, or MSPs links on the left-hand side), the resulting screen shows the broad categories as well as by filesystem or by back-end storage group, as appropriate. This also applies to the related History screens.

Versions

The Versions screen displays the version numbers of key software packages that have been installed.

Clients

A client is a computer running a program that accesses the fileserver. Clients are known to NAS Manager by their IP address; if multiple accessing programs are running on the same computer, they are all counted as a single client.


Note: Client information is gathered only for CIFS and NFS protocols.

The All Clients screen displays the clients sorted according to hostname. The other selections sort according to the chosen selection (such as by aggregate throughput).

From each of these screens, you can change the sorted display of the data without returning to the Monitoring screen.

Displaying the clients in this fashion is useful for pinpointing how the current set of clients are contributing the workload profile. For example, upon noticing an unusually large amount of network traffic on the Network Throughput screen, changing to display the clients in order of aggregate throughput will quickly identify the contributing clients.

From the list of clients, you can display a detailed view of the NFS and CIFS traffic generated by a particular client. This is useful when trying to diagnose problems that affect only a single client or type of client. For example, by viewing the client detail, it may be obvious that throughput is limited by the client using very small read and write sizes. Continuing from the client details to the client history screen can help diagnose problems, such as hung NFS mounts.

The iSCSI screen display a list of the connected iSCSI initiators are connected and their targets.