Chapter 14. Monitoring Status

This chapter discusses the following::

Methods to View System Status

You can view the system status in the following ways:


Note: You must run administration commands on a server-capable administration node; you run the cxfs_info status command on a client-only node.


  • Monitor log files. See “Status in Log Files”.

  • Use the CXFS GUI or the tail command to view the end of the /var/log/messages system log file on a server-capable administration node. (You can also view the system log file on client-only nodes.)

  • Keep continuous watch on the state of a cluster using the GUI view area or the following cxfs_admin command:

    cxfs_admin -i clustername -r -c "status interval=seconds"

  • Query the status of an individual node or cluster using the GUI or cxfs_admin.

  • Manually test the filesystems with the ls command.

  • Monitor the system with Performance Co-Pilot. You can use Performance Co-Pilot to monitor the read/write throughput and I/O load distribution across all disks and for all nodes in the cluster. The activity can be visualized, used to generate alarms, or archived for later analysis. You can also monitor XVM statistics.

    See the following:

    • Performance Co-Pilot for Linux User's and Administrator's Guide

    • Performance Co-Pilot for Linux Programmer's Guide

    • dkvis(1), pmie(1), pmieconf(1), and pmlogger(1) man pages


    Note: You must manually install the XVM statistics for the Performance Co-Pilot package; it is not installed by default.



Note: Administrative tasks must be performed using one of the following tools:

  • The CXFS GUI when it is connected to a server-capable administration node (a node that has the cluster_admin software package installed)

  • cxfs_admin (you must logged in as root on a host that has permission to access the CXFS cluster database)



Status in Log Files

You should monitor the following for problems:

  • Server-capable administration node log: /var/log/messages . Look for a Membership delivered message to indicate that a cluster was formed.

  • Events from the GUI and clconfd: /var/cluster/ha/log/cad_log

  • Kernel status: /var/cluster/ha/log/clconfd_ hostname

  • Command line interface log:/var/cluster/ha/log/cli_ hostname

  • Monitoring of other daemons:/var/cluster/ha/log/cmond_log

  • Reset daemon log: /var/cluster/ha/log/crsd_ hostname

  • Output of the diagnostic tools such as the serial and network connectivity tests: /var/cluster/ha/log/diags_ hostname

  • Cluster database membership status: /var/cluster/ha/log/fs2d_log

  • System administration log, /var/lib/sysadm/salog , which contains a list of the commands run by the GUI:

For information about client-only nodes, see CXFS 5 Client-Only Guide for SGI InfiniteStorage.

Cluster, Node, and CXFS Filesystem Status

You can monitor system status with the following tools:

Also see “Key to Icons and States” in Chapter 10.

CXFS GUI and Status

You can monitor the status of the cluster, individual nodes, and CXFS filesystems by using the CXFS GUI connected to a server-capable administration node. For complete details about the GUI, see Chapter 10, “CXFS GUI”.

The easiest way to keep a continuous watch on the state of a cluster is to use the view area and choose the following:

Edit -> Expand All

The cluster status can be one of the following:

  • ACTIVE, which means the cluster is up and running.

  • INACTIVE, which means that CXFS services have not been started.

  • ERROR, which means that some nodes are in a DOWN state; that is, the cluster should be running, but it is not.

  • UNKNOWN, which means that the state cannot be determined because CXFS services are not running on the node performing the query.

To query the status of a node, you provide the logical name of the node. The node status can be one of the following:

  • UP, which means that CXFS services are started and the node is part of the CXFS kernel membership.

  • DOWN, which means that although CXFS services are started and the node is defined as part of the cluster, the node is not in the current CXFS kernel membership.

  • INACTIVE, which means that CXFS services have not been started

  • UNKNOWN, which means that the state cannot be determined because CXFS services are not running on the node performing the query.

State information is exchanged by daemons that run only when CXFS services are started. A given server-capable administration node must be running CXFS services in order to report status on other nodes.

For example, CXFS services must be started on node1 in order for it to show the status of node2. If CXFS services are started on node1, then it will accurately report the state of all other nodes in the cluster. However, if node1's CXFS services are not started, it will report the following states:

  • INACTIVE for its own state, because it can determine that the start CXFS services task has not been run

  • UNKNOWN as the state of all other nodes, because the daemons required to exchange information with other nodes are not running, and therefore state cannot be determined

You can use the view area to monitor the status of the nodes. Select View: Nodes and Cluster.

cxfs_admin and Status

You can monitor the status of the cluster, individual nodes, and CXFS filesystems by using the cxfs_admin command on any host that has monitor access the CXFS cluster database. For complete details about cxfs_admin, see Chapter 11, “cxfs_admin Command”.

To query node and cluster status, use the following cxfs_admin command on any host that has monitor access (see “Setting cxfs_admin Access Permissions” in Chapter 11) to the CXFS cluster database:

status

To continuously redisplay an updated status, enter an interval in seconds:

status interval=seconds

For example, to redisplay every 8 seconds:

cxfs_admin:mycluster> status interval=8

To stop the updates, send an interrupt signal (usually Ctrl+C).

The most common states for nodes include:

  • Disabled: The node is not allowed to join the cluster

  • Inactive: The node is not in cluster membership

  • Stable: The node is in membership and has mounted all of its filesystems

A node can have other transient states, such as Establishing membership.

The most common states for filesystems include:

  • Mounted: All enabled nodes have mounted the filesystem

  • Unmounted: All nodes have unmounted the filesystem

The cluster can have one of the following states:

  • Stable

  • node(s) not stable

  • filesystem(s) not stable

  • node(s), filesystem(s) not stable

Any other state (not mentioned above) requires attention by the administrator.

For example (a * character indicates a server-capable administration node):

cxfs_admin:clusterOne > status
Event at [ Jan 26 11:38:06 ]
Cluster         : clusterOne
Tiebreaker      : 
Client Licenses : enterprise   allocated 0 of 256
                  workstation  allocated 2 of 50
------------------  --------  --------  ----------------------------------------------------------------
Node                Cell ID   Age       Status
------------------  --------  --------  ----------------------------------------------------------------
bert *              1         5         Stable
cxfsxe5 *           0         26        Stable
cxfs3               4         0         Disabled
penguin17           2         1         Stable
pg-27               3         12        Stable
 
------------------  ------------------  ----------------------------------------------------------------
Filesystem          Server Name         Status
------------------  ------------------  ----------------------------------------------------------------
zj01s0              N/A                 0 of 5 nodes mounted, A server is trying to mount
zj01s1              N/A                 Unmounted
zj0ds2              cxfsxe5             Mounted [2 of 3 nodes]
 
------------------  ----------  ------------------------------------------------------------------------
Switch              Port Count  Known Fenced Ports
------------------  ----------  ------------------------------------------------------------------------
brocade26cp0        192         24, 25, 223

cxfs_info and Status

You can monitor the status of the cluster, individual nodes, and CXFS filesystems by using the cxfs_info command on a client-only node.

The cxfs_info command provides information about the cluster status, node status, and filesystem status. cxfs_info is run from a client-only node. The path to cxfs_info varies by platform.

You can use the -e option to display information continuously, updating the screen when new information is available; use the -c option to clear the screen between updates. For less verbose output, use the -q (quiet) option.

For example, on a Solaris node named cxfssun4:

cxfssun4# /usr/cxfs_cluster/bin/cxfs_info
cxfs_client status [timestamp Sep 03 12:16:06 / generation 18879]

Cluster:
    sun4 (4) - enabled
Local:
    cxfssun4 (2) - enabled, state: stable, cms: up, xvm: up, fs: up
Nodes:
    cxfs27     enabled  up    1     
    cxfs28     enabled  up    0     
    cxfsnt4    enabled  up    3     
    cxfssun4   enabled  up    2     
    mesabi     enabled  DOWN  4     
Filesystems:
    lun1s0     enabled  mounted          lun1s0               /lun1s0
    mirror0    disabled unmounted        mirror0              /mirror0

clconf_info and Status

You can monitor the status of the cluster, individual nodes, and CXFS filesystems by using the clconf_info command on a server-capable administration node, assuming that the cluster is up.

The clconf_info command has the following options:

-e

Waits for events from clconfd and displays the new information

-n nodename

Displays information for the specified logical node name

-p

Persists until the membership is formed

-q

(Quiet mode) Decreases verbosity of output. You can repeat this option to increase the level of quiet; that is, -qq specifies more quiet (less output) than -q.

-s

Sorts the output alphabetically by name for nodes and by device for filesystems. By default, the output is not sorted.

-v

(Verbose mode) Specifies the verbosity of output ( -vv specifies more verbosity than -v). The default output for clconf_info is the maximum verbosity.

For example:

server-admin# /usr/cluster/bin/clconf_info

Event at [2004-04-16 09:20:59]

Membership since Fri Apr 16 09:20:56 2004
____________ ______ ________ ______ ______
Node         NodeID Status   Age    CellID
____________ ______ ________ ______ ______
leesa             0 inactive      -      0
whack             2 up           16      3
lustre            8 up            5      5
thud             88 up           16      1
cxfs2           102 DOWN          -      2
____________ ______ ________ ______ ______
2 CXFS FileSystems
/dev/cxvm/tp9500_0 on /mnt/cxfs0  enabled  server=(whack)  2 client(s)=(thud,lustre)  status=UP
/dev/cxvm/tp9500a4s0 on /mnt/tp9500a4s0  disabled  server=()  0 client(s)=()  status=DOWN

This command displays the following fields:

  • Node is the node name.

  • NodeID is the node ID.

  • Status is the status of the node, which may be up, DOWN, or inactive .

  • Age indicates how many membership transitions in which the node has participated. The age is 1 the first time a node joins the membership and will increment for each time the membership changes. This number is dynamically allocated by the CXFS software (the user does not define the age).

  • CellID is the cell ID, which is allocated when a node is added into the cluster definition with the GUI or cxfs_admin. It persists until the node is removed from the cluster. The kernel also reports the cell ID in console messages.

You can also use the clconf_info command to monitor the status of the nodes in the cluster. It uses the same node states as the CXFS GUI. See “CXFS GUI and Status”.

For example:

server-admin# /usr/cluster/bin/clconf_info

Event at [2004-04-16 09:20:59]

Membership since Fri Apr 16 09:20:56 2004
____________ ______ ________ ______ ______
Node         NodeID Status   Age    CellID
____________ ______ ________ ______ ______
leesa             0 inactive      -      0
whack             2 up           16      3
lustre            8 up            5      5
thud             88 up           16      1
cxfs2           102 DOWN          -      2
____________ ______ ________ ______ ______
2 CXFS FileSystems
/dev/cxvm/tp9500_0 on /mnt/cxfs0  enabled  server=(whack)  2 client(s)=(thud,lustre)  status=UP
/dev/cxvm/tp9500a4s0 on /mnt/tp9500a4s0  disabled  server=()  0 client(s)=()  status=DOWN

I/O Fencing Status

To check the current fencing status, do one of the following:

  • Select View: Switches in the GUI view area

  • Use the show switch command within cxfs_admin

  • Use the hafence command as follows:

    /usr/cluster/bin/hafence -q

For example, the following output shows that all nodes are enabled:

server-admin# /usr/cluster/bin/hafence -q
  Switch[0] "ptg-brocade" has 8 ports
    Port 1 type=FABRIC status=enabled  hba=210000e08b0102c6 on host thunderbox
    Port 2 type=FABRIC status=enabled  hba=210000e08b01fec5 on host whack
    Port 5 type=FABRIC status=enabled  hba=210000e08b027795 on host thump
    Port 6 type=FABRIC status=enabled  hba=210000e08b019ef0 on host thud

A fenced port shows status=disabled. For example:

server-admin# /usr/cluster/bin/hafence -q
  Switch[0] "brocade04" has 16 ports
    Port 4 type=FABRIC status=enabled  hba=210000e08b0042d8 on host o200c
    Port 5 type=FABRIC status=enabled  hba=210000e08b00908e on host cxfs30
    Port 9 type=FABRIC status=enabled  hba=2000000173002d3e on host cxfssun3

Verbose (-v) output would be as follows:

server-admin# /usr/cluster/bin/hafence -v
  Switch[0] "brocade04" has 16 ports
    Port 0 type=FABRIC status=enabled  hba=2000000173003b5f on host UNKNOWN
    Port 1 type=FABRIC status=enabled  hba=2000000173003adf on host UNKNOWN
    Port 2 type=FABRIC status=enabled  hba=210000e08b023649 on host UNKNOWN
    Port 3 type=FABRIC status=enabled  hba=210000e08b021249 on host UNKNOWN
    Port 4 type=FABRIC status=enabled  hba=210000e08b0042d8 on host o200c
    Port 5 type=FABRIC status=enabled  hba=210000e08b00908e on host cxfs30
    Port 6 type=FABRIC status=enabled  hba=2000000173002d2a on host UNKNOWN
    Port 7 type=FABRIC status=enabled  hba=2000000173003376 on host UNKNOWN
    Port 8 type=FABRIC status=enabled  hba=2000000173002c0b on host UNKNOWN
    Port 9 type=FABRIC status=enabled  hba=2000000173002d3e on host cxfssun3
    Port 10 type=FABRIC status=enabled  hba=2000000173003430 on host UNKNOWN
    Port 11 type=FABRIC status=enabled  hba=200900a0b80c13c9 on host UNKNOWN
    Port 12 type=FABRIC status=disabled hba=0000000000000000 on host UNKNOWN
    Port 13 type=FABRIC status=enabled  hba=200d00a0b80c2476 on host UNKNOWN
    Port 14 type=FABRIC status=enabled  hba=1000006069201e5b on host UNKNOWN
    Port 15 type=FABRIC status=enabled  hba=1000006069201e5b on host UNKNOWN

A status of enabled for an UNKNOWN host indicates that the port is connected to a system that is not a node in the cluster. A status of disabled for an UNKNOWN host indicates that the node has been fenced (disabled), and the port may or may not be connected to a node in the cluster. A status of enabled with a specific name host indicates that the port is not fenced and is connected to the specified node in the cluster.

To check current fail policy settings, use the show failpolicy command in cxfs_admin, the node information in the GUI, or the cms_failconf command as follows:

/usr/cluster/bin/cms_failconf -q

For example, the following output shows that all nodes except thud have the system default fail policy configuration. The node thud has been configured for fencing and resetting.

server-admin# /usr/cluster/bin/cms_failconf -q
CMS failure configuration:
        cell[0] whack    Reset Shutdown
        cell[1] thunder  Reset Shutdown
        cell[2] thud     Fence Reset
        cell[3] thump    Reset Shutdown
        cell[4] terry    Reset Shutdown
        cell[5] leesa    Reset Shutdown

XVM Statistics

You can use Performance Co-Pilot to monitor XVM statistics. To do this, you must enable the collection of statistics:

  • To enable the collection of statistics for the local host, enter the following:

    pmstore xvm.control.stats_on 1

  • To disable the collection of statistics for the local host, enter the following:

    pmstore xvm.control.stats_on 0

You can gather XVM statistics in the following ways:

  • By using the pmdumptext command from the SGI Foundation pcp-open RPM. It can be used to produce an ASCII report of selected metrics from the xvm group in the Performance Co-Pilot namespace of available metrics.

  • By using the pmgxvm command provided.

    If you have the pcp-sgi RPM from SGI ProPack, you can also use the pmchart command to view time-series data in the form of a moving graph. Figure 14-1 shows an example.

    Figure 14-1. pmgxvm chart

    pmgxvm chart