This chapter describes administrative tasks you perform to operate and monitor an FailSafe system. It describes how to perform tasks using the FailSafe Manager GUI and the cmgr command. The major sections in this chapter are as follows:
![]() | Note: SGI recommends that you perform all FailSafe administration from one node in the pool so that the latest copy of the database will be available, even when there are network partitions. |
Use the following procedure to redirect the console, which is required to get access to the console input and output on systems with only one serial/USB port that provides both L1 system controller and console support:
Edit the /etc/inittab file to use an alternate serial port.
Either issue an init q command or reboot.
For example, suppose you had the following in the /etc/inittab file (line breaks added for readability):
# on-board ports or on Challenge/Onyx MP machines, first IO4 board ports t1:23:respawn:/sbin/suattr -C CAP_FOWNER,CAP_DEVICE_MGT,CAP_DAC_WRITE+ip -c "exec /sbin/getty ttyd1 console" # alt console t2:23:off:/sbin/suattr -C CAP_FOWNER,CAP_DEVICE_MGT,CAP_DAC_WRITE+ip -c "exec /sbin/getty -N ttyd2 co_9600" # port 2 |
You could change it to the following:
# on-board ports or on Challenge/Onyx MP machines, first IO4 board ports t1:23:off:/sbin/suattr -C CAP_FOWNER,CAP_DEVICE_MGT,CAP_DAC_WRITE+ip -c "exec /sbin/getty ttyd1 co_9600" # port 1 t2:23:respawn:/sbin/suattr -C CAP_FOWNER,CAP_DEVICE_MGT,CAP_DAC_WRITE+ip -c "exec /sbin/getty -N ttyd2 console" # alt console |
![]() | Caution: Redirecting the console by using the above method works only when IRIX is running. To access the console when IRIX is not running (miniroot), you must physically reconnect the machine: unplug the serial hardware reset cable from the console/L1 port and then connect the console cable. |
For more information, see “Origin 300, Origin 3200C, Onyx 300, and Onyx 3200C Console Support ” in Chapter 3.
This section discusses the procedure for using a single node in a two-node cluster in the following cases:
Only one node in a cluster is powered up after a power failure
One node in the cluster is down for an extended period for maintenance
The following procedure describes the steps required to use just one node in the cluster:
Create an emergency failover policy for each node. Each policy should look like the following example when the cmgr command is issued, where ActiveNode is the name of the node using the policy (in the examples, nodeA ) and DownNode is the name of the nonfunctioning node (in the examples, nodeB):
cmgr> show failover_policy emergency-ActiveNode Failover Policy: emergency-ActiveNode Version: 1 Script: ordered Attributes: Controlled_Failback InPlace_Recovery Initial AFD: ActiveNode |
For example, suppose you have two nodes, nodeA and nodeB. You would have two emergency failover policies:
cmgr> show failover_policy emergency-nodeA Failover Policy: emergency-nodeA Version: 1 Script: ordered Attributes: Controlled_Failback InPlace_Recovery Initial AFD: nodeA cmgr> show failover_policy emergency-nodeB Failover Policy: emergency-nodeB Version: 1 Script: ordered Attributes: Controlled_Failback InPlace_Recovery Initial AFD: nodeB |
For more information, see “Define a Failover Policy” in Chapter 6.
If a single node in a two-node cluster has just booted from a power failure and the other node is still powered off, the surviving node will form an active cluster. The resources will be in ONLINE READY state. They cannot move to ONLINE state because only half of the failover domain is active. The powered-off node will be in UNKNOWN state. At this point, you would want to apply the emergency policy, which contains only one node in the failover domain.
![]() | Note: If the nonfunctional node is in DOWN state (because it was reset by another node), then the resource groups will be in the ONLINE state rather than ONLINE READY state. |
Change the state of all resource groups to offline. The last known state of these groups was online before the machines went down. This step tells the database to label the state of the resource groups appropriately in preparation for later steps. FailSafe will execute the new failover policy when the groups are made online.
![]() | Note: If the groups are already online on the surviving node (such as they would be in a maintenance procedure), you should use the admin offline_detach command rather than the admin offline_force command because the desire is to leave all resources running on that surviving node. |
Use the following command:
admin offline_force resource_group RGname in cluster Clustername |
For example:
cmgr> set cluster test-cluster cmgr> show resource_groups in test-cluster Resource Groups: group1 group2 cmgr> admin offline_force resource_group group1 cmgr> admin offline_force resource_group group2 |
Modify each resource group to use the appropriate single-node emergency failover policy (the policy that contains the one node that is up). Use the following cmgr commands or the GUI:
modify resource_group RGname in cluster Clustername set failover_policy to emergency-ActiveNode |
For example, on nodeA:
cmgr> set cluster test-cluster cmgr> modify resource_group group1 Enter commands, when finished enter either "done" or "cancel" resource_group group1 ? set failover_policy to emergency-nodeA resource_group group1 ? done Successfully modified resource group group1 |
Mark the resource groups as online in the database. When HA services are started in future steps, the services will come online using the emergency failover policies.
admin online resource_group RGname in cluster Clustername |
For example:
cmgr> set cluster test-cluster cmgr> admin online resource_group group1 FailSafe daemon (ha_fsd) is not running on this local node or it is not ready to accept admin commands. Resource Group (group1) is online-ready. Failed to admin: online admin command failed cmgr> show status of resource_group group1 in cluster test-cluster State: Online Ready Error: No error Check resource group group1 status in an active node if HA services are active in cluster |
To resume using the down node, do the following:
Boot the down node. It will join the cluster and copy the cluster database from the other node.
Perform an offline_detach command on the resource groups in the cluster. This causes FailSafe to stop monitoring the resource group, but does not physically stop the processes on that group. FailSafe will report the status as offline and will not have any control over the group. The resources will remain in service.
![]() | Note: There are issues when performing an offline_detach operation with Auto_Recovery; see “Offline Detach Issues” in Chapter 3. |
Use the following command:
admin offline_detach resource_group RGname [in cluster Clustername] |
cmgr> admin offline_detach resource_group group1 in cluster test-cluster |
Show the status of the resource groups to be sure that they now show as offline.
![]() | Note: The resources are still in service even though this command output shows them as offline . |
show status of resource_group RGname in cluster Clustername |
For example:
cmgr> show status of resource_group group1 in cluster test-cluster |
Modify the resource groups to restore the original two-node failover policies they were using before the failure:
modify resource_group RGname in cluster Clustername set failover_policy to OriginalFailoverPolicy |
![]() | Note: This only restores the configuration for the static environment. The runtime environment will still be using the single-node policy at this time. |
For example, if the normal failover policy was normal-fp :
cmgr> set cluster test-cluster cmgr> modify resource_group group1 Enter commands, when finished enter either "done" or "cancel" resource_group group1 ? set failover_policy to normal-fp resource_group group1 ? done Successfully modified resource group group1 cmgr modify resource_group group2 Enter commands, when finished enter either "done" or "cancel" resource_group group2 ? set failover_policy to normal-fp resource_group group2 ? done Successfully modified resource group group2 |
Make the resource groups online in the cluster:
admin online resource_group RGname in cluster Clustername |
For example:
cmgr> admin online resource_group group1 in cluster test-cluster cmgr> admin online resource_group group2 in cluster test-cluster |
It may be desirable to move the resources back to their original nodes if it is believed that the cluster is now healthy. (Because our original policies included the InPlace_Recovery attribute, all of the resources have remained on the node that has been active throughout this process.)
admin move resource_group RGname in cluster Clustername to node PrimaryOwner |
For example:
cmgr> admin move resource_group group1 in cluster test-cluster to node nodeB |
If you run into errors after entering the admin move command, see “Ensuring that Resource Groups are Deallocated” in Chapter 10.
This section describes the following:
You can use the cluster_status command to monitor the cluster using a curses interface. For example, the following shows a two-node cluster configured for FailSafe only and cluster_status help text displayed:
# /var/cluster/cmgr-scripts/cluster_status * Cluster=nfs-cluster FailSafe=ACTIVE CXFS=Not Configured 08:45:12 Nodes = hans2 hans1 FailSafe = UP UP FailSafe HB =192.26.50.15 127.0.0.1 CXFS = ResourceGroup Owner State Error bartest-group Offline No error footest-group Offline No error bar_rg2 hans2 Online No error nfs-group1 hans2 Online No error foo_rg hans2 Online No error +-------+ cluster_status Help +--------+ | on s - Toggle Sound on event | | on r - Toggle Resource Group View | | on c - Toggle CXFS View | | j - Scroll up the selection | | k - Scroll down the selection | | TAB - Toggle RG or CXFS selection | | ENTER - View detail on selection | | h - Toggle help screen | | q - Quit cluster_status | +--- Press 'h' to remove help window --+ |
The above shows that a sound will be activated when a node or the cluster changes status. You can override the s setting by invoking cluster_status with the -m (mute) option. You can also use the arrow keys to scroll the selection.
![]() | Note: The cluster_status command can display no more than 128 CXFS filesystems. |
The easiest way to keep a continuous watch on the state of a cluster is to use the GUI view area.
System components that are experiencing problems appear as blinking red icons. Components in transitional states also appear as blinking icons. If there is a problem in a resource group or node, the icon for the cluster turns red and blinks, as well as the resource group or node icon.
The cluster status can be one of the following:
ACTIVE, which means the cluster is up and running and there is a valid FailSafe membership.
INACTIVE, which means the start FailSafe HA services task has not been run and there is no FailSafe membership.
ERROR, which means that some nodes are in a DOWN state; that is, the cluster should be running, but it is not.
UNKNOWN,which means that the state cannot be determined because FailSafe HA services are not running on the node performing the query.
If you minimize the GUI window, the minimized-icon shows the current state of the cluster. Green indicates FailSafe HA services active without an error, gray indicates FailSafe HA services are inactive, and red indicates an error state.
The following tables show keys to the icons and states used in the FailSafe Manager GUI.
The full legend for component states is as follows:
Icon | Entity | |
---|---|---|
| IRIX node | |
| Cluster | |
| Resource | |
| Resource group | |
| Resource type | |
| Failover policy | |
| Expanded tree | |
| Collapsed tree | |
| User name | |
| GUI task for which execution privilege may be granted or revoked | |
| Privileged command executed by a given GUI task |
Icon | State | |
---|---|---|
| Inactive or unknown (HA services may not be active) | |
| Online-ready state for a resource group | |
| Healthy and active or online | |
| (blinking) Transitioning to healthy and active/online or transitioning to offline | |
| Maintenance mode, in which the resource is not monitored by FailSafe | |
| (blinking red) Problems with the component |
To query node and cluster status, use the following command:
show status of cluster Clustername |
You can use cmgr to query the status of a resource or to contact the system controller on a node, as described in the following subsections.
To query a resource status, use the following command:
show status of resource Resourcename of resource_type RTname [in cluster Clustername] |
If you have specified a default cluster, you do not need to specify a cluster when you use this command and it will show the status of the indicated resource in the default cluster.
This command displays the number of local monitoring failures, the monitor execution time parameters, and the maximum and minimum time taken to complete the monitoring script for the resource. For example, the following output shows that there have been no local monitoring failures:
cmgr> show status of resource 163.154.18.119 of resource_type IP_address in cluster nfs-cluster State: Online Error: None Owner: hans2 Flags: Resource is monitored locally Resource statistics Number of local monitoring failures: 0 Time of last local monitoring failure: Not applicable Total number of monitors 885 Maximum monitor execution time 998 Minimum monitor execution time 155 Last monitor execution time 222 Monitor timeout 40000 All times are in milliseconds |
To query the status of a resource group, you provide the name of the resource group and the cluster which includes the resource group. Resource group status includes the following components:
Resource group state
Resource group error state
Resource owner
These components are described in the following subsections.
If a node that contains a resource group online has a status of UNKNOWN, the status of the resource group will not be available or ONLINE-READY.
A resource group state can be one of the following:
When a resource group is ONLINE, its error status is continually being monitored. A resource group error status can be one of the following:
You can use the view area to monitor the status of the resources in a FailSafe configuration:
Select View: Resources in Groups to see the resources organized by the groups to which they belong.
Select View: Groups owned by Nodes to see where the online groups are running. This view lets you observe failovers as they occur.
To query a resource group status, use the following cmgr command:
show status of resource_group RGname [in cluster Clustername] |
If you have specified a default cluster, you do not need to specify a cluster when you use this command and it will show the status of the indicated resource group in the default cluster.
To query the status of a node, you provide the logical node name of the node. The node status can be one of the following:
When you start HA services, node states transition from INACTIVE to UP. It may happen that a node state may transition from INACTIVE to UNKNOWN to UP.
You can use the cluster_status command to monitor the status of the nodes in the cluster.
You can use the GUI view area to monitor the status of the clusters in a FailSafe configuration. Select View: Groups owned by Nodes to monitor the health of the default cluster, its resource groups, and the group's resources.
To query node status, use the following command:
show status of node nodename |
When FailSafe is running, you can determine whether the system controller on a node is responding with the following command:
admin ping node nodename |
This command uses the FailSafe daemons to test whether the system controller is responding.
You can verify reset connectivity on a node in a cluster even when the FailSafe daemons are not running by using the standalone option of the admin ping command:
admin ping standalone node nodename |
This command does not go through the FailSafe daemons, but calls the ping command directly to test whether the system controller on the indicated node is responding.
The haStatus script provides status and configuration information about clusters, nodes, resources, and resource groups in the configuration. This script is installed in the /var/cluster/cmgr-scripts directory. You can modify this script to suit your needs. See the haStatus man page for further information about this script.
The following examples show the output of the different options of the haStatus script.
# haStatus -help Usage: haStatus [-a|-i] [-c clustername] where, -a prints detailed cluster configuration information and cluster status. -i prints detailed cluster configuration information only. -c can be used to specify a cluster for which status is to be printed. “clustername” is the name of the cluster for which status is to be printed. # haStatus Tue Nov 30 14:12:09 PST 1999 Cluster test-cluster: Cluster state is ACTIVE. Node hans2: State of machine is UP. Node hans1: State of machine is UP. Resource_group nfs-group1: State: Online Error: No error Owner: hans1 Failover Policy: fp_h1_h2_ord_auto_auto Resources: /hafs1 (type: NFS) /hafs1/nfs/statmon (type: statd_unlimited) 150.166.41.95 (type: IP_address) /hafs1 (type: filesystem) havol1 (type: volume) # haStatus -i Tue Nov 30 14:13:52 PST 1999 Cluster test-cluster: Node hans2: Logical Machine Name: hans2 Hostname: hans2.dept.company.com Is FailSafe: true Is CXFS: false Nodeid: 32418 Reset type: powerCycle System Controller: msc System Controller status: enabled System Controller owner: hans1 System Controller owner device: /dev/ttyd2 System Controller owner type: tty ControlNet Ipaddr: 192.26.50.15 ControlNet HB: true ControlNet Control: true ControlNet Priority: 1 ControlNet Ipaddr: 150.166.41.61 ControlNet HB: true ControlNet Control: false ControlNet Priority: 2 Node hans1: Logical Machine Name: hans1 Hostname: hans1.dept.company.com Is FailSafe: true Is CXFS: false Nodeid: 32645 Reset type: powerCycle System Controller: msc System Controller status: enabled System Controller owner: hans2 System Controller owner device: /dev/ttyd2 System Controller owner type: tty ControlNet Ipaddr: 192.26.50.14 ControlNet HB: true ControlNet Control: true ControlNet Priority: 1 ControlNet Ipaddr: 150.166.41.60 ControlNet HB: true ControlNet Control: false ControlNet Priority: 2 Resource_group nfs-group1: Failover Policy: fp_h1_h2_ord_auto_auto Version: 1 Script: ordered Attributes: Auto_Failback Auto_Recovery Initial AFD: hans1 hans2 Resources: /hafs1 (type: NFS) /hafs1/nfs/statmon (type: statd_unlimited) 150.166.41.95 (type: IP_address) /hafs1 (type: filesystem) havol1 (type: volume) Resource /hafs1 (type NFS): export-info: rw,wsync filesystem: /hafs1 Resource dependencies statd_unlimited /hafs1/nfs/statmon filesystem /hafs1 Resource /hafs1/nfs/statmon (type statd_unlimited): InterfaceAddress: 150.166.41.95 Resource dependencies IP_address 150.166.41.95 filesystem /hafs1 Resource 150.166.41.95 (type IP_address): NetworkMask: 0xffffff00 interfaces: ef1 BroadcastAddress: 150.166.41.255 No resource dependencies Resource /hafs1 (type filesystem): volume-name: havol1 mount-options: rw,noauto monitor-level: 2 Resource dependencies volume havol1 Resource havol1 (type volume): devname-group: sys devname-owner: root devname-mode: 666 No resource dependencies Failover_policy fp_h1_h2_ord_auto_auto: Version: 1 Script: ordered Attributes: Auto_Failback Auto_Recovery Initial AFD: hans1 hans2 # haStatus -a Tue Nov 30 14:45:30 PST 1999 Cluster test-cluster: Cluster state is ACTIVE. Node hans2: State of machine is UP. Logical Machine Name: hans2 Hostname: hans2.dept.company.com Is FailSafe: true Is CXFS: false Nodeid: 32418 Reset type: powerCycle System Controller: msc System Controller status: enabled System Controller owner: hans1 System Controller owner device: /dev/ttyd2 System Controller owner type: tty ControlNet Ipaddr: 192.26.50.15 ControlNet HB: true ControlNet Control: true ControlNet Priority: 1 ControlNet Ipaddr: 150.166.41.61 ControlNet HB: true ControlNet Control: false ControlNet Priority: 2 Node hans1: State of machine is UP. Logical Machine Name: hans1 Hostname: hans1.dept.company.com Is FailSafe: true Is CXFS: false Nodeid: 32645 Reset type: powerCycle System Controller: msc System Controller status: enabled System Controller owner: hans2 System Controller owner device: /dev/ttyd2 System Controller owner type: tty ControlNet Ipaddr: 192.26.50.14 ControlNet HB: true ControlNet Control: true ControlNet Priority: 1 ControlNet Ipaddr: 150.166.41.60 ControlNet HB: true ControlNet Control: false ControlNet Priority: 2 Resource_group nfs-group1: State: Online Error: No error Owner: hans1 Failover Policy: fp_h1_h2_ord_auto_auto Version: 1 Script: ordered Attributes: Auto_Failback Auto_Recovery Initial AFD: hans1 hans2 Resources: /hafs1 (type: NFS) /hafs1/nfs/statmon (type: statd_unlimited) 150.166.41.95 (type: IP_address) /hafs1 (type: filesystem) havol1 (type: volume) Resource /hafs1 (type NFS): State: Online Error: None Owner: hans1 Flags: Resource is monitored locally export-info: rw,wsync filesystem: /hafs1 Resource dependencies statd_unlimited /hafs1/nfs/statmon filesystem /hafs1 Resource /hafs1/nfs/statmon (type statd_unlimited): State: Online Error: None Owner: hans1 Flags: Resource is monitored locally InterfaceAddress: 150.166.41.95 Resource dependencies IP_address 150.166.41.95 filesystem /hafs1 Resource 150.166.41.95 (type IP_address): State: Online Error: None Owner: hans1 Flags: Resource is monitored locally NetworkMask: 0xffffff00 interfaces: ef1 BroadcastAddress: 150.166.41.255 No resource dependencies Resource /hafs1 (type filesystem): State: Online Error: None Owner: hans1 Flags: Resource is monitored locally volume-name: havol1 mount-options: rw,noauto monitor-level: 2 Resource dependencies volume havol1 Resource havol1 (type volume): State: Online Error: None Owner: hans1 Flags: Resource is monitored locally devname-group: sys devname-owner: root devname-mode: 666 No resource dependencies # haStatus -c test-cluster Tue Nov 30 14:42:04 PST 1999 Cluster test-cluster: Cluster state is ACTIVE. Node hans2: State of machine is UP. Node hans1: State of machine is UP. Resource_group nfs-group1: State: Online Error: No error Owner: hans1 Failover Policy: fp_h1_h2_ord_auto_auto Resources: /hafs1 (type: NFS) /hafs1/nfs/statmon (type: statd_unlimited) 150.166.41.95 (type: IP_address) /hafs1 (type: filesystem) havol1 (type: volume) |
The Embedded Support Partner (ESP) consists of a set of daemons that perform various monitoring activities. You can choose to configure ESP so that it will log FailSafe events (the FailSafe ESP event profile is not configured in ESP by default).
FailSafe uses an event class ID of 77 and a description of IRIS FailSafe2.
If you want to use ESP for FailSafe, enter the following command to add the failsafe2 event profile to ESP:
# espconfig -add eventprofile failsafe2 |
FailSafe will then log ESP events for the following:
Daemon configuration error
Failover policy configuration error
Resource group allocation (start) failure
Resource group failures:
Allocation (start) failure
Release (stop) failure
Monitoring failure
Exclusivity failure
Failover policy failure
Resource group status:
online
offline
maintenance_on
maintenance_off
FailSafe shutdown (HA services stopped)
FailSafe started (HA services started)
You can use the espreport or launchESPartner commands to see the logged ESP events. See the esp man page and the Embedded Support Partner User Guide for more information about ESP.
While a FailSafe system is running, you can move a resource group online to a particular node, or you can take a resource group offline. In addition, you can move a resource group from one node in a cluster to another node in a cluster. The following subsections describe these tasks.
This section describes how to bring a resource group online.
Before you bring a resource group online for the first time, you should run the diagnostic tests on that resource group. Diagnostics check system configurations and perform some validations that are not performed when you bring a resource group online.
You cannot bring a resource group online in the following circumstances:
If the resource group has no members
If the resource group is currently running in the cluster
To bring a resource group fully online, HA services must be active. When HA services are active, an attempt is made to allocate the resource group in the cluster. However, you can also execute a command to bring the resource group online when HA services are not active. When HA services are not active, the resource group is marked to be brought online when HA services become active; the resource group is then in an ONLINE-READY state. Failsafe tries to bring a resource group in an ONLINE-READY state online when HA services are started.
You can disable resource groups from coming online when HA services are started by using the GUI or cmgr to take the resource group offline, as described in “Take a Resource Group Offline”.
![]() | Caution: Before bringing a resource group online in the cluster, you must be sure that the resource group is not running on a disabled node (where HA services are not running). Bringing a resource group online while it is running on a disabled node could cause data corruption. For information on detached resource groups, see “Take a Resource Group Offline”. |
Do the following:
Group to Bring Online: select the name of the resource group you want to bring online. The menu displays only resource groups that are not currently online.
Click on OK to complete the task.
To bring a resource group online, use the following command:
admin online resource_group RGname [in cluster Clustername] |
If you have specified a default cluster, you do not need to specify a cluster when you use this command.
cmgr> set cluster test-cluster cmgr> admin online resource_group group1 FailSafe daemon (ha_fsd) is not running on this local node or it is not ready to accept admin commands. Resource Group (group1) is online-ready. Failed to admin: online admin command failed cmgr> show status of resource_group group1 in cluster test-cluster State: Online Ready Error: No error Check resource group group1 status in an active node if HA services are active in cluster |
This section tells you how to take a resource group offline.
When you take a resource group offline, FailSafe takes each resource in the resource group offline in a predefined order. If any single resource gives an error during this process, the process stops, leaving all remaining resources allocated.
You can take a FailSafe resource group offline in any of the following ways:
Take the resource group offline. This physically stops the processes for that resource group and does not reset any error conditions. If this operation fails, the resource group will be left online in an error state.
Force the resource group offline. This physically stops the processes for that resource group but resets any error conditions. This operation cannot fail.
Detach the resource group. This causes FailSafe to stop monitoring the resource group, but does not physically stop the processes on that group. FailSafe will report the status as offline and will not have any control over the group. This operation should rarely fail.
Detach the resource group and force the error state to be cleared. This causes FailSafe to stop monitoring the resource group, but does not physically stop the processes on that group. FailSafe will report the status as offline and will not have any control over the group. In addition, all error conditions of the resource group will be reset. This operation should rarely fail.
If you do not need to stop the resource group and do not want FailSafe to monitor the resource group while you make changes, but you would still like to have administrative control over the resource group (for instance, to move that resource group to another node), you can put the resource group in maintenance mode using the Suspend Monitoring a Resource Group task on the GUI or the admin maintenance_on command of cmgr, as described in “ Suspend and Resume Monitoring of a Resource Group”.
If the fsd daemon is not running or is not ready to accept client requests, executing this task disables the resource group in the cluster database only. The resource group remains online and the command fails.
Enter the following:
Detach Only: check this box to stop monitoring the resource group. The resource group will not be stopped, but FailSafe will not have any control over the group.
Detach Force: check this box to stop monitoring the resource group. The resource group will not be stopped, but FailSafe will not have any control over the group. In addition, Failsafe will clear all errors.
![]() | Caution: The Detach Only and Detach Force settings leave the resource group's resources running on the node where the group was online. After stopping HA services on that node, do not bring the resource group online on another node in the cluster; doing so can cause data integrity problems. Instead, make sure that no resources are running on a node before stopping HA services on that node. |
Force Offline: check this box to stop all resources in the group and clear all errors.
Group to Take Offline: select the name of the resource group you want to take offline. The menu displays only resource groups that are currently online.
Click on OK to complete the task.
To take a resource group offline, use the following command:
admin offline resource_group RGname [in cluster Clustername] |
To take a resource group offline with the force option in effect, forcing FailSafe to complete the action even if there are errors, use the following command:
admin offline_force resource_group RGname [in cluster Clustername] |
![]() | Note: Doing an offline_force operation on a resource group can leave resources in the resource group running on the cluster. The offline_force operation will succeed even though all resources in the resource group have not been stopped. FailSafe does not track these resources any longer. You should take care to prevent resources from running on multiple nodes in the cluster. |
To detach a resource group, use the following command:
admin offline_detach resource_group RGname [in cluster Clustername] |
To detach the resource group and force the error state to be cleared:
admin offline_detach_force resource_group RGname [in cluster Clustername] |
This causes FailSafe to stop monitoring the resource group, but does not physically stop the processes on that group. FailSafe will report the status as offline and will not have any control over the group. In addition, all error conditions of the resource group will be reset. This operation should rarely fail.
This section tells you how to move a resource group.
While FailSafe is active, you can move a resource group to another node in the same cluster.
![]() | Note: When you move a resource group in an active system, you may find the unexpected behavior that the command appears to have succeeded, but the resource group remains online on the same node in the cluster. This can occur if the resource group fails to start on the node to which you are moving it. In this case, FailSafe will fail over the resource group to the next node in the application failover domain, which may be the node on which the resource group was originally running. Since FailSafe kept the resource group online, the command succeeds. |
Enter the following:
Group to Move: select the name of the resource group to be moved. Only resource groups that are currently online are displayed in the menu.
Failover Domain Node: (optional) select the name of the node to which you want to move the resource group. If you do not specify a node, FailSafe will move the resource group to the next available node in the failover domain.
Click on OK to complete the task.
To move a resource group to another node, use the following command:
admin move resource_group RGname [in cluster Clustername] [to node Nodename] |
For example, to move resource group nfs-group1 running on node primary to node backup in the cluster nfs-cluster, do the following:
cmgr> admin move resource_group nfs-group1 in cluster nfs-cluster to node backup |
If the user does not specify the node, the resource group's failover policy is used to determine the destination node for the resource group.
If you run into errors after entering the admin move command, see “Ensuring that Resource Groups are Deallocated” in Chapter 10.
This section describes how to stop monitoring of a resource group in order to put it into maintenance mode.
You can temporarily stop FailSafe from monitoring a specific resource group, which puts the resource group in maintenance mode. The resource group remains on the same node in the cluster but is no longer monitored by FailSafe for resource failures.
You can put a resource group into maintenance mode if you do not want FailSafe to monitor the group for a period of time. You may want to do this for upgrade or testing purposes, or if there is any reason that FailSafe should not act on that resource group. When a resource group is in maintenance mode, it is not being monitored and it is not highly available. If the resource group's owner node fails, FailSafe will move the resource group to another node and resume monitoring.
When you put a resource group into maintenance mode, resources in the resource group are in ONLINE-MAINTENANCE state. The ONLINE-MAINTENANCE state for the resource is seen only on the node that has the resource online. All other nodes will show the resource as ONLINE. The resource group, however, should appear as being in ONLINE-MAINTENANCE state in all nodes.
Do the following:
Group to Stop Monitoring: select the name of the group you want to stop monitoring. Only those resource groups that are currently online and monitored are displayed in the menu.
Click OK to complete the task.
This task lets you resume monitoring a resource group.
Once monitoring is resumed and assuming that the restart action is enabled, if the resource group or one of its resources fails, FailSafe will restart each failed component based on the failover policy.
Perform the following steps:
Group to Start Monitoring: select the name of the group you want to start monitoring. Only those resource groups that are currently online and not monitored are displayed in the menu.
Click OK to complete the task.
To put a resource group into maintenance mode, use the following command:
admin maintenance_on resource_group RGname [in cluster Clustername] |
If you have specified a default cluster, you do not need to specify a cluster when you use this command.
You can stop the execution of FailSafe on all the nodes in a cluster or on a specified node only. See “Stop FailSafe HA Services” in Chapter 6.
You can use FailSafe to reset nodes in a cluster. This sends a reset command to the system controller port on the specified node. When the node is reset, other nodes in the cluster will detect this and remove the node from the active cluster, reallocating any resource groups that were allocated on that node onto a backup node. The backup node that is used depends on how you have configured your system.
After the node reboots, it will rejoin the cluster. Some resource groups might move back to the node, depending on how you have configured your system.
You can use the GUI to reset nodes in a cluster. This sends a reset command to the system controller port on the specified node. When the node is reset, other nodes in the cluster will detect the change and remove the node from the active cluster. When the node reboots, it will rejoin the FailSafe membership.
To reset a node, do the following:
Node to Reset: select the node to be reset.
Click on OK to complete the task.
When FailSafe is running, you can reset a node with the following command:
admin reset node nodename |
This command uses the FailSafe daemons to reset the specified node.
You can reset a node in a cluster even when the FailSafe daemons are not running by using the standalone option:
admin reset standalone node nodename |
The command above does not use the crsd daemon.
If you have defined the node but have not defined system controller information for it, you can use the following command line:
admin reset dev_name nodename of dev_type tty with sysctrl_type msc|mmsc|l2|l1 |
When FailSafe is running, you can perform a power cycle on a node with the following command:
admin powerCycle node nodename |
This command uses the FailSafe daemons to powercycle the specified node.
You can powercycle a node in a cluster even when the FailSafe daemons are not running by using the standalone option:
admin powerCycle standalone node nodename |
This command does not go through the crsd daemon.
If the node has not been defined in the cluster database, you can use the following command line:
admin powerCycle dev_name nodename of dev_type tty with sysctrl_type msc|mmsc|l2|l1 |
When FailSafe is running, you can perform a nonmaskable interrupt (NMI) on a node with the following command:
admin nmi node nodename |
This command uses the FailSafe daemons to perform an NMI on the specified node.
You can perform an NMI on a node in a cluster even when the FailSafe daemons are not running by using the standalone option:
admin nmi standalone node nodename |
The above command does not use the crsd daemon.
If the node has not been defined in the cluster database, you can use the following command line:
admin nmi dev_name nodename of dev_type tty with sysctrl_type msc|mmsc|l2|l1 |
This section discusses the following:
If the database has been accidentally deleted from an individual node, you can replace it with a copy from another node. Do not use this method if the cluster database has been corrupted.
Do the following:
Stop the HA services and (if running) CXFS services.
Stop the cluster daemons by running the following command on each node:
# /etc/init.d/cluster stop |
Run cdbreinit on nodes that are missing the cluster database. Verify that cluster daemons are running.
Restart HA services and (if needed) CXFS services.
You can use the build_cmgr_script command from one node in the cluster to create a cmgr script that will recreate the node, cluster, switch, and filesystem definitions for all nodes in the cluster database. You can then later run the resulting script to recreate a database with the same contents; this method can be used for missing or corrupted cluster databases.
![]() | Note: The build_cmgr_script does not recreate node-specific information for resources and resource types or local logging information because the cluster database does not replicate node-specific information. Therefore, if you reinitialize the cluster database, you will lose node specific information. The build_cmgr_script script does not contain local logging information, so it cannot be used as a complete backup/restore tool. |
To perform a database backup, use the build_cmgr_script script from one node in the cluster, as described in “Creating a cmgr Script Automatically” in Chapter 5.
![]() | Caution: Do not make configuration changes while you are using the build_cmgr_script command. |
By default, this creates a cmgr script in the following location:
/var/cluster/ha/tmp/cmgr_create_cluster_clustername_processID |
You can specify another filename by using the -o option.
To perform a restore on all nodes in the pool, do the following:
Stop HA services for all nodes in the cluster.
Stop the cluster database daemons on each node.
Remove all copies of the old database by using the cdbreinit command on each node.
Execute the cmgr script (which was generated by the build_cmgr_script script) on the node that is defined first in the script. This will recreate the backed-up database on each node.
![]() | Note: If you want to run the generated script on a different node, you must modify the generated script so that the node is the first one listed in the script. |
Restart cluster database daemons on each node.
For example, to backup the current database, clear the database, and restore the database to all nodes, do the following:
On one node in the cluster: # /var/cluster/cmgr-scripts/build_cmgr_script -o /tmp/newcdb Building cmgr script for cluster clusterA ... build_cmgr_script: Generated cmgr script is /tmp/newcdb On one node: # stop ha_services for cluster clusterA On each node: # /etc/init.d/cluster stop On each node: # /usr/cluster/bin/cdbreinitOn each node: # /etc/init.d/cluster start On the *first* node listed in the /tmp/newcdb script: # /tmp/newcdb |
The cdbBackup and cdbRestore commands backup and restore the cluster database and node-specific information, such as local logging information. You must run these commands individually for each node.
To perform a backup of the cluster, use the cdbBackup command on each node.
![]() | Caution: Do not make configuration changes while you are using the cdbBackup command. |
To perform a restore, run the cdbRestore command on each node. You can use this method for either a missing or corrupted cluster database. Do the following:
Stop HA services.
Stop cluster services on each node.
Remove the old database by using the cdbreinit command on each node.
Stop cluster services again (these were restarted automatically by cdbreinit in the previous step) on each node.
Use the cdbRestore command on each node.
Start cluster services on each node.
For example, to backup the current database, clear the database, and then restore the database to all nodes, do the following:
On each node: # /usr/cluster/bin/cdbBackup On one node in the cluster: # stop ha_services for cluster clusterA On each node: # /etc/init.d/cluster stop On each node: # /usr/cluster/bin/cdbreinit On each node (again): # /etc/init.d/cluster stop On each node: # /usr/cluster/bin/cdbRestore On each node: # /etc/init.d/cluster start |
For more information, see the cdbBackup and cdbRestore man page.
![]() | Note: Do not perform a cdbDump while information is changing in the cluster database. Check the SYSLOG file for information to help determine when cluster database activity is occurring. As a rule of thumb, you should be able to perform a cdbDump if at least 15 minutes have passed since the last node joined the cluster or the last administration command was run. |
To perform an XFS filesystem dump and restore, you must do the following:
Perform a backup of the cluster database using cdbBackup.
Perform the XFS filesystem dump with xfsdump.
Perform the XFS filesystem restore with xfsrestore.
Remove the existing cluster database:
# rm /var/cluster/cdb |
Restore the backed-up database by using cdbRestore.
This section discusses the following:
For information about log levels, see “Set Log Configuration” in Chapter 6.
You can run the /var/cluster/cmgr-scripts/rotatelogs script to copy all files to a new location. This script saves log files with the day and the month name as a suffix. If you run the script twice in one day, it will append the current log file to the previous saved copy. The root crontab file has an entry to run this script weekly.
The script syntax is as follows:
/var/cluster/cmgr-scripts/rotatelogs [-h] [-d|-u] |
If no option is specified, the log files will be rotated. Options are as follows:
-h | Prints the help message. The log files are not rotated and other options are ignored. |
-d | Deletes saved log files that are older than one week before rotating the current log files. You cannot specify this option and -u. |
-u | Unconditionally deletes all saved log files before rotating the current log files. You cannot specify this option and -d . |
By default, the rotatelogs script will be run by crontab once a week, which is sufficient if you use the default log levels. If you plan to run with a high debug level for several weeks, you should reset the crontab entry so that the rotatelogs script is run more often.
On heavily loaded machines, or for very large log files, you may want to move resource groups and stop HA services before running rotatelogs.
You can use a script such as the following to copy large files to a new location. The files in the new location will be overwritten each time this script is run.
#!/bin/sh # Argument is maximum size of a log file (in characters) - default: 500000 size=${1:-500000} find /var/cluster/ha/log -type f ! -name '*.OLD' -size +${size}c -print | while read log_file; do cp ${log_file} ${log_file}.OLD echo '*** LOG FILE ROTATION ' `date` '***' > ${log_file} done |
The GUI lets you grant or revoke access to a specific GUI task for one or more specific users. By default, only root may execute tasks in the GUI. You cannot grant or revoke tasks for users with a user ID of 0.
![]() | Note: To maintain security, the root user must have a password. If root does not have a password, then all users can get access to any task. |
Access to the task is allowed only on the node to which the GUI is connected; if you want to allow access on another node in the pool, you must connect the GUI to that node and grant access again.
GUI tasks and the cmgr command operate by executing underlying privileged commands which are normally accessible only to root. When granting access to a task, you are in effect granting access to all of its required underlying commands, which results in also granting access to the other GUI tasks that use the same underlying commands.
To see which tasks a specific user can currently access, select View: Users. Select a specific user to see details about the tasks available to that user.
To see which users can currently access a specific task, select View: Task Privileges. Select a specific task to see details about the users who can access it and the privileged commands it requires.
You can grant access to a specific task to one or more users at a time.
![]() | Note: Access to the task is only allowed on the node to which the GUI is connected; if you want to allow access on another node in the pool, you must connect the GUI to that node and grant access again. |
Do the following:
Select the user or users for whom you want to grant access. You can use the following methods to select users:
Click to select one user at a time
Shift+click to select a block of users
Ctrl+click to toggle the selection of any one user, which allows you to select multiple users that are not contiguous
Click Select All to select all users
Click Next to move to the next page.
Select the task or tasks to grant access to, using the above selection methods. Click Next to move to the next page.
Confirm your choices by clicking OK.
![]() | Note: If more tasks than you selected are shown, then the selected tasks run the same underlying privileged commands as other tasks, such that access to the tasks you specified cannot be granted without also granting access to these additional tasks. |
To see which tasks a specific user can access, select View: Users. Select a specific user to see details about the tasks available to that user.
To see which users can access a specific task, select View: Task Privileges. Select a specific task to see details about the users who can access it and the privileged commands it requires.
Suppose you wanted to grant user guest permission to define clusterwide and node-specific resources. You would do the following:
Select guest and click Next to move to the next page.
Select the tasks you want sys to be able to execute:
Click Define a Resource
Ctrl+click Redefine a Resource for a Specific Node
Ctrl+click Add or Remove Resources in Resource Group
Click Next to move to the next page.
Confirm your choices by clicking OK.
Figure 8-1 shows the privileged commands that were granted to user guest.
![]() | Note: Modify a Resource Definition is also displayed, even though the administrator did not explicitly select it; the privilege commands for the tasks selected also require this command. |
Figure 8-2 shows the screen that is displayed when you select View: Users and click guest to display information in the details area of the GUI window. The privileged commands listed are the underlying commands executed by the GUI tasks.
Suppose you wanted to give user sys access to all tasks except adding or removing nodes from a cluster. The easiest way to do this is to select all of the tasks and then deselect the one you want to restrict. You would do the following:
Select sys and click Next to move to the next page.
Select the tasks you want sys to be able to execute:
Click Select All to highlight all tasks.
Deselect the task to which you want to restrict access. Ctrl+click Add/Remove Nodes in Cluster .
Click Next to move to the next page.
Confirm your choices by clicking OK.
You can revoke task access from one or more users at a time.
![]() | Note: Access to the task is only revoked on the node to which the GUI is connected; if a user has access to the task on multiple nodes in the pool, you must connect the GUI to those other nodes and revoke access again. |
Do the following:
Select the user or users from whom you want to revoke task access. You can use the following methods to select users:
Click to select one user at a time
Shift+click to select a block of users
Ctrl+click to toggle the selection of any one user, which allows you to select multiple users that are not contiguous
Click Select All to select all users
Click Next to move to the next page.
Select the task or tasks to revoke access to, using the above selection methods. Click Next to move to the next page.
Confirm your choices by clicking OK.
![]() | Note: If more tasks than you selected are shown, then the selected tasks run the same underlying privileged commands as other tasks, such that access to the tasks you specified cannot be revoked without also revoking access to these additional tasks. |
To see which tasks a specific user can access, select View: Users. Select a specific user to see details about the tasks available to that user.
To see which users can access a specific task, select View: Task Privileges. Select a specific task to see details about the users who can access it.
The ChecksumVersion variable is required for clusters running IRIX 6.5.22 or later. Any cluster without this variable will use the old checksum behavior and the variable will not be present in the cluster database. All clusters created prior to IRIX 6.5.22 must set the variable manually.
If your cluster was created before IRIX 6.5.22, you must run the following command to add the ChecksumVersion variable to the cluster database and set it to the correct value. You must do this after all nodes in the cluster have been upgraded to IRIX 6.5.22 and are running normally.
For example, if the name of the cluster is gps, run the following cbutil command on one node in the cluster:
![]() | Note: Command arguments are case-sensitive. |
# /usr/cluster/bin/cdbutil -i cdbutil> node #cluster#gps#ClusterAdmin cdbutil> create ChecksumVersion cdbutil> setvalue ChecksumVersion 1 cdbutil> quit |
Cluster databases running IRIX 6.5.22 or later must have all nodes at 6.5.22 or later. Do not downgrade or add nodes to the cluster without setting ChecksumVersion to 0 (otherwise, the nodes will fail to form a membership). After you have upgraded all nodes to IRIX 6.5.22 or later set ChecksumVersion to 1 by running the following commands on one node in the cluster:
# /usr/cluster/bin/cdbutil -i cdbutil> node #cluster#gps#ClusterAdmin cdbutil> setvalue ChecksumVersion 1 cdbutil> quit |
![]() | Note: The create step is missing here because the variable should already be in the cluster database at this point. |