Chapter 8. FailSafe System Operation

This chapter describes administrative tasks you perform to operate and monitor an FailSafe system. It describes how to perform tasks using the FailSafe Manager GUI and the cmgr command. The major sections in this chapter are as follows:


Note: SGI recommends that you perform all FailSafe administration from one node in the pool so that the latest copy of the database will be available, even when there are network partitions.


Redirecting the Console for Origin 300, Origin 3200C, Onyx 300, and Onyx 3200C

Use the following procedure to redirect the console, which is required to get access to the console input and output on systems with only one serial/USB port that provides both L1 system controller and console support:

  1. Edit the /etc/inittab file to use an alternate serial port.

  2. Either issue an init q command or reboot.

For example, suppose you had the following in the /etc/inittab file (line breaks added for readability):

# on-board ports or on Challenge/Onyx MP machines, first IO4 board ports
t1:23:respawn:/sbin/suattr -C CAP_FOWNER,CAP_DEVICE_MGT,CAP_DAC_WRITE+ip
-c "exec /sbin/getty ttyd1 console"    # alt console
t2:23:off:/sbin/suattr -C CAP_FOWNER,CAP_DEVICE_MGT,CAP_DAC_WRITE+ip 
-c "exec /sbin/getty -N ttyd2 co_9600"     # port 2   

You could change it to the following:

# on-board ports or on Challenge/Onyx MP machines, first IO4 board ports
t1:23:off:/sbin/suattr -C CAP_FOWNER,CAP_DEVICE_MGT,CAP_DAC_WRITE+ip
-c "exec /sbin/getty ttyd1 co_9600"        # port 1
t2:23:respawn:/sbin/suattr -C CAP_FOWNER,CAP_DEVICE_MGT,CAP_DAC_WRITE+ip
-c "exec /sbin/getty -N ttyd2 console" # alt console


Caution: Redirecting the console by using the above method works only when IRIX is running. To access the console when IRIX is not running (miniroot), you must physically reconnect the machine: unplug the serial hardware reset cable from the console/L1 port and then connect the console cable.

For more information, see “Origin 300, Origin 3200C, Onyx 300, and Onyx 3200C Console Support ” in Chapter 3.

Two-Node Clusters: Single-Node Use

This section discusses the procedure for using a single node in a two-node cluster in the following cases:

  • Only one node in a cluster is powered up after a power failure

  • One node in the cluster is down for an extended period for maintenance

Using a Single Node

The following procedure describes the steps required to use just one node in the cluster:

  1. Create an emergency failover policy for each node. Each policy should look like the following example when the cmgr command is issued, where ActiveNode is the name of the node using the policy (in the examples, nodeA ) and DownNode is the name of the nonfunctioning node (in the examples, nodeB):

    cmgr> show failover_policy emergency-ActiveNode
    
    Failover Policy: emergency-ActiveNode
    Version: 1
    Script: ordered
    Attributes: Controlled_Failback InPlace_Recovery 
    Initial AFD: ActiveNode

    For example, suppose you have two nodes, nodeA and nodeB. You would have two emergency failover policies:

    cmgr> show failover_policy emergency-nodeA
    
    Failover Policy: emergency-nodeA
    Version: 1
    Script: ordered
    Attributes: Controlled_Failback InPlace_Recovery 
    Initial AFD: nodeA
    
    cmgr> show failover_policy emergency-nodeB
    
    Failover Policy: emergency-nodeB
    Version: 1
    Script: ordered
    Attributes: Controlled_Failback InPlace_Recovery 
    Initial AFD: nodeB

    For more information, see “Define a Failover Policy” in Chapter 6.

    If a single node in a two-node cluster has just booted from a power failure and the other node is still powered off, the surviving node will form an active cluster. The resources will be in ONLINE READY state. They cannot move to ONLINE state because only half of the failover domain is active. The powered-off node will be in UNKNOWN state. At this point, you would want to apply the emergency policy, which contains only one node in the failover domain.


    Note: If the nonfunctional node is in DOWN state (because it was reset by another node), then the resource groups will be in the ONLINE state rather than ONLINE READY state.


  2. Change the state of all resource groups to offline. The last known state of these groups was online before the machines went down. This step tells the database to label the state of the resource groups appropriately in preparation for later steps. FailSafe will execute the new failover policy when the groups are made online.


    Note: If the groups are already online on the surviving node (such as they would be in a maintenance procedure), you should use the admin offline_detach command rather than the admin offline_force command because the desire is to leave all resources running on that surviving node.

    Use the following command:

    admin offline_force resource_group RGname in cluster Clustername

    For example:

    cmgr> set cluster test-cluster
    cmgr> show resource_groups in test-cluster
    
    Resource Groups: 
            group1
            group2
    
    cmgr> admin offline_force resource_group group1
    cmgr> admin offline_force resource_group group2

  3. Modify each resource group to use the appropriate single-node emergency failover policy (the policy that contains the one node that is up). Use the following cmgr commands or the GUI:

    modify resource_group RGname in cluster Clustername
    set failover_policy to emergency-ActiveNode

    For example, on nodeA:

    cmgr> set cluster test-cluster
    cmgr> modify resource_group group1
    Enter commands, when finished enter either "done" or "cancel"
    
    resource_group group1 ? set failover_policy to emergency-nodeA
    resource_group group1 ? done
    Successfully modified resource group group1

  4. Mark the resource groups as online in the database. When HA services are started in future steps, the services will come online using the emergency failover policies.

    admin online resource_group RGname in cluster Clustername

    For example:

    cmgr> set cluster test-cluster
    cmgr> admin online resource_group group1
    FailSafe daemon (ha_fsd) is not running on this local node or it is not ready to accept admin commands.
    Resource Group (group1) is online-ready.
    
    Failed to admin:
            online
    
    admin command failed
    
    cmgr> show status of resource_group group1 in cluster test-cluster
    
    State: Online Ready
    Error: No error
    Check resource group group1 status in an active node if HA services are active in cluster

Resuming Two-Node Use

To resume using the down node, do the following:

  1. Boot the down node. It will join the cluster and copy the cluster database from the other node.

  2. Perform an offline_detach command on the resource groups in the cluster. This causes FailSafe to stop monitoring the resource group, but does not physically stop the processes on that group. FailSafe will report the status as offline and will not have any control over the group. The resources will remain in service.


    Note: There are issues when performing an offline_detach operation with Auto_Recovery; see “Offline Detach Issues” in Chapter 3.


    Use the following command:

    admin offline_detach resource_group RGname [in cluster Clustername]

    For example:

    cmgr> admin offline_detach resource_group group1 in cluster test-cluster

    Show the status of the resource groups to be sure that they now show as offline.


    Note: The resources are still in service even though this command output shows them as offline .


    show status of resource_group RGname in cluster Clustername

    For example:

    cmgr> show status of resource_group group1 in cluster test-cluster

  3. Modify the resource groups to restore the original two-node failover policies they were using before the failure:

    modify resource_group RGname in cluster Clustername
    set failover_policy to OriginalFailoverPolicy


    Note: This only restores the configuration for the static environment. The runtime environment will still be using the single-node policy at this time.

    For example, if the normal failover policy was normal-fp :

    cmgr> set cluster test-cluster
    cmgr> modify resource_group group1
    Enter commands, when finished enter either "done" or "cancel"
    
    resource_group group1 ? set failover_policy to normal-fp
    resource_group group1 ? done
    Successfully modified resource group group1
    
    cmgr modify resource_group group2
    Enter commands, when finished enter either "done" or "cancel"
    
    resource_group group2 ? set failover_policy to normal-fp
    resource_group group2 ? done
    Successfully modified resource group group2

  4. Make the resource groups online in the cluster:

    admin online resource_group RGname in cluster Clustername

    For example:

    cmgr> admin online resource_group group1 in cluster test-cluster
    cmgr> admin online resource_group group2 in cluster test-cluster

  5. It may be desirable to move the resources back to their original nodes if it is believed that the cluster is now healthy. (Because our original policies included the InPlace_Recovery attribute, all of the resources have remained on the node that has been active throughout this process.)

    admin move resource_group RGname in cluster Clustername to node PrimaryOwner

    For example:

    cmgr> admin move resource_group group1 in cluster test-cluster to node nodeB

If you run into errors after entering the admin move command, see “Ensuring that Resource Groups are Deallocated” in Chapter 10.

System Status

This section describes the following:

Monitoring System Status with cluster_status

You can use the cluster_status command to monitor the cluster using a curses interface. For example, the following shows a two-node cluster configured for FailSafe only and cluster_status help text displayed:

# /var/cluster/cmgr-scripts/cluster_status
* Cluster=nfs-cluster  FailSafe=ACTIVE  CXFS=Not Configured         08:45:12 
      Nodes =       hans2        hans1
   FailSafe =          UP           UP
FailSafe HB =192.26.50.15    127.0.0.1
       CXFS =
       ResourceGroup           Owner           State           Error         

       bartest-group                         Offline        No error
       footest-group                         Offline        No error
             bar_rg2           hans2          Online        No error
          nfs-group1           hans2          Online        No error
              foo_rg           hans2          Online        No error


                 +-------+ cluster_status Help +--------+
                 |  on s  - Toggle Sound on event       |
                 |  on r  - Toggle Resource Group View  |
                 |  on c  - Toggle CXFS View            |
                 |     j  - Scroll up the selection     |
                 |     k  - Scroll down the selection   |
                 |   TAB  - Toggle RG or CXFS selection |
                 | ENTER  - View detail on selection    |
                 |     h  - Toggle help screen          |
                 |     q  - Quit cluster_status         |
                 +--- Press 'h' to remove help window --+


The above shows that a sound will be activated when a node or the cluster changes status. You can override the s setting by invoking cluster_status with the -m (mute) option. You can also use the arrow keys to scroll the selection.


Note: The cluster_status command can display no more than 128 CXFS filesystems.


Monitoring System Status with the GUI

The easiest way to keep a continuous watch on the state of a cluster is to use the GUI view area.

System components that are experiencing problems appear as blinking red icons. Components in transitional states also appear as blinking icons. If there is a problem in a resource group or node, the icon for the cluster turns red and blinks, as well as the resource group or node icon.

The cluster status can be one of the following:

  • ACTIVE, which means the cluster is up and running and there is a valid FailSafe membership.

  • INACTIVE, which means the start FailSafe HA services task has not been run and there is no FailSafe membership.

  • ERROR, which means that some nodes are in a DOWN state; that is, the cluster should be running, but it is not.

  • UNKNOWN,which means that the state cannot be determined because FailSafe HA services are not running on the node performing the query.

If you minimize the GUI window, the minimized-icon shows the current state of the cluster. Green indicates FailSafe HA services active without an error, gray indicates FailSafe HA services are inactive, and red indicates an error state.

Key to Icons and States

The following tables show keys to the icons and states used in the FailSafe Manager GUI.

The full legend for component states is as follows:

Table 8-1. Key to Icons

Icon

Entity

IRIX node

Cluster

Resource

Resource group

Resource type

Failover policy

Expanded tree

Collapsed tree

User name

GUI task for which execution privilege may be granted or revoked

Privileged command executed by a given GUI task


Table 8-2. Key to States

Icon

State

Inactive or unknown (HA services may not be active)

Online-ready state for a resource group

Healthy and active or online

(blinking) Transitioning to healthy and active/online or transitioning to offline

Maintenance mode, in which the resource is not monitored by FailSafe

(blinking red) Problems with the component


Querying Cluster Status with cmgr

To query node and cluster status, use the following command:

show status of cluster Clustername

Monitoring Resource and Reset Serial Line with cmgr

You can use cmgr to query the status of a resource or to contact the system controller on a node, as described in the following subsections.

Querying Resource Status with cmgr

To query a resource status, use the following command:

show status of resource Resourcename of resource_type RTname [in cluster Clustername]

If you have specified a default cluster, you do not need to specify a cluster when you use this command and it will show the status of the indicated resource in the default cluster.

This command displays the number of local monitoring failures, the monitor execution time parameters, and the maximum and minimum time taken to complete the monitoring script for the resource. For example, the following output shows that there have been no local monitoring failures:

cmgr> show status of resource 163.154.18.119 of resource_type IP_address in cluster nfs-cluster

 State: Online
 Error: None
 Owner: hans2
 Flags: Resource is monitored locally
 Resource statistics
         Number of local monitoring failures: 0 
         Time of last local monitoring failure: Not applicable 
         Total number of monitors 885 
         Maximum monitor execution time 998 
         Minimum monitor execution time 155 
         Last monitor execution time 222 
         Monitor timeout 40000 
         All times are in milliseconds

Performing a ping of a System Controller with cmgr

To perform a ping operation on a system controller by providing the device name, use the following command:

admin ping dev_name devicename of dev_type tty with sysctrl_type SystemControllerType

Resource Group Status

To query the status of a resource group, you provide the name of the resource group and the cluster which includes the resource group. Resource group status includes the following components:

  • Resource group state

  • Resource group error state

  • Resource owner

These components are described in the following subsections.

If a node that contains a resource group online has a status of UNKNOWN, the status of the resource group will not be available or ONLINE-READY.

Resource Group State

A resource group state can be one of the following:

ONLINE 

FailSafe is running on the local nodes. The resource group is allocated on a node in the cluster and is being monitored by FailSafe. It is fully allocated if there is no error; otherwise, some resources may not be allocated or some resources may be in an error state.

ONLINE-PENDING 

FailSafe is running on the local nodes and the resource group is in the process of being allocated. This is a transient state.

OFFLINE 

The resource group is not running or the resource group has been detached, regardless of whether FailSafe is running. When FailSafe starts up, it will not allocate this resource group.

OFFLINE-PENDING 

FailSafe is running on the local nodes and the resource group is in the process of being released (becoming offline). This is a transient state.

ONLINE-READY 

FailSafe is not running on the local node. When FailSafe starts up, it will attempt to bring this resource group online. No FailSafe process is running on the current node if this state is returned.

ONLINE-MAINTENANCE 

The resource group is allocated in a node in the cluster but it is not being monitored by FailSafe. If a node failure occurs while a resource group in ONLINE-MAINTENANCE state resides on that node, the resource group will be moved to another node and monitoring will resume. An administrator may move a resource group to an ONLINE-MAINTENANCE state for upgrade or testing purposes, or if there is any reason that FailSafe should not act on that resource for a period of time.

INTERNAL ERROR 

An internal FailSafe error has occurred and FailSafe does not know the state of the resource group. Error recovery is required. This could result from a memory error, bugs in a program, or communication problems.

DISCOVERY (EXCLUSIVITY) 

The resource group is in the process of going online if FailSafe can correctly determine whether any resource in the resource group is already allocated on all nodes in the resource group's failover domain. This is a transient state.

INITIALIZING 

FailSafe on the local node has yet to get any information about this resource group. This is a transient state.

Resource Group Error State

When a resource group is ONLINE, its error status is continually being monitored. A resource group error status can be one of the following:

NO ERROR 

Resource group has no error.

INTERNAL ERROR - NOT RECOVERABLE 

An internal error occurred; notify SGI if this condition arises.

NODE UNKNOWN 

Node that had the resource group online is in an unknown state. This occurs when the node is not part of the cluster. The last known state of the resource group is ONLINE, but the system cannot communicate with the node.

SRMD EXECUTABLE ERROR 

The start or stop action has failed for a resource in the resource group.

SPLIT RESOURCE GROUP (EXCLUSIVITY) 

FailSafe has determined that part of the resource group was running on at least two different nodes in the cluster.

NODE NOT AVAILABLE (EXCLUSIVITY) 

FailSafe has determined that one of the nodes in the resource group's failover domain was not in the FailSafe membership. FailSafe cannot bring the resource group online until that node is removed from the failover domain or HA services are started on that node.

MONITOR ACTIVITY UNKNOWN 

In the process of turning maintenance mode on or off, an error occurred. FailSafe can no longer determine if monitoring is enabled or disabled. Retry the operation. If the error continues, report the error to SGI.

NO AVAILABLE NODES 

A monitoring error has occurred on the last valid node in the FailSafe membership.

Resource Owner

The resource owner is the logical node name of the node that currently owns the resource.

Monitoring Resource Group Status with GUI

You can use the view area to monitor the status of the resources in a FailSafe configuration:

  • Select View: Resources in Groups to see the resources organized by the groups to which they belong.

  • Select View: Groups owned by Nodes to see where the online groups are running. This view lets you observe failovers as they occur.

Querying Resource Group Status with cmgr

To query a resource group status, use the following cmgr command:

show status of resource_group RGname [in cluster Clustername]

If you have specified a default cluster, you do not need to specify a cluster when you use this command and it will show the status of the indicated resource group in the default cluster.

Node Status

To query the status of a node, you provide the logical node name of the node. The node status can be one of the following:

UP 

This node is part of the FailSafe membership.

DOWN 

This node is not part of the FailSafe membership (no heartbeats) and this node has been reset. This is a transient state.

UNKNOWN 

This node is not part of the FailSafe membership (no heartbeats) and this node has not been reset (reset attempt has failed).

INACTIVE 

HA services have not been started on this node.

When you start HA services, node states transition from INACTIVE to UP. It may happen that a node state may transition from INACTIVE to UNKNOWN to UP.

Monitoring Node Status with cluster_status

You can use the cluster_status command to monitor the status of the nodes in the cluster.

Monitoring Cluster Status with the GUI

You can use the GUI view area to monitor the status of the clusters in a FailSafe configuration. Select View: Groups owned by Nodes to monitor the health of the default cluster, its resource groups, and the group's resources.

Querying Node Status with cmgr

To query node status, use the following command:

show status of node nodename

Performing a ping of the System Controller with cmgr

When FailSafe is running, you can determine whether the system controller on a node is responding with the following command:

admin ping node nodename

This command uses the FailSafe daemons to test whether the system controller is responding.

You can verify reset connectivity on a node in a cluster even when the FailSafe daemons are not running by using the standalone option of the admin ping command:

admin ping standalone node nodename

This command does not go through the FailSafe daemons, but calls the ping command directly to test whether the system controller on the indicated node is responding.

Viewing System Status with the haStatus Script

The haStatus script provides status and configuration information about clusters, nodes, resources, and resource groups in the configuration. This script is installed in the /var/cluster/cmgr-scripts directory. You can modify this script to suit your needs. See the haStatus man page for further information about this script.

The following examples show the output of the different options of the haStatus script.

# haStatus -help
Usage: haStatus [-a|-i] [-c clustername]
where,
 -a prints detailed cluster configuration information and cluster
status.
 -i prints detailed cluster configuration information only.
 -c can be used to specify a cluster for which status is to be printed.
 “clustername” is the name of the cluster for which status is to be
printed.

# haStatus
Tue Nov 30 14:12:09 PST 1999
Cluster test-cluster:
        Cluster state is ACTIVE.
Node hans2:
        State of machine is UP.
Node hans1:
        State of machine is UP.
Resource_group nfs-group1:
        State: Online
        Error: No error
        Owner: hans1
        Failover Policy: fp_h1_h2_ord_auto_auto
        Resources:
                /hafs1  (type: NFS)
                /hafs1/nfs/statmon      (type: statd_unlimited)
                150.166.41.95   (type: IP_address)
                /hafs1  (type: filesystem)
                havol1  (type: volume)
# haStatus -i
Tue Nov 30 14:13:52 PST 1999
Cluster test-cluster:
Node hans2:
        Logical Machine Name: hans2
        Hostname: hans2.dept.company.com
        Is FailSafe: true
        Is CXFS: false
        Nodeid: 32418
        Reset type: powerCycle
        System Controller: msc
        System Controller status: enabled
        System Controller owner: hans1
        System Controller owner device: /dev/ttyd2
        System Controller owner type: tty
        ControlNet Ipaddr: 192.26.50.15
        ControlNet HB: true
        ControlNet Control: true
        ControlNet Priority: 1
        ControlNet Ipaddr: 150.166.41.61
        ControlNet HB: true
        ControlNet Control: false
        ControlNet Priority: 2
Node hans1:
        Logical Machine Name: hans1
        Hostname: hans1.dept.company.com
        Is FailSafe: true
        Is CXFS: false
        Nodeid: 32645
        Reset type: powerCycle
        System Controller: msc
        System Controller status: enabled
        System Controller owner: hans2
        System Controller owner device: /dev/ttyd2
        System Controller owner type: tty
        ControlNet Ipaddr: 192.26.50.14
        ControlNet HB: true
        ControlNet Control: true
        ControlNet Priority: 1
        ControlNet Ipaddr: 150.166.41.60
        ControlNet HB: true
        ControlNet Control: false
        ControlNet Priority: 2
Resource_group nfs-group1:
        Failover Policy: fp_h1_h2_ord_auto_auto
                Version: 1
                Script: ordered
                Attributes: Auto_Failback Auto_Recovery
                Initial AFD: hans1 hans2
        Resources:
                /hafs1  (type: NFS)
                /hafs1/nfs/statmon      (type: statd_unlimited)
                150.166.41.95   (type: IP_address)
                /hafs1  (type: filesystem)
                havol1  (type: volume)
Resource /hafs1 (type NFS):
        export-info: rw,wsync
        filesystem: /hafs1
        Resource dependencies
        statd_unlimited /hafs1/nfs/statmon
        filesystem /hafs1
Resource /hafs1/nfs/statmon (type statd_unlimited):
        InterfaceAddress: 150.166.41.95
        Resource dependencies
        IP_address 150.166.41.95
        filesystem /hafs1
Resource 150.166.41.95 (type IP_address):
        NetworkMask: 0xffffff00
        interfaces: ef1
        BroadcastAddress: 150.166.41.255
        No resource dependencies
Resource /hafs1 (type filesystem):
        volume-name: havol1
        mount-options: rw,noauto
        monitor-level: 2
        Resource dependencies
        volume havol1
Resource havol1 (type volume):
        devname-group: sys
        devname-owner: root
        devname-mode: 666
        No resource dependencies
Failover_policy fp_h1_h2_ord_auto_auto:
        Version: 1
        Script: ordered
        Attributes: Auto_Failback Auto_Recovery
        Initial AFD: hans1 hans2
# haStatus -a
Tue Nov 30 14:45:30 PST 1999
Cluster test-cluster:
        Cluster state is ACTIVE.
Node hans2:
        State of machine is UP.
        Logical Machine Name: hans2
        Hostname: hans2.dept.company.com
        Is FailSafe: true
        Is CXFS: false
        Nodeid: 32418
        Reset type: powerCycle
        System Controller: msc
        System Controller status: enabled
        System Controller owner: hans1
        System Controller owner device: /dev/ttyd2
        System Controller owner type: tty
        ControlNet Ipaddr: 192.26.50.15
        ControlNet HB: true
        ControlNet Control: true
        ControlNet Priority: 1
        ControlNet Ipaddr: 150.166.41.61
        ControlNet HB: true
        ControlNet Control: false
        ControlNet Priority: 2
Node hans1:
        State of machine is UP.
        Logical Machine Name: hans1
        Hostname: hans1.dept.company.com
        Is FailSafe: true
        Is CXFS: false
        Nodeid: 32645
        Reset type: powerCycle
        System Controller: msc
        System Controller status: enabled
        System Controller owner: hans2
        System Controller owner device: /dev/ttyd2
        System Controller owner type: tty
        ControlNet Ipaddr: 192.26.50.14
        ControlNet HB: true
        ControlNet Control: true
        ControlNet Priority: 1
        ControlNet Ipaddr: 150.166.41.60
        ControlNet HB: true
        ControlNet Control: false
        ControlNet Priority: 2
Resource_group nfs-group1:
        State: Online
        Error: No error
        Owner: hans1
        Failover Policy: fp_h1_h2_ord_auto_auto
                Version: 1
                Script: ordered
                Attributes: Auto_Failback Auto_Recovery
                Initial AFD: hans1 hans2
        Resources:
                /hafs1  (type: NFS)
                /hafs1/nfs/statmon      (type: statd_unlimited)
                150.166.41.95   (type: IP_address)
                /hafs1  (type: filesystem)
                havol1  (type: volume)
Resource /hafs1 (type NFS):
        State: Online
        Error: None
        Owner: hans1
        Flags: Resource is monitored locally
        export-info: rw,wsync
        filesystem: /hafs1
        Resource dependencies
        statd_unlimited /hafs1/nfs/statmon
        filesystem /hafs1
Resource /hafs1/nfs/statmon (type statd_unlimited):
        State: Online
        Error: None
        Owner: hans1
        Flags: Resource is monitored locally
        InterfaceAddress: 150.166.41.95
        Resource dependencies
        IP_address 150.166.41.95
        filesystem /hafs1
Resource 150.166.41.95 (type IP_address):
        State: Online
        Error: None
        Owner: hans1
        Flags: Resource is monitored locally
        NetworkMask: 0xffffff00
        interfaces: ef1
        BroadcastAddress: 150.166.41.255
        No resource dependencies
Resource /hafs1 (type filesystem):
        State: Online
        Error: None
        Owner: hans1
        Flags: Resource is monitored locally
        volume-name: havol1
        mount-options: rw,noauto
        monitor-level: 2
        Resource dependencies
        volume havol1
Resource havol1 (type volume):
        State: Online
        Error: None
        Owner: hans1
        Flags: Resource is monitored locally
        devname-group: sys
        devname-owner: root
        devname-mode: 666
        No resource dependencies
# haStatus -c test-cluster
Tue Nov 30 14:42:04 PST 1999
Cluster test-cluster:
        Cluster state is ACTIVE.
Node hans2:
        State of machine is UP.
Node hans1:
        State of machine is UP.
Resource_group nfs-group1:
        State: Online
        Error: No error
        Owner: hans1
        Failover Policy: fp_h1_h2_ord_auto_auto
        Resources:
                /hafs1  (type: NFS)
                /hafs1/nfs/statmon      (type: statd_unlimited)
                150.166.41.95   (type: IP_address)
                /hafs1  (type: filesystem)
                havol1  (type: volume)

Embedded Support Partner (ESP) Logging of FailSafe Events

The Embedded Support Partner (ESP) consists of a set of daemons that perform various monitoring activities. You can choose to configure ESP so that it will log FailSafe events (the FailSafe ESP event profile is not configured in ESP by default).

FailSafe uses an event class ID of 77 and a description of IRIS FailSafe2.

If you want to use ESP for FailSafe, enter the following command to add the failsafe2 event profile to ESP:

# espconfig -add eventprofile failsafe2

FailSafe will then log ESP events for the following:

  • Daemon configuration error

  • Failover policy configuration error

  • Resource group allocation (start) failure

  • Resource group failures:

    • Allocation (start) failure

    • Release (stop) failure

    • Monitoring failure

    • Exclusivity failure

    • Failover policy failure

  • Resource group status:

    • online

    • offline

    • maintenance_on

    • maintenance_off

  • FailSafe shutdown (HA services stopped)

  • FailSafe started (HA services started)

You can use the espreport or launchESPartner commands to see the logged ESP events. See the esp man page and the Embedded Support Partner User Guide for more information about ESP.

Resource Group Failover

While a FailSafe system is running, you can move a resource group online to a particular node, or you can take a resource group offline. In addition, you can move a resource group from one node in a cluster to another node in a cluster. The following subsections describe these tasks.

Bring a Resource Group Online

This section describes how to bring a resource group online.

Bring a Resource Group Online with the GUI

Before you bring a resource group online for the first time, you should run the diagnostic tests on that resource group. Diagnostics check system configurations and perform some validations that are not performed when you bring a resource group online.

You cannot bring a resource group online in the following circumstances:

  • If the resource group has no members

  • If the resource group is currently running in the cluster

To bring a resource group fully online, HA services must be active. When HA services are active, an attempt is made to allocate the resource group in the cluster. However, you can also execute a command to bring the resource group online when HA services are not active. When HA services are not active, the resource group is marked to be brought online when HA services become active; the resource group is then in an ONLINE-READY state. Failsafe tries to bring a resource group in an ONLINE-READY state online when HA services are started.

You can disable resource groups from coming online when HA services are started by using the GUI or cmgr to take the resource group offline, as described in “Take a Resource Group Offline”.


Caution: Before bringing a resource group online in the cluster, you must be sure that the resource group is not running on a disabled node (where HA services are not running). Bringing a resource group online while it is running on a disabled node could cause data corruption. For information on detached resource groups, see “Take a Resource Group Offline”.

Do the following:

  1. Group to Bring Online: select the name of the resource group you want to bring online. The menu displays only resource groups that are not currently online.

  2. Click on OK to complete the task.

Bring a Resource Group Online with cmgr

To bring a resource group online, use the following command:

admin online resource_group RGname [in cluster Clustername]

If you have specified a default cluster, you do not need to specify a cluster when you use this command.

For example:

cmgr> set cluster test-cluster
cmgr> admin online resource_group group1
FailSafe daemon (ha_fsd) is not running on this local node or it is not ready to accept admin commands.
Resource Group (group1) is online-ready.

Failed to admin:
        online

admin command failed

cmgr> show status of resource_group group1 in cluster test-cluster

State: Online Ready
Error: No error
Check resource group group1 status in an active node if HA services are active in cluster

Take a Resource Group Offline

This section tells you how to take a resource group offline.

Take a Resource Group Offline with the GUI

When you take a resource group offline, FailSafe takes each resource in the resource group offline in a predefined order. If any single resource gives an error during this process, the process stops, leaving all remaining resources allocated.

You can take a FailSafe resource group offline in any of the following ways:

  • Take the resource group offline. This physically stops the processes for that resource group and does not reset any error conditions. If this operation fails, the resource group will be left online in an error state.

  • Force the resource group offline. This physically stops the processes for that resource group but resets any error conditions. This operation cannot fail.

  • Detach the resource group. This causes FailSafe to stop monitoring the resource group, but does not physically stop the processes on that group. FailSafe will report the status as offline and will not have any control over the group. This operation should rarely fail.

  • Detach the resource group and force the error state to be cleared. This causes FailSafe to stop monitoring the resource group, but does not physically stop the processes on that group. FailSafe will report the status as offline and will not have any control over the group. In addition, all error conditions of the resource group will be reset. This operation should rarely fail.

If you do not need to stop the resource group and do not want FailSafe to monitor the resource group while you make changes, but you would still like to have administrative control over the resource group (for instance, to move that resource group to another node), you can put the resource group in maintenance mode using the Suspend Monitoring a Resource Group task on the GUI or the admin maintenance_on command of cmgr, as described in “ Suspend and Resume Monitoring of a Resource Group”.

If the fsd daemon is not running or is not ready to accept client requests, executing this task disables the resource group in the cluster database only. The resource group remains online and the command fails.

Enter the following:

  1. Detach Only: check this box to stop monitoring the resource group. The resource group will not be stopped, but FailSafe will not have any control over the group.

  2. Detach Force: check this box to stop monitoring the resource group. The resource group will not be stopped, but FailSafe will not have any control over the group. In addition, Failsafe will clear all errors.


    Caution: The Detach Only and Detach Force settings leave the resource group's resources running on the node where the group was online. After stopping HA services on that node, do not bring the resource group online on another node in the cluster; doing so can cause data integrity problems. Instead, make sure that no resources are running on a node before stopping HA services on that node.


  3. Force Offline: check this box to stop all resources in the group and clear all errors.

  4. Group to Take Offline: select the name of the resource group you want to take offline. The menu displays only resource groups that are currently online.

  5. Click on OK to complete the task.

Take a Resource Group Offline with cmgr

To take a resource group offline, use the following command:

admin offline resource_group RGname [in cluster Clustername]

To take a resource group offline with the force option in effect, forcing FailSafe to complete the action even if there are errors, use the following command:

admin offline_force resource_group RGname [in cluster Clustername]


Note: Doing an offline_force operation on a resource group can leave resources in the resource group running on the cluster. The offline_force operation will succeed even though all resources in the resource group have not been stopped. FailSafe does not track these resources any longer. You should take care to prevent resources from running on multiple nodes in the cluster.

To detach a resource group, use the following command:

admin offline_detach resource_group RGname [in cluster Clustername]

To detach the resource group and force the error state to be cleared:

admin offline_detach_force resource_group RGname [in cluster Clustername]

This causes FailSafe to stop monitoring the resource group, but does not physically stop the processes on that group. FailSafe will report the status as offline and will not have any control over the group. In addition, all error conditions of the resource group will be reset. This operation should rarely fail.

Move a Resource Group

This section tells you how to move a resource group.

Move a Resource Group with the GUI

While FailSafe is active, you can move a resource group to another node in the same cluster.


Note: When you move a resource group in an active system, you may find the unexpected behavior that the command appears to have succeeded, but the resource group remains online on the same node in the cluster. This can occur if the resource group fails to start on the node to which you are moving it. In this case, FailSafe will fail over the resource group to the next node in the application failover domain, which may be the node on which the resource group was originally running. Since FailSafe kept the resource group online, the command succeeds.

Enter the following:

  1. Group to Move: select the name of the resource group to be moved. Only resource groups that are currently online are displayed in the menu.

  2. Failover Domain Node: (optional) select the name of the node to which you want to move the resource group. If you do not specify a node, FailSafe will move the resource group to the next available node in the failover domain.

  3. Click on OK to complete the task.

Move a Resource Group with cmgr

To move a resource group to another node, use the following command:

admin move resource_group RGname [in cluster Clustername] [to node Nodename]

For example, to move resource group nfs-group1 running on node primary to node backup in the cluster nfs-cluster, do the following:

cmgr> admin move resource_group nfs-group1 in cluster nfs-cluster to node backup

If the user does not specify the node, the resource group's failover policy is used to determine the destination node for the resource group.

If you run into errors after entering the admin move command, see “Ensuring that Resource Groups are Deallocated” in Chapter 10.

Suspend and Resume Monitoring of a Resource Group

This section describes how to stop monitoring of a resource group in order to put it into maintenance mode.

Suspend Monitoring a Resource Group with the GUI

You can temporarily stop FailSafe from monitoring a specific resource group, which puts the resource group in maintenance mode. The resource group remains on the same node in the cluster but is no longer monitored by FailSafe for resource failures.

You can put a resource group into maintenance mode if you do not want FailSafe to monitor the group for a period of time. You may want to do this for upgrade or testing purposes, or if there is any reason that FailSafe should not act on that resource group. When a resource group is in maintenance mode, it is not being monitored and it is not highly available. If the resource group's owner node fails, FailSafe will move the resource group to another node and resume monitoring.

When you put a resource group into maintenance mode, resources in the resource group are in ONLINE-MAINTENANCE state. The ONLINE-MAINTENANCE state for the resource is seen only on the node that has the resource online. All other nodes will show the resource as ONLINE. The resource group, however, should appear as being in ONLINE-MAINTENANCE state in all nodes.

Do the following:

  1. Group to Stop Monitoring: select the name of the group you want to stop monitoring. Only those resource groups that are currently online and monitored are displayed in the menu.

  2. Click OK to complete the task.

Resume Monitoring of a Resource Group with the GUI

This task lets you resume monitoring a resource group.

Once monitoring is resumed and assuming that the restart action is enabled, if the resource group or one of its resources fails, FailSafe will restart each failed component based on the failover policy.

Perform the following steps:

  1. Group to Start Monitoring: select the name of the group you want to start monitoring. Only those resource groups that are currently online and not monitored are displayed in the menu.

  2. Click OK to complete the task.

Putting a Resource Group into Maintenance Mode with cmgr

To put a resource group into maintenance mode, use the following command:

admin maintenance_on resource_group RGname [in cluster Clustername]

If you have specified a default cluster, you do not need to specify a cluster when you use this command.

Resume Monitoring of a Resource Group with cmgr

To move a resource group back online from maintenance mode, use the following command:

admin maintenance_off resource_group RGname [in cluster Clustername]

Stopping FailSafe

You can stop the execution of FailSafe on all the nodes in a cluster or on a specified node only. See “Stop FailSafe HA Services” in Chapter 6.

Resetting Nodes

You can use FailSafe to reset nodes in a cluster. This sends a reset command to the system controller port on the specified node. When the node is reset, other nodes in the cluster will detect this and remove the node from the active cluster, reallocating any resource groups that were allocated on that node onto a backup node. The backup node that is used depends on how you have configured your system.

After the node reboots, it will rejoin the cluster. Some resource groups might move back to the node, depending on how you have configured your system.

Reset a Node with the GUI

You can use the GUI to reset nodes in a cluster. This sends a reset command to the system controller port on the specified node. When the node is reset, other nodes in the cluster will detect the change and remove the node from the active cluster. When the node reboots, it will rejoin the FailSafe membership.

To reset a node, do the following:

  1. Node to Reset: select the node to be reset.

  2. Click on OK to complete the task.

Reset a Node with cmgr

When FailSafe is running, you can reset a node with the following command:

admin reset node nodename

This command uses the FailSafe daemons to reset the specified node.

You can reset a node in a cluster even when the FailSafe daemons are not running by using the standalone option:

admin reset standalone node nodename

The command above does not use the crsd daemon.

If you have defined the node but have not defined system controller information for it, you can use the following command line:

admin reset dev_name nodename of dev_type tty with sysctrl_type msc|mmsc|l2|l1

Power Cycle a Node with cmgr

When FailSafe is running, you can perform a power cycle on a node with the following command:

admin powerCycle node nodename

This command uses the FailSafe daemons to powercycle the specified node.

You can powercycle a node in a cluster even when the FailSafe daemons are not running by using the standalone option:

admin powerCycle standalone node nodename

This command does not go through the crsd daemon.

If the node has not been defined in the cluster database, you can use the following command line:

admin powerCycle dev_name nodename of dev_type tty with sysctrl_type msc|mmsc|l2|l1

Perform an NMI on a Node with cmgr

When FailSafe is running, you can perform a nonmaskable interrupt (NMI) on a node with the following command:

admin nmi node nodename

This command uses the FailSafe daemons to perform an NMI on the specified node.

You can perform an NMI on a node in a cluster even when the FailSafe daemons are not running by using the standalone option:

admin nmi standalone node nodename

The above command does not use the crsd daemon.

If the node has not been defined in the cluster database, you can use the following command line:

admin nmi dev_name nodename of dev_type tty with sysctrl_type msc|mmsc|l2|l1

Cluster Database Backup and Restore

This section discusses the following:

Restoring the Database from Another Node

If the database has been accidentally deleted from an individual node, you can replace it with a copy from another node. Do not use this method if the cluster database has been corrupted.

Do the following:

  1. Stop the HA services and (if running) CXFS services.

  2. Stop the cluster daemons by running the following command on each node:

    # /etc/init.d/cluster stop

  3. Run cdbreinit on nodes that are missing the cluster database. Verify that cluster daemons are running.

  4. Restart HA services and (if needed) CXFS services.

Using build_cmgr_script for the Cluster Database

You can use the build_cmgr_script command from one node in the cluster to create a cmgr script that will recreate the node, cluster, switch, and filesystem definitions for all nodes in the cluster database. You can then later run the resulting script to recreate a database with the same contents; this method can be used for missing or corrupted cluster databases.


Note: The build_cmgr_script does not recreate node-specific information for resources and resource types or local logging information because the cluster database does not replicate node-specific information. Therefore, if you reinitialize the cluster database, you will lose node specific information. The build_cmgr_script script does not contain local logging information, so it cannot be used as a complete backup/restore tool.

To perform a database backup, use the build_cmgr_script script from one node in the cluster, as described in “Creating a cmgr Script Automatically” in Chapter 5.


Caution: Do not make configuration changes while you are using the build_cmgr_script command.

By default, this creates a cmgr script in the following location:

/var/cluster/ha/tmp/cmgr_create_cluster_clustername_processID

You can specify another filename by using the -o option.

To perform a restore on all nodes in the pool, do the following:

  1. Stop HA services for all nodes in the cluster.

  2. Stop the cluster database daemons on each node.

  3. Remove all copies of the old database by using the cdbreinit command on each node.

  4. Execute the cmgr script (which was generated by the build_cmgr_script script) on the node that is defined first in the script. This will recreate the backed-up database on each node.


    Note: If you want to run the generated script on a different node, you must modify the generated script so that the node is the first one listed in the script.


  5. Restart cluster database daemons on each node.

For example, to backup the current database, clear the database, and restore the database to all nodes, do the following:

On one node in the cluster:
# /var/cluster/cmgr-scripts/build_cmgr_script -o /tmp/newcdb
Building cmgr script for cluster clusterA ...
build_cmgr_script: Generated cmgr script is /tmp/newcdb

On one node:
# stop ha_services for cluster clusterA

On each node:
# /etc/init.d/cluster stop

On each node:
# /usr/cluster/bin/cdbreinitOn each node:
# /etc/init.d/cluster start

On the *first* node listed in the /tmp/newcdb script:
# /tmp/newcdb

Using cdbBackup and cdbRestore for the Cluster Database and Logging Information

The cdbBackup and cdbRestore commands backup and restore the cluster database and node-specific information, such as local logging information. You must run these commands individually for each node.

To perform a backup of the cluster, use the cdbBackup command on each node.


Caution: Do not make configuration changes while you are using the cdbBackup command.

To perform a restore, run the cdbRestore command on each node. You can use this method for either a missing or corrupted cluster database. Do the following:

  1. Stop HA services.

  2. Stop cluster services on each node.

  3. Remove the old database by using the cdbreinit command on each node.

  4. Stop cluster services again (these were restarted automatically by cdbreinit in the previous step) on each node.

  5. Use the cdbRestore command on each node.

  6. Start cluster services on each node.

For example, to backup the current database, clear the database, and then restore the database to all nodes, do the following:

On each node:
# /usr/cluster/bin/cdbBackup

On one node in the cluster:
# stop ha_services for cluster clusterA

On each node:
# /etc/init.d/cluster stop

On each node:
# /usr/cluster/bin/cdbreinit

On each node (again):
# /etc/init.d/cluster stop

On each node:
# /usr/cluster/bin/cdbRestore

On each node:
# /etc/init.d/cluster start

For more information, see the cdbBackup and cdbRestore man page.


Note: Do not perform a cdbDump while information is changing in the cluster database. Check the SYSLOG file for information to help determine when cluster database activity is occurring. As a rule of thumb, you should be able to perform a cdbDump if at least 15 minutes have passed since the last node joined the cluster or the last administration command was run.


Filesystem Dump and Restore

To perform an XFS filesystem dump and restore, you must do the following:

  1. Perform a backup of the cluster database using cdbBackup.

  2. Perform the XFS filesystem dump with xfsdump.

  3. Perform the XFS filesystem restore with xfsrestore.

  4. Remove the existing cluster database:

    # rm /var/cluster/cdb

  5. Restore the backed-up database by using cdbRestore.

Rotating Log Files

This section discusses the following:

For information about log levels, see “Set Log Configuration” in Chapter 6.

Rotating All Log Files

You can run the /var/cluster/cmgr-scripts/rotatelogs script to copy all files to a new location. This script saves log files with the day and the month name as a suffix. If you run the script twice in one day, it will append the current log file to the previous saved copy. The root crontab file has an entry to run this script weekly.

The script syntax is as follows:

/var/cluster/cmgr-scripts/rotatelogs [-h] [-d|-u]

If no option is specified, the log files will be rotated. Options are as follows:

-h

Prints the help message. The log files are not rotated and other options are ignored.

-d

Deletes saved log files that are older than one week before rotating the current log files. You cannot specify this option and -u.

-u

Unconditionally deletes all saved log files before rotating the current log files. You cannot specify this option and -d .

By default, the rotatelogs script will be run by crontab once a week, which is sufficient if you use the default log levels. If you plan to run with a high debug level for several weeks, you should reset the crontab entry so that the rotatelogs script is run more often.

On heavily loaded machines, or for very large log files, you may want to move resource groups and stop HA services before running rotatelogs.

Rotating Large Log Files

You can use a script such as the following to copy large files to a new location. The files in the new location will be overwritten each time this script is run.

#!/bin/sh
# Argument is maximum size of a log file (in characters) - default: 500000

size=${1:-500000}
find /var/cluster/ha/log -type f ! -name '*.OLD' -size +${size}c -print | while read log_file; do
        cp ${log_file} ${log_file}.OLD
        echo '*** LOG FILE ROTATION ' `date` '***' > ${log_file}
done

Granting Task Execution Privileges to Users

The GUI lets you grant or revoke access to a specific GUI task for one or more specific users. By default, only root may execute tasks in the GUI. You cannot grant or revoke tasks for users with a user ID of 0.


Note: To maintain security, the root user must have a password. If root does not have a password, then all users can get access to any task.

Access to the task is allowed only on the node to which the GUI is connected; if you want to allow access on another node in the pool, you must connect the GUI to that node and grant access again.

GUI tasks and the cmgr command operate by executing underlying privileged commands which are normally accessible only to root. When granting access to a task, you are in effect granting access to all of its required underlying commands, which results in also granting access to the other GUI tasks that use the same underlying commands.

To see which tasks a specific user can currently access, select View: Users. Select a specific user to see details about the tasks available to that user.

To see which users can currently access a specific task, select View: Task Privileges. Select a specific task to see details about the users who can access it and the privileged commands it requires.

Grant Task Access to a User or Users

You can grant access to a specific task to one or more users at a time.


Note: Access to the task is only allowed on the node to which the GUI is connected; if you want to allow access on another node in the pool, you must connect the GUI to that node and grant access again.

Do the following:

  1. Select the user or users for whom you want to grant access. You can use the following methods to select users:

    • Click to select one user at a time

    • Shift+click to select a block of users

    • Ctrl+click to toggle the selection of any one user, which allows you to select multiple users that are not contiguous

    • Click Select All to select all users

    Click Next to move to the next page.

  2. Select the task or tasks to grant access to, using the above selection methods. Click Next to move to the next page.

  3. Confirm your choices by clicking OK.


    Note: If more tasks than you selected are shown, then the selected tasks run the same underlying privileged commands as other tasks, such that access to the tasks you specified cannot be granted without also granting access to these additional tasks.


To see which tasks a specific user can access, select View: Users. Select a specific user to see details about the tasks available to that user.

To see which users can access a specific task, select View: Task Privileges. Select a specific task to see details about the users who can access it and the privileged commands it requires.

Granting Access to a Few Tasks

Suppose you wanted to grant user guest permission to define clusterwide and node-specific resources. You would do the following:

  1. Select guest and click Next to move to the next page.

  2. Select the tasks you want sys to be able to execute:

    1. Click Define a Resource

    2. Ctrl+click Redefine a Resource for a Specific Node

    3. Ctrl+click Add or Remove Resources in Resource Group

    Click Next to move to the next page.

  3. Confirm your choices by clicking OK.

Figure 8-1 shows the privileged commands that were granted to user guest.


Note: Modify a Resource Definition is also displayed, even though the administrator did not explicitly select it; the privilege commands for the tasks selected also require this command.


Figure 8-1. Results of Granting a User Privilege

Results of Granting a User Privilege

Figure 8-2 shows the screen that is displayed when you select View: Users and click guest to display information in the details area of the GUI window. The privileged commands listed are the underlying commands executed by the GUI tasks.

Figure 8-2. Displaying the Privileged Commands a User May Execute

Displaying the Privileged Commands a User May
Execute

Granting Access to Most Tasks

Suppose you wanted to give user sys access to all tasks except adding or removing nodes from a cluster. The easiest way to do this is to select all of the tasks and then deselect the one you want to restrict. You would do the following:

  1. Select sys and click Next to move to the next page.

  2. Select the tasks you want sys to be able to execute:

    1. Click Select All to highlight all tasks.

    2. Deselect the task to which you want to restrict access. Ctrl+click Add/Remove Nodes in Cluster .

    Click Next to move to the next page.

  3. Confirm your choices by clicking OK.

Revoke Task Access from a User or Users

You can revoke task access from one or more users at a time.


Note: Access to the task is only revoked on the node to which the GUI is connected; if a user has access to the task on multiple nodes in the pool, you must connect the GUI to those other nodes and revoke access again.


Do the following:

  1. Select the user or users from whom you want to revoke task access. You can use the following methods to select users:

    • Click to select one user at a time

    • Shift+click to select a block of users

    • Ctrl+click to toggle the selection of any one user, which allows you to select multiple users that are not contiguous

    • Click Select All to select all users

    Click Next to move to the next page.

  2. Select the task or tasks to revoke access to, using the above selection methods. Click Next to move to the next page.

  3. Confirm your choices by clicking OK.


    Note: If more tasks than you selected are shown, then the selected tasks run the same underlying privileged commands as other tasks, such that access to the tasks you specified cannot be revoked without also revoking access to these additional tasks.


To see which tasks a specific user can access, select View: Users. Select a specific user to see details about the tasks available to that user.

To see which users can access a specific task, select View: Task Privileges. Select a specific task to see details about the users who can access it.

Updating the Checksum Version for 6.5.21 and Earlier Clusters

The ChecksumVersion variable is required for clusters running IRIX 6.5.22 or later. Any cluster without this variable will use the old checksum behavior and the variable will not be present in the cluster database. All clusters created prior to IRIX 6.5.22 must set the variable manually.

If your cluster was created before IRIX 6.5.22, you must run the following command to add the ChecksumVersion variable to the cluster database and set it to the correct value. You must do this after all nodes in the cluster have been upgraded to IRIX 6.5.22 and are running normally.

For example, if the name of the cluster is gps, run the following cbutil command on one node in the cluster:


Note: Command arguments are case-sensitive.


# /usr/cluster/bin/cdbutil -i
cdbutil> node #cluster#gps#ClusterAdmin
cdbutil> create ChecksumVersion
cdbutil> setvalue ChecksumVersion 1
cdbutil> quit

Cluster databases running IRIX 6.5.22 or later must have all nodes at 6.5.22 or later. Do not downgrade or add nodes to the cluster without setting ChecksumVersion to 0 (otherwise, the nodes will fail to form a membership). After you have upgraded all nodes to IRIX 6.5.22 or later set ChecksumVersion to 1 by running the following commands on one node in the cluster:

# /usr/cluster/bin/cdbutil -i
cdbutil> node #cluster#gps#ClusterAdmin
cdbutil> setvalue ChecksumVersion 1
cdbutil> quit


Note: The create step is missing here because the variable should already be in the cluster database at this point.