Chapter 6. Configuration

This chapter provides a summary of the steps required to configure a cluster using either the FailSafe Manager graphical user interface (GUI) or the cmgr command.


Note: For the initial installation, SGI highly recommends that you use the GUI guided configuration tasks. See “Guided Configuration with the GUI”.

SGI also recommendeds that you perform all FailSafe administration from one node in the pool so that the latest copy of the database will be available even when there are network partitions.


The following sections describe the preliminary steps you should follow, information you must understand, the GUI guided configuration, and the various individual tasks using the GUI and cmgr.

Preliminary Steps

The cluster processes are started automatically when FailSafe and cluster subsystems from the IRIX CD are installed. Complete the following steps to ensure that you are ready to configure the initial cluster:

During the course of configuration, you will see various information-only messages in the log files.

Verify that the Cluster chkconfig Flag is On

Ensure that the output from chkconfig shows the following flag set to on:

# chkconfig
        Flag                 State               
        ====                 =====               
        cluster              on

If it is not, set it to on. For example:

# chkconfig cluster on

Start the Cluster Daemons

Enter the following to start the cluster daemons:

# chkconfig cluster on
# /etc/init.d/cluster start

After you start highly available (HA) services, the following daemons are also started on a base FailSafe system (without optional plug-ins):

  • ha_fsd

  • ha_cmsd

  • ha_gcd

  • ha_srmd

  • ha_ifd

Verify that the Cluster Daemons are Running

When you first install the software, the following cluster daemons should be running:

  • fs2d

  • cmond

  • cad

  • crsd

To determine which daemons are running, enter the following:

ps -ef | grep cluster

The following shows an example of the output when just the initial daemons are running; for readability, whitespace has been removed and the daemon names are highlighted:

# ps -ef | grep cluster
root 31431     1 0 12:51:36 ?     0:14 /usr/lib32/cluster/cbe/fs2d /var/cluster/cdb/cdb.db #
root 31456 31478 0 12:53:01 ?     0:03 /usr/cluster/bin/crsd -l
root 31475 31478 0 12:53:00 ?     0:08 /usr/cluster/bin/cad -l -lf /var/cluster/ha/log/cad_log --append_log
root 31478     1 0 12:53:00 ?     0:00 /usr/cluster/bin/cmond -L info -f /var/cluster/ha/log/cmond_log
root 31570 31408 0 14:01:52 pts/0 0:00 grep cluster

If you do not see these processes, go to the logs to see what the problem might be. If you must restart the daemons, enter the following:

# /etc/init.d/cluster restart

Determine the Hostname of the Node

When you are initially configuring the cluster, you must use the IP address or the value of /etc/sys_id when logging in to the GUI and when defining the nodes in the pool. The value of /etc/sys_id must match the name of the IP address for the node in /etc/hosts. The value of /etc/sys_id is displayed by the hostname command. For example:

# hostname
fs6

Also, if you use nsd, you must configure your system so that local files are accessed before the network information service (NIS) or the domain name service (DNS). See “System File Configuration” in Chapter 3.


Caution: It is critical that these files are configured properly and that you enter the hostname for the nodes. See “Install FailSafe” in Chapter 4.


Name Restrictions

When you specify the names of the various components of a FailSafe system, the name cannot begin with an underscore (_) or include any whitespace characters. In addition, the name of any FailSafe component cannot contain a space, an unprintable character, or a *, ?, \, or #.

The following is the list of permitted characters for the name of a FailSafe component:

  • alphanumeric characters

  • /

  • .

  • - (hyphen)

  • _ (underscore)

  • :

  • =

  • @

  • '

These character restrictions apply whether you are configuring your system with the GUI or cmgr.

Configuring Timeout Values and Monitoring Intervals

When you configure the components of a FailSafe system, you configure various timeout values and monitoring intervals that determine the application downtime of a highly available (HA) system when there is a failure. To determine reasonable values to set for your system, consider the following equations:

application_downtime = failure_detection + time_to_handle_failure + failure_recovery_time

Failure detection depends on the type of failure that is detected:

  • When a node goes down, there will be a node failure detection after the node timeout time, which is one of the parameters that you can modify. All failures that translate into a node failure (such as heartbeat failure and operating system failure) fall into this failure category. Node timeout has a default value of 15 seconds.

  • When there is a resource failure, there will be a monitor failure of a resource. The time this will take is determined by the following:

    • The monitoring interval for the resource type

    • The monitor timeout for the resource type

    • The number of restarts defined for the resource type, if the restart mode is configured on

    For information on setting values for a resource type, see “Define a Resource Type with the GUI”.

Reducing these values will result in a shorter failover time, but could also lead to significant increase in the FailSafe overhead, which will affect the system performance and could lead to false failovers.

The time to handle a failure is something that the user cannot control. In general, this should take a few seconds.

The failure recovery time is determined by the total time it takes for FailSafe to perform the following:

  • Execute the failover policy script (approximately 5 seconds).

  • Run the stop action script for all resources in the resource group. This is not required for node failure; the failing node will be reset.

  • Run the start action script for all resources in the resource group.

Setting Configuration Defaults with cmgr

Certain cmgr commands require you to specify a cluster, node, or resource type. Before you configure the components of a FailSafe system, you can set defaults for these values that will be used if you do not specify an explicit value. The default values are in effect only for the current session of cmgr.

Use the following cmgr commands:

  • Default cluster:

    set cluster Clustername

    For example:

    cmgr> set cluster test-cluster

  • Default node:

    set node Nodename

    For example:

    cmgr> set node node1

  • Default resource type:

    set resource_type RTname

    For example:

    cmgr> set resource_type IP_address

To view the current default configuration values, use the following command:

show set defaults

Guided Configuration with the GUI

The GUI provides guided configuration task sets to help you configure your FailSafe cluster.

The node from which you run the GUI affects your view of the cluster. You should wait for a change to appear in the view area before making another change; the change is not guaranteed to be propagated across the cluster until the icons appear in the view area.

You should only make changes from one instance of the GUI running at any given time; changes made by a second GUI instance (a second invocation of fsmgr) may overwrite changes made by the first instance. However, multiple windows accessed via the File menu are all part of a single GUI instance; you can make changes from any of these windows.

Set Up a New Cluster


Note: Within the tasks, you can click on any blue text to get more information about that concept or input field. In every task, the cluster configuration will not update until you click OK.

The Set Up a New Cluster task in the Guided Configuration leads you through the steps required to create a new cluster. It encompasses tasks that are detailed elsewhere.

The GUI provides a convenient display of a cluster and its components. Verify your progress to avoid adding nodes too quickly.

Do the following:

  1. Click Define a Node to define the node to which you are connected (that is, the local node). The hostname that appears in /etc/sys_id is used for all node definitions. See “Define a Node”.


    Note: If you attempt to define a cluster or other object before the local node has been defined, you will get an error message that says:
    No nodes are registered on servername. You cannot define a cluster 
    until you define the node to which the GUI is connected. To do so, 
    click "Continue" to launch the "Set Up a New Cluster" task.



  2. (Optional) After the first node icon appears in the view area, click on step 2, Define a Node, to define the other nodes in the cluster. The hostname/IP-address pairings and priorities of the networks must be the same for each node in the cluster.


    Note: Do not define a second node until the icon for the first node appears in the view area. If you add nodes too quickly (before the database can include the node), errors will occur.


    Repeat this step for each node. For large clusters, SGI recommends that you define only the first three nodes and then continue on to the next step; add the remaining nodes after you have a successful small cluster.

  3. Click Define a Cluster to create the cluster definition. See “Define a Cluster”. Verify that the cluster appears in the view area; choose View: Nodes in Cluster.

  4. Click Add/Remove Nodes in Cluster to add the nodes to the new cluster. See “Add or Remove Nodes in the Cluster with the GUI”.

    Click Next to move to the second page of tasks.

  5. (Optional) Click Test Connectivity to verify that the nodes are physically connected. See “Test Connectivity with the GUI” in Chapter 9. (This test requires the proper configuration of the /etc/.rhosts file.)

  6. Click Start HA Services .

  7. Click Close. Clicking on Close exits the task; it does not undo the task.

Set Up a Highly Available Resource Group


Note: Within the tasks, you can click on any blue text to get more information about that concept or input field. In every task, the cluster configuration will not update until you click OK.

The Set Up a Highly Available Resource Group task leads you through the steps required to define a resource group. It encompasses tasks that are detailed elsewhere.

Do the following:

  1. Define a new resource. See “Define a Resource”.

  2. Add any required resource dependencies. See“Add/Remove Dependencies for a Resource Definition”.

  3. Verify the resources and dependencies. See “Test Resources with the GUI” in Chapter 9.

  4. Define a failover policy to specify where the resources can run. See “Define a Failover Policy”.

  5. Test the failover policies. See “Test Failover Policies with the GUI” in Chapter 9.

    Click Next to move to the next page.

  6. Define a resource group that uses the failover policy you defined earlier. See “Define a Resource Group”.

  7. Add or remove resources in resource group. See “Test Failover Policies with the GUI” in Chapter 9.

  8. Set the resources in the resource group to start when HA services are started. See “Bring a Resource Group Online” in Chapter 8.

  9. Start FailSafe HA services if they have not already been started. See “Start FailSafe HA Services with the GUI”.

Repeat these steps for each resource group.

Set Up an Existing CXFS Cluster for FailSafe

This task appears on the GUI if you also have CXFS installed.


Note: Within the tasks, you can click on any blue text to get more information about that concept or input field. In every task, the cluster configuration will not update until you click OK.

The Set Up an Existing CXFS Cluster for FailSafe task leads you through the steps required to convert existing CXFS nodes and the cluster to FailSafe. It encompasses tasks that are detailed elsewhere.

There is a single database for FailSafe and CXFS. If a given node applies to both products, ensure that any modifications you make are appropriate for both products.

Do the following:

  1. Click Convert a CXFS Cluster to FailSafe. This will change the cluster type to CXFS and FailSafe. See “Convert a CXFS Cluster to FailSafe with the GUI”.

  2. Use the CXFS GUI (or cmgr command) to stop CXFS services on the nodes to be converted. See the CXFS Administration Guide for SGI InfiniteStorage.

  3. Click Convert a CXFS Node to FailSafe to convert the local node (the node to which you are connected). A converted node can be of type CXFS and FailSafe or FailSafe. See “Convert a CXFS Node to FailSafe with the GUI”.

  4. Click Convert a CXFS Node to FailSafe to convert another node. Repeat this step for each node you want to convert.

  5. Click Start HA Services.

Fix or Upgrade Cluster Nodes

You can use the following tasks to fix or upgrade nodes:

Make Changes to Existing Cluster

You can make most cluster changes when HA services are active. To use the destructive option in FailSafe diagnostics, you must stop HA services on all nodes in the cluster. To make changes to network configuration (IP address, hostname, network interfaces) in a FailSafe node, you must stop HA and cluster services on all nodes in the pool.

See the following:

Optimize Node Usage

You can improve cluster performance by taking advantage of a particular node's hardware. For example, one node in the cluster may have a larger disk or a faster CPU.

Depending upon your situation, you may find the following tasks useful:

Customize FailSafe Failure Detection

You can do the following to customize how FailSafe monitors and fails over resource groups:

Customize Resource Group Failover Behavior

You can use various tasks to change failover behavior in the cluster or the resource group:

You can also create a custom failover policy script:

  1. Use the FailSafe Programmer's Guide for SGI Infinite Storage to write a custom failover script.

  2. Place the scripts in the /var/cluster/ha/policies directory.

  3. Restart the FailSafe Manager GUI.

  4. Change the desired failover policy to use your new custom failover script. See “Modify a Failover Policy Definition”.

  5. Select View: Groups owned by Nodes in the GUI view area.

  6. Test the script by moving a resource group from one node to another, simulating failover. Watch the resource group behavior in the view area to confirm that failover behavior works as expected. See “Move a Resource Group with the GUI” in Chapter 8.

Customize Resource Failover Behavior

You can customize resource failover behavior by editing existing action scripts or creating new scripts. Do the following:

  1. Make a copy of the action scripts you want to modify. Action scripts for each resource type are contained in the /var/cluster/ha/resource_types directory.

  2. Edit the copies or create new scripts. See the FailSafe Programmer's Guide for SGI Infinite Storage.

  3. Place the edited/new scripts in the appropriate subdirectory in /var/cluster/ha/resource_types.

  4. Restart the FailSafe Manager GUI.

  5. Make use of the new scripts in the resource type. See “Define a Resource Type”, and “Modify a Resource Type Definition”.

  6. Define resources using the new resource type. See “Define a Resource”.

  7. Verify that FailSafe can manage the new custom resources. See “Test Resources with the GUI” in Chapter 9.

  8. Add the new resource. See “Add or Remove Nodes in the Cluster”.

Redistribute Resource Load in Cluster

After setting up resource groups and observing how they fail over, you may want to distribute the resource groups differently to balance the load among the nodes in the cluster. Do the following:

  1. Determine the current load. For example, invoke the System Manager tool from the Toolchest, then launch the graphical system monitor window by selecting the System Performance category and then the View System Resources task to view various system load statistics. For more information, see the gr_osview man page.

  2. If you want to redistribute the resource groups among the nodes, see “Move a Resource Group with the GUI” in Chapter 8.

  3. If you want to create a new failover policy that uses nodes in a different order or uses different nodes, do the following:

Node Tasks

A node is an operating system (OS) image, usually an individual computer. A node can belong to only one cluster.

This use of the term node does not have the same meaning as a node in an SGI Origin 3000 or SGI 2000 system.

This section describes the following node configuration tasks:

Define a Node

This section describes how to define a node.

Define a Node with the GUI

The first node you define must be the node that you have logged into, in order to perform cluster administration.


Note: Within the tasks, you can click on any blue text to get more information about that concept or input field. In every task, the cluster configuration will not update until you click OK.

To define a node, do the following:

  1. Enter the following:

    • Hostname: Hostname of the node you are defining, such as mynode.company.com (this can be abbreviated to mynode if it is resolved on all nodes).


      Note: If you attempt to define a cluster or other object before the local node has been defined, you will get an error message that says:
      No nodes are registered on servername. You cannot define a cluster
      until you define the node to which the GUI is connected. To do so,
      click "Continue" to launch the "Set Up a New Cluster" task.



    • Logical Name: The same as the hostname, or an abbreviation of the hostname (such as lilly), or an entirely different name (such as nodeA). Logical names cannot begin with an underscore (_) or include any whitespace characters, and can be at most 255 characters.


      Note: If you want to rename a node, you must delete it and then define a new node.


    • Networks for Incoming Cluster Messages: Do the following:

      • Network: Enter the IP address or hostname of the private network. (The hostname must be resolved in the /etc/hosts file.) The priorities of the networks must be the same for each node in the cluster. For information about using the hostname, see “System File Configuration” in Chapter 3. For information about why a private network is required, see “Private Network” in Chapter 1.

      • Messages to Accept: Select the appropriate type. You can use the None setting if you want to temporarily define a network but do not want it to accept messages.

      • Click Add to add the network to the list.

        If you later want to modify the network, click the network in the list to select it, then click Modify.

        If you want to delete a network from the list, click the network in the list to select it, then click Delete .

    • Node ID: (Optional ) An integer in the range 1 through 32767 that is unique among the nodes in the pool. If you do not specify a number, FailSafe will calculate an ID for you. The default ID is a 5-digit number based on the machine's serial number and other machine-specific information; it is not sequential.


      Caution: You must not change the node ID number after the node has been defined.


    • Partition ID: (Optional) Uniquely defines a partition in a partitioned SGI Origin 3000 system. If your system is not partitioned, leave this field empty.


      Note: Use the mkpart command to determine the partition ID value:

      • The -n option lists the partition ID (which is 0 if the system is not partitioned).

      • The -l option lists the bricks in the various partitions (use rack#.slot# format in the GUI)

        For example (output truncated here for readability):

        # mkpart -n
        Partition id = 1
        # mkpart -l
        partition: 3 = brick: 003c10 003c13 003c16 ...
        partition: 1 = brick: 001c10 001c13 001c16 ...

        You could enter one of the following for the Partition ID field:

        1
        001.10



      Click Next to move to the next page.

    • You can choose whether or not to use the system controller port to reset the node. If you want FailSafe to be able to use the system controller to reset the node, you select the Set Reset Parameters checkbox and provide the following information:

      • This node:

        • Port Type: select L1 (L1 system controller for Origin 300, Origin 3200C, Onyx 300, and Onyx 3200C systems), L2 (L2 system controller for Origin 3400, Origin 3800, Origin 300 with NUMAlink module, and Onyx 3000 series), MSC (module system controller for Origin 200, Onyx2 deskside, and SGI 2100, 2200 deskside systems ), or MMSC (multimodule system controller for rackmount SGI 2400, SGI 2800 and Onyx2 systems).

        • Reset Method: The type of reset to be performed: power cycle, serial reset, or NMI (nonmaskable interrupt)

        • Port Password: system controller password for privileged commands, not the node's root password or PROM password. On some machines, the system administrator may not have set this password. If you wish to set or change the system controller port password, consult the hardware manual for your node.

        • Temporarily Disable Port: if you want to provide reset information now but do not want to allow the reset capability at this time, check this box. If this box is checked, FailSafe cannot reset the node.

      • Owner (node that sends reset command):

        • Logical Name: name of the node that sends the remote reset command. Serial cables must physically connect the node being defined and the owner node through the system controller port. At run time, the node must be defined in the pool.

          You can select a logical name or enter the logical name of a node that is not yet defined. However, you must define the node before you run the node connectivity diagnostics task.

        • TTY Device: name of the terminal port (TTY) on the owner node to which the system controller is connected. /dev/ttyd2 is the most commonly used port, except on Origin 3000, Origin 300, and Origin 350 systems (where /dev/ttyd3 or /dev/ttyd4 is commonly used). The other end of the cable connects to this node's system controller port, so the node can be controlled remotely by the other node.

      If you do not want to use the reset function at all, click the Set System Controller Parameters box to deselect (uncheck) it.

  2. Click OK to complete the task.

You can use the hostname or the IP address as input to the network interface field. However, using the hostname requires DNS on the nodes; therefore, you may want to use the actual IP address.


Note: Do not add a second node until the first node icon appears in the view area. The entire cluster status information is sent each time a change is made to the cluster database; therefore, the larger the configuration, the longer it will take.


Define a Node with cmgr

To define a node, use the following commands:

define node LogicalHostname
    set hostname to Hostname
    set nodeid to NodeID
    set node_function to server_admin|client_admin
    set partition_id to PartitionID
    set reset_type to powerCycle|reset|nmi
    set sysctrl_type to msc|mmsc|l2|l1
    set sysctrl_password to Password
    set sysctrl_status to enabled|disabled
    set sysctrl_owner to Node_sending_reset_command
    set sysctrl_device to port
    set sysctrl_owner_type to tty
    set is_failsafe to true|false
    set is_cxfs to true|false
    add nic IPaddressOrHostname (if DNS)
            set heartbeat to true|false
            set ctrl_msgs to true|false
            set priority to integer
    remove nic IPaddressOrHostname (if DNS)

There are additional commands that apply to CXFS; if you are running a coexecution cluster, see CXFS Administration Guide for SGI InfiniteStorage for more information.

Usage notes:

  • node is the same as the hostname (such as mynode.company.com), or an abbreviation of the hostname (such as mynode), or an entirely different name (such as nodeA). Logical names cannot begin with an underscore (_) or include any whitespace characters, and can be at most 255 characters.

  • hostname is the hostname as returned by the hostname command on the node being defined. Other nodes in the pool must all be able to resolve this hostname correctly via /etc/hosts or a name resolution mechanism. The default for hostname is the value for LogicalHostname; therefore, you must supply a value for this command if you use a value other than the hostname or an abbreviation of it for LogicalHostname.

  • nodeid is an integer in the range 1 through 32767 that is unique among the nodes in the pool. If you do not specify a number, FailSafe will calculate an ID for you. The default ID is a 5-digit number based on the machine's serial number and other machine-specific information; it is not sequential.


    Caution: You must not change the node ID number after the node has been defined.


  • node_function specifies the CXFS function of the node. If you use prompting mode, you must enter one of the following:

    • server_admin for a node that you wish to use as a CXFS metadata server in a coexecution cluster.

    • client_admin, for a node that will not be used as a CXFS metadata server

    A FailSafe node cannot have the client-only function; this function is for CXFS-only nodes.

  • partition_id uniquely defines a partition in a partitioned SGI Origin 3000 system.


    Note: Use the mkpart command to determine this value:

    • The -n option lists the partition ID (which is 0 if the system is not partitioned).

    • The -l option lists the bricks in the various partitions (use rack#.slot# format in cmgr).

      For example (output truncated here for readability):

      # mkpart -n
      Partition id = 1
      # mkpart -l
      partition: 3 = brick: 003c10 003c13 003c16 ...
      partition: 1 = brick: 001c10 001c13 001c16 ...

      You could enter one of the following for the Partition ID field:

      1
      001.10


    If your system is not partitioned, use a value of 0.

    To unset the partition ID, use a value of 0 or none.

  • reset_type can be one of the following:

    • powerCycle to turn power off and on

    • reset to perform a serial reset

    • nmi to perform a nonmaskable interrupt

  • sysctrl_type is the system controller type, based on the node hardware, as shown in Table 6-1.

  • sysctrl_password is the password for the system controller port, not the node's root password or PROM password. On some nodes, the system administrator may not have set this password. If you wish to set or change the system controller password, consult the hardware manual for your node.

  • sysctrl_status is either enabled or disabled. This allows you to provide information about the system controller but temporarily disable by setting this value to disabled (meaning that FailSafe cannot reset the node). To allow FailSafe to reset the node, enter disabled.

  • sysctrl_owner is the logical name of the node that can reset this node via the system controller port. A node may reset another node when it detects that the node is not responding to heartbeat messages or is not responding correctly to requests. A serial hardware reset cable must physically connect one of the owner's serial ports to the system controller port of the node being defined. The owner must be a node in the pool. (You can specify the name of a node that is not yet defined. However, the owner must be defined as a node before the node connectivity diagnostic test is run and before the cluster is activated.)

  • sysctrl_device is the serial port. /dev/ttyd2 is the most commonly used port, except on Origin 3000, Origin 300, and Origin 350 systems (where /dev/ttyd3 or /dev/ttyd4 is commonly used).

  • sysctrl_owner_type must be tty for clusters running FailSafe (the network selection applies to clusters running CXFS only, without FailSafe coexecution.)

  • is_failsafe and is_cxfs specify the node type. If you are running just FailSafe on this node, set is_failsafe to true. If you are running both CXFS and FailSafe on this node in a coexecution cluster, set both values to true.

  • nic is the IP address or hostname of the private network. (The hostname must be resolved in the /etc/hosts file.)

    There can be up to eight network interfaces. SGI recommends that this network be private; see “Private Network” in Chapter 1.

    The priorities of the networks must be the same for each node in the cluster. For more information about using the hostname, see “System File Configuration” in Chapter 3. For information about why a private network is required, see “Private Network” in Chapter 1.


Note: The set hierarchy command is ignored for FailSafe-only nodes.


Table 6-1. System Controller Types

l1

l2

mmsc

msc

Origin 300

Origin 3400

SGI 2400 rackmount

Origin 200

Origin 3200c

Origin 3800

SGI 2800 rackmount

Onyx2 deskside

Onyx 300

Origin 300 with NUMAlink module

Onyx2 rackmount

SGI 2100 deskside

Onyx 3200c

Onyx 3000 series

 

SGI 2200 deskside

Use the add nic command to define the network interfaces. When you enter this command, the following prompt appears:

NIC - nic#?

When this prompt appears, you use the following commands to specify the flags for the control network:

set heartbeat to true|false
set ctrl_msgs to true|false
set priority to integer

Use the following command from the node name prompt to remove a network controller:

remove nic IPaddress 

When you have finished defining a node, enter done.

The following example defines a FailSafe node called cm1a, with one controller:

cmgr> define node cm1a
Enter commands, you may enter "done" or "cancel" at any time to exit

cm1a? set hostname to cm1a
cm1a? set nodeid to 1
cm1a? set reset_type to powerCycle
cm1a? set sysctrl_type to msc
cm1a? set sysctrl_password to [ ]
cm1a? set sysctrl_status to enabled
cm1a? set sysctrl_owner to cm2
cm1a? set sysctrl_device to /dev/ttyd2
cm1a? set sysctrl_owner_type to tty
cm1a? set is_failsafe to true
cm1a? set is_cxfs to false
cm1a? add nic cm1
Enter network interface commands, when finished enter "done" 
or "cancel"

NIC - cm1 > set heartbeat to true
NIC - cm1 > set ctrl_msgs to true
NIC - cm1 > set priority to 0
NIC - cm1 > done
cm1a? done

If you have invoked cmgr with the -p option or you entered the set prompting on command, the display appears as in the following example:

cmgr> define node cm1a
Enter commands, when finished enter either "done" or "cancel" 

Hostname[optional]? cm1a
Is this a FailSafe node <true|false> ? true
Is this a CXFS node <true|false> ? false
Node Function <server_admin|client_admin ? client_admin
Node ID ? 1
Reset type <powerCycle|reset|nmi> ? (powerCycle)
Do you wish to define system controller info[y/n]:y
Sysctrl Type <msc|mmsc|l2|l1>? (msc) msc
Sysctrl Password [optional]? ( )
Sysctrl Status <enabled|disabled>? enabled
Sysctrl Owner? cm2
Sysctrl Device? /dev/ttyd2
Sysctrl Owner Type <tty|network> tty 
Number of Network interfaces [2]? 2
NIC 1 - IP Address? 192.56.50.1
NIC 1 - Heartbeat HB (use network for heartbeats) <true|false>? true
NIC 1 - (use network for control messages) <true|false>? true
NIC 1 - Priority <1,2,...>? 1
NIC 2 - IP Address? 192.56.50.2
NIC 2 Heartbeat HB (use network for heartbeats) <true|false>? true
NIC 2 - (use network for control messages) <true|false>? false
NIC 2 - Priority <1,2,...>? 2

Add or Remove Nodes in the Cluster

This section describes how to add or remove nodes.

Add or Remove Nodes in the Cluster with the GUI

After you have added nodes to the pool and defined the cluster, you can indicate which of those nodes to include in the cluster.


Note: Do not add or remove nodes node until the cluster icon appears in the view area; select View: Nodes in Cluster.


Do the following:

  1. Add or remove the desired nodes:

    • To add a node, select its logical name from the Available Nodes menu and click Add. The node name will appear in the Nodes to Go into Cluster list.

    • To delete a node, click on its logical name in the Nodes to Go into Cluster list. (The logical name will be highlighted.) Then click Remove .

  2. Click OK to complete the task.

Modify a Node Definition

This section describes how to modify a node definition.

Modify a Node Definition with the GUI


Note: If you want to rename a node, you must delete it and then define a new node.

To modify a node, do the following:

  1. Logical Name: select the logical name of the node. After you do this, information for this node will be filled into the various fields.

  2. Change the information in the appropriate field as follows:

    • Networks for Incoming Cluster Messages: the priorities of the networks must be the same for each node in the cluster.

      • Network: if you want to add a network for incoming cluster messages, enter the IP address or hostname into the Network text field and click Add .

        • If you want to modify a network that is already in the list, click the network in the list in order to select it. Then click on Modify. This moves the network out of the list and into the text entry area. You can then change it. To add it back into the list, click Add.

        • If you want to delete a network, click on the network in the priority list in order to select it. Then click Delete.

        • If you want to change the priority of a network, click the network in the priority list in order to select it. Then click the up and down arrows in order to move it to a different position in the list.

      • Partition ID: (optional) uniquely defines a partition in a partitioned SGI Origin 3000 system. If your system is not partitioned, leave this field empty.


        Note: Use the mkpart command to determine the partition ID value:

        • The -n option lists the partition ID (which is 0 if the system is not partitioned).

        • The -l option lists the bricks in the various partitions (use rack#.slot# format in cmgr).

          For example (output truncated here for readability):

          # mkpart -n
          Partition id = 1
          # mkpart -l
          partition: 3 = brick: 003c10 003c13 003c16 ...
          partition: 1 = brick: 001c10 001c13 001c16 ... 

          You could enter one of the following for the Partition ID field:

          1
          001.10



      Click Next to move to the next page.

    • You can choose whether or not to use the system controller port to reset the node. If you want FailSafe to be able to use the system controller to reset the node, you select the Set Reset Parameters checkbox and provide the following information:

      • This node:

        • Port Type: select L1 (L1 system controller for Origin 300, Origin 3200C, Onyx 300, and Onyx 3200C systems), L2 (L2 system controller for Origin 3400, Origin 3800, Origin 300 with NUMAlink module, and Onyx 3000 series), MSC (module system controller for Origin 200, Onyx2 deskside, and SGI 2100, 2200 deskside systems ), or MMSC (multimodule system controller for rackmount SGI 2400, SGI 2800 and Onyx2 systems).

        • Port Password: the password for the system controller port, not the node's root password or PROM password. On some machines, the system administrator may not have set this password. If you wish to set or change the system controller port password, consult the hardware manual for your node.

        • Temporarily Disable Port: if you want to provide reset information now but do not want to allow the reset capability at this time, check this box. If this box is checked, FailSafe cannot reset the node.

      • Owner (node that sends reset command):

        • Logical Name: name of the node that sends the remote reset command. Serial cables must physically connect the node being defined and the owner node through the system controller port. At run time, the node must be defined in the pool.

          You can select a logical name or enter the logical name of a node that is not yet defined. However, you must define the node before you run the node connectivity diagnostics task.

        • TTY Device: name of the terminal port (TTY) on the owner node to which the system controller is connected, such as /dev/ttyd2. The other end of the cable connects to this node's system controller port, so the node can be controlled remotely by the other node.

      If you do not want to use the reset function at all, click the Set System Controller Parameters box to deselect (uncheck) it.

  3. Click OK to complete the task.

Modify a Node Definition with cmgr

To modify an existing node, use the following commands:

modify node LogicalHostname
    set hostname to Hostname
    set partition_id to PartitionID
    set reset_type to powerCycle|reset|nmi
    set sysctrl_type to msc|mmsc|l2|l1
    set sysctrl_password to Password
    set sysctrl_status to enabled|disabled
    set sysctrl_owner to node_sending_reset_command
    set sysctrl_device to port
    set sysctrl_owner_type to tty
    set is_failsafe to true|false
    set is_cxfs to true|false
    add nic IPaddress_Or_Hostname (if DNS)
            set heartbeat to true|false
            set ctrl_msgs to true|false
            set priority to integer
    remove nic IPaddress_Or_Hostname (if DNS)


Note: The set hierarchy command is ignored for Failsafe-only nodes.

The commands are the same as those used to define a node. You can change any of the information you specified when defining a node except the node ID. For details about the commands, see “Define a Node with cmgr”.

There are additional commands that apply to CXFS; if you are running a coexecution cluster, see CXFS Administration Guide for SGI InfiniteStorage for more information.


Caution: To change node ID, you must delete the node and define the node with a new node ID.


Example of Partitioning

The following shows an example of partitioning an SGI Origin 3000 system:

# cmgr
Welcome to SGI Cluster Manager Command-Line Interface

cmgr> modify node n_preston
Enter commands, when finished enter either "done" or "cancel"

n_preston ? set partition_id to 1
n_preston ? done

Successfully modified node n_preston

To perform this function with prompting, enter the following:

# cmgr -p
Welcome to SGI Cluster Manager Command-Line Interface

cmgr> modify node n_preston
Enter commands, you may enter "done" or "cancel" at any time to exit

Hostname[optional] ? (preston.dept.company.com) 
Is this a FailSafe node <true|false> ? (true) 
Is this a CXFS node <true|false> ? (true) 
Node ID[optional] ? (606) 
Partition ID[optional] ? (0) 1
Reset type <powerCycle|reset|nmi> ? (powerCycle) 
Do you wish to modify system controller info[y/n]:n
Number of Network Interfaces ? (2) 
NIC 1 - IP Address ? (192.168.168.2) 
NIC 1 - Heartbeat HB (use network for heartbeats) <true|false> ? (true) 
NIC 1 - (use network for control messages) <true|false> ? (true) 
NIC 1 - Priority <1,2,...> ? (1) 
NIC 2 - IP Address ? (192.168.168.1) 
NIC 2 - Heartbeat HB (use network for heartbeats) <true|false> ? (true) 
NIC 2 - (use network for control messages) <true|false> ? (true) 
NIC 2 - Priority <1,2,...> ? (2) 
Node Weight ? (1) 

Successfully modified node n_preston

cmgr> show node n_preston
Logical Machine Name: n_preston
Hostname: preston.dept.company.com
Node Is FailSafe: true
Node Is CXFS: true
Nodeid: 606
Partition id: 1
Reset type: powerCycle
ControlNet Ipaddr: 192.168.168.2
ControlNet HB: true
ControlNet Control: true
ControlNet Priority: 1
ControlNet Ipaddr: 192.168.168.1
ControlNet HB: true
ControlNet Control: true
ControlNet Priority: 2
Node Weight: 1

To unset the partition ID, use a value of 0 or none.

Convert a CXFS Node to FailSafe

This section tells you how to convert a FailSafe node to also apply to CXFS.

Convert a CXFS Node to FailSafe with the GUI

This task appears on the GUI if you also have CXFS installed.

To convert an existing CXFS node (of type CXFS) to type CXFS and FailSafe or type FailSafe , do the following:

  1. Stop CXFS services on the node to be converted using the CXFS GUI. See the CXFS Administration Guide for SGI InfiniteStorage.

  2. Convert the node:

    • Logical Name: select the logical name of the node.

    • Keep CXFS Settings:

      • To convert to type CXFS and FailSafe, click the checkbox

      • To convert to type FailSafe, leave the checkbox blank

    • Click OK to complete the task.


Note: If you want to rename a node, you must delete it and then define a new node.

To change other parameters, see “Modify a Node Definition with the GUI”. Ensure that modifications you make are appropriate for both FailSafe and CXFS.

Convert a Node to CXFS or FailSafe with cmgr

To convert an existing CXFS node so that it also applies to Failsafe, use the modify command to change the setting.


Note: You cannot turn off FailSafe or CXFS for a node if the respective HA or CXFS services are active. You must first stop the services for the node.

For example, in normal mode:

cmgr> modify node cxfs6
Enter commands, when finished enter either "done" or "cancel"

cxfs6 ? set is_FailSafe to true
cxfs6 ? done

Successfully modified node cxfs6

For example, in prompting mode:

cmgr> modify node cxfs6
Enter commands, you may enter "done" or "cancel" at any time to exit

Hostname[optional] ? (cxfs6.americas.sgi.com) 
Is this a FailSafe node <true|false> ? (false) true
Is this a CXFS node <true|false> ? (true) 
Node ID[optional] ? (13203) 
Partition ID[optional] ? (0)
Reset type <powerCycle|reset|nmi> ? (powerCycle) 
Do you wish to modify system controller info[y/n]:n
Number of Network Interfaces ? (1) 
NIC 1 - IP Address ? (cxfs6) 
NIC 1 - Heartbeat HB (use network for heartbeats) <true|false> ? (true) 
NIC 1 - (use network for control messages) <true|false> ? (true) 
NIC 1 - Priority <1,2,...> ? (1) 
Node Weight ? (0) 

Successfully modified node cxfs6

Delete a Node

This section tells you how to delete a node.

Delete a Node with the GUI

You must remove a node from a cluster before you can delete the node from the pool. For information, see “Modify a Cluster Definition”.

To delete a node, do the following:

  1. Node to Delete: select the logical name of the node to be deleted.

  2. Click OK to complete the task.

Delete a Node with cmgr

To delete a node, use the following command:

delete node Nodename

You can delete a node only if the node is not currently part of a cluster. If a cluster currently contains the node, you must first modify that cluster to remove the node from it.

For example, suppose you had a cluster named cxfs6-8 with the following configuration:

cmgr> show cluster cxfs6-8
Cluster Name: cxfs6-8
Cluster Is FailSafe: true
Cluster Is CXFS: true
Cluster ID: 20
Cluster HA mode: normal
Cluster CX mode: normal


Cluster cxfs6-8 has following 3 machine(s)
        cxfs6
        cxfs7
        cxfs8

To delete node cxfs8, you would do the following in prompting mode (assuming that CXFS services and FailSafe HA services have been stopped on the node):

cmgr> modify cluster cxfs6-8
Enter commands, when finished enter either "done" or "cancel"

Is this a FailSafe cluster <true|false> ? (true) 
Is this a CXFS cluster <true|false> ? (true) 
Cluster Notify Cmd [optional] ? 
Cluster Notify Address [optional] ? 
Cluster HA mode <normal|experimental>[optional] ? (normal) 
Cluster ID ? (20) 
Number of Cluster FileSystems ? (0) 

Current nodes in cluster cxfs6-8:
Node - 1: cxfs6
Node - 2: cxfs7
Node - 3: cxfs8

Add nodes to or remove nodes from cluster cxfs6-8
Enter "done" when completed or "cancel" to abort

cxfs6-8 ? remove node cxfs8
cxfs6-8 ? done
Successfully modified cluster cxfs6-8

cmgr> show cluster cxfs6-8
Cluster Name: cxfs6-8
Cluster Is FailSafe: true
Cluster Is CXFS: true
Cluster ID: 20
Cluster HA mode: normal


Cluster cxfs6-8 has following 2 machine(s)
        cxfs6
        cxfs7

To delete cxfs8 from the pool, enter the following:

cmgr> delete node cxfs8

IMPORTANT: NODE cannot be deleted if it is a member of a cluster.
The LOCAL node can not be deleted if some other nodes are still defined.


Deleted machine (cxfs6).

Display a Node

This section tells you how to display a node.

Display a Node with the GUI

After you define nodes, you can display the following:

  • Nodes that have been defined ( Nodes in Pool)

  • Nodes that are members of a specific cluster (Nodes in Cluster)

  • Attributes of a node

Click any name or icon in the view area to see detailed status and configuration information in the details area.

Display a Node with cmgr

After you have defined a node, you can display the node's parameters with the following command:

show node Nodename

A show node command on node cm1a would yield the following display:

cmgr> show node cm1
Logical Machine Name: cm1
Hostname: cm1
Node Is FailSafe: true
Node is CXFS: false
Nodeid: 1
Reset type: powerCycle
System Controller: msc
System Controller status: enabled
System Controller owner: cm2
System Controller owner device: /dev/ttyd2
System Controller owner type: tty
ControlNet Ipaddr: 192.56.50.1
ControlNet HB: true
ControlNet Control: true
ControlNet Priority: 0

You can see a list of all of the nodes that have been defined with the following command:

show nodes in pool

For example:

cmgr> show nodes in pool

3 Machine(s) defined
        cxfs8
        cxfs6
        cxfs7

You can show the nodes in the cluster with the following command:

show nodes in cluster Clustername

For example, if node cxfs8 was in the pool but not in clusterA, you would see:

cmgr> show nodes in cluster clusterA

Cluster clusterA has following 2 machine(s)
        cxfs6
        cxfs7

Cluster Tasks

The cluster is the set of nodes in the pool that have been defined as a cluster. The cluster is identified by a simple name; this name must be unique within the pool. (For example, you cannot use the same name for the cluster and for a node.)

All nodes in the cluster are also in the pool. However, all nodes in the pool are not necessarily in the cluster; that is, the cluster may consist of a subset of the nodes in the pool. There is only one cluster per pool.

This section describes the following cluster configuration tasks:

Define a Cluster

This section tells you how to define a cluster.

Define a Cluster with the GUI

A cluster is a collection of nodes coupled to each other by a private network. A cluster is identified by a simple name. A given node may be a member of only one cluster.

To define a cluster, do the following:

  1. Enter the following:

    • Cluster Name: the logical name of the cluster. The name can have a maximum length of 255 characters. Clusters that share a network and use XVM must have unique names.

    • Cluster Mode: usually, you should choose the default Normal mode.

      Choosing Experimental turns off resetting so that you can debug the cluster without causing node resets. You should only use Experimental mode when debugging.

    • Notify Administrator (of cluster and node status changes):

      • By e-mail: this choice requires that you specify the e-mail program (by default, /usr/sbin/Mail) and the e-mail addresses of those to be identified. To specify multiple addresses, separate them with commas. FailSafe will send e-mail to the addresses whenever the status changes for a node or cluster. If you do not specify an address, notification will not be sent.

      • By other command: this choice requires that you specify the command to be run whenever the status changes for a node or cluster.

      • Never: this choice specifies that notification is not sent.

  2. Click OK to complete the task. This is a long-running task that might take a few minutes to complete.

Define a Cluster with cmgr

When you define a cluster with cmgr, you define a cluster and add nodes to the cluster with the same command. For general information, see “Define a Cluster”.

Use the following commands to define a cluster:

define cluster Clustername
    set is_failsafe to true|false
    set is_cxfs to true|false
    set notify_cmd to NotifyCommand
    set notify_addr to Email_address
    set ha_mode to normal|experimental
    set cx_mode to normal|experimental
    add node Node1name
    add node Node2name
    ...

Usage notes:

  • cluster is the logical name of the cluster. Logical names cannot begin with an underscore (_) or include any whitespace characters, and can be at most 255 characters.

  • is_failsafe and is_cxfs specify the cluster type. If you are running just FailSafe, set is_failsafe to true. If you are running a coexecution cluster, set both values to true.

  • notify_cmd is the command to be run whenever the status changes for a node or cluster.

  • notify_addr is the email address to be notified of cluster and node status changes. To specify multiple addresses, separate them with commas. FailSafe will send e-mail to the addresses whenever the status changes for a node or cluster. If you do not specify an address, notification will not be sent. If you use the notify_addr command, you must specify the e-mail program (by default, /usr/sbin/Mail ) as the NotifyCommand.

  • set ha_mode and set cx_mode should normally be set to normal. Setting the mode to experimental turns off resetting so that you can debug the cluster without causing node resets. You should only use experimental mode when debugging. The set cx_mode command applies only to CXFS, and the set ha_mode command applies only to FailSafe.

This is a long-running task that might take a few minutes to complete. Failsafe also adds the resource types that are installed in the node to the new cluster; this process takes time.

The following shows the commands with prompting:

cmgr> define cluster Clustername
Enter commands, you may enter "done" or "cancel" at any time to exit

Is this a FailSafe cluster <true|false> ? true|false
Is this a CXFS cluster  <true|false> ? true|false
Cluster Notify Cmd [optional] ? 
Cluster Notify Address [optional] ? 
Cluster HA mode <normal|experimental> [optional] ? normal 
No nodes in cluster Clustername

Add nodes to or remove nodes from cluster Clustername
Enter "done" when completed or "cancel" to abort

Clustername ? add node Node1name
Clustername ? add node Node2name
...
Clustername ? done
Creating resource type MAC_address
Creating resource type IP_address
Creating resource type filesystem
Creating resource type volume
Successfully defined cluster Clustername

Added node <node1name> to cluster <clustername>
Added node <node2name> to cluster <clustername>

You should set the cluster to the default normal mode. Setting the mode to Experimental turns off resetting so that you can debug the cluster without causing node resets. You should only use Experimental mode when debugging. However, you should never use experimental mode on a production cluster and should only use it if directed to by SGI customer support. SGI does not support the use of experimental by customers.

For example:

cmgr> define cluster fs6-8
Enter commands, you may enter "done" or "cancel" at any time to exit

Is this a FailSafe cluster <true|false> ? true 
Is this a CXFS cluster  <true|false> ? false
Cluster Notify Cmd [optional] ? 
Cluster Notify Address [optional] ? 
Cluster HA mode <normal|experimental> [optional] ?

No nodes in cluster fs6-8

Add nodes to or remove nodes from cluster fs6-8
Enter "done" when completed or "cancel" to abort

fs6-8 ? add node fs6
fs6-8 ? add node fs7
fs6-8 ? add node fs8
fs6-8 ? done
Creating resource type MAC_address
Creating resource type IP_address
Creating resource type filesystem
Creating resource type volume
Successfully defined cluster fd6-8

Added node <fs6> to cluster <fs6-8>
Added node <fs7> to cluster <fs6-8>
Added node <fs8> to cluster <fs6-8>

To do this without prompting, enter the following:

cmgr> define cluster fs6-8
Enter commands, you may enter "done" or "cancel" at any time to exit

cluster fs6-8? set is_failsafe to true
cluster fs6-8? add node fs6
cluster fs6-8? add node fs7
cluster fs6-8? add node fs8
cluster fs6-8? done
Creating resource type MAC_address
Creating resource type IP_address
Creating resource type filesystem
Creating resource type volume
Successfully defined cluster fs6-8

Modify a Cluster Definition

This section tells you how to modify a cluster definition.

Modify a Cluster Definition with the GUI

To change how the cluster administrator is notified of changes in the cluster's state, do the following:

  1. Cluster Name: select the name of the cluster.

  2. Cluster Mode: usually, you should set the cluster to the default Normal mode. See “Define a Cluster”, for information about Experimental mode.

  3. Notify Administrator (of cluster and node status changes):

    • By e-mail: this choice requires that you specify the e-mail program (by default /usr/sbin/Mail) and the e-mail addresses of those to be identified. To specify multiple addresses, separate them with commas. FailSafe will send e-mail to the addresses whenever the status changes for a node or cluster. If you do not specify an address, notification will not be sent.

    • By other command: this choice requires that you specify the command to be run whenever the status changes for a node or cluster.

    • Never: this choice specifies that notification is not sent.

  4. Click OK.

To modify the nodes that make up a cluster, see “Add or Remove Nodes in the Cluster with the GUI”.


Note: If you want to rename a cluster, you must delete it and then define a new cluster.


Modify a Cluster Definition with cmgr

The commands are as follows:

modify cluster Clustername
    set is_failsafe to true
    set is_cxfs to true
    set notify_cnd to command
    set notify_addr to EmailAddress
    set ha_mode to normal|experimental
    set cx_mode to normal|experimental
    add node Node1name
    add node Node2name
    ...
    remove node Node1name
    remove node Node2name...

The following is an example of adding a node to a cluster in prompting mode:

cmgr> modify cluster nfs-cluster
Enter commands, you may enter "done" or "cancel" at any time to exit

Is this a FailSafe cluster <true|false> ? (true)
Is this a CXFS cluster <true|false> ? (false)
Cluster Notify Cmd [optional] ?
Cluster Notify Address [optional] ?
Cluster HA mode <normal|experimental>[optional] ? (normal)

Current nodes in cluster nfs-cluster:
Node - 1: hans1


No networks in cluster nfs-cluster

Add nodes to or remove nodes/networks from cluster nfs-cluster
Enter "done" when completed or "cancel" to abort

nfs-cluster ? add node hans2
nfs-cluster ? done
Added node <hans2> to cluster <nfs-cluster>
Successfully modified cluster nfs-cluster


Note: All references to networks in the prompting-mode output are for CXFS clusters only. You must configure FailSafe networks as part of the node definition.


Convert a CXFS Cluster to FailSafe

This section tells you how to convert a CXFS cluster so that it also applies to FailSafe.

Convert a CXFS Cluster to FailSafe with the GUI

This task appears on the GUI if you also have CXFS installed.

To convert the information from an existing CXFS cluster (that is, of type CXFS) to create a cluster that also applies to FailSafe (that is, of type CXFS and FailSafe ), do the following:

  1. Cluster Name: select the name of the cluster.

  2. Click OK to complete the task.

The cluster will apply to both FailSafe and CXFS. To modify the nodes that make up a cluster, see “Add or Remove Nodes in the Cluster”.


Note: If you want to rename a cluster, you must delete it and then define a new cluster.


Converting a CXFS Cluster to Failsafe with cmgr

To convert a cluster with cmgr, use the modify cluster command then the following commands:

modify cluster Clustername
    set is_failsafe to true|false
    set is_cxfs to true|false
    set clusterid to clusterID

For example, to convert CXFS cluster TEST so that it also applies to FailSafe, enter the following:

cmgr> modify cluster TEST
Enter commands, when finished enter either "done" or "cancel"

TEST ? set is_failsafe to true

The cluster must support all of the functionalities (FailSafe and/or CXFS) that are turned on for its nodes; that is, if your cluster is of type CXFS, then you cannot modify a node that is part of the cluster so that the node is of type FailSafe or CXFS and FailSafe. However, the nodes do not have to support all the functionalities of the cluster; that is, you can have a node of type CXFS in a cluster of type CXFS and FailSafe.

Delete a Cluster

This section tells you how to delete a cluster.

Delete a Cluster with the GUI

You cannot delete a cluster that contains nodes; you must first remove all nodes from the cluster. See “Add or Remove Nodes in the Cluster with the GUI”.

To delete a cluster, do the following:

  1. Cluster to Delete: select the cluster name.

  2. Click OK to complete the task.

Delete a Cluster with cmgr

You cannot delete a cluster that contains nodes; you must first remove all nodes from the cluster.

To delete a cluster, use the following command:

delete cluster Clustername 

Example in normal mode:

cmgr> modify cluster fs6-8
Enter commands, when finished enter either "done" or "cancel"

fs6-8 ? remove node fs6
fs6-8 ? remove node fs7
fs6-8 ? remove node fs8
fs6-8 ? done
Successfully modified cluster fs6-8

cmgr> delete cluster fs6-8

cmgr> show clusters

cmgr>

Example using prompting:

cmgr> modify cluster fs6-8
Enter commands, you may enter "done" or "cancel" at any time to exit

Cluster Notify Cmd [optional] ? 
Cluster Notify Address [optional] ? 
Cluster HA mode <normal|experimental>[optional] ? (normal) 

Current nodes in cluster  fs6-8:
Node - 1: fs6
Node - 2: fs7
Node - 3: fs8

Add nodes to or remove nodes from cluster fs6-8
Enter "done" when completed or "cancel" to abort

fs6-8 ? remove node fs6
fs6-8 ? remove node fs7
fs6-8 ? remove node fs8
fs6-8 ? done
Successfully modified cluster fs6-8

cmgr> delete cluster fs6-8

cmgr> show clusters

cmgr>

Display a Cluster

This section tells you how to display a cluster.

Display a Cluster with the GUI

The GUI provides a convenient display of a cluster and its components. From the View selection, you can choose elements within the cluster to examine. To view details of the cluster, click on the cluster name or icon.

The status details will appear in the details area on the right side of the GUI screen.

Display a Cluster with cmgr

After you have defined a cluster, you can display the nodes in that cluster with the following commands:

show clusters
show cluster Clustername

For example:

cmgr> show clusters

1 Cluster(s) defined
        nfs-cluster

cmgr> show cluster nfs-cluster
Cluster Name: nfs-cluster
Cluster Is FailSafe: true
Cluster Is CXFS: false
Cluster HA mode: normal

Cluster nfs-cluster has following 2 machine(s)
        hans2
        hans1

Resource Type Tasks

A resource type is a particular class of resource. All of the resources in a particular resource type can be handled in the same way for the purposes of failover. Every resource is an instance of exactly one resource type.

This section describes the following resource type tasks:

Define a Resource Type

This section describes how to define a resource type.

Define a Resource Type with the GUI

The FailSafe software includes many predefined resource types. Resource types in the cluster are created for the FailSafe plug-ins installed in the node using the /usr/cluster/bin/cdb-create-resource-type script. Resource types that were not created when the cluster was configured can be added later using the resource type install command, as described in “Load a Resource Type with the GUI”.

If these predefined resource types fit the application you want to make into an HA service, you can reuse them. If none fits, you can define additional resource types. Complete information on defining resource types is provided in the FailSafe Programmer's Guide for SGI Infinite Storage. This manual provides a summary of that information.

To define a new resource type, do the following:

  1. Resource Type: specify the name of the new resource type, with a maximum length of 255 characters.

    Click Next to move to the next page.

  2. Specify settings for required actions (time values are in milliseconds):

    • Start/Stop Order: order of performing the action scripts for resources of this type in relation to resources of other types:

      • Resources are started in the increasing order of this value.

      • Resources are stopped in the decreasing order of this value.

      See the FailSafe Programmer's Guide for SGI Infinite Storage for a full description of the order ranges available.

    • Start Timeout: the maximum duration for starting a resource of this type.

    • Stop Timeout: the maximum duration for stopping a resource of this type.

    • Exclusive Timeout: the maximum duration for verifying that a resource of this type is not already running.

    • Monitor Timeout: the maximum duration for monitoring a resource of this type.

    • Monitor Interval: the amount of time between successive executions of the monitor action script; this is only valid for the monitor action script.

    • Monitor Start Time: the amount of time between starting a resource and beginning monitoring of that resource.

      Click Next to move to the next page.

  3. Specify settings for optional actions as needed:

    • Restart Enabled: check the box to enable restarting of the resource. You should enable restart if you want a resource of this type to automatically be restarted on the current node after a monitoring failure. Enabling restart can decrease application downtime.

      For example, suppose FailSafe detects that a resource's monitor action has failed:

      • If restart is disabled, FailSafe will immediately attempt to move the whole group to another node in the failover domain. The application will be down until the entire group is failed over.

      • If restart is enabled, FailSafe will attempt to restart the resource on the current node where the rest of the resource group is running. If this succeeds, the resource group will be made available as soon as the resource restarts; if this fails, only then will FailSafe attempt to move the whole group to another node in the failover domain.

      The local restart flag enables local failover:

      • If local restart is enabled and the resource monitor script fails, SRMD executes the restart script for the resource.

      • If the restart script is successful, SRMD continues to monitor the resource.

      • If the restart script fails or the restart count is exhausted, SRMD sends a resource group monitoring error to FSD. FSD itself is not involved in local failover.

      To determine the number of local monitoring failures, use the show status of resource command to cmgr; for more information, see “Querying Resource Status with cmgr” in Chapter 8.

      When a resource is restarted, all other resources in the resource group are not restarted. It is not possible to do a local restart of a resource using the GUI or cmgr.

      If you find that you need to reset the restart counter for a resource type, you can put the resource group in maintenance mode and remove it from maintenance mode. This process will restart counters for all resources in the resource group. For information on putting a resource group in maintenance mode, see “ Suspend and Resume Monitoring of a Resource Group” in Chapter 8.

    • Restart Timeout: the maximum amount of time to wait before restarting the resource after a monitoring failure occurs.

    • Restart Count: the maximum number of attempts that FailSafe will make to restart a resource of this type on the current node. Enter an integer greater than zero.

    • Probe Enabled: check if you want FailSafe to verify that a resource of this type is configured on a node.

    • Probe Timeout: the maximum amount of time for FailSafe to attempt to verify that a resource of this type is configured on a node.

  4. Change settings for type-specific attributes: specify any attributes specific to the resource type. You must provide the following for each attribute:

    • Attribute key: name of the attribute

    • Data Type: select either String or Integer

    • Default Value: optionally, provide a default value

    For example, NFS requires the following attributes:

    • export-point, which takes a value that defines the export disk name. This name is used as input to the exportfs command. For example:

      export-point = /this_disk

    • export-info, which takes a value that defines the export options for the filesystem. These options are used in the exportfs command. For example:

      export-info = rw,wsync,anon=root

    • filesystem, which takes a value that defines the raw filesystem. This name is used as input to the mount command. For example:

      filesystem = /dev/xlv/xlv_object

    Click Add to add the attribute, and repeat as necessary for other attributes.

  5. Click OK to complete the task.

Define a Resource Type with cmgr

Use the following commands:

define resource_type RTname on node Nodename [in cluster Clustername]

define resource_type RTname [in cluster Clustername]
   set order to start/stop_Order_Number
   set restart_mode to 0|1
   set restart_count to Number_Of_Attempts
   add action ActionScriptname
           set exec_time to ExecutionTimeout
           set monitor_interval to MonitorInterval
           set monitor_time to MonitorTime
   add type_attribute Type-specific_Attributename
           set data_type to string|integer
           set default_value to Default
   add dependency RTname
   remove action ActionScriptname
   remove type_attribute Type-specific_Attributename
   remove dependency DependencyName

Usage notes:

  • resource_type is the name of the resource type to be defined, with a maximum length of 255 characters.

  • order is the order of performing the action scripts for resources of this type in relation to resources of other types:

    • Resources are started in the increasing order of this value

    • Resources are stopped in the decreasing order of this value

    See the FailSafe Programmer's Guide for SGI Infinite Storage for a full description of the order ranges available.

  • restart_mode is as follows:

    • 0 = Do not restart on monitoring failures (disable restart)

    • 1 = Restart a fixed number of times (enable restart)

    You should enable restart if you want a resource of this type to automatically be restarted on the current node after a monitoring failure. Enabling restart can decrease application downtime.

  • restart_count is the maximum number of attempts that FailSafe will make to restart a resource of this type on the current node. Enter an integer greater than zero.

  • action is the name of the action script (exclusive, start, stop, monitor, or restart). For more information, see “Action Scripts” in Chapter 1. The following time values are in milliseconds:

    • exec_time is the maximum time for executing the action script

    • monitor_interval is the amount of time between successive executions of the monitor action script (this is valid only for the monitor action script)

    • monitor_time is the amount of time between starting a resource and beginning monitoring of that resource

  • type_attribute is a type-specific attribute

    • data_type is either string or integer

    • default_value is the default value for the attribute

  • dependency adds a dependency upon the specified resource type (RTname)

By default, the resource type will apply across the cluster; if you wish to limit the resource type to a specific node, enter the node name when prompted. If you wish to enable restart mode, enter 1 when prompted.

For an example in normal mode, see the template for the cmgr command in the following file:

/var/cluster/cmgr-templates/cmgr-create-resource_type


Note: The cmgr-create-resource_type script provides a general mechanism for creating a resource type. Each existing resource type has a create_resource_type script in its directory, such as /var/cluster/ha/resource_types/statd_unlimited/create_resource_type .

The following example in prompting mode only shows the prompts and answers for two action scripts (start and stop) for a new resource type named newresourcetype .

cmgr> define resource_type newresourcetype

(Enter "cancel" at any time to abort)

Node[optional]?
Order ? 300
Restart Mode ? (0)
 
DEFINE RESOURCE TYPE OPTIONS
 
        0) Modify Action Script.
        1) Add Action Script.
        2) Remove Action Script.
        3) Add Type Specific Attribute.
        4) Remove Type Specific Attribute.
        5) Add Dependency.
        6) Remove Dependency.
        7) Show Current Information.
        8) Cancel. (Aborts command)
        9) Done. (Exits and runs command)
 
Enter option:1
 
No current resource type actions
 
Action name ? start
Executable timeout (in milliseconds) ? 40000
 
        0) Modify Action Script.
        1) Add Action Script.
        2) Remove Action Script.
        3) Add Type Specific Attribute.
        4) Remove Type Specific Attribute.
        5) Add Dependency.
        6) Remove Dependency.
        7) Show Current Information.
        8) Cancel. (Aborts command)
        9) Done. (Exits and runs command)
 
Enter option:1
 
Current resource type actions:
        start
 
Action name stop
Executable timeout?  (in milliseconds) 40000
 
        0) Modify Action Script.
        1) Add Action Script.
        2) Remove Action Script.
        3) Add Type Specific Attribute.
        4) Remove Type Specific Attribute.
        5) Add Dependency.
        6) Remove Dependency.
        7) Show Current Information.
        8) Cancel. (Aborts command)
        9) Done. (Exits and runs command)
 
Enter option:3
 
No current type specific attributes
 
Type Specific Attribute ? integer-att
Datatype ? integer
Default value[optional] ? 33 
 
        0) Modify Action Script.
        1) Add Action Script.
        2) Remove Action Script.
        3) Add Type Specific Attribute.
        4) Remove Type Specific Attribute.
        5) Add Dependency.
        6) Remove Dependency.
        7) Show Current Information.
        8) Cancel. (Aborts command)
        9) Done. (Exits and runs command)
 
Enter option:3
 
Current type specific attributes:
        Type Specific Attribute - 1: integer-att
 
Type Specific Attribute ? string-att
Datatype ? string
Default value[optional] ? rw
 
        0) Modify Action Script.
        1) Add Action Script.
        2) Remove Action Script.
        3) Add Type Specific Attribute.
        4) Remove Type Specific Attribute.
        5) Add Dependency.
        6) Remove Dependency.
        7) Show Current Information.
        8) Cancel. (Aborts command)
        9) Done. (Exits and runs command)
 
Enter option:5
 
No current resource type dependencies
 
Dependency name ? filesystem
 
        0) Modify Action Script.
        1) Add Action Script.
        2) Remove Action Script.
        3) Add Type Specific Attribute.
        4) Remove Type Specific Attribute.
        5) Add Dependency.
        6) Remove Dependency.
        7) Show Current Information.
        8) Cancel. (Aborts command)
        9) Done. (Exits and runs command)
 
Enter option:7

Current resource type actions:
        Action - 1: start
        Action - 2: stop

Current type specific attributes:
        Type Specific Attribute - 1: integer-att
        Type Specific Attribute - 2: string-att

No current resource type dependencies

Resource dependencies to be added:
        Resource dependency - 1: filesystem

        0) Modify Action Script.
        1) Add Action Script.
        2) Remove Action Script.
        3) Add Type Specific Attribute.
        4) Remove Type Specific Attribute.
        5) Add Dependency.
        6) Remove Dependency.
        7) Show Current Information.
        8) Cancel. (Aborts command)
        9) Done. (Exits and runs command)
 
Enter option:9
Successfully defined resource_type newresourcetype
 
cmgr> show resource_types
 
template
MAC_address
newresourcetype
IP_address
filesystem
volume

cmgr> exit
# 

To determine the number of local monitoring failures, use the show status of resource command to cmgr; for more information, see “Querying Resource Status with cmgr” in Chapter 8.

Redefine a Resource Type for a Specific Node

This section describes how to define a resource type that applies to a specific node. You must connect the GUI or execute the cmgr command on the node for which the resource type will be redefined.

Redefine a Resource Type for a Specific Node with the GUI

This task lets you take an existing clusterwide resource type and redefine it for use on the local node.

A resource type that is redefined for a specific node overrides a clusterwide definition with the same name; this allows an individual node to override global settings from a clusterwide resource type definition. You can use this feature if you want to have different script timeouts for a node or you want to restart a resource on only one node in the cluster.

For example, the IP_address resource has local restart enabled by default. If you would like to have an IP_address type without local restart for a particular node, you can make a copy of the IP_address clusterwide resource type with all of the parameters the same except for restart mode, which you set to 0.

Do the following:

  1. Local Node: the name of the local node is filled in for you. (If you wanted to make the resource type specific to a different node, you must connect the GUI to that node.)

  2. Clusterwide Resource Type: select the name of the resource type you want to redefine for the local node.

    Click Next to move to the next page.

  3. Change settings for required actions as needed (time values are in milliseconds):

    • Start/Stop Order: order of performing the action scripts for resources of this type in relation to resources of other types:

      • Resources are started in the increasing order of this value

      • Resources are stopped in the decreasing order of this value

      See the FailSafe Programmer's Guide for SGI Infinite Storage for a full description of the order ranges available.

    • Start Timeout: the maximum duration for starting a resource of this type.

    • Stop Timeout: the maximum duration for stopping a resource of this type.

    • Exclusive Timeout: the maximum duration for verifying that a resource of this type is not already running.

    • Monitor Timeout: the maximum duration for monitoring a resource of this type.

    • Monitor Interval: the amount of time between successive executions of the monitor action script; this is only valid for the monitor action script.

    • Monitor Start Time: the amount of time between starting a resource and beginning monitoring of that resource.

      Click Next to move to the next page.

  4. Change settings for optional actions as needed:

    • Restart Enabled: check the box to enable restarting of the resource. You should enable restart if you want a resource of this type to automatically be restarted on the current node after a monitoring failure. Enabling restart can decrease application downtime.

      For example, suppose FailSafe detects that a resource's monitor action has failed:

      • If restart is disabled, FailSafe will immediately attempt to move the whole group to another node in the failover domain. The application will be down until the entire group is failed over.

      • If restart is enabled, FailSafe will attempt to restart the resource on the current node where the rest of the resource group is running. If this succeeds, the resource group will be made available as soon as the resource restarts; if this fails, only then will FailSafe attempt to move the whole group to another node in the failover domain.

      The local restart flag enables local failover:

      • If local restart is enabled and the resource monitor script fails, SRMD executes the restart script for the resource.

      • If the restart script is successful, SRMD continues to monitor the resource.

      • If the restart script fails or the restart count is exhausted, SRMD sends a resource group monitoring error to FSD. FSD itself is not involved in local failover.

      When a resource is restarted, all other resources in the resource group are not restarted. It is not possible to do a local restart of a resource using the GUI or cmgr.

      If you find that you need to reset the restart counter for a resource type, you can put the resource group in maintenance mode and remove it from maintenance mode. This process will restart counters for all resources in the resource group. For information on putting a resource group in maintenance mode, see “ Suspend and Resume Monitoring of a Resource Group” in Chapter 8.

    • Restart Timeout: the maximum amount of time to wait before restarting the resource after a monitoring failure occurs.

    • Restart Count: the maximum number of attempts that FailSafe will make to restart a resource of this type on the current node. Enter an integer greater than zero.

    • Probe Enabled: check if you want FailSafe to verify that a resource of this type is configured on a node.

    • Probe Timeout: the maximum amount of time for FailSafe to attempt to verify that a resource of this type is configured on a node.

  5. Change settings for type-specific attributes; specify any attributes specific to the resource type. You must provide the following for each attribute:

    • Attribute key: specify the name of the attribute

    • Data Type: select either String or Integer

    • Default Value: optionally, provide a default value

    For example, NFS requires the following attributes:

    • export-point, which takes a value that defines the export disk name. This name is used as input to the exportfs command. For example:

      export-point = /this_disk

    • export-info, which takes a value that defines the export options for the filesystem. These options are used in the exportfs command. For example:

      export-info = rw,wsync,anon=root

    • filesystem, which takes a value that defines the raw filesystem. This name is used as input to the mount command. For example:

      filesystem = /dev/xlv/xlv_object

    Click Add to add the attribute, and repeat as necessary for other attributes.

  6. Click OK to complete the task.

Define a Node-Specific Resource Type with cmgr

With cmgr, you redefine a node-specific resource type similar to defining a clusterwide resource type, except that you specify a node on the command line. You must execute the command on that node.

Use the following command to define a node-specific resource type:

define resource_type RTname on node Nodename [in cluster Clustername]

Add/Remove Dependencies for a Resource Type

This section describes how to add dependencies to a resource type.

Add/Remove Dependencies for a Resource Type with the GUI

Like resources, a resource type can be dependent on one or more other resource types. If such a dependency exists, at least one instance of each of the dependent resource types must be defined.

For example, a resource type named Netscape_web might have resource type dependencies on a resource type named IP_address and volume. If a resource named ws1 is defined with the Netscape_web resource type, then the resource group containing ws1 must also contain at least one resource of the type IP_address and one resource of the type volume. Figure 6-1 shows these dependencies.

Figure 6-1. Dependencies

Dependencies

Enter the following information:

  1. Resource Type: select the resource type.

  2. Dependency Type: select the dependency type. Click Add to add the dependency to the list, click Delete to remove the dependency from the list.

  3. Click OK to complete the task.

Add/Remove Dependencies for a Resource Type with cmgr

When using cmgr, you add or remove dependencies when you define or modify the resource type.

For example, suppose the NFS resource type in nfs-cluster has a resource type dependency on filesystem resource type. To change the NFS resource type to have a dependency on the IP_address resource type instead (and not on filesystem), do the following:

cmgr> show resource_type NFS in cluster nfs-cluster

Name: NFS
Predefined: true
....

Resource type dependencies
        filesystem

cmgr> modify resource_type NFS in cluster nfs-cluster
Enter commands, when finished enter either "done" or "cancel"

resource_type NFS ? remove dependency filesystem
resource_type NFS ? add dependency IP_address
resource_type NFS ? done
Successfully modified resource_type NFS

cmgr> show resource_type NFS in cluster nfs-cluster

Name: NFS
Predefined: true
....

Resource type dependencies
        IP_address

Load a Resource Type

This section describes how to install (load) a resource type.

Load a Resource Type with the GUI

When you define a cluster, FailSafe installs a set of resource type definitions that you can use; these definitions include default values. If you need to install additional, standard SGI supplied resource type definitions on the cluster, or if you delete a standard resource type definition and wish to reinstall it, you can load that resource type definition on the cluster.

The resource type definition you are loading cannot already exist on the cluster.

Load a Resource Type with cmgr

Use the following command to install a resource type on a cluster:

install resource_type RTname [in cluster Clustername]

Modify a Resource Type Definition

This section describes how to modify a resource type.

Modify a Resource Type with the GUI

The process of modifying a resource type is similar to the process of defining a resource type.

Enter the following (time values are in milliseconds):

  1. Resource Type: select the name of the resource type to be modified.

    Click Next to move to the next page. The current settings for each field will be filled in for you.

  2. Start/Stop Order: order of performing the action scripts for resources of this type in relation to resources of other types:

    • Resources are started in the increasing order of this value.

    • Resources are stopped in the decreasing order of this value.

    See the FailSafe Programmer's Guide for SGI Infinite Storage for a full description of the order ranges available.

  3. Start Timeout: the maximum duration for starting a resource of this type.

  4. Stop Timeout: the maximum duration for stopping a resource of this type.

  5. Exclusive Timeout: the maximum duration for verifying that a resource of this type is not already running.

  6. Monitor Timeout: the maximum duration for monitoring a resource of this type.

  7. Monitor Interval: the amount of time between successive executions of the monitor action script; this is valid only for the monitor action script.

  8. Monitor Start Time: the amount of time between starting a resource and beginning monitoring of that resource.

    Click Next to move to the next page.

  9. Enable Restart: check the box to enable restarting of the resource. You should enable restart if you want a resource of this type to automatically be restarted on the current node after a monitoring failure. Enabling restart can decrease application downtime.

    For example, suppose FailSafe detects that a resource's monitor action has failed:

    • If restart is disabled, FailSafe will immediately attempt to move the whole group to another node in the failover domain. The application will be down until the entire group is failed over.

    • If restart is enabled, FailSafe will attempt to restart the resource on the current node where the rest of the resource group is running. If this succeeds, the resource group will be made available as soon as the resource restarts; if this fails, only then will FailSafe attempt to move the whole group to another node in the failover domain.

    The local restart flag enables local failover:

    • If local restart is enabled and the resource monitor script fails, SRMD executes the restart script for the resource.

    • If the restart script is successful, SRMD continues to monitor the resource.

    • If the restart script fails or the restart count is exhausted, SRMD sends a resource group monitoring error to FSD. FSD itself is not involved in local failover.

    When a resource is restarted, all other resources in the resource group are not restarted. It is not possible to do a local restart of a resource using the GUI or cmgr.

    If you find that you need to reset the restart counter for a resource type, you can put the resource group in maintenance mode and remove it from maintenance mode. This process will restart counters for all resources in the resource group. For information on putting a resource group in maintenance mode, see “ Suspend and Resume Monitoring of a Resource Group” in Chapter 8.

  10. Restart Timeout: the maximum amount of time to wait before restarting the resource after a monitoring failure occurs.

  11. Restart Count: the maximum number of attempts that FailSafe will make to restart a resource of this type on the current node. Enter an integer greater than zero.

  12. Probe Enabled: check if you want FailSafe to verify that a resource of this type is configured on a node.

  13. Probe Timeout: the maximum amount of time for FailSafe to attempt to verify that a resource of this type is configured on a node.

  14. Type-Specific Attributes: specify new attributes that are specific to the resource type, or modify an existing attribute by selecting its name. You must provide the following for each attribute:

    • Attribute key: specify the name of the attribute

    • Data Type: select either String or Integer

    • Default Value: (optional) provide a default value for the attribute


    Note: You cannot modify the type-specific attributes if there are any existing resources of this type.

    Click Add to add the attribute or Modify to modify the attribute, and repeat as necessary for other attributes. Click OK to complete the definition.

Modify a Resource Type with cmgr

Use the following commands to modify a resource type:

modify resource_type RTname [in cluster Clustername]
   set order to start/stop_OrderNumber
   set restart_mode to 0|1
   set restart_count to Number_Of_Attempts
   add action ActionScriptname
           set exec_time to ExecutionTimeout
           set monitor_interval to MonitorInterval
           set monitor_time to MonitorTime
   modify action ActionScriptname
           set exec_time to ExecutionTimeout
           set monitor_interval to MonitorInterval
           set monitor_time to MonitorTime
   add type_attribute Type-specificAttributename
           set data_type to string|integer
           set default_value to Default
   add dependency RTname
   remove action ActionScriptname
   remove type_attribute Type-specific_Attributename
   remove dependency Dependencyname

You modify a resource type using the same commands you use to define a resource type. See “Define a Resource”.

You can display the current values of the resource type timeouts, allowing you to modify any of the action timeouts.

The following example shows how to increase the statd_unlimited resource type monitor executable timeout from 40 seconds to 60 seconds.

#cmgr> modify resource_type statd_unlimited in cluster test-cluster
Enter commands, when finished enter either "done" or "cancel"

resource_type statd_unlimited? modify action monitor
Enter action parameters, when finished enter "done" or "cancel"

Current action monitor parameters:
        exec_time : 40000ms
        monitor_interval : 20000ms
        monitor_time : 50000ms

Action - monitor ? set exec_time to 60000
Action - monitor ? done
resource_type statd_unlimited ? done
Successfully modified resource_type statd_unlimited

The following examples show how to modify the resource type timeouts in prompt mode.

#cmgr> modify resource_type statd_unlimited in cluster test-cluster


(Enter "cancel" at any time to abort)

Node[optional] ? 
Order ? (411) 
Restart Mode ? (0) 

MODIFY RESOURCE TYPE OPTIONS

        0) Modify Action Script.
        1) Add Action Script.
        2) Remove Action Script.
        3) Add Type Specific Attribute.
        4) Remove Type Specific Attribute.
        5) Add Dependency.
        6) Remove Dependency.
        7) Show Current Information.
        8) Cancel. (Aborts command)
        9) Done. (Exits and runs command)

Enter option:0
Current resource type actions:        
        stop
        exclusive
        start
        restart
        monitor

Action name ? monitor
Executable timeout (in milliseconds) ? (40000ms) 60000
Monitoring Interval (in milliseconds) ? (20000ms) 
Start Monitoring Time (in milliseconds) ? (50000ms) 

        0) Modify Action Script.
        1) Add Action Script.
        2) Remove Action Script.
        3) Add Type Specific Attribute.
        4) Remove Type Specific Attribute.
        5) Add Dependency.
        6) Remove Dependency.
        7) Show Current Information.
        8) Cancel. (Aborts command)
        9) Done. (Exits and runs command)

Enter option:9
Successfully modified resource_type statd_unlimited

Delete a Resource Type

This section describes how to delete a resource type.

Delete a Resource Type with the GUI

To delete a resource type with the GUI, enter the following:

  1. Resource Type to Delete: select the name of the resource type that you want to delete.


    Note: If you select a resource type that has been redefined for the local node, that special definition of the resource type will be deleted and the clusterwide resource type will used instead.

    You cannot delete a clusterwide resource type if there are any resources of that type.


  2. Click OK to complete the task.

Delete a Resource Type with cmgr

Use the following command to delete a resource type:

delete resource_type RTname [in cluster Clustername]

Display a Resource Type

This section describes how to display resource types.

Display Resource Types with the GUI

Select View: Resource Types. You can then click on any of the resource type icons in the view area to examine the parameters of the resource type.

Display Resource Types with cmgr

Use the following commands to view resource types:

  1. To view the parameters of a defined resource type ( RTname):

    show resource_type RTname [in cluster Clustername]

  2. To view all of the defined resource types:

    show resource_types [in cluster Clustername]

  3. To view all of the defined resource types that have been installed:

    show resource_types installed

Resource Tasks

A resource is a single physical or logical entity that provides a service to clients or other resources. A resource is generally available for use on two or more nodes in a cluster, although only one node controls the resource at any given time. For example, a resource can be a single disk volume, a particular network address, or an application such as a web node.

This section describes the following resource tasks:

Define a Resource

This section describes how to define a new resource.

Define a Resource with the GUI

Resources are identified by a resource type and a resource name. A resource name identifies a specific instance of a resource type. All of the resources in a given resource type can be handled in the same way for the purposes of failover.

By default, resources apply clusterwide. However, you can use the Redefine a Resource for a Specific Node task to apply the resource to just the local node; see “Redefine a Resource for a Specific Node”.

Provide appropriate values for the following:

  1. Resource Type: the name of the resource type.

    A resource type can be defined for a specific logical node, or it can be defined for an entire cluster. A resource type that is defined for a node will override a clusterwide resource type definition of the same name; this allows an individual node to override global settings from a clusterwide resource type definition.

    The type of resource to define. The FailSafe system includes pre-defined resource types, listed in the GUI. You can define your own resource type as well.

    The FailSafe software includes many predefined resource types. If these types fit the application you want to make into an HA service, you can reuse them. If none fit, you can define additional resource types.

  2. Resource: name of the resource to define, with a maximum length of 255 characters, that does not begin with an underscore. XVM resource names must not begin with a slash (/).

    A resource is a single physical or logical entity that provides a service to clients or other resources. Examples include a single disk volume, a particular network HA IP address, or a specific application such as a web server. Particular resource types may have other naming requirements; see the sections below.

    You can define up to 100 resources in a FailSafe configuration.

    Click Next to move to the next page.

  3. Type-specific attributes: enter the attributes that apply to this resource. The following sections describe attributes for each resource type provided in the base FailSafe release; other attributes are available with FailSafe plug-in releases and are described in the documentation supplied with those releases. You can specify attributes for new resource types you create.

  4. Click OK to complete the task.

CXFS Attributes

The CXFS resource is the mount point of the CXFS filesystem, such as /shared_CXFS. In the Relocate Metadata server? field, you must specify whether the metadata server of the CXFS filesystem should be relocated ( true) or not (false).

filesystem Attributes

The filesystem resource must be an XFS filesystem.

Any XFS filesystem that must be highly available should be configured as a filesystem resource. All XFS filesystems that you use as a filesystem resource must be created on XLV volumes on shared disks.

When you define a filesystem resource, the name of the resource should be the mount point of the filesystem. For example, an XFS filesystem that was created on an XLV volume xlv_vol and mounted on the /shared1 directory will have the resource name /shared1.

Specify the following parameters:

  • Volume Name: the name of the XLV volume associated with the filesystem. For example, for the filesystem created on the XLV volume xlv_vol, the volume name attribute will be xlv_vol as well.

  • Mount Options: the mount options to be used for mounting the filesystem, which are the mount options that have to be passed to the -o option of the mount command. The list of available options is provided on the fstab man page. The default is rw.

  • kill-nfsd-before-umount:

    • true stops the nfsd NFS server processes in the server before the filesystem is unmounted and then restarts the nfsd daemons after the umount is completed. This is the default (true).

    • false unmounts the filesystem without changing the nfsd status.

  • Monitoring Level: the monitoring level to be used for the filesystem.

    • 1 specifies to check whether the filesystem exists in /etc/mtab, as described in the mtab man page.

    • 2 specifies to check whether the filesystem is mounted using the stat command. This is a more-intrusive check that is more reliable if it completes on time; however, some loaded systems have been known to have problems with this level.

IP_address Attributes

The IP_address resources are the IP addresses used by clients to access the HA services within the resource group. These HA IP addresses are moved from one node to another along with the other resources in the resource group when a failure is detected.

You specify the resource name of an IP_address resource in dot (“.”) notation. IP names that require name resolution should not be used. For example, 192.26.50.1 is a valid resource name of the IP_address resource type.

The HA IP address you define as a FailSafe resource must not be the same as the IP address of a node hostname or the IP address of a node's control network.

Specify the following parameters:

  • Network Mask: the network mask of the HA IP address.

  • Interfaces: a comma-separated list of interfaces on which the HA IP address can be configured. This ordered list is a superset of all the interfaces on all nodes where this HA IP address might be allocated. You can specify multiple interfaces to configure local restart of the HA IP address, if those interfaces are on the same node. However, using local IP failover requires that you make changes to the start and restart action scripts for the IP_address resource; see “Example: Local Failover of HA IP Address” in Chapter 7.

    The order of the list of interfaces determines the priority order for determining which HA IP address will be used for local restarts of the node.

  • Broadcast Address: the broadcast address for the HA IP address

MAC_address Attributes

The MAC address is the link-level address of the network interface. If MAC addresses are to be failed over, dedicated network interfaces are required.

The resource name of a MAC address is the MAC address of the interface. You can obtain MAC addresses by using the ha_macconfig2 command.

You must specify attribute Interface:, which is the interface that has to be reMAC-ed.

Only Ethernet interfaces are capable of undergoing the reMAC process.

volume Attributes

The volume resource type is the XLV volume used by the resources in the resource group.

When you define a volume resource, the resource name should be the name of the XLV volume. Do not specify the XLV device file name as the resource name. For example, the resource name for a volume might be xlv_vol but not /dev/xlv/xlv_vol or /dev/dsk/xlv/xlv_vol.

When an XLV volume is assembled on a node, a file is created in /dev/xlv. Even when you configure a volume resource in a FailSafe cluster, you can view that volume from only one node at a time, unless a failover has occurred.

You may be able to view a volume name in /dev/xlv on two different nodes after failover because when an XLV volume is shut down, the filename is not removed from that directory. Hence, more than one node may have the volume filename in its directory. However, only one node at a time will have the volume assembled. Use xlv_mgr to see which machine has the volume assembled.

Specify the following parameters:

  • Device Group: the group name of the XLV device file. The sys group is the default group name for XLV device files.

  • Device Owner: the user name (login name) of the owner of the XLV device file. root is the default owner for XLV device files.

  • Device Mode: the device file permissions, specified in octal notation. 600 mode is the default value for XLV device file permissions.

XVM Attributes

The XVM resource type is the local XVM volume used by FailSafe applications.

When you define an XVM resource, the resource name must be a unique string for all local XVM domains in the FailSafe cluster. The name should be the name of the volume without the preceding vol/ characters.

Specify the following parameters:

  • FS XVM owner: is the XVM temporary owner for FailSafe. This value must not be the cluster name or a hostname known to any machine in the cluster. The default is fake_owner .

  • Device group: is the XVM device group. The default is sys.

  • Physvol names: the names of the physical volume that comprise an XVM volume, separated by commas (spaces are not accepted). This name does not contain the vol/ prefix that is displayed by the xvm command. (For example, if you enter bigvol here, the xvm show command would display this physvol as vol/bigvol.) You must enter a value in this field, there is no default.

  • Device mode: is the XVM device mode permissions. The default is 600.

  • Device owner: is the XMV device owner. The default is root.

For more details about XVM, see the XVM Volume Manager Administrator's Guide.

Define a Resource with cmgr

Use the following command to define a clusterwide resource:

define resource Resourcename [of resource_type RTname] [in cluster Clustername]
    set Key to AttributeValue
    add dependency Dependencyname of type RTname
    remove dependency Dependencyname of type RTname    

Usage notes:

  • The resource name has a maximum length of 255 characters and cannot begin with an underscore. XVM resource names must not begin with a slash (/).

  • set Key specifies the name of the attribute, and AttributeValue sets its value

  • add dependency adds a dependency of the specified resource type (RTname)

  • remove dependency deletes a dependency of the specified resource type

When you use this command to define a resource, you define a clusterwide resource that is not specific to a node.

The legal values for set Key to AttributeValue will depend on the type of resource you are defining, as described in “Define a Resource”. For detailed information on how to determine the format for defining resource attributes, see “Specify Resource Attributes with cmgr ”.

When you are finished defining the resource and its dependencies, enter done to return to the cmgr prompt.

For example:

cmgr> define resource /hafs1/nfs/statmon of resource_type statd_unlimited in cluster nfs-cluster
resource /hafs1/nfs/statmon? set ExportPoint to /hafs1/subdir
resource /hafs1/nfs/statmon? done

The following section of a cmgr script defines a resource of resource type statd_unlimited:

define resource /hafs1/nfs/statmon of resource_type statd_unlimited in cluster nfs-cluster
        set ExportPoint to /hafs1/subdir
done

Specify Resource Attributes with cmgr

To see the format in which you can specify the user-specific attributes that you must set for a particular resource type, you can enter the following command to see the full definition of that resource type:

show resource_type RTname [in cluster Clustername]

For example, to see the attributes you define for a resource of resource type volumes, enter the following command:

cmgr> show resource_type volume in cluster test-cluster

At the bottom of the resulting display, the following appears:

Type specific attribute: devname-group
        Data type: string
        Default value: sys
Type specific attribute: devname-owner
        Data type: string
        Default value: root
Type specific attribute: devname-mode
        Data type: string
        Default value: 600

This display reflects the format in which you can specify the group ID, the device owner, and the device file permissions for the volume:

  • devname-group specifies the group ID of the XLV device file

  • devname_owner specifies the owner of the XLV device file

  • devname_mode specifies the device file permissions

For example, to set the group ID to sys for a resource name A, enter the following command:

resource A? set devname-group to sys

Table 6-2 summarizes the attributes you specify for the predefined FailSafe resource types with the set Key to AttributeValue command.

Table 6-2. Resource Type Attributes

Resource Type

Attribute

Description

CXFS

relocate-mds

Specifies if the metadata server of the CXFS filesystem should be relocated or not. (The name of a CXFS resource is the mount point of the CXFS filesystem. For example, /shared_CXFS.)

filesystem

volume-name

Specifies the name of the xlv volume associated with the filesystem.

 

mount-options

Specifies the mount options to be used for mounting the filesystem.

 

kill-nfsds-before-umount

When set to true, stops the nfsd NFS server processes running in the server before the filesystem is unmounted and then restarts nfsd after the umount is completed. When set to false, the unmount takes place without changing the nfsd status. The default is true.

 

monitoring-level

Specifies the monitoring level to be used for the filesystem:

  • 1 specifies to check whether the filesystem exists in /etc/mtab, as described in the mtab man page.

  • 2 specifies to check whether the filesystem is mounted using the stat command. This is a more-intrusive check that is more reliable if it completes on time; however, some loaded systems have been known to have problems with this level.

IP_address

NetworkMask

Specifies the subnet mask of the IP address.

 

interfaces

Specifies a comma-separated list of interfaces on which the IP address can be configured.

 

BroadcastAddress

Specifies the broadcast address for the IP address.

MAC_address

interface-name

Specifies the name of the interface that has to be re-MACed.

volume

devname-group

Specifies the group ID of the xlv device file.

 

devname_owner

Specifies the owner of the xlv device.

 

devname_mode

Specifies device file permissions.

XVM

fs_xvm_owner

Specifies the XVM temporary owner for FailSafe. This value must not be the cluster name or a hostname known to any node in the cluster. The default is fake_owner.

devname_group

Specifies the XVM device group. The default is sys.

physvol_names

Specifies the names of the physical volumes, separated by commas. There is no default.

devname_mode

Specifies the permission mode of the XVM device in octal. The default is 600.

devname_owner

Specifies the XVM device owner. The default is root.


Redefine a Resource for a Specific Node

This section describes redefining a resource for a specific node. You must connect the GUI or execute the cmgr command on the node for which the resource type will be redefined.

Redefine a Resource for a Specific Node with the GUI

You can redefine an existing resource for a specific node from that node (the local node) only. Only existing clusterwide resources can be redefined.

You may want to use this feature when you configure heterogeneous clusters for an IP_address resource. For example, the resource 192.26.50.2 of type IP_address can be configured on a Gigabit Ethernet interface eg0 on a server and on a 100BASE-T interface ef0 on another server. The clusterwide resource definition for 192.26.50.2 will have the interfaces field set to ef0 and the node-specific definition for the first node will have eg0 as the interfaces field.

Provide appropriate values for the following:

  1. Local Node: the name of the node on which the GUI is currently running, which is provided for you. You can only redefine a resource for this node. To redefine a resource for a different node, you must connect the GUI to that node.


    Caution: You should only make changes from one instance of the GUI at any given time in the cluster. Changes made by a second GUI instance -- a second invocation of fsmgr -- may overwrite changes made by the first instance, because different GUI instances are updated independently at different times. (In time, however, independent GUI instances will provide the same information.) However, multiple windows accessed via the File menu are all part of a single GUI instance; you can make changes from any of these windows.


  2. Resource Type: select the resource type.

  3. Clusterwide Resource: name of the resource that you want to redefined for this node. Click Next to move to the next page.

  4. Type-specific attributes: change the information for each attribute as needed. For information about each attribute, see “Define a Resource with the GUI”.

  5. Click OK to complete the task.

Redefine a Resource for a Specific Node with cmgr

You can use cmgr to redefine a clusterwide resource to be specific to a node just as you define a clusterwide resource, except that you specify a node on the define resource command. You must execute the cmgr command on the node for which the resource will be redefined.

Use the following command to define a node-specific resource:

define resource Resourcename of resource_type RTname on node Nodename [in cluster Clustername]

If you have already specified a default cluster, you do not need to specify a cluster in this command because cmgr will use the default.

Add/Remove Dependencies for a Resource Definition

This section describes how to add and remove dependencies for a resource.

Add/Remove Dependencies for a Resource Definition with the GUI

A resource can be dependent on one or more other resources; if so, it will not be able to start (that is, be made available for use) unless the dependent resources are started as well. Dependent resources must be part of the same resource group.

As you define resources, you can define which resources are dependent on other resources. For example, a web server may depend on a both an HA IP address and a filesystem. In turn, a filesystem may depend on a volume. This is shown in Figure 6-2.

Figure 6-2. Example of Resource Dependency

Example of Resource Dependency

You cannot make resources mutually dependent. For example, if resource A is dependent on resource B, then you cannot make resource B dependent on resource A. In addition, you cannot define cyclic dependencies. For example, if resource A is dependent on resource B, and resource B is dependent on resource C, then resource C cannot be dependent on resource A. This is shown in Figure 6-3.

Figure 6-3. Mutual Dependency of Resources Is Not Allowed

Mutual Dependency of Resources Is Not Allowed

Provide appropriate values for the following:

  1. Resource Type: select the name of the resource type.

  2. Resource: select the resource name.

  3. Dependency Type: select the resource type to be added to or deleted from the dependency list.

  4. Dependency Name: select the resource name to be added to or deleted from the dependency list. Click Add to add the displayed type and name to the list.

  5. Click OK to complete the task.

Add/Remove Dependencies for a Resource Definition with cmgr

To add or remove dependencies for a resource definition, use the modify resource command. For example:

cmgr> modify resource /hafs1/expdir of resource_type NFS in cluster nfs-cluster
Enter commands, when finished enter either "done" or "cancel"

Type specific attributes to modify with set command:

Type Specific Attribute - 1: export-info
Type Specific Attribute - 2: filesystem


Resource type dependencies to add or remove:

Resource dependency - 1: /hafs1         type: filesystem


resource /hafs1/expdir ? add dependency 100.102.10.101 of type IP_address 
resource /hafs1/expdir ? done
Successfully modified resource /hafs1/expdir

Modify a Resource Definition

This section describes how to modify a resource definition.

Modify a Resource Definition with the GUI

You can modify only the type-specific attributes for a resource. You cannot rename a resource after it has been defined; to rename a resource, you must delete it and define the new resource.

Provide appropriate values for the following:

  1. Resource Type: select the name of the resource type.

  2. Resource: select the name of resource to be modified. Click Next to move to the next page.

  3. Type-specific attributes: modify the attributes as needed. For information about attributes, see “Define a Resource with the GUI”.

  4. Click OK to complete the task.


Note: There are some resource attributes whose modification does not take effect until the resource group containing that resource is brought online again. For example, if you modify the export options of a resource of type NFS, the modifications do not take effect immediately; they take effect when the resource is brought online.


Modify a Resource Definition with cmgr

Use the following commands to modify a resource:

modify resource Resourcename [of resource_type RTname] on node Nodename [in cluster Clustername]

modify resource Resourcename [of resource_type RTname] [in cluster Clustername]
    set key to AttributeValue
    add dependency dependencyname of type typename
    remove dependency dependencyname of type typename    

You modify a resource using the same commands you use to define a resource. See “Define a Resource”.

Delete a Resource

This section describes how to delete a resource.

Delete a Resource with the GUI

A resource may not be deleted if it is part of a resource group. See “Add/Remove Resources in Resource Group”.

To delete a resource, provide the following:

  1. Resource Type: select the resource type.

  2. Resource to Delete: select the name of the resource to be deleted.

  3. Click OK to complete the task.

If you select a resource that has been redefined for the node to which the GUI is connected, the delete operation will delete the redefined resource definition and also put into effect the clusterwide resource definition.

If you select a clusterwide resource definition, the delete operation will delete this definition and make it unavailable for use in a resource group. Deleting a clusterwide resource definition will fail if the resource is part of any resource group.

Delete a Resource with cmgr

Use the following command to delete a resource definition:

delete resource Resourcename of resource_type RTtype [in cluster Clustername]

Display a Resource

You can display the following:

  • Attributes of a particular defined resource

  • All of the defined resources in a specified resource group

  • All the defined resources of a specified resource type


Caution: Anyone can use the GUI to view database information; therefore, you should not include any sensitive information in the cluster database.


Display a Resource with the GUI

The GUI provides a convenient display of resources through the view area. Select View: Resources in Groups to see all defined resources. The status of these resources will be shown in the icon (grey indicates offline). Alternately, you can select View: Resources by Type or View: Resources owned by Nodes.

Display a Resource with cmgr

Use the following commands to display a resource:

  • To view the parameters of a single resource:

    show resource Resourcename of resource_type RTname

  • To view all of the defined resources in a resource group:

    show resources in resource_group RGname [in cluster Clustername]

  • To view all of the defined resources of a particular resource type in a specified cluster:

    show resources of resource_type RTname [in cluster Clustername]

Failover Policy Tasks

A failover policy is the method used by FailSafe to determine the destination node of a failover. A failover policy consists of the following:

  • Failover domain

  • Failover attributes

  • Failover script

FailSafe uses the failover domain output from a failover script along with failover attributes to determine on which node a resource group should reside.

The administrator must configure a failover policy for each resource group. A failover policy name must be unique within the pool.

This section describes the following failover policy tasks:

Define a Failover Policy

This section describes how to define a failover policy.

Define a Failover Policy with the GUI

Before you can configure your resources into a resource group, you must determine which failover policy to apply to the resource group. To define a failover policy, provide the following information:

  1. Failover Policy: enter the name of the failover policy, with a maximum length of 63 characters, which must be unique within the pool.

  2. Script: select the name of an existing failover script. The failover script generates the run-time failover domain and returns it to the FailSafe process. The FailSafe process applies the failover attributes and then selects the first node in the returned failover domain that is also in the current FailSafe membership:

    • ordered never changes the initial failover domain; when using this script, the initial and run-time failover domains are equivalent.

    • round-robin selects the resource group owner in a round-robin (circular) fashion. This policy can be used for resource groups that can be run in any node in the cluster.

    Failover scripts are stored in the /var/clusters/ha/policies directory. If the scripts provided with the release do not meet your needs, you can define a new failover script and place it in the /var/clusters/ha/policies directory. When you are using the FailSafe GUI, the GUI automatically detects your script and presents it to you as a choice. You can configure the cluster database to use your new failover script for the required resource groups. For information on defining failover scripts, see the FailSafe Programmer's Guide for SGI Infinite Storage.

  3. Failback: choose the name of the failover attribute, which is a value that is passed to the failover script and used by FailSafe for the purpose of modifying the run-time failover domain used for a specific resource group.

    You can specify the following classes of failover attributes:

    • Required attributes: either Auto_Failback or Controlled_Failback (mutually exclusive)

    • Optional attributes:

      • Auto_Recovery or InPlace_Recovery (mutually exclusive)

      • Critical_RG

      • Node_Failures_Only


    Note: The starting conditions for the attributes differ by class:

    • For required attributes, a node joins the FailSafe membership when the cluster is already providing HA services.

    • For optional attributes, HA services are started and the resource group is running in only one node in the cluster.


    Table 6-3 describes each attribute.

    Table 6-3. Failover Attributes

    Class

    Name

    Description

    Required

    Auto_Failback

    Specifies that the resource group is made online based on the failover policy when the node joins the cluster. This attribute is best used when some type of load balancing is required. You must specify either this attribute or the Controlled_Failback attribute.

    Controlled_Failback

    Specifies that the resource group remains on the same node when a node joins the cluster. This attribute is best used when client/server applications have expensive recovery mechanisms, such as databases or any application that uses tcp to communicate. You must specify either this attribute or the Auto_Failback attribute.

    Optional

    Auto_Recovery

    Specifies that the resource group is made online based on the failover policy even when an exclusivity check shows that the resource group is running on a node. This attribute is optional and is mutually exclusive with the InPlace_Recovery attribute. If you specify neither of these attributes, FailSafe will use this attribute by default if you have specified the Auto_Failback attribute.

    InPlace_Recovery

    Specifies that the resource group is made online on the same node where the resource group is running. This attribute is optional and is mutually exclusive with the Auto_Recovery attribute. If you specify neither of these attributes, FailSafe will use this attribute by default if you have specified the Controlled_Failback attribute.

    Critical_RG

    Allows monitor failure recovery to succeed even when there are resource group release failures. When resource monitoring fails, FailSafe attempts to move the resource group to another node in the application failover domain. If FailSafe fails to release the resources in the resource group, FailSafe puts the resource group into srmd executable error status. If the Critical_RG failover attribute is specified in the failover policy of the resource group, FailSafe will reset the node where the release operation failed and move the resource group to another node based on the failover policy.

    Node_Failures_Only

    Allows failover only when there are node failures. This attribute does not have an impact on resource restarts in the local node. The failover does not occur when there is a resource monitoring failure in the resource group. This attribute is useful for customers who are using a hierarchical storage management system such as DMF; in this situation, a customer may want to have resource monitoring failures reported without automatic recovery, allowing operators to perform the recovery action manually if necessary.

    See the FailSafe Programmer's Guide for SGI Infinite Storage for a full discussion of example failover policies.

  4. Recovery: choose the recovery attribute:

    • Let FailSafe Choose means that FailSafe will determine the best attribute for the circumstances.

    • Automatic means that the group will be brought online on the initial node in the failover domain.

    • In Place means that the group will be brought online on the node where the group is already partially allocated.

  5. Critical Resource Group: check to toggle selection. Selecting this attribute allows monitor failure recovery to succeed even when there are resource group release failures.

    When resource monitoring fails, FailSafe attempts to move the resource group to another node in the failover domain:

    • If FailSafe fails to release the resources, it puts the resource group into srmd executable error state.

    • If you select the Critical Resource Group state, FailSafe will reset the node where the release operation failed and move the resource group to another node based on the failover policy.

  6. Node Failures Only: this attribute controls failover on resource monitoring failures. If you select this attribute, the resource group recovery (that is, failover to another node in the failover domain) is performed only when there are node failures.

  7. Other Attributes: enter in additional attributes to be used for failover. These optional attributes are determined by the user-defined failover scripts that you can write and place into the /var/cluster/ha/policies directory.

  8. Ordered Nodes in Failover Domain: a failover domain is the ordered list of nodes on which a given resource group can be allocated. The nodes listed in the failover domain must be defined for the cluster; however, the failover domain does not have to include every node in the cluster. The failover domain can be also used to statically load-balance the resource groups in a cluster.

    Examples:

    • In a four-node cluster, a set of two nodes that have access to a particular XLV volume should be the failover domain of the resource group containing that XLV volume.

    • In a cluster of nodes named venus, mercury, and pluto, you could configure the following initial failover domains for resource groups RG1 and RG2:

      • RG1: mercury, venus, pluto

      • RG2: pluto, mercury

    The administrator defines the initial failover domain when configuring a failover policy. The initial failover domain is used when a cluster is first booted. The ordered list specified by the initial failover domain is transformed into a run-time failover domain by the failover script. With each failure, the failover script takes the current run-time failover domain and potentially modifies it (for the ordered failover script, the order will not change); the initial failover domain is never used again. Depending on the run-time conditions, such as load and contents of the failover script, the initial and run-time failover domains may be identical.

    For example, suppose that the cluster contains three nodes named N1, N2, and N3; that node failure is not the reason for failover; and that the initial failover domain is as follows:

    N1 N2 N3

    The run-time failover domain will vary based on the failover script:

    • If ordered:

      N1 N2 N3

    • If round-robin:

      N2 N3 N1

    • If a customized failover script, the order could be any permutation, based on the contents of the script:

      N1 N2 N3                 N1 N3 N2 
      N2 N1 N3                 N2 N3 N1
      N3 N1 N2                 N3 N2 N1

    FailSafe stores the run-time failover domain and uses it as input to the next failover script invocation.

  9. Click OK to complete the task.

Complete information on failover policies and failover scripts, with an emphasis on writing your own failover policies and scripts, is provided in the FailSafe Programmer's Guide for SGI Infinite Storage.

Define a Failover Policy with cmgr

For details about failover policies, see “Define a Failover Policy with the GUI”.

Use the following to define a failover policy:

define failover_policy Policyname
    set attribute to Attributename
    set script to Scriptname
    set domain to Nodename

The following prompt appears:

failover_policy Policyname?

When you define a failover policy, you can set as many attributes and domains as your setup requires, executing the add attribute and add domain commands with different values. You can also specify multiple domains in one command of the following format:

set domain to Node1 Node2 Node3 ...

The components of a failover policy are described in detail in the FailSafe Programmer's Guide for SGI Infinite Storage and in summary in “Define a Failover Policy with the GUI”.

For example, suppose you have a failover policy named fp_ord with attributes Auto_Failback, Auto_Recovery and Critical_RG and a failover domain of node2 node1. The primary node is node2 and the backup node is node1. Following is an example of defining the failover policy in normal mode:

cmgr> define failover_policy fp_ord
Enter commands, when finished enter either "done" or "cancel"

failover_policy fp_ord? set attribute to Auto_Failback
failover_policy fp_ord? set attribute to Auto_Recovery
failover_policy fp_ord? set attribute to Critical_RG
failover_policy fp_ord? set script to ordered
failover_policy fp_ord? set domain to node2 node1
failover_policy fp_ord? done

Modify a Failover Policy Definition

This section describes how to modify a failover policy.

Modify a Failover Policy Definition with the GUI

The process of deleting a failover policy is similar to defining a new policy. See “Define a Failover Policy with the GUI”.

Do the following:

  1. Failover Policy: select the name of the failover policy.

  2. Script: use the menu to select the name of an existing failover script:

    • ordered never changes the initial domain; when using this script, the initial and run-time domains are equivalent.

    • round-robin selects the resource group owner in a round-robin (circular) fashion. This policy can be used for resource groups that can be run in any node in the cluster.

    Failover scripts are stored in the /var/clusters/ha/policies directory. If the scripts provided with the release do not meet your needs, you can define a new failover script and place it in the /var/clusters/ha/policies directory. When you are using the FailSafe GUI, the GUI automatically detects your script and presents it to you as a choice. You can configure the cluster database to use your new failover script for the required resource groups. For information on defining failover scripts, see the FailSafe Programmer's Guide for SGI Infinite Storage.

  3. Failback: choose the name of the failover attribute. You can specify the following classes of failover attributes:

    • Required attributes: either Auto_Failback or Controlled_Failback (mutually exclusive)

    • Optional attributes:

      • Auto_Recovery or InPlace_Recovery (mutually exclusive)

      • Critical_RG

      • Node_Failures_Only


    Note: The starting conditions for the attributes differs by class:

    • For required attributes, a node joins the FailSafe membership when the cluster is already providing HA services.

    • For optional attributes, HA services are started and the resource group is running in only one node in the cluster.


    Table 6-3 describes each attribute.

    See the FailSafe Programmer's Guide for SGI Infinite Storage for a full discussions of example failover policies.

  4. Recovery: choose the recovery attribute, or let Failsafe choose the best attribute for the circumstances:

    • Automatic means that the group will be brought online on the initial node in the failover domain.

    • InPlace means that the group will be brought online on the node where the group is already partially allocated.

  5. Critical Resource Group: check to toggle selection. Selecting this attribute allows monitor failure recovery to succeed even when there are resource group release failures.

    When resource monitoring fails, FailSafe attempts to move the resource group to another node in the failover domain:

    • If FailSafe fails to release the resources, it puts the resource group into srmd executable error state.

    • If you select the Critical Resource Group state, FailSafe will reset the node where the release operation fails and move the resource group to another node based on the failover policy.

  6. Node Failures Only: this attribute controls failover on resource monitoring failures. If you select this attribute, the resource group recovery (that is, failover to another node in the failover domain) is performed only when there are node failures.

  7. Other Attributes: enter in additional attributes to be used for failover. These optional attributes are determined by the user-defined failover scripts that you can write and place into the /var/cluster/ha/policies directory.

  8. Ordered Nodes in Failover Domain: a failover domain is the ordered list of nodes on which a given resource group can be allocated. The nodes listed in the failover domain must be defined for the cluster; however, the failover domain does not have to include every node in the cluster. The failover domain also can be used to statically load-balance the resource groups in a cluster.

    The administrator defines the initial failover domain when configuring a failover policy. The initial failover domain is used when a cluster is first booted. The ordered list specified by the initial failover domain is transformed into a run-time failover domain by the failover script. With each failure, the failover script takes the current run-time failover domain and potentially modifies it (for the ordered failover script, the order will not change); the initial failover domain is never used again. Depending on the run-time conditions, such as load and contents of the failover script, the initial and run-time failover domains may be identical.

    FailSafe stores the run-time failover domain and uses it as input to the next failover script invocation.

  9. Click OK to complete the task.

Complete information on failover policies and failover scripts, with an emphasis on writing your own failover policies and scripts, is provided in the FailSafe Programmer's Guide for SGI Infinite Storage.

Modify a Failover Policy Definition with cmgr

Use the following command to modify a failover policy:

modify failover_policy Policyname

You modify a failover policy using the same commands you use to define a failover policy. See “Define a Failover Policy with cmgr”.

Delete a Failover Policy

This section describes how to delete a failover policy.

Delete a Failover Policy with the GUI

This task lets you delete a failover policy. Deleting a failover policy does not delete the cluster nodes in the policy's failover domain.


Note: You cannot delete a failover policy that is currently being used by a resource group. You must first use the Modify Resource Group task to select a different failover policy for the resource group.


Do the following:

  1. Failover Policy to Delete: select a policy.

  2. Click OK to complete the task.

Delete a Failover Policy with cmgr

Use the following command to delete a failover policy:

delete failover_policy Policyname

Display a Failover Policy

You can use FailSafe to display any of the following:

  • The components of a specified failover policy

  • All of the failover policies

  • All of the failover policy attributes

  • All of the failover policy scripts

Display a Failover Policy with the GUI

Select View: Failover Policies to see all defined failover policies in the view area. Select the name of a specific policy in the view area in order to see details about it in the details area.

Display a Failover Policy with cmgr

Use the following commands to display a failover policy:

  • To view all of the failover policies:

    show failover policies

  • To view the parameters of a specific failover policy:

    show failover_policy Policyname

  • To view all of the failover policy attributes:

    show failover_policy attributes

  • To view all of the failover policy scripts:

    show failover_policy scripts

Resource Group Tasks

A resource group is a collection of interdependent resources. A resource group is identified by a simple name; this name must be unique within a cluster.

This section describes the following resource group tasks:

Define a Resource Group

This section describes how to define a resource group.

Define a Resource Group with the GUI

Resources are configured together into resource groups . A resource group is a collection of interdependent resources. If any individual resource in a resource group becomes unavailable for its intended use, then the entire resource group is considered unavailable. Therefore, a resource group is the unit of failover for FailSafe.

For example, a resource group could contain all of the resources that are required for the operation of a web node, such as the web node itself, the HA IP address with which it communicates to the outside world, and the disk volumes containing the content that it serves.

When you define a resource group, you specify a failover policy. A failover policy controls the behavior of a resource group in failure situations.

Do the following:

  1. Failover Policy: select the name of the failover policy. This policy will determine which node will take over the services of the resource group upon failure.

  2. Resource Group Name: enter the name of the resource group, with a maximum length of 63 characters.

  3. Click OK to complete the task.

To add resources to the group, see “Add/Remove Resources in Resource Group”.


Note: FailSafe does not allow resource groups that do not contain any resources to be brought online.

You can define up to 100 resources configured in any number of resource groups.

Define a Resource Group with cmgr

Use the following command to define a resource group:

define resource_group RGname [in cluster Clustername]
    set failover_policy to Policyname
    add resource Resourcename of resource_type RTname
    remove resource Resourcename of resource_type RTname

Usage notes:

  • failover_policy specifies the failover policy name

  • resource specifies the resource name

  • resource_type specifies the resource type

For example:

cmgr> define resource_group group1 in cluster filesystem-cluster
Enter commands, when finished enter either "done" or "cancel"

resource_group group1? failover_policy to fp_ord
resource_group group1? add resource 10.154.99.99 of resource_type IP_address
resource_group group1? add resource havol of resource_type volume
resource_group group1? add resource /hafs of resource_type filesystem
resource_group group1? done

For a full example of resource group creation using cmgr see “Example: Create a Resource Group” in Chapter 7.

Modify a Resource Group Definition

This section describes how to modify a resource group.

Modify a Resource Group Definition with the GUI

This task lets you change a resource group by changing its failover policy.

Do the following:

  1. Resource Group: select a resource group

  2. Failover Policy: select a failover policy

  3. Click OK to complete the task

To change the contents of the resource group, see “Add/Remove Resources in Resource Group”.

Modify a Resource Group Definition with cmgr

Use the following commands to modify a resource group:

modify resource_group RGname [in cluster Clustername]
    set failover_policy to Policyname
    add resource Resourcename of resource_type RTname
    remove resource Resourcename of resource_type RTname
    

For example:

cmgr> modify resource_group WS1 in cluster test-cluster

You modify a resource group using the same commands you use to define a resource group. See “Define a Resource Group with cmgr”.

Delete a Resource Group

This section describes how to delete a resource group.

Delete a Resource Group with the GUI

This task lets you delete an offline resource group. Deleting the group does not delete the individual resources that are members of the group.


Note: You cannot delete a resource group that is online.


Do the following:

  1. Resource Group: select the name of the resource group you want to delete. Only offline resource groups are listed.

  2. Click OK to complete the task.

Delete a Resource Group with cmgr

Use the following command to delete a resource group:

delete resource_group RGname [in cluster Clustername]

For example:

cmgr> delete resource_group WS1 in cluster test-cluster

Add/Remove Resources in Resource Group

This task lets you change a resource group by adding or removing resources.


Note: You cannot have a resource group online that does not contain any resources; therefore, FailSafe does not allow you to delete all resources from a resource group after the resource group is online. Likewise, FailSafe does not allow you to bring a resource group online if it has no resources.

Resources must be added and deleted in atomic units; this means that resources that are interdependent must be added and deleted together.


Note: All interdependent resources must be added to the same resource group.

Do the following:

  1. Resource Group: select a resource group. A list of existing resources in the group appears.

  2. To add a resource to the group:

    • Resource Type: select a resource type

    • Resource Name: select a resource name

    • Click Add

  3. To modify a resource in the group:

    • Select its name from the display window

    • Click Modify

  4. To delete a resource from the group:

    • Select its name from the display window

    • Click Delete

  5. Click OK to complete the task.

Display a Resource Group

This section describes how to display a resource group.

Display a Resource Group with the GUI

You can display the parameters of a defined resource group, and you can display all of the resource groups defined for a cluster.

Display a Resource Group with cmgr

Use the following commands to display a resource group

  • To view a specific resource group:

    show resource_group RGname [in cluster Clustername]

    For example:

    cmgr> show resource_group small-rg in cluster test-cluster
    Resource Group: small-rg
            Cluster: test-cluster
            Failover Policy: test_fp
    
    Resources: 
            100.101.10.101  (type: IP_address)
            /hafs  (type: filesystem)
            havol  (type: volume)

  • To view all of the resource groups:

    show resource_groups [in cluster Clustername]

    For example:

    cmgr> show resource_groups in cluster test-cluster
    
    Resource Groups:
            bar-rg
            foo-rg
            small-rg

FailSafe HA Services Tasks

After you have configured your FailSafe system and run diagnostic tests on its components, you can activate FailSafe by starting the highly available (HA) services. You can start HA services on all of the nodes in a cluster or on a specified node only.

This section describes the following tasks:

Start FailSafe HA Services

This section describes how to start FailSafe HA services.

Start FailSafe HA Services with the GUI

You can start FailSafe HA services on all of the nodes in a cluster or on a specified node only:

  1. Cluster Name: the name of the cluster is specified for you.

  2. One Node Only: if you want HA services to be started on one node only, choose its name. If you leave this field blank, HA services will be started on every node in the cluster.


Caution: When you start HA services on a subset of nodes, you should ensure that resource groups are running on only the started nodes. For example, if a cluster contains nodes N1, N2, and N3 and HA services are started on nodes N1 and N2 but not on node N3, you should verify that resource groups are not running on node N3. FailSafe will not perform exclusivity checks on nodes where HA services are not started.

When you start HA services, the following actions are performed:

  • All nodes in the cluster (or the selected node only) are enabled

  • FailSafe returns success to the user after modifying the cluster database

  • The local cmond gets notification from the fs2d daemon

  • The local cmond starts all HA processes (cmsd, gcd, srmd, fsd) and ifd

  • cmond sets the failsafe2 chkconfig flag to on

Start FailSafe HA Services with cmgr

Use the following command to start HA services:

start ha_services [on node Nodename] [for cluster Clustername]

For example:

  • To start HA services across the cluster:

    cmgr> start ha_services for cluster test-cluster

  • To start HA services just on node N1:

    cmgr> start ha_services on node N1 for cluster test-cluster

Stop FailSafe HA Services

This section describes how to stop FailSafe HA services.

Stop FailSafe HA Services with the GUI

You can stop HA services on all of the nodes in a cluster or on one specified node.


Note: This is a long-running task that might take a few minutes to complete.

Stopping a node or a cluster is a complex operation that involves several steps and can take several minutes. Aborting a stop operation can leave the nodes and the resources in an unintended state.

When stopping HA services on a node or for a cluster, the operation may fail if any resource groups are not in a stable clean state. Resource groups that are in transition will cause any stop HA services command to fail. In many cases, the command may succeed at a later time after resource groups have settled into a stable state.

After you have successfully stopped a node or a cluster, it should have no resource groups and all HA services should be gone.

Serially stopping HA services on every node in a cluster is not the same as stopping HA services for the entire cluster. In the former case, an attempt is made to keep resource groups online and highly available; in the latter case, resource groups are moved offline, as described in the following sections.

When you stop HA services, the FailSafe daemons perform the following actions:

  • A shutdown request is sent to the fsd daemon

  • fsd releases all resource groups and puts them in ONLINE-READY state

  • All nodes in the cluster in the cluster database are disabled (one node at a time and the local node last)

  • FailSafe waits until the node is removed from the FailSafe membership before disabling the node

  • The shutdown is successful only when all nodes are not part of the FailSafe membership

  • cmond receives notification from the cluster database when nodes are disabled

  • The local cmond sends a SIGTERM to all HA processes and ifd

  • All HA processes clean up and exit with “don't restart” code

  • All other cmsd daemons remove the disabled node from the FailSafe membership

If HA services are stopped on one node, that node's online resource groups will be moved, according to the failover policy, to a node where HA services are active. If HA services are stopped on the cluster, all online resource groups will be taken offline, making them no longer highly available.

See the caution in “Start FailSafe HA Services with the GUI”.

Stopping HA Services on One Node

To stop HA services on one node, enter the following:

  • Force: click the checkbox to forcibly stop the services even if there are errors that would normally prevent them from being stopped.

    The operation of stopping a node tries to move all resource groups from the node to some other node and then tries to disable the node in the cluster, subsequently killing all HA processes.

    When HA services are stopped on a node, all resource groups owned by the node are moved to some other node in the cluster that is capable of maintaining these resource groups in an HA state. This operation will fail if there is no node that can take over these resource groups. This condition will always occur if the last node in a cluster is shut down when you stop HA services on that node.

    In this circumstance, you can specify the Force option to shut down the node even if resource groups cannot be moved or released. This will normally leave resource groups allocated in a non-HA state on that same node. Using the force option might result in the node getting reset. In order to guarantee that the resource groups remain allocated on the last node in a cluster, all online resource groups should be detached.

    If you wish to move resource groups offline that are owned by the node being shut down, you must do so prior to stopping the node.

  • Cluster Name: the name of the cluster is specified for you.

  • One Node Only: select the node name.

  • Click OK to complete the task.

Stopping HA Services on All Nodes in a Cluster

Stopping HA services across the cluster attempts to release all resource groups and disable all nodes in the cluster, subsequently killing all HA processes.

When a cluster is deactivated and the FailSafe HA services are stopped on that cluster, resource groups are moved offline or deallocated. If you want the resource groups to remain allocated, you must detach the resource groups before attempting to deactivate the cluster.

Serially stopping HA services on every node in a cluster is not the same as stopping HA services for the entire cluster. In the former case, an attempt is made to keep resource groups online and highly available while in the latter case resource groups are moved offline.

To stop HA services on all nodes, enter the following:

  • Force: click the checkbox to force the stop even if there are errors

  • Cluster Name: the name of the cluster is specified for you

  • One Node Only: leave this field blank

  • Click OK to complete the task

Stop FailSafe HA Services with cmgr

To stop FailSafe HA services, use the following command:

stop ha_services [on node Nodename] [for cluster Clustername] [force]

The force option will cause the stop to occur even if there are errors.

This is a long-running task might take a few minutes to complete. The cmgr command will provide intermediate task status for such tasks. For example:

cmgr> stop ha_services in cluster nfs-cluster
Making resource groups offline
Stopping HA services on node node1
Stopping HA services on node node2

Set FailSafe HA Parameters

This section tells you how to set FailSafe HA parameters.

Set FailSafe HA Parameters with the GUI

This task lets you change how FailSafe monitors the cluster and detects the need for node resets and group failovers:

  1. Cluster Name: name of the cluster. This value is provided for you.

  2. Heartbeat Interval: the interval, in milliseconds, between heartbeat messages. This interval must be greater than 500 milliseconds and it must not be greater than one-tenth the value of the node timeout period. This interval is set to one second, by default. It has a default value of 1000 milliseconds.

    The higher the number of heartbeats (smaller heartbeat interval), the greater the potential for slowing down the network. Conversely, the fewer the number of heartbeats (larger heartbeat interval), the greater the potential for reducing availability of resources.

  3. Node Timeout: if no heartbeat is received from a node within the node timeout period, the node is considered to be dead and is not considered part of the FailSafe membership.

    Enter a value in milliseconds. The node timeout must be at least 5 seconds. In addition, the node timeout must be at least 10 times the heartbeat interval for proper FailSafe operation; otherwise, false failovers may be triggered. It has a default value of 15000 milliseconds.

    Node timeout is a clusterwide parameter.

  4. Node Wait Time: the interval, in milliseconds, during which a node waits for other nodes to join the cluster before declaring a new FailSafe membership. If the value is not set for the cluster, FailSafe calculates this value by multiplying the Node Timeout value by the number of nodes.

  5. Powerfail Mode: check the box to turn it on. The powerfail mode indicates whether a special power failure algorithm should be run when no response is received from a system controller after a reset request. Powerfail is a node-specific parameter, and should be defined for the node that performs the reset operation.

  6. Tie-Breaker Node: select a node name. The Failsafe tiebreaker node is the node used to compute the FailSafe membership in situations where 50% of the nodes in a cluster can talk to each other. If you do not specify a tiebreaker node, the node with the lowest node ID number is used.

    You should configure a tiebreaker node even if there is an odd number of nodes in the cluster because one node may be stopped, leaving an even number of nodes to determine membership.

    In a cluster where the nodes are of different sizes and capabilities, the largest node in the cluster with the most important application or the maximum number of resource groups should be configured as the tiebreaker node.

Set FailSafe HA Parameters with cmgr

You can modify the FailSafe parameters with the following command:

modify ha_parameters [on node Nodename] [in cluster Clustername]
    set node_timeout to TimeoutValue
    set heartbeat to HeartbeatInterval
    set run_pwrfail to true|false
    set node_wait to NodeWaitTime
    set tie_breaker to TieBreakerNodename

Usage notes:

  • node_timeout is the node time-out period. If no heartbeat is received from a node within the node timeout period, the node is considered to be dead and is not considered part of the FailSafe membership.

    Enter a value in milliseconds. The node timeout must be at least 5 seconds. In addition, the node timeout must be at least 10 times the heartbeat interval for proper FailSafe operation; otherwise, false failovers may be triggered. It has a default value of 60000 milliseconds.

    node_timeout is a clusterwide parameter.

  • heartbeat is the heartbeat interval, in milliseconds, between heartbeat messages. This interval must be greater than 500 milliseconds and it must not be greater than one-tenth the value of the node timeout period. This interval is set to one second, by default. It has a default value of 1000 milliseconds.

    The higher the number of heartbeats (smaller heartbeat interval), the greater the potential for slowing down the network. Conversely, the fewer the number of heartbeats (larger heartbeat interval), the greater the potential for reducing availability of resources.

  • run_pwrfail indicates whether a special power failure algorithm should be run (true) when no response is received from a system controller after a reset request.

    Powerfail is a node-specific parameter, and should be defined for the node that performs the reset operation.

  • node_wait is the interval, in milliseconds, during which a node waits for other nodes to join the cluster before declaring a new FailSafe membership. If the value is not set for the cluster, FailSafe calculates this value by multiplying the node-timeout value by the number of nodes.

  • tie_breaker is the name of the node to act as the FailSafe tiebreaker.

    Setting tie_breaker to "" (no space between quotation marks) unsets the tie_breaker value. Unsetting the tie_breaker is equivalent to not setting the value in the first place. In this case, FailSafe will use the node with the lowest node ID as the tiebreaker node.

Set Log Configuration

This section describes how to set log configuration.

Set Log Configuration with the GUI

FailSafe maintains system logs for each of the FailSafe daemons. You can customize the system logs according to the level of logging you wish to maintain. Changes apply as follows:

  • To all nodes in the pool for the cli and crsd log groups

  • To all nodes in the cluster for all other log groups

You can also customize the log group configuration for a specific node in the cluster or pool.

Default Log File Names

FailSafe logs both normal operations and critical errors to the SYSLOG file, as well as to individual log files for each log group.

To set the log configuration, enter the appropriate values for the following fields:

  1. Log Group: a log group is a set of processes that log to the same log file according to the same logging configuration. All FailSafe daemons make one log group each.

    FailSafe maintains the following log groups:

    cli 

    Commands log

    crsd 

    Cluster reset services (crsd) log

    diags  

    Diagnostics log

    ha_agent  

    HA monitoring agents (ha_ifmx2) log

    ha_cmsd  

    FailSafe membership daemon (ha_cmsd) log

    ha_fsd  

    FailSafe daemon (ha_fsd) log

    ha_gcd  

    Group communication daemon (ha_gcd) log

    ha_ifd  

    Network interface monitoring daemon (ha_ifd ) log

    ha_script  

    Action and failover policy scripts log

    ha_srmd  

    System resource manager (ha_srmd) log

  2. Log Level: the log level, specified as character strings with the GUI and numerically (1 to 19) with cmgr. The log level specifies the verbosity of the logging, controlling the amount of log messages that FailSafe will write into an associated log group's file. There are 10 debug levels. Table 6-4 shows the logging levels as you specify them with the GUI and cmgr.

    Notifications of critical errors and normal operations are always sent to the SYSLOG file. Changes you make to the log level for a log group do not affect SYSLOG.

  3. Log File: a file that contains FailSafe notifications for a particular log group. Log file names beginning with a slash are absolute, while names not beginning with a slash are relative to the /var/cluster/ha/log directory.

    The FailSafe software appends the node name to the name of the log file you specify. For example, when you specify the log file name for a log group as /var/cluster/ha/log/cli, the file name will be /var/cluster/ha/log/cli_ Nodename.

    Table 6-5 shows the default log file names.

  4. Click OK to complete the task.

Table 6-4. Log Levels

GUI level

cmgr level

Meaning

Off

0

No logging

Minimal

1

Logs notification of critical errors and normal operation

Info

2

Logs minimal notification plus warning

Default

5

Logs all Info messages plus additional notifications

Debug0 through Debug9

10 through 19

Logs increasingly more debug information, including data structures. Many megabytes of disk space can be consumed on the server when debug levels are used in a log configuration.


Table 6-5. Default Log File Names

Log File Name

Log File Owner

/var/cluster/ha/log/cmsd_ Nodename

FailSafe membership services daemon in node Nodename.

/var/cluster/ha/log/gcd_ Nodename

Group communication daemon in node Nodename.

/var/cluster/ha/log/srmd_ Nodename

System resource manager daemon in node Nodename.

/var/cluster/ha/log/failsafe_ Nodename

FailSafe daemon, a policy implementor for resource groups, in node Nodename.

/var/cluster/ha/log/ AgentNodename

Monitoring agent named Agent in node Nodename. For example, ifd_Nodename is the log file for the interface daemon monitoring agent that monitors interfaces and IP addresses and performs local failover of IP addresses.

/var/cluster/ha/log/crsd_ Nodename

Reset daemon in node nodename.

/var/cluster/ha/log/script_ Nodename

Scripts in node nodename .

/var/cluster/ha/log/cli _Nodename

Internal administrative commands in node nodename invoked by the GUI and cmgr .


Display Log Group Definitions with the GUI

To display log group definitions with the GUI, select the Log Group menu. The current log level and log file for that log group will be displayed in the task window, where you can change those settings if you desire.

Define Log Groups with cmgr

Use the following command to define a log group:

define log_group Groupname [on node Nodename] [in cluster Clustername]
    set log_level to Level
    add log_file Logfilename
    remove log_file Logfilename

Usage notes:

  • Specify the node name if you wish to customize the log group configuration for a specific node only. For details about legal values, see “Set Log Configuration with the GUI”.

  • log_level can have one of the following values:

    • 0 gives no logging

    • 1 logs notifications of critical errors and normal operation (these messages are also logged to the SYSLOG file)

    • 2 logs Minimal notifications plus warnings

    • 5 through 7 log increasingly more detailed notifications

    • 10 through 19 log increasingly more debug information, including data structures

  • log_file is the file that contains FailSafe notifications for a particular log group. Log file names beginning with a slash are absolute, while names not beginning with a slash are relative to the /var/cluster/ha/log directory.

    The FailSafe software appends the node name to the name of the log file you specify. For example, when you specify the log file name for a log group as /var/cluster/ha/log/cli, the file name will be /var/cluster/ha/log/cli_ Nodename.

    For a list of default log names, see Table 6-5.

Configure Log Groups with cmgr

You can configure a log group with the following command:

define log_group LogGroup on node Nodename [in cluster Clustername]

The LogGroup variable can be one of the following:

cli
crsd
diags
ha_agent
ha_cmsd
ha_fsd
ha_gcd
ha_ifd
ha_script
ha_srmd


Caution: Do not change the names of the log files. If you change the names, errors can occur.

For example, to define log group cli on node fs6 with a log level of 5:

cmgr> define log_group cli on node fs6 in cluster fs6-8

(Enter "cancel" at any time to abort)

Log Level ? (11) 5

CREATE LOG FILE OPTIONS

        1) Add Log File.
        2) Remove Log File.
        3) Show Current Log Files.
        4) Cancel. (Aborts command)
        5) Done. (Exits and runs command)

Enter option:5
Successfully defined log group cli

Modify Log Groups with cmgr

Use the following command to modify a log group:

modify log_group LogGroupname on node Nodename [in cluster Clustername]

You modify a log group using the same commands you use to define a log group. See “Define Log Groups with cmgr”.

For example, to change the log level of cli to be 10, enter the following:

cmgr> modify log_group cli on node fs6 in cluster fs6-8

(Enter "cancel" at any time to abort)

Log Level ? (2) 10

MODIFY LOG FILE OPTIONS

        1) Add Log File.
        2) Remove Log File.
        3) Show Current Log Files.
        4) Cancel. (Aborts command)
        5) Done. (Exits and runs command)

Enter option:5
Successfully modified log group cli

For example, to set the log level for the ha_script log group to 11, enter the following:

cmgr> modify log_group ha_script

log_group ha_script ? set log_level to 11
log_group ha_script ? done
Successfully modified log group ha_script

Display Log Group Definitions

This section describes how to display log group definitions.

Display Log Group Definitions with cmgr

Use the following command to display log group levels:

show log_groups

This command shows the currently defined log group names, their logging levels, and the log files. For example:

cmgr> show log_groups

ha_cmsd 13 /var/cluster/ha/log/cmsd
crsd 5 /var/cluster/ha/log/crsd
ha_gcd 5 /var/cluster/ha/log/gcd
ha_ifd 5 /var/cluster/ha/log/ifd
ha_srmd 14 /var/cluster/ha/log/srmd
ha_fsd 11 /var/cluster/ha/log/failsafe
cli 2 /var/cluster/ha/log/cli
ha_script 13 /var/cluster/ha/log/script
ha_agent 5 CommandName
diags 2 /var/cluster/ha/log/diags
clconfd 5 /var/cluster/ha/log/clconfd

In this example, ha_cmsd is the name of logging group used by the ha_cmsd daemon to log messages. The log verbosity level for ha_cmsd is 13; the verbosity level range from 0 (no message) to 19 (most verbose). The log file used is /var/cluster/ha/log/cmsd. A node name suffix is added to all log file names.

Use the following command to see messages logged by a specific daemon on a specific node:

show log_group LogGroupName [on node Nodename]

To exit from the message display, enter Cntrl-C.