Chapter 3. Creating a Failover Policy

This chapter tells you how to create a failover policy. It describes the following topics:

Contents of a Failover Policy

A failover policy is the method by which a resource group is failed over from one node to another. A failover policy consists of the following:

  • Failover domain

  • Failover attribute

  • Failover scripts

FailSafe uses the failover domain output from a failover script along with failover attributes to determine on which node a resource group should reside.

The administrator must configure a failover policy for each resource group. The name of the failover policy must be unique within the pool.

Failover Domain

A failover domain is the ordered list of nodes on which a given resource group can be allocated. The nodes listed in the failover domain must be within the same cluster; however, the failover domain does not have to include every node in the cluster. The failover domain can also be used to statically load balance the resource groups in a cluster.

Examples:

  • In a four-node cluster, a set of two nodes that have access to a particular XLV volume may be the failover domain of the resource group containing that XLV volume.

  • In a cluster of nodes named venus, mercury, and pluto, you could configure the following initial failover domains for resource groups RG1 and RG2:

    • mercury, venus, pluto for RG1

    • pluto, mercury for RG2

The administrator defines the initial failover domain when configuring a failover policy. The initial failover domain is used when a cluster is first booted. The ordered list specified by the initial failover domain is transformed into a run-time failover domain by the failover script. With each failure, the failover script takes the current run-time failover domain and potentially modifies it (for the ordered failover script, the order will not change); the initial failover domain is never used again. Depending on the run-time conditions such as load and contents of the failover script, the initial and run-time failover domains may be identical.

For example, suppose that the cluster contains three nodes named N1, N2, and N3; that node failure is not the reason for failover; and that the initial failover domain is as follows:

N1 N2 N3

The runtime failover domain will vary based on the failover script:

  • If ordered:

    N1 N2 N3

  • If round-robin:

    N2 N3 N1

  • If a customized failover script, the order could be any permutation, based on the contents of the script:

    N1 N2 N3                 N1 N3 N2
    N2 N1 N3                 N2 N3 N1
    N3 N1 N2                 N3 N2 N1

FailSafe stores the run-time failover domain and uses it as input to the next failover script invocation.

Failover Attributes

A failover attribute is a value that is passed to the failover script and used by FailSafe for the purpose of modifying the run-time failover domain used for a specific resource group.

You can specify the following classes of failover attributes:

  • Required attributes: either Auto_Failback or Controlled_Failback (mutually exclusive)

  • Optional attributes:

    • Auto_Recovery or InPlace_Recovery (mutually exclusive)

    • Critical_RG

    • Node_Failures_Only


Note: The starting conditions for the attributes differs by class:

  • For required attributes, a node joins the FailSafe membership when the cluster is already providing highly available services.

  • For optional attributes, highly available services are started and the resource group is running in only one node in the cluster.


Table 3-1 describes each attribute.

Table 3-1. Failover Attributes

Class

Name

Description

Required

Auto_Failback

Specifies that the resource group is made online based on the failover policy when the node joins the cluster. This attribute is best used when some type of load balancing is required. You must specify either this attribute or the Controlled_Failback attribute.

 

Controlled_Failback

Specifies that the resource group remains on the same node when a node joins the cluster. This attribute is best used when client/server applications have expensive recovery mechanisms, such as for databases or applications that use tcp to communicate. You must specify either this attribute or the Auto_Failback attribute.

Optional

Auto_Recovery

Specifies that the resource group is made online based on the failover policy even when an exclusivity check shows that the resource group is running on a node. This attribute is optional and is mutually exclusive with the InPlace_Recovery attribute. If you specify neither of these attributes, FailSafe will use this attribute by default if you have specified the Auto_Failback attribute.

 

InPlace_Recovery

Specifies that the resource group is made online on the same node where the resource group is running. This attribute is optional and is mutually exclusive with the Auto_Recovery attribute. If you specify neither of these attributes, FailSafe will use this attribute by default if you have specified the Controlled_Failback attribute.

 

Critical_RG

Allows monitor failure recovery to succeed even when there are resource group release failures. When resource monitoring fails, FailSafe attempts to move the resource group to another node in the application failover domain. If FailSafe fails to release the resources in the resource group, FailSafe puts the Resource group into srmd executable error status. If the Critical_RG failover attribute is specified in the failover policy of the resource group, FailSafe will reset the node where the release operation failed and move the resource group to another node based on the failover policy.

 

Node_Failures_Only

Allows failover only when there are node failures. This attribute does not have an impact on resource restarts in the local node. The failover does not occur when there is a resource monitoring failure in the resource group. This attribute is useful for customers who are using a hierarchical storage management system such as DMF; in this situation, a customer may want to have resource monitoring failures reported without automatic recovery, allowing operators to perform the recovery action manually if necessary.


Failover Scripts

A failover script generates the run-time failover domain and returns it to the FailSafe process. The FailSafe process applies the failover attributes and then selects the first node in the returned failover domain that is also in the current FailSafe membership.


Note: The run-time of the failover script must be capped to a system-definable maximum. Therefore, any external calls must be guaranteed to return quickly. If the failover script takes too long to return, FailSafe will kill the script process and use the previous run-time failover domain.

Failover scripts are stored in the /var/cluster/ha/policies directory.

ordered

The ordered failover script is provided with the release. The ordered script never changes the initial domain; when using this script, the initial and run-time domains are equivalent. The script reads six lines from the input file and in case of errors logs the input parameters and/or the error to the script log.

The following example shows the contents of the ordered failover script:

#!/sbin/ksh
#
# $1 - input file
# $2 - output file
#
# line 1 input file - version
# line 2 input file - name
# line 3 input file - owner field
# line 4 input file - attributes
# line 5 input file - list of possible owners
# line 6 input file - application failover domain

DIR=/usr/cluster/bin
LOG=${DIR}/ha_cilog -g ha_script -s script
FILE=/var/cluster/ha/policies/ordered

input=$1
output=$2
cat ${input} | read version
head -2 ${input} | tail -1 | read name
head -3 ${input} | tail -1 | read owner
head -4 ${input} | tail -1 | read attr
head -5 ${input} | tail -1 | read mem1 mem2 mem3 mem4 mem5 mem6 mem7 mem8
head -6 ${input} | tail -1 | read afd1 afd2 afd3 afd4 afd5 afd6 afd7 afd8

${LOG} -l 1 "${FILE}:" `/bin/cat ${input}`

if [ "${version}" -ne 1 ] ; then
    ${LOG} -l 1 "ERROR: ${FILE}: Different version no. Should be (1) rather than
(${version})" ;
    exit 1;
elif [ -z "${name}" ]; then
    ${LOG} -l 1 "ERROR: ${FILE}: Failover script not defined";
    exit 1;
elif [ -z "${attr}" ]; then
    ${LOG} -l 1 "ERROR: ${FILE}: Attributes not defined";
    exit 1;
elif [ -z "${mem1}" ]; then
    ${LOG} -l 1 "ERROR: ${FILE}: No node membership defined";
    exit 1;
elif [ -z "${afd1}" ]; then
    ${LOG} -l 1 "ERROR: ${FILE}: No failover domain defined";
    exit 1;
fi

found=0
for i in $afd1 $afd2 $afd3 $afd4 $afd5 $afd6 $afd7 $afd8; do
    for j in $mem1 $mem2 $mem3 $mem4 $mem5 $mem6 $mem7 $mem8; do
        if [ "X${j}" = "X${i}" ]; then
            found=1;
            break;
        fi
    done
done

if [ ${found} -eq 0 ]; then
    mem="("$mem1")"" ""("$mem2")"" ""("$mem3")"" ""("$mem4")"" ""("$mem5")""
""("$mem6")"" ""("$mem7")"" ""("$mem8")";
    afd="("$afd1")"" ""("$afd2")"" ""("$afd3")"" ""("$afd4")"" ""("$afd5")""
""("$afd6")"" ""("$afd7")"" ""("$afd8")";
    ${LOG} -l 1 "ERROR: ${FILE}: Policy script failed"
    ${LOG} -l 1 "ERROR: ${FILE}: " `/bin/cat ${input}`
    ${LOG} -l 1 "ERROR: ${FILE}: Nodes defined in membership do not match the
ones in failure domain"
    ${LOG} -l 1 "ERROR: ${FILE}: Parameters read from input file: version = 
$version, name = $name, owner = $owner,  attribute = $attr, nodes = $mem, afd = $afd"
    exit 1;
fi

if [ ${found} -eq 1 ]; then
    rm -f ${output}
    echo $afd1 $afd2 $afd3 $afd4 $afd5 $afd6 $afd7 $afd8 > ${output}
    exit 0
fi
exit 1

round-robin

The round-robin failover script selects the resource group owner in a round-robin (circular) fashion. This policy can be used for resource groups that can be run in any node in the cluster.

The following example shows the contents of the round-robin failover script:

#!/sbin/ksh
#
# $1 - input file
# $2 - output file
#
# line 1 input file - version
# line 2 input file - name
# line 3 input file - owner field
# line 4 input file - attributes
# line 5 input file - Possible list of owners
# line 6 input file - application failover domain


DIR=/usr/cluster/bin
LOG=${DIR}/ha_cilog -g ha_script -s script
FILE=/var/cluster/ha/policies/round-robin

# Read input file
input=$1
output=$2
cat ${input} | read version
head -2 ${input} | tail -1 | read name
head -3 ${input} | tail -1 | read owner
head -4 ${input} | tail -1 | read attr
head -5 ${input} | tail -1 | read mem1 mem2 mem3 mem4 mem5 mem6 mem7 mem8
head -6 ${input} | tail -1 | read afd1 afd2 afd3 afd4 afd5 afd6 afd7 afd8

# Validate input file
${LOG} -l 1 "${FILE}:" `/bin/cat ${input}`

if [ "${version}" -ne 1 ] ; then
    ${LOG} -l 1 "ERROR: ${FILE}: Different version no. Should be (1) rather than
(${version})" ;
    exit 1;
elif [ -z "${name}" ]; then
    ${LOG} -l 1 "ERROR: ${FILE}: Failover script not defined";
    exit 1;
elif [ -z "${attr}" ]; then
    ${LOG} -l 1 "ERROR: ${FILE}: Attributes not defined";
    exit 1;
elif [ -z "${mem1}" ]; then
    ${LOG} -l 1 "ERROR: ${FILE}: No node membership defined";
    exit 1;
elif [ -z "${afd1}" ]; then
    ${LOG} -l 1 "ERROR: ${FILE}: No failover domain defined";
    exit 1;
fi


# Return 0 if $1 is in the membership and return 1 otherwise.
check_in_mem()
{
    for j in $mem1 $mem2 $mem3 $mem4 $mem5 $mem6 $mem7 $mem8; do
        if [ "X${j}" = "X$1" ]; then
            return 0;
        fi
    done
    return 1;
}

# Check if owner has to be changed. There is no need to change owner if
# owner node is in the possible list of owners.
check_in_mem ${owner}
if [ $? -eq 0 ]; then
    nextowner=${owner};
fi

# Search for the next owner
if [ "X${nextowner}" = "X" ]; then
    next=0;
    for i in $afd1 $afd2 $afd3 $afd4 $afd5 $afd6 $afd7 $afd8; do
        if [ "X${i}" = "X${owner}" ]; then
            next=1;
            continue;
        fi

        if [ "X${owner}" = "XNO ONE" ]; then
            next=1;
        fi

        if [ ${next} -eq 1 ]; then
            # Check if ${i} is in membership
            check_in_mem ${i};
            if [ $? -eq 0 ]; then
                # found next owner
                nextowner=${i};
                next=0;
                break;
            fi
        fi
    done
fi

if [ "X${nextowner}" = "X" ]; then
    # wrap round the afd list.
    for i in $afd1 $afd2 $afd3 $afd4 $afd5 $afd6 $afd7 $afd8; do
        if [ "X${i}" = "X${owner}" ]; then
            # Search for next owner complete
            break;
        fi

        # Previous loop should have found new owner
        if [ "X${owner}" = "XNO ONE" ]; then
            break;
        fi

        if [ ${next} -eq 1 ]; then
            check_in_mem ${i};
            if [ $? -eq 0 ]; then
                # found next owner
                nextowner=${i};
                next=0;
                break;
            fi
        fi
    done
fi

if [ "X${nextowner}" = "X" ]; then
    ${LOG} -l 1 "ERROR: ${FILE}: Policy script failed"
    ${LOG} -l 1 "ERROR: ${FILE}: " `/bin/cat ${input}`
    ${LOG} -l 1 "ERROR: ${FILE}: Could not find new owner"
    exit 1;
fi


# nextowner is the new owner
print=0;
rm -f ${output};

# Print the new afd to the output file
echo -n "${nextowner} " > ${output};
for i in $afd1 $afd2 $afd3 $afd4 $afd5 $afd6 $afd7 $afd8;
do
    if [ "X${nextowner}" = "X${i}" ]; then
        print=1;
    elif [ ${print} -eq 1 ]; then
        echo -n "${i} " >> ${output}
    fi
done

print=1;
for i in $afd1 $afd2 $afd3 $afd4 $afd5 $afd6 $afd7 $afd8; do
    if [ "X${nextowner}" = "X${i}" ]; then
        print=0;
    elif [ ${print} -eq 1 ]; then
        echo -n "${i} " >> ${output}
    fi
done

echo >> ${output};
exit 0;

Creating a New Failover Script

If the ordered or round-robin scripts do not meet your needs, you can create a new failover script and place it in the /var/clusters/ha/policies directory. You can then configure the cluster database to use your new failover script for the required resource groups.

Failover Script Interface

The following is passed to the failover script:

function(version, name, owner, attributes, possibleowners, domain)

version 

FailSafe version. The IRIX FailSafe 2.1. x release uses version number 1.

name 

Name of the failover script (used for error validations and logging purposes).

owner 

Logical name of the node that has (or had) the resource group online.

attributes 

Failover attributes (Auto_Failback or Controlled_Failback must be included).

possibleowners 

List of possible owners for the resource group. This list can be a subset of the current FailSafe membership.

domain 

Ordered list of nodes used at the last failover. (At the first failover, the initial failover domain is used.)

The failover script returns the newly generated run-time failover domain to FailSafe, which then chooses the node on which the resource group should be allocated by applying the failover attributes and FailSafe membership to the run-time failover domain.

Creating a Failover Policy that Returns the Resource Group to the Same Node

When HA services are stopped, a resource group will continue to run on the same node (that is, the node where it was running at the time HA services were stopped), as long as the node is available as part of the cluster. The resource group will switch to a different node in the failover domain only when the node is not available.

When resources are started, the node information can be stored in the configuration database as the resource group owner. When resources are stopped, the resource group owner information is removed. When FailSafe is started, the resource group ownership is read from the configuration database as part of the failover policy. This information is used by the failover policy script to determine the node where resource group should run.

Example Failover Policies

There are two general types of configuration, each of which can have from two through eight nodes:

  • N nodes that can potentially failover their applications to any of the other nodes in the cluster.

  • N primary nodes that can failover to M backup nodes. For example, you could have three primary nodes and one backup node.

This section shows examples of failover policies for the following types of configuration, each of which can have from two through eight nodes:

  • N primary nodes and one backup node (N+1)

  • N primary nodes and two backup nodes (N+2)

  • N primary nodes and M backup nodes (N+M)


    Note: The diagrams in the following sections illustrate the configuration concepts discussed here, but they do not address all required or supported elements, such as reset hubs.


N+1 Configuration

Figure 3-1 shows a specific instance of an N+1 configuration in which there are three primary nodes and one backup node. (This is also known as a star configuration .) The disks shown could each be disk farms.

Figure 3-1. N+1 Configuration Concept

N+1 Configuration
Concept

You could configure the following failover policies for load balancing:

  • Failover policy for RG1:

    • Initial failover domain = A, D

    • Failover attribute = Auto_Failback , Critical_RG

    • Failover script = ordered

  • Failover policy for RG2:

    • Initial failover domain = B, D

    • Failover attribute = Auto_Failback

    • Failover script = ordered

  • Failover policy for RG3:

    • Initial failover domain = C, D

    • Failover attribute = Auto_Failback

    • Failover script = ordered

If node A fails, RG1 will fail over to node D. As soon as node A reboots, RG1 will be moved back to node A .

If you change the failover attribute to Controlled_Failback for RG1 and node A fails, RG1 will fail over to node D and will remain running on node D even if node A reboots.

Suppose resource group RG1 is online on node A in the cluster. When the monitor of one of the resources in RG1 fails, FailSafe attempts to move the resource group to node D. If the release of RG1 from node A fails, FailSafe will reset node A and allocate the resource group on node D. If Critical_RG failover attribute was not specified, RG1 will have an srmd executable error.

N+2 Configuration

Figure 3-2 shows a specific instance of an N+2 configuration in which there are four primary nodes and two backup nodes. The disks shown could each be disk farms.

Figure 3-2. N+2 Configuration Concept

N+2 Configuration
Concept

You could configure the following failover policy for resource groups RG7 and RG8:

  • Failover policy for RG7:

    • Initial failover domain = A, E, F

    • Failover attribute = Controlled_Failback

    • Failover script = ordered

  • Failover policy for RG8:

    • Initial failover domain = B, F, E

    • Failover attribute = Auto_Failback

    • Failover script = ordered

If node A fails, RG7 will fail over to node E. If node E also fails, RG7 will fail over to node F. If A is rebooted, RG7 will remain on node F.

If node B fails, RG8 will fail over to node F. If B is rebooted, RG8 will return to node B.

N+M Configuration

Figure 3-3 shows a specific instance of an N+M configuration in which there are four primary nodes and each can serve as a backup node. The disk shown could be a disk farm.

Figure 3-3. N+M Configuration Concept

N+M
 Configuration Concept

You could configure the following failover policy for resource groups RG5 and RG6:

  • Failover policy for RG5:

    • Initial failover domain = A, B, C, D

    • Failover attribute = Controlled_Failback

    • Failover script = ordered

  • Failover policy for RG6:

    • Initial failover domain = C, A, D

    • Failover attribute = Controlled_Failback

    • Failover script = ordered

If node C fails, RG6 will fail over to node A. When node C reboots, RG6 will remain running on node A. If node A then fails, RG6 will return to node C and RG5 will move to node B. If node B then fails, RG5 moves to node C.