Chapter 2. Writing the Action Scripts and Adding Monitoring Agents

This chapter describes how to write the action scripts required for a plug-in and how to add monitoring agents. It discusses the following topics:

Set of Action Scripts

The action scripts are the set of scripts that determine how a resource is started, monitored, and stopped.


Caution: Multiple instances of scripts may be executed at the same time. For more information, see “Understanding the Execution of Action Scripts”.

The following set of action scripts can be provided for each resource type:

  • exclusive, which verifies that the resource is not already running

  • start, which starts the resource

  • stop, which stops the resource

  • monitor, which monitors the resource

  • restart, which restarts the resource on the same node when a monitoring failure occurs

The start, stop, and exclusive scripts are required for every resource type.


Note: The start and stop scripts must be idempotent; that is, they have the appearance of being run once but can in fact be run multiple times. For example, if the start script is run for a resource that is already started, the script must not return an error.

A monitor script is required, but if you wish it may contain only a return-success function. A restart script is required if the application must have a restart ability on the same node in case of failure. However, the restart script may contain only a return-success function.

Understanding the Execution of Action Scripts

Before you can write a new action script, you must understand how action scripts are executed. This section covers the following topics:

When Action Scripts are Executed

Table 2-1 shows the circumstances under which action scripts are executed.

Table 2-1. Execution of Action Scripts

Script

Execution Conditions

exclusive

A resource group is made online by the user

 

High-availability (HA) processes (ha_cmsd, ha_gcd, ha_fsd , ha_srmd, ha_ifd) are started

start

A resource group is made online by the user

 

HA processes are started

 

A resource group fails over

stop

A resource group is made offline

 

HA processes are stopped

 

A resource group fails over

 

A node is shutdown or rebooted

monitor

A resource groups is online

restart

The monitor script fails


Multiple Instances of a Script Executed at the Same Time

Multiple instances of the same script may be executed at the same time. To avoid this problem, you can use the ha_filelock and ha_execute_lock commands to achieve sequential execution of commands in different instances of the same script.

For example, multiple instances of xlv_assemble should not be executed in a node at the same time. Therefore, the start script for volumes should execute xlv_assemble under the control of ha_execute_lock as follows:

${HA_CMDSPATH}/ha_execute_lock  30
${HA_SCRIPTTMPDIR}/lock.volume_assemble \"/sbin/xlv_assemble -l
-s${VOLUME_NAME} \"

All resources of the same resource type in a given resource group are passed as parameters to the action scripts.

The ha_execute_lock command takes the following arguments:

  • Number of seconds before the command times out waiting for the file lock

  • File to be used for locking

  • Command to be executed

The ha_execute_lock command tries to obtain a lock on the file every second for timeout seconds. After obtaining a lock on the file, it executes the command argument. On command completion, it releases the lock on the file.

Differences between the exclusive and monitor Scripts

Although the same check can be used in monitor and exclusive action scripts, they are used for different purposes. Table 2-2 summarizes the differences between the scripts.

Table 2-2. Differences Between the monitor and exclusive Action Scripts

exclusive

monitor

Executed in all nodes in the cluster.

Executed only on the node where the resource group (which contains the resource) is online.

Executed before the resource is started in the cluster.

Executed when the resource is online in the cluster. (The monitor script could degrade the services provided by the HA server. Therefore, the check performed by the monitor script should be lightweight and less time consuming than the check performed by the exclusive script.)

Executed only once before the resource group is made online in the cluster.

Executed periodically.

Failure will result in resource group not becoming online in the cluster.

Failure will cause a resource group failover to another node or a restart of the resource in the local node.


Successful Execution of Action Scripts

Table 2-3 shows the state of a resource group after the successful execution of an action script for every resource within a resource group. To view the state of a resource group, use the FailSafe Manager graphical user interface (GUI) or the cmgr command.

Table 2-3. Successful Action Script Results

Event

Resource Group State

Script to Execute

Resource group is made online on a node

online

start

Resource group is made offline on a node

offline

stop

Online status of the resource group

(No effect)

exclusive

Normal monitoring of online resource group

online

monitor

Resource group monitoring failure

online

restart


Failure of Action Scripts

Table 2-4 shows the state of the resource group and the error state when an action script fails. (There are no offline states with errors.)

Table 2-4. Failure of an Action Script

Failing Script

Resource Group State

Error State

exclusive

online

exclusivity

monitor

online

monitoring failure

restart

online

monitoring failure

start

online

srmd executable error

stop

online

srmd executable error

When monitoring fails, FailSafe will stop monitoring. After recovering the resource, the system administrator must bring the resource group online again in order for FailSafe to resume monitoring it. For example, if a start script fails, the state of the resource and the resource group will be online and the error will be srmd executable error. FailSafe will attempt to move the resource group to other nodes in the application failover domain when the start script fails

Implementing Timeouts and Retrying a Command

You can use the ha_exec2 command to execute action scripts using timeouts. This allows the action script to be completed within the specified time, and permits proper error messages to be logged on failure or timeout. The retry variable is especially useful in monitor and exclusive action scripts.

To retry a command, use the following syntax:

/usr/cluster/bin/ha_exec2  timeout_in_seconds   number_of_retries   command

For example:

${HA_CMDSPATH}/ha_exec2 30 2 "umount /fs"

The above ha_exec2 command executes the umount /fs command line. If the command does not complete within 30 seconds, it kills the umount command and retries the command. The ha_exec2 command retries the umount command twice if it times out or fails.

The ha_exec2 command executes the command string passed as a parameter. If the command string successfully completes execution and it returns an exit code, then that exit code is returned by ha_exec2. However, if there is a failure, the following special exit codes are returned by ha_exec2:

  • 100: the command could not be executed

  • 101: there was an invalid argument to ha_exec2

  • 102: the ha_exec2 command failed

  • 103: the command timed out and was killed by ha_exec2

  • 104: the command timed out and ha_exec2 could not kill the command

  • 105: the command exited with no error code

For more information, see the ha_exec2 man page.

Sending Signals

You can use the ha_exec2 command to send signals to specific process. A process is identified by its name or its arguments.

For example:

${HA_CMDSPATH}/ha_exec2 -s 0 -t "SYBASE_DBSERVER"

The above command sends signal 0 (checks if the process exists) to all processes whose name or arguments match the SYBASE_DBSERVER string. The command returns 0 if it is a success.

You should use the ha_exec2 command to check for server processes in the monitor script instead of using the ps -ef | grep command line.

For more information, see the ha_exec2 man page.

Preparation

Before you can write the action scripts, you must do the following:

  • Understand the scriptlib functions described in Appendix B, “Using the Script Library”.

  • Familiarize yourself with the script templates provided in the following directory:

    /var/cluster/ha/resource_types/template

  • Read the man pages for the following commands:

    • cmgr

    • fs2d

    • ha_cilog

    • ha_cmsd

    • ha_exec2

    • ha_fsd

    • ha_gcd

    • ha_ifd

    • ha_ifdadmin

    • ha_macconfig2

    • ha_srmd

    • ha_statd2

    • haStatus

  • Familiarize yourself with the action scripts for other highly available services in /var/cluster/ha/resource_types that are similar to the scripts you wish to create.

  • Understand how to do the following actions for your application:

    • Verify that the resource is running

    • Verify that the resource can be run

    • Start the resource

    • Stop the resource

    • Check for the server processes

    • Do a simple query as a client and understand the expected response

    • Check for configuration file or directory existence (as needed)

  • Determine whether or not monitoring is required (see “Is Monitoring Necessary?”). However, even if monitoring is not needed, a monitor script is still required; in this case, it can contain only a return-success function.

  • Determine if a resource type must be added to the cluster database.

  • Understand the vendor-supplied startup and shutdown procedures.

  • Determine the configuration parameters for the application; these may be used in the action script and should be stored in the cluster database. Action scripts may read from the database.

  • Determine whether the resource type can be restarted in the local node and whether this action makes sense.

Is Monitoring Necessary?

In the following situations, you may not need to perform application monitoring:

  • Heartbeat monitoring is sufficient; that is, simply verifying that the node is alive (provided automatically by the base software) determines the health of the highly available service.

  • There is no process or resource that can be monitored. For example, the SGI Gauntlet Internet Firewall software performs IP filtering on firewall nodes. Because the filtering is done in the kernel, there is no process or resource to monitor.

  • A resource on which the application depends is already monitored. For example, monitoring some client-node resources might best be done by monitoring the file systems, volumes, and network interfaces they use. Because this is already done by the base software, additional monitoring is not required.


    Caution: Beware that monitoring should be as lightweight as possible so that it does not affect system performance. Also, security issues may make monitoring difficult. If you are unable to provide a monitoring script with appropriate performance and security, consider a monitoring agent; see “Monitoring Agents”.


Types of Monitoring

There are two types of monitoring that may be accomplished in a monitor script:

  • Is the resource present?

  • Is the resource responding?

You can define multiple levels of monitoring within the monitor script, and the administrator can choose the desired level by configuring the resource definition in the cluster database. Ensure that the monitoring level chosen does not affect system performance. For more information, see the FailSafe Administrator's Guide for SGI InfiniteStorage.

What are the Symptoms of Monitoring Failure?

Possible symptoms of failure include the following:

  • The resource returns an error code

  • The resource returns the wrong result

  • The resource does not return quickly enough

How Often Should Monitoring Occur?

You must determine the monitoring interval time and time-out value for the monitor script. The time-out must be long enough to guarantee that occasional anomalies do not cause false failovers. It will be useful for you to determine the peak load that the resource may need to sustain.

You must also determine if the monitor test should execute multiple times so that an application is not declared dead after a single failure. In general, testing more than once before declaring failure is a good idea.

Examples of Testing for Monitoring Failure

The test should be simple and complete quickly, whether it succeeds or fails. Some examples of tests are as follows:

  • For a client/server resource that follows a well-defined protocol, the monitor script can make a simple request and verify that the proper response is received.

  • For a web server application, the monitor script can request a home page, verify that the connection was made, and ignore the resulting home page.

  • For a database, a simple request such as querying a table can be made.

  • For NFS, more complicated end-to-end monitoring is required. The test might consist of mounting an exported file system, checking access to the file system with a stat() system call to the root of the file system, and undoing the mount.

  • For a resource that writes to a log file, check that the size of the log file is increasing or use the grep command to check for a particular message.

  • The following command can be used to determine quickly whether a process exists:

    /sbin/killall -0 process_name

    You can also use the ha_exec2 command to check if a process is running.

    The ha_exec2 command differs from killall in that it performs a more exhaustive check on the process name as well as process arguments. killall searches for the process using the process name only. The command line is as follows:

    /usr/cluster/bin/ha_exec2 -s 0 -t process_name


    Note: Do not use the ps command to check on a particular process because its execution can be too slow.


Script Format

Templates for the action scripts are provided in the following directory:

/var/cluster/ha/resource_types/template

The template scripts have the same general format. Following is the type of information in the order in which it appears in the template scripts:

  • Header information

  • Set local variables

  • Read resource information

  • Exit status

  • Perform the basic action of the script, which is the customized area you must provide

  • Set global variables

  • Verify arguments

  • Read input file


    Note: Action “scripts” can be of any form -- such as Bourne shell script, Perl script, or C language program. The rest of this chapter discusses Korn shell.


The following sections show an example from the NFS start script.

Header Information

The header information contains comments about the resource type, script type, and resource configuration format. You must modify the code as needed.

Following is the header for the NFS start script:

#!/sbin/ksh

# **************************************************************************
# *                                                                        *
# *                  Copyright (C) 1998 Silicon Graphics, Inc.             *
# *                                                                        *
# *  These coded instructions, statements, and computer programs  contain  *
# *  unpublished  proprietary  information of Silicon Graphics, Inc., and  *
# *  are protected by Federal copyright law.  They  may  not be disclosed  *
# *  to  third  parties  or copied or duplicated in any form, in whole or  *
# *  in part, without the prior written consent of Silicon Graphics, Inc.  *
# *                                                                        *
# **************************************************************************
#ident "$Revision: 1.25 $"

# Resource type: NFS
# Start script NFS

#
# Test resource configuration information is present in the database in
# the following format
#
# resource-type.NFS
#

Set Local Variables

The set_local_variables() section of the script defines all of the variables that are local to the script, such as temporary file names or database keys. All local variables should use the LOCAL_ prefix. You must modify the code as needed.

Following is the set_local_variables() section from the NFS start script:

set_local_variables()
{
LOCAL_TEST_KEY=NFS
}

Read Resource Information

The  get_xxx _info() function, such as get_nfs_info(), reads the resource information from the cluster database. $1 is the test resource name. If the operation is successful, a value of 0 is returned; if the operation fails, 1 is returned.

The information is returned in the HA_STRING variable. For more information about HA_STRING, see Appendix B, “Using the Script Library”.

Following is the get_nfs_info() section from the NFS start script:

get_nfs_info ()
{
     ha_get_info ${LOCAL_TEST_KEY} $1
     if [ $? -ne 0 ]; then
          return 1;
     else
          return 0;
     fi
}

Call ha_get_info with a third argument of any value to obtain all attributes and dependency information for a resource from the cluster database. Use ha_get_multi_fields to retrieve specific dependency information. The resource dependency information is returned in the $HA_FIELD_VALUE variable.

Exit Status

In the exit_script() function, $1 contains the exit_status value. If cleanup actions are required, such as the removal of temporary files that were created as part of the process, place them before the exit line.

Following is the exit_script() section from the NFS start script:

exit_script()
{
${HA_DBGLOG} "Exit: exit_script()";
exit $1;
}


Note: If you call the exit_script function prior to normal termination, it should be preceded by the ha_write_status_for_resource function and you should use the same return code that is logged to the output file.


Basic Action

This area of the script is the portion you must customize. The templates provide a minimal framework.

Following is the framework for the basic action from the start template:

start_template()

# for all template resources passed as parameter
for TEMPLATE in $HA_RES_NAMES
do
    #HA_CMD="command to start $TEMPLATE resource on the local machine";

    #ha_execute_cmd "string to describe the command being executed";

    ha_write_status_for_resource $TEMPLATE $HA_SUCCESS;
done
}


Note: When testing the script, you will add the following line to this area to obtain debugging information:
set -x


For examples of this area, see “Examples of Action Scripts”.

Set Global Variables

The following lines set all of the global and local variables and store the resource names in $HA_RES_NAMES.

Following is the set_global_variables() function from the NFS start script:

set_global_variables()
{
     HA_DIR=/var/cluster/ha
     COMMON_LIB=${HA_DIR}/common_scripts/scriptlib

     # Execute the common library file
     . $COMMON_LIB

     ha_set_global_defs;
}

Verify Arguments

The ha_check_args() function verifies the arguments and stores them in the $HA_INFILE and $HA_OUTFILE variables. It returns 1 on error and 0 on success.

Following is the ha_check_args() section from the NFS start script:

ha_check_args $*;
if [ $? -ne 0 ]; then
     exit $HA_INVAL_ARGS;
fi

Read Input File

The ha_read_infile() function reads the input file and stores the resource names in the $HA_RES_NAMES variable. This function is defined in the scriptlib library. See “Read an Input File” in Appendix B.

Following is code from the NFS start script that calls the ha_read_infile() function:

# Read the input file and store the resource names in $HA_RES_NAMES 
# variable

ha_read_infile;

Complete the Action

Each action script ends with the following, which performs the action and writes the output status to the $HA_OUTFILE:

action_resourcetype;
 
exit_script $HA_SUCCESS

Following is the completion from the NFS start script:

start_nfs;
 
exit_script $HA_SUCCESS;

Steps in Writing a Script


Caution: Multiple copies of actions scripts can execute at the same time. Therefore, all temporary filenames used by the scripts can be suffixed by script.$$ in order to make them unique, or you can use the resource name because it must be unique to the cluster.

For each script, you must do the following:

  • Get the required variables

  • Check the variables

  • Perform the action

  • Check the action


    Note: The start and stop scripts are required to be idempotent; that is, they have the appearance of being run once but can in fact be run multiple times. For example, if the start script is run for a resource that is already started, the script must not return an error.

    All action scripts must return the status to the following file:

    /var/cluster/ha/log/script_nodename

Examples of Action Scripts

The following sections use portions of the NFS scripts as examples.


Note: The examples in this guide may not exactly match the released system.


start Script

The NFS start script does the following:

  1.  Creates a resource-specific NFS status directory.

  2.  Exports the specified export-point with the specified export-options.

Following is a section from the NFS start script:

# Start the resource on the local machine.
# Return HA_SUCCESS if the resource has been successfully started on the local
# machine and HA_CMD_FAILED otherwise.
#
start_nfs()
{
${HA_DBGLOG} "Entry: start_nfs()";

# for all nfs resources passed as parameter
for resource in ${HA_RES_NAMES}
do
NFSFILEDIR=${HA_SCRIPTTMPDIR}/${LOCAL_TEST_KEY}$resource
HA_CMD="/sbin/mkdir -p $NFSFILEDIR";
ha_execute_cmd "creating nfs status file directory";
if [ $? -ne 0 ]; then
   ${HA_LOG} "Failed to create ${NFSFILEDIR} directory";
   ha_write_status_for_resource ${resource} ${HA_NOCFGINFO};
   exit_script $HA_NOCFGINFO
fi

get_nfs_info $resource
if [ $? -ne 0 ]; then
    ${HA_LOG} "NFS: $resource parameters not present in CDB";
    ha_write_status_for_resource ${resource} ${HA_NOCFGINFO};
    exit_script ${HA_NOCFGINFO};
fi

ha_get_field "${HA_STRING}" export-info
if [ $? -ne 0 ]; then
    ${HA_LOG} "NFS: export-info not present in CDB for resource $resource";
    ha_write_status_for_resource ${resource} ${HA_NOCFGINFO};
    exit_script ${HA_NOCFGINFO};
fi
export_opts="$HA_FIELD_VALUE"

ha_get_field "${HA_STRING}" filesystem
if [ $? -ne 0 ]; then
    ${HA_LOG} "NFS: filesystem-info not present in CDB for resource
$resource";
    ha_write_status_for_resource ${resource} ${HA_NOCFGINFO};
    exit_script ${HA_NOCFGINFO};
fi
filesystem="$HA_FIELD_VALUE"
# Make the script idempotent, check to see if the NFS resource
# is already exported, if so return success. Remember that we
# might not have any export options.
retstat=0;
# Check to see if the NFS resource is already exported
# (without options)
/usr/etc/exportfs | grep "$resource$" >/dev/null 2>&1
retstat=$?
if [ $retstat -eq 1 ]; then
    # Check to see if the NFS resource is already exported
    # with options.
    /usr/etc/exportfs | grep "$resource " | grep "$export_opts$" >/dev/null 2>&1
    retstat=$?
fi
if [ $retstat -eq 1 ]; then
    # Before we try and export the NFS resource, make sure
    # filesystem is mounted.
    HA_CMD="/sbin/grep $filesystem /etc/mtab > /dev/null 2>&1";
    ha_execute_cmd "check if the filesystem $filesystem is mounted";
    if [ $? -eq 0 ]; then
	HA_CMD="/usr/etc/exportfs -i -o $export_opts $resource";
	ha_execute_cmd "export $resource directories to NFS clients";
	if [ $? -ne 0 ]; then
	    ha_write_status_for_resource ${resource} ${HA_CMD_FAILED};
	else
	    ha_write_status_for_resource ${resource} ${HA_SUCCESS};
	fi
    else
	${HA_LOG} "NFS: filesystem $filesystem not mounted"
	ha_write_status_for_resource ${resource} ${HA_CMD_FAILED};
    fi
else
    ha_write_status_for_resource ${resource} ${HA_SUCCESS};
fi
done
}

stop Script

The NFS stop script does the following:

  1. Unexports the specified export-point.

  2. Removes the NFS status directory.

Following is an example from the NFS stop script:

# Stop the nfs resource on the local machine.
# Return HA_SUCCESS if the resource has been successfully stopped on the local
# machine and HA_CMD_FAILED otherwise.
#
stop_nfs()
{

${HA_DBGLOG} "Entry: stop_nfs()";

# for all nfs resources passed as parameter
for resource in ${HA_RES_NAMES}
do
get_nfs_info $resource
if [ $? -ne 0 ]; then
    # NFS resource information not available.
    ${HA_LOG} "NFS: $resource parameters not present in CDB";
    ha_write_status_for_resource ${resource} ${HA_NOCFGINFO};
    exit_script ${HA_NOCFGINFO};
fi

ha_get_field "${HA_STRING}" export-info
if [ $? -ne 0 ]; then
    ${HA_LOG} "NFS: export-info not present in CDB for resource $resource";
    ha_write_status_for_resource ${resource} ${HA_NOCFGINFO};
    exit_script ${HA_NOCFGINFO};
fi
export_opts="$HA_FIELD_VALUE"

# Make the script idempotent, check to see if the filesystem
# is already exported, if so return success. Remember that we
# might not have any export options.

retstat=0;
# Check to see if the filesystem is already exported
# (without options)
/usr/etc/exportfs | grep "$resource$" >/dev/null 2>&1
retstat=$?
if [ $retstat -eq 1 ]; then
    # Check to see if the filesystem is already exported
    # with options.
    /usr/etc/exportfs | grep "$resource " | grep "$export_opts$" >/dev/null 2>&1
    retstat=$?
fi
if [ $retstat -eq 0 ]; then
    # Before we unexport the filesystem, check that it exists
    HA_CMD="/sbin/grep $resource /etc/mtab > /dev/null 2>&1";
    ha_execute_cmd "check if the export-point exists";
    if [ $? -eq 0 ]; then
	HA_CMD="/usr/etc/exportfs -u $resource";
	ha_execute_cmd "unexport $resource directories to NFS clients";
	if [ $? -ne 0 ]; then
	    ha_write_status_for_resource ${resource} ${HA_CMD_FAILED};
	else
	    ha_write_status_for_resource ${resource} ${HA_SUCCESS};
	fi
    else
	${HA_LOG} "NFS: filesystem $resource not found in export filesystem list, \
	unexporting anyway";
	HA_CMD="/usr/etc/exportfs -u $resource";
	ha_execute_cmd "unexport $resource directories to NFS clients";
	ha_write_status_for_resource ${resource} ${HA_SUCCESS};
    fi
else
    ha_write_status_for_resource ${resource} ${HA_SUCCESS};
fi

# remove the monitor nfs status file
NFSFILEDIR=${HA_SCRIPTTMPDIR}/${LOCAL_TEST_KEY}$resource
HA_CMD="/sbin/rm -rf $NFSFILEDIR";
ha_execute_cmd "removing nfs status file directory";
if [ $? -ne 0 ]; then
   ${HA_LOG} "Failed to delete ${NFSFILEDIR} directory";
   ha_write_status_for_resource ${resource} ${HA_NOCFGINFO};
   exit_script $HA_NOCFGINFO
fi
done
}

monitor Script

The NFS monitor script does the following:

  1. Verifies that the file system is mounted at the correct mount point.

  2. Requests the status of the exported file system.

  3. Checks the export-point.

  4. Requests NFS statistics and (based on the results) make a Remote Procedure Call (RPC) to NFS as needed.

Following is an example from the NFS monitor script:

# Check if the nfs resource is allocated in the local node
# This check must be light weight and less intrusive compared to
# exclusive check. This check is done when the resource has been
# allocated in the local node.
# Return HA_SUCCESS if the resource is running in the local node
# and HA_CMD_FAILED if the resource is not running in the local node
# The list of the resources passed as input is in variable
# $HA_RES_NAMES
#
monitor_nfs()
{
${HA_DBGLOG} "Entry: monitor_nfs()";

for resource in ${HA_RES_NAMES}
do
get_nfs_info $resource
if [ $? -ne 0 ]; then
    # No resource information available.
    ${HA_LOG} "NFS: $resource parameters not present in CDB";
    ha_write_status_for_resource ${resource} ${HA_NOCFGINFO};
    exit_script ${HA_NOCFGINFO};
fi

ha_get_field "${HA_STRING}" filesystem
if [ $? -ne 0 ]; then
    # filesystem not available available.
    ${HA_LOG} "NFS: filesystem not present in CDB for resource $resource";
    ha_write_status_for_resource ${resource} ${HA_NOCFGINFO};
    exit_script ${HA_NOCFGINFO};
fi
fs="$HA_FIELD_VALUE";


# Check to see if the filesystem is mounted
HA_CMD="/sbin/mount | grep $fs  >> /dev/null 2>&1"
ha_execute_cmd "check to see if $fs is mounted"
if [ $? -ne 0 ]; then
    ${HA_LOG} "NFS: $fs not mounted";
    ha_write_status_for_resource ${resource} ${HA_CMD_FAILED};
    exit_script $HA_CMD_FAILED;
fi

# stat the filesystem
HA_CMD="/sbin/stat $resource";
ha_execute_cmd "stat mount point $resource"
if [ $? -ne 0 ]; then
    ${HA_LOG} "NFS: cannot stat $resource NFS export point";
    ha_write_status_for_resource ${resource} ${HA_CMD_FAILED};
    exit_script $HA_CMD_FAILED;
fi

# check the filesystem is exported
EXPORTFS="${HA_SCRIPTTMPDIR}/exportfs.$$"
/usr/etc/exportfs > $EXPORTFS 2>&1
HA_CMD="awk '{print \$1}' $EXPORTFS | grep $resource"
ha_execute_cmd " check the filesystem $resource is exported"
if [ $? -ne 0 ]; then
    ${HA_LOG} "NFS: failed to find $resource in exported filesystem list:-"
    ${HA_LOG} "`/sbin/cat ${EXPORTFS}`"
    rm -f $EXPORTFS;
    ha_write_status_for_resource ${resource} ${HA_CMD_FAILED};
    exit_script $HA_CMD_FAILED;
fi

rm -f $EXPORTFS

# create a file to hold the nfs stats. This will will be
# deleted in the stop script.
NFSFILE=${HA_SCRIPTTMPDIR}/${LOCAL_TEST_KEY}$resource/.nfsstat
NFS_STAT=`/usr/etc/nfsstat -rs | /usr/bin/tail -1 | /usr/bin/awk '{print $1}'`
if [ ! -f $NFSFILE ]; then
   ${HA_LOG} "NFS: creating stat file $NFSFILE";
   echo $NFS_STAT > $NFSFILE;
   if [ $NFS_STAT -eq 0 ];then
      # do some rpcinfo's
      exec_rpcinfo;
      if [ $? -ne 0 ]; then
	 ${HA_LOG} "NFS: exec_rpcinfo failed (1)";
	 ha_write_status_for_resource ${resource} ${HA_CMD_FAILED};
	 exit_script $HA_CMD_FAILED;
      fi
   fi
else
   OLD_STAT=`/sbin/cat $NFSFILE`
   if test "X${NFS_STAT}" = "X"; then
	${HA_LOG} "NFS: NFS_STAT is not set, reset to zero";
	NFS_STAT=0;
   fi
   if test "X${OLD_STAT}" = "X"; then
	${HA_LOG} "NFS: OLD_STAT is not set, reset to zero";
	OLD_STAT=0;
   fi
   if [ $NFS_STAT -gt $OLD_STAT ]; then
	echo $NFS_STAT > $NFSFILE;
   else
	echo $NFS_STAT > $NFSFILE;
	exec_rpcinfo;
	if [ $? -ne 0 ]; then
	   ${HA_LOG} "NFS: exec_rpcinfo failed (2)";
	   ha_write_status_for_resource $resource ${HA_CMD_FAILED};
	   exit_script $HA_CMD_FAILED;
	fi
   fi
fi
ha_write_status_for_resource $resource $HA_SUCCESS;
done
}

exclusive Script

The NFS exclusive script determines whether the file system is already exported. The check made by an exclusive script can be more expensive than a monitor check. FailSafe uses this script to determine if resources are running on a node in the cluster, and to thereby prevent starting resources on multiple nodes in the cluster.

Following is an example from the NFS exclusive script:

# Check if the nfs resource is running in the local node. This check can
# more intrusive than the monitor check. This check is used to determine
# if the resource has to be started on a machine in the cluster.
# Return HA_NOT_RUNNING if the resource is not running in the local node
# and HA_RUNNING if the  resource is running in the local node
# The list of nfs resources passed as input is in variable
# $HA_RES_NAMES
#
exclusive_nfs()
{

${HA_DBGLOG} "Entry: exclusive_nfs()";

# for all resources passed as parameter
for resource in ${HA_RES_NAMES}
do
get_nfs_info $resource
if [ $? -ne 0 ]; then
    # No resource information available
    ${HA_LOG} "NFS: $resource parameters not present in CDB";
    ha_write_status_for_resource ${resource} ${HA_NOCFGINFO};
    exit_script ${HA_NOCFGINFO};
fi

SMFILE=${HA_SCRIPTTMPDIR}/showmount.$$
/etc/showmount -x >> ${SMFILE};
HA_CMD="/sbin/grep $resource ${SMFILE} >> /dev/null 2>&1"
ha_execute_cmd "checking for $resource exported directory"
if [ $? -eq 0 ];then
    ha_write_status_for_resource ${resource} ${HA_RUNNING};
    ha_print_exclusive_status ${resource} ${HA_RUNNING};
else
    ha_write_status_for_resource ${resource} ${HA_NOT_RUNNING};
    ha_print_exclusive_status ${resource} ${HA_NOT_RUNNING};
fi
rm -f ${SMFILE}
done
}

restart Script

The NFS restart script exports the specified export-point with the specified export-options.

Following is an example from the restart script for NFS:

# Restart nfs resource
# Return HA_SUCCESS if nfs resource failed over successfully or
# return HA_CMD_FAILED if nfs resource could not be failed over locally.
# Return HA_NOT_SUPPORTED if local restart is not supported for nfs
# resource type.
# The list of nfs resources passed as input is in variable
# $HA_RES_NAMES
#
restart_nfs()
{
${HA_DBGLOG} "Entry: restart_nfs()";

# for all nfs resources passed as parameter
for resource in ${HA_RES_NAMES}
do
get_nfs_info $resource
if [ $? -ne 0 ]; then
    ${HA_LOG} "NFS: $resource parameters not present in CDB";
    ha_write_status_for_resource ${resource} ${HA_NOCFGINFO};
    exit_script ${HA_NOCFGINFO};
fi

ha_get_field "${HA_STRING}" export-info
if [ $? -ne 0 ]; then
    ${HA_LOG} "NFS: export-info not present in CDB for resource $resource";
    ha_write_status_for_resource ${resource} ${HA_NOCFGINFO};
    exit_script ${HA_NOCFGINFO};
fi
export_opts="$HA_FIELD_VALUE"

HA_CMD="/usr/etc/exportfs -i -o $export_opts $resource";
ha_execute_cmd "export $resource directories to NFS clients";
if [ $? -ne 0 ]; then
   ha_write_status_for_resource ${resource} ${HA_CMD_FAILED};
else
   ha_write_status_for_resource ${resource} ${HA_SUCCESS};
fi
done
}

Monitoring Agents

If resources cannot be monitored using a lightweight check, you should use a monitoring agent. The monitor action script contacts the monitoring agent to determine the status of the resource in the node. The monitoring agent in turn periodically monitors the resource. Figure 2-1 shows the monitoring process.

Figure 2-1. Monitoring Process

Monitoring Process

Monitoring agents are useful for monitoring database resources. In databases, creating the database connection is costly and time consuming. The monitoring agent maintains connections to the database and it queries the database using the connection in response to the monitor action script request.

Monitoring agents are independent processes and can be started by the cmond process, although this is not required. For example, if a monitoring agent must be started when activating highly available services on a node, information about that agent can be added to the cmond configuration on that node. The cmond configuration is located in the /var/cluster/cmon/process_groups directory. Information about different agents should go into different files. The name of the file is not relevant to the activate/deactivate procedure.

If a monitoring agent exits or aborts, cmond will automatically restart the monitoring agent. This prevents monitor action script failures due to monitoring agent failures.

For example, the /var/cluster/cmon/process_groups/ip_addresses file contains information about the ha_ifd process that monitors network interfaces. It contains the following:

TYPE = cluster_agent
PROCS = ha_ifd
ACTIONS = start stop restart attach detach
AUTOACTION = attach


Note: The ACTIONS line above defines what cmond can do to the PROCS processes. These actions must be the same for every agent. (It does not refer to action scripts.)

If you create a new monitoring agent, you must also create a corresponding file in the /var/cluster/cmon/process_groups directory that contains similar information about the new agent. To do this, you can copy the ip_addresses file and modify the PROCS line to list the executables that constitute your new agent. These executables must be located in the /usr/cluster/bin directory. You should not modify the other configuration lines (TYPE, ACTIONS, and AUTOACTION ).

Suppose you need to add a new agent called newagent that consists of processes ha_x and ha_y. The configuration information for this agent will be located in the /var/cluster/cmon/process_groups/newagent file, which will contain the following:

TYPE = cluster_agent
PROCS = ha_x ha_y
ACTIONS = start stop restart attach detach
AUTOACTION = attach

In this case, the software will expect two executables ( /usr/cluster/bin/ha_x and /usr/cluster/bin/ha_y) to be present.