This chapter describes how to write the action scripts required for a plug-in and how to add monitoring agents. It discusses the following topics:
The action scripts are the set of scripts that determine how a resource is started, monitored, and stopped.
![]() | Caution: Multiple instances of scripts may be executed at the same time. For more information, see “Understanding the Execution of Action Scripts”. |
The following set of action scripts can be provided for each resource type:
The start, stop, and exclusive scripts are required for every resource type.
![]() | Note: The start and stop scripts must be idempotent; that is, they have the appearance of being run once but can in fact be run multiple times. For example, if the start script is run for a resource that is already started, the script must not return an error. |
A monitor script is required, but if you wish it may contain only a return-success function. A restart script is required if the application must have a restart ability on the same node in case of failure. However, the restart script may contain only a return-success function.
Before you can write a new action script, you must understand how action scripts are executed. This section covers the following topics:
Table 2-1 shows the circumstances under which action scripts are executed.
Table 2-1. Execution of Action Scripts
Script | Execution Conditions |
---|---|
exclusive | A resource group is made online by the user |
| High-availability (HA) processes (ha_cmsd, ha_gcd, ha_fsd , ha_srmd, ha_ifd) are started
|
start | A resource group is made online by the user |
| HA processes are started |
| A resource group fails over
|
stop | A resource group is made offline |
| HA processes are stopped |
| A resource group fails over |
| A node is shutdown or rebooted
|
monitor | A resource groups is online
|
restart | The monitor script fails |
Multiple instances of the same script may be executed at the same time. To avoid this problem, you can use the ha_filelock and ha_execute_lock commands to achieve sequential execution of commands in different instances of the same script.
For example, multiple instances of xlv_assemble should not be executed in a node at the same time. Therefore, the start script for volumes should execute xlv_assemble under the control of ha_execute_lock as follows:
${HA_CMDSPATH}/ha_execute_lock 30 ${HA_SCRIPTTMPDIR}/lock.volume_assemble \"/sbin/xlv_assemble -l -s${VOLUME_NAME} \" |
All resources of the same resource type in a given resource group are passed as parameters to the action scripts.
The ha_execute_lock command takes the following arguments:
Number of seconds before the command times out waiting for the file lock
File to be used for locking
Command to be executed
The ha_execute_lock command tries to obtain a lock on the file every second for timeout seconds. After obtaining a lock on the file, it executes the command argument. On command completion, it releases the lock on the file.
Although the same check can be used in monitor and exclusive action scripts, they are used for different purposes. Table 2-2 summarizes the differences between the scripts.
Table 2-2. Differences Between the monitor and exclusive Action Scripts
exclusive | monitor |
---|---|
Executed in all nodes in the cluster. | Executed only on the node where the resource group (which contains the resource) is online. |
Executed before the resource is started in the cluster. | Executed when the resource is online in the cluster. (The monitor script could degrade the services provided by the HA server. Therefore, the check performed by the monitor script should be lightweight and less time consuming than the check performed by the exclusive script.) |
Executed only once before the resource group is made online in the cluster. | Executed periodically. |
Failure will result in resource group not becoming online in the cluster. | Failure will cause a resource group failover to another node or a restart of the resource in the local node. |
Table 2-3 shows the state of a resource group after the successful execution of an action script for every resource within a resource group. To view the state of a resource group, use the FailSafe Manager graphical user interface (GUI) or the cmgr command.
Table 2-3. Successful Action Script Results
Event | Resource Group State | Script to Execute |
---|---|---|
Resource group is made online on a node | start | |
Resource group is made offline on a node | offline | stop |
Online status of the resource group | (No effect) | exclusive |
Normal monitoring of online resource group | online | monitor |
Resource group monitoring failure | online | restart |
Table 2-4 shows the state of the resource group and the error state when an action script fails. (There are no offline states with errors.)
Table 2-4. Failure of an Action Script
Failing Script | Resource Group State | Error State |
---|---|---|
exclusive | online | exclusivity |
monitor | online | monitoring failure |
restart | online | monitoring failure |
start | online | srmd executable error |
stop | online | srmd executable error |
When monitoring fails, FailSafe will stop monitoring. After recovering the resource, the system administrator must bring the resource group online again in order for FailSafe to resume monitoring it. For example, if a start script fails, the state of the resource and the resource group will be online and the error will be srmd executable error. FailSafe will attempt to move the resource group to other nodes in the application failover domain when the start script fails
You can use the ha_exec2 command to execute action scripts using timeouts. This allows the action script to be completed within the specified time, and permits proper error messages to be logged on failure or timeout. The retry variable is especially useful in monitor and exclusive action scripts.
To retry a command, use the following syntax:
/usr/cluster/bin/ha_exec2 timeout_in_seconds number_of_retries command |
For example:
${HA_CMDSPATH}/ha_exec2 30 2 "umount /fs" |
The above ha_exec2 command executes the umount /fs command line. If the command does not complete within 30 seconds, it kills the umount command and retries the command. The ha_exec2 command retries the umount command twice if it times out or fails.
The ha_exec2 command executes the command string passed as a parameter. If the command string successfully completes execution and it returns an exit code, then that exit code is returned by ha_exec2. However, if there is a failure, the following special exit codes are returned by ha_exec2:
100: the command could not be executed
101: there was an invalid argument to ha_exec2
102: the ha_exec2 command failed
103: the command timed out and was killed by ha_exec2
104: the command timed out and ha_exec2 could not kill the command
105: the command exited with no error code
For more information, see the ha_exec2 man page.
You can use the ha_exec2 command to send signals to specific process. A process is identified by its name or its arguments.
For example:
${HA_CMDSPATH}/ha_exec2 -s 0 -t "SYBASE_DBSERVER" |
The above command sends signal 0 (checks if the process exists) to all processes whose name or arguments match the SYBASE_DBSERVER string. The command returns 0 if it is a success.
You should use the ha_exec2 command to check for server processes in the monitor script instead of using the ps -ef | grep command line.
For more information, see the ha_exec2 man page.
Before you can write the action scripts, you must do the following:
Understand the scriptlib functions described in Appendix B, “Using the Script Library”.
Familiarize yourself with the script templates provided in the following directory:
/var/cluster/ha/resource_types/template |
Read the man pages for the following commands:
cmgr
fs2d
ha_cilog
ha_cmsd
ha_exec2
ha_fsd
ha_gcd
ha_ifd
ha_ifdadmin
ha_macconfig2
ha_srmd
ha_statd2
haStatus
Familiarize yourself with the action scripts for other highly available services in /var/cluster/ha/resource_types that are similar to the scripts you wish to create.
Understand how to do the following actions for your application:
Verify that the resource is running
Verify that the resource can be run
Start the resource
Stop the resource
Check for the server processes
Do a simple query as a client and understand the expected response
Check for configuration file or directory existence (as needed)
Determine whether or not monitoring is required (see “Is Monitoring Necessary?”). However, even if monitoring is not needed, a monitor script is still required; in this case, it can contain only a return-success function.
Determine if a resource type must be added to the cluster database.
Understand the vendor-supplied startup and shutdown procedures.
Determine the configuration parameters for the application; these may be used in the action script and should be stored in the cluster database. Action scripts may read from the database.
Determine whether the resource type can be restarted in the local node and whether this action makes sense.
In the following situations, you may not need to perform application monitoring:
Heartbeat monitoring is sufficient; that is, simply verifying that the node is alive (provided automatically by the base software) determines the health of the highly available service.
There is no process or resource that can be monitored. For example, the SGI Gauntlet Internet Firewall software performs IP filtering on firewall nodes. Because the filtering is done in the kernel, there is no process or resource to monitor.
A resource on which the application depends is already monitored. For example, monitoring some client-node resources might best be done by monitoring the file systems, volumes, and network interfaces they use. Because this is already done by the base software, additional monitoring is not required.
![]() | Caution: Beware that monitoring should be as lightweight as possible so that it does not affect system performance. Also, security issues may make monitoring difficult. If you are unable to provide a monitoring script with appropriate performance and security, consider a monitoring agent; see “Monitoring Agents”. |
There are two types of monitoring that may be accomplished in a monitor script:
Is the resource present?
Is the resource responding?
You can define multiple levels of monitoring within the monitor script, and the administrator can choose the desired level by configuring the resource definition in the cluster database. Ensure that the monitoring level chosen does not affect system performance. For more information, see the FailSafe Administrator's Guide for SGI InfiniteStorage.
Possible symptoms of failure include the following:
The resource returns an error code
The resource returns the wrong result
The resource does not return quickly enough
You must determine the monitoring interval time and time-out value for the monitor script. The time-out must be long enough to guarantee that occasional anomalies do not cause false failovers. It will be useful for you to determine the peak load that the resource may need to sustain.
You must also determine if the monitor test should execute multiple times so that an application is not declared dead after a single failure. In general, testing more than once before declaring failure is a good idea.
The test should be simple and complete quickly, whether it succeeds or fails. Some examples of tests are as follows:
For a client/server resource that follows a well-defined protocol, the monitor script can make a simple request and verify that the proper response is received.
For a web server application, the monitor script can request a home page, verify that the connection was made, and ignore the resulting home page.
For a database, a simple request such as querying a table can be made.
For NFS, more complicated end-to-end monitoring is required. The test might consist of mounting an exported file system, checking access to the file system with a stat() system call to the root of the file system, and undoing the mount.
For a resource that writes to a log file, check that the size of the log file is increasing or use the grep command to check for a particular message.
The following command can be used to determine quickly whether a process exists:
/sbin/killall -0 process_name |
You can also use the ha_exec2 command to check if a process is running.
The ha_exec2 command differs from killall in that it performs a more exhaustive check on the process name as well as process arguments. killall searches for the process using the process name only. The command line is as follows:
/usr/cluster/bin/ha_exec2 -s 0 -t process_name |
![]() | Note: Do not use the ps command to check on a particular process because its execution can be too slow. |
Templates for the action scripts are provided in the following directory:
/var/cluster/ha/resource_types/template |
The template scripts have the same general format. Following is the type of information in the order in which it appears in the template scripts:
Header information
Set local variables
Read resource information
Exit status
Perform the basic action of the script, which is the customized area you must provide
Set global variables
Verify arguments
Read input file
![]() | Note: Action “scripts” can be of any form -- such as Bourne shell script, Perl script, or C language program. The rest of this chapter discusses Korn shell. |
The following sections show an example from the NFS start script.
The header information contains comments about the resource type, script type, and resource configuration format. You must modify the code as needed.
Following is the header for the NFS start script:
#!/sbin/ksh # ************************************************************************** # * * # * Copyright (C) 1998 Silicon Graphics, Inc. * # * * # * These coded instructions, statements, and computer programs contain * # * unpublished proprietary information of Silicon Graphics, Inc., and * # * are protected by Federal copyright law. They may not be disclosed * # * to third parties or copied or duplicated in any form, in whole or * # * in part, without the prior written consent of Silicon Graphics, Inc. * # * * # ************************************************************************** #ident "$Revision: 1.25 $" # Resource type: NFS # Start script NFS # # Test resource configuration information is present in the database in # the following format # # resource-type.NFS # |
The set_local_variables() section of the script defines all of the variables that are local to the script, such as temporary file names or database keys. All local variables should use the LOCAL_ prefix. You must modify the code as needed.
Following is the set_local_variables() section from the NFS start script:
set_local_variables() { LOCAL_TEST_KEY=NFS } |
The get_xxx _info() function, such as get_nfs_info(), reads the resource information from the cluster database. $1 is the test resource name. If the operation is successful, a value of 0 is returned; if the operation fails, 1 is returned.
The information is returned in the HA_STRING variable. For more information about HA_STRING, see Appendix B, “Using the Script Library”.
Following is the get_nfs_info() section from the NFS start script:
get_nfs_info () { ha_get_info ${LOCAL_TEST_KEY} $1 if [ $? -ne 0 ]; then return 1; else return 0; fi } |
Call ha_get_info with a third argument of any value to obtain all attributes and dependency information for a resource from the cluster database. Use ha_get_multi_fields to retrieve specific dependency information. The resource dependency information is returned in the $HA_FIELD_VALUE variable.
In the exit_script() function, $1 contains the exit_status value. If cleanup actions are required, such as the removal of temporary files that were created as part of the process, place them before the exit line.
Following is the exit_script() section from the NFS start script:
exit_script() { ${HA_DBGLOG} "Exit: exit_script()"; exit $1; } |
This area of the script is the portion you must customize. The templates provide a minimal framework.
Following is the framework for the basic action from the start template:
start_template() # for all template resources passed as parameter for TEMPLATE in $HA_RES_NAMES do #HA_CMD="command to start $TEMPLATE resource on the local machine"; #ha_execute_cmd "string to describe the command being executed"; ha_write_status_for_resource $TEMPLATE $HA_SUCCESS; done } |
![]() | Note: When testing the script, you will add the following line to
this area to obtain debugging information:
|
For examples of this area, see “Examples of Action Scripts”.
The following lines set all of the global and local variables and store the resource names in $HA_RES_NAMES.
Following is the set_global_variables() function from the NFS start script:
set_global_variables() { HA_DIR=/var/cluster/ha COMMON_LIB=${HA_DIR}/common_scripts/scriptlib # Execute the common library file . $COMMON_LIB ha_set_global_defs; } |
The ha_check_args() function verifies the arguments and stores them in the $HA_INFILE and $HA_OUTFILE variables. It returns 1 on error and 0 on success.
Following is the ha_check_args() section from the NFS start script:
ha_check_args $*; if [ $? -ne 0 ]; then exit $HA_INVAL_ARGS; fi |
The ha_read_infile() function reads the input file and stores the resource names in the $HA_RES_NAMES variable. This function is defined in the scriptlib library. See “Read an Input File” in Appendix B.
Following is code from the NFS start script that calls the ha_read_infile() function:
# Read the input file and store the resource names in $HA_RES_NAMES # variable ha_read_infile; |
For each script, you must do the following:
Get the required variables
Check the variables
Perform the action
Check the action
![]() | Note: The start and stop scripts are required to be idempotent; that is, they have the appearance of being run once but can in fact be run multiple times. For example, if the start script is run for a resource that is already started, the script must not return an error. |
All action scripts must return the status to the following file:
/var/cluster/ha/log/script_nodename |
The following sections use portions of the NFS scripts as examples.
![]() | Note: The examples in this guide may not exactly match the released system. |
The NFS start script does the following:
Creates a resource-specific NFS status directory.
Exports the specified export-point with the specified export-options.
Following is a section from the NFS start script:
# Start the resource on the local machine. # Return HA_SUCCESS if the resource has been successfully started on the local # machine and HA_CMD_FAILED otherwise. # start_nfs() { ${HA_DBGLOG} "Entry: start_nfs()"; # for all nfs resources passed as parameter for resource in ${HA_RES_NAMES} do NFSFILEDIR=${HA_SCRIPTTMPDIR}/${LOCAL_TEST_KEY}$resource HA_CMD="/sbin/mkdir -p $NFSFILEDIR"; ha_execute_cmd "creating nfs status file directory"; if [ $? -ne 0 ]; then ${HA_LOG} "Failed to create ${NFSFILEDIR} directory"; ha_write_status_for_resource ${resource} ${HA_NOCFGINFO}; exit_script $HA_NOCFGINFO fi get_nfs_info $resource if [ $? -ne 0 ]; then ${HA_LOG} "NFS: $resource parameters not present in CDB"; ha_write_status_for_resource ${resource} ${HA_NOCFGINFO}; exit_script ${HA_NOCFGINFO}; fi ha_get_field "${HA_STRING}" export-info if [ $? -ne 0 ]; then ${HA_LOG} "NFS: export-info not present in CDB for resource $resource"; ha_write_status_for_resource ${resource} ${HA_NOCFGINFO}; exit_script ${HA_NOCFGINFO}; fi export_opts="$HA_FIELD_VALUE" ha_get_field "${HA_STRING}" filesystem if [ $? -ne 0 ]; then ${HA_LOG} "NFS: filesystem-info not present in CDB for resource $resource"; ha_write_status_for_resource ${resource} ${HA_NOCFGINFO}; exit_script ${HA_NOCFGINFO}; fi filesystem="$HA_FIELD_VALUE" # Make the script idempotent, check to see if the NFS resource # is already exported, if so return success. Remember that we # might not have any export options. retstat=0; # Check to see if the NFS resource is already exported # (without options) /usr/etc/exportfs | grep "$resource$" >/dev/null 2>&1 retstat=$? if [ $retstat -eq 1 ]; then # Check to see if the NFS resource is already exported # with options. /usr/etc/exportfs | grep "$resource " | grep "$export_opts$" >/dev/null 2>&1 retstat=$? fi if [ $retstat -eq 1 ]; then # Before we try and export the NFS resource, make sure # filesystem is mounted. HA_CMD="/sbin/grep $filesystem /etc/mtab > /dev/null 2>&1"; ha_execute_cmd "check if the filesystem $filesystem is mounted"; if [ $? -eq 0 ]; then HA_CMD="/usr/etc/exportfs -i -o $export_opts $resource"; ha_execute_cmd "export $resource directories to NFS clients"; if [ $? -ne 0 ]; then ha_write_status_for_resource ${resource} ${HA_CMD_FAILED}; else ha_write_status_for_resource ${resource} ${HA_SUCCESS}; fi else ${HA_LOG} "NFS: filesystem $filesystem not mounted" ha_write_status_for_resource ${resource} ${HA_CMD_FAILED}; fi else ha_write_status_for_resource ${resource} ${HA_SUCCESS}; fi done } |
The NFS stop script does the following:
Unexports the specified export-point.
Removes the NFS status directory.
Following is an example from the NFS stop script:
# Stop the nfs resource on the local machine. # Return HA_SUCCESS if the resource has been successfully stopped on the local # machine and HA_CMD_FAILED otherwise. # stop_nfs() { ${HA_DBGLOG} "Entry: stop_nfs()"; # for all nfs resources passed as parameter for resource in ${HA_RES_NAMES} do get_nfs_info $resource if [ $? -ne 0 ]; then # NFS resource information not available. ${HA_LOG} "NFS: $resource parameters not present in CDB"; ha_write_status_for_resource ${resource} ${HA_NOCFGINFO}; exit_script ${HA_NOCFGINFO}; fi ha_get_field "${HA_STRING}" export-info if [ $? -ne 0 ]; then ${HA_LOG} "NFS: export-info not present in CDB for resource $resource"; ha_write_status_for_resource ${resource} ${HA_NOCFGINFO}; exit_script ${HA_NOCFGINFO}; fi export_opts="$HA_FIELD_VALUE" # Make the script idempotent, check to see if the filesystem # is already exported, if so return success. Remember that we # might not have any export options. retstat=0; # Check to see if the filesystem is already exported # (without options) /usr/etc/exportfs | grep "$resource$" >/dev/null 2>&1 retstat=$? if [ $retstat -eq 1 ]; then # Check to see if the filesystem is already exported # with options. /usr/etc/exportfs | grep "$resource " | grep "$export_opts$" >/dev/null 2>&1 retstat=$? fi if [ $retstat -eq 0 ]; then # Before we unexport the filesystem, check that it exists HA_CMD="/sbin/grep $resource /etc/mtab > /dev/null 2>&1"; ha_execute_cmd "check if the export-point exists"; if [ $? -eq 0 ]; then HA_CMD="/usr/etc/exportfs -u $resource"; ha_execute_cmd "unexport $resource directories to NFS clients"; if [ $? -ne 0 ]; then ha_write_status_for_resource ${resource} ${HA_CMD_FAILED}; else ha_write_status_for_resource ${resource} ${HA_SUCCESS}; fi else ${HA_LOG} "NFS: filesystem $resource not found in export filesystem list, \ unexporting anyway"; HA_CMD="/usr/etc/exportfs -u $resource"; ha_execute_cmd "unexport $resource directories to NFS clients"; ha_write_status_for_resource ${resource} ${HA_SUCCESS}; fi else ha_write_status_for_resource ${resource} ${HA_SUCCESS}; fi # remove the monitor nfs status file NFSFILEDIR=${HA_SCRIPTTMPDIR}/${LOCAL_TEST_KEY}$resource HA_CMD="/sbin/rm -rf $NFSFILEDIR"; ha_execute_cmd "removing nfs status file directory"; if [ $? -ne 0 ]; then ${HA_LOG} "Failed to delete ${NFSFILEDIR} directory"; ha_write_status_for_resource ${resource} ${HA_NOCFGINFO}; exit_script $HA_NOCFGINFO fi done } |
The NFS monitor script does the following:
Verifies that the file system is mounted at the correct mount point.
Requests the status of the exported file system.
Checks the export-point.
Requests NFS statistics and (based on the results) make a Remote Procedure Call (RPC) to NFS as needed.
Following is an example from the NFS monitor script:
# Check if the nfs resource is allocated in the local node # This check must be light weight and less intrusive compared to # exclusive check. This check is done when the resource has been # allocated in the local node. # Return HA_SUCCESS if the resource is running in the local node # and HA_CMD_FAILED if the resource is not running in the local node # The list of the resources passed as input is in variable # $HA_RES_NAMES # monitor_nfs() { ${HA_DBGLOG} "Entry: monitor_nfs()"; for resource in ${HA_RES_NAMES} do get_nfs_info $resource if [ $? -ne 0 ]; then # No resource information available. ${HA_LOG} "NFS: $resource parameters not present in CDB"; ha_write_status_for_resource ${resource} ${HA_NOCFGINFO}; exit_script ${HA_NOCFGINFO}; fi ha_get_field "${HA_STRING}" filesystem if [ $? -ne 0 ]; then # filesystem not available available. ${HA_LOG} "NFS: filesystem not present in CDB for resource $resource"; ha_write_status_for_resource ${resource} ${HA_NOCFGINFO}; exit_script ${HA_NOCFGINFO}; fi fs="$HA_FIELD_VALUE"; # Check to see if the filesystem is mounted HA_CMD="/sbin/mount | grep $fs >> /dev/null 2>&1" ha_execute_cmd "check to see if $fs is mounted" if [ $? -ne 0 ]; then ${HA_LOG} "NFS: $fs not mounted"; ha_write_status_for_resource ${resource} ${HA_CMD_FAILED}; exit_script $HA_CMD_FAILED; fi # stat the filesystem HA_CMD="/sbin/stat $resource"; ha_execute_cmd "stat mount point $resource" if [ $? -ne 0 ]; then ${HA_LOG} "NFS: cannot stat $resource NFS export point"; ha_write_status_for_resource ${resource} ${HA_CMD_FAILED}; exit_script $HA_CMD_FAILED; fi # check the filesystem is exported EXPORTFS="${HA_SCRIPTTMPDIR}/exportfs.$$" /usr/etc/exportfs > $EXPORTFS 2>&1 HA_CMD="awk '{print \$1}' $EXPORTFS | grep $resource" ha_execute_cmd " check the filesystem $resource is exported" if [ $? -ne 0 ]; then ${HA_LOG} "NFS: failed to find $resource in exported filesystem list:-" ${HA_LOG} "`/sbin/cat ${EXPORTFS}`" rm -f $EXPORTFS; ha_write_status_for_resource ${resource} ${HA_CMD_FAILED}; exit_script $HA_CMD_FAILED; fi rm -f $EXPORTFS # create a file to hold the nfs stats. This will will be # deleted in the stop script. NFSFILE=${HA_SCRIPTTMPDIR}/${LOCAL_TEST_KEY}$resource/.nfsstat NFS_STAT=`/usr/etc/nfsstat -rs | /usr/bin/tail -1 | /usr/bin/awk '{print $1}'` if [ ! -f $NFSFILE ]; then ${HA_LOG} "NFS: creating stat file $NFSFILE"; echo $NFS_STAT > $NFSFILE; if [ $NFS_STAT -eq 0 ];then # do some rpcinfo's exec_rpcinfo; if [ $? -ne 0 ]; then ${HA_LOG} "NFS: exec_rpcinfo failed (1)"; ha_write_status_for_resource ${resource} ${HA_CMD_FAILED}; exit_script $HA_CMD_FAILED; fi fi else OLD_STAT=`/sbin/cat $NFSFILE` if test "X${NFS_STAT}" = "X"; then ${HA_LOG} "NFS: NFS_STAT is not set, reset to zero"; NFS_STAT=0; fi if test "X${OLD_STAT}" = "X"; then ${HA_LOG} "NFS: OLD_STAT is not set, reset to zero"; OLD_STAT=0; fi if [ $NFS_STAT -gt $OLD_STAT ]; then echo $NFS_STAT > $NFSFILE; else echo $NFS_STAT > $NFSFILE; exec_rpcinfo; if [ $? -ne 0 ]; then ${HA_LOG} "NFS: exec_rpcinfo failed (2)"; ha_write_status_for_resource $resource ${HA_CMD_FAILED}; exit_script $HA_CMD_FAILED; fi fi fi ha_write_status_for_resource $resource $HA_SUCCESS; done } |
The NFS exclusive script determines whether the file system is already exported. The check made by an exclusive script can be more expensive than a monitor check. FailSafe uses this script to determine if resources are running on a node in the cluster, and to thereby prevent starting resources on multiple nodes in the cluster.
Following is an example from the NFS exclusive script:
# Check if the nfs resource is running in the local node. This check can # more intrusive than the monitor check. This check is used to determine # if the resource has to be started on a machine in the cluster. # Return HA_NOT_RUNNING if the resource is not running in the local node # and HA_RUNNING if the resource is running in the local node # The list of nfs resources passed as input is in variable # $HA_RES_NAMES # exclusive_nfs() { ${HA_DBGLOG} "Entry: exclusive_nfs()"; # for all resources passed as parameter for resource in ${HA_RES_NAMES} do get_nfs_info $resource if [ $? -ne 0 ]; then # No resource information available ${HA_LOG} "NFS: $resource parameters not present in CDB"; ha_write_status_for_resource ${resource} ${HA_NOCFGINFO}; exit_script ${HA_NOCFGINFO}; fi SMFILE=${HA_SCRIPTTMPDIR}/showmount.$$ /etc/showmount -x >> ${SMFILE}; HA_CMD="/sbin/grep $resource ${SMFILE} >> /dev/null 2>&1" ha_execute_cmd "checking for $resource exported directory" if [ $? -eq 0 ];then ha_write_status_for_resource ${resource} ${HA_RUNNING}; ha_print_exclusive_status ${resource} ${HA_RUNNING}; else ha_write_status_for_resource ${resource} ${HA_NOT_RUNNING}; ha_print_exclusive_status ${resource} ${HA_NOT_RUNNING}; fi rm -f ${SMFILE} done } |
The NFS restart script exports the specified export-point with the specified export-options.
Following is an example from the restart script for NFS:
# Restart nfs resource # Return HA_SUCCESS if nfs resource failed over successfully or # return HA_CMD_FAILED if nfs resource could not be failed over locally. # Return HA_NOT_SUPPORTED if local restart is not supported for nfs # resource type. # The list of nfs resources passed as input is in variable # $HA_RES_NAMES # restart_nfs() { ${HA_DBGLOG} "Entry: restart_nfs()"; # for all nfs resources passed as parameter for resource in ${HA_RES_NAMES} do get_nfs_info $resource if [ $? -ne 0 ]; then ${HA_LOG} "NFS: $resource parameters not present in CDB"; ha_write_status_for_resource ${resource} ${HA_NOCFGINFO}; exit_script ${HA_NOCFGINFO}; fi ha_get_field "${HA_STRING}" export-info if [ $? -ne 0 ]; then ${HA_LOG} "NFS: export-info not present in CDB for resource $resource"; ha_write_status_for_resource ${resource} ${HA_NOCFGINFO}; exit_script ${HA_NOCFGINFO}; fi export_opts="$HA_FIELD_VALUE" HA_CMD="/usr/etc/exportfs -i -o $export_opts $resource"; ha_execute_cmd "export $resource directories to NFS clients"; if [ $? -ne 0 ]; then ha_write_status_for_resource ${resource} ${HA_CMD_FAILED}; else ha_write_status_for_resource ${resource} ${HA_SUCCESS}; fi done } |
If resources cannot be monitored using a lightweight check, you should use a monitoring agent. The monitor action script contacts the monitoring agent to determine the status of the resource in the node. The monitoring agent in turn periodically monitors the resource. Figure 2-1 shows the monitoring process.
Monitoring agents are useful for monitoring database resources. In databases, creating the database connection is costly and time consuming. The monitoring agent maintains connections to the database and it queries the database using the connection in response to the monitor action script request.
Monitoring agents are independent processes and can be started by the cmond process, although this is not required. For example, if a monitoring agent must be started when activating highly available services on a node, information about that agent can be added to the cmond configuration on that node. The cmond configuration is located in the /var/cluster/cmon/process_groups directory. Information about different agents should go into different files. The name of the file is not relevant to the activate/deactivate procedure.
If a monitoring agent exits or aborts, cmond will automatically restart the monitoring agent. This prevents monitor action script failures due to monitoring agent failures.
For example, the /var/cluster/cmon/process_groups/ip_addresses file contains information about the ha_ifd process that monitors network interfaces. It contains the following:
TYPE = cluster_agent PROCS = ha_ifd ACTIONS = start stop restart attach detach AUTOACTION = attach |
![]() | Note: The ACTIONS line above defines what cmond can do to the PROCS processes. These actions must be the same for every agent. (It does not refer to action scripts.) |
If you create a new monitoring agent, you must also create a corresponding file in the /var/cluster/cmon/process_groups directory that contains similar information about the new agent. To do this, you can copy the ip_addresses file and modify the PROCS line to list the executables that constitute your new agent. These executables must be located in the /usr/cluster/bin directory. You should not modify the other configuration lines (TYPE, ACTIONS, and AUTOACTION ).
Suppose you need to add a new agent called newagent that consists of processes ha_x and ha_y. The configuration information for this agent will be located in the /var/cluster/cmon/process_groups/newagent file, which will contain the following:
TYPE = cluster_agent PROCS = ha_x ha_y ACTIONS = start stop restart attach detach AUTOACTION = attach |
In this case, the software will expect two executables ( /usr/cluster/bin/ha_x and /usr/cluster/bin/ha_y) to be present.