This chapter describes how to test action scripts without running FailSafe. It also provides tips on how to debug problems that you may encounter. It covers the following:
![]() | Note: Parameters are passed to the action scripts as both input files and output files. Each line of the input file contains the resource name; the output file contains the resource name and the script exit status. |
Some general testing and debugging techniques you can use during testing are as follows:
To get debugging information, add the following line to each of your scripts in the main function of the script:
set -x |
To check that an application is running on a node:
Enter the following command on that node, where application is the name (or a portion of the name) of the executable for the application:
ps -ef | grep application |
Use appropriate commands provided by the application. For example, the FailSafe Informix option uses the Informix command onstat.
To show the status of a resource, use the following cmgr command:
show status of resource resourcename of resource_type typename [in cluster clustername] |
For example:
cmgr> show status of resource /hafs1/subdir of resource_type NFS in cluster nfs-cluster State: Online Error: None Owner: hans2 Flags: Resource is monitored locally |
To show the status of a node, use the following cmgr command:
show status of node nodename |
For example:
cmgr> show status of node hans2 FailSafe status of node is UP. Machine (hans2) is not configured for CXFS. |
To show the status of a resource group, use the following cmgr command:
show status of resource_group RG_name in cluster clustername |
For example:
cmgr> show status of resource_group nfs-group1 in cluster nfs-cluster State: Online Error: No error Owner: hans2 |
The exclusive script returns an error when the resource is running in the local node. If the resource is actually running in the node, there is no exclusive action script bug.
If the resource group does not become online on the primary node, it can be because of a start script error or a monitor script error on the primary node. The nature of the failure can be seen in the srmd logs of the primary node.
If the action script failure status is timeout , resource type timeouts for the action should be increased. In the case of the monitor script, the check can be made more lightweight.
The resource type action script timeouts are for a resource. So, if an action is performed on two resources, the script timeout is twice the configured resource type action timeout.
If the resource group has a configuration error, check the srmd logs on the primary node for errors.
The action scripts that use ${HA_LOG} and ${HA_DBGLOG} macros to log messages can find the messages in /var/cluster/ha/log/script_nodename file in each node in the cluster.
HA_LOG logs messages at log level 1 and HA_DBGLOG uses log level 11.
To test an action script, do the following:
Create an input file, such as /tmp/input , that contains expected resource names. For example, to create a file that contains the resource named disk1 do the following:
# echo "/disk1" > /tmp/input |
Create an input parameter file, such as /tmp/ipparamfile, as follows:
# echo "ClusterName web-cluster" > /tmp/ipparamfile |
Execute the action script as follows:
# ./start /tmp/input /tmp/output /tmp/ipparamfile |
![]() | Note: The use of the input parameter file is optional. |
Change the log level from HA_NORMLVL to HA_DBGLVL to allow messages written with HA_DBGLOG to be printed by adding the following line after the set_global_variables statement in your script:
HA_CURRENT_LOGLEVEL=$HA_DBGLVL |
The output file will contain one of the following return values for the start, stop, monitor , and restart scripts:
HA_SUCCESS=0 HA_INVAL_ARGS=1 HA_CMD_FAILED=2 HA_NOTSUPPORTED=3 HA_NOCFGINFO=4 |
The output file will contain one of the following return values for the exclusive script:
HA_NOT_RUNNING=0 HA_RUNNING=2 |
Suppose you have a resource named /disk1. The syntax for the input and output files would be as follows:
Input file: <resourcename>
Output file: <resourcename> <status>
The following example shows:
The exit status of the action script is 1
The exit status of the resource is 2
![]() | Note: The use of anonymous indicates that the script was run manually. When the script is run by FailSafe, the full path to the script name is displayed. |
# echo "/disk1" > /tmp/ipfile # ./monitor /tmp/ipfile /tmp/opfile /tmp/ipparamfile # echo $? 2 # cat /tmp/opfile /disk1 2 # tail /var/cluster/ha/log/script_heb1 Tue Aug 25 11:32:57.437 <anonymous script 23787:0 Unknown:0> ./monitor: ./monitor called with /tmp/ipfile and /tmp/opfile Tue Aug 25 11:32:58.118 <anonymous script 24556:0 Unknown:0> ./monitor: check to see if /disk1 is mounted on /disk1 Tue Aug 25 11:32:58.433 <anonymous script 23811:0 Unknown:0> ./monitor: /sbin/mount | grep /disk1 | grep /disk1 >> /dev/null 2>&1 exited with status 0 Tue Aug 25 11:32:58.665 <anonymous script 24124:0 Unknown:0> ./monitor: stat mount point /disk1 Tue Aug 25 11:32:58.969 <anonymous script 23525:0 Unknown:0> ./monitor: /sbin/stat /disk1 exited with status 0 Tue Aug 25 11:32:59.258 <anonymous script 24431:0 Unknown:0> ./monitor: check the filesystem /disk1 is exported Tue Aug 25 11:32:59.610 <anonymous script 6982:0 Unknown:0> ./monitor: Tue Aug 25 11:32:59.917 <anonymous script 24040:0 Unknown:0> ./monitor: awk '{print \$1}' /var/cluster/ha/tmp/exportfs.23762 | grep /disk1 exited with status 1 Tue Aug 25 11:33:00.131 <anonymous script 24418:0 Unknown:0> ./monitor: echo failed to find /disk1 in exported filesystem list:- Tue Aug 25 11:33:00.340 <anonymous script 24236:0 Unknown:0> ./monitor: echo /disk2 |
For additional information about a script's processing, see the /var/cluster/ha/log/script_nodename.
The monitor script tests the liveliness of applications and resources. The best way to test it is to induce a failure, run the script, and check if this failure is detected by the script; then repeat the process for another failure.
Use this checklist for testing a monitor script:
Verify that the script detects failure of the application successfully
Verify that the script always exits with a return value
Verify that the script does not contain commands that can hang, such as using DNS for name resolution, or those that continue forever, such as ping
Verify that the script completes before the time-out value specified in the configuration file
Verify that the script's return codes are correct
During testing, measure the time it takes for a script to complete and adjust the monitoring times in your script accordingly. To get a good estimate of the time required for the script to execute, run it under different system load conditions.