Chapter 5. Testing Scripts

This chapter describes how to test action scripts without running FailSafe. It also provides tips on how to debug problems that you may encounter. It covers the following:


Note: Parameters are passed to the action scripts as both input files and output files. Each line of the input file contains the resource name; the output file contains the resource name and the script exit status.


General Testing and Debugging Techniques

Some general testing and debugging techniques you can use during testing are as follows:

  • To get debugging information, add the following line to each of your scripts in the main function of the script:

    set -x

  • To check that an application is running on a node:

    • Enter the following command on that node, where application is the name (or a portion of the name) of the executable for the application:

      ps -ef | grep application  

    • Use appropriate commands provided by the application. For example, the FailSafe Informix option uses the Informix command onstat.

  • To show the status of a resource, use the following cmgr command:

    show status of resource resourcename of resource_type typename [in cluster clustername]

    For example:

    cmgr> show status of resource /hafs1/subdir of resource_type NFS in cluster nfs-cluster
    
    State: Online
    Error: None
    Owner: hans2
    Flags: Resource is monitored locally

  • To show the status of a node, use the following cmgr command:

    show status of node nodename

    For example:

    cmgr> show status of node hans2
    
    FailSafe status of node is UP.
    
    Machine (hans2) is not configured for CXFS.

  • To show the status of a resource group, use the following cmgr command:

    show status of resource_group RG_name  in cluster clustername

    For example:

    cmgr> show status of resource_group nfs-group1 in cluster nfs-cluster
    
    State: Online
    Error: No error
    Owner: hans2

Debugging Notes

  • The exclusive script returns an error when the resource is running in the local node. If the resource is actually running in the node, there is no exclusive action script bug.

  • If the resource group does not become online on the primary node, it can be because of a start script error or a monitor script error on the primary node. The nature of the failure can be seen in the srmd logs of the primary node.

  • If the action script failure status is timeout , resource type timeouts for the action should be increased. In the case of the monitor script, the check can be made more lightweight.

  • The resource type action script timeouts are for a resource. So, if an action is performed on two resources, the script timeout is twice the configured resource type action timeout.

  • If the resource group has a configuration error, check the srmd logs on the primary node for errors.

  • The action scripts that use ${HA_LOG} and ${HA_DBGLOG} macros to log messages can find the messages in /var/cluster/ha/log/script_nodename file in each node in the cluster.

    HA_LOG logs messages at log level 1 and HA_DBGLOG uses log level 11.

Testing an Action Script

To test an action script, do the following:

  1. Create an input file, such as /tmp/input , that contains expected resource names. For example, to create a file that contains the resource named disk1 do the following:

    # echo "/disk1" > /tmp/input

  2. Create an input parameter file, such as /tmp/ipparamfile, as follows:

    # echo "ClusterName web-cluster" > /tmp/ipparamfile

  3. Execute the action script as follows:

    # ./start /tmp/input /tmp/output /tmp/ipparamfile


    Note: The use of the input parameter file is optional.


  4. Change the log level from HA_NORMLVL to HA_DBGLVL to allow messages written with HA_DBGLOG to be printed by adding the following line after the set_global_variables statement in your script:

    HA_CURRENT_LOGLEVEL=$HA_DBGLVL

The output file will contain one of the following return values for the start, stop, monitor , and restart scripts:

HA_SUCCESS=0
HA_INVAL_ARGS=1
HA_CMD_FAILED=2
HA_NOTSUPPORTED=3
HA_NOCFGINFO=4

The output file will contain one of the following return values for the exclusive script:

HA_NOT_RUNNING=0
HA_RUNNING=2


Note: If you call the exit_script function prior to normal termination, it should be preceded by the ha_write_status_for_resource function and you should use the same return code that is logged to the output file.

Suppose you have a resource named /disk1. The syntax for the input and output files would be as follows:

  • Input file: <resourcename>

  • Output file: <resourcename> <status>

The following example shows:

  • The exit status of the action script is 1

  • The exit status of the resource is 2


    Note: The use of anonymous indicates that the script was run manually. When the script is run by FailSafe, the full path to the script name is displayed.


    # echo "/disk1" > /tmp/ipfile
    # ./monitor /tmp/ipfile /tmp/opfile /tmp/ipparamfile
    # echo $?
    2
    # cat /tmp/opfile
    /disk1 2
    # tail /var/cluster/ha/log/script_heb1
    Tue Aug 25 11:32:57.437 <anonymous script 23787:0 Unknown:0> ./monitor:
    ./monitor called with /tmp/ipfile and /tmp/opfile
    Tue Aug 25 11:32:58.118 <anonymous script 24556:0 Unknown:0> ./monitor:
    check to see if /disk1 is mounted on /disk1
    Tue Aug 25 11:32:58.433 <anonymous script 23811:0 Unknown:0> ./monitor:
    /sbin/mount | grep /disk1 | grep /disk1 >> /dev/null 2>&1 exited with
    status 0
    Tue Aug 25 11:32:58.665 <anonymous script 24124:0 Unknown:0> ./monitor:
    stat mount point /disk1
    Tue Aug 25 11:32:58.969 <anonymous script 23525:0 Unknown:0> ./monitor:
    /sbin/stat /disk1 exited with status 0
    Tue Aug 25 11:32:59.258 <anonymous script 24431:0 Unknown:0> ./monitor:
    check the filesystem /disk1 is exported
    Tue Aug 25 11:32:59.610 <anonymous script 6982:0 Unknown:0> ./monitor:
    Tue Aug 25 11:32:59.917 <anonymous script 24040:0 Unknown:0> ./monitor:
    awk '{print \$1}' /var/cluster/ha/tmp/exportfs.23762 | grep /disk1 exited
    with status 1
    Tue Aug 25 11:33:00.131 <anonymous script 24418:0 Unknown:0> ./monitor:
    echo failed to find /disk1 in exported filesystem list:-
    Tue Aug 25 11:33:00.340 <anonymous script 24236:0 Unknown:0> ./monitor:
    echo /disk2

For additional information about a script's processing, see the /var/cluster/ha/log/script_nodename.

Special Testing Considerations for the monitor Script

The monitor script tests the liveliness of applications and resources. The best way to test it is to induce a failure, run the script, and check if this failure is detected by the script; then repeat the process for another failure.

Use this checklist for testing a monitor script:

  • Verify that the script detects failure of the application successfully

  • Verify that the script always exits with a return value

  • Verify that the script does not contain commands that can hang, such as using DNS for name resolution, or those that continue forever, such as ping

  • Verify that the script completes before the time-out value specified in the configuration file

  • Verify that the script's return codes are correct

During testing, measure the time it takes for a script to complete and adjust the monitoring times in your script accordingly. To get a good estimate of the time required for the script to execute, run it under different system load conditions.