Chapter 5. Testing IRIS FailSafe Configuration

This chapter explains how to test the IRIS FailSafe system configuration. The tests in each section in this chapter, except the last section, are performed when IRIS FailSafe software is not running. The last section describes how to test the running IRIS FailSafe software.

The sections in this chapter are as follows:


Note: Example pathnames in this chapter show IRIX 6.4 pathnames for logical volumes, such as /dev/xlv/shared1_vol, rather than IRIX 6.2 pathnames such as /dev/dsk/xlv/shared1_vol. On IRIX 6.2 nodes, use IRIX 6.2 pathnames.


Testing the Serial Connections

To test the serial connections between the IRIS FailSafe nodes, follow these steps:

  1. If a remote power control unit is used, confirm that it is powered on by checking that the display light on the front of the box is lit green. (The section “Replacing Batteries in the Remote Power Control Unit” in Chapter 8 explains how to change the batteries.)

  2. On both nodes, enter

    # /etc/init.d/failsafe stop 
    

  3. Enter this command on one node:

    # /usr/etc/ha_spng -i 10 -f reset-tty -d sys-ctlr-type -w password 
    

    The variables are:

    reset-tty  

    The value of the reset-tty parameter for this node in the configuration file /var/ha/ha.conf. An example is /dev/ttyd2.

    sys-ctlr-type  

    The value of the sys-ctlr-type parameter for this node in /var/ha/ha.conf (MMSC for Origin2000 or Onyx2 rack systems, MSC for Origin200 systems and deskside Origin2000 and Onyx2 systems, or CHAL for Challenge).

    The -d sys-ctlr-type option is omitted if there is no sys-ctlr-type parameter or it is set to CHAL.

    password  

    The unencrypted password for this node's system controller.

    The -w password option is omitted if the node is a CHALLENGE node or if it is an Origin node with the default system controller password.

  4. Check the return value of the command by entering the first command if you are using csh and the second command if you are using sh:

    # echo $status 
    
    # echo $? 
    

    If the return value is 0, the connection is good.

  5. If the return value is 1, verify the cable connections of the serial cable from each node's serial port to the remote power control unit or the other node's system controller port.

  6. Repeat steps 3 through 5 on the second node.

Testing the Private Network

To test the private (heartbeat) network, follow these steps:

  1. Enter this command on one node:

    # /usr/etc/ping -r -c 3 priv-xfs-ha1 
    PING priv-xfs-ha1.eng.sgi.com (190.0.3.1): 56 data bytes
    64 bytes from 190.0.3.1: icmp_seq=0 ttl=254 time=3 ms
    64 bytes from 190.0.3.1: icmp_seq=1 ttl=254 time=2 ms
    64 bytes from 190.0.3.1: icmp_seq=2 ttl=254 time=2 ms
    

    priv-xfs-ha1 is the private IP address of the other node. Typical ping output, such as that shown, should appear.

  2. If the ping command fails, verify that the private network interface has been configured up using the ifconfig command, for example:

    # /usr/etc/ifconfig ec3
    ec3: flags=c63<UP,BROADCAST,NOTRAILERS,RUNNING,FILTMULTI,MULTICAST>
            inet 190.0.3.1 netmask 0xffffff00 broadcast 190.0.3.255
    

    The UP in the first line of output indicates that the interface is configured up.

  3. If the ping command fails and the private network interface has been configured up, verify that the private network cables are connected properly.

  4. Repeat steps 1 through 3 on the other node.

Testing the Public Network Interfaces

The procedure below describes how to test the public interfaces on each node. It uses this interface as an example:

node xfs-ha1
        interface xfs-ha1-ec0
        {
                name = ec0
                ip-address = xfs-ha1
                netmask = 0xffffff00
                broadcast-addr = 190.0.2.255
        }
        ...
}
node xfs-ha2
...
interface-pair one {
        primary-interface = xfs-ha1-ec0
        secondary-interface = xfs-ha2-ec0
        re-mac = false
        netmask = 0xffffff00
        broadcast-addr = 190.0.2.255
        ip-aliases = ( stocks )
}

Follow these steps:

  1. To test the public network interfaces on the first node (xfs-ha1), enter the following command from a client:

    # /usr/etc/ping -c 3 xfs-ha1
    PING xfs-ha1.engr.sgi.com (190.0.2.1): 56 data bytes
    64 bytes from 190.0.2.1: icmp_seq=0 ttl=254 time=3 ms
    64 bytes from 190.0.2.1: icmp_seq=1 ttl=254 time=2 ms
    64 bytes from 190.0.2.1: icmp_seq=2 ttl=254 time=2 ms
    

    xfs-ha1 is an IP address for an interface on the node xfs-ha1.

  2. Repeat step 1 for the remaining public network interfaces on xfs-ha1.

  3. Repeat step 1 for all public interfaces of the other node in the cluster.

Testing Logical Volumes

Follow the procedure below to verify that XLV logical volumes have been configured properly. This portion of a configuration file is used as an example in the procedure below:

volume sharedsybase_vol
{
        server-node = xfs-ha1
        backup-node = xfs-ha2
        devname = shared_sybase
        devname-owner = sybase
        devname-group = sybase
        devname-mode = 0664
}

  1. On a node that is a primary node for volumes (xfs-ha1 in this example), enter this command to stop the RAID agent if the cluster uses a CHALLENGE RAID storage system:

    # /etc/init.d/raid5 stop 
    

  2. On the same node, enter the following commands to assemble the XLV logical volume sharedsybase_vol:

    # xlv_mgr -c "change nodename xfs-ha1 shared_sybase"
    set node name xfs-ha1 for object shared_sybase done
    
    # xlv_assemble -l -s shared_sybase
    VOL shared_sybase       flags=0x1, [complete]         (node=xfs-ha1)
    DATA    flags=0x0()     open_flag=0x0() device=(192, 4)
    PLEX 0  flags=0x0
    VE 0    [active]
            start=0, end=3583999, (cat)grp_size=1
            /dev/dsk/dks5d1s0 (3584000 blks)
    

  3. Repeat step 2 for each of the other volumes with the same primary node.

  4. If you stopped the RAID agent in step 1, restart the RAID agent by entering this command:

    # /etc/init.d/raid5 start 
    

  5. On the same node, list all of the XLV logical volumes on the node:

    # ls -l /dev/xlv
    total 0
    brw-rw-r--    1 sybase   sybase   192,  4 May 22 11:18 shared_sybase
    ...
    

    You should see all volumes that have this node listed as their server-node in the configuration file.

  6. Enter this command to read ten blocks from one of the XLV logical volumes (for example, /dev/xlv/shared_sybase) and discard them:

    # dd if=/dev/xlv/shared_sybase of=/dev/null count=10
    10+0 records in
    10+0 records out
    

    The output should match the output shown.

  7. Repeat step 6 for every volume in the configuration file for which this node is the primary node.

  8. If the other node serves as the primary node for any XLV logical volumes, repeat steps 1 through 7.

Testing Filesystems

The procedure below tests filesystems configured for IRIS FailSafe by executing the mount commands that the IRIS FailSafe software would execute. These filesystem and volume sections of a configuration file are used as an example:

filesystem shared1
{
           mount-point = /shared1
           mount-info
           {
                   fs-type = xfs
                   volume-name = shared1_vol
                   mode = rw,noauto
           }
}
volume shared1_vol
{
        server-node = xfs-ha1
        backup-node = xfs-ha2
        devname = shared1_vol
        devname-owner = root
        devname-group = sys
        devname-mode = 0600
}

For each filesystem listed in the configuration file, follow this procedure:

  1. Identify the primary node for the filesystem by looking up the primary node (the server-node) of the XLV logical volume used by this filesystem. In the example above, volume-name is shared1_vol; look for the volume block with the label shared1_vol. Its server-node (primary node) is xfs-ha1.

  2. On the primary node, check to see if the XLV logical volume device name exists:

    # ls /dev/xlv/shared1_vol
    

  3. If the device name doesn't exist and you are using a CHALLENGE RAID storage system, stop the RAID agent:

    # /etc/init.d/raid5 stop 
    

  4. If the device name doesn't exist, enter the following commands to assemble the XLV logical volume shared1_vol:

    # xlv_mgr -c "change nodename xfs-ha1 shared1_vol"
    set node name xfs-ha1 for object shared1_vol done
    
    # xlv_assemble -l -s shared1_vol
    VOL shared1_vol         flags=0x1, [complete]         (node=xfs-ha1)
    DATA    flags=0x0()     open_flag=0x0() device=(192, 4)
    PLEX 0  flags=0x0
    VE 0    [active]
            start=0, end=3583999, (cat)grp_size=1
            /dev/dsk/dks5d1s0 (3584000 blks)
    

  5. If you stopped the RAID agent in step 3, restart it with this command:

    # /etc/init.d/raid5 start 
    

  6. On the primary node, mount the filesystem using a mount command that mimics the mount command given by IRIS FailSafe:

    # mount -t xfs -o rw,noauto /dev/xlv/shared1_vol /shared1
    

    The mount should be successful.

  7. Unmount the filesystem:

    # umount /shared1
    

  8. On the secondary node, check to see if the XLV logical volume device name exists:

    # ls /dev/xlv/shared1_vol
    

  9. If the device name doesn't exist and you are using a CHALLENGE RAID storage system, stop the RAID agent:

    # /etc/init.d/raid5 stop 
    

  10. If the device name doesn't exist, enter the following commands on the secondary node to assemble the XLV logical volume shared1_vol:

    # xlv_mgr -c "change nodename xfs-ha2 shared1_vol"
    set node name xfs-ha2 for object shared1_vol done
    
    # xlv_assemble -l -s shared1_vol
    VOL shared1_vol         flags=0x1, [complete]         (node=xfs-ha2)
    DATA    flags=0x0()     open_flag=0x0() device=(192, 4)
    PLEX 0  flags=0x0
    VE 0    [active]
            start=0, end=3583999, (cat)grp_size=1
            /dev/dsk/dks5d1s0 (3584000 blks)
    

  11. If you stopped the RAID agent in step 9, restart it with this command:

    # /etc/init.d/raid5 start 
    

  12. Mount the filesystem on the secondary node by entering the command from step 6 on the secondary node:

    # mount -t xfs -o rw,noauto /dev/xlv/shared1_vol /shared1
    

  13. Unmount the filesystem:

    # umount /shared1
    

Testing NFS Configuration

The procedure below tests NFS configuration by exporting filesystems manually and determining if a client can access them.

This NFS entry in ha.conf is used as an example:

nfs shared1
{
        filesystem = shared1
        export-point = /shared1
        export-info = rw
        ip-address = 190.0.2.3
}

For each NFS block in ha.conf, follow these steps:

  1. Mount /shared1 on either node as described in the section “Testing Filesystems.”

  2. Make sure the IP address is configured by entering this command on the node where /shared1 was mounted:

    # /usr/etc/ping -c 3 190.0.2.3
    PING 190.0.2.3 (190.0.2.3): 56 data bytes
    64 bytes from 190.0.2.3: icmp_seq=0 ttl=254 time=3 ms
    64 bytes from 190.0.2.3: icmp_seq=1 ttl=254 time=2 ms
    64 bytes from 190.0.2.3: icmp_seq=2 ttl=254 time=2 ms
    

  3. From the node on which it is mounted, export the filesystem:

    # exportfs -i -o rw /shared1
    

  4. Make sure the filesystem was exported:

    # exportfs
    /shared1 -rw
    

  5. Verify that you can mount the exported filesystem on a client by entering these commands from a client:

    # mkdir /tempmount
    # mount 190.0.2.3:/shared1 /tempmount 
    # umount /tempmount
    # rmdir /tempmount
    

  6. From the node on which the filesystem is mounted, unexport it and unmount it in preparation for running this test from the other node:

    # exportfs -u /shared1
    # umount /shared1
    

  7. Repeat steps 1 through 6 on the other node. Make sure you do not mount the filesystem simultaneously from both nodes.

Testing Netscape Server Configuration

To test whether the Netscape servers are correctly configured, follow these steps:

  1. Start the Netscape FastTrack and Enterprise servers:

    # /etc/init.d/ns_fasttrack start 
    # /etc/init.d/ns_enterprise start 
    

  2. Run a Web browser, such as Netscape, on a client and try to access some Web pages served by the server.

  3. Stop the Netscape FastTrack and Enterprise servers:

    # /etc/init.d/ns_fasttrack stop 
    # /etc/init.d/ns_enterprise stop 
    

Testing System Behavior With IRIS FailSafe Running

Testing system behavior with IRIS FailSafe running is broken into four phases in the following subsections. The phases are preparing for testing, checking normal operation, checking failover, and cleaning up after testing.

Preparing for Testing

  1. Edit the file /etc/init.d/failsafe on each node and change the value of MIN_UPTIME from the default (300 seconds) to 0. This enables you to allow multiple failovers without the FailSafe software disabling itself due to frequent failovers.

  2. Bring up IRIS FailSafe software by entering these two commands on each node:

    # chkconfig failsafe on 
    # /etc/init.d/failsafe start 
    

Checking Normal Operation

Follow this procedure to verify that the IRIS FailSafe cluster is operating normally:

  1. Verify that the nodes are in normal state by entering this command on each node:

    # /usr/etc/ha_admin -i 
    ha_admin: Node controller state normal 
    

    If either node has not reached normal state, wait a few minutes and try the command again. If normal state isn't reached, check the /var/adm/SYSLOG file on both nodes for errors. See Appendix B, “System Troubleshooting,” for troubleshooting information.

  2. Verify that NFS filesystems are exported by the cluster by mounting them from a client.

  3. Verify that Netscape servers on the cluster are working by running a browser on a client and viewing Web pages served by the Netscape servers.

  4. Check any other high-availability applications running on the cluster.

Checking Failover

After you have confirmed that the cluster operates correctly when both nodes are active, confirm that the cluster functions correctly in the face of failures by performing the tests below. Each test is independent and should be performed on an IRIS FailSafe cluster that is operating normally.

  1. Power off one node in the cluster. The other node in the cluster should detect the failure and take over the services. If you have an active/backup configuration, power off the active node.

  2. Disconnect the private network. If you have enabled heartbeat messages to be sent over the public network, the cluster should continue to function as before. Otherwise, one node takes over the services of the other node. The node whose services were taken over is rebooted.

    If heartbeat messages are sent over the public network (hb-public-ipname is set to a fixed IP address), enter this command after reconnecting the private network to switch heartbeat messages back to the private network:

    # /usr/etc/ha_admin -x 
    

  3. Disconnect the public network from one of the active nodes in the cluster. The other node should take over the services.

  4. Forcibly unmount a filesystem on an active node to make it unavailable. The other node should take over the services. The failed node enters standby state.

  5. Kill the application daemons (for example, ns_fasttrack) on a node to make the service unavailable. The other node should take over the service.

  6. Disconnect the serial line to the remote power control unit (if you are using CHALLENGE S) or to the system controller port (any other system). If you have configured the IRIS FailSafe software to send mail, it notifies the administrator of the failure and otherwise continues to function.

    When the cluster is in this state, neither node can take over if another failure occurs. After you have reconnected the serial line, you can resume monitoring of the serial line by executing /usr/etc/ha_admin -m start <node_name>.

Cleaning Up After Testing

Follow this procedure to return the nodes to normal state after testing:

  1. Edit the file /etc/init.d/failsafe on each node and return the value of MIN_UPTIME to its initial suggested value, 300.

  2. Restart IRIS FailSafe on both nodes by entering these commands on each node:

    # /etc/init.d/failsafe stop
    # /etc/init.d/failsafe start