This chapter explains how to test the IRIS FailSafe system configuration. The tests in each section in this chapter, except the last section, are performed when IRIS FailSafe software is not running. The last section describes how to test the running IRIS FailSafe software.
The sections in this chapter are as follows:
![]() | Note: Example pathnames in this chapter show IRIX 6.4 pathnames for logical volumes, such as /dev/xlv/shared1_vol, rather than IRIX 6.2 pathnames such as /dev/dsk/xlv/shared1_vol. On IRIX 6.2 nodes, use IRIX 6.2 pathnames. |
To test the serial connections between the IRIS FailSafe nodes, follow these steps:
If a remote power control unit is used, confirm that it is powered on by checking that the display light on the front of the box is lit green. (The section “Replacing Batteries in the Remote Power Control Unit” in Chapter 8 explains how to change the batteries.)
On both nodes, enter
# /etc/init.d/failsafe stop |
Enter this command on one node:
# /usr/etc/ha_spng -i 10 -f reset-tty -d sys-ctlr-type -w password |
The variables are:
reset-tty | The value of the reset-tty parameter for this node in the configuration file /var/ha/ha.conf. An example is /dev/ttyd2. | |
sys-ctlr-type | The value of the sys-ctlr-type parameter for this node in /var/ha/ha.conf (MMSC for Origin2000 or Onyx2 rack systems, MSC for Origin200 systems and deskside Origin2000 and Onyx2 systems, or CHAL for Challenge). The -d sys-ctlr-type option is omitted if there is no sys-ctlr-type parameter or it is set to CHAL. | |
password | The unencrypted password for this node's system controller. The -w password option is omitted if the node is a CHALLENGE node or if it is an Origin node with the default system controller password. |
Check the return value of the command by entering the first command if you are using csh and the second command if you are using sh:
# echo $status # echo $? |
If the return value is 0, the connection is good.
If the return value is 1, verify the cable connections of the serial cable from each node's serial port to the remote power control unit or the other node's system controller port.
To test the private (heartbeat) network, follow these steps:
Enter this command on one node:
# /usr/etc/ping -r -c 3 priv-xfs-ha1 PING priv-xfs-ha1.eng.sgi.com (190.0.3.1): 56 data bytes 64 bytes from 190.0.3.1: icmp_seq=0 ttl=254 time=3 ms 64 bytes from 190.0.3.1: icmp_seq=1 ttl=254 time=2 ms 64 bytes from 190.0.3.1: icmp_seq=2 ttl=254 time=2 ms |
priv-xfs-ha1 is the private IP address of the other node. Typical ping output, such as that shown, should appear.
If the ping command fails, verify that the private network interface has been configured up using the ifconfig command, for example:
# /usr/etc/ifconfig ec3 ec3: flags=c63<UP,BROADCAST,NOTRAILERS,RUNNING,FILTMULTI,MULTICAST> inet 190.0.3.1 netmask 0xffffff00 broadcast 190.0.3.255 |
The UP in the first line of output indicates that the interface is configured up.
If the ping command fails and the private network interface has been configured up, verify that the private network cables are connected properly.
The procedure below describes how to test the public interfaces on each node. It uses this interface as an example:
node xfs-ha1 interface xfs-ha1-ec0 { name = ec0 ip-address = xfs-ha1 netmask = 0xffffff00 broadcast-addr = 190.0.2.255 } ... } node xfs-ha2 ... interface-pair one { primary-interface = xfs-ha1-ec0 secondary-interface = xfs-ha2-ec0 re-mac = false netmask = 0xffffff00 broadcast-addr = 190.0.2.255 ip-aliases = ( stocks ) } |
Follow these steps:
To test the public network interfaces on the first node (xfs-ha1), enter the following command from a client:
# /usr/etc/ping -c 3 xfs-ha1 PING xfs-ha1.engr.sgi.com (190.0.2.1): 56 data bytes 64 bytes from 190.0.2.1: icmp_seq=0 ttl=254 time=3 ms 64 bytes from 190.0.2.1: icmp_seq=1 ttl=254 time=2 ms 64 bytes from 190.0.2.1: icmp_seq=2 ttl=254 time=2 ms |
xfs-ha1 is an IP address for an interface on the node xfs-ha1.
Repeat step 1 for the remaining public network interfaces on xfs-ha1.
Repeat step 1 for all public interfaces of the other node in the cluster.
Follow the procedure below to verify that XLV logical volumes have been configured properly. This portion of a configuration file is used as an example in the procedure below:
volume sharedsybase_vol { server-node = xfs-ha1 backup-node = xfs-ha2 devname = shared_sybase devname-owner = sybase devname-group = sybase devname-mode = 0664 } |
On a node that is a primary node for volumes (xfs-ha1 in this example), enter this command to stop the RAID agent if the cluster uses a CHALLENGE RAID storage system:
# /etc/init.d/raid5 stop |
On the same node, enter the following commands to assemble the XLV logical volume sharedsybase_vol:
# xlv_mgr -c "change nodename xfs-ha1 shared_sybase" set node name xfs-ha1 for object shared_sybase done # xlv_assemble -l -s shared_sybase VOL shared_sybase flags=0x1, [complete] (node=xfs-ha1) DATA flags=0x0() open_flag=0x0() device=(192, 4) PLEX 0 flags=0x0 VE 0 [active] start=0, end=3583999, (cat)grp_size=1 /dev/dsk/dks5d1s0 (3584000 blks) |
Repeat step 2 for each of the other volumes with the same primary node.
If you stopped the RAID agent in step 1, restart the RAID agent by entering this command:
# /etc/init.d/raid5 start |
On the same node, list all of the XLV logical volumes on the node:
# ls -l /dev/xlv total 0 brw-rw-r-- 1 sybase sybase 192, 4 May 22 11:18 shared_sybase ... |
You should see all volumes that have this node listed as their server-node in the configuration file.
Enter this command to read ten blocks from one of the XLV logical volumes (for example, /dev/xlv/shared_sybase) and discard them:
# dd if=/dev/xlv/shared_sybase of=/dev/null count=10 10+0 records in 10+0 records out |
The output should match the output shown.
Repeat step 6 for every volume in the configuration file for which this node is the primary node.
If the other node serves as the primary node for any XLV logical volumes, repeat steps 1 through 7.
The procedure below tests filesystems configured for IRIS FailSafe by executing the mount commands that the IRIS FailSafe software would execute. These filesystem and volume sections of a configuration file are used as an example:
filesystem shared1 { mount-point = /shared1 mount-info { fs-type = xfs volume-name = shared1_vol mode = rw,noauto } } volume shared1_vol { server-node = xfs-ha1 backup-node = xfs-ha2 devname = shared1_vol devname-owner = root devname-group = sys devname-mode = 0600 } |
For each filesystem listed in the configuration file, follow this procedure:
Identify the primary node for the filesystem by looking up the primary node (the server-node) of the XLV logical volume used by this filesystem. In the example above, volume-name is shared1_vol; look for the volume block with the label shared1_vol. Its server-node (primary node) is xfs-ha1.
On the primary node, check to see if the XLV logical volume device name exists:
# ls /dev/xlv/shared1_vol |
If the device name doesn't exist and you are using a CHALLENGE RAID storage system, stop the RAID agent:
# /etc/init.d/raid5 stop |
If the device name doesn't exist, enter the following commands to assemble the XLV logical volume shared1_vol:
# xlv_mgr -c "change nodename xfs-ha1 shared1_vol" set node name xfs-ha1 for object shared1_vol done # xlv_assemble -l -s shared1_vol VOL shared1_vol flags=0x1, [complete] (node=xfs-ha1) DATA flags=0x0() open_flag=0x0() device=(192, 4) PLEX 0 flags=0x0 VE 0 [active] start=0, end=3583999, (cat)grp_size=1 /dev/dsk/dks5d1s0 (3584000 blks) |
If you stopped the RAID agent in step 3, restart it with this command:
# /etc/init.d/raid5 start |
On the primary node, mount the filesystem using a mount command that mimics the mount command given by IRIS FailSafe:
# mount -t xfs -o rw,noauto /dev/xlv/shared1_vol /shared1 |
The mount should be successful.
Unmount the filesystem:
# umount /shared1 |
On the secondary node, check to see if the XLV logical volume device name exists:
# ls /dev/xlv/shared1_vol |
If the device name doesn't exist and you are using a CHALLENGE RAID storage system, stop the RAID agent:
# /etc/init.d/raid5 stop |
If the device name doesn't exist, enter the following commands on the secondary node to assemble the XLV logical volume shared1_vol:
# xlv_mgr -c "change nodename xfs-ha2 shared1_vol" set node name xfs-ha2 for object shared1_vol done # xlv_assemble -l -s shared1_vol VOL shared1_vol flags=0x1, [complete] (node=xfs-ha2) DATA flags=0x0() open_flag=0x0() device=(192, 4) PLEX 0 flags=0x0 VE 0 [active] start=0, end=3583999, (cat)grp_size=1 /dev/dsk/dks5d1s0 (3584000 blks) |
If you stopped the RAID agent in step 9, restart it with this command:
# /etc/init.d/raid5 start |
Mount the filesystem on the secondary node by entering the command from step 6 on the secondary node:
# mount -t xfs -o rw,noauto /dev/xlv/shared1_vol /shared1 |
Unmount the filesystem:
# umount /shared1 |
The procedure below tests NFS configuration by exporting filesystems manually and determining if a client can access them.
This NFS entry in ha.conf is used as an example:
nfs shared1 { filesystem = shared1 export-point = /shared1 export-info = rw ip-address = 190.0.2.3 } |
For each NFS block in ha.conf, follow these steps:
Mount /shared1 on either node as described in the section “Testing Filesystems.”
Make sure the IP address is configured by entering this command on the node where /shared1 was mounted:
# /usr/etc/ping -c 3 190.0.2.3 PING 190.0.2.3 (190.0.2.3): 56 data bytes 64 bytes from 190.0.2.3: icmp_seq=0 ttl=254 time=3 ms 64 bytes from 190.0.2.3: icmp_seq=1 ttl=254 time=2 ms 64 bytes from 190.0.2.3: icmp_seq=2 ttl=254 time=2 ms |
From the node on which it is mounted, export the filesystem:
# exportfs -i -o rw /shared1 |
Make sure the filesystem was exported:
# exportfs /shared1 -rw |
Verify that you can mount the exported filesystem on a client by entering these commands from a client:
# mkdir /tempmount # mount 190.0.2.3:/shared1 /tempmount # umount /tempmount # rmdir /tempmount |
From the node on which the filesystem is mounted, unexport it and unmount it in preparation for running this test from the other node:
# exportfs -u /shared1 # umount /shared1 |
Repeat steps 1 through 6 on the other node. Make sure you do not mount the filesystem simultaneously from both nodes.
To test whether the Netscape servers are correctly configured, follow these steps:
Start the Netscape FastTrack and Enterprise servers:
# /etc/init.d/ns_fasttrack start # /etc/init.d/ns_enterprise start |
Run a Web browser, such as Netscape, on a client and try to access some Web pages served by the server.
Stop the Netscape FastTrack and Enterprise servers:
# /etc/init.d/ns_fasttrack stop # /etc/init.d/ns_enterprise stop |
Testing system behavior with IRIS FailSafe running is broken into four phases in the following subsections. The phases are preparing for testing, checking normal operation, checking failover, and cleaning up after testing.
Edit the file /etc/init.d/failsafe on each node and change the value of MIN_UPTIME from the default (300 seconds) to 0. This enables you to allow multiple failovers without the FailSafe software disabling itself due to frequent failovers.
Bring up IRIS FailSafe software by entering these two commands on each node:
# chkconfig failsafe on # /etc/init.d/failsafe start |
Follow this procedure to verify that the IRIS FailSafe cluster is operating normally:
Verify that the nodes are in normal state by entering this command on each node:
# /usr/etc/ha_admin -i ha_admin: Node controller state normal |
If either node has not reached normal state, wait a few minutes and try the command again. If normal state isn't reached, check the /var/adm/SYSLOG file on both nodes for errors. See Appendix B, “System Troubleshooting,” for troubleshooting information.
Verify that NFS filesystems are exported by the cluster by mounting them from a client.
Verify that Netscape servers on the cluster are working by running a browser on a client and viewing Web pages served by the Netscape servers.
Check any other high-availability applications running on the cluster.
After you have confirmed that the cluster operates correctly when both nodes are active, confirm that the cluster functions correctly in the face of failures by performing the tests below. Each test is independent and should be performed on an IRIS FailSafe cluster that is operating normally.
Power off one node in the cluster. The other node in the cluster should detect the failure and take over the services. If you have an active/backup configuration, power off the active node.
Disconnect the private network. If you have enabled heartbeat messages to be sent over the public network, the cluster should continue to function as before. Otherwise, one node takes over the services of the other node. The node whose services were taken over is rebooted.
If heartbeat messages are sent over the public network (hb-public-ipname is set to a fixed IP address), enter this command after reconnecting the private network to switch heartbeat messages back to the private network:
# /usr/etc/ha_admin -x |
Disconnect the public network from one of the active nodes in the cluster. The other node should take over the services.
Forcibly unmount a filesystem on an active node to make it unavailable. The other node should take over the services. The failed node enters standby state.
Kill the application daemons (for example, ns_fasttrack) on a node to make the service unavailable. The other node should take over the service.
Disconnect the serial line to the remote power control unit (if you are using CHALLENGE S) or to the system controller port (any other system). If you have configured the IRIS FailSafe software to send mail, it notifies the administrator of the failure and otherwise continues to function.
When the cluster is in this state, neither node can take over if another failure occurs. After you have reconnected the serial line, you can resume monitoring of the serial line by executing /usr/etc/ha_admin -m start <node_name>.
Follow this procedure to return the nodes to normal state after testing:
Edit the file /etc/init.d/failsafe on each node and return the value of MIN_UPTIME to its initial suggested value, 300.
Restart IRIS FailSafe on both nodes by entering these commands on each node:
# /etc/init.d/failsafe stop # /etc/init.d/failsafe start |