Chapter 4. FailSafe Installation and System Preparation


Note: The procedures in this chapter assume that you have done the work described in Chapter 2, “Configuration Planning”.

The following steps are required for FailSafe installation and system preparation:

Install FailSafe

This section discusses the following:

Install Procedure

Installing the FailSafe base CD requires about 10 MB of free space.

To install the required software, do the following:

  1. On each node in the pool, upgrade to a supported release of IRIX according to the IRIX 6.5 Installation Instructions and the FailSafe product release notes:

    # relnotes failsafe2 [chapter_number]

    To verify that a given node has been upgraded, use the following command to display the currently installed system:

    # uname -aR

  2. Depending on the servers and storage in the configuration and the IRIX revision level, install the latest recommended patches. For information on recommended patches for each platform, see: http://bits.csd.sgi.com/digest/patches/recommended/

  3. On each node, install the version of the serial port server driver that is appropriate to the operating system. Use the CD that accompanies the serial port server. Reboot the system after installation.

    For more information, see the following documentation provided with the serial port server:

    • EL Serial Port Server Installation Guide (provided by Digi Corporation)

    • EL Serial Port Server Installation Guide Errata

  4. On each node, install the following software, in the order shown:

    sysadm_base.sw.dso 
    sysadm_base.sw.server 
    sysadm_cluster.sw.server
    cluster_admin.sw.base 
    cluster_control.sw.base
    cluster_services.sw.base
    cluster_services.sw.cli 
    failsafe2.sw 
    sysadm_failsafe2.sw.server

    When sysadm_base is installed, tcpmux service is added to the /etc/inetd.conf file.


    Note: For systems that do not have sysadmdesktop installed, inst reports missing prerequisites. Resolve this conflict by installing sysadm_base.sw.priv, which provides a subset of the functionality of sysadmdesktop.sw.base and is included in this distribution, or by installing sysadmdesktop.sw.base from the IRIX distribution.

    If you try to install sysadm_base.sw.priv on a system that already has sysadmdesktop.sw.base, inst reports incompatible subsystems. Resolve this conflict by not installing sysadm_base.sw.priv. Similar conflicts occur if you try to install sysadmdesktop.sw.base on a system that already has sysadm_base.sw.priv.

    If the nodes are to be administered by a web-based version of the GUI, install these subsystems, in the order shown:

    java2_eoe.sw 
    java2_eoe.sw32
    sysadm_base.sw.client 
    sysadm_cluster.sw.client
    sysadm_failsafe2.sw.client
    sysadm_failsafe2.sw.web


    Caution: The GUI only operates with Java2 v1.4.1 Execution Environment (Sun JRE v1.4.1). This is the version of Java that is provided with the IRIX 6.5.x release.

    The SGI website also contains Java 1. However, you cannot use this version of Java with the GUI. Using a Java version other than 1.4.1 will cause the GUI to fail.


  5. On each node, install application software and appropriate optional FailSafe plug-in software. For example, for NFS install the following:

    nfs.sw.nfs (If necessary; from IRIX, might already be present)
    failsafe2_nfs.sw

  6. If you want to run the administrative workstation (GUI client) from an IRIX desktop, install the following subsystems on the desktop:

    sysadm_failsafe2.sw.desktop
    sysadm_failsafe2.sw.client
    sysadm_base.sw.client
    sysadm_cluster.sw.client
    java2_eoe.sw
    java2_eoe.sw32

    If the administrative workstation is an IRIX machine that launches the GUI client from a web browser that supports Java, install the 1.4.1 java_plugin from the IRIX applications CD. (However, launching the GUI from a Web browser is not the recommended method on IRIX; on IRIX it is better to invoke the fsmgr command.)

    After installing the Java plug-in, you must close all browser windows and restart the browser.

  7. On the appropriate nodes, install other optional software, such as storage management or network board software.

  8. If the cluster is using plexed XLV logical volumes, do the following:

    1. Install a disk plexing license on each node in the /var/flexlm/license.dat file. For more information on XLV logical volumes and on XFS plexing and filesystems, see Chapter 2, “Configuration Planning”.

    2. Verify that the license has been successfully installed on each node in the cluster by using the xlv_mgr command:

      # xlv_mgr
      xlv_mgr> show config

      If the license is successfully installed, the following line appears:

      Plexing license: present

    3. Quit xlv_mgr.

  9. Install recommended patches for FailSafe.

    For instructions on installing a FailSafe patch, see “Install Patches”.

  10. Set the AutoLoad variable to Yes ; this can be done when you set host SCSI IDs, as explained in “Set NVRAM Variables”.


Note: For reference, Appendix A, “FailSafe Software”, summarizes systems to install on each component of a cluster or node.


Differences When Performing a Remote or Miniroot Install

If you perform a remote or miniroot install, the exitop commands will be deferred until cluster services are started, at which time the exitop commands will be run. While performing the installation, you will see can't run remotely messages among the normal set of messages. For example:

Removing orphaned directories
Installing/removing files ..  94% 
Running exit-commands ..  94% 
cluster_admin.sw.base: ( $rbase/usr/cluster/bin/cdb-exitop )
cdb-exitop: can't run remotely - scheduling to run later
Running exit-commands ..  95% 
cluster_control.sw.base: ( $rbase/usr/cluster/bin/cluster_control-exitop )
cluster_control-exitop: can't run remotely - scheduling to run later
Running exit-commands ..  96% 
cluster_services.sw.base: ( $rbase/usr/cluster/bin/cluster_ha-exitop )
cluster_ha-exitop: can't run remotely - scheduling to run later
Running exit-commands ..  97% 
cxfs_cluster.sw.base: ( $rbase/usr/cluster/bin/cluster_cx-exitop )
cluster_cx-exitop: can't run remotely - scheduling to run later
Running exit-commands ..  99% 
Checking dependencies .. 100% Done.
Installations and removals were successful.
You may continue with installations or quit now.

If you see the above messages during the install process, you will see the following messages when the cluster services start up:

running /usr/cluster/bin/cdb-exitop
cdb-exitop: initializing CDB
cdb-exitop: success
running /usr/cluster/bin/cluster_control-exitop
cluster_control-exitop: success
running /usr/cluster/bin/cluster_ha-exitop
cluster_ha-exitop: Added HA keys to /var/cluster/cdb/cdb.db

         * * * * * * * * * * I M P O R T A N T * * * * * * * * * * * * *

         "sgi-cmsd" service MUST be added to /etc/services.
         Restart cluster processes after adding the entry.  Failure to do so
         will cause cluster and FailSafe services to function incorrectly.
         Please refer to the SGI FailSafe Administrator's Guide for more
         information.


         * * * * * * * * * * I M P O R T A N T * * * * * * * * * * * * *

         "sgi-gcd" service MUST be added to /etc/services.
         Restart cluster processes after adding the entry.  Failure to do so
         will cause cluster and FailSafe services to function incorrectly.
         Please refer to the SGI FailSafe Administrator's Guide for more
         information.

cluster_ha-exitop: success
running /usr/cluster/bin/cluster_cx-exitop
cluster_cx-exitop: Added CXFS keys to /var/cluster/cdb/cdb.db
cluster_cx-exitop: success

Configure System Files

This section discusses the following:

Also see the best-practices information in “System File Configuration” in Chapter 3.

/etc/services

Edit the /etc/services file so that it contains entries for sgi-cad and sgi-crsd before you install the cluster_admin product on each node in the pool. The port numbers assigned for these processes must be the same in all nodes in the pool.


Note: sgi-cad requires a TCP port for communication between FailSafe nodes.

The following shows an example of /etc/services entries for sgi-cad and sgi-crsd:

sgi-crsd        7500/udp           # Cluster Reset Services Daemon
sgi-cad         9000/tcp           # Cluster Admin daemon

Edit the /etc/services file so that it contains entries for sgi-cmsd and sgi-gcd on each node before starting highly available (HA) services on the node. The port numbers assigned for these processes must be the same in all nodes in the cluster.

The following shows an example of /etc/services entries for sgi-cmsd and sgi-gcd:

sgi-cmsd        7000/udp         # FailSafe Membership Daemon
sgi-gcd         8000/udp         # Group Communication Daemon

/etc/config/cad.options

The /etc/config/cad.options file contains the list of parameters that the cad cluster administration daemon reads when the process is started. cad provides cluster information to the GUI.

The following options can be set in the cad.options file:

--append_log 

Append cad logging information to the cad log file instead of overwriting it.

--log_file filename 

cad log file name. Alternately, this can be specified as -lf filename.

-vvvv 

Verbosity level. The number of v characters indicates the level of logging. Setting -v logs the fewest messages. Setting -vvvv logs the highest number of messages.

The following example shows an /etc/config/cad.options file:

-vv -lf /var/cluster/ha/log/cad_nodename --append_log

The contents of the /etc/config/cad.options file cannot be modified using the cmgr command or the GUI.


Note: If you make a change to the cad.options file at any time other than initial configuration, you must restart the cad processes in order for these changes to take effect. You can do this by rebooting the nodes or by entering the following command:
# /etc/init.d/cluster restart



If you execute this command on a running cluster, it will remain up and running. However, the GUI will lose connection with the cad daemon; the GUI will prompt you to reconnect.


/etc/config/fs2d.options

The /etc/config/fs2d.options file contains the list of parameters that the fs2d daemon reads when the process is started. The fs2d daemon is the cluster database daemon that manages the distribution of the cluster database across the nodes in the pool.

Table 4-1 shows the options can that can be set in the fs2d.options file.

Table 4-1. fs2d.options File Options

Option

Description

-logevents event name

Log selected events. The following event names may be used: all, internal, args, attach, chandle, node, tree, lock, datacon, trap, notify, access, storage. The default is all.

-logdest log destination

Set log destination. The following log destinations may be used: all, stdout, stderr, syslog, logfile. If multiple destinations are specified, the log messages are written to all of them. If logfile is specified, it has no effect unless the -logfile option is also specified. The default is logfile.

-logfile filename

Set log filename. The default is /var/cluster/ha/log/fs2d_log .

-logfilemax maximum size

Set log file maximum size (in bytes). If the file exceeds the maximum size, any preexisting filename.old will be deleted, the current file will be renamed to filename.old, and a new file will be created. A single message will not be split across files. If -logfile is set, the default is 10000000.

-loglevel loglevel

Set log level. The following log levels may be used: always, critical, error, warning, info, moreinfo, freq, morefreq, trace, busy. The default is info .

-trace trace_class

Trace selected events. The following trace classes may be used: all, rpcs, updates, transactions, monitor. If you specify this option, you must also specify -tracefile and/or -tracelog. No tracing is done, even if it is requested for one or more classes of events, unless either or both of -tracefile or -tracelog is specified. The default is transactions.

-tracefile filename

Set trace filename. There is no default.

-tracefilemax maximum_size

Set trace file maximum size (in bytes). If the file exceeds the maximum size, any preexisting filename.old will be deleted, the current file will be renamed to filename.old, and a new file will be created.

-[no]tracelog

[Do not] trace to log destination. When this option is set, tracing messages are directed to the log destination or destinations. If there is also a trace file, the tracing messages are written there as well. The default is -tracelog.

-[no]parent_timer

[Do not] exit when parent exits. The default is -noparent_timer.

-[no]daemonize

[Do not] run as a daemon. The default is -daemonize.

-l

Do not run as a daemon.

-h

Print usage message.

-o help

Print usage message.

If you use the default values for these options, the system will be configured so that all log messages of level info or less, and all trace messages for transaction events, are sent to the /var/cluster/ha/log/fs2d_log file. When the file size reaches 10 MB, this file will be moved to its namesake with the .old extension and logging will roll over to a new file of the same name. A single message will not be split across files.


Note: If you make a change to the fs2d.options file at any time other than initial configuration, you must restart the fs2d processes in order for those changes to take effect. You can do this by rebooting the nodes or by entering the following command:
# /etc/init.d/cluster restart



If you execute this command on a running cluster, it should remain up and running. However, the GUI will lose connection with the cad daemon; the GUI will prompt you to reconnect.


Example 1

The following example shows an /etc/config/fs2d.options file that directs logging and tracing information as follows:

  • All log events are sent to /var/adm/SYSLOG.

  • Tracing information for RPCs, updates, and transactions are sent to /var/cluster/ha/log/fs2d_ops1.

    When the size of this file exceeds 100,000,000 bytes, this file is renamed to /var/cluster/ha/log/fs2d_ops1.old and a new file /var/cluster/ha/log/fs2d_ops1 is created. A single message is not split across files.

(Line breaks added here only for readability.)

-logevents all -loglevel trace -logdest syslog -trace rpcs 
-trace updates -trace transactions -tracefile /var/cluster/ha/log/fs2d_ops1 
-tracefilemax 100000000

Example 2

The following example shows an /etc/config/fs2d.options file that directs all log and trace messages into one file, /var/cluster/ha/log/fs2d_nodeA, for which a maximum size of 100,000,000 bytes is specified. -tracelog directs the tracing to the log file.

(Line breaks added here only for readability.)

-logevents all -loglevel trace -trace rpcs -trace updates 
-trace transactions -tracelog -logfile /var/cluster/ha/log/fs2d_nodeA 
-logfilemax 100000000 -logdest logfile.

/etc/config/cmond.options

The /etc/config/cmond.options file contains the list of parameters that the cmond cluster monitor daemon reads when the process is started. It also specifies the name of the file that logs cmond events. cmond provides a framework for starting, stopping, and monitoring process groups. See the cmond man page for more information.

The following options can be set in the cmond.options file:

-L log_level 

Set log level to log_level. The legal values for log_level are normal, critical, error, warning, info, frequent, and all.

-d 

Run in debug mode.

-l 

Lazy mode, where cmond does not validate its connection to the cluster database.

-t nap_interval 

The time interval in milliseconds after which cmond checks for liveliness of process groups it is monitoring.

-s 

Log messages to standard error.

A default cmond.options file is shipped with the following options. This default options file logs cmond events to the /var/cluster/ha/log/cmond_log file.

-L info -f /var/cluster/ha/log/cmond_log

Set the corepluspid System Parameter

Use the systune command to set the corepluspid flag to 1 on every node. If this flag is set, IRIX will suffix all core files with a process ID (PID). This prevents a core dump from being overwritten by another process core dump.

Set NVRAM Variables

During the hardware installation of FailSafe nodes, two non-volatile random-access memory (NVRAM) variables must be set:

  • The boot parameter AutoLoad must be set to yes. FailSafe requires the nodes to be automatically booted when they are reset or when the node is powered on.

  • The SCSI IDs of the nodes, specified by the scsihostid variable, must be different. This variable is important only when a cluster is configured with shared SCSI storage. If a cluster has no shared storage or is using shared Fibre Channel storage, setting scsihostid is not important.

You can check the setting of these variables with the following commands:

# nvram AutoLoad
Y
# nvram scsihostid 
0

To set these variables, use the following commands:

# nvram AutoLoad yes
# nvram scsihostid number 

number is the SCSI ID you choose. A node uses its SCSI ID on all buses attached to it. Therefore, you must ensure that no device attached to a node has number as its SCSI unit number. If you change the value of the scsihostid variable, you must reboot the system for the change to take effect.

Create XLV Logical Volumes and XFS Filesystems

You can create XLV logical volumes by following the instructions in the guide IRIX Admin: Disks and Filesystems.


Note: This section describes logical volume configuration using XLV logical volumes. For information on coexecution of FailSafe and CXFS filesystems (which use XVM logical volumes), see “Coexecution of CXFS and FailSafe” in Chapter 2. For information on creating CXFS filesystems, see the CXFS Administration Guide for SGI InfiniteStorage. For information on creating XVM logical volumes, see the XVM Volume Manager Administrator's Guide.

When you create XLV logical volumes and XFS filesystems, remember the following important points:

  • If the shared disks are not in a RAID storage system, you should create plexed XLV logical volumes.

  • Each XLV logical volume must be owned by the same node that is the primary node for the resources that use the logical volume. To simplify the management of the owners of volumes on shared disks, use the following recommendations:

    • Work with the volumes on a shared disk from only one node in the cluster.

    • After you create all the volumes on one node, you can selectively change the nodename to the other node using xlv_mgr.

  • If the XLV logical volumes you create are used as raw volumes (that is, with no filesystem) for storing database data, the database system may require that the device names (in /dev/rxlv and /dev/xlv) have specific owners, groups, and modes. If this is the case (see the documentation provided by the database vendor), use the chown and chmod commands to set the owner, group, and mode as required.

  • No filesystem entries are made in /etc/fstab for XFS filesystems on shared disks; FailSafe software mounts the filesystems on shared disks. However, to simplify system administration, consider adding comments to /etc/fstab that list the XFS filesystems configured for FailSafe. Thus, a system administrator who sees mounted FailSafe filesystems in the output of the df command and looks for the filesystems in the /etc/fstab file will learn that they are filesystems managed by FailSafe.

  • Be sure to create the mount point directory for each filesystem on all nodes.

Configure Network Interfaces

This section describes how to configure the network interfaces. The example shown in Figure 4-1 is used in the procedure.

Figure 4-1. Example Interface Configuration

Example Interface Configuration

  1. If possible, add every IP address, IP name, and highly available (HA) IP address (alias) for the nodes to /etc/hosts on one node.

    For example:

    190.0.2.1 xfs-ha1.company.com xfs-ha1
    190.0.2.3 stocks
    190.0.3.1 priv-xfs-ha1
    190.0.2.2 xfs-ha2.company.com xfs-ha2
    190.0.2.4 bonds
    190.0.3.2 priv-xfs-ha2


    Note: HA IP addresses that are used exclusively by HA services are not added to the file /etc/config/ipaliases.options. Similarly, if all IP aliases are used only by HA services, the ipaliases chkconfig flag should be off.


  2. Add all of the IP addresses from step 1 to /etc/hosts on the other nodes in the cluster.

  3. If there are IP addresses, IP names, or HA IP addresses that you did not add to /etc/hosts in steps 1 and 2, verify that NIS is configured on all nodes by entering the following command on each node:

    # chkconfig | grep yp
    ...
            yp           on

    If the output shows that yp is off, you must start NIS. See the NIS Administrator's Guide for details.

  4. For IP addresses, IP names, and HA IP addresses that you did not add to /etc/hosts on the nodes in steps 1 and 2, verify that they are in the NIS database by entering the following command for each address:

    # ypmatch address hosts
    190.0.2.1 xfs-ha1.company.com xfs-ha1

    address is an IP address, IP name, or HA IP address. If ypmatch reports that address does not match, it must be added to the NIS database. See the NIS Administrator's Guide for details.

  5. On one node, add that node's interfaces and their IP addresses to the file /etc/config/netif.options . However, highly available (HA) IP addresses are not added to the netif.options file.

    For the example in Figure 4-1, the public interface name and IP address lines are as follows:

    if1name=ec0
    if1addr=$HOSTNAME

    $HOSTNAME is an alias for an IP address that appears in /etc/hosts.

    If there are additional public interfaces, their interface names and IP addresses appear on lines such as the following:

    if2name=
    if2addr=

    In the example, the control network name and IP address are as follows:

    if3name=ec3
    if3addr=priv-$HOSTNAME

    The control network IP address in this example, priv-$HOSTNAME , is an alias for an IP address that appears in /etc/hosts .

  6. If there are more than eight interfaces on the node, change the value of if_num to the number of interfaces. For fewer than eight interfaces (as in the example in Figure 4-1), the line is as follows:

    if_num=8

  7. Repeat Steps 5 and 6 on the other nodes.

  8. Edit the /etc/config/routed.options file on each node so that the routes are not advertised over the control network. See the routed man page for a list of options.

    For example:

    -q -h -Prdisc_interval=45


    Note: The -q option is required for FailSafe to function correctly. This ensures that the heartbeat network does not get loaded with packets that are not related to the cluster.

    The options do the following:

    • Turn off advertising of routes

    • Cause host or point-to-point routes to not be advertised (provided there is a network route going the same direction)

    • Set the normal interval with which router discovery advertisements are transmitted to 45 seconds (and their lifetime to 135 seconds)

  9. Verify that FailSafe 2.x is turned off on each node, using the chkconfig command:

    # chkconfig | grep failsafe2
    ...
            failsafe2          off
    ...

    If failsafe2 is set to on on a node, enter this command on that node:

    # chkconfig failsafe2 off

    If Failsafe 1.x is present, you must also ensure that it is not configured on for any node:

    # chkconfig | grep failsafe
    ...
            failsafe             off
    ...

    If failsafe is on on any node, enter this command on that node:

    # chkconfig failsafe off

  10. Configure an e-mail alias on each node that sends the FailSafe e-mail notifications of cluster transitions to a user outside of the cluster and to a user on the other nodes in the cluster.

    For example, if there are two nodes called xfs-ha1 and xfs-ha2, add the following to /usr/lib/aliases on xfs-ha1:

    fsafe_admin:[email protected],[email protected] 

    On xfs-ha2, add the following line to /usr/lib/aliases:

    fsafe_admin:[email protected],[email protected] 

    The alias you choose, fsafe_admin in this case, is the value you will use for the mail destination address when you configure your system. In this example, operations is the user outside the cluster and admin_user is a user on each node.

  11. If the nodes use NIS -- that is, yp has been set to on using chkconfig -- or the BIND domain name server (DNS), switching to local name resolution is recommended. Modify the /etc/nsswitch.conf file so that it reads as follows:

    hosts:                  files nis dns 


    Note: Exclusive use of NIS or DNS for IP address lookup for the nodes has been shown to reduce availability in situations where the NIS service becomes unreliable.


  12. If you are using FDDI, finish configuring and verifying the new FDDI station, as explained in the FDDIXpress release notes and the FDDIXPress Administration Guide.

  13. Reboot all nodes to put the new network configuration into effect.

Configure the Ring Reset Serial Port

When using a ring reset configuration, you must turn off the getty process for the tty ports to which the reset serial cables. . Perform the following steps on each node:

  1. Determine which port is used for the reset serial line.

  2. Open the file /etc/inittab for editing.

  3. Find the line for the port by looking at the comments on the right for the port number from step 1.

  4. Change the third field of this line to off. For example:

    t2:23:off:/sbin/getty -N ttyd2 co_9600          # port 2

  5. Save the file.

  6. Enter these commands to make the change take effect:

    # killall getty
    # init q

Install Patches

The procedures in this section describe how to install a FailSafe patch. The patch should be installed on all nodes.

Installing FailSafe 2.x and a FailSafe Patch at the Same Time

When you install FailSafe 2.x images and an upgrade patch together, the cluster processes must be stopped and started on each node after patch installation. This is because the FailSafe 2.x installation automatically starts the cluster processes and the patch installation does not automatically stop them, so the cluster processes will continue to run the unpatched shared libraries unless you restart them.

Do the following on each node:

  1. Install FailSafe 2.x images on the node. This includes the following products:

    cluster_admin
    cluster_control
    cluster_services
    failsafe2
    sysadm_base
    sysadm_failsafe2

  2. Install the FailSafe 2.x patch.

  3. In a UNIX shell, stop all cluster processes on the node:

    # /etc/init.d/cluster stop

  4. Verify that the cluster processes (cad, cmond , crsd, and fs2d) have stopped:

    # ps -ef | egrep '(cad|cmond|crsd|fs2d)'

  5. Start cluster processes on the node:

    # /etc/init.d/cluster start

You are now ready to run the FailSafe Manager GUI or the cmgr command to set up a FailSafe cluster.

Installing a FailSafe Patch on an Existing FailSafe 2. x Cluster

Using these instructions, you can install a FailSafe patch on each FailSafe 2.x node in turn, without shutting down the entire cluster and without interrupting the HA services provided by the cluster.


Note: Before installing a FailSafe patch, you should read the patch's release notes. These release notes may contain special instructions that are not provided in this procedure.

To install a FailSafe patch on each node in your FailSafe cluster, follow these steps:

  1. If you have the FailSafe GUI client software installed on a machine that is not a node, first install the patch client subsystems on that machine. The GUI client software subsystems are as follows, where xxxxxxx is the patch number:

    patchSGxxxxxxx.sysadm_base_sw.client
    patchSGxxxxxxx.sysadm_failsafe2_sw.client
    patchSGxxxxxxx.sysadm_failsafe2_sw.desktop

  2. Choose a node on which to install the patch. Start up the FailSafe GUI or cmgr command on that node.

    For convenience, connect the GUI to a node that you are not upgrading.


    Note: If you connect to the node that you are upgrading, then in a later step (when you stop HA services), FailSafe will no longer report accurate status to the GUI; in another later step (when you stop cluster services), the GUI will lose its connection.


    Use the following cmgr command to specify a default node (later commands in this procedure assume the cluster name has already been set):

    cmgr> set cluster clustername

  3. (Optional) If you wish to keep all resource groups running on the node during installation, take the resource groups offline using the detach option (that is, detach the resource groups). If you do this, FailSafe will stop monitoring the resources, which will continue to run on the node, and will not have any control over the resource groups. Otherwise, in the next step, the resources should migrate to another node automatically, assuming the failover policy is defined that way.

    If you are using the GUI, run the Take Resource Group Offline task and check the Detach Only checkbox.

    If you are using cmgr, execute the following command:

    cmgr> admin offline_detach resource_group groupname

  4. Stop HA services on the node. (When HA services stop, FailSafe will no longer be able to report current cluster and node state if the FailSafe GUI is connected to that node. To monitor the cluster state during installation, connect the FailSafe GUI to the node that you are not upgrading.)

    If you are using the GUI, run the Stop FailSafe HA Services task, specifying the node you are patching in the One Node Only field.

    If you are using cmgr, execute the following command:

    cmgr> stop ha_services on node nodename

    If you skipped optional step 3, FailSafe will attempt to migrate all resource groups off that node, but this will fail if there are no other available nodes in the resource group's failover domain. If an error occurs, either complete step 3 or move the resource group to the other node:

    If you are using the GUI, run the Move Resource Group task, specifying the node you are not patching in the Failover Domain Node field.

    If you are using cmgr, execute the following command:

    cmgr> admin move resource_group groupname to node nodename

  5. In a UNIX shell on the node you are upgrading, stop all cluster processes:

    # /etc/init.d/cluster stop

    When you are using the GUI, if the connection lost dialogue appears, click No. If you wish to continue using the GUI, restart the GUI, connecting to a node you are not patching.

  6. Verify that the cluster processes (cad, cmond, crsd, and fs2d) have stopped:

    # ps -ef | egrep '(cad|cmond|crsd|fs2d)'

  7. Use chkconfig to turn off the cluster flag:

    # chkconfig cluster off


    Note: You cannot use the failsafe2 flag to turn off the HA services on a node. You must use the GUI or cmgr commands to stop HA services; these commands can be run from any node in the pool. If necessary, you can use the force option. For more information, see “Stop FailSafe HA Services” in Chapter 6.


  8. Install the patch on the node.

  9. Use chkconfig to turn on the cluster flag:

    # chkconfig cluster on

  10. Start cluster processes on the node:

    # /etc/init.d/cluster start

  11. Start HA services on the node.

    If you are using the GUI and you are running the GUI in a Web browser, do the following:

    1. Exit your browser.

    2. Restart the Web server on the node you have just patched.

    3. Restart the GUI, connecting to the patched node.

    4. Run the Start FailSafe HA Services task, specifying the node that you just patched in the One Node Only field.

      If the GUI claims that FailSafe HA services are active on the cluster, then you are using an unpatched client; in this case, run the cmgr command instead, run the GUI on a patched client, or run the GUI in a Web browser from the patched node.

    If you are using cmgr, execute the following command:

    cmgr> start ha_services on node nodename

  12. Monitor the resource groups and verify that they come back online on the upgraded node. This may take several minutes, depending on the types and numbers of resources in the groups.

    If you are using the GUI, select View: Groups Owned by Nodes in the view area. Confirm that the resource group icons indicates online status.


    Note: When you restart HA services on the upgraded node, it can take several minutes for the node and cluster to return to normal active state.

    If you are using cmgr, execute the following command:

    cmgr> show status of resource_group groupname

Repeat the above process for the other nodes. If you are using the GUI, remember to reconnect to the node that you have just upgraded. After completing the process for all nodes, you can continue to monitor and administer your upgraded cluster, defining additional new nodes if desired.

Install Performance Co-Pilot Software

You can deploy Performance Co-Pilot for FailSafe as a collector agent or as a monitor client:

  • Collector agents are installed on collector hosts, which are the nodes in the FailSafe cluster itself from which you want to gather statistics. Typically, each node in a FailSafe cluster is designated as a collector host.

  • A monitor client is installed on the monitor host, which is typically a workstation that has a display and is running the IRIS Desktop.

Installing the Collector Host

To install Performance Co-Pilot for FailSafe on the designated collector hosts, the following software components must already be installed:

  • The pcp_eoe.sw subsystem from IRIX 6.5.11 or later

  • FailSafe 2.1 or later

  • Performance Co-Pilot 2.1 or later

A collector license (PCPCOL) must also be installed on each of these nodes.

After this software is installed, you must install the following subsystems of Performance Co-Pilot for FailSafe on each collector host. Table 4-2 lists the subsystems required for a collector host and their approximate sizes.

Table 4-2. Performance Co-Pilot for FailSafe Collector Subsystems

Subsystem

Size in KB

pcp_fsafe.man.pages

40

pcp_fsafe.man.relnotes

32

pcp_fsafe.sw.collector

128


To install the required subsystems on a monitor host, do the following:

  1. Mount the FailSafe CD-ROM by inserting it into an available drive. You can access a local CD-ROM drive or a remote CD-ROM drive of another host over the network.

  2. Log in as root.

  3. Start the inst command:

    # inst

  4. Specify the installation location:

    • If you are installing from the local CD-ROM drive, enter the following:

      Inst> from /CDROM/dist

    • If you are installing from a remote drive, enter the following, where host is the name of the host with the CD-ROM drive that contains a mounted FailSafe CD-ROM:

      Inst> from host:/CDROM/dist

  5. Select the default subsystems in the pcp_fsafe package. The default subsystems are provided for easy installation onto multiple collector hosts:

    Inst> install default

  6. Ensure that there are no conflicts:

    Inst> conflicts

  7. Install the software:

    Inst> go

  8. Change to the /var/pcp/pmdas/fsafe directory:

    # cd /var/pcp/pmdas/fsafe

  9. Run the Install utility, which installs the FailSafe performance metrics into the Performance Co-Pilot performance metrics namespace:

    # ./Install

  10. Choose an appropriate configuration for installation of the fsafe Performance Metrics Domain Agent (PMDA):

    • collector, which collects performance statistics on this system

    • monitor, which allows this system to monitor local and/or remote systems

    • both, which allows collector and monitor configuration for this system

    For example, to choose just the collector, enter the following:

    Please enter c(ollector) or m(onitor) or b(oth) [b] c

Removing Performance Metrics from a Collector Host

If you wish to remove Performance Co-Pilot for FailSafe from a collector host, you must remove the Performance Co-Pilot for FailSafe metrics from the performance metrics namespace of that host. You can do this before removing the pcp_fsafe subsystem by performing the following commands:

  1. Change to the /var/pcp/pmdas/fsafe directory:

    # cd /var/pcp/pmdas/fsafe

  2. Run the Remove utility:

    # ./Remove

Installing the Monitor Host

To install Performance Co-Pilot for FailSafe on a designated monitor host, the following software components must already be installed on the node:

  • The pcp_eoe.sw subsystem of IRIX 6.5.11 or later, including the subsystem pcp_eoe.sw.monitor

  • Performance Co-Pilot 2.1 or later, including the subsystem pcp.sw.monitor

The monitor license (PCPMON) must also be installed on the monitor host.

After this software is installed, install the subsystems of Performance Co-Pilot for FailSafe listed in Table 4-3 on each collector host.

Table 4-3. Performance Co-Pilot for FailSafe Monitor Subsystems

Subsystem

Size in KB

pcp_fsafe.man.pages

40

pcp_fsafe.man.relnotes

32

pcp_fsafe.sw.monitor

516


To install the required subsystems for Performance Co-Pilot for FailSafe on a monitor host, do the following:

  1. Mount the Performance Co-Pilot for FailSafe CD-ROM by inserting it into an available drive. You can access a local CD-ROM drive or a remote CD-ROM drive of another host over the network.

  2. Log in as root.

  3. Start inst:

    # inst

  4. Specify the installation location:

    • If you are installing from the local CD-ROM drive, enter the following:

      Inst> from /CDROM/dist

    • If you are installing from a remote drive, enter the following, where host is the name of the host with the CD-ROM drive that contains a mounted Performance Co-Pilot for FailSafe CD-ROM:

      Inst> from host:/CDROM/dist

  5. Select the required subsystems in the pcp_fsafe package for a monitor configuration:

    Inst> keep pcp_fsafe.sw.collector
    Inst> install pcp_fsafe.sw.monitor

  6. Ensure that there are no conflicts before you install Performance Co-Pilot for FailSafe:

    Inst> conflicts

  7. Install the software:

    Inst> go

Test the System

This section discusses the following ways of testing the system:

Private Network Interface

For each private network on each node in the pool, enter the following, where nodeIPaddress is the IP address of the node:

# /usr/etc/ping -c 3 nodeIPaddress

Typical ping output should appear, such as the following:

PING IPaddress (190.x.x.x: 56 data bytes
64 bytes from 190.x.x.x: icmp_seq=0 tt1=254 time=3 ms
64 bytes from 190.x.x.x: icmp_seq=1 tt1=254 time=2 ms
64 bytes from 190.x.x.x: icmp_seq=2 tt1=254 time=2 ms

If ping fails, follow these steps:

  1. Verify that the network interface was configured up using ifconfig; for example:

    # /usr/etc/ifconfig ec3
    ec3: flags=c63<UP,BROADCAST,NOTRAILERS,RUNNING,FILTMULTI,MULTICAST>
    inet 190.x.x.x netmask 0xffffff00 broadcast 190.x.x.x

    The UP in the first line of output indicates that the interface was configured up.

  2. Verify that the cables are correctly seated.

Repeat this procedure on each node.

Serial Reset Connection

To test the serial hardware reset connections, do the following:

  1. Ensure that the nodes and the serial multiplexer are powered on.

  2. Start the cmgr command on one of the nodes in the pool:

    # cmgr

  3. Stop HA services on each node:

    stop ha_services for cluster clustername

    For example:

    cmgr> stop ha_services for cluster fs6-8

    Wait until the node has successfully transitioned to inactive state and the FailSafe processes have exited. This process can take a few minutes.

  4. Test the serial connections by entering one of the following:

    • To test the whole cluster, enter the following:

      test serial in cluster clustername

      For example:

      cmgr> test serial in cluster fs6-8
      Status: Testing serial lines ...
      Status: Checking serial lines using crsd (cluster reset services) from node fs8
      Success: Serial ping command OK.
      
      Status: Checking serial lines using crsd (cluster reset services) from node fs6
      Success: Serial ping command OK.
      
      Status: Checking serial lines using crsd (cluster reset services) from node fs7
      Success: Serial ping command OK.
      
      Notice: overall exit status:success, tests failed:0, total tests executed:1

    • To test an individual node, entering the following:

      test serial in cluster clustername node machinename

      For example:

      cmgr> test serial in cluster fs6-8 node fs7
      Status: Testing serial lines ...
      Status: Checking serial lines using crsd (cluster reset services) from node fs6
      Success: Serial ping command OK.
      
      Notice: overall exit status:success, tests failed:0, total tests executed:1

    • To test an individual node using just a ping , enter the following:

      admin ping node nodename

      For example:

      cmgr> admin ping node fs7
      
      ping operation successful

  5. If a command fails, make sure all the cables are seated properly and rerun the command.

  6. Repeat the process on other nodes in the cluster.

Modifications Required for Connectivity Diagnostics

If you want to use the connectivity diagnostics provided with FailSafe, ensure that the /.rhosts file on each administration node allows all the nodes in the cluster to have access to each other in order to run remote commands such as rsh. The connectivity tests execute a ping command from the local node to all nodes and from all nodes to the local node. To execute ping on a remote node, FailSafe uses rsh (user root). For example, suppose you have a cluster with three nodes: xfs0, fs1, and fs2. The /.rhosts file on each administration node will be as follows (prompt denotes node name):

fs0# cat /.rhosts 
fs1 root
fs1-priv root
fs2 root
fs2-priv root
 
fs1# cat /.rhosts
fs0 root
fs0-priv root
fs2 root
fs2-priv root
 
fs2# cat /.rhosts
fs0 root
fs0-priv root
fs1 root
fs1-priv root

Make sure that the mode of the .rhosts file is set to 600 (read and write access for the owner only).

After you have completed running the connectivity tests, you may wish to disable rsh on all cluster nodes.