Chapter 1. Overview of the IRIS FailSafe System

This chapter provides an overview of the components and operation of the IRIS FailSafe system. It contains these major sections:

If your IRIS FailSafe system is running Release 1.1 of IRIS FailSafe and you plan to upgrade it to Release 1.2, you can skip to the last major section of this chapter, “Overview of Upgrading an IRIS FailSafe Cluster From Release 1.1 to Release 1.2,” for information about how to upgrade the system to Release 1.2.

What Is High Availability?

In the world of mission critical computing, the availability of information and computing resources is extremely important. The availability of a system is affected by how long it is unavailable after a failure in any of its components. Different degrees of availability are provided by different types of systems:

  • Fault-tolerant systems (continuous availability). These systems use redundant components and specialized logic to ensure continuous operation and to provide complete data integrity. On these systems the degree of availability is extremely high. Some of these systems can also tolerate outages due to hardware or software upgrades (continuous availability). This solution is very expensive and requires specialized hardware and software.

  • High-availability systems. These systems survive single points of failure by using redundant off-the-shelf components and specialized software. They provide a lower degree of availability than the fault-tolerant systems, but at much lower cost. Typically these systems provide high availability only for client/server applications, and base their redundancy on cluster architectures with shared resources.

The Silicon Graphics high-availability solution, IRIS FailSafe, is based on a two-node cluster. This provides redundancy of processors and I/O controllers. The redundancy of storage is obtained through the use of dual-hosted CHALLENGE RAID devices and plexed (mirrored) disks.

If one of the nodes in the cluster or one of the nodes' components fails, the second node restarts the high-availability services of the failed node. In the client/server paradigm, the client does not care which of the two nodes in the cluster is providing the service. The clients see only a brief interruption of the service.

High-availability services are monitored by the IRIS FailSafe software. During normal operation, if a failure is detected on any of these components, a failover process is initiated on the surviving node. This process consists of isolating the failed node (to ensure data consistency), doing any recovery required by the failed over services, and quickly restarting the services on the surviving node.

In a high-availability system, each node serves as backup for the other node. Unlike the backup resources in a fault-tolerant system, which serves purely as redundant hardware for backup in case of failure, the resources of each node in a high-availability system can be used during normal operation.

What Is IRIS FailSafe?

The Silicon Graphics® IRIS FailSafe product provides a general facility for providing high-availability services. These services fall into two groups: high-availability resources and high-availability applications. High-availability resources are network interfaces, XLV logical volumes, and XFS filesystems that have been configured for IRIS FailSafe. Optional IRIS FailSafe products are available for these applications: NFS, the Netscape FastTrack Server, the Netscape Enterprise Server, Sybase, INFORMIX, and Oracle.

The Silicon Graphics IRIS FailSafe system consists of two Origin2000, Origin200, Onyx2, CHALLENGE, POWER CHALLENGE, or Onyx® servers that provide high-availability services. In some cases, the servers need not be the same model. For example, an IRIS FailSafe cluster can consist of a CHALLENGE L server and a CHALLENGE S server or an Origin200 server and an Origin2000 server. A cluster cannot consist of a CHALLENGE server and an Origin server.


Note: In the remainder of this guide, except where noted otherwise, the name Origin implies Origin2000, Origin200, and Onyx2 servers and the name CHALLENGE implies CHALLENGE, POWER CHALLENGE, and Onyx servers.

Disks are shared by physically attaching them to both the nodes in the system. The two servers (called nodes throughout this guide) and disks, along with IRIS FailSafe software, form an IRIS FailSafe cluster.

While running high-availability services, the nodes can run other applications that are not high-availability services. All high-availability services are owned and accessed by one node at a time.

The IRIS FailSafe system supports fast failover: if a high-availability service—interface, disk, application, or the node itself—fails, the service quickly (although not instantaneously) resumes because the second node in the system shuts down the failed node and takes over all of its services. To clients, the services on the second node are indistinguishable from the original services before failure occurred. It appears as if the original node has crashed and rebooted quickly. The clients notice a brief interruption in the high-availability service.

Two configurations are possible:

  • All high-availability services run on one node. The other node is the backup node. After failover, the services run on the backup node. In this case, the backup node is a hot standby for failover purposes only. The backup node can run other applications that are not high-availability services.

  • High-availability services run concurrently on both nodes. For each service, the other node serves as a backup node. For example, both nodes can be exporting different NFS filesystems. If a failover occurs, one node then exports all of the NFS filesystems.

The base software for the IRIS FailSafe system consists of IRIX 6.2 or IRIX 6.4, IRIX patches, IRIS FailSafe software, FDDI software (if FDDI networking is used), and software for the optional CHALLENGE RAID storage system.

There are IRIS FailSafe software options for some high-availability applications. Optional software includes

  • IRIS FailSafe NFS

  • IRIS FailSafe Web (for Netscape servers)

  • IRIS FailSafe INFORMIX

  • IRIS FailSafe Sybase

  • IRIS FailSafe Oracle

IRIS FailSafe and the Silicon Graphics Oracle Parallel Server (OPS) can co-exist on a cluster. IRIS FailSafe enhances OPS by providing IP failover in an OPS hardware configuration. However, OPS instances are not failed over. Oracle instances that are not OPS instances can be failed over by using the IRIS FailSafe Oracle option. IRIS FailSafe and OPS are not merged administratively, so different tools are required to maintain a combined system.

IRIS FailSafe supports ATM LAN emulation failover when FORE Systems ATM cards are used with a FORE Systems switch.

IRIS FailSafe provides a framework for making applications into high-availability services. If you want to add high-availability applications on an IRIS FailSafe cluster, you must write scripts to handle monitoring and failover functions. In addition, you must add the new high-availability applications to the IRIS FailSafe configuration file /va/ha/ha.conf to register the application and scripts with the IRIS FailSafe software. Developing these scripts and making additions to the configuration file is described in the IRIS FailSafe Programmer's Guide .

Hardware Components of an IRIS FailSafe Cluster

Figure 1-1 shows the IRIS FailSafe hardware components.

Figure 1-1. IRIS FailSafe System Components


The hardware components of the IRIS FailSafe system are as follows:

  • two CHALLENGE nodes or two Origin nodes

  • one or more interfaces on each node to one or more public networks (Ethernet or FDDI for CHALLENGE nodes and Ethernet for Origin nodes)

    These public interfaces attached to each node connect the node to one or more public networks, which link the cluster to clients. Each public interface has an IP address called a fixed IP address that doesn't move to the other node in the cluster during failover. Each public interface can have additional IP addresses, called high-availability IP addresses, that are transferred to an interface on the surviving node in case of failover.

  • a serial line from a serial port on each node to the other's Remote System Control port (to the Silicon Graphics remote power control unit used with CHALLENGE S nodes in a cluster)

    A surviving node uses this line to reboot the failed node during takeover. This procedure ensures that the failed node is not using the shared disks when the surviving node takes them over.

  • one interface on each node for the private network (Ethernet or FDDI on CHALLENGE nodes and Ethernet on Origin nodes)

    One Ethernet or FDDI interface on each node is required for the private heartbeat connection, by which each node monitors the state of the other node. The IRIS FailSafe software also uses this connection to pass control messages between nodes. These interfaces, called private interfaces, have distinct IP addresses that are kept private for security reasons.

  • disk storage and SCSI bus shared by the nodes in the cluster

    The nodes in the IRIS FailSafe system share dual-hosted disk storage over a shared fast and wide SCSI bus. The bus is shared so that either node can take over the disks in case of failure. The hardware required for the disk storage is either:

    • CHALLENGE Vault peripheral enclosures with SCSI disks (CHALLENGE and Origin2000 nodes only)

    • CHALLENGE RAID deskside or rackmount storage systems; each chassis assembly has two storage-control processors (SPs) and at least five disk modules with caching enabled (CHALLENGE or Origin nodes)


Note: The IRIS FailSafe system is designed to survive a single point of failure. Therefore, when a system component fails, it must be restarted, repaired, or replaced as soon as possible to avoid the possibility of two or more failed components.


IRIS FailSafe Software Architecture

IRIS FailSafe software includes a set of processes running on each node and communicating with each other. These processes use scripts for failover and recovery operations and for monitoring the high-availability services on each node. These processes and scripts read IRIS FailSafe configuration information from a configuration file /var/ha/ha.conf.

The IRIS FailSafe daemons and scripts are shown in Figure 1-2. For each daemon or set of scripts, the diagram shows other daemons and scripts it communicates with and the communication path it uses.

Figure 1-2. IRIS FailSafe Software Architecture


IRIS FailSafe daemons and scripts are described in the following subsections.

Heartbeat Daemon

The heartbeat daemon ha_hbeat runs on each node and is the first IRIS FailSafe process to start. It is polled by the application monitor on the other node. These heartbeat messages enable each node to determine the liveliness of the other node. Heartbeat messages are passed on the private network. If there is a failure of the private network, heartbeat messages can be passed on the public network.

Node Controller

The node controller process ha_nc determines each node's current state. The node states are described in the section “Node States” in this chapter.

The node controllers in the cluster pass messages to each other over the private network. If there is a failure of the private network, the node controllers don't use the public network.

Application Monitor

On each node, the application monitor process ha_appmon monitors all services on both nodes and reports any failures to the node controller.

The application monitor polls the heartbeat daemon on the other node to determine its liveliness. It also executes the failover scripts during state transitions.

Because the application monitor ha_appmon is a multi-threaded process, you may see several instances of ha_appmon simultaneously running on a node when you look at the output of the ps command.

Kill Daemon

The kill daemon ha_killd on each node monitors the serial connection to the other node and provides the power-cycling capability.

Monitoring Scripts

For each high-availability service, a monitoring script for that service periodically checks all instances of the service on the local node to verify that they are still running or available.

Interface Agent

The interface agent ha_ifa monitors all local interfaces to determine if they are still functioning.

The interface agent uses the number of input packets as the criteria to determine whether a network interface is working or not. The interface agent injects packets into the public network if it finds the number of input packets in an interface is not increasing. This prevents false failovers in networks that do not have any I/O activity.

Failover Scripts

Each high-availability service has a failover script that contains the commands that are executed when the cluster performs failover and recovery operations. These operations are called takeback, takeover, giveaway, and giveback. Examples of the tasks that failover scripts do are shutting down instances of an application, starting up instances of an application on their backup node, and unmounting and mounting filesystems.

High-Availability Resources

This section discusses the high-availability resources that are provided on an IRIS FailSafe system.

Nodes

If a node crashes or hangs (for example, due to a parity error or bus error), it will not respond to the heartbeat message sent by the application monitor on the other node. The other (good) node takes over the failed node's services after resetting the failed node.

If a node fails, the interfaces, access to storage, and services also become unavailable. See the succeeding sections for descriptions of how the IRIS FailSafe system handles or eliminates these points of failure.

Network Interfaces and IP Addresses

Clients access the high-availability services provided by the IRIS FailSafe cluster using IP addresses. Each high-availability service can use multiple IP addresses. The IP addresses are not tied to a particular high-availability service; they can be shared by all the high-availability services in the cluster.

IRIS FailSafe uses the IP aliasing mechanism to support multiple IP addresses on a single network interface. Clients can use a high-availability service that uses multiple IP addresses even when there is only one network interface in the server node.

The IP aliasing mechanism allows an IRIS FailSafe configuration that has a node with multiple network interfaces to be backed up by a node with a single network interface. IP addresses configured on multiple network interfaces are moved to the single interface on the other node in case of a failure.

IRIS FailSafe requires that each network interface in a cluster have an IP address that does not failover. These IP addresses, called fixed IP addresses, are used to monitor network interfaces. Each fixed IP address must be configured to a network interface at system boot up time. All other IP addresses in the cluster are configured as high-availability IP addresses.

High-availability IP addresses are configured on a network interface. During failover and recovery processes they moved to another network interface in the other node by IRIS FailSafe. High-availability IP addresses are specified in the IRIS FailSafe configuration file /var/ha/ha.conf. IRIS FailSafe uses the ifconfig command to configure an IP address on a network interface and to move IP addresses from one interface to another.

In some networking implementations, IP addresses cannot be moved from one interface to another by using only the ifconfig command. IRIS FailSafe uses re-MACing (MAC address impersonation) to support these networking implementations. Re-MACing moves the physical (MAC) address of a network interface to another interface. It is done by using the macconfig command. Re-MACing is done in addition to the standard ifconfig process that IRIS FailSafe uses to move IP addresses.


Note: Re-MACing can be used only on Ethernet networks. It cannot be used on FDDI networks.

Re-MACing is required when packets called gratuitous ARP packets are not passed through the network. These packets are generated automatically when an IP address is added to an interface (as in a failover process). They announce a new mapping of an IP address to MAC address. This tells clients on the local subnet that a particular interface now has a particular IP address. Clients then update their internal ARP caches with the new MAC address for the IP address. (The IP address just moved from interface to interface.) When gratuitous ARP packets are not passed through the network, the internal ARP caches of subnet clients cannot be updated. In these cases, re-MACing is used. This moves the MAC address of the original interface to the new interface. Thus, both the IP address and the MAC address are moved to the new interface and the internal ARP caches of clients do not need updating.

Re-MACing is not done by default; you must specify that it be done for each pair of primary and secondary interfaces that requires it. A procedure in the section “Planning Network Interface and IP Address Configuration” in Chapter 2 describes how you can determine whether re-MACing is required. In general, routers and PC/NFS clients may require re-MACing interfaces.

A side effect of re-MACing is that the original MAC address of an interface that has received a new MAC address is no longer available for use. Because of this, each network interface has to be backed up by a dedicated backup interface. This backup interface cannot be used by clients as a primary interface. (After a failover to this interface, packets sent to the original MAC address are ignored by every node on the network.) Each backup interface backs up only one network interface.

Disks

The IRIS FailSafe system includes shared SCSI-based storage in the form of one or more CHALLENGE RAID storage systems (for CHALLENGE or Origin nodes) or CHALLENGE Vaults (for CHALLENGE or Origin2000 nodes only) with plexed disks. All data for high-availability applications must be stored in XLV logical volumes on shared disks. If high-availability applications use filesystems, XFS filesystems must be used.

For CHALLENGE RAID storage systems, if a disk or disk controller fails, the RAID storage system is equipped to keep services available through its own capabilities.

With plexed XLV logical volumes on the disks in a CHALLENGE Vault, the XLV system provides redundancy. No participation of the IRIS FailSafe system software is required for a disk failure. If a disk controller fails, the IRIS FailSafe system software initiates the failover process.

Figure 1-3 shows disk storage takeover. The surviving node takes over the shared disks and recovers the logical volumes and filesystems on the disks. This process is expedited by the XFS filesystem, which supports fast recovery because it uses journaling technology that does not require the use of the fsck command for filesystem consistency checking.

Figure 1-3. Disk Storage Failover


High-Availability Applications

Each application has a primary node and backup node. The primary node is the node on which the application runs when FailSafe is in normal state. When a failure of any high-availability resources or high-availability application is detected by IRIS FailSafe software, all high-availability resources on the failed node are failed over to the other node and the high-availability applications on the failed node are stopped. When these operations are complete, the high-availability applications are started on the backup node.

All information about high-availability applications, including the primary node and backup node for the application and monitoring scripts, is specified in the IRIS FailSafe configuration file. Monitoring scripts detect the failure of a high-availability application.

IRIS FailSafe option products provide monitoring scripts and failover scripts that make NFS, Web, Oracle, INFORMIX, and Sybase applications high-availability services.

The IRIS FailSafe software provides a framework for making applications high-availability services. By writing scripts and modifying the IRIS FailSafe configuration file, you can turn client/server applications into high-availability applications. For information, see the IRIS FailSafe Programmer's Guide .

Failover and Recovery Processes

When a failure is detected on one node (the node has crashed, hung, or been shut down, or a high-availability service is no longer operating), the other node performs a failover of the high-availability services that are being provided on the node with the failure (called the failed node). Failover makes all of the high-availability services, previously provided by both nodes in a cluster, available on the surviving node in the cluster. This is called degraded state. (Node states are more fully described in the next section, “Node States.”)

A failure in a high-availability service can be detected by IRIS FailSafe processes running on either node. Depending on which node detects the failure, the sequence of actions following the failure is different.

If the failure is detected by the IRIS FailSafe software running on the same node, the failed node performs these operations:

  • stops all high-availability applications running on the node

  • moves all high-availability resources (IP addresses and shared disks) to the other node

  • sends a message to the other node (surviving node) to start providing all high-availability resources and applications previously provided by the failed node

  • moves to a state called standby state

When it receives the message, the surviving node performs these operations:

  • transfers ownership of all the high-availability resources from the failed node to itself

  • starts offering the high-availability resources of the failed node and the applications that were running on the failed node

  • moves to degraded state

If the failure is detected by FailSafe software running on the other node, the node detecting the failure (the surviving node) performs these operations:

  • using the serial connection between the nodes, reboots the failed node to prevent corruption of data

  • transfers ownership of all the high-availability resources from the failed node to itself

  • starts offering the high-availability resources of the failed node and the applications that were running on the failed node

  • moves to degraded state

When a failed node is coming back up (called a recovering node), it determines if the other node is running. There are three possible scenarios:

  • If IRIS FailSafe is not running on the other node or if the private interfaces or private network are not functioning, the recovering node does not begin providing high-availability services. It goes into standby state.

  • If IRIS FailSafe is running on the other node and controlled failback is configured on for the node, the recovering node doesn't begin proving high-availability services; it goes to a state called controlled failback state.

  • If IRIS FailSafe is running on the other node and controlled failback is configured off for the node, the surviving node shuts down the high-availability services for which it is the backup node, and the recovering node begins providing the high-availability services for which it is the primary node.

Normally, a node that experiences a failure automatically reboots and resumes providing high-availability services. This scenario works well for transient errors (as well as for planned outages for equipment and software upgrades). However, if there are persistent errors, automatic reboot can cause recovery and an immediate failover again. To prevent this, the IRIS FailSafe software checks how long the rebooted node has been up since the last time it was started. If the interval is less than five minutes (by default), the IRIS FailSafe software automatically does a chkconfig failsafe off on the failed node and does not start up the IRIS FailSafe software on this node. It also writes error messages to the console and /var/adm/SYSLOG.

Node States

Each node that is running IRIS FailSafe software is in one of the six states described in Table 1-1.

Table 1-1. Node States

Node State

Definition

standalone

The node is coming up and IRIS FailSafe is starting up. This is a transient state.

joining

The node is continuing the process of coming up and joining the cluster. The node should never remain in this state for more than two or three minutes.

normal

The node is actively providing its own high-availability services.

degraded

The node is providing all high-availability services for the cluster; the other node is unavailable.

standby

This node has stopped monitoring the other node in the cluster and is no longer providing high-availability services because a local failure has been detected or an administrative command has moved the node to this state. Also, if a node cannot move to normal state during the joining phase, it moves to this state.

controlled failback

This node is no longer providing high-availability services, but it is monitoring the other node in the cluster and the services it is providing.

error

An unrecoverable failure has occurred.

Figure 1-4 diagrams the node states and the events that govern them.

Figure 1-4. IRIS FailSafe Node States and Transitions


Table 1-2 shows the possible combinations of states for two nodes. When the state is listed as “(none),” it means that IRIS FailSafe software is not running or the node is shut down.

Table 1-2. Possible Combinations of Node States

State of One Node

State of the Other Node

Situation

standalone
standalone
standalone
standalone
standalone
standalone
joining
joining
joining
joining
joining

standalone
joining
normal
degraded
standby
(none)
joining
normal
degraded
standby
(none)

These are transient state combinations that occur immediately after one or both nodes have been rebooted.

normal

normal

Both nodes are operating normally, providing the high-availability services for which they are the primary node.

degraded

standby

The node in degraded state is providing all high-availability services. The node in standby state is not providing any high-availability services and is not performing monitoring.

degraded

controlled failback

The node in degraded state is providing all high-availability services. The node in controlled failback state is not providing any high-availability services, but is performing monitoring.

degraded

(none)

The node in degraded state is providing all high-availability services while the other node isn't running IRIS FailSafe software or is shut down.

standby

(none)

The node in standby state is running IRIS FailSafe software, but is not providing any high-availability services. The other node isn't running IRIS FailSafe software or is shut down.


Overview of Configuring and Testing a New IRIS FailSafe Cluster

After the IRIS FailSafe cluster hardware has been installed, follow this general procedure to configure and test the IRIS FailSafe system:

  1. Become familiar with IRIS FailSafe terms by reviewing this chapter and the Glossary at the end of this guide.

  2. Plan the configuration of high-availability applications and services on the cluster using Chapter 2, “Planning IRIS FailSafe Configuration.”

  3. Perform various administrative tasks, including the installation of prerequisite software, that are required by IRIS FailSafe. The instructions are in Chapter 3, “Configuring Nodes for IRIS FailSafe.”

  4. Prepare the IRIS FailSafe configuration file as explained in Chapter 4, “Creating the IRIS FailSafe Configuration File.”

  5. Test the IRIS FailSafe system in three phases: test individual components prior to starting IRIS FailSafe software, test normal operation of the IRIS FailSafe system, and simulate failures to test the operation of the system after a failure occurs. The instructions are in Chapter 5, “Testing IRIS FailSafe Configuration.”

Overview of Upgrading an IRIS FailSafe Cluster From Release 1.1 to Release 1.2

Follow these guidelines in upgrading an IRIS FailSafe cluster from IRIS FailSafe 1.1 to IRIS FailSafe 1.2:

  • There is no need to make modifications to the configuration file /var/ha/ha.conf unless you make other changes, such as adding additional high-availability resources or upgrading the node hardware (for example from a CHALLENGE node to an Origin node).

  • A cluster that includes any Origin systems must use version 1.2 configuration files.

  • If you are upgrading the release of IRIS FailSafe only, follow the upgrade instructions in the section “Upgrade Procedure A” in Chapter 7.

  • If you are making other changes as well, follow the instructions in the section “Choosing the Correct Upgrade Procedure” in Chapter 7 to choose the appropriate upgrade procedure.

These minor changes to the configuration file /var/ha/ha.conf have been made in the IRIS FailSafe 1.2 release:

  • A new parameter sys-ctlr-type has been added to the node block. This parameter describes the type of system controller in the node. It is optional for CHALLENGE nodes and required for Origin nodes.

  • The value of the devname parameter in the volume block is no longer a full pathname. It is now just the volume name (the portion of the pathname following /dev/dsk/xlv or /dev/xlv).

  • The monitoring-level parameter in webserver blocks now takes an additional value, 3.

  • The value of the parameter version-minor (in the internal block) is now 2.