This chapter provides an overview of the components and operation of the FailSafe system. It contains the following:
FailSafe provides a general facility for providing highly available services. This type of system survives a single point of failure by using redundant components and FailSafe software to provide highly available services for a cluster that contains multiple nodes.
If one of the nodes in the cluster or one of the nodes' components fails, a different node in the cluster restarts the highly available services of the failed node. To clients, the services on the replacement node are indistinguishable from the original services before failure occurred. It appears as if the original node has crashed and rebooted quickly. The clients notice only a brief interruption in the highly available service.
In a FailSafe environment, nodes can serve as backup systems for other nodes. Unlike the backup resources in a fault-tolerant system, which serve purely as redundant hardware for backup in case of failure, the resources of each node in a highly available system can be used during normal operation to run other applications that are not necessarily highly available services. All highly available services are owned by one node in the cluster at a time.
Highly available services are monitored by the FailSafe software. If a failure is detected on any of these components, a failover process is initiated. Using FailSafe, you can define a failover policy to establish which node will take over the services under what conditions. This process consists of resetting the failed node (to ensure data consistency), performing recovery procedures required by the failed over services, and quickly restarting the services on the node that will take them over.
FailSafe also supports selective failover in which individual highly available applications can be failed over to a backup node independent of the other highly available applications on that node.
The FailSafe, CXFS, DMF, and TMF products are integrated to provide a complete storage solution.
CXFS, the clustered XFS filesystem, allows groups of computers to coherently share large amounts of data while maintaining high performance. You can use FailSafe to provide highly available services (such as NFS or Web) running on a CXFS filesystem. This combination provides high-performance shared data access for highly available applications.
Figure 1-1 shows an example configuration.
For more information, see the CXFS Administration Guide for SGI InfiniteStorage and the CXFS MultiOS Client-Only Guide for SGI InfiniteStorage.
The Data Migration Facility (DMF) is a hierarchical storage management system for SGI environments. Its primary purpose is to preserve the economic value of storage media and stored data. The high I/O bandwidth of these environments is sufficient to overrun online disk resources. Consequently, capacity scheduling, in the form of native file system migration, has become an integral part of many computing environments and is a requirement for effective use of SGI systems.
The FailSafe DMF plug-in enables DMF and its resources to be moved from one server to another when a FailSafe failover occurs. If the server that is running FailSafe DMF crashes, DMF fails over to another server along with its filesystems.
For more information, see the FailSafe DMF Administrator's Guide for SGI InfiniteStorage.
The Tape Management Facility (TMF) supports processing of labeled tapes, including multifile volumes and multivolume sets. These capabilities are most important to customers who run production tape operations where tape label recognition and tape security are requirements.
The FailSafe TMF plug-in enables TMF and its resources to be failed over from one server to another when a failure occurs.
For more information, see the FailSafe Version 2 TMF Administrator's Guide.
This section discusses the following:
This section defines the terminology necessary to configure and monitor highly available services with FailSafe.
A cluster is the set of systems (nodes) configured to work together as a single computing resource. A cluster is identified by a simple name and a cluster ID. There is only one cluster that may be formed from a given pool of nodes.
Disks or logical units (LUNs) are assigned to clusters by recording the name of the cluster on the disk or LUN. Thus, if any disk is accessible (via a Fibre Channel connection) from machines in multiple clusters, then those clusters must have unique names. When members of a cluster send messages to each other, they identify their cluster via the cluster ID.
You should choose unique names and cluster IDs for each of the clusters within your organization.
A node is an operating system (OS) image, usually an individual computer. (This use of the term node does not have the same meaning as a node in an SGI Origin 3000 or SGI 2000 system.) A given node can be a member of only one pool and therefore only one cluster.
The pool is the set of nodes from which a particular cluster may be formed. Only one cluster may be configured from a given pool, and it need not contain all of the available nodes. (Other pools may exist, but each is disjoint from the other. They share no node or cluster definitions.)
A pool is formed when you connect to a given node and define that node in the cluster database using the FailSafe graphical user interface (GUI) or cmgr command. You can then add other nodes to the pool by defining them while still connected to the first node, or to any other node that is already in the pool. (If you were to connect to another node and then define it, you would be creating a second pool).
Figure 1-2 shows the concepts of pool and cluster.
The cluster database contains configuration information about resources, resource groups, failover policy, nodes, clusters, logging information, and configuration parameters.
The database consists of a collection of files; you can view and modify the contents of the database by using the FailSafe Manager GUI and the cmgr, cluster_status, and clconf_info commands.
If you are running FailSafe in coexecution with CXFS, they share the same cluster database and command-line interface commands. However, each product has its own GUI.
There are the following types of membership:
FailSafe membership is the list of FailSafe nodes in the cluster on which FailSafe can make resource groups online:
The potential FailSafe membership is the set of all FailSafe nodes that are defined in the cluster and on which HA services have been enabled. Nodes are enabled when HA services are started. The enabled status is stored in the cluster database; if an enabled node goes down, its status will remain enabled to indicate that it is supposed to be in the membership.
The actual membership consists of the eligible nodes whose state is known and that are communicating with other FailSafe nodes using multiple heartbeat and control networks. If the highest priority private network is unavailable, the FailSafe heartbeat will fail over to the next available heartbeat network defined for the node.
Stopping HA services on a node (deactivating the node) is equivalent to removing a node from the FailSafe cluster. FailSafe membership does not include deactivated nodes in membership calculation.
Cluster database membership (also known as fs2d membership) is the group of nodes in the pool where the cluster database is replicated. The fs2d daemon is the cluster database daemon that maintains membership of nodes where the database is replicated and keeps the database synchronized across database transactions.
For more details about membership, see FailSafe Architecture for SGI InfiniteStorage.
With CXFS coexecution, there is also CXFS membership . For more information about CXFS, see “Coexecution of CXFS and FailSafe” in Chapter 2, and the CXFS Administration Guide for SGI InfiniteStorage.
The quorum is the number of nodes required to form a cluster, which differs according to membership:
For FailSafe membership: >50% (a majority) of the nodes in the cluster where highly available (HA) services were started must be in a known state (successfully reset or talking to each other using heartbeat and control networks) to form and maintain a cluster.
For cluster database membership, 50% (half) of the nodes in the pool must be available to the fs2d daemon (and can therefore receive cluster database updates) to form and maintain a cluster.
Figure 1-3 shows an example of FailSafe and cluster database memberships. The figure describes the following:
A pool consisting of six nodes, N1 through N6.
A cluster that has been defined to have four nodes, N1 to N4.
HA services have been started on four nodes, N1 to N4. (HA services can only be started on nodes that have been defined as part of the cluster; however, not all nodes within the cluster must have HA services started.)
Three nodes are up (N1 through N3) and three nodes are down (N4 through N6).
The cluster database membership consists of N1 through N3, three of six nodes in the pool (50%).
The FailSafe membership also consists of nodes N1 through N3, three of four nodes where HA services were started.
If a network partition results in a tied membership, in which there are two sets of nodes each consisting of 50% of the cluster, a node from the set containing the tiebreaker node will attempt to reset a node in the other set in order to maintain a quorum. For more information, see FailSafe Architecture for SGI InfiniteStorage.
A private network is one that is dedicated to cluster communication and is accessible by administrators but not by users.
The cluster software uses the private network to send the heartbeat/control messages necessary for the cluster configuration to function. If there are delays in receiving heartbeat messages, the cluster software may determine that a node is not responding and will therefore remove it from the FailSafe membership.
Using a private network limits the traffic on the public network and therefore will help avoid unnecessary resets or disconnects.
The messaging protocol does not prevent snooping (viewing) or spoofing (in which one machine on the network masquerades as another); therefore, a private network is safer than a public network.
Therefore, because the performance and security characteristics of a public network could cause problems in the cluster and because heartbeat is very timing-dependent (even small variations can cause problems), SGI recommends a dedicated private network to which all nodes are attached and over which heartbeat/control messages are sent.
In addition, SGI recommends that all nodes be on the same local network segment.
![]() | Note: If there are any network issues on the private network, fix them before trying to use FailSafe. |
If you are running FailSafe in coexecution with CXFS, they use the same private network. (A private network is recommended for FailSafe, but is required for CXFS.) You can use multiple private networks.
A resource is a single physical or logical entity that provides a service to clients or other resources. For example, a resource can be a single disk volume, a particular network address, or an application such as a Web server. A resource is generally available for use over time on two or more nodes in a cluster, although it can be allocated to only one node at any given time.
Resources are identified by a resource name and a resource type.
A resource type is a particular class of resource. All of the resources in a particular resource type can be handled in the same way for the purposes of failover. Every resource is an instance of exactly one resource type.
A resource type is identified by a simple name; this name must be unique within the cluster. A resource type can be defined for a specific node or it can be defined for an entire cluster. A resource type that is defined for a specific node overrides a clusterwide resource type definition with the same name; this allows an individual node to override global settings from a clusterwide resource type definition.
The FailSafe software includes many predefined resource types. If these types fit the application you want to make highly available, you can reuse them. If none fit, you can create additional resource types by using the instructions in the FailSafe Programmer's Guide for SGI Infinite Storage.
A resource name identifies a specific instance of a resource type. A resource name must be unique for a given resource type.
A resource group is a collection of interdependent resources. A resource group is identified by a simple name; this name must be unique within a cluster. Table 1-1 shows an example of the resources and their corresponding resource types for a resource group named Webgroup.
Table 1-1. Example Webgroup Resource Group
Resource | Resource Type |
---|---|
10.10.48.22 | IP_address |
/fs1 | filesystem |
vol1 | volume |
web1 | Netscape_web |
If any individual resource in a resource group becomes unavailable for its intended use, then the entire resource group is considered unavailable. Therefore, a resource group is the unit of failover.
Resource groups cannot overlap; that is, two resource groups cannot contain the same resource.
One resource can be dependent on one or more other resources; if so, it will not be able to start (that is, be made available for use) unless the dependent resources are also started. Dependent resources must be part of the same resource group and are identified in a resource dependency list. Resource dependencies are verified when resources are added to a resource group, not when resources are defined.
![]() | Note: All interdependent resources must be added to the same resource group. |
Like resources, a resource type can be dependent on one or more other resource types. If such a dependency exists, at least one instance of each of the dependent resource types must be defined. A resource type dependency list details the resource types upon which a resource type depends.
For example, a resource type named Netscape_web might have resource type dependencies on resource types named IP_address and volume. If a resource named WS1 is defined with the Netscape_web resource type, then the resource group containing WS1 must also contain at least one resource of the type IP_address and one resource of the type volume. This is shown in Figure 1-4.
A failover is the process of allocating a resource group (or application) to another node, according to a failover policy. A failover may be triggered by the failure of a resource, a change in the FailSafe membership (such as when a node fails or starts), or a manual request by the administrator.
A failover policy is the method used by FailSafe to determine the destination node of a failover. A failover policy consists of the following:
Failover domain
Failover attributes
Failover script
FailSafe uses the failover domain output from a failover script along with failover attributes to determine on which node a resource group should reside.
The administrator must configure a failover policy for each resource group. A failover policy name must be unique within the pool.
A failover domain is the ordered list of nodes on which a given resource group can be allocated. The nodes listed in the failover domain must be within the same cluster; however, the failover domain does not have to include every node in the cluster.
The administrator defines the initial failover domain when creating a failover policy. This list is transformed into a run-time failover domain by the failover script; FailSafe uses the run-time failover domain along with failover attributes and the FailSafe membership to determine the node on which a resource group should reside. FailSafe stores the run-time failover domain and uses it as input to the next failover script invocation. Depending on the run-time conditions and contents of the failover script, the initial and run-time failover domains may be identical.
In general, FailSafe allocates a given resource group to the first node listed in the run-time failover domain that is also in the FailSafe membership; the point at which this allocation takes place is affected by the failover attributes.
A failover attribute is a string that affects the allocation of a resource group in a cluster. The administrator must specify system attributes (such as Auto_Failback or Controlled_Failback), and can optionally supply site-specific attributes.
A failover script is a shell script that generates a run-time failover domain and returns it to the ha_fsd process. The ha_fsd process applies the failover attributes and then selects the first node in the returned failover domain that is also in the current FailSafe membership.
The following failover scripts are provided with the FailSafe release:
ordered, which never changes the initial failover domain. When using this script, the initial and run-time failover domains are equivalent.
round-robin, which selects the resource group owner in a round-robin (circular) fashion. This policy can be used for resource groups that can be run in any node in the cluster.
If these scripts do not meet your needs, you can create a new failover script using the information provided in the FailSafe Programmer's Guide for SGI Infinite Storage.
The action scripts determine how a resource is started, monitored, and stopped. There must be a set of action scripts specified for each resource type.
Following is the complete set of action scripts that can be specified for each resource type:
The release includes action scripts for predefined resource types. If these scripts fit the resource type that you want to make highly available, you can reuse them by copying them and modifying them as needed. If none fits, you can create additional action scripts by using the instructions provided in the FailSafe Programmer's Guide for SGI Infinite Storage.
A cluster process group is a group of application instances in a distributed application that cooperate to provide a service. Each application instance can consist of one or more operating system processes and spans only one node.
For example, distributed lock manager instances in each node would form a process group. By forming a process group, they can obtain process membership and reliable, ordered, atomic communication services.
![]() | Note: There is no relationship between an operating system process group and a cluster process group. |
A plug-in is the set of software required to make an application highly available, including a resource type and action scripts. There are plug-ins provided with the base FailSafe release, optional plug-ins available for purchase from SGI, and customized plug-ins you can write using the instructions in the FailSafe Programmer's Guide for SGI Infinite Storage.
Figure 1-5 shows an example of FailSafe hardware components, in this case for a two-node system.
The hardware components are as follows:
Up to eight nodes: Origin 300 and Origin 3000 series, Onyx 300 and Onyx 3000 series, Origin 200, Onyx2 deskside, SGI 2000 series, and Onyx2.
More than two interfaces on each node for control networks.
At least two Ethernet or FDDI interfaces on each node are required for the control network heartbeat connection, by which each node monitors the state of other nodes. The FailSafe software also uses this connection to pass control messages between nodes. These interfaces have distinct IP addresses.
A serial line from a serial port on each node to a Remote System Control port on another node.
A node that is taking over services on the failed node uses this line to reboot the failed node during takeover. This procedure ensures that the failed node is not using the shared disks when the replacement node takes them over.
An optional Etherlite network-based serial multiplexer (EL-16) (FAILSAFE-N_NODE) hardware component to reset machines in a cluster.
Disk storage and SCSI bus/Fibre Channel shared by the nodes in the cluster.
The nodes in the FailSafe system share multi-hosted disk storage over a shared fast and wide SCSI bus or Fibre Channel. The storage connection is shared so that either node can take over the disks in case of failure. The hardware required for the disk storage can be one of the following:
For an exact list of storage supported, please contact SGI support.
In addition, FailSafe supports ATM LAN emulation failover when FORE Systems ATM cards are used with a FORE Systems switch.
A FailSafe system supports the following disk connections:
SCSI disks can be connected to two machines only. Fibre Channel disks can be connected to multiple machines.
FailSafe supports the following highly available configurations:
These configurations provide redundancy of processors and I/O controllers. Redundancy of storage is obtained through the use of multihosted RAID disk devices and plexed (mirrored) disks.
You can use the following reset models when configuring a FailSafe system:
In a basic two-node configuration, the following arrangements are possible:
All highly available services run on one node. The other node is the backup node. After failover, the services run on the backup node. In this case, the backup node is a hot standby for failover purposes only. The backup node can run other applications that are not highly available services.
Highly available services run concurrently on both nodes. For each service, the other node serves as a backup node. For example, both nodes can be exporting different NFS filesystems. If a failover occurs, one node then exports all of the NFS filesystems.
FailSafe provides the following features to increase the flexibility and ease of operation of a highly available system:
FailSafe allows you to perform a variety of administrative tasks while the system is running:
Monitor applications. You can turn monitoring of an application on and off while FailSafe continues to run. This allows you to perform online application upgrades without bringing down the FailSafe system.
Managed resources. You can add resources while the FailSafe system is online.
Upgrade FailSafe software. You can upgrade FailSafe software on one node at a time without taking down the entire FailSafe cluster.
The unit of failover is a resource group. This limits the impact of a component failure to the resource group to which that component belongs, and does not affect other resource groups or services on the same node. The process in which a specific resource group is failed over from one node to another node while other resource groups continue to run on the first node is called fine-grain failover.
FailSafe allows you to fail over a resource group onto the same node. This feature enables you to configure a single-node system, where backup for a particular application is provided on the same machine, if possible. It also enables you to indicate that a specified number of local restarts be attempted before the resource group fails over to a different node.
You can perform all FailSafe administrative tasks by means of the FailSafe Manager graphical user interface (GUI). The GUI provides a guided interface to configure, administer, and monitor a FailSafe-controlled highly available cluster. The GUI also provides screen-by-screen help text.
If you want, you can perform administrative tasks directly by using the cmgr command, which provides a command-line interface for the administration tasks.
For more information, see the following:
This section discusses the highly available resources in a FailSafe system:
FailSafe detects if a node crashes or hangs (for example, due to a parity error or bus error). A different node, determined by the failover policy, resets the failed node and takes over the failed node's services.
If a node fails, its interfaces, access to storage, and services also become unavailable. See the following sections for descriptions of how the FailSafe system handles or eliminates these points of failure.
Clients access the highly available services provided by the FailSafe cluster using IP addresses. Each highly available service can use multiple IP addresses. The IP addresses are not tied to a particular highly available service; they can be shared by all the resources in a resource group.
FailSafe uses the IP aliasing mechanism to support multiple IP addresses on a single network interface. Clients can use a highly available service that uses multiple IP addresses even when there is only one network interface in the server node.
The IP aliasing mechanism allows a FailSafe configuration that has a node with multiple network interfaces to be backed up by a node with a single network interface. IP addresses configured on multiple network interfaces are moved to the single interface on the other node in case of a failure.
![]() | Note: That is, the hostname is bound to a different IP address that never moves. |
FailSafe requires that each network interface in a cluster have an IP address that does not fail over. These IP addresses, called fixed IP addresses, are used to monitor network interfaces. The fixed IP address would be the same address you would use if you configured it as a normal system and put it on the network before FailSafe was even installed.
Each fixed IP address must be configured to a network interface at system boot up time. All other IP addresses in the cluster are configured as highly available (HA) IP addresses.
Highly available IP addresses are configured on a network interface. During failover and recovery processes, FailSafe moves them to another network interface in the other node. Highly available IP addresses are specified when you configure the FailSafe system. FailSafe uses the ifconfig command to configure an IP address on a network interface and to move IP addresses from one interface to another.
In some networking implementations, IP addresses cannot be moved from one interface to another by using only the ifconfig command. FailSafe uses media access control (MAC) address impersonation (re-MACing) to support these networking implementations.
Re-MACing moves the physical MAC address of a network interface to another interface. This is done by using the macconfig command. Re-MACing is done in addition to the standard ifconfig process that FailSafe uses to move IP addresses. This requires two network connections into the public network for each MAC address. For each MAC address being moved, a dedicated backup network interface is required. To do re-MACing in FailSafe, a resource of type MAC_Address is used.
![]() | Note: Re-MACing can be used only on Ethernet networks. It is usually not required for TCP/IP networks. |
Re-MACing is required when packets called gratuitous ARP packets are not passed through the network. These packets are generated automatically when an IP address is added to an interface (as in a failover process). They announce a new mapping of an IP address to a MAC address. This tells clients on the local subnet that a particular interface now has a particular IP address. Clients then update their internal ARP caches with the new MAC address for the IP address. (The IP address just moved from interface to interface.) When gratuitous ARP packets are not passed through the network, the internal ARP caches of subnet clients cannot be updated. In these cases, re-MACing is used. This moves the MAC address of the original interface to the new interface. Thus, both the IP address and the MAC address are moved to the new interface and the internal ARP caches of clients do not need updating.
Re-MACing is not done by default; you must specify that it be done for each pair of primary and secondary interfaces that requires it. (See “Determining if Re-MACing is Required” in Chapter 2.) In general, routers and PC/NFS clients may require re-MACing interfaces.
A side effect of re-MACing is that the original MAC address of an interface that has received a new MAC address is no longer available for use. Because of this, each network interface has to be backed up by a dedicated backup interface. This backup interface cannot be used by clients as a primary interface. (After a failover to this interface, packets sent to the original MAC address are ignored by every node on the network.) Each backup interface backs up only one network interface.
FailSafe supports storage based on SCSI or Fibre Channel.
Plexing must be used to mirror disks in a JBOD configuration. If highly available applications use filesystems, XFS filesystems or CXFS filesystems must be used. When CXFS filesystems are used, they must be on XVM volumes.
![]() | Note: Neither SCSI storage nor Fibre JBOD is supported in a storage area network (SAN) configuration and therefore it cannot be used with CXFS. |
The storage components should not have a single point of failure. All data should be in a RAID or should be mirrored. It is recommended that there are at least two paths from storage to the servers for redundancy.
For Fibre Channel RAID storage systems, if a disk or disk controller fails, the RAID storage system is equipped to keep services available through its own capabilities.
For all the above storage systems, if a disk or disk controller fails, either XLV or XVM will keep the service available through a redundant path as appropriate.
If no alternate paths are available to the storage subsystems, then FailSafe will initiate a failover process.
Figure 1-8 shows an example of disk storage failover in a two-node cluster.
Each application has a primary node and up to seven additional nodes that you can use as a backup node, according to the failover policy you define. The primary node is the node on which the application runs when FailSafe is in a normal state. When a failure of any highly available application is detected by FailSafe software, all resources in the affected resource group on the failed node are failed over to a different node and the resources on the failed node are stopped. When these operations are complete, the resources are started on the backup node.
All information about resources, including the primary node, components of the resource group, and failover policy is specified when you configure your FailSafe system with the GUI or with the cmgr command. Information on configuring the system is provided in Chapter 6, “Configuration”. Monitoring scripts detect the failure of a resource.
The FailSafe software provides a framework for making applications highly available services. By writing scripts and configuring the system in accordance with those scripts, you can turn client/server applications into highly available applications. For information, see the FailSafe Programmer's Guide for SGI Infinite Storage.
A failure is when the node has crashed, hung, or been shut down, or when a highly available service is no longer operating. The node with the failure is called the failed node. A different node performs a failover of the highly available services that were being provided on the failed node. Failover allows all of the highly available services, including those provided by the failed node, to remain available within the cluster.
Depending on which node detects the failure, the sequence of actions following the failure is different.
If the failure is detected by the FailSafe software running on the same node, the failed node performs the following operations:
Stops the highly available resource group running on the failed node
Moves the highly available resource group to a different node, according to the defined failover policy for the resource group
Asks the new node to start providing all resource group services previously provided by the failed node
When it receives the message, the node that is taking over the resource group performs the following operations:
Transfers ownership of the resource group from the failed node to itself
Starts offering the resource group services that were running on the failed node
If the failure is detected by FailSafe software running on a different node, the node detecting the failure performs these operations:
Power-cycles the failed node (to prevent corruption of data) by using the serial hardware rest connection between the nodes
Transfers ownership of the resource group from the failed node to the other nodes in the cluster, based on the resource group failover policy
Starts offering the resource group services that were running on the failed node
When a failed node comes back up, whether or not the node automatically starts to provide highly available services again depends on the failover policy that you define.
For more information, see “Define a Failover Policy with the GUI” in Chapter 6.
Normally, a node that experiences a failure automatically reboots and resumes providing highly available services. This scenario works well for transient errors (as well as for planned outages for equipment and software upgrades).
For further information on FailSafe execution during startup and failover, see FailSafe Architecture for SGI InfiniteStorage.
After the FailSafe cluster hardware has been installed, use the following general procedure to configure and test the FailSafe system:
Become familiar with FailSafe terms by reviewing this chapter.
Plan the configuration of highly available applications and services on the cluster using Chapter 2, “Configuration Planning”.
Perform various administrative tasks, including the installation of prerequisite software, that are required by FailSafe, as described in Chapter 4, “FailSafe Installation and System Preparation ”.
Define the configuration as explained in Chapter 6, “Configuration”.
Test the system. See “Testing the Configuration” in Chapter 3, and Chapter 9, “Testing the Configuration”.
As of IRIS FailSafe 2.1.6, FailSafe supports a cluster containing nodes running n-2 releases. For example, you can have nodes running FailSafe 2.1.6 (n), 2.1.5, and 2.1.4 in the same cluster. This policy lets you to keep your cluster running and applications available during the upgrade process.
Each IRIS FailSafe release is paired with a given even-numbered IRIX release, and will also support the following odd-numbered release. For example, IRIS FailSafe 2.1.6 supports IRIX 6.5.22 and IRIX 6.5.23.