Chapter 1. Introduction

SGI FailSafe provides highly available services for as many as eight nodes in a cluster. These services are monitored by the FailSafe software. You can create additional services that are highly available by using the instructions in this guide to write your own plug-in , the set of scripts that are required to turn an application into a highly available service in conjunction with FailSafe software.

This chapter contains the following:

For information about FailSafe terminology and the components, software layers, communication paths, and order of execution of action and failover scripts, see FailSafe Administrator's Guide for SGI InfiniteStorage.

Plug-ins

A plug-in is the set of software required to make an application highly available, including a resource type and action scripts. There are plug-ins provided with the base FailSafe release, optional plug-ins available for purchase from SGI, and customized plug-ins you can write using the instructions in this guide. See “Plug-ins”.

The following tables show the provided and optional FailSafe plug-ins and their associated resource types.

Table 1-1. Provided Plug-Ins

Provided Plug-In

Resource Type

CXFS file system

CXFS

IP addresses

IP_address

MAC addresses

MAC_address

XFS filesystems

filesystem

XLV logical volumes

volume

XVM volume manager

XVM


Table 1-2. Optional Plug-Ins

Optional Plug-In

Resource Type

IRIS FailSafe for DMF

DMF

IRIS FailSafe for NFS

NFS and statd_unlimited

IRIS FailSafe for Informix

INFORMIX_DB

IRIS FailSafe for Oracle

Oracle_DB

IRIS FailSafe for Samba

Samba

IRIS FailSafe for TMF

TMF

IRIS FailSafe for Web (Netscape)

Netscape_web

See the release notes for information about the specific releases that are supported.

If you want to create your own plug-in, or change the functionality of the provided failover scripts and action scripts by writing new scripts, you will use the instructions in this guide.


Note: If you require a customized plug-in but do not want to write it yourself, you can establish a contract with the Silicon Graphics Professional Services group to create customized scripts. See: http://www.sgi.com/services/index.html .


Characteristics that Permit an Application to be Highly Available

The characteristics of an application that can be made highly available are as follows:

  • The application can be easily restarted and monitored. It should be able to recover from failures as does most client/server software. The failure could be a hardware failure, an operating system failure, or an application failure. If a node crashes and reboots, client/server software should be able to attach again automatically.

  • The application must have a start and stop procedure. When the application fails over, the instances of the application are stopped on one node using the stop procedure and restarted on the other node using start procedure.

    Avoid applications that are started as a daemon from /etc/inetd.conf because typically everything in /etc/inetd.conf is already running. Trying to automatically edit /etc/inetd.conf could cause errors for other daemons started by this file.

    Many applications will have a start and stop procedure that belongs in the /etc/init.d directory. You can incorporate them into a custom /var/ha/resources script to appropriately start and stop the application. If the application also has a chkconfig flag, set it to off. The chkconfig flag should be set to on in the /var/ha/resources start script.

  • The application does not depend on the hostname or any identifier that is specific to a node.

  • The application can be moved from one node to another after failures.

    If the resource has failed, it must still be possible to run the resource stop procedure. In addition, the resource must recover from the failed state when the resource start procedure is executed on another node.

    Ensure that there is no affinity for a specific node.

  • The application does not depend on knowing the primary host name (as returned by hostname); that is, those resources that can be configured to work with an IP address.

  • Other resources on which the application depends can be made highly available. If they are not provided by FailSafe and its optional products (see “Plug-ins”), you must make these resources highly available, using the information in this guide.


    Note: An application itself is not modified to make it highly available.


Overview of the Programming Steps

To make an application highly available, follow these steps:

  1. Understand the application and determine the following:

    • The configuration required for the application, such as user names, permissions, data location (volumes), and so on. For more information about configuration, see the FailSafe Administrator's Guide for SGI InfiniteStorage.

    • The other resources on which the application depends. All interdependent resources must be part of the same resource group.

    • The resource type that best suits this application.

    • The number of instances of the resource type that will constitute the application. (Each instance of a given application, or resource type, is a separate resource.) For example, a web server may depend upon two filesystem resources.

    • The commands and arguments required to start, stop, and monitor this application (that is, the resources in the resource group).

    • The order in which all resources in the resource group must be started and stopped.

  2. Determine whether existing action scripts can be reused. If they cannot, write a new set of action scripts, using existing scripts and the templates in /var/cluster/ha/resource_types/template as a guide. See Chapter 2, “Writing the Action Scripts and Adding Monitoring Agents”.

  3. Determine whether the existing ordered or round-robin failover scripts can be reused for the resource group. If they cannot, write a new failover script. See Chapter 3, “Creating a Failover Policy”.

  4. Determine whether an existing resource type can be reused. If none applies, create a new resource type or modify an existing resource. See Chapter 4, “Defining a New Resource Type”.

  5. Configure the following in the cluster database (for more information, see the FailSafe Administrator's Guide for SGI InfiniteStorage):

    • Resource group

    • Resource type

    • Failover policy

  6. Test the action scripts and failover script. See Chapter 5, “Testing Scripts”, and “Debugging Notes” in Chapter 5.


    Note: Do not modify the scripts included with the FailSafe release. New or customized scripts must have different names from the files included with the release.


Administrative Commands for Use in Scripts

Table 1-3 shows the administrative commands available with FailSafe for use in scripts.

Table 1-3. FailSafe Administrative Commands for Use in Scripts

Command

Purpose

ha_cilog

Logs messages to the script_ nodename log files.

ha_execute_lock

Executes a command with a lock file that allows command execution to be serialized. The lock file prevents multiple instances of the same command from executing at the same time on a single node.

ha_exec2

Executes a command and retries the command on failure or timeout.

ha_filelock

Locks a file.

ha_fileunlock

Unlocks a file.

ha_ifdadmin

Communicates with the ha_ifd network interface agent daemon.

ha_http_ping2

Checks if a web server is running.

ha_macconfig2

Displays or modifies MAC addresses of a network interface.


Caution: Do not use the script in /usr/sysadm/privbin. These are internal commands that have a different command line parameter scheme. The functionality of these commands may change in the future. These commands are not documented.