Chapter 2. Configuring FailSafe TMF

This chapter provides information about configuring the FailSafe TMF plug-in:

The procedures described in this chapter assume that a cluster database that does not include TMF has already been created, installed, and tested as described in the FailSafe Administrator's Guide for SGI InfiniteStorage.

Verifying that TMF is Enabled

To run FailSafe TMF, the TMF software must be enabled. You should ensure that the output from chkconfig shows the following flag set to on:

# chkconfig | grep tmf
...
        tmf             on

If it is not, set it to on. For example:

# chkconfig tmf on

Creating a TMF Resource Type

To create a TMF resource type, this subsection assumes that you are already familiar with the concepts of resource types. Table 2-1 shows the resource attributes of a TMF resource type. “Configuring a TMF Resource” describes how to use these parameters when configuring a TMF resource.

Table 2-1. TMF Configuration Parameters and Attributes

Parameters/Attributes

Comments

resource-name

The name of the TMF resource; for example egft.

device-group

 

The device group that is to be monitored. This device group specification is a device group that is defined in the TMF configuration file tmf.config.

devices-minimum

The minimum number of devices of the specified device-group that you must have available to you on a node.

devices-loaned

Currently unused; should be left at default value.

email-addresses

List of addresses to send email when the monitor script detects that devices in the device-group have become unavailable. This may be a comma- or white-space-separated list of names.

The TMF resource type is not created at cluster creation time. You must create the resource type before a TMF resource is created. The TMF resource type must be installed if you want to add a TMF resource to a cluster that was created before the FailSafe TMF software was installed.

You can use one of the following methods to create the TMF resource type:

  • Run cmgr and manually create the resource type. For more information on cmgr, see the FailSafe Administrator's Guide for SGI InfiniteStorage.

  • Run cmgr and install the resource type, as follows:

    cmgr> install resource_type TMF in cluster eagan
    cmgr> show resource_types installed
    
    TMF
    NFS
    template
    Netscape_web
    statd_unlimited
    Oracle_DB
    MAC_address
    IP_address
    INFORMIX_DB
    filesystem
    volume

  • Use the template scripts supplied with FailSafe located in /var/cluster/cmgr-template/ cmgr-create-resource_type.

  • Execute /var/cluster/ha/resource_type/TMF/create_resource_type and include the path of the cluster database argument and the cluster name.

  • Run the FailSafe Manager GUI and use the Load Resource Type task to load the resource type. For more information on the FailSafe GUI, see the FailSafe Administrator's Guide for SGI InfiniteStorage.

Configuring a TMF Resource

The FailSafe TMF plug-in performs various functions for a TMF resource, as summarized in “FailSafe TMF Plug-In” in Chapter 1. This section describes how to configure a resource to perform each of these functions.

FailSafe TMF Configuration Parameters

Table 2-1 summarizes the FailSafe TMF plug-in configuration parameters.

The FailSafe TMF plug-in lets you specify device groups to monitor. You specify a device group through the resource attribute device-group ; it applies to the particular resource that is being defined. A device group refers to the tape devices that belong to a device group as defined in the TMF configuration file. This attribute is required for each resource that you create.

When you create a TMF resource, you must specify the minimum number of devices of a particular device group that must be configured and available for use. This value is specified as the resource attribute devices-minimum and is required for each resource. The default value for devices-minimum is 0, which means that no action is taken by the TMF plug-in even if no tapes are available (many sites would not want FailSafe to take action if the tapes are not available because failover or local restart will not help in that situation).

When you create a TMF resource, you must specify a list of email addresses to notify when the monitoring scripts detect that devices in the device group have become unavailable. Specify this list through the resource attributeemail-addresses as a comma- or white-space-separated list of name.

A TMF resource includes the resource attribute devices-loaned. This attribute is currently unused by the FailSafe TMF plug-in and should be left at its default assigned value.

Optional Configuration Specifications

There are other optional configuration specifications associated with this resource. These specifications provide required information to the FailSafe TMF plug-in that let it communicate with the tape library and they also tell the plug-in which drives within the library on which it will force dismounts.

The FailSafe TMF plug-in can force a dismount of tapes from drives within the library. There may be various reasons why you might want to do this when a failover occurs. In the case of the data migration facility (DMF), you would want to ensure that any DMF tapes that were in use on a previous host are available to DMF on the new node after a failover. If these tapes were in drives assigned to the previous host, they must be ejected and returned to the library so that they are again accessible to DMF on the new host. You may want the FailSafe TMF plug-in to dismount only tape devices associated with a particular resource or you may not want the plug-in to dismount any tapes at all.

If you are using the tpsc tape driver, then in order for the plug-in to be able to force a dismount of tapes, the capabilities list specified for the device in the /var/sysgen/master.d/scsi file must not include the MTCAN_PREV capability.

The following example shows entries from this file for the STK 9840 and STK 9940 drives. The description for the 9840 drive does not include the MTCAN_PREV capability, but the description for the 9940 drive does include it.

/* STK 9840 drive */
{ STK9840, TPSTK9840, 3, 4, "STK", "9840", 0, 0, {0, 0, 0, 0},
MTCAN_BSF | MTCAN_BSR | MTCANT_RET | MTCAN_CHKRDY | MTCAN_SPEOD |
MTCAN_SEEK | MTCAN_APPEND | MTCAN_SILI | MTCAN_VAR | MTCAN_SETSZ |
             MTCAN_CHTYPEANY | MTCAN_COMPRESS,
              20, 8*60, 10*60, 3*60, 3*60, 16384, 256*1024,
          tpsc_default_dens_count, tpsc_default_hwg_dens_names,
                tpsc_default_alias_dens_names,
          {0}, 0, 0, 0,
          0, (u_char *)0 },

/* STK 9940 drive */
{ STK9840, TPSTK9840, 3, 4, "STK", "T9940A", 0, 0, {0, 0, 0, 0},
           MTCAN_BSF | MTCAN_BSR | MTCANT_RET | MTCAN_CHKRDY | MTCAN_PREV |
        MTCAN_SPEOD |
            MTCAN_SEEK | MTCAN_APPEND | MTCAN_SILI | MTCAN_VAR | MTCAN_SETSZ |
             MTCAN_CHTYPEANY | MTCAN_COMPRESS,
              20, 8*60, 10*60, 3*60, 3*60, 16384, 256*1024,
          tpsc_default_dens_count, tpsc_default_hwg_dens_names,
        tpsc_default_alias_dens_names,
          {0}, 0, 0, 0,
          0, (u_char *)0 },

If the device does have this capability specified and you are using the tpsc tape driver, you must do the following:

  • Remove the device from the list

  • Perform an autoconfig

  • Reboot in order for the plug-in to be able to force the dismounts of tapes from devices of that type

If you are using the ts tape driver, then the /etc/config/tspd.config file must not specify the following for the device:

PREVENT_REMOVAL pathname yes

If the device does have this capability specified, you must edit the tspd.config file and restart the appropriate ts personality daemon.

Some of the functions of the FailSafe TMF plug-in are performed through TMF; the plug-in issues commands to the TMF daemon to use these functions. However, the plug-in forces a dismount of a tape from a drive by issuing a command to the library software controlling the loader/library. In the case of the Storage Technology Corporation (STK) hardware, the plug-in communicates its request to the Automated Cartridge System Library Software (ACSLS) software that controls the loader. The plug-in uses an expect script that issues commands to login to the loader and issue a dismount request to a drive.

The /etc/tmf/failsafe_tmf.config File

The /etc/tmf/failsafe_tmf.config file lets you configure additional features of the FailSafe TMF plug-in. This file exists on all hosts in the cluster, and should be edited as necessary on each machine.

The contents of the failsafe_tmf.config file are dependent on the drives assigned to each host in the cluster. If all hosts in the failover domain are configured through TMF to use exactly the same drives, then this file would be the same on each host in the failover domain. You must maintain this file on each host; a change on one host is unknown to the other hosts.

There are two different types of directives that you can specify in the failsafe_tmf.config file: the loader directive and the remote_devices directive. These are defined in the following subsections.

The loader Directive

The loader directive provides information about a TMF loader, which controls one or more tape devices that are members of TMF device groups being managed as FailSafe resources. There may be more than one such directive in this file. The loader information is used by the FailSafe TMF plug-in to force a dismount of tapes from drives that cannot be made available (that is, have tmstat states other than assn, free, conn, or idle) so that those tapes can be used via other tape devices in the same device group. The information is also used to force a dismount of tapes from drives that are only connected to other hosts, not this host (as described in “The remote_devices Directive ”). If the file does not contain a loader directive, then the TMF plug-in will make no attempt to force a dismount of tapes from any drives.

The directive has the following format:

loader  lname  ltype  lhost  luser  lpswd

where:

lname

Name of the loader as defined in the TMF config file.

ltype

Type of the loader as defined in the TMF config file. (Currently only STKACS is supported for ltype.)

lhost

Server name of the loader as defined in the TMF config file.

luser

User name of the loader's administrator account. For STKACS, use acssa.

lpswd

Password for the loader's administrator account.

The TMF command /usr/sbin/tmmls shows the name of the loader and the server associated with it:

# tmmls
loader type status m  server   old  m_pnd  d_pnd  r_qd  comp  avg
 operator  OPERATOR      UP A  IRIX       0      0      0     0     0    0(sec)
 wolfy       STKACS    DOWN A  wolfcree   0      0      0     0     0    0(sec)
 panther     STKACS    DOWN A  stk9710    0      0      0     0     0    0(sec)
 l180        STKACS      UP A  stk9710    0      0      0     0     0    0(sec)

For example, suppose you want to have the FailSafe TMF plug-in dismount drives that are in the l180 loader/library listed above. That library has the stk9710 server associated with it. The loader directive in the failsafe_tmf.config file would look like the following:

loader l180 STKACS stk9710 acssa acssapassword

In this case, the FailSafe TMF plug-in would force a dismount for each drive that is specified in the tmf.config file to be in the l180 loader/library and in the plug-in's drive group. If you do not want the plug-in to dismount any tape drives associated with a particular resource, you would not place a loader directive in the failsafe_tmf.config file.

The remote_devices Directive

The remote_devices directive provides information about one or more tape devices that are part of a TMF device group, but which are not visible on this host. An example would be where a library has four SCSI drives, and two drives are connected to each of two FailSafe hosts. If host A should crash, host B must be able to force a dismount of any tapes in A's drives so that they can then be used from host B. Because the drives are not visible on host B, the remote_devices directive provides the information needed to force a dismount of unseen drives.

The directive has the following format:

remote_devices  rname  lname  drvid ...

where:

rname

Name of the FailSafe TMF Resource that will dismount this drive.

lname

Name of the loader as defined in the TMF config file. There must be a loader directive for lname elsewhere in this file, or the remote_devices directive will be ignored.

drvid

The vendor ID of the drive on which to force a dismount. This is the unique name by which the loader identifies the drive. In the case of STKACS, this will be a comma-separated four-digit string listing the ACS, LSM, drive panel, and drive (for example, 0,0,1,3).


Note: No blanks should exist within the ID.


Multiple vendor IDs can be specified in the same remote_devices directive as long as they all pertain to the same loader. If all the vendor IDs will not fit on a single line, just add additional remote_devices directives for the same loader. For example, to enable the FailSafe TMF plug-in to force a dismount of the remote drives (0,0,1,0), (0,0,1,1), (0,0,1,2), and (0,0,1,3) in the l180 loader/library for resource tmf_eglf, the directive would be:

remote_devices tmf_eglfl  180  0,0,1,0  0,0,1,1  0,0,1,2  0,0,1,3

If multiple FailSafe TMF resources are defined, only the resource named tmf_eglf will force a dismount of these drives.

Configuring Tapes and TMF

If drives that belong to a FailSafe TMF resource are configured on more than one machine in the FailSafe cluster, they should be configured consistently. The same tape driver (for example, ts or tpsc) should be used on each host where the drive is configured.

When configuring a FailSafe TMF resource, administrators should be aware of several parameters in the /etc/tmf/tmf.config file. The FailSafe TMF plug-in will try to start the loader associated with its device-group if it is not up. However, if the tmf.config file specifies status = UP for the loader, this step may not be necessary, and the devices may become available sooner.

A drive that is in a FailSafe TMF resource will be configured in the tmf.config file of one or more hosts within the cluster. It should be configured with status=down. All drives associated with the resource group must be unavailable for the exclusive script to indicate that the resource is not already running.

If the drives being used do not support persistent reserve, then they should be configured in the tmf.config file with access=shared. If the drives do support persistent reserve, then it is recommended that you use this feature when using the FailSafe TMF plug-in. To use persistent reserve, you should use the ts tape driver, and set access=exclusive in the tmf.config file. See the ts(7) man page for more information about using the ts tape driver. The access option should be consistent across all hosts in the failsafe cluster where the drives are configured.

The -g option of the tmconfig command reassigns a device to a different device group name. The FailSafe TMF software does not support reassigning a device into a FailSafe TMF device group. That is because, in case of failover, the FailSafe TMF plug-in on the machine we have failed over to would not have any knowledge of this reassigned drive. It would not be able to dismount tapes that are in the drive. Using tmconfig -g to move devices out of a FailSafe TMF device group will decrease the number of available drives that the monitor script sees. Also, in the case of failover or stop, the drive will be configured down.

Executing FailSafe TMF

The FailSafe TMF plug-in assumes that TMF is being used as the mounting service for tape devices associated with a tape library. Each time the plug-in is run, it will verify that TMF is up and running. If TMF is not running, the plug-in will start it. If TMF cannot be started by the plug-in, a failure will occur. Next, the plug-in will verify that the tape loader associated with the devices for a resource is up and accessible. If it is not up, the plug-in will configure it up using the /usr/sbin/tmconfig TMF command. If it cannot configure up the loader, a failure will occur.

The FailSafe TMF plug-in uses information supplied by the /etc/tmf/tmf.config file to identify what devices pertain to a particular resource. It uses that information in conjunction with the resource's remote_devices directive in the failsafe_tmf.config file to determine what actions need to be taken on tape drives defined by the resource.

The plug-in retrieves the values of the device-group and devices-minimum attributes for a particular resource. It then examines the TMF configuration file for information pertaining to drives belonging to the same device group as specified for the resource and stores the information for processing.

It will then force a dismount of tapes from any drives that are specified in the remote_devices directive in the failsafe_tmf.config file associated with the resource. The FailSafe TMF plug-in verifies the minimum number of devices of the specified type are available for use. A device is considered available if its status displayed from the tmstat command is one of the following:

  • assn

  • idle

  • free

If devices are in the down state, it will use tmconfig to configure them up and make sure that they are available. If a device cannot be configured up, and the associated loader directive is in the failsafe_tmf.config file, the plug-in will force a dismount of the tape from that device.

If the FailSafe TMF plug-in does not find the required minimum number of drives to be available, a failure will occur.


Note: The TMF Resource Type definition defines the number of local restarts to be a large number. This means that if a failure of the resource is detected on a host, the host will try to restart the resource before failing over. For information about changing this definition, see the FailSafe Administrator's Guide for SGI InfiniteStorage.


Creating a TMF Resource

After you have defined the resource type, you must define the TMF resources based on the resource type. Each resource requires a unique resource name.

# cmgr
Welcome to SGI Cluster Manager Command-Line Interface

cmgr> set cluster eagan
cmgr> define resource egft of resource_type TMF
Enter commands, when finished enter either "done" or "cancel"

Type specific attributes to create with set command:

Type Specific Attributes - 1: device-group
Type Specific Attributes - 2: devices-minimum
Type Specific Attributes - 3: devices-loaned
Type Specific Attributes - 4: email-addresses

No resource type dependencies to add

resource egft ? set device-group EGLFT
resource egft ? set devices-minimum 4
resource egft ? set email-addresses [email protected]
resource egft ? done
Successfully defined resource egft


Note: The devices-loaned parameter is ignored and it should be left at its default value.

The device-group field is case sensitive. It must exactly match what is entered in the tmf.config file.

FailSafe, and the FailSafe TMF plug-in, allow you to specify unique values for each of the attributes on each of the hosts. For example, a FailSafe TMF resource could be defined so that host A had devices-minimum=3 , but host B had devices-minimum=2.

Creating a TMF Resource Group

You can create a resource group by using either the FailSafe GUI or the cmgr command. For information, see the FailSafe Administrator's Guide for SGI InfiniteStorage.

To define an effective resource group, you must include all of the resources on which the TMF depends. The following example shows the creation of a typical resource group:

cmgr> show failover_policies

Failover Policies:
        t2
dmfadmin
dgroups
ordered

cmgr> define resource_group tmfrg in cluster eagan
Enter commands, when finished enter either "done" or "cancel"

resource_group tmfrg ? set failover_policy to dgroups
resource_group tmfrg ? add resource egft of resource_type TMF
resource_group tmfrg ? done
Successfully created resource group tmfrg

cmgr> show resource_group tmfrg

Resource Group: tmfrg
        Cluster: eagan
        Failover Policy: dgroups

Resources: 
        egft (type: TMF)

cmgr> show failover_policy dgroups
Failover policy: dgroups
Version: 1
Script: ordered
Attributes: Auto_Failback Auto_Recovery
Initial AFD: guiness dublin

Testing the TMF Resource

To ensure that the TMF resource has been correctly configured, you can test individual actions by executing the scripts. Each script, located at /var/cluster/ha/resource_types/TMF, requires two arguments, an input file and an output file. The contents of these files are the resource names. The scripts will display 0 if they are successfully executed or display a positive number that indicates the error type. For more information on error codes, see the FailSafe Programmer's Guide for SGI Infinite Storage.

All TMF scripts assume they are being run under ksh.

In the following example, you can test the start script by starting the NFS resource with the resource name tmfx.

$ cd /var/cluster/ha/resource_types/TMF
$ echo "tmfx" > /tmp/ipfile
$ ./start /tmp/ipfile /tmp/opfile

This should start the tmfx instance, named by the TMF resource tmfx.

To view the individual script actions, you must edit the script and add the following to the action function:

set -x

Testing the start Script

Use the following procedure to test the start script:

  1. Create a TMF resource for a device group specified in the tmf.config file. Do not run TMF on your test node.

  2. Perform the following actions:

    # echo "resource-name" > /tmp/ipfile
    # /var/cluster/ha/resource_types/TMF/start /tmp/ipfile /dev/null

  3. Check /var/cluster/ha/log/script_ nodename logfile and verify from the log messages that the resource was started.

  4. Verify that the TMF daemon process tmdaemon was started. Using the TMF commands tmmls and tmstat, verify that the tape loader and tape drives of type resource_name that you defined came up and are available.


    Note: Running this test may force the dismount of tapes in derives as specified in the failsafe_tmf.config file.


Testing the stop Script

To stop the TMF resource, enter the following command:

# echo "resource-name" > /tmp/ipfile
# /var/cluster/ha/resource_types/TMF/stop /tmp/ipfile /dev/null

Check to see if the resource is offline by using the TMF tmstat command to verify that the tape devices of type resource_name were configured down.

Testing Resource Group Failovers

You can test the failover policy by using either cmgr or the FailSafe GUI to move the resource group to another node in the cluster. To ensure that the resource group correctly failed over, use cmgr or the GUI to display the resource group states.

The following example uses cmgr to test the failover policy:

cmgr> admin offline resource_group TMF in cluster eagan

cmgr> admin move resource_group TMF in cluster eagan to node cm2

cmgr> admin online resource_group TMF in cluster eagan