FailSafe™ Administrator's Guide for SGI® InfiniteStorage

Document Number: 007-3901-013

Front Matter

| List of Figures | List of Tables |

Table of Contents

New Features in This Guide

About This Guide
Structure of This Guide
Related Documentation
Obtaining Publications
Reader Comments

1. Overview
High Availability with FailSafe
Complete Storage Solution: FailSafe, CXFS, DMF, and TMF
Cluster Environment
Additional Features
Highly Available Resources
Highly Available Applications
Failover and Recovery Processes
Overview of Configuring and Testing a New Cluster
Release Support Policy

2. Configuration Planning
Example of the Planning Process
Disk Configuration
XFS Filesystem Configuration
CXFS Filesystem Configuration
HA IP Address Configuration

3. Best Practices
Planning and Installing a FailSafe Cluster
Knowing the Tools
Administration and Operation
Avoiding Problems

4. FailSafe Installation and System Preparation
Install FailSafe
Configure System Files
Set the corepluspid System Parameter
Set NVRAM Variables
Create XLV Logical Volumes and XFS Filesystems
Configure Network Interfaces
Configure the Ring Reset Serial Port
Install Patches
Install Performance Co-Pilot Software
Test the System
Modifications Required for Connectivity Diagnostics

5. Administration Tools
FailSafe Manager GUI
cmgr Command

6. Configuration
Preliminary Steps
Name Restrictions
Configuring Timeout Values and Monitoring Intervals
Setting Configuration Defaults with cmgr
Guided Configuration with the GUI
Node Tasks
Cluster Tasks
Resource Type Tasks
Resource Tasks
Failover Policy Tasks
Resource Group Tasks
FailSafe HA Services Tasks

7. Configuration Examples
Example: Script to Define an SGI File Server 850 Cluster
Example: Script to Define an SGI SAN Server 1000 Cluster
Example: Script to Define a Three-Node Cluster
Example: Modify a Cluster to Include a CXFS Filesystem
Example: Export CXFS Filesystems
Example: Create a Resource Group
Example: Local Failover of HA IP Address

8. FailSafe System Operation
Redirecting the Console for Origin 300, Origin 3200C, Onyx 300, and Onyx 3200C
Two-Node Clusters: Single-Node Use
System Status
Embedded Support Partner (ESP) Logging of FailSafe Events
Resource Group Failover
Stopping FailSafe
Resetting Nodes
Cluster Database Backup and Restore
Filesystem Dump and Restore
Rotating Log Files
Granting Task Execution Privileges to Users
Updating the Checksum Version for 6.5.21 and Earlier Clusters

9. Testing the Configuration
Performing Diagnostic Tasks with the GUI
Performing Diagnostic Tasks with cmgr

10. System Recovery and Troubleshooting
Overview of System Recovery
Identifying the Cluster Status
Locating Problems
Common Problems
Disabling Resource Groups for Maintenance
Ensuring that Resource Groups are Deallocated
FailSafe Log Files
FailSafe Membership and Resets
Status Monitoring
XVM Alternate Path Failover
Dynamic Control of FailSafe HA Services
Recovery Procedures
CXFS Metadata Server Relocation
Other Problems with CXFS Coexecution
Reporting Problems to SGI

11. Upgrading and Maintaining Active Clusters
Adding a Node to an Active Cluster
Deleting a Node from an Active Cluster
Changing Control Networks in a Cluster
Upgrading OS Software in an Active Cluster
Upgrading FailSafe Software in an Active Cluster
Adding New Resource Groups or Resources in an Active Cluster
Adding a New Hardware Device in an Active Cluster

12. Performance Co-Pilot for FailSafe
Using the Visualization Tools
Performance Co-Pilot for FailSafe Performance Metrics
Performance Co-Pilot Gray Display

A. FailSafe Software
Subsystems on the CD
Subsystems for Servers and Workstations in the Pool
Additional Subsystems for Nodes in the FailSafe Cluster
Additional Subsystems for Workstations

B. Metrics Exported by Performance Co-Pilot for FailSafe

C. System Messages
SYSLOG Messages
Log File Error Messages