Linux FailSafe™ Administrator's Guide

Written by Joshua Rodman of SuSE, Inc. and Steven Levine and Jenn Byrnes of SGI

Illustrated by Dany Galgani and Chris Wengelski

Production by Adrian Daley, Glen Traefald

Engineering contributions by Scott Henry, Daniel Hurtubise, Vidula Iyer, Ashwinee Khaladkar, Herbert Lewis, Michael Nishimoto, Wesley Smith, Bill Sparks, Paddy Sreenivasan, Dan Stekloff, Rebecca Underwood, Mayank Vasa, Manish Verma

Legal Notice
Table of Contents
About This Guide
1. Audience
2. Structure of This Guide
3. Related Documentation
4. Conventions Used in This Guide
1. Overview of the Linux FailSafe System
1.1. High Availability and Linux FailSafe
1.2. Concepts
1.3. Additional Linux FailSafe Features
1.4. Linux FailSafe Administration
1.5. Hardware Components of a Linux FailSafe Cluster
1.6. Linux FailSafe Disk Connections
1.7. Linux FailSafe Supported Configurations
1.8. Highly Available Resources
1.9. Highly Available Applications
1.10. Failover and Recovery Processes
1.11. Overview of Configuring and Testing a New Linux FailSafe Cluster
1.12. Linux FailSafe System Software
2. Planning Linux FailSafe Configuration
2.1. Introduction to Configuration Planning
2.2. Disk Configuration
2.3. Logical Volume Configuration
2.4. Filesystem Configuration
2.5. IP Address Configuration
3. Installing Linux FailSafe Software and Preparing the System
3.1. Overview of Configuring Nodes for Linux FailSafe
3.2. Installing Required Software
3.3. Configuring System Files
3.4. Additional Configuration Issues
3.5. Choosing and Configuring devices and Filesystems
3.6. Configuring Network Interfaces
3.7. Configuration for Reset
4. Linux FailSafe Administration Tools
4.1. The Linux FailSafe Cluster Manager Tools
4.2. Using the Linux FailSafe Cluster Manager GUI
4.3. Using the FailSafe Cluster Manager CLI
5. Linux FailSafe Cluster Configuration
5.1. Setting Configuration Defaults
5.2. Name Restrictions
5.3. Configuring Timeout Values and Monitoring Intervals
5.4. Cluster Configuration
5.5. Resource Configuration
5.6. Linux FailSafe System Log Configuration
5.7. Resource Group Creation Example
5.8. Linux FailSafe Configuration Example CLI Script
6. Configuration Examples
6.1. Linux FailSafe Example with Three-Node Cluster
6.2. cmgr Script
6.3. Local Failover of an IP Address
7. Linux FailSafe System Operation
7.1. Setting System Operation Defaults
7.2. System Operation Considerations
7.3. Activating (Starting) Linux FailSafe
7.4. System Status
7.5. Resource Group Failover
7.6. Deactivating (Stopping) Linux FailSafe
7.7. Resetting Nodes
7.8. Backing Up and Restoring Configuration With Cluster Manager CLI
8. Testing Linux FailSafe Configuration
8.1. Overview of FailSafe Diagnostic Commands
8.2. Performing Diagnostic Tasks with the Cluster Manager GUI
8.3. Performing Diagnostic Tasks with the Cluster Manager CLI
9. Linux FailSafe Recovery
9.1. Overview of FailSafe System Recovery
9.2. FailSafe Log Files
9.3. Node Membership and Resets
9.4. Status Monitoring
9.5. Dynamic Control of FailSafe Services
9.6. Recovery Procedures
10. Upgrading and Maintaining Active Clusters
10.1. Adding a Node to an Active Cluster
10.2. Deleting a Node from an Active Cluster
10.3. Changing Control Networks in a Cluster
10.4. Upgrading OS Software in an Active Cluster
10.5. Upgrading FailSafe Software in an Active Cluster
10.6. Adding New Resource Groups or Resources in an Active Cluster
10.7. Adding a New Hardware Device in an Active Cluster
Glossary
Index
List of Tables
1-1. Example Resource Group
1-2. Contents of /usr/lib/failsafe/bin
1-3. Administrative Commands for Use in Scripts
2-1. Logical Volume Configuration Parameters
2-2. Filesystem Configuration Parameters
2-3. IP Address Configuration Parameters
4-1. Available Templates
5-1. Log Levels
8-1. FailSafe Diagnostic Test Summary
List of Figures
1-1. Sample Linux FailSafe System Components
1-2. Disk Storage Failover on a Two-Node System
1-3. Software Layers
1-4. Read/Write Actions to the Cluster Configuration Database
1-5. Communication Path for a Node that is Not in a Cluster
1-6. Message Paths for Action Scripts and Failover Policy Scripts
2-1. Non-Shared Disk Configuration and Failover
2-2. Shared Disk Configuration for Active/Backup Use
2-3. Shared Disk Configuration For Dual-Active Use
3-1. Example Interface Configuration
6-1. Configuration Example