When a FailSafe system is running, you may need to perform various administration procedures without shutting down the entire cluster. This chapter provides instructions for performing upgrade and maintenance procedures on active clusters. It includes the following:
Use the following procedure to add a node to an active cluster. This procedure assumes that cluster_admin, cluster_control, cluster_ha, and failsafe2 products are already installed in this node.
Check control network connections from the node to the rest of the cluster using ping command. Note the list of control network IP addresses.
Check the serial connections to reset this node. Note the name of the node that can reset this node.
Run node diagnostics. For information on FailSafe diagnostic commands, see Chapter 9, “Testing the Configuration”.
Make sure that the sgi-cad, sgi-crsd, sgi-cmsd, and sgi-gcd entries are present in the /etc/services file. The port numbers for these processes should match the port numbers in other nodes in the cluster.
sgi-cad 7200/tcp # Cluster admin daemon sgi-crsd 7500/udp # Cluster reset services daemon sgi-cmsd 7000/udp # FailSafe membership Daemon sgi-gcd 8000/udp # Group communication Daemon |
Check if the HA services and cluster processes (cad, cmond, crsd ) are running.
# ps -ef | grep cad |
If HA services and cluster processes are not running, run the cdbreinit command. For example:
# /usr/cluster/bin/cdbreinit /var/cluster/cdb/cdb.db Killing fs2d... Removing database header file /var/cluster/cdb/cdb.db... Preparing to delete database directory /var/cluster/cdb/cdb.db# !! Continue[y/n]y Removing database directory /var/cluster/cdb/cdb.db#... Deleted CDB database at /var/cluster/cdb/cdb.db Recreating new CDB database at /var/cluster/cdb/cdb.db with cdb-exitop... fs2d Created standard CDB database in /var/cluster/cdb/cdb.db Please make sure that "sgi-cad" service is added to /etc/services file If not, add the entry and restart cluster processes. Please refer to SGI FailSafe administration manual for more information. Modifying CDB database at /var/cluster/cdb/cdb.db with cluster_ha-exitop... Modified standard CDB database in /var/cluster/cdb/cdb.db Please make sure that "sgi-cmsd" and "sgi-gcd" services are added to /etc/services file before starting HA services. Please refer to SGI FailSafe administration manual for more information. Starting cluster control processes with cluster_control-exitop... Please make sure that "sgi-crsd" service is added to /etc/services file If not, add the entry and restart cluster processes. Please refer to SGI FailSafe administration manual for more information. Started cluster control processes Restarting cluster admin processes with failsafe-exitop... |
Use the GUI, the cmgr command, or the template script (/var/cluster/cmgr-templates/cmgr-create-node ) to define the node.
![]() | Note: This node must be defined from one of nodes that is already in the cluster. |
Use the cmgr command or the GUI to add the node to the cluster.
For example: the following cmgr command adds the node web-node3 to the cluster web-cluster :
cmgr> modify cluster web-cluster Enter commands, when finished enter either "done" or "cancel" |
web-cluster ? add node web-node3 web-cluster ? done |
Start highly available (HA) services on this node using the cmgr command or the GUI.
For example, the following cmgr command starts HA services on node web-node3 in cluster web-cluster:
cmgr> start ha_services on node web-node3 in cluster web-cluster |
Add this node to the failure domain of the relevant failover policy. In order to do this, you must redefine the entire failover policy to include the additional node in the failure domain.
Use the following procedure to delete a node from an active cluster. This procedure assumes that the node status is UP.
If resource groups are online on the node, use the cmgr command or the GUI to move them to another node in the cluster.
To move the resource groups to another node in the cluster, there should be another node available in the failover policy domain of the resource group. If you want to leave the resource groups running in the same node, use the cmgr command or the GUI to detach the resource group.
For example, the following command would leave the resource group web-rg running in the same node in the cluster web-cluster .
cmgr> admin detach resource_group web-rg in cluster web-cluster |
Delete the node from the failure domains of any failover policies that use the node. In order to do this, the entire failover policy must be redefined, deleting the affected node from the failure domain.
Stop HA services on the node.
For example, to stop HA services on the node web-node3 , use the following cmgr command. This command will move all the resource groups online on this node to other nodes in the cluster if possible:
cmgr> stop ha_services on node web-node3 for cluster web-cluster |
If it is not possible to move resource groups that are online on node web-node3, the above command will fail. The force option is available to stop HA services in a node even in the case of an error. If there are resources that cannot be moved offline or deallocated properly, a side-effect of the offline force command will be to leave these resources allocated on the node.
Perform Steps 4, 5, 6, and 7 if the node must be deleted from the cluster database.
Delete the node from the cluster.
For example, to delete node web-node3 from web-cluster configuration, use the following cmgr command:
cmgr> modify cluster web-cluster Enter commands, when finished enter either "done" or "cancel" |
web-cluster ? remove node web-node3 web-cluster ? done |
Remove node configuration from the cluster database.
The following cmgr command deletes the web-node3 node definition from the cluster database:
cmgr> delete node web-node3 |
Stop all cluster processes and delete the cluster database:
# /etc/init.d/cluster stop # killall fs2d # cdbdelete /var/cluster/cdb/cdb.db |
Disable cluster and HA processes from starting when the node boots:
# chkconfig cluster off # chkconfig failsafe2 off |
Use the following procedure to change the control networks in a currently active cluster. This procedure is valid for a two-node cluster consisting of nodes node1 and node2. In this procedure, you must complete each step before proceeding to the next step.
![]() | Note: Do not perform any other administration operations during this procedure. |
From either node, stop HA services on the cluster. Make sure all HA processes have exited on both nodes.
From node2, stop the cluster processes on node2:
# /etc/init.d/cluster stop # killall fs2d |
Make sure the fs2d process have been killed on node2.
From node1, modify the node1 and node2 definition. Use the GUI or the following cmgr commands:
cmgr> modify node node1 Enter commands, when finished enter either "done" or "cancel" node1?> remove nic old_nic _address node1> add nic new_nic_address NIC - new_nic_address set heartbeat to ... NIC - new_nic_address set ctrl_msgs to ... NIC - new_nic_address set priority to ... NIC - new_nic_address done node1? done |
Repeat the same procedure to modify node2.
From node1, check if the node1 and node2 definitions are correct. Using cmgr on node1, execute the following commands to view the node definitions:
cmgr> show node node1 cmgr> show node node2 |
On both node1 and node2, modify the network interface IP addresses in /etc/config/netif.options and execute ifconfig to configure the new IP addresses on node1 and node2. Verify that the IP addresses match the node definitions in the cluster database.
From node1, stop the cluster process on node1:
# /etc/init.d/cluster stop # killall fs2d |
Make sure the fs2d process have been killed on node1.
From node2, execute the following command to start cluster process on node2:
# /usr/cluster/bin/cdbreinit /var/cluster/cdb/cdb.db |
Answer y to the prompt.
From node1, start cluster processes on node1:
# /etc/init.d/cluster start |
The following messages should appear in the SYSLOG on node2:
Starting to receive CDB sync series from machine <node1_nodeID> ... Finished receiving CDB sync series from machine <node1_nodeID> |
Wait for approximately 60 seconds for the synchronization to complete.
From any node, start HA services in the cluster.
Use the following procedure on one node at a time if the OS software upgrade requires a reboot or will impact the FailSafe software. If you are uncertain, you should use this procedure
The following procedure upgrades the OS software on node web-node3:
If resource groups are online on the node, use a cmgr command or the GUI to move them another node in the cluster. To move the resource group to another node in the cluster, there should be another node available in the failover policy domain of the resource group.
The following cmgr command moves resource group web-rg to another node in the cluster web-cluster:
cmgr> admin move resource_group web-rg in cluster web-cluster |
To stop HA services on the node web-node3, use the following cmgr command or the GUI. This command will move all the resource groups online on this node to other nodes in the cluster if possible.
cmgr> stop ha_services on node web-node3 for cluster web-cluster |
If it is not possible to move resource groups that are online on node web-node3, the above command will fail. You can use the force option to stop HA services in a node even in the case of an error.
Perform the OS upgrade in the node web-node3.
After the OS upgrade, make sure cluster processes (cmond, cad, crsd) are running.
Restart HA services on the node. For example, the following cmgr command restarts HA services on the node:
cmgr> start ha_services on node web-node3 for cluster web-cluster |
Make sure the resource groups are running on the most appropriate node after restarting HA services.
When you upgrade FailSafe software in an active cluster, you upgrade one node at a time in the cluster.
The following procedure upgrades FailSafe on node web-node3 :
If resource groups are online on the node, use a cmgr command or the GUI to move them another node in the cluster. To move the resource group to another node in the cluster, there should be another node available in the failover policy domain of the resource group.
For example, the following cmgr command moves resource group web-rg to another node in the cluster web-cluster:
cmgr> admin move resource_group web-rg in cluster web-cluster |
To stop HA services on the node web-node3, use the following cmgr command or the GUI. This command will move all the resource groups online on this node to other nodes in the cluster if possible:
cmgr> stop ha_services on node web-node3 for cluster web-cluster |
If it is not possible to move resource groups that are online on node web-node3, the above command will fail. You can use the force option to stop HA services in a node even in the case of an error.
Stop all cluster processes running on the node:
# /etc/init.d/cluster stop |
Perform the FailSafe upgrade in the node web-node3.
After the FailSafe upgrade, check whether cluster processes (cmond, cad, crsd) are running. If not, restart cluster processes:
# chkconfig cluster on; /etc/init.d/cluster start |
Restart HA services on the node. For example, the following cmgr command restarts HA services on the node:
cmgr> start ha_services on node web-node3 for cluster web-cluster |
Make sure the resource groups are running on the most appropriate node after restarting HA services.
The following procedure adds a resource group and resources to an active cluster. To add resources to an existing resource group, perform resource configuration (Step 4), perform resource diagnostics (Step 5), and add resources to the resource group (Step 6).
Identify all the resources that have to be moved together. These resources running on a node should be able to provide a service to the client. These resources should be placed in a resource group. For example, Netscape webserver mfg-web, its highly available (HA) IP address 192.26.50.40, and the filesystem /shared/mfg-web containing the Web configuration and document pages should be placed in the same resource group (for example, mfg-web-rg).
Configure the resources in all nodes in the cluster where the resource group is expected to be online. For example, this might involve configuring Netscape Web server mfg-web on nodes web-node1 and web-node2 in the cluster.
Create a failover policy. Determine the type of failover attribute required for the resource group. You can use the following cmgr template to create the failover policy:
/var/cluster/cmgr-templates/cmgr-create-failover_policy |
Configure the resources in cluster database. There are cmgr templates to create resources of various resource types in /var/cluster/cmgr-templates directory. For example, the volume resource, the /shared/mfg-web filesystem, the 192.26.50.40 IP_address resource, and the mfg-web Netscape_web resource have to be created in the cluster database. Create the resource dependencies for these resources.
Run resource diagnostics. For information on the diagnostic commands, see Chapter 9, “Testing the Configuration”.
Create resource group and add resources to the resource group. You can use the following cmgr template to create resource group and add resources to resource group:
/var/cluster/cmgr-templates/cmgr-create-resource_group |
All resources that are dependent on each other should be added to the resource group at the same time. If resources are added to an existing resource group that is online in a node in the cluster, the resources are also made online on the same node.
You will add new hardware devices to an active cluster one node at a time.
To add hardware devices to a node in an active cluster, follow the same procedure as when you upgrade OS software in an active cluster, as described in “Upgrading OS Software in an Active Cluster”. In summary:
You must move the resource groups offline and stop HA services in the node before adding the hardware device.
After adding the hardware device, make sure cluster processes are running and start HA services on the node.
To include the new hardware device in the cluster database, you must modify your resource configuration and your node configuration, where appropriate.