The order of execution is as follows:
FailSafe starts up by using the start ha_services command in cmgr or as part of the node bootup procedure. It then reads the resource group information from the cluster database.
FailSafe tells the system resource manager (SRM) to run exclusive scripts for all resource groups that are in the Online ready state.
SRM returns one of the following states for each resource group:
running
partially running
not running
If a resource group has a state of not running in a node where HA services have been started, the following occurs:
FailSafe runs the failover policy script associated with the resource group. The failover policy scripts takes the list of nodes that are capable of running the resource group (the failover domain) as a parameter.
The failover policy script returns an ordered list of nodes in descending order of priority (the run-time failover domain) where the resource group can be placed.
FailSafe sends a request to SRM to move the resource group to the first node in the run-time failover domain.
SRM executes the start action script for all resources in the resource group:
If the start script fails, the resource group is marked online on that node with following error:
srmd executable error |
If the start script is successful, SRM automatically starts monitoring those resources. After the specified start monitoring time passes, SRM executes the monitor action script for the resource in the resource group.
If the state of the resource group has a status of running or partially running on only one node in the cluster, FailSafe runs the associated failover policy script:
If the highest priority node is the same node where the resource group is partially running or running , the resource group is made online on the same node. In the partially running case, FailSafe tells SRM to execute start scripts for all resources in the resource group.
If the highest priority node is another node in the cluster, FailSafe tells SRM to execute stop action scripts for resources in the resource group on other nodes. FailSafe then makes the resource group online in the highest priority node in the cluster.
If the state of the resource group is running or partially running in multiple nodes in the cluster, the resource group is marked with an error exclusivity error. These resource groups will require operator intervention to become online in the cluster.
Figure 6-1 shows the message paths for action scripts and failover policy scripts.
When the start action script fails, the order of execution is as follows:
SRM notifies FailSafe of the start action script failure as a resource group failure.
FailSafe runs the failover policy script to determine the next node for the resource group.
FailSafe sends a request to SRM to release the resource group and allocate the resource group in the next node in the cluster.