This chapter tells you how to use Performance Co-Pilot for FailSafe to monitor the availability of a FailSafe cluster. For information about installing Performance Co-Pilot for FailSafe, see “Install Performance Co-Pilot Software” in Chapter 4.
Performance Co-Pilot provides the following:
An agent for exporting FailSafe heartbeat and resource monitoring statistics to the Performance Co-Pilot framework
3-D visualization tools for displaying these statistics in an intuitive presentation
The visualization of statistics provides valuable information about the availability of nodes and resources monitored by FailSafe. For example, it can highlight a reduction in monitoring response times that may indicate problems in availability of services provided by the cluster.
Because Performance Co-Pilot for FailSafe is an extension to the Performance Co-Pilot framework, you can use other Performance Co-Pilot tools to analyze or present FailSafe monitoring statistics, and record Performance Co-Pilot for FailSafe metrics as archives for deferred analysis. You can also use Performance Co-Pilot to gather statistics about CPU and memory utilization, network and disk activity, and other performance metrics for each node in the cluster.
To view statistics about the FailSafe cluster, use the rmvis and hbvis commands.
The hbvis command constructs a display showing the distribution of heartbeat response times for every node in the cluster. Figure 12-1 shows an example display.
Key features of the display include the frequency of heartbeat responses that arrive at particular intervals within the timeout period and the frequency of heartbeat responses that have been missed (determined not to have arrived). The bar representing the frequency of missed heartbeat responses changes color to indicate the urgency of problems with availability of a node.
The rmvis command constructs a display of the resource monitoring response times for resources monitored on every node of the cluster. Figure 12-2 shows an example display.
The display is similar in concept to that of hbvis, showing the frequency of resource monitoring responses that arrive within the timeout period, and the frequency of responses that have timed out. The bar representing the frequency of resource responses that have timed out also changes color to indicate the urgency of problems with the availability of particular resources.
If a node has failed or a resource has failed over, its statistics will disappear from the display.
To run a visualization tool on the monitor host, use the -h option to specify an available collector host in the cluster (host):
% hbvis -h host |
or
% rmvis -h host |
The collector host specified can be any collector host that is a member of the cluster for which you wish to view statistics.
You can also access these tools from the following FailSafe GUI menus:
File -> Launch Resource Monitoring
File -> Launch Heartbeat Monitoring
There are various options available to alter the display provided by hbvis and rmvis when launched from the command line:
-H hostfile | Provides a file that lists the nodes that are to appear in the visualization. This is useful in limiting the number of nodes in the display, because it takes more time to construct the display for clusters with more nodes. |
-t interval | Assigns the sampling time of the visualization. There may be circumstances where extending the period of the sampling time may provide better application responsiveness, particularly for clusters with many nodes. Because FailSafe maintains the statistics, hbvis and rmvis will always show the latest statistics available for the sampling time selected. For details about the interval option, see the pmview and PCPIntro man pages. |
-r | Selects the FailSafe metrics that present a sampling of statistics taken from the time of the last statistical reset. This enables hbvis and rmvis to improve the sensitivity of the visualization when abrupt changes appear in the FailSafe monitoring statistics. Without the -r option, the statistics presented are from a sampling of FailSafe metrics collected from the time ha_cmsd and/or ha_srmd was last restarted. |
-R | Starts a new statistical sampling. |
-v | (hbvis only) Provides a visualization of heartbeat statistics for each node in the cluster, from the point of view of the selected collector host only. (The collector host is selected using the -h option). There is a graphical representation of heartbeat statistics for each node in the cluster as observed by the selected collector host. |
-w | (hbvis only) Provides a visualization of the aggregate of heartbeat statistics for all nodes in the cluster, from the point of view of the selected collector host only. (The collector host is selected using the -h option). There is a only one graphical representation of heartbeat statistics for the entire cluster as observed by the selected collector host. |
For a complete description of options, see the hbvis and rmvis man pages.
The hbvis and rmvis commands use the command pmview to display the 3-D visualization of FailSafe performance metrics. For a description of the various menu commands and controls in the visualization window, consult the man page for pmview.
Performance Co-Pilot tools such as pmlogger, pmchart, and pminfo can use the metrics exported by Performance Co-Pilot for FailSafe.
Appendix B, “Metrics Exported by Performance Co-Pilot for FailSafe”, provides a description of Performance Co-Pilot for FailSafe metrics. You can also display a description of metrics by using the following command:
% pminfo -tT -h host |
(If you are logged in to a collector host, you can leave out the -h option).
A gray display (that is, no colored rectangle bars appear on the node's gray baseplane) when using hbvis or rmvis may indicate one of the following:
The node is down.
If you wish to see only the nodes that are up, create a file containing a list of nodes that are to be displayed and pass it as an option to hbvis/rmvis using the -H option (or the environment variable PCP_FSAFE_NODES) so that a new picture of the cluster can be generated. Please refer to the hbvis/rmvis man pages for more details on the -H option.
The collector daemons have been killed on that node.
To solve this problem, restart pmdafsafe in one of the following ways:
If pmcd is still running, send pmcd the SIGHUP signal by entering the following:
# killall -HUP pmcd |
If pmcd is not running, restart Performance Co-Pilot by entering the following:
# /etc/init.d/pcp start |
The timeout and sampling settings are too short.
To change the sampling time, use the time controls available in the pmview window. By default, this is two seconds; you may need to lengthen the sampling period if you are getting an unsatisfactory display.
Alternatively, there may be timeout issues between pmdafsafe and pmcd, or between pmcd and pmview. Refer to the man pages for pmcd and PCPIntro for information on how to change the timeout settings for the various Performance Co-Pilot tools.
The resource has failed over (for rmvis).
In this case, restart rmvis so that a new picture of the cluster can be generated.