Appendix B. Metrics Exported by Performance Co-Pilot for FailSafe

This appendix lists the metrics implemented by pmdafsafe .

fsafe.srm.all.* metrics are the same as the fsafe.srm.* metrics, except that the latest values obtained for all resources will be available, even if ha_srmd or any of the resources themselves are not available.

Table B-1. Performance Co-Pilot Metrics

Metric

Description

fsafe.srm.status
fsafe.srm.all.status

Latest status of a monitoring event performed on a resource, for all resources configured to be monitored on this node.

fsafe.srm.timeout
fsafe.srm.all.timeout

The prescribed timeout, in milliseconds, for monitoring a resource.

fsafe.srm.probes
fsafe.srm.all.probes

Number of times a resource has been monitored, for all resources configured to be monitored on this node, since the time ha_srmd was started.

fsafe.srm.recent.probes

Number of times a resource has been monitored, for all resources configured to be monitored on this node, since a data collection reset (via fsafe.control.reset_srm).

fsafe.srm.timeouts
fsafe.srm.all.timeouts

Number of resource monitoring events that have timed out before declaring that resource as failed, for all resources configured to be monitored on this node, since the time the resources have last been available.

fsafe.srm.recent.timeouts

Number of resource monitoring events that have timed out before declaring that resource as failed, for all resources configured to be monitored on this node, since a data collection reset (via fsafe.control.reset_srm).

fsafe.srm.min_resp
fsafe.srm.all.min_resp

Approximate minimum time, in milliseconds, taken to complete a monitoring event on a resource, for all resources configured to be monitored.

fsafe.srm.max_resp
fsafe.srm.all.max_resp

Approximate maximum time, in milliseconds, taken to complete a monitoring event on a resource, for all resources configured to be monitored on this node.

fsafe.srm.last_resp
fsafe.srm.all.last_resp

Approximate time, in milliseconds, taken in completing the most recent monitoring event on a resource, for all resources configured to be monitored on this node.

fsafe.srm.cumm_timeouts
fsafe.srm.all.cumm_timeouts

Cumulative number of resource monitoring events that have timed out, for all resources configured to be monitored on this node, since the time ha_srmd has started.

fsafe.srm.recent.cumm_timeouts

Cumulative number of resource monitoring events that have timed out, for all resources configured to be monitored on this node, since a data collection reset (via fsafe.control.reset_srm ).

fsafe.srm.histo_20
fsafe.srm.all.histo_20

Fraction of monitoring events that have been received within 0-20% of the response time from 0 milliseconds to fsafe.srm.timeout, for all resources configured to be monitored on this node, since the time ha_srmd has started.

fsafe.srm.recent.histo_20

Fraction of monitoring events that have been received within 0-20% of the response time from 0 milliseconds to fsafe.srm.timeout, for all resources configured to be monitored on this node, since a data collection reset (via fsafe.control.reset_srm).

fsafe.srm.histo_40
fsafe.srm.all.histo_40

Fraction of monitoring events that have been received within 20-40% of the response time from 0 milliseconds to fsafe.srm.timeout, for all resources configured to be monitored on this node, since the time ha_srmd has started.

fsafe.srm.recent.histo_40

Fraction of monitoring events that have been received within 20-40% of the response time from 0 milliseconds to fsafe.srm.timeout, for all resources configured to be monitored on this node, since a data collection reset (via fsafe.control.reset_srm).

fsafe.srm.histo_60
fsafe.srm.all.histo_60

Fraction of monitoring events that have been received within 40-60% of the response time from 0 milliseconds to fsafe.srm.timeout, for all resources configured to be monitored on this node, since the time ha_srmd(1M) has started.

fsafe.srm.recent.histo_60

Fraction of monitoring events that have been received within 40-60% of the response time from 0 milliseconds to fsafe.srm.timeout, for all resources configured to be monitored on this node, since a data collection reset (via fsafe.control.reset_srm).

fsafe.srm.histo_80
fsafe.srm.all.histo_80

Fraction of monitoring events that have been received within 60-80% of the response time from 0 milliseconds to fsafe.srm.timeout, for all resources configured to be monitored on this node, since the time ha_srmd has started.

fsafe.srm.recent.histo_80

Fraction of monitoring events that have been received within 60-80% of the response time from 0 milliseconds to fsafe.srm.timeout, for all resources configured to be monitored on this node, since a data collection reset (via fsafe.control.reset_srm).

fsafe.srm.histo_100
fsafe.srm.all.histo_100

Fraction of monitoring events that have been received within 80-100% of the response time from 0 milliseconds to fsafe.srm.timeout, for all resources configured to be monitored on this node, since the time ha_srmd has started.

fsafe.srm.recent.histo_100

Fraction of monitoring events that have been received within 80-100% of the response time from 0 milliseconds to fsafe.srm.timeout, for all resources configured to be monitored on this node, since a data collection reset (via fsafe.control.reset_srm).

fsafe.srm.frac_timeouts
fsafe.srm.all.frac_timeouts

Fraction of monitoring events that have timed out before declaring that resource as failed, for all resources configured to be monitored on this node, since the time the resources have last been available.

fsafe.srm.recent.frac_timeouts

Fraction of monitoring events that have timed out, before declaring that resource as failed, for all resources configured to be monitored on this node, since a data collection reset (via fsafe.control.reset_srm).

fsafe.srm.frac_cumm_timeouts
fsafe.srm.all.frac_cumm_timeouts

Fraction of cumulative number of monitoring events that have timed out, for all resources configured to be monitored on this node, since the time ha_srmd has started.

fsafe.srm.recent.frac_cumm_timeouts

Fraction of cumulative number of monitoring events that have timed out, for all resources configured to be monitored on this node, since a data collection reset (via fsafe.control.reset_srm).

fsafe.srm.recent.timestamp

The time when a new collection of statistics was started for the fsafe.srm.recent.* metrics, after issuing a store to the metric fsafe.control.reset_srm .

fsafe.config.clustername

The name of this cluster.

fsafe.config.hostname

The name of all hosts in the cluster specified by fsafe.config.clustername.

fsafe.config.nnodes

Number of nodes in the cluster specified by fsafe.config.clustername.

fsafe.config.cms.interval

The cluster heartbeat event interval, in milliseconds.

fsafe.config.cms.timeout

The heartbeat event timeout for all nodes in the cluster, in milliseconds.

fsafe.config.cms.nbuckets

The number of heartbeat event response intervals per node, where each interval covers a time equal to the heartbeat event interval (fsafe.config.cms.interval) for segments of time until the heartbeat event timeout (fsafe.config.cms.timeout ).

fsafe.control.debug

Debugging flags for the fsafe PMDA when a decimal integer value is stored to this metric. It ultimately affects what information is put into the fsafe PMDA's log (normally at /var/adm/pcplog/fsafe.log ).

 

Reading this metric will return the currently assigned debugging flags as a decimal integer.

fsafe.control.reset_cms

Resets data collection statistics for all metrics gathered from ha_cmsd. When this metric is stored to, the data provided is ignored; it is the act of storing to this metric which causes the reset.

 

Reading this metric will return zero (0).

fsafe.control.reset_srm

Resets data collection statistics for all metrics gathered from ha_srmd. When this metric is stored, the data provided is ignored; it is the act of storing to this metric which causes the reset.

 

Reading this metric will return zero (0).

fsafe.control.retry

Sets the number of retries permitted when contacting ha_cmsd or ha_srmd, and when the daemons indicate that they are busy.

 

Depending on which metrics are being read, and which daemon is required to obtain values for the required metrics, values for some metrics may not be available, possibly producing the message "Try again. Information not currently available. " This metric can be adjusted in order to increase the number of retries permitted when collecting metrics, before giving up and displaying this message. A retry is performed once every 100 ms (approximately).

 

Note that setting this metric does not alter how the fsafe PMDA handles more serious errors from ha_cmsd or ha_srmd.

 

Reading this metric will return the current retry count.

fsafe.cms.expected

The number of heartbeat events expected to have been received for each node in the cluster (excluding the collector host), since the time ha_cmsd has started.

fsafe.cms.recent.expected

The number of heartbeat events expected to have been received for each node in the cluster (excluding the collector host), since a data collection reset (via fsafe.control.reset_cms ).

fsafe.cms.received

The number of heartbeat events actually received for each node in the cluster (excluding the collector host), since the time ha_cmsd has started.

fsafe.cms.recent.received

The number of heartbeat events actually received for each node in the cluster (excluding the collector host), since a data collection reset (via fsafe.control.reset_cms ).

fsafe.cms.missed

The number of heartbeat events determined not to have been received for each node in the cluster (excluding the collector host), since the time ha_cmsd has started.

fsafe.cms.recent.missed

The number of heartbeat events determined not to have been received for each node in the cluster (excluding the collector host), since a data collection reset (via fsafe.control.reset_cms ).

fsafe.cms.histo

Histogram of heartbeat event response times for events that have occurred within discrete heartbeat response intervals for each node in the cluster (excluding the collector host), since the time ha_cmsd has started.

 

The heartbeat response intervals are defined to be equal to the configured heartbeat event interval ( fsafe.config.cms.interval), for a number of intervals up to the configured heartbeat event timeout (fsafe.config.cms.timeout ).

fsafe.cms.recent.histo

Histogram of heartbeat event response times for events that have occurred within discrete heartbeat response intervals for each node in the cluster (excluding the collector host), since a data collection reset (via fsafe.control.reset_cms).

 

The heartbeat response intervals are defined to be equal to the configured heartbeat event interval ( fsafe.config.cms.interval), for a number of intervals up to the configured heartbeat event timeout (fsafe.config.cms.timeout ).

fsafe.cms.frac_received

Fraction of heartbeat events received over all expected events for each node in the cluster, since the time ha_cmsd has started.

fsafe.cms.recent.frac_received

Fraction of heartbeat events received over all expected events for each node in the cluster, since a data collection reset (via fsafe.control.reset_cms).

fsafe.cms.frac_missed

Fraction of heartbeat events determined not to have been received over all expected events for each node in the cluster, since the time ha_cmsd has started.

fsafe.cms.recent.frac_missed

Fraction of heartbeat events determined not to have been received over all expected events for each node in the cluster, since a data collection reset (via fsafe.control.reset_cms ).

fsafe.cms.recent.timestamp

The time when a new collection of statistics was started for the fsafe.cms.recent.* metrics, after issuing a store to the metric fsafe.control.reset_cms .

fsafe.cms.pernode.expected

The number of heartbeat events expected to have been received for a particular node in the cluster, since the time ha_cmsd has started.

fsafe.cms.recent.pernode.expected

The number of heartbeat events expected to have been received for a particular node in the cluster, since a data collection reset (via fsafe.control.reset_cms).

fsafe.cms.pernode.received

The number of heartbeat events actually received for a particular node in the cluster, since the time ha_cmsd has started.

fsafe.cms.recent.pernode.received

The number of heartbeat events actually received for a particular node in the cluster, since a data collection reset (via fsafe.control.reset_cms).

fsafe.cms.pernode.missed

The number of heartbeat events determined not to have been received for a particular node in the cluster, since the time ha_cmsd has started.

fsafe.cms.recent.pernode.missed

The number of heartbeat events determined not to have been received for a particular node in the cluster, since a data collection reset (via fsafe.control.reset_cms).

fsafe.cms.pernode.histo

Histogram of heartbeat event response times for events that have occurred within discrete heartbeat response intervals for a particular node in the cluster, since the time ha_cmsd has started.

 

The heartbeat response intervals are defined to be equal to the configured heartbeat event interval ( fsafe.config.cms.interval), for a number of intervals up to the configured heartbeat event timeout (fsafe.config.cms.timeout ).

fsafe.cms.recent.pernode.histo

Histogram of heartbeat event response times for events that have occurred within discrete heartbeat response intervals for a particular node in the cluster, since a data collection reset (via fsafe.control.reset_cms).

 

The heartbeat response intervals are defined to be equal to the configured heartbeat event interval ( fsafe.config.cms.interval), for a number of intervals up to the configured heartbeat event timeout (fsafe.config.cms.timeout ).

fsafe.cms.pernode.frac_received

Fraction of heartbeat events received over all expected events for a particular node in the cluster, since the time ha_cmsd has started.

fsafe.cms.recent.pernode.frac_received

Fraction of heartbeat events received over all expected events for a particular node in the cluster, since a data collection reset (via fsafe.control.reset_cms).

fsafe.cms.pernode.frac_missed

Fraction of heartbeat events determined not to have been received over all expected events for a particular node in the cluster, since the time ha_cmsd has started.

fsafe.cms.recent.pernode.frac_missed

 

 

Fraction of heartbeat events determined not to have been received over all expected events for a particular node in the cluster, since a data collection reset (via fsafe.control.reset_cms ).