This appendix lists the metrics implemented by pmdafsafe .
fsafe.srm.all.* metrics are the same as the fsafe.srm.* metrics, except that the latest values obtained for all resources will be available, even if ha_srmd or any of the resources themselves are not available.
Table B-1. Performance Co-Pilot Metrics
Metric | Description | |
---|---|---|
| Latest status of a monitoring event performed on a resource, for all resources configured to be monitored on this node. | |
| The prescribed timeout, in milliseconds, for monitoring a resource. | |
| Number of times a resource has been monitored, for all resources configured to be monitored on this node, since the time ha_srmd was started. | |
| Number of times a resource has been monitored, for all resources configured to be monitored on this node, since a data collection reset (via fsafe.control.reset_srm). | |
| Number of resource monitoring events that have timed out before declaring that resource as failed, for all resources configured to be monitored on this node, since the time the resources have last been available. | |
fsafe.srm.recent.timeouts | Number of resource monitoring events that have timed out before declaring that resource as failed, for all resources configured to be monitored on this node, since a data collection reset (via fsafe.control.reset_srm). | |
| Approximate minimum time, in milliseconds, taken to complete a monitoring event on a resource, for all resources configured to be monitored. | |
| Approximate maximum time, in milliseconds, taken to complete a monitoring event on a resource, for all resources configured to be monitored on this node. | |
| Approximate time, in milliseconds, taken in completing the most recent monitoring event on a resource, for all resources configured to be monitored on this node. | |
| Cumulative number of resource monitoring events that have timed out, for all resources configured to be monitored on this node, since the time ha_srmd has started. | |
fsafe.srm.recent.cumm_timeouts | Cumulative number of resource monitoring events that have timed out, for all resources configured to be monitored on this node, since a data collection reset (via fsafe.control.reset_srm ). | |
| Fraction of monitoring events that have been received within 0-20% of the response time from 0 milliseconds to fsafe.srm.timeout, for all resources configured to be monitored on this node, since the time ha_srmd has started. | |
fsafe.srm.recent.histo_20 | Fraction of monitoring events that have been received within 0-20% of the response time from 0 milliseconds to fsafe.srm.timeout, for all resources configured to be monitored on this node, since a data collection reset (via fsafe.control.reset_srm). | |
| Fraction of monitoring events that have been received within 20-40% of the response time from 0 milliseconds to fsafe.srm.timeout, for all resources configured to be monitored on this node, since the time ha_srmd has started. | |
fsafe.srm.recent.histo_40 | Fraction of monitoring events that have been received within 20-40% of the response time from 0 milliseconds to fsafe.srm.timeout, for all resources configured to be monitored on this node, since a data collection reset (via fsafe.control.reset_srm). | |
| Fraction of monitoring events that have been received within 40-60% of the response time from 0 milliseconds to fsafe.srm.timeout, for all resources configured to be monitored on this node, since the time ha_srmd(1M) has started. | |
fsafe.srm.recent.histo_60 | Fraction of monitoring events that have been received within 40-60% of the response time from 0 milliseconds to fsafe.srm.timeout, for all resources configured to be monitored on this node, since a data collection reset (via fsafe.control.reset_srm). | |
| Fraction of monitoring events that have been received within 60-80% of the response time from 0 milliseconds to fsafe.srm.timeout, for all resources configured to be monitored on this node, since the time ha_srmd has started. | |
fsafe.srm.recent.histo_80 | Fraction of monitoring events that have been received within 60-80% of the response time from 0 milliseconds to fsafe.srm.timeout, for all resources configured to be monitored on this node, since a data collection reset (via fsafe.control.reset_srm). | |
| Fraction of monitoring events that have been received within 80-100% of the response time from 0 milliseconds to fsafe.srm.timeout, for all resources configured to be monitored on this node, since the time ha_srmd has started. | |
fsafe.srm.recent.histo_100 | Fraction of monitoring events that have been received within 80-100% of the response time from 0 milliseconds to fsafe.srm.timeout, for all resources configured to be monitored on this node, since a data collection reset (via fsafe.control.reset_srm). | |
| Fraction of monitoring events that have timed out before declaring that resource as failed, for all resources configured to be monitored on this node, since the time the resources have last been available. | |
fsafe.srm.recent.frac_timeouts | Fraction of monitoring events that have timed out, before declaring that resource as failed, for all resources configured to be monitored on this node, since a data collection reset (via fsafe.control.reset_srm). | |
| Fraction of cumulative number of monitoring events that have timed out, for all resources configured to be monitored on this node, since the time ha_srmd has started. | |
fsafe.srm.recent.frac_cumm_timeouts | Fraction of cumulative number of monitoring events that have timed out, for all resources configured to be monitored on this node, since a data collection reset (via fsafe.control.reset_srm). | |
fsafe.srm.recent.timestamp | The time when a new collection of statistics was started for the fsafe.srm.recent.* metrics, after issuing a store to the metric fsafe.control.reset_srm . | |
fsafe.config.clustername | The name of this cluster. | |
fsafe.config.hostname | The name of all hosts in the cluster specified by fsafe.config.clustername. | |
fsafe.config.nnodes | Number of nodes in the cluster specified by fsafe.config.clustername. | |
fsafe.config.cms.interval | The cluster heartbeat event interval, in milliseconds. | |
fsafe.config.cms.timeout | The heartbeat event timeout for all nodes in the cluster, in milliseconds. | |
fsafe.config.cms.nbuckets | The number of heartbeat event response intervals per node, where each interval covers a time equal to the heartbeat event interval (fsafe.config.cms.interval) for segments of time until the heartbeat event timeout (fsafe.config.cms.timeout ). | |
fsafe.control.debug | Debugging flags for the fsafe PMDA when a decimal integer value is stored to this metric. It ultimately affects what information is put into the fsafe PMDA's log (normally at /var/adm/pcplog/fsafe.log ). | |
Reading this metric will return the currently assigned debugging flags as a decimal integer. | ||
fsafe.control.reset_cms | Resets data collection statistics for all metrics gathered from ha_cmsd. When this metric is stored to, the data provided is ignored; it is the act of storing to this metric which causes the reset. | |
Reading this metric will return zero (0). | ||
fsafe.control.reset_srm | Resets data collection statistics for all metrics gathered from ha_srmd. When this metric is stored, the data provided is ignored; it is the act of storing to this metric which causes the reset. | |
Reading this metric will return zero (0). | ||
fsafe.control.retry | Sets the number of retries permitted when contacting ha_cmsd or ha_srmd, and when the daemons indicate that they are busy. | |
Depending on which metrics are being read, and which daemon is required to obtain values for the required metrics, values for some metrics may not be available, possibly producing the message "Try again. Information not currently available. " This metric can be adjusted in order to increase the number of retries permitted when collecting metrics, before giving up and displaying this message. A retry is performed once every 100 ms (approximately). | ||
Note that setting this metric does not alter how the fsafe PMDA handles more serious errors from ha_cmsd or ha_srmd. | ||
Reading this metric will return the current retry count. | ||
fsafe.cms.expected | The number of heartbeat events expected to have been received for each node in the cluster (excluding the collector host), since the time ha_cmsd has started. | |
fsafe.cms.recent.expected | The number of heartbeat events expected to have been received for each node in the cluster (excluding the collector host), since a data collection reset (via fsafe.control.reset_cms ). | |
fsafe.cms.received | The number of heartbeat events actually received for each node in the cluster (excluding the collector host), since the time ha_cmsd has started. | |
fsafe.cms.recent.received | The number of heartbeat events actually received for each node in the cluster (excluding the collector host), since a data collection reset (via fsafe.control.reset_cms ). | |
fsafe.cms.missed | The number of heartbeat events determined not to have been received for each node in the cluster (excluding the collector host), since the time ha_cmsd has started. | |
fsafe.cms.recent.missed | The number of heartbeat events determined not to have been received for each node in the cluster (excluding the collector host), since a data collection reset (via fsafe.control.reset_cms ). | |
fsafe.cms.histo | Histogram of heartbeat event response times for events that have occurred within discrete heartbeat response intervals for each node in the cluster (excluding the collector host), since the time ha_cmsd has started. | |
The heartbeat response intervals are defined to be equal to the configured heartbeat event interval ( fsafe.config.cms.interval), for a number of intervals up to the configured heartbeat event timeout (fsafe.config.cms.timeout ). | ||
fsafe.cms.recent.histo | Histogram of heartbeat event response times for events that have occurred within discrete heartbeat response intervals for each node in the cluster (excluding the collector host), since a data collection reset (via fsafe.control.reset_cms). | |
The heartbeat response intervals are defined to be equal to the configured heartbeat event interval ( fsafe.config.cms.interval), for a number of intervals up to the configured heartbeat event timeout (fsafe.config.cms.timeout ). | ||
fsafe.cms.frac_received | Fraction of heartbeat events received over all expected events for each node in the cluster, since the time ha_cmsd has started. | |
fsafe.cms.recent.frac_received | Fraction of heartbeat events received over all expected events for each node in the cluster, since a data collection reset (via fsafe.control.reset_cms). | |
fsafe.cms.frac_missed | Fraction of heartbeat events determined not to have been received over all expected events for each node in the cluster, since the time ha_cmsd has started. | |
fsafe.cms.recent.frac_missed | Fraction of heartbeat events determined not to have been received over all expected events for each node in the cluster, since a data collection reset (via fsafe.control.reset_cms ). | |
fsafe.cms.recent.timestamp | The time when a new collection of statistics was started for the fsafe.cms.recent.* metrics, after issuing a store to the metric fsafe.control.reset_cms . | |
fsafe.cms.pernode.expected | The number of heartbeat events expected to have been received for a particular node in the cluster, since the time ha_cmsd has started. | |
fsafe.cms.recent.pernode.expected | The number of heartbeat events expected to have been received for a particular node in the cluster, since a data collection reset (via fsafe.control.reset_cms). | |
fsafe.cms.pernode.received | The number of heartbeat events actually received for a particular node in the cluster, since the time ha_cmsd has started. | |
fsafe.cms.recent.pernode.received | The number of heartbeat events actually received for a particular node in the cluster, since a data collection reset (via fsafe.control.reset_cms). | |
fsafe.cms.pernode.missed | The number of heartbeat events determined not to have been received for a particular node in the cluster, since the time ha_cmsd has started. | |
fsafe.cms.recent.pernode.missed | The number of heartbeat events determined not to have been received for a particular node in the cluster, since a data collection reset (via fsafe.control.reset_cms). | |
fsafe.cms.pernode.histo | Histogram of heartbeat event response times for events that have occurred within discrete heartbeat response intervals for a particular node in the cluster, since the time ha_cmsd has started. | |
The heartbeat response intervals are defined to be equal to the configured heartbeat event interval ( fsafe.config.cms.interval), for a number of intervals up to the configured heartbeat event timeout (fsafe.config.cms.timeout ). | ||
fsafe.cms.recent.pernode.histo | Histogram of heartbeat event response times for events that have occurred within discrete heartbeat response intervals for a particular node in the cluster, since a data collection reset (via fsafe.control.reset_cms). | |
The heartbeat response intervals are defined to be equal to the configured heartbeat event interval ( fsafe.config.cms.interval), for a number of intervals up to the configured heartbeat event timeout (fsafe.config.cms.timeout ). | ||
fsafe.cms.pernode.frac_received | Fraction of heartbeat events received over all expected events for a particular node in the cluster, since the time ha_cmsd has started. | |
fsafe.cms.recent.pernode.frac_received | Fraction of heartbeat events received over all expected events for a particular node in the cluster, since a data collection reset (via fsafe.control.reset_cms). | |
fsafe.cms.pernode.frac_missed | Fraction of heartbeat events determined not to have been received over all expected events for a particular node in the cluster, since the time ha_cmsd has started. | |
fsafe.cms.recent.pernode.frac_missed |
| |
| Fraction of heartbeat events determined not to have been received over all expected events for a particular node in the cluster, since a data collection reset (via fsafe.control.reset_cms ). |