This appendix discusses the following:
FailSafe logs both normal operations and critical errors to SYSLOG, as well as to individual log files for each log group.
In general, errors in the SYSLOG file take the following form:
timestamp priority_&_facility : hostname process[ID]: <internal_info> CODE message_text |
For example:
Sep 7 11:12:59 6X:fs0 cli[5830]: < E clconf 0> CI_IPCERR_NOSERVER, clconf ipc: ipcclnt_connect() failed, file /var/cluster/ha/comm/clconfd-ipc_fs0 |
The following table shows the parts of the preceding SYSLOG message.
Table C-1. SYSLOG Error Message Format
Content | Part | Meaning |
---|---|---|
Sep 7 11:12:59 | Timestamp | September 7 at 11:12 AM. |
6X | Facility and level | 6X indicates an informational message. See the syslogd man page and the file /usr/include/sys/syslog.h . |
fs0 | Node name | The node whose logical name is fs0 is the node on which the process is running. |
cli[5830] | Process[ID] | The process sending the message is cli and its process ID number is 5830. |
<CI>E clconf 0 | Internal information: message source, logging subsystem, and thread ID | The message is from the cluster infrastructure (CI). E indicates that it is an error. The clconf command is the logging subsystem. 0 indicates that it is not multithreaded. |
CI_IPCERR_NOSERVER, clconf ipc | Internal error code | Information about the type of message; in this case, a message indicating that the server is missing. No error code is printed if it is a normal message. |
ipcclnt_connect() failed, file /var/cluster/ha/comm/clconfd-ipc_fs0 | Message text | A connection failed for the clconfd-ipc_fs0 file. |
Some of the following sections present only the message identifiers and text.
When you start HA services successfully, you will see a series of message. The following shows messages you would see after successfully starting HA services for nodes named hans1 and hans2 (line breaks added for readability):
Aug 14 15:01:23 6X:hans1 ha_cmsd[6431]: < N cms 0> FailSafe: ha_cmsd process started. Aug 14 15:01:24 6X:hans1 ha_ifd[6430]: < N ifd 0> FailSafe: ha_ifd monitoring network interfaces Aug 14 15:02:00 6X:hans1 ha_cmsd[6431]: < N cms 0> FailSafe Node Confirmed Membership: sqn 1 G_sqn = 1, ack false change 5 Aug 14 15:02:00 5B:hans1 Configuration changes Aug 14 15:02:00 5B:hans1 Membership changes Aug 14 15:02:00 5B:hans1 node hans1 [1] : UP incarnation 69 age 1:0 Aug 14 15:02:00 5B:hans1 node hans2 [2] : UP incarnation 168 age 1:0 Aug 14 15:02:04 6X:hans1 ha_gcd[6435]: < N gcd 0> FailSafe: ha_gcd initialization complete Aug 14 15:02:17 6X:hans1 ha_srmd[6424]: < N srm 2> FailSafe: SRM ready to accept clients Aug 14 15:02:48 6X:hans1 ha_fsd[6408]: < N fsd 0> FailSafe initialization complete |
For all cli messages, only the last message from the command (which begins with CLI private command failed) is meaningful. You can ignore all other cli messages.
The following are example errors from the cli daemon.
CI_ERR_INVAL, CLI private command: failed (Machine (fs0) exists.) | |
You tried to create a new node definition with logical name fs0; however, that node name already exists in the cluster database. Choose a different name. | |
CI_ERR_INVAL, CLI private command: failed (IP address (128.162.89.33) specified for control network is fs0 is assigned to control network of machine (fs0).) | |
You specified the same IP address for two different control networks of node fs0. Use a different IP address. | |
CI_FAILURE, CLI private command: failed (Unable to validate hostname of machine (fs0) being modified.) | |
The DNS resolution of the fs0 name failed. To solve this problem, add an entry for fs0 in /etc/hosts on all nodes. | |
CI_IPCERR_NOPULSE, CLI private command: failed (Cluster state is UNKNOWN.) | |
The cluster state is UNKNOWN and the command could not complete. This is a transient error. However, if it persists, stop and restart the cluster daemons. |
The following errors are sent by the crsd daemon.
CI_ERR_NOTFOUND, No logging entries found for group crsd, no logging will take place - Database entry #global#logging#crsd not found. | |
No crsd logging definition was found in the cluster database. This can happen if you start cluster processes without creating the database. | |
CI_ERR_RETRY, Could not find machine listing. | |
The crsd daemon could not find the local node in the cluster database. You can ignore this message if the local node definition has not yet been created. | |
CI_ERR_SYS:125, bind() failed. | |
The sgi-crsd port number in the /etc/services file is not unique, or there is no sgi-crsd entry in the file. | |
CI_FAILURE, Entry for sgi-crsd is missing in /etc/services. | |
The sgi-crsd entry is missing from the /etc/services file. | |
CI_FAILURE, Initialization failed, exiting. | |
A sequence of messages will be ended with this message; see the messages prior to this one in order to determine the cause of the failure. |
The following errors are sent by the cmond daemon.
The following errors are sent by the fs2d daemon.
Error 9 writing CDB info attribute for node #cluster#elaine#machines#fs2#HA#status | |
An internal error occurred when writing to the cluster database. Retry the operation. If the error persists, stop and restart the cluster daemons. If the problem persists, clear the database, reboot, and re-create the database. | |
Error 9 writing CDB string value for node #cluster#elaine#machines#fs2#HA#status | |
An internal error occurred when writing to the cluster database. Retry the operation. If the error persists, stop and restart the cluster daemons. If the problem persists, clear the database, reboot, and re-create the database. | |
Failed to update CDB for node #cluster#elaine#HA#FileSystems#fs1#FSStatus | |
An internal error occurred when writing to the cluster database. Retry the operation. If the error persists, stop and restart the cluster daemons. If the problem persists, clear the database, reboot, and re-create the database. | |
Failed to update CDB for node #cluster#elaine#machines#fs2#HA#status | |
An internal error occurred when writing to the cluster database. Retry the operation. If the error persists, stop and restart the cluster daemons. If the problem persists, clear the database, reboot, and re-create the database. | |
Machine 101 machine_sync failed with lock_timeout error | |
The fs2d daemon was not able to synchronize the cluster database and the sync process timed out. This operation will be retried automatically by fs2d. |
The following error is sent by the ha_srmd daemon:
Executable /var/cluster/ha/resource_types/template/stop does not have execute file permission, Skipping resource type template configuration because of configuration errors, Fix the configuration errors and send SIGHUP signal to ha_srmd process on all nodes in the FailSafe cluster | ||
The stop action script does not have the correct execution permission and therefore cannot be run. You must change the mode of the script to allow execute permission and then send a SIGHUP signal the ha_srmd process on each node so that ha_srmd will reread the resource type configuration. If ha_srmd finds errors in the resource type configuration, errors will be sent to the SYSLOG or ha_srmd logs. Use the following command line on each node:
|
FailSafe maintains logs for each of the FailSafe daemons.
Log file messages take the following form:
daemon_log timestamp internal_process: message_text |
For example:
cad_log:Thu Sep 2 17:25:06.092 cclconf_poll_clconfd: clconf_poll failed with error CI_IPCERR_NOPULSE |
Table C-2, shows the parts in the preceding message.
Table C-2. Log File Error Message Format
Content | Part | Meaning |
---|---|---|
cad_log | Daemon identifier | The message pertains to the cad daemon |
Sep 2 17:25:06.092 | Timestamp and process ID | September 2 at 5:25 PM, process ID 92. |
cclconf_poll_clconfd | Internal process information | Internal process information |
clconf_poll failed with error CI_IPCERR_NOPULSE | Message text | The clconfd daemon could not be contacted to get an update on the cluster's status. |
The following are examples of messages from /var/cluster/ha/log/cad_log :
ccacdb_cam_open: failed to open connection to CAM server error 4 | ||
Internal message that can be ignored because the cad operation is automatically retried. | ||
ccamail_cam_open: failed to open connection to CAM server error 4 | ||
Internal message that can be ignored because the cad operation is automatically retried. | ||
ccicdb_cam_open: failed to open connection to CAM server error 4 | ||
Internal message that can be ignored because the cad operation is automatically retried. | ||
cclconf_cam_open: failed to open connection to CAM server error 4 | ||
Internal message that can be ignored because the cad operation is automatically retried. | ||
cclconf_poll_clconfd: clconf_poll failed with error CI_IPCERR_NOCONN | ||
The clconfd daemon is not running or is not responding to external requests. If the error persists, stop and restart the cluster daemons. | ||
cclconf_poll_clconfd: clconf_poll failed with error CI_IPCERR_NOPULSE | ||
The clconfd daemon could not be contacted to get an update on the cluster's status. If the error persists, stop and restart the cluster daemons. | ||
cclconf_poll_clconfd: clconf_poll failed with error CI_CLCONFERR_LONELY | ||
The clconfd daemon does not have enough information to provide an accurate status of the cluster. It will automatically restart with fresh data and resume its service. | ||
csrm_cam_open: failed to open connection to CAM server error 4 | ||
Internal message that can be ignored because the cad operation is automatically retried. | ||
Could not execute notification cmd. system() failed. Error: No child processes | ||
No mail message was sent because cad could not fork processes. Stop and restart the cluster daemons. | ||
error 3 sending event notification to client 0x000000021010f078 | ||
GUI process exited without cleaning up.
| ||
error 8 sending event notification to client 0x000000031010f138 | ||
GUI process exited without cleaning up.
|
The following are examples of messages from /var/cluster/ha/log/cli_ Hostname:
CI_CONFERR_NOTFOUND, No machines found in the CDB. | |
The local node is not defined in the cluster database. | |
CI_ERR_INVAL, Cluster (bob) not defined | |
The cluster called bob is not present in the cluster database. | |
CI_ERR_INVAL, CLI private command: failed (Cluster (bob) not defined) | |
The cluster called bob is not present in the cluster database. | |
CI_IPCERR_AGAIN, ipcclnt_connect(): file /var/cluster/ha/comm/clconfd-ipc_fs0 lock failed - Permission denied | |
The underlying command line interface (CLI) was invoked by a login other than root. You should only use cmgr(1M) when you are logged in as root. | |
CI_IPCERR_NOPULSE, CLI private command: failed (Cluster state is UNKNOWN.) | |
The cluster state could not be determined. Check if the clconfd(1M) daemon is running. | |
CI_IPCERR_NOPULSE, ipcclnt_pulse_internal(): server failed to pulse | |
The cluster state could not be determined. Check if the clconfd(1M) daemon is running. | |
CI_IPCERR_NOSERVER, clconf ipc: ipcclnt_connect() failed, file /var/cluster/ha/comm/clconfd-ipc_fs0 | |
The local node (fs0) is not defined in the cluster database. | |
CI_IPCERR_NOSERVER, Connection file /var/cluster/ha/comm/clconfd-ipc_fs0 not present. | |
The local node (fs0) is not defined in the cluster database. |
The following are examples of messages from /var/cluster/ha/log/crsd_Hostname:
CI_CONFERR_INVAL, Nodeid -1 is invalid., I_CONFERR_INVAL, Error from ci_security_init()., CI_ERR_SYS:125, bind() failed., CI_ERR_SYS:125, Initialization failed, exiting., CI_ERR_NOTFOUND, Nodeid does not have a value., CI_CONFERR_INVAL, Nodeid -1 is invalid. | |
For each of these messages, either the node ID was not provided in the node definition or the cluster processes were not running in that node when node definition was created in the cluster database. This is a warning that optional information is not available when expected. | |
CI_ERR_NOTFOUND, SystemController information for node fs2 not found, requests will be ignored. | |
System controller information (optional information) was not provided for node fs2. Provide system controller information for node fs2 by modifying node definition. This is a warning that optional information is not available when expected. Without this information, the node will not be reset if it fails, which might prevent the cluster from properly recovering from the failure. | |
CI_ERR_NOTFOUND, SystemController information for node fs0 not found, requests will be ignored. | |
The owner node specified in the node definition for the node with a node ID of 101 has not been defined. You must define the owner node. | |
CI_CRSERR_NOTFOUND, Reset request 0x10087d48 received for node 101, but its owner node does not exist. | |
The owner node specified in the node definition for the node with a node ID of 101 has not been defined. You must define the owner node. |
The following are examples of messages from /var/cluster/ha/log/fs2d_Hostname:
Failed to copy global CDB to node fs1 (1), error 4 | |
There are communication problems between the local node and node fs2. Check the control networks of the two nodes. | |
Communication failure send new quorum to machine fs2 (102) (error 6003) | |
There are communication problems between the local node and node fs2. Check the control networks of the two nodes. | |
Failed to copy CDB transaction to node fs2 (1) | |
There are communication problems between the local node and node fs2. Check the control networks of the two nodes. | |
Outgoing RPC to Hostname : NULL | |
If you see this message, check your Remote Procedure Call (RPC) setup. For more information, see the rpcinfo and portmap man pages. |