Appendix C. System Messages

This appendix discusses the following:

SYSLOG Messages

FailSafe logs both normal operations and critical errors to SYSLOG, as well as to individual log files for each log group.

In general, errors in the SYSLOG file take the following form:

timestamp priority_&_facility : hostname process[ID]: <internal_info> CODE message_text 

For example:

Sep  7 11:12:59 6X:fs0 cli[5830]: < E clconf 0> CI_IPCERR_NOSERVER, clconf
ipc: ipcclnt_connect() failed, file /var/cluster/ha/comm/clconfd-ipc_fs0

The following table shows the parts of the preceding SYSLOG message.

Table C-1. SYSLOG Error Message Format

Content

Part

Meaning

Sep 7 11:12:59

Timestamp

September 7 at 11:12 AM.

6X

Facility and level

6X indicates an informational message. See the syslogd man page and the file /usr/include/sys/syslog.h .

fs0

Node name

The node whose logical name is fs0 is the node on which the process is running.

cli[5830]

Process[ID]

The process sending the message is cli and its process ID number is 5830.

<CI>E clconf 0

Internal information: message source, logging subsystem, and thread ID

The message is from the cluster infrastructure (CI). E indicates that it is an error. The clconf command is the logging subsystem. 0 indicates that it is not multithreaded.

CI_IPCERR_NOSERVER, clconf ipc

Internal error code

Information about the type of message; in this case, a message indicating that the server is missing. No error code is printed if it is a normal message.

ipcclnt_connect() failed, file /var/cluster/ha/comm/clconfd-ipc_fs0

Message text

A connection failed for the clconfd-ipc_fs0 file.

Some of the following sections present only the message identifiers and text.

Normal Messages After Successfully Starting HA Services

When you start HA services successfully, you will see a series of message. The following shows messages you would see after successfully starting HA services for nodes named hans1 and hans2 (line breaks added for readability):

Aug 14 15:01:23 6X:hans1 ha_cmsd[6431]: < N cms 0> FailSafe: ha_cmsd process started.
Aug 14 15:01:24 6X:hans1 ha_ifd[6430]: < N ifd 0> FailSafe: ha_ifd monitoring network interfaces
Aug 14 15:02:00 6X:hans1 ha_cmsd[6431]: < N cms 0> FailSafe Node Confirmed Membership: sqn 1 G_sqn = 1,
   ack false change 5
Aug 14 15:02:00 5B:hans1 Configuration changes
Aug 14 15:02:00 5B:hans1 Membership changes
Aug 14 15:02:00 5B:hans1 node hans1 [1] : UP  incarnation 69   age 1:0
Aug 14 15:02:00 5B:hans1 node hans2 [2] : UP  incarnation 168   age 1:0
Aug 14 15:02:04 6X:hans1 ha_gcd[6435]: < N gcd 0> FailSafe: ha_gcd initialization complete
Aug 14 15:02:17 6X:hans1 ha_srmd[6424]: < N srm 2> FailSafe: SRM ready to accept clients
Aug 14 15:02:48 6X:hans1 ha_fsd[6408]: < N fsd 0> FailSafe initialization complete

cli Error Messages

For all cli messages, only the last message from the command (which begins with CLI private command failed) is meaningful. You can ignore all other cli messages.

The following are example errors from the cli daemon.

CI_ERR_INVAL, CLI private command: failed (Machine (fs0) exists.)
 

You tried to create a new node definition with logical name fs0; however, that node name already exists in the cluster database. Choose a different name.

CI_ERR_INVAL, CLI private command: failed (IP address (128.162.89.33) specified for control network is fs0 is assigned to control network of machine (fs0).)
 

You specified the same IP address for two different control networks of node fs0. Use a different IP address.

CI_FAILURE, CLI private command: failed (Unable to validate hostname of machine (fs0) being modified.)
 

The DNS resolution of the fs0 name failed. To solve this problem, add an entry for fs0 in /etc/hosts on all nodes.

CI_IPCERR_NOPULSE, CLI private command: failed (Cluster state is UNKNOWN.)
 

The cluster state is UNKNOWN and the command could not complete. This is a transient error. However, if it persists, stop and restart the cluster daemons.

crsd Error Messages

The following errors are sent by the crsd daemon.

CI_ERR_NOTFOUND, No logging entries found for group crsd, no logging will take place - Database entry #global#logging#crsd not found.
 

No crsd logging definition was found in the cluster database. This can happen if you start cluster processes without creating the database.

CI_ERR_RETRY, Could not find machine listing.
 

The crsd daemon could not find the local node in the cluster database. You can ignore this message if the local node definition has not yet been created.

CI_ERR_SYS:125, bind() failed.
 

The sgi-crsd port number in the /etc/services file is not unique, or there is no sgi-crsd entry in the file.

CI_FAILURE, Entry for sgi-crsd is missing in /etc/services.
 

The sgi-crsd entry is missing from the /etc/services file.

CI_FAILURE, Initialization failed, exiting.
 

A sequence of messages will be ended with this message; see the messages prior to this one in order to determine the cause of the failure.

cmond Error Messages

The following errors are sent by the cmond daemon.

Could not register for notification.cdb_error = 7
 

An error number of 7 indicates that the cluster database was not initialized when the cluster process was started.

This may be caused if you execute the cdbreinit on one administration node while some other administration nodes in the pool are still running fs2d and already have the node listed in the database.

Do the following:

  1. Execute the following command on the nodes that show the error:

    # /usr/cluster/bin/cdb-init-std-nodes

    This command will recreate the missing nodes without disrupting the rest of the database.

  2. If the error persists, force the daemons to restart by executing the following command:

    # /etc/init.d/cluster restart

    Verify that cmond is restarted.

  3. If the error persists, reinitialize the database on just the node that is having problems.

  4. If the error still persists, reinitialize all nodes in the cluster.

Process clconfd:343 of group cluster_cx exited, status = 3.
 

The clconfd process exited with status 3, meaning that the process will not be restarted by cmond. No corrective action is needed.

Process crsd:1790 of group cluster_control exited, status = 127
 

The crsd process exited with an error (nonzero) status. Look at the corresponding daemon logs for error messages.

fs2d Error Messages

The following errors are sent by the fs2d daemon.

Error 9 writing CDB info attribute for node #cluster#elaine#machines#fs2#HA#status
 

An internal error occurred when writing to the cluster database. Retry the operation. If the error persists, stop and restart the cluster daemons.

If the problem persists, clear the database, reboot, and re-create the database.

Error 9 writing CDB string value for node #cluster#elaine#machines#fs2#HA#status
 

An internal error occurred when writing to the cluster database. Retry the operation. If the error persists, stop and restart the cluster daemons.

If the problem persists, clear the database, reboot, and re-create the database.

Failed to update CDB for node #cluster#elaine#HA#FileSystems#fs1#FSStatus
 

An internal error occurred when writing to the cluster database. Retry the operation. If the error persists, stop and restart the cluster daemons.

If the problem persists, clear the database, reboot, and re-create the database.

Failed to update CDB for node #cluster#elaine#machines#fs2#HA#status
 

An internal error occurred when writing to the cluster database. Retry the operation. If the error persists, stop and restart the cluster daemons.

If the problem persists, clear the database, reboot, and re-create the database.

Machine 101 machine_sync failed with lock_timeout error
 

The fs2d daemon was not able to synchronize the cluster database and the sync process timed out. This operation will be retried automatically by fs2d.

ha_srmd Error Message

The following error is sent by the ha_srmd daemon:

Executable /var/cluster/ha/resource_types/template/stop does not have execute file permission, Skipping resource type template configuration because of configuration errors, Fix the configuration errors and send SIGHUP signal to ha_srmd process on all nodes in the FailSafe cluster
 

The stop action script does not have the correct execution permission and therefore cannot be run. You must change the mode of the script to allow execute permission and then send a SIGHUP signal the ha_srmd process on each node so that ha_srmd will reread the resource type configuration. If ha_srmd finds errors in the resource type configuration, errors will be sent to the SYSLOG or ha_srmd logs. Use the following command line on each node:

# killall -HUP ha_srmd

Log File Error Messages

FailSafe maintains logs for each of the FailSafe daemons.

Log file messages take the following form:

daemon_log timestamp internal_process: message_text

For example:

cad_log:Thu Sep  2 17:25:06.092  cclconf_poll_clconfd: clconf_poll failed with error CI_IPCERR_NOPULSE

Table C-2, shows the parts in the preceding message.

Table C-2. Log File Error Message Format

Content

Part

Meaning

cad_log

Daemon identifier

The message pertains to the cad daemon

Sep 2 17:25:06.092

Timestamp and process ID

September 2 at 5:25 PM, process ID 92.

cclconf_poll_clconfd

Internal process information

Internal process information

clconf_poll failed with error CI_IPCERR_NOPULSE

Message text

The clconfd daemon could not be contacted to get an update on the cluster's status.


cad Messages

The following are examples of messages from /var/cluster/ha/log/cad_log :

ccacdb_cam_open: failed to open connection to CAM server error 4
 

Internal message that can be ignored because the cad operation is automatically retried.

ccamail_cam_open: failed to open connection to CAM server error 4
 

Internal message that can be ignored because the cad operation is automatically retried.

ccicdb_cam_open: failed to open connection to CAM server error 4
 

Internal message that can be ignored because the cad operation is automatically retried.

cclconf_cam_open: failed to open connection to CAM server error 4
 

Internal message that can be ignored because the cad operation is automatically retried.

cclconf_poll_clconfd: clconf_poll failed with error CI_IPCERR_NOCONN
 

The clconfd daemon is not running or is not responding to external requests. If the error persists, stop and restart the cluster daemons.


cclconf_poll_clconfd: clconf_poll failed with error CI_IPCERR_NOPULSE
 

The clconfd daemon could not be contacted to get an update on the cluster's status. If the error persists, stop and restart the cluster daemons.

cclconf_poll_clconfd: clconf_poll failed with error CI_CLCONFERR_LONELY
 

The clconfd daemon does not have enough information to provide an accurate status of the cluster. It will automatically restart with fresh data and resume its service.

csrm_cam_open: failed to open connection to CAM server error 4
 

Internal message that can be ignored because the cad operation is automatically retried.

Could not execute notification cmd. system() failed. Error: No child processes
 

No mail message was sent because cad could not fork processes. Stop and restart the cluster daemons.

error 3 sending event notification to client 0x000000021010f078
 

GUI process exited without cleaning up.

error 8 sending event notification to client 0x000000031010f138
 

GUI process exited without cleaning up.

cli Messages

The following are examples of messages from /var/cluster/ha/log/cli_ Hostname:

CI_CONFERR_NOTFOUND, No machines found in the CDB.
 

The local node is not defined in the cluster database.

CI_ERR_INVAL, Cluster (bob) not defined
 

The cluster called bob is not present in the cluster database.

CI_ERR_INVAL, CLI private command: failed (Cluster (bob) not defined)
 

The cluster called bob is not present in the cluster database.

CI_IPCERR_AGAIN, ipcclnt_connect(): file /var/cluster/ha/comm/clconfd-ipc_fs0 lock failed - Permission denied
 

The underlying command line interface (CLI) was invoked by a login other than root. You should only use cmgr(1M) when you are logged in as root.

CI_IPCERR_NOPULSE, CLI private command: failed (Cluster state is UNKNOWN.)
 

The cluster state could not be determined. Check if the clconfd(1M) daemon is running.

CI_IPCERR_NOPULSE, ipcclnt_pulse_internal(): server failed to pulse
 

The cluster state could not be determined. Check if the clconfd(1M) daemon is running.

CI_IPCERR_NOSERVER, clconf ipc: ipcclnt_connect() failed, file /var/cluster/ha/comm/clconfd-ipc_fs0
 

The local node (fs0) is not defined in the cluster database.

CI_IPCERR_NOSERVER, Connection file /var/cluster/ha/comm/clconfd-ipc_fs0 not present.
 

The local node (fs0) is not defined in the cluster database.

crsd Errors

The following are examples of messages from /var/cluster/ha/log/crsd_Hostname:

CI_CONFERR_INVAL, Nodeid -1 is invalid., I_CONFERR_INVAL, Error from ci_security_init()., CI_ERR_SYS:125, bind() failed., CI_ERR_SYS:125, Initialization failed, exiting., CI_ERR_NOTFOUND, Nodeid does not have a value., CI_CONFERR_INVAL, Nodeid -1 is invalid.
 

For each of these messages, either the node ID was not provided in the node definition or the cluster processes were not running in that node when node definition was created in the cluster database. This is a warning that optional information is not available when expected.

CI_ERR_NOTFOUND, SystemController information for node fs2 not found, requests will be ignored.
 

System controller information (optional information) was not provided for node fs2. Provide system controller information for node fs2 by modifying node definition. This is a warning that optional information is not available when expected. Without this information, the node will not be reset if it fails, which might prevent the cluster from properly recovering from the failure.

CI_ERR_NOTFOUND, SystemController information for node fs0 not found, requests will be ignored.
 

The owner node specified in the node definition for the node with a node ID of 101 has not been defined. You must define the owner node.

CI_CRSERR_NOTFOUND, Reset request 0x10087d48 received for node 101, but its owner node does not exist.
 

The owner node specified in the node definition for the node with a node ID of 101 has not been defined. You must define the owner node.

fs2d Errors

The following are examples of messages from /var/cluster/ha/log/fs2d_Hostname:

Failed to copy global CDB to node fs1 (1), error 4
 

There are communication problems between the local node and node fs2. Check the control networks of the two nodes.

Communication failure send new quorum to machine fs2 (102) (error 6003)
 

There are communication problems between the local node and node fs2. Check the control networks of the two nodes.

Failed to copy CDB transaction to node fs2 (1)
 

There are communication problems between the local node and node fs2. Check the control networks of the two nodes.

Outgoing RPC to Hostname : NULL
 

If you see this message, check your Remote Procedure Call (RPC) setup. For more information, see the rpcinfo and portmap man pages.