Chapter 5. Monitoring Networked Systems with Events

EnlightenDSM is a powerful tool for monitoring host and network problems. The Events feature collects and saves status, configuration, performance and capacity information, automatically sending warning messages to system or network managers when an alarm condition occurs. Events can also take action under specific conditions using a process you specify.

The information collected by Events is also used by the Status Map feature to graphically display the state of your systems. These two powerful tools identify which hosts or pools are having problems, or are about to have problems, allowing network managers to anticipate errors and take quick action when problems arise.

This chapter provides information about:

Using Events

This section describes how to add, modify and delete event tests using EnlightenDSM's Events feature. For a detailed overview of Events capabilities and features, standards compliance, and the basics of building a testtab file, refer to Chapter 10, “Events,” in the EnlightenDSM Reference Manual.

How Events Works

The Events feature helps predict problems by reporting the event and taking corrective action defined by the user.

Data collected by Events can be used to assist in tasks such as:

  • system tuning

  • load balancing

  • resource planning and justification

  • upgrade requirement analysis

Events collects this data by monitoring the following:

  • memory subsystems

  • individual files

  • directory queues

  • filesystems

  • printer queues

  • critical processes

  • network statistics

  • hardware inventory

  • software inventory

  • user-provided data

An appropriate message can be sent to network managers, system managers, or both when an alarm condition occurs. Events can send alarms using one or more of the following methods:

  • SNMP (Simple Network Management Protocol) trap messages

  • e-mail

  • Programmable Events Processor (PEP) messages

Events can also pass the alarm for possible corrective action to a defined process. The same process can be assigned for all tests or a separate process can be specified for each test.

Events Test Types

There are two Events categories:

Group Tests—Also referred to as Internal tests, these tests are automatically configured with default configuration values. Refer to “Modifying and Existing Test” later in this chapter for more details. Group Tests include:

  • O/S Tests

  • File Systems Tests

  • Printer Tests

  • Process Tests

  • Inventory Tests

  • RPC Tests

  • VM (Virtual Memory) Tests

  • MBUF Tests

  • Ncashe Tests

  • IP Tests

  • ICMP Tests

  • TCP Tests

  • UDP

Item Tests—Tests configured by the administrator using the Events Configuration Add feature. Refer to “Adding an Event Test” in this chapter for more details. Item tests include:

Process Tests:

  • Instance of Process

  • Size of Process

  • Time Used by Process

  • API built-in tests

File Tests:

  • File Size

  • File accessed

  • File modified

  • File clamped

Adding an Event Test

To create a new test:

  1. Choose Configure from the Events menu. The Events Configuration window appears.

    Figure 5-1. Events Configuration window


    This window displays the hostname, the test group, whether the test is turned on or off, whether logging for the test is turned on or off, the severity level, and the test name for each test EnlightenDSM finds on your default host.

  2. Click the Add button to select the test type for this new event test. A Select New Event Type window appears. Select the type of test to be created.

    Figure 5-2. Select New Event Type window


  3. Enter the hostname that will contain the default settings for the new tests.

  4. From the Test Category field, choose the test type. The options are:

    • Files (the default)

    • Processes

    • Directories

  5. From the Sub-category field, specify a subcategory of the test types. If the Files option in the Test Category field was chosen, the type of file test can also be selected:

    • Size of File (the default)

    • Last Modified

    • Last Accessed

    • File Clamping

    If the Processes option in the Test Category field was chosen, the type of process test can also be chosen:

    • Instance of Process (the default)

    • Size of Process

    • Time Used by Process

    If the Directories option in the Test Category field was chosen, there are no further subcategories to choose from.

  6. Click the Apply button to display the Add Events Test window for the selected test type. Enter the parameters for the new event test.

    Figure 5-3. Add Events Test window


Entering Test Parameters

For more information on the formats of these fields, or  to run Events from a command-line mode, refer to Chapter 10, “Events,” in the EnlightenDSM Reference Manual.

This section describes how to use each field and button in the Add Events Test window.

Hostnames field

Type the hostname(s) for this test (or click the right arrow button to choose from a pick list of all hosts within the current pool). Leave a blank space between hostnames for multiple entries.

Testname field

Enter the name of the test. This must be either the process name, or the full pathname of the file or directory to be monitored.

Arguments field

An optional list of command arguments is used in matching a process name. This field can be used to differentiate between two process instances by also matching the argument list used by each of the processes.

Units of Measure field

This read-only field shows what the standard units of measure are for this test.

Sub-command field

This read-only field shows the Events-defined subcommand (if any) this test will use during its execution.

Internal Test field

This read-only field determines if this test is an Events built-in test (Yes setting) or a user-defined test (No setting).

Test Group field

This read-only field shows the test group type for this test.

State field

This toggle turns the test On (the default) or Off.

Severity field

This toggle chooses the level of severity to assign this test from the following message types:

  • OK

  • Informational (the default)

  • Warning

  • Error

  • Severe

Use PEP field

Choose whether this test should use PEP to report its results and/or filter any action to be taken. The default setting is Yes.

Logging field

Choose whether logging should be enabled for this test. The default is Yes.

Delta field

If Logging is enabled, enter a “changed by” (delta) value. EnlightenDSM will record the most recent value measured by the test if that value differs by at least this delta amount from the previously logged value.

See Chapter 10, “Events,” in the EnlightenDSM Reference Manual for more details.

Mailer field

Enter the mail program that should be used to deliver alarm messages. The default is /bin/mail. If another mail program is used, it must use the same syntax as the standard mail program for the target operating system.

User field

Enter the user(s) who should receive any alarm information. The default is value is root. Leave a blank space between each user name for multiple entries. If this value is set to nobody, no mail will be sent.

Command field

Enter any executable this test should run when it sets an alarm. This can be a script or a compiled executable.

Age field

Enter how much time (in minutes) must elapse before a file in a directory is considered to have “aged”. Only files more than “aged” minutes old are counted as “old”.

This value is only available if this test will monitor a directory queue.

Test Freq field

Specify how often in minutes to run this test. The default is every five minutes.

Alarm Freq field

Specify how long to wait in minutes before sending another new alarm about this test. The default is every hour.

High Thresh field

Specify an absolute high-level alarm set point for the data you're measuring in this test. This can be an integer or floating-point value.

Low Thresh field

Specify an absolute low-level alarm set point for the data you're measuring in this test. This can be an integer or floating-point value.

Pos Rate field

Specify a positive percentage change alarm set point for the data you're measuring in this test. This threshold checks the percentage of change by comparing the current test value with the last measured value. This must be a floating-point value.

Neg Rate field

Specify a negative percentage change alarm set point for the data you're measuring in this test. This threshold checks the percentage of change by comparing the current test value with the last measured value. This must be a floating-point value.

Pos Jump field

Specify a positive incremental change alarm set point for the data you're measuring in this test. This threshold checks for the change of n points by comparing the current test value with the last measured value. This can be an integer or floating-point value.

Neg Jump field

Specify a negative incremental change alarm set point for the data you're measuring in this test. This threshold checks for the change of n points by comparing the current test value with the last measured value. This can be an integer or floating-point value.

API File field

If you're creating an API test, specify the full pathname of the file that will hold the values you are monitoring.


Note: See Chapter 10, “Events,” in the EnlightenDSM Reference Manual for examples and more information on creating API tests.


API Data field

If an API test is being created, specify which field or column holds the data value to monitor. Use a digit prefaced by an `f' for a field number or by a `c' for a column number. The default assumes this value is a field number. If you're using a column designator, each character in any input file line/row is handled as one column.

API Label Field

If an API test is being created, specify which field in your file contains a descriptive word or label.

Logfile Clamping Regular Expressions field

Regular expressions can be used to define “types” of messages based on pattern matching. When one or more of these message types are found in a file, an alarm is sent to the agents specified in the test. Each time this test runs, it evaluates only those files added since the last occurrence of the test.

Apply button

Click the Apply button to add the test configuration to the testtab files for all specified hosts. The Events process monitoring the data will also be updated.


Note: EnlightenDSM is updated immediately and the testtab file is updated two minutes later.


Set Defaults button

Click the Set Defaults button to set default values in the text fields based on the test type category you previously specified in the Select New Event Type window.

Modifying an Existing Test

Click the Modify button in the Events Configuration window (Figure 5-1) to modify an existing test configuration. A window similar to the Add Events Test window will appear.

Figure 5-4. Modify Events Test window


See the previous section, “Entering Test Parameters” for information about each field.

There are three differences between the Add and Modify Events Test windows:

  • The Test Name field in the Modify window is read-only.

  • The Modify button (rather than the Add button) is used to save changes.

  • The Next button is used to modify additional test configurations if more than one test is selected for modification from the Events Configuration list. Clicking next will not save your changes so be sure to click Modify before moving on to the next test configuration.

Deleting an Event Test

Highlight the test to be deleted from the Events Configuration window. Click the Delete button. EnlightenDSM prompts you to confirm your action.


Note: Only tests you have added can be deleted. Built-in tests can be turned off, but cannot be deleted from the Events Configuration list.


Copying an Event Test

To create a new test using the parameters of an existing test, highlight the existing test and click Copy. The Add Events Test window will appear containing the original test parameters in each field. Edit this window as needed and click Apply to save the new test.

See “Entering Test Parameters” for information about each field.


Note: You can only copy tests you have added.


Viewing System Status

The Status Map uses information provided by Events to graphically display the current state of hosts and pools using color-coded icons. You can ignore the status, or query events and fix the problem.

This section describes how to interpret and navigate the Status Map, query events using the Status Map window, and clear the Status Map of Events.

Interpreting the Status Map

To view the Status Map, choose Status Map from the Events menu.

A map similar to the one in Figure 5-5 appears.

Figure 5-5. Example of a Status Map


The state of each host or pool icon is displayed according to the icon color:

  • Green - OK

  • White - Informational

  • Yellow - Warning

  • Blue - Error

  • Red - Severe

The color of the host icon reflects the highest priority event that has not been cleared. The color of pool icons reflect the highest priority uncleared event for any host within that pool:

A blinking host or pool icon indicates the following:

  • If a host icon is blinking, an unacknowledged event has occurred for that host.

  • If a pool icon is blinking, an unacknowledged event has occurred for at least one host contained in the pool. An unacknowledged event is any event message that you have not yet acknowledged using the Status Map.


Note: If you haven't recently viewed the Status Map and an icon is blinking green, look at the preceding activity for that host or pool. An alarm might have occurred and cleared itself between viewings. For example, if an unauthorized user logged on to your system and then exited, and an Events test had already checked for this type of alert, a green alert would appear. All EnlightenDSM system administration functions act upon any hosts selected in the Status Map.

To change the Status Map background, place the cursor arrow over the map and click the right mouse button. A list of backgrounds appears from which to choose.

Navigating the Status Map

To navigate the Status Map:

  • If pool icon is selected, all hosts within that pool are selected and can be managed as a single unit.

  • To perform administration functions on a subset of hosts in a pool, select those hosts by single-clicking on the host icons.

  • The standard mouse “select rectangle” or “sweep” methods can also be used to define a temporary group.

Querying Events

To search and view Events alarms and messages logged:

  1. Click the Query Events button in the Status Map window. The Status Map Query Events window appears.

    Figure 5-6. Status Map Query Events window


  2. In the Query Hosts field, choose the hosts to be searched. The options are:

    • Hosts In Exception Pool (the default)

    • Hosts In Current Pool

    • Specific Host(s). Specify which host(s) to check for messages in the text field to the right. Leave a blank space between hostnames for multiple entries.

  3. Under Message Type, the options are:

    • Event Messages Only: alarm messages generated when the results of a test violate a predefined threshold.

    • Log Messages Only: informational messages logged when a test successfully runs without generating an alarm.

    • Both Messages (the default).

  4. Under Message Severity Level, the options are:

    • Severe Message

    • Info Message

    • Error Message

    • Okay Message

    • Warning Message

    Select one or more severity levels to use in the search. The default setting selects all levels of message severity.

  5. Use the Test Name Filter field to limit the search to a few tests. Type the entire test name or just the first few letters of the test name in this field.

    All tests names containing all or part of the specified string will be queried (or click the right arrow button to choose from a pick list of all pre-defined standard Events tests). Leave a blank space between test names for multiple entries. The standard regular expression wildcards `*', `[]', and `?' can also be used in this field (for example, /home/*).

  6. In the Number of Messages per Host field, enter the number of messages per host to be searched. The most recent messages are displayed first. The counter buttons to the right also change the number displayed.

  7. Use the Time Between fields to limit the search for message to those generated between a span of time. Enter the beginning and ending dates and times of messages to be searched. For a detailed description of date/time formats allowed in this field, refer to Appendix C of the EnlightenDSM Reference Manual.

  8. Click the Execute Query button to begin the search process. The results are displayed in the View Event Message window (Figure 5-7).

    Figure 5-7. Query results


All messages matching your search criteria are displayed in the list box. Each line includes the hostname, test name, logged value, units, severity, status, and timestamp.

To modify a test, highlight the desired test and click the Reconfigure Test button. The Modify Events Test window for that test will appear. For details on using this window to modify a test, see “Modifying an Existing Test”.

Clearing an Event from the Status Map

An event can be cleared from the Status Map when the condition that triggered the event no longer exists. There are two ways to clear an event:

  • The event clears itself (for example, an Events CPU load test returns an OK result).

  • You correct the activity that caused the original event/problem to occur (for example, by correcting whatever is overloading the CPU load).

After all events for a host or pool are cleared, the color for that icon is set to green (current status = OK).

Auditing Security Checks

EnlightenDSM provides a host of security features to check vital files, filesystem devices, boot and shutdown scripts, crontab contents, password integrity, group files, home directories, and break-in attempts. The findings from these security checks are placed in a logfile, allowing you to choose the functions that you want to audit.

For information on each one of the security features, refer to Chapter 6, “Security,” in the EnlightenDSM Reference Manual.

Monitoring Logins, Processes, and CPUs

The Activity Monitor in the User menu is used to easily monitor login activity, process statuses, and CPU usage.

The rest of this section details how to use each Activity Monitor option.

Monitoring Logins

To quickly see which users are currently logged into the system, go to the User menu, choose Activity Monitor and then Who Is Logged In. The Who Is Logged In window appears (Figure 5-8).

Figure 5-8. Who Is Logged In window


Each line in the window displays the hostname, user name, tty, login time, idle time, process ID, and the location of the tty (if available). To select one or more user accounts for further information, highlight the usernames and select one of the following options:

Writing Messages to Users

Click the Write button to write a new message directly to the selected user's mailbox. A window appears with a field for composing a message of any length. Press the Return key to send the message. The recipient can respond to this message, allowing for a two-way conversation.

Message

The Message command sends a predefined or custom form letter directly to the user's screen, instead of the user's mailbox. The recipient cannot reply to this message.

Logging Out

Click the Logout button to immediately terminate all highlighted work sessions. This command kills the initial Shell process belonging to the marked users.


Note: Use this command with caution as it may also cause related user processes to be killed.


Displaying Processes

Click the Processes button to display a window of all processes currently running for the highlighted users. To further manipulate this information, see the next section, “Monitoring Process Status.”

Monitoring Process Status

To view a list of all active processes, choose Activity Monitor and then Process Status from the User menu. The Processes window appears (Figure 5-9).

Figure 5-9. Processes window


Highlight a process and use the command buttons in the window to perform the following actions.

Killing a Process

Click the Terminate button to immediately kill the highlighted process. A pop-up window will prompt you to verify your Terminate command.


Note: This command will not kill related processes, so if there are child processes running, they will become orphans. These orphans might terminate automatically, or you may have to kill them manually.


Hanging Up a Process

The Hangup command is similar to the Terminate command, except it provides enough time for the process to shut down properly. This means the process can close any files and terminate any child processes. A pop-up window will prompt you to verify your Hangup command.

Suspending and Continuing a Process

Clicking the Suspend button will stop a process from working but will not terminate it. The process is put on hold and can be re-activated later. Click Continue to re-activate a suspended process.

Changing Process Priorities

Click the Priority button to change the priority of a process. This priority determines when the CPU acts on a process. It may have a value from -20 to +20; the smaller the number, the higher the priority. You can enter the desired priority or use the arrow buttons to change the value.

Monitoring CPU Usage

From the User menu, choose Activity Monitor, then CPU Summary, The Summary of Process By User window appears (Figure 5-10). This window shows all currently logged-in users, the current number of processes, and the total cumulative CPU usage for each active user

Figure 5-10. Processes window


Graphing Processes

To graph the processes, highlight the users to be graphed and click the Graph button. A window appears displaying the highlighted items in a graphical format. Press and hold down the middle mouse button to rotate the graph in the direction you move the mouse.

Viewing Processes

To view the processes in detail, highlight the users to be viewed and click the Processes button. A window appears displaying all processes for the highlighted users. To further manipulate this information, see “Monitoring Process Status”.