Chapter 4. Configuring Additional Features

Chapter 4. Configuring Additional Features
Prev		Next

This chapter includes the following topics:

Enabling Hardware Event Tracker (HET) Notifications

The following topics contain information about HET:

Note: This documentation includes information about all SGI systems for the sake of completeness.

About HET

All of your SGI system's IPMI controllers send SNMP traps to the management node, which is the system admin controller (SAC) on an SGI ICE X system or the system management node (SMN) on an SGI UV system. The SGI Foundation Software's HET tools process these system alerts and send an email notification after critical hardware events occur.

On SGI ICE X and SGI UV platforms, the HET tools are configured by default. You do not need to perform any additional system configuration, but SGI recommends that you customize the email address to which the HET tools send critical event notifications. On SGI Rackable systems, you need to configure the system manually as noted in the het(8) man page. The het(8) man page also contains information about HET defaults and internal processes.

HET accumulates information about system events in the following log file:

/var/log/het/het_trap_processor.log

As an event-driven system monitoring tool, HET listens for system events. When HET receives information about an event, it converts the message from coded numbers into a readable form, as follows:

When a noncritical event occurs, HET simply logs the event. As an option, you can configure an email address to receive noncritical event notifications.
When a critical event occurs, HET logs the event and send an email message. SGI recommends that you edit file /etc/sysconfig/het and specify an email address specific to your site. By default, HET sends event information to root@localhost. For more information about how to customize HET notifications, see “Customizing HET Notifications”.

The firmware for each baseboard management controller (BMC) and the firmware for each cooling node on an SGI ICE X system includes threshold values for each component. If a system condition becomes too low or to high for its threshold, the BMC sends a critical event alert. The following are examples of critical system events that cause an alert:
- Ambient air temperature outside of recommended range
- Voltage sensor unable to attain a critical low voltage
- Power supply failure
- Loss of redundant power supply
- Fan speed unable to attain a critical threshold or a loss of fan redundancy
- Board processor modules that exceed a critical temperature threshold
- Memory uncorrectable errors

Customizing HET Notifications

You can customize the email addresses to which event information about NON-RECOVERABLE events is sent. As an option, you can specify a site-specific email address for less-severe events, or all HET events, too.

The HET log file, /var/log/het/het_trap_processor.log, contains information about all HET events. You can consult this file periodically to monitor noncritical events.

The following procedure explains how to configure an email address or email alias to receive HET notifications.

Procedure 4-1. To customize HET notifications

Log in as root and open the following file with a text editor:
/etc/sysconfig/het
On an SGI ICE X system, log into the system admin controller (SAC) node.

On an SGI UV system, log into the system management node (SMN). HET requires an SMN. If your system does not include an SMN, you cannot enable HET.
Search the file for the following string:
HET_MAIL_TROUBLE_TO
Change the default recipient, root, to be the email address of a person or the email alias of a group who can attend to the system when NON-RECOVERABLE events occur.
(Optional) Configure an email recipient for notifications about CRITICAL events.

Search the file for the following string:
HET_MAIL_NEWS_TO
Specify an email address or alias to receive CRITICAL event notifications.
Save and close file /etc/sysconfig/het.
(Optional) Configure an email recipient for all HET events.

Complete the following steps:
- Open file /etc/het.action.d/het_mail with a text editor.
- Search for the following lines in /etc/het.action.d/het_mail:
  # NOTE: Adjust if needed # Default is an empty mailing list audience for # non (NON-RECOVERABLE or CRITICAL) events. to=""
- Edit the to="" line to specify an email address or an email alias between the quotation marks.
- Save and close the file.

HET Examples

The following is an example of a HET log file that contains critical information:

dump     2013-10-23.07.13.21 [het_process_thread:2] # begin ---------------------------------
dump     2013-10-23.07.13.21 [het_process_thread:2] agentAddr        172.24.0.2
dump     2013-10-23.07.13.21 [het_process_thread:2] het_type         ipmi
dump     2013-10-23.07.13.21 [het_process_thread:2] guid             r1lead
dump     2013-10-23.07.13.21 [het_process_thread:2] sn               X1------
dump     2013-10-23.07.13.21 [het_process_thread:2] alertSeverity    NONE
dump     2013-10-23.07.13.21 [het_process_thread:2] event            uncorrectableECC
dump     2013-10-23.07.13.21 [het_process_thread:2] sensorName       None-memory
dump     2013-10-23.07.13.21 [het_process_thread:2] sensorNumber     0x00
dump     2013-10-23.07.13.21 [het_process_thread:2] sensorTypeName   memory
dump     2013-10-23.07.13.21 [het_process_thread:2] eventClassName   discrete
dump     2013-10-23.07.13.21 [het_process_thread:2] event1           0x51
dump     2013-10-23.07.13.21 [het_process_thread:2] event2           0xff
dump     2013-10-23.07.13.21 [het_process_thread:2] event3           0x51
dump     2013-10-23.07.13.21 [het_process_thread:2] flap_count       1
dump     2013-10-23.07.13.21 [het_process_thread:2] # end -----------------------------------

The corresponding email message that HET sends is as follows:

X-Original-To: root
Delivered-To: [email protected]
Date: Wed, 18 Dec 2013 14:36:52 -0600
From: [email protected]
To: [email protected]
Subject: HET ALERT from cb9 - NON-RECOVERABLE
User-Agent: Heirloom mailx 12.2 01/07/07

The following HET(Hardware Environment Tracking) event has been recorded:
HET ALERT from cb9 - NON-RECOVERABLE

Event Details:

        EVENT                          uncorrectableECC
        HET                            r1i0n4
        LOCATION                       r1i0n4
        SENSOR                         None-memory
        SENSORNUMBER                   0x00
        SENSORTHRESHOLD                81
        SENSORTYPE                     memory
        SENSORVALUE                    255
        SEVERITY                       NON-RECOVERABLE
        SN                             X1------
        TYPE                           ipmi

CPU Frequency Scaling

CPU frequency scaling allows the operating system to automatically and dynamically scale the processor frequency. CPU frequency scaling needs to be enabled in a compute node image if you want to take advantage of the Intel Turbo Boost technology that is built into each processor.

The Intel Turbo Boost Technology allows processor cores to run faster than the base operating frequency as long as they are operating below the limits set for power, current, and temperature. The CPU frequency scaling setting also affects power consumption and enables you to manage power consumption. For example, you can theoretically cut power consumption in half if you clock the computer's processors from 2 GHz down to 1 GHz.

The following procedures pertain to CPU frequency:

Enabling or Disabling CPU Frequency Scaling

The procedure in this topic explains how to enable or disable CPU frequency scaling. CPU frequency scaling is disabled by default on SGI ICE X systems.

The following procedure explains how to change your CPU frequency scaling setting.

Procedure 4-2. To control CPU frequency scaling

Log into the system admin controller (SAC) as root.
Use the cimage --list-images command to retrieve a list of the compute node images you can edit:

For example:
# cimage --list-images image: compute-sles11sp3.mpt kernel: 3.0.76-0.11-default kernel: 3.0.76-0.11-trace image: compute-sles11sp3 kernel: 3.0.76-0.11-default
The previous example shows the names of two images: compute-sles11sp3.mpt and compute-sles11sp3.
Type the following command to change to the directory that contains the image you want to edit:
# chroot /var/lib/systemimager/images/image_name
For image_name, specify one of the compute node image names that the cimage command returned. For example, using the output from the preceding step, specify either compute-sles11sp3.mpt or compute-sles11sp3.
Use a text editor to open file /etc/modprobe.d/acpi-cpufreq.conf.
Note the following line in this file:
install acpi-cpufreq /bin/true
To enable CPU frequency scaling, insert a pound (#) character as the first character in this line, which makes the line appear as follows:
#install acpi-cpufreq /bin/true
To disable CPU frequency scaling, make sure that the install acpi-cpufreq /bin/true line does not contain a # character in column 1, which makes the line appear as follows:
install acpi-cpufreq /bin/true
Save and close file /etc/modprobe.d/acpi-cpufreq.conf.
Push the changes out to the compute nodes.

Perform the procedure in the following topic:

“Pushing Changes to the Compute Nodes” in Chapter 2
(Optional) Change the CPU frequency governor setting and configure turbo mode.

The default governor setting and the default turbo mode setting are appropriate for most SGI ICE X systems. If you want to change these settings, proceed to the following:

“(Optional) Changing the Governor Setting and Configuring Turbo Mode”

(Optional) Changing the Governor Setting and Configuring Turbo Mode

Use the procedure in this topic to change the governor setting and, optionally, to configure turbo mode. When you enable turbo mode, you enable the CPU frequency to exceed its nominal level for short periods of time, depending on the processor, temperature, current, power, and other factors. For general information about turbo mode, see the following website:

https://www-ssl.intel.com/content/www/us/en/architecture-and-technology/turbo-boost/turbo-boost-technology.html

The following procedure explains how to set the CPU frequency governor appropriately and how to configure turbo mode.

Procedure 4-3. To change the governor setting and configure turbo mode

Make sure that CPU frequency is enabled.

For information, see “Enabling or Disabling CPU Frequency Scaling”.

Examine the following list and choose a power governor setting:

governor Setting		Effect
`ondemand`		Dynamically switches between the available CPUs if at 95% of CPU load. Default.
`performance`		Runs the CPUs at the maximum frequency.
`conservative`		Dynamically switches between the available CPUs if at 75% of CPU load.
`powersave`		Runs the CPUs at the minimum frequency.
`userspace`		Runs the CPUs at user-specified frequencies.

Use the cimage --list-images command to retrieve a list of the compute node images you can edit:

For example:
# cimage --list-images image: compute-sles11sp3.mpt kernel: 3.0.76-0.11-default kernel: 3.0.76-0.11-trace image: compute-sles11sp3 kernel: 3.0.76-0.11-default
The previous example shows the names of two images: compute-sles11sp3.mpt and compute-sles11sp3.
Type the following command to change to the directory that contains the image you want to edit:
# chroot /var/lib/systemimager/images/image_name
For image_name, specify one of the compute node image names that the cimage command returned. For example, using the output from the preceding step, specify either compute-sles11sp3.mpt or compute-sles11sp3.
Use one of the following platform-specific methods to change the setting:
- On RHEL platforms, complete the following steps:
  1. Open file /etc/sysconfig/cpuspeed.
  2. Search for the GOVERNOR= string.
  3. Edit the setting, adding the governor setting you chose in the previous step.
  4. Save and close the file.
  5. Type the following command:
    # service cpuspeed restart
- On SLES platforms, complete the following steps:
  1. Type the following command:
    # cpupower frequency-set -g governor
    
    For governor, specify the setting you chose in the previous step.
  2. Type the following command:
    # cpupower frequency-info
  3. Verify that the governor setting you specified appears in the command in the output in the current policy field.
  4. Use a text editor to edit the /etc/init.d/after.local file and add the following line:
    cpupower freqency-set -g governor
    
    The preceding line ensures that after each boot, the system sets the governor setting you specified.
Push the changes out to the compute nodes.

Perform the procedure in the following topic:

“Pushing Changes to the Compute Nodes” in Chapter 2

Use the cat(1) command to retrieve the list of available frequencies. For example:

# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
3301000 3300000 3200000 3100000 3000000 2900000 2800000 2700000 2600000
2500000 2400000 2300000 2200000 2100000 2000000 1900000 1800000 1700000
1600000 1500000 1400000 1300000 1200000

The preceding output shows the available frequencies, listed in order from the highest, 3301000 KHz, to the lowest, 1200000 KHz.

On SGI systems, the second frequency listed is always the processor's nominal frequency. This is a 3.3 GHz processor, so 3300000 KHz is the nominal frequency.

You can also obtain the nominal frequency by typing the following command and examining the information in the model name field:

# cat /proc/cpuinfo

Use the cpupower command to set the frequency to the nominal frequency of 3.3 GHz plus 1 MHz.

That is, specify a frequency of 3301 MHz. For example:
# cpupower frequency-set -u 3301MHz
Later, if you want to disable turbo mode, type the following command to set the maximum frequency back to the nominal frequency:
# cpupower frequency-set -u 3300MHz

Installing MPI on a Running SGI ICE X System

This topic explains how to install MPI on an SGI ICE X system. The instructions in this section update existing images instead of creating new ones. It should be noted that integrating MPI before cluster deployment is easier.

The SGI® MPI and SGI® Accelerate™ software packages have embedded in them suggested package lists for each node type. The crepo command, used in the following example, makes use of these lists and recomputes the lists when new media is added and then selected.

File names in this example are just illustrations.

# crepo --add accelerate-1.0-cd1-media-rhel6-x86_64.iso
# crepo --add mpi-1.0-cd1-media-rhel6-x86_64.iso

Update the crepo selected repositories so that all repositories associated with the software distribution (distro) you are installing are present. For example, if you want MPI to work on RHEL 6, you might do something like this:

Show what is currently selected (the asterisks to the left):

# crepo --show
* SGI-Management-Center-1.7-rhel6 : /tftpboot/sgi/SGI-Management-Center-1.7-rhel6
* SGI-Foundation-Software-2.10-rhel6 : /tftpboot/sgi/SGI-Foundation-Software-2.10-rhel6
* SGI-XFS-XVM-2.5-for-RHEL-rhel6 : /tftpboot/sgi/SGI-XFS-XVM-2.5-for-RHEL-rhel6
* SGI-Accelerate-1.8-rhel6 : /tftpboot/sgi/SGI-Accelerate-1.8-rhel6
* SGI-Tempo-2.9.0-rhel6 : /tftpboot/sgi/SGI-Tempo-2.9.0-rhel6
* SGI-MPI-1.8-rhel6 : /tftpboot/sgi/SGI-MPI-1.8-rhel6
* Red-Hat-Enterprise-Linux-6.5 : /tftpboot/distro/rhel6.5

Unselect unrelated repositories:

# crepo --unselect SGI-Tempo-2.5-rhel6
Updating: /etc/opt/sgi/rpmlists/generated-compute-rhel.6.5.rpmlist
Updating: /etc/opt/sgi/rpmlists/generated-service-rhel6.5.rpmlist
# crepo --unselect SGI-Foundation-Software-2.5-rhel6
Updating: /etc/opt/sgi/rpmlists/generated-compute-rhel6.5.rpmlist
Updating: /etc/opt/sgi/rpmlists/generated-service-rhel6.5.rpmlist
# crepo --unselect Red-Hat-Enterprise-Linux-6.5
Removing: /etc/opt/sgi/rpmlists/generated-compute-rhel6.5.rpmlist
Removing: /etc/opt/sgi/rpmlists/generated-service-rhel6.5.rpmlist

Select RHEL 6 related repositories:

# crepo --select Red-Hat-Enterprise-Linux-6.5
Updating: /etc/opt/sgi/rpmlists/generated-compute-rhel6.5.rpmlist
Updating: /etc/opt/sgi/rpmlists/generated-lead-rhel6.5.rpmlist
Updating: /etc/opt/sgi/rpmlists/generated-service-rhel6.5.rpmlist
# crepo --select SGI-Foundation-Software-2.5-rhel6
Updating: /etc/opt/sgi/rpmlists/generated-compute-rhel6.5.rpmlist
Updating: /etc/opt/sgi/rpmlists/generated-lead-rhel6.5.rpmlist
Updating: /etc/opt/sgi/rpmlists/generated-service-rhel6.5.rpmlist
# crepo --select SGI-XFS-XVM-2.5-for-RHEL-rhel6
Updating: /etc/opt/sgi/rpmlists/generated-compute-rhel6.5.rpmlist
Updating: /etc/opt/sgi/rpmlists/generated-lead-rhel6.5.rpmlist
Updating: /etc/opt/sgi/rpmlists/generated-service-rhel6.5.rpmlist

After performing the preceding steps, the proper repositories are registered and selected so you can operate on them by default. Since you are using an already deployed system, you need to update existing images and potentially existing service nodes themselves. This example uses SGI suggested/ generated rpmlists. If you have custom rpmlists, you need to manually reconcile the two lists for each node type. The list fragments in /var/opt/sgi/sgi-repodata/ may help you.

Note: The commands in this topic include a continuation character (\).

For a service node image, type the following:

# cinstallman --refresh-image --image service-rhel6.5 \
--rpmlist /etc/opt/sgi/rpmlists/generated-service-rhel6.5.rpmlist

For a compute node image, type the following:

# cinstallman --refresh-image --image compute-rhel6.5 \
--rpmlist /etc/opt/sgi/rpmlists/generated-compute-rhel6.5.rpmlist

Finally, you need to push the updated compute image to the rack leader controllers (RLCs).

Note: If the compute nodes are booted on the image and are using NFS for roots, you need to shut the compute nodes down before being able to run this command.

# cimage --push-rack compute-rhel6.5 r"*"

To make sure the compute nodes you are operating on have the associated compute image you just updated, perform a command similar to the following”

# cimage --set compute-rhel6.5 2.6.32-71.el6.x86_64 "*"

You can find the available images and kernels using the cimage --list-images command.

If you have booted service/login nodes, you likely want to refresh those running nodes also. (You could also reinstall them, as well). Here is a refresh example:

# cinstallman --refresh-node --node service0 \
--rpmlist /etc/opt/sgi/rpmlists/generated-service-rhel6.5.rpmlist

Now reset or bring up the nodes (depends on the state you left them). If you want to bring up all nodes, this command will not disrupt nodes already operating:

# cpower --system --up

Troubleshooting Configuration Changes

If a configuration change does not affect the SGI ICE X system in the intended manner, try one of the following approaches:

Edit the node image on the system admin controller (SAC). For example, you can reconfigure the service node image on the SAC and reimage the service nodes with that new image.
Edit the node customization scripts. For example, the compute node update scripts reside on the SAC in the /opt/sgi/share/per-host-customization/global directory.

Prev	Table of Contents	Next
Chapter 3. Installing and Configuring an SGI ICE X System		Chapter 5. Troubleshooting