This chapter includes the following topics:
The following topics contain information about HET:
| Note: This documentation includes information about all SGI systems for the sake of completeness. |
All of your SGI system's IPMI controllers send SNMP traps to the management node, which is the system admin controller (SAC) on an SGI ICE X system or the system management node (SMN) on an SGI UV system. The SGI Foundation Software's HET tools process these system alerts and send an email notification after critical hardware events occur.
On SGI ICE X and SGI UV platforms, the HET tools are configured by default. You do not need to perform any additional system configuration, but SGI recommends that you customize the email address to which the HET tools send critical event notifications. On SGI Rackable systems, you need to configure the system manually as noted in the het(8) man page. The het(8) man page also contains information about HET defaults and internal processes.
HET accumulates information about system events in the following log file:
/var/log/het/het_trap_processor.log |
As an event-driven system monitoring tool, HET listens for system events. When HET receives information about an event, it converts the message from coded numbers into a readable form, as follows:
When a noncritical event occurs, HET simply logs the event. As an option, you can configure an email address to receive noncritical event notifications.
When a critical event occurs, HET logs the event and send an email message. SGI recommends that you edit file /etc/sysconfig/het and specify an email address specific to your site. By default, HET sends event information to root@localhost. For more information about how to customize HET notifications, see “Customizing HET Notifications”.
The firmware for each baseboard management controller (BMC) and the firmware for each cooling node on an SGI ICE X system includes threshold values for each component. If a system condition becomes too low or to high for its threshold, the BMC sends a critical event alert. The following are examples of critical system events that cause an alert:
Ambient air temperature outside of recommended range
Voltage sensor unable to attain a critical low voltage
Power supply failure
Loss of redundant power supply
Fan speed unable to attain a critical threshold or a loss of fan redundancy
Board processor modules that exceed a critical temperature threshold
Memory uncorrectable errors
You can customize the email addresses to which event information about NON-RECOVERABLE events is sent. As an option, you can specify a site-specific email address for less-severe events, or all HET events, too.
The HET log file, /var/log/het/het_trap_processor.log, contains information about all HET events. You can consult this file periodically to monitor noncritical events.
The following procedure explains how to configure an email address or email alias to receive HET notifications.
Procedure 4-1. To customize HET notifications
Log in as root and open the following file with a text editor:
/etc/sysconfig/het |
On an SGI ICE X system, log into the system admin controller (SAC) node.
On an SGI UV system, log into the system management node (SMN). HET requires an SMN. If your system does not include an SMN, you cannot enable HET.
Search the file for the following string:
HET_MAIL_TROUBLE_TO |
Change the default recipient, root, to be the email address of a person or the email alias of a group who can attend to the system when NON-RECOVERABLE events occur.
(Optional) Configure an email recipient for notifications about CRITICAL events.
Search the file for the following string:
HET_MAIL_NEWS_TO |
Specify an email address or alias to receive CRITICAL event notifications.
Save and close file /etc/sysconfig/het.
(Optional) Configure an email recipient for all HET events.
Complete the following steps:
Open file /etc/het.action.d/het_mail with a text editor.
Search for the following lines in /etc/het.action.d/het_mail :
# NOTE: Adjust if needed # Default is an empty mailing list audience for # non (NON-RECOVERABLE or CRITICAL) events. to="" |
Edit the to="" line to specify an email address or an email alias between the quotation marks.
Save and close the file.
The following is an example of a HET log file that contains critical information:
dump 2013-10-23.07.13.21 [het_process_thread:2] # begin --------------------------------- dump 2013-10-23.07.13.21 [het_process_thread:2] agentAddr 172.24.0.2 dump 2013-10-23.07.13.21 [het_process_thread:2] het_type ipmi dump 2013-10-23.07.13.21 [het_process_thread:2] guid r1lead dump 2013-10-23.07.13.21 [het_process_thread:2] sn X1------ dump 2013-10-23.07.13.21 [het_process_thread:2] alertSeverity NONE dump 2013-10-23.07.13.21 [het_process_thread:2] event uncorrectableECC dump 2013-10-23.07.13.21 [het_process_thread:2] sensorName None-memory dump 2013-10-23.07.13.21 [het_process_thread:2] sensorNumber 0x00 dump 2013-10-23.07.13.21 [het_process_thread:2] sensorTypeName memory dump 2013-10-23.07.13.21 [het_process_thread:2] eventClassName discrete dump 2013-10-23.07.13.21 [het_process_thread:2] event1 0x51 dump 2013-10-23.07.13.21 [het_process_thread:2] event2 0xff dump 2013-10-23.07.13.21 [het_process_thread:2] event3 0x51 dump 2013-10-23.07.13.21 [het_process_thread:2] flap_count 1 dump 2013-10-23.07.13.21 [het_process_thread:2] # end ----------------------------------- |
The corresponding email message that HET sends is as follows:
X-Original-To: root Delivered-To: [email protected] Date: Wed, 18 Dec 2013 14:36:52 -0600 From: [email protected] To: [email protected] Subject: HET ALERT from cb9 - NON-RECOVERABLE User-Agent: Heirloom mailx 12.2 01/07/07 The following HET(Hardware Environment Tracking) event has been recorded: HET ALERT from cb9 - NON-RECOVERABLE Event Details: EVENT uncorrectableECC HET r1i0n4 LOCATION r1i0n4 SENSOR None-memory SENSORNUMBER 0x00 SENSORTHRESHOLD 81 SENSORTYPE memory SENSORVALUE 255 SEVERITY NON-RECOVERABLE SN X1------ TYPE ipmi |
CPU frequency scaling allows the operating system to automatically and dynamically scale the processor frequency. CPU frequency scaling needs to be enabled in a compute node image if you want to take advantage of the Intel Turbo Boost technology that is built into each processor.
The Intel Turbo Boost Technology allows processor cores to run faster than the base operating frequency as long as they are operating below the limits set for power, current, and temperature. The CPU frequency scaling setting also affects power consumption and enables you to manage power consumption. For example, you can theoretically cut power consumption in half if you clock the computer's processors from 2 GHz down to 1 GHz.
The following procedures pertain to CPU frequency:
The procedure in this topic explains how to enable or disable CPU frequency scaling. CPU frequency scaling is disabled by default on SGI ICE X systems.
The following procedure explains how to change your CPU frequency scaling setting.
Procedure 4-2. To control CPU frequency scaling
Log into the system admin controller (SAC) as root .
Use the cimage --list-images command to retrieve a list of the compute node images you can edit:
For example:
# cimage --list-images
image: compute-sles11sp3.mpt
kernel: 3.0.76-0.11-default
kernel: 3.0.76-0.11-trace
image: compute-sles11sp3
kernel: 3.0.76-0.11-default |
The previous example shows the names of two images: compute-sles11sp3.mpt and compute-sles11sp3.
Type the following command to change to the directory that contains the image you want to edit:
# chroot /var/lib/systemimager/images/image_name |
For image_name, specify one of the compute node image names that the cimage command returned. For example, using the output from the preceding step, specify either compute-sles11sp3.mpt or compute-sles11sp3.
Use a text editor to open file /etc/modprobe.d/acpi-cpufreq.conf .
Note the following line in this file:
install acpi-cpufreq /bin/true |
To enable CPU frequency scaling, insert a pound (#) character as the first character in this line, which makes the line appear as follows:
#install acpi-cpufreq /bin/true |
To disable CPU frequency scaling, make sure that the install acpi-cpufreq /bin/true line does not contain a # character in column 1, which makes the line appear as follows:
install acpi-cpufreq /bin/true |
Save and close file /etc/modprobe.d/acpi-cpufreq.conf .
Push the changes out to the compute nodes.
Perform the procedure in the following topic:
(Optional) Change the CPU frequency governor setting and configure turbo mode.
The default governor setting and the default turbo mode setting are appropriate for most SGI ICE X systems. If you want to change these settings, proceed to the following:
“(Optional) Changing the Governor Setting and Configuring Turbo Mode”
Use the procedure in this topic to change the governor setting and, optionally, to configure turbo mode. When you enable turbo mode, you enable the CPU frequency to exceed its nominal level for short periods of time, depending on the processor, temperature, current, power, and other factors. For general information about turbo mode, see the following website:
The following procedure explains how to set the CPU frequency governor appropriately and how to configure turbo mode.
Procedure 4-3. To change the governor setting and configure turbo mode
Make sure that CPU frequency is enabled.
For information, see “Enabling or Disabling CPU Frequency Scaling”.
Examine the following list and choose a power governor setting:
| governor Setting | Effect | |
| ondemand | Dynamically switches between the available CPUs if at 95% of CPU load. Default. | |
| performance | Runs the CPUs at the maximum frequency. | |
| conservative | Dynamically switches between the available CPUs if at 75% of CPU load. | |
| powersave | Runs the CPUs at the minimum frequency. | |
| userspace | Runs the CPUs at user-specified frequencies. |
Use the cimage --list-images command to retrieve a list of the compute node images you can edit:
For example:
# cimage --list-images
image: compute-sles11sp3.mpt
kernel: 3.0.76-0.11-default
kernel: 3.0.76-0.11-trace
image: compute-sles11sp3
kernel: 3.0.76-0.11-default |
The previous example shows the names of two images: compute-sles11sp3.mpt and compute-sles11sp3.
Type the following command to change to the directory that contains the image you want to edit:
# chroot /var/lib/systemimager/images/image_name |
For image_name, specify one of the compute node image names that the cimage command returned. For example, using the output from the preceding step, specify either compute-sles11sp3.mpt or compute-sles11sp3.
Use one of the following platform-specific methods to change the setting:
On RHEL platforms, complete the following steps:
Open file /etc/sysconfig/cpuspeed.
Search for the GOVERNOR= string.
Edit the setting, adding the governor setting you chose in the previous step.
Save and close the file.
Type the following command:
# service cpuspeed restart |
On SLES platforms, complete the following steps:
Type the following command:
# cpupower frequency-set -g governor |
For governor, specify the setting you chose in the previous step.
Type the following command:
# cpupower frequency-info |
Verify that the governor setting you specified appears in the command in the output in the current policy field.
Use a text editor to edit the /etc/init.d/after.local file and add the following line:
cpupower freqency-set -g governor |
The preceding line ensures that after each boot, the system sets the governor setting you specified.
Push the changes out to the compute nodes.
Perform the procedure in the following topic:
Use the cat(1) command to retrieve the list of available frequencies. For example:
# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies 3301000 3300000 3200000 3100000 3000000 2900000 2800000 2700000 2600000 2500000 2400000 2300000 2200000 2100000 2000000 1900000 1800000 1700000 1600000 1500000 1400000 1300000 1200000 |
The preceding output shows the available frequencies, listed in order from the highest, 3301000 KHz, to the lowest, 1200000 KHz.
On SGI systems, the second frequency listed is always the processor's nominal frequency. This is a 3.3 GHz processor, so 3300000 KHz is the nominal frequency.
You can also obtain the nominal frequency by typing the following command and examining the information in the model name field:
# cat /proc/cpuinfo |
Use the cpupower command to set the frequency to the nominal frequency of 3.3 GHz plus 1 MHz.
That is, specify a frequency of 3301 MHz. For example:
# cpupower frequency-set -u 3301MHz |
Later, if you want to disable turbo mode, type the following command to set the maximum frequency back to the nominal frequency:
# cpupower frequency-set -u 3300MHz |
This topic explains how to install MPI on an SGI ICE X system. The instructions in this section update existing images instead of creating new ones. It should be noted that integrating MPI before cluster deployment is easier.
The SGI® MPI and SGI® Accelerate™ software packages have embedded in them suggested package lists for each node type. The crepo command, used in the following example, makes use of these lists and recomputes the lists when new media is added and then selected.
File names in this example are just illustrations.
Register SGI MPI and SGI Accelerate with SGI Tempo, as follows:
# crepo --add accelerate-1.0-cd1-media-rhel6-x86_64.iso # crepo --add mpi-1.0-cd1-media-rhel6-x86_64.iso |
Update the crepo selected repositories so that all repositories associated with the software distribution (distro) you are installing are present. For example, if you want MPI to work on RHEL 6, you might do something like this:
Show what is currently selected (the asterisks to the left):
# crepo --show * SGI-Management-Center-1.7-rhel6 : /tftpboot/sgi/SGI-Management-Center-1.7-rhel6 * SGI-Foundation-Software-2.10-rhel6 : /tftpboot/sgi/SGI-Foundation-Software-2.10-rhel6 * SGI-XFS-XVM-2.5-for-RHEL-rhel6 : /tftpboot/sgi/SGI-XFS-XVM-2.5-for-RHEL-rhel6 * SGI-Accelerate-1.8-rhel6 : /tftpboot/sgi/SGI-Accelerate-1.8-rhel6 * SGI-Tempo-2.9.0-rhel6 : /tftpboot/sgi/SGI-Tempo-2.9.0-rhel6 * SGI-MPI-1.8-rhel6 : /tftpboot/sgi/SGI-MPI-1.8-rhel6 * Red-Hat-Enterprise-Linux-6.5 : /tftpboot/distro/rhel6.5 |
Unselect unrelated repositories:
# crepo --unselect SGI-Tempo-2.5-rhel6 Updating: /etc/opt/sgi/rpmlists/generated-compute-rhel.6.5.rpmlist Updating: /etc/opt/sgi/rpmlists/generated-service-rhel6.5.rpmlist # crepo --unselect SGI-Foundation-Software-2.5-rhel6 Updating: /etc/opt/sgi/rpmlists/generated-compute-rhel6.5.rpmlist Updating: /etc/opt/sgi/rpmlists/generated-service-rhel6.5.rpmlist # crepo --unselect Red-Hat-Enterprise-Linux-6.5 Removing: /etc/opt/sgi/rpmlists/generated-compute-rhel6.5.rpmlist Removing: /etc/opt/sgi/rpmlists/generated-service-rhel6.5.rpmlist |
Select RHEL 6 related repositories:
# crepo --select Red-Hat-Enterprise-Linux-6.5 Updating: /etc/opt/sgi/rpmlists/generated-compute-rhel6.5.rpmlist Updating: /etc/opt/sgi/rpmlists/generated-lead-rhel6.5.rpmlist Updating: /etc/opt/sgi/rpmlists/generated-service-rhel6.5.rpmlist # crepo --select SGI-Foundation-Software-2.5-rhel6 Updating: /etc/opt/sgi/rpmlists/generated-compute-rhel6.5.rpmlist Updating: /etc/opt/sgi/rpmlists/generated-lead-rhel6.5.rpmlist Updating: /etc/opt/sgi/rpmlists/generated-service-rhel6.5.rpmlist # crepo --select SGI-XFS-XVM-2.5-for-RHEL-rhel6 Updating: /etc/opt/sgi/rpmlists/generated-compute-rhel6.5.rpmlist Updating: /etc/opt/sgi/rpmlists/generated-lead-rhel6.5.rpmlist Updating: /etc/opt/sgi/rpmlists/generated-service-rhel6.5.rpmlist |
After performing the preceding steps, the proper repositories are registered and selected so you can operate on them by default. Since you are using an already deployed system, you need to update existing images and potentially existing service nodes themselves. This example uses SGI suggested/ generated rpmlists. If you have custom rpmlists, you need to manually reconcile the two lists for each node type. The list fragments in /var/opt/sgi/sgi-repodata/ may help you.
| Note: The commands in this topic include a continuation character (\). |
For a service node image, type the following:
# cinstallman --refresh-image --image service-rhel6.5 \ --rpmlist /etc/opt/sgi/rpmlists/generated-service-rhel6.5.rpmlist |
For a compute node image, type the following:
# cinstallman --refresh-image --image compute-rhel6.5 \ --rpmlist /etc/opt/sgi/rpmlists/generated-compute-rhel6.5.rpmlist |
Finally, you need to push the updated compute image to the rack leader controllers (RLCs).
| Note: If the compute nodes are booted on the image and are using NFS for roots, you need to shut the compute nodes down before being able to run this command. |
# cimage --push-rack compute-rhel6.5 r"*" |
To make sure the compute nodes you are operating on have the associated compute image you just updated, perform a command similar to the following”
# cimage --set compute-rhel6.5 2.6.32-71.el6.x86_64 "*" |
You can find the available images and kernels using the cimage --list-images command.
If you have booted service/login nodes, you likely want to refresh those running nodes also. (You could also reinstall them, as well). Here is a refresh example:
# cinstallman --refresh-node --node service0 \ --rpmlist /etc/opt/sgi/rpmlists/generated-service-rhel6.5.rpmlist |
Now reset or bring up the nodes (depends on the state you left them). If you want to bring up all nodes, this command will not disrupt nodes already operating:
# cpower --system --up |
If a configuration change does not affect the SGI ICE X system in the intended manner, try one of the following approaches:
Edit the node image on the system admin controller (SAC). For example, you can reconfigure the service node image on the SAC and reimage the service nodes with that new image.
Edit the node customization scripts. For example, the compute node update scripts reside on the SAC in the /opt/sgi/share/per-host-customization/global directory.