This chapter discusses ways in which the user can tune the run-time environment to improve the performance of an MPI message passing application on SGI computers. None of these ways involve application code changes.
One of the most common problems with optimizing message passing codes on large shared memory computers is achieving reproducible timings from run to run. To reduce run-time variability, you can take the following precautions:
Do not oversubscribe the system. In other words, do not request more CPUs than are available and do not request more memory than is available. Oversubscribing causes the system to wait unnecessarily for resources to become available and leads to variations in the results and less than optimal performance.
Avoid interference from other system activity. Both the Linux and IRIX kernels use more memory on node 0 than on other nodes (node 0 is called the kernel node in the following discussion). If your application uses almost all of the available memory per processor, the memory for processes assigned to the kernel node can unintentionally spill over to nonlocal memory. By keeping user applications off the kernel node, you can avoid this effect.
Additionally, by restricting system daemons to run on the kernel node, you can also deliver an additional percentage of each application CPU to the user. One solution IRIX provides to solve this problem is the boot_cpuset(4) command. The boot_cpuset ( man boot_cpuset) capability allows creation of a cpuset that contains the init process and all of its descendants, effectively preventing system functions from interfering with batch jobs running on the rest of the machine.
Avoid interference with other applications. You can use cpusets or cpumemsets to address this problem also. You can use cpusets (for IRIX) or cpumemsets (for Linux) to effectively partition a large, distributed memory host in a fashion that minimizes interactions between jobs running concurrently on the system. See IRIX Admin: Resource Administration and the Linux Resource Administration Guide for information about cpusets and cpumemsets.
On a quiet, dedicated system, you can use dplace or the MPI_DSM_CPULIST shell variable to improve run-time performance repeatability. These approaches are not as suitable for shared, nondedicated systems.
Use a batch scheduler; for example, LSF from Platform Computing or PBSpro from Veridan. These batch schedulers use cpusets to avoid oversubscribing the system and possible interference between applications.
By default, the SGI MPI implementation buffers messages whose lengths exceed 64 bytes. Longer messages are buffered in a shared memory region to allow for exchange of data between MPI processes. In the SGI MPI implementation, these buffers are divided into two basic pools.
For messages exchanged between MPI processes within the same host, buffers from the ”per process” pool (called the “per proc” pool) are used. Each MPI process is allocated a fixed portion of this pool when the application is launched. Each of these portions is logically partitioned into 16-KB buffers.
For MPI jobs running across multiple hosts, a second pool of shared memory is available. Messages exchanged between MPI processes on different hosts use this pool of shared memory, called the “per host” pool. The structure of this pool is somewhat more complex than the “per proc” pool.
For an MPI job running on a single host, messages that exceed 64 bytes are handled as follows. For messages with a length of 16 KB or less, the sender MPI process buffers the entire message. It then delivers a message header (also called a control message) to a mailbox, which is polled by the MPI receiver when an MPI call is made. Upon finding a matching receive request for the sender's control message, the receiver copies the data out of the shared memory buffer into the application buffer indicated in the receive request. The receiver then sends a message header back to the sender process, indicating that the shared memory buffer is available for reuse. Messages whose length exceeds 16 KB are broken down into 16-KB chunks, allowing the sender and receiver to overlap the copying of data to and from shared memory in a pipeline fashion.
Because there is a finite number of these shared memory buffers, this can be a constraint on the overall application performance for certain communication patterns. You can use the MPI_BUFS_PER_PROC shell variable to adjust the number of buffers available for the “per proc” pool. Similarly, you can use the MPI_BUFS_PER_HOST shell variable to adjust the “per host” pool. You can use the MPI statistics counters to determine if retries for these shared memory buffers are occurring.
For information on the use of these counters, see “MPI Internal Statistics” in Chapter 5. In general, you can avoid excessive numbers of retries for buffers by increasing the number of buffers for the “per proc” pool or “per host” pool. However, you should keep in mind that increasing the number of buffers does consume more memory. Also, increasing the number of “per proc” buffers does potentially increase the probability for cache pollution (that is, the excessive filling of the cache with message buffers). Cache pollution can result in degraded performance during the compute phase of a message passing application.
There are additional buffering considerations to take into account when running an MPI job across multiple hosts. For further discussion of multihost runs, see “Tuning for Running Applications Across Multiple Hosts”.
For message transfers between MPI processes within the same host or transfers between partitions, it is possible under certain conditions to avoid the need to buffer messages. Because many MPI applications are written assuming infinite buffering, the use of this unbuffered approach is not enabled by default for MPI_Send. This section describes how to activate this mechanism by default for MPI_Send. For MPI_Isend, MPI_Sendrecv, MPI_Alltoall, MPI_Bcase , MPI_Allreduce, and MPI_Reduce, this optimization is enabled by default for large message sizes.
On IRIX systems, when global memory is used for single copy optimization, the sender's message data must reside in globally accessible memory. Globally accessible memory includes common block or static memory and memory allocated with the Fortran 90 allocate statement or MPI_Alloc_mem (with the SMA_GLOBAL_ALLOC environment variable set). In addition, applications linked against the SHMEM library can also access the LIBSMA symmetric heap via the shpalloc or shmalloc functions. Consequently, use of this feature might require changes to the application. Additional restrictions are described in “Avoiding Message Buffering -- Single Copy Methods” in Chapter 3.
The threshold for message lengths beyond which MPI attempts to use this single copy method is specified by the MPI_BUFFER_MAX shell variable. Its value should be set to the message length in bytes beyond which the single copy method should be tried. In general, a value of 2000 or higher is beneficial for most applications running on a single host. To disable default single copy, use the MPI_DEFAULT_SINGLE_COPY_OFF environment variable.
MPI can take advantage of the XPMEM driver, a special cross-partition device driver, available on both IRIX and Linux systems, that allows the operating system to copy data between two processes within the same host or across partitions.
On systems running IRIX, this feature requires IRIX 6.5.13 or greater and is available only on Origin 300 and 3000 series servers. This option is not available on servers running Trusted IRIX. The MPI library uses the XPMEM driver to enhance single copy optimization (within a host) to eliminate some of the restrictions with only a slight (less that 5 percent) performance cost over the more restrictive single copy optimization using globally accessible memory.
You can enable this optimization if you set the MPI_XPMEM_ON and MPI_BUFFER_MAX environment variables. Note that if the sender data resides in globally accessible memory, the data is copied using a bcopy process. Otherwise, the XPMEM driver is used to transfer the data. Using the XPMEM form of single copy is less restrictive in that the sender's data is not required to be globally accessible. It is available for ABI N32 as well as ABI 64. This optimization also can be used to transfer data between two different executable files on the same host or two different executable files across IRIX partitions.
|Note: Use of the XPMEM driver disables the ability to checkpoint/restart an MPI job.|
On IRIX systems, under certain conditions, the XPMEM driver can take advantage of the block transfer engine (BTE) to provide increased bandwidth. In addition to having MPI_BUFFER_MAX and MPI_XPMEM_ON set, the send and receive buffers must be cache-aligned and the amount of data to transfer must be greater than or equal to MPI_XPMEM_THRESHOLD . The default value for MPI_XPMEM_THRESHOLD is 8192.
On systems running Linux, use of the XPMEM driver is required to support single-copy message transfers between two processes within the same host or across partitions. On Linux systems, during job startup, MPI uses the XPMEM driver (via the xpmem kernel module) to map memory from one MPI process onto another. The mapped areas include the static region, private heap, and stack region of each process.
Memory mapping allows each process to directly access memory from the address space of another process. This technique allows MPI to support single copy transfers for contiguous data types from any of these mapped regions. For these transfers, whether between processes residing on the same host or across partitions, the data is copied using a bcopy process. A bcopy process is also used to transfer data between two different executable files on the same host or two different executable files across partitions. For data residing outside of a mapped region (a /dev/zero region, for example), MPI uses the XPMEM driver to copy the data.
Memory mapping is enabled by default on Linux. To disable it, set the MPI_MEMMAP_OFF environment variable. Memory mapping must be enabled to allow single-copy transfers, MPI-2 one-sided communication, and certain collective optimizations.
The MPI library takes advantage of NUMA placement functions that are available on IRIX and Linux systems. Usually, the default placement is adequate. Under certain circumstances, however, you might want to modify this default behavior. The easiest way to do this is by setting one or more MPI placement shell variables. Several of the most commonly used of these variables are discribed in the following sections. For a complete listing of memory placement related shell variables, see the MPI(1) man page.
The MPI_DSM_CPULIST shell variable allows you to manually select processors to use for an MPI application. At times, specifying a list of processors on which to run a job can be the best means to insure highly reproducible timings, particularly when running on a dedicated system.
This setting is treated as a comma and/or hyphen delineated ordered list that specifies a mapping of MPI processes to CPUs. If running across multiple hosts, the per host components of the CPU list are delineated by colons.
|Note: This feature will not be compatible with job migration features available in future IRIX releases. In addition, this feature should not be used with MPI applications that use either of the MPI-2 spawn related functions.|
Examples of settings are as follows:
Place three MPI processes on CPUs 8, 16, and 32.
Place the MPI process rank zero on CPU 32, one on 16, and two on CPU 8.
Place the MPI processes 0 through 7 on CPUs 8 to 15. Place the MPI processes 8 through 15 on CPUs 32 to 39.
Place the MPI processes 0 through 7 on CPUs 39 to 32. Place the MPI processes 8 through 15 on CPUs 8 to 15.
Place the MPI processes 0 through 7 on the first host on CPUs 8 through 15. Place MPI processes 8 through 15 on CPUs 16 to 23 on the second host.
Note that the process rank is the MPI_COMM_WORLD rank. The interpretation of the CPU values specified in the MPI_DSM_CPULIST depends on whether the MPI job is being run within a cpuset. If the job is run outside of a cpuset, the CPUs specify cpunum values given in the hardware graph (hwgraph(4)). When running within a cpuset, the default behavior is to interpret the CPU values as relative processor numbers within the cpuset. To specify cpunum values instead, you can use the MPI_DSM_CPULIST_TYPE (MPI(1)) shell variable.
The number of processors specified should equal the number of MPI processes that will be used to run the application. The number of colon delineated parts of the list must equal the number of hosts used for the MPI job. If an error occurs in processing the CPU list, the default placement policy is used. To insure linking of the MPI processes to the designated processors, you should also set the MPI_DSM_MUSTRUN shell variable on IRIX only.
Use the MPI_DSM_DISTRIBUTE shell variable to ensure that each MPI process will get a physical CPU and memory on the node to which it was assigned. On Linux systems, if this environment variable is used without specifying an MPI_DSM_CPULIST variable, it will cause MPI to assign MPI ranks starting at logical CPU 0 and incrementing until all ranks have been placed. On Linux systems, therefore, it is recommended that this variable be used only if running within a cpumemset or on a dedicated system.
Use the MPI_DSM_MUSTRUN shell variable to ensure that each MPI process will get a physical CPU and memory on the node to which it was assigned. It has been observed that using this shell variable has led to improved performance, especially on IRIX systems running version 6.5.7 and earlier. With the MPT 1.8 release, the MPI_DSM_MUSTRUN variable is deprecated on Linux. Use MPI_DSM_DISTRIBUTE instead.
The MPI_DSM_PPM shell variable allows you to specify the number of MPI processes to be placed on a node. Memory bandwidth intensive applications can benefit from placing fewer MPI processes on each node of a distributed memory host. On Origin 200 and Origin 2000 series servers, the default is to place two MPI processes on each node. On Origin 300 and Origin 3000 series servers, the default is four MPI processes per node. You can use the MPI_DSM_PPM shell variable to change these values. On Origin 300 and Origin 3000 series servers, setting MPI_DSM_PPM to 2 places one MPI process on each memory bus. On SGI Altix 3000 systems, setting MPI_DSM_PPM to 1 places one MPI process on each node.
You can use the PAGESIZE_DATA and PAGESIZE_STACK variables to request nondefault page sizes (in kilobytes). Setting these variables can be helpful for applications that experience frequent TLB misses. You can ascertain this condition by using the ssrun or perfex profiling tools. However, these variables should be used with caution. Generally, system administrators do not configure the system to have many large pages per node. If very large page sizes are requested, you might lose good memory locality if the operating system is able to satisfy the large page request only with remote memory.
|Note: Because these variables are associated with NUMA placement, disabling NUMA placement via the MPI_DSM_OFF shell variable disables the use of these page size shell variables.|
|Note: These shell variables are currently not available on Linux systems.|
The dplace tool offers another means of specifying the placement of MPI processes within a distributed memory host. This tool is available on both Linux and IRIX systems. Starting with IRIX 6.5.13, dplace and MPI interoperate to allow MPI to better manage placement of certain shared memory data structures when dplace is used to place the MPI job. If this interoperability feature is undesirable, you can set the MPI_DPLACE_INTEROP_OFF shell variable.
For instructions on how to use dplace with MPI, see the dplace(1) man page.
Hybrid MPI/OpenMP applications might require special memory placement features to operate efficiently on ccNUMA Origin servers. This section describes a preliminary method for achieving this memory placement.
The basic idea is to space out the MPI processes to accommodate the OpenMP threads associated with each MPI process. In addition, assuming a particular ordering of library init code (see the DSO man page), this method employs procedures to insure that the OpenMP threads remain close to the parent MPI process. This type of placement has been found to improve the performance of some hybrid applications significantly.
To take partial advantage of this placement option, the following requirements must be met:
When running the application, you must set the MPI_OPENMP_INTEROP shell variable.
To compile the application, you must use a MIPSpro compiler and the -mp compiler option. This hybrid model placement option is not available with other compilers.
The application must run on an Origin 300 or Origin 3000 series server.
To take full advantage of this placement option, you must be able to link the application such that the libmpi.so init code is run before the libmp.so init code. For instructions on how to link the hybrid application, see “Compiling and Linking IRIX MPI Programs” in Chapter 2. This linkage issue has been removed in the MIPspro 7.4 (and later versions) compilers. It may, however, remain in earlier compiler versions.
You can use an additional memory placement feature for hybrid MPI/OpenMP applications by using the MPI_DSM_PLACEMENT shell variable. Specification of a “threadroundrobin” policy results in the parent MPI process stack, data, and heap memory segments being spread across the nodes on which the child OpenMP threads are running.
MPI reserves nodes for this hybrid placement model based on the number of MPI processes and the number of OpenMP threads per process, rounded up to the nearest multiple of 4. For example, if 6 OpenMP threads per MPI process are going to be used for a 4 MPI process job, MPI will request a placement for 32 (4 X 8) CPUs on the host machine. You should take this into account when requesting resources in a batch environment or when using cpusets. In this implementation, it is assumed that all MPI processes start with the same number of OpenMP threads, as specified by the OMP_NUM_THREADS or equivalent shell variable at job startup.
This placement is not recommended if you set _DSM_PPM to a non-default value (for more information, see pe_environ). Also, it is suggested that the mustrun shell variables (MPI_DSM_MUSTRUN and _DSM_MUSTRUN) not be set when using this placement model.
On Linux systems the MPI_OPENMP_INTEROP variable is supported. However, the OpenMP threads are not actually pinned to a CPU but are free to migrate to any of the CPUs in the OpenMP thread group for each MPI rank. The pinning of the OpenMP thread to a specific CPU will be supported in a future release.
When you are running an MPI application across a cluster of hosts, there are additional run-time environment settings and configurations that you can consider when trying to improve application performance.
IRIX hosts can be clustered using a variety of high performance interconnects. You can use the XPMEM interconnect to cluster Origin 300 and Origin 3000 series servers as partitioned systems. Other high performance interconnects include GSN and Myrinet. If none of these interconnects is available, MPI relies on TCP/IP to handle MPI traffic between hosts.
Systems running Linux can use the XPMEM interconnect to cluster hosts as partitioned systems, or rely on TCP/IP as the multihost interconnect.
When launched as a distributed application, MPI probes for these interconnects at job startup. For details of launching a distributed application, see “Launching a Distributed Application” in Chapter 2. When a high performance interconnect is detected, MPI attempts to use this interconnect if it is available on every host being used by the MPI job. If the interconnect is not available for use on every host, the library attempts to use the next slower interconnect until this connectivity requirement is met. Table 6-1 specifies the order in which MPI probes for available interconnects.
Default Order of Selection
Environment Variable to Require Use
Environment Variable for Specifying Device Selection
The third column of Table 6-1, also indicates the environment variable you can set to pick a particular interconnect other than the default. For example, suppose you want to run an MPI job on a cluster supporting both GSN and Myrinet (GM) interconnects. By default, the MPI job would try to run over the GSN interconnect. If for some reason you wanted to use the Myrinet (GM) interconnect, you would set the MPI_USE_GM shell variable before launching the job. This would cause the MPI library to attempt to run the job using the Myrinet (GM) interconnect. If the Myrinet interconnect cannot be used, the job will fail.
The XPMEM interconnect is an exception in that it does not require that all hosts in the MPI job need to be reachable via the XPMEM device. Message traffic between hosts not reachable via XPMEM will go over the next fastest interconnect. Also, when you specify a particular interconnect to use, you can set the MPI_USE_XPMEM variable in addition to one of the other four choices.
In general, to insure the best performance of the application, you should allow MPI to pick the fastest available interconnect.
When running in cluster mode, be careful about setting the MPI_BUFFER_MAX value too low. Setting it less than 16384 bytes could lead to a significant increase in the number of small control messages sent over the interconnect, possibly leading to performance degradation.
In addition to the choice of interconnect, you should know that multihost jobs use different buffers from those used by jobs run on a single host. In the SGI implementation of MPI, all of the previously mentioned interconnects rely on the “per host” buffers to deliver long messages. The default setting for the number of buffers per host might be too low for many applications. You can determine whether this setting is too low by using the MPI statistics described earlier in this section.
In particular, you should examine the metric for retries allocating MPI per host buffers. High retry counts usually indicate that the MPI_BUFS_PER_HOST shell variable should be increased. Table 6-2 provides an example of application performance as a function of the number of “per host” message buffers. Here, the Fourier Transform (FT) class C benchmark was run on a cluster of four Origin 300 servers (32 CPUs each) using Myrinet. Note that the performance improves by almost a factor of three by increasing the MPI_BUFS_PER_HOST from the default of 32 buffers to 128 buffers per host.
Execution Time (secs)
When considering these MPI statistics, GSN users should also examine the counter for retries allocating MPI per host message headers. In cases in which this metric indicates high numbers of retries, it might be necessary to increase the MPI_MSGS_PER_HOST shell variable . Myrinet (GM) does not use this resource.
When using GSN or Myrinet high performance networks, MPI attempts to use all adapters (cards) available on each host in the job. You can modify this behavior by specifying specific adapter(s) to use. The fourth column of Table 6-1 indicates the shell variable to use for a given network. For details on syntax, see the MPI man page.
When using the TCP/IP interconnect, unless specified otherwise, MPI uses the default IP adapter for each host. To use a nondefault adapter, enter the adapter-specific host name on the mpirun command line.