Chapter 5. Performance Tuning

Chapter 5. Performance Tuning
Prev		Next

This chapter includes the following topics:

About Performance Tuning

After you analyze your code to determine where performance bottlenecks are occurring, you can turn your attention to making your programs run their fastest. One way to do this is to use multiple CPUs in parallel processing mode. However, this should be the last step. The first step is to make your program run as efficiently as possible on a single processor system and then consider ways to use parallel processing.

Intel provides tuning information, including information about the Intel processors, at the following website:

http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html .

This chapter describes the process of tuning your application for a single processor system and then tuning it for parallel processing. It also addresses how to improve the performance of floating-point programs and MPI applications.

Single Processor Code Tuning

Several basic steps are used to tune performance of single-processor code:

Get the expected answers and then tune performance. For details, see “Getting the Correct Results”.
Use existing tuned code, such as that found in math libraries and scientific library packages. For details, see “Using Tuned Code”.
Determine what needs tuning. For details, see “Determining Tuning Needs”.
Use the compiler to do the work. For details, see “Using Compiler Options to Optimize Performance ”.
Consider tuning cache performance. For details, see “Tuning the Cache Performance”.
Set environment variables to enable higher-performance memory management mode. For details, see “Managing Memory”.

Getting the Correct Results

One of the first steps in performance tuning is to verify that the correct answers are being obtained. After the correct answers are obtained, tuning can be done. You can verify answers by initially disabling specific optimizations and limiting default optimizations. This can be accomplished by using specific compiler options and by using debugging tools.

The following compiler options emphasize tracing and porting over performance:

Option	Purpose
`-O`	Disables all optimization. The default is `-O2`.
`-g`	Preserves symbols for debugging. In the past, using `-g` automatically put down the optimization level. In Intel compiler today, you can use `-O3` with `-g`.
`-fp-model`	Lets you specify compiler rules for the following: Value safety Floating-point (FP) expression evaluation FPU environment access Precise FP exceptions FP contractions The default is `-fp-model fast=1`. Note that `-mp` is an old option and is replaced by `-fp-model`.
`-r`, `-i`	Sets default real, integer, and logical sizes to 8 bytes, which are useful for porting codes from Cray, Inc. systems. This option explicitly declares intrinsic and external library functions.

For information about debugging tools that you can use to verify that correct answers are being obtained, see the following:

“About Debugging” in Chapter 2

Managing Heap Corruption Problems

You can use environment variables to check for heap corruption problems in programs that use glibc malloc/free dynamic memory management routines.

Set the MALLOC_CHECK_ environment variable to 1 to print diagnostic messages or to 2 to abort immediately when heap corruption is detected.

Overruns and underruns are circumstances in which an access to an array is outside the declared boundary of the array. Underruns and overruns cannot be simultaneously detected. The default behavior is to place inaccessible pages immediately after allocated memory.

Using Tuned Code

Where possible, use code that has already been tuned for optimum hardware performance.

The following mathematical functions should be used where possible to help obtain best results:

MKL, Intel's Math Kernel Library. This library includes BLAS, LAPACK, and FFT routines.
VML, the Vector Math Library, available as part of the MKL package (libmkl_vml_itp.so).
Standard Math library. Standard math library functions are provided with the Intel compiler's libimf.a file. If the -lm option is specified, glibc libm routines are linked in first.

Documentation is available for MKL and VML at the following website:

https://software.intel.com/en-us/intel-parallel-studio-xe-support/documentation

Determining Tuning Needs

Use the following tools to determine what points in your code might benefit from tuning:

Tool	Purpose
`time(1)`	Obtains an overview of user, system, and elapsed time.
`gprof(1)`	Obtains an execution profile of your program. This is a `pcsamp` profile. Use the `-p` compiler option to enable `gprof` use.
VTune	Monitors performance. This is an Intel performance monitoring tool. You can run it directly on your SGI UV system. The Linux server/Windows client is useful when you are working on a remote system.
`psrun`	Measures the performance of unmodified executables. This is a `PerfSuite` command-line utility. `psrun` takes as input a configuration XML document that describes the desired measurement. For more information, see the following website: http://perfsuite.ncsa.uiuc.edu/

For information about other performance analysis tools, see Chapter 2, “Performance Analysis and Debugging”.

Using Compiler Options to Optimize Performance

This topic describes several Intel compiler options that can optimize performance. In addition to the performance options and processor options that this topic describes, the following options might be useful to you:

The -help option displays a short summary of the ifort or icc options.
The -dryrun option displays the driver tool commands that ifort or icc generate. This option does not actually perform a compile.

For more information about the Intel compiler options, see the following:

https://software.intel.com/en-us/intel-parallel-studio-xe

Use the following options to help tune performance:

Option
	Purpose
`-fno-alias`
	Assumes no pointer aliasing. Pointer aliasing can create uncertainty about the possibility that two unrelated names might refer to the identical memory. Because of this uncertainty, the compiler assumes that any two pointers can point to the same location in memory. This can remove optimization opportunities, particularly for loops. Other aliasing options include `-ansi_alias` and `-fno_fnalias`. Note that incorrect alias assertions might generate incorrect code.
`-ip`
	Generates single file, interprocedural optimization. A related option, `-ipo` generates multifile, interprocedural optimization. Most compiler optimizations work within a single procedure, such as a function or a subroutine, at a time. This intra-procedural focus restricts optimization possibilities because a compiler is forced to make worst-case assumptions about the possible effects of a procedure. By using inter-procedural analysis, more than a single procedure is analyzed at once and code is optimized. It performs two passes through the code and requires more compile time.
`-O3`
	Enables `-O2` optimizations plus more aggressive optimizations, including loop transformation and prefetching. Loop transformations are found in a transformation file created by the compiler; you can examine this file to see what suggested changes have been made to loops. Prefetch instructions allow data to be moved into the cache before their use. A prefetch instruction is similar to a load instruction. Note that Level 3 optimization may not improve performance for all programs.
`-opt_report`
	Generates an optimization report and places it in the file specified by the `-opt_report_file` option.
`-prof_gen`, `-prof_use`
	Generates and uses profiling information. These options require a three-step compilation process: Compile with proper instrumentation using `-prof_gen`. Run the program on one or more training datasets. Compile with `-prof_use`, which uses the profile information from the training run.
`-S`
	Compiles and generates an assembly listing in the `.s` files and does not link. The assembly listing can be used in conjunction with the output generated by the `-opt_report` option to try to determine how well the compiler is optimizing loops.
`-vec-report`
	Controls information specific to the vectorizer. Intel Xeon series processors support vectorization, which can provide a powerful performance boost.
`-Ofast`
	Takes the place of, and is equivalent to specifying, the following options: `-ipo -O3 -no-prec-div -static -fp-model fast=2 -xHost`
`-diag-enable`
	Enables the Source Checker, which provides advanced diagnostics based on a detailed analysis of your source code. When enabled, the compiler performs static global analysis to find errors in software that the compiler does not typically detect. This general source code analysis tool is an additional diagnostic to help you debug your programs. You can use source code analysis options to detect the following types of potential errors in your compiled code: Incorrect usage of OpenMP directives Inconsistent object declarations in different program units Boundary violations Uninitialized memory Memory corruptions Memory leaks Incorrect usage of pointers and allocatable arrays Dead code and redundant executions Typographical errors or uninitialized variables Dangerous usage of unchecked input Source checker analysis performs a general overview check of a program for all possible values simultaneously. This is in contrast to run-time checking tools that run a program with a fixed set of values for input variables; such checking tools cannot easily check all edge effects. By not using a fixed set of input values, the source checker can check for obscure cases. In fact, you do not need to run the program for Source Checker because the analysis is performed at compilation time. The only requirement is a successful compilation. There are limitations to Source Checker analysis. Because the Source Checker does not fully interpret the analyzed program, it can generate so called false-positive messages. This is a fundamental difference between compiler errors and Source Checker errors. In the case of the source checker, you decide whether the generated error is legitimate and needs to be fixed.

The Intel compilers support additional options that are specific to each processor model. To determine the processor used in your system, examine the contents of the /proc/cpuinfo file. For example:

# cat /proc/cpuinfo | grep "model name"
model name      : Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
model name      : Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
model name      : Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
.
.
.

Use the information in Table 5-1 to determine the processor in your SGI system.

Table 5-1. SGI Systems and Intel Processors

SGI System	Intel Processor Model	Intel Processor Code Name
SGI UV 300	Xeon E7-8800 v2 series Xeon E7-8800 v3 series	Ivy Bridge EX Haswell EX
SGI UV 3000	Xeon E5-4600 v3 series	Haswell EP 4S
SGI UV 2000	Xeon E5-4600 series Xeon E5-4600 v2 series	Sandy Bridge EP 4S Ivy Bridge EP 4S
SGI UV 1000	Xeon 7500 series Xeon E7-8800 series	Nehalem EX Westmere EX
SGI ICE or SGI Rackable	Xeon E5-2600 series Xeon E5-2600 v2 series Xeon E5-2600 v3 series	Sandy Bridge EP Ivy Bridge EP Haswell EP

The following list shows the processor-specific compiler options:

Processor Option	Purpose
`-xAVX`	Generates instructions for the Sandy Bridge processors and the Ivy Bridge processors, which support Intel Advanced Vector Extensions (AVX) instructions.
`-xCORE-AVX2`	Generates instructions for Haswell processors, which support Intel AVX2 instructions.
`-xHost`	Generates instructions for the highest instruction set available on the compilation host processor.
`-xSSE4.2`	Generates instructions for Nehalem processors and Westmere processors, which support Intel SSE4.2 instructions.

Tuning the Cache Performance

The processor cache stores recently-used information in a place where it can be accessed quickly. This topic uses the following terms to describe cache performance tuning:

A cache line is the minimum unit of transfer from next-higher cache into this one.
A cache hit is reference to a cache line that is present in the cache.
A cache miss is reference to a cache line that is not present in this cache level and must be retrieved from a higher cache, from memory, or from swap space.
The hit time is the time to access the upper level of the memory hierarchy, which includes the time needed to determine whether the access is a hit or a miss.
A miss penalty is the time to replace a block in the upper level with the corresponding block from the lower level, plus the time to deliver this block to the processor. The time to access the next level in the hierarchy is the major component of the miss penalty.

There are several actions you can take to help tune cache performance:

Avoid large power-of-two (and multiples thereof) strides and dimensions that cause cache thrashing. Cache thrashing occurs when multiple memory accesses require use of the same cache line. This can lead to an unnecessary number of cache misses.

To prevent cache thrashing, redimension your vectors so that the size is not a power of two. Space the vectors out in memory so that concurrently accessed elements map to different locations in the cache. When working with two-dimensional arrays, make the leading dimension an odd number. For multidimensional arrays, change two or more dimensions to an odd number.

For example, assume that a cache in the hierarchy has a size of 256 KB, which is 65536 four-byte words. A Fortran program contains the following loop:
real data(655360,24) ... do i=1,23 do j=1,655360 diff=diff+data(j,i)-data(j,i+1) enddo enddo
The two accesses to data are separated in memory by 655360*4 bytes, which is a simple multiple of the cache size. They consequently load to the same location in the cache. Because both data items cannot simultaneously coexist in that cache location, a pattern of replace on reload occurs that considerably reduces performance.
Use a memory stride of 1 wherever possible. A loop over an array should access array elements from adjacent memory addresses. When the loop iterates through memory by consecutive word addresses, it uses every word of every cache line in sequence and does not return to a cache line after finishing it.

If memory strides other than 1 are used, cache lines could be loaded multiple times if an array is too large to be held in memory at one time.
Cache bank conflicts can occur if there are two accesses to the same 16-byte-wide bank at the same time.

A maximum of four performance monitoring events can be counted simultaneously.
Group together data that is used at the same time and do not use vectors in your code, if possible. If elements that are used in one loop iteration are contiguous in memory, it can reduce traffic to the cache and fewer cache lines will be fetched for each iteration of the loop.
Try to avoid the use of temporary arrays and minimize data copies.

Managing Memory

The following topics pertain to memory management:

Memory Use Strategies

The following are some general memory use goals and guidelines:

Register reuse. Do a lot of work on the same data before working on new data.
Cache reuse. The program is much more efficient if all of the data and instructions fit in cache. If the data and instructions do not fit in the cache, try to use what is in cache before using anything that is not in cache.
Data locality. Try to access data that is nearby in memory before attempting to access data that is far away in memory.
I/O efficiency. Perform a large amount of I/O operations all at once, rather than a little bit at a time. Do not mix calculations and I/O.

Memory Hierarchy Latencies

Memory is not arranged as a flat, random access storage device. It is critical to understand that memory is a hierarchy to get good performance. Memory latency differs within the hierarchy. Performance is affected by where the data resides.

CPUs that are waiting for memory are not doing useful work. Software should be hierarchy-aware to achieve best performance, so observe the following guidelines:

Perform as many operations as possible on data in registers.
Perform as many operations as possible on data in the cache(s).
Keep data uses spatially and temporally local.
Consider temporal locality and spatial locality.

Memory hierarchies take advantage of temporal locality by keeping more recently accessed data items closer to the processor. Memory hierarchies take advantage of spatial locality by moving contiguous words in memory to upper levels of the hierarchy.

Tuning Multiprocessor Codes

The following topics explain multiprocessor tuning, which consists of the following major steps:

Perform single processor tuning, which benefits multiprocessor codes also. For information, see “Single Processor Code Tuning”.
Determine the parts of your code that can be parallelized. For information, see “Data Decomposition”.
Choose the parallelization methodology for your code. For information, see “Measuring Parallelization and Parallelizing Your Code”.
Analyze your code to make sure it is parallelizing properly. For information, see Chapter 2, “Performance Analysis and Debugging”.
Determine if false sharing exists. False sharing refers to OpenMP, not MPI. For information, see “Fixing False Sharing”.
Tune for data placement. For information, see Chapter 4, “Data Process and Placement Tools”.
Use environment variables to assist with tuning. For information, see “Environment Variables for Performance Tuning ”.

Data Decomposition

In order to efficiently use multiple processors on a system, tasks have to be found that can be performed at the same time. There are two basic methods of defining these tasks:

Functional parallelism

Functional parallelism is achieved when different processors perform different functions. This is a known approach for programmers trained in modular programming. Disadvantages to this approach include the difficulties of defining functions as the number of processors grow and finding functions that use an equivalent amount of CPU power. This approach may also require large amounts of synchronization and data movement.
Data parallelism

Data parallelism is achieved when different processors perform the same function on different parts of the data. This approach takes advantage of the large cumulative memory. One requirement of this approach, though, is that the problem domain be decomposed . There are two steps in data parallelism:
1. Decompose the data.
  
  Data decomposition is breaking up the data and mapping data to processors. You, the programmer, can break up the data explicitly by using message passing (with MPI) and data passing (using the SHMEM library routines). Alternatively, you can employ compiler-based MP directives to find parallelism in implicitly decomposed data.
  
  There are advantages and disadvantages to implicit and explicit data decomposition:
  - Implicit decomposition advantages: No data resizing is needed. All synchronization is handled by the compiler. The source code is easier to develop and is portable to other systems with OpenMP or High Performance Fortran (HPF) support.
  - Implicit decomposition disadvantages : The data communication is hidden by the user
  - Explicit decomposition advantages : The programmer has full control over insertion of communication and synchronization calls. The source code is portable to other systems. Code performance can be better than implicitly parallelized codes.
  - Explicit decomposition disadvantages : Harder to program. The source code is harder to read and the code is longer (typically 40% more).
2. Divide the work among processors.

Measuring Parallelization and Parallelizing Your Code

When tuning for performance, first assess the amount of code that is parallelized in your program. Use the following formula to calculate the amount of code that is parallelized:

p=N(T(1)-T(N)) / T(1)(N-1)

In this equation, T(1) is the time the code runs on a single CPU and T(N) is the time it runs on N CPUs. Speedup is defined as T(1)/T(N).

If speedup/N is less than 50% (that is, N>(2-p)/(1- p)), stop using more CPUs and tune for better scalability.

You can use one of the following to display CPU activity:

The top(1) command.
The vmstat(8) command.
The open source Performance Co-Pilot tools. For example, pmval(1) (pmval kernel.percpu.cpu.user) or the visualization command pmchart (1).

Next, focus on using one of the following parallelization methodologies:

Using SGI MPI

The SGI Performance Suite includes the SGI Message Passing Toolkit (SGI MPT). SGI MPT includes both the SGI Message Passing Interface (SGI MPI) and SGI SHMEM. SGI MPI is optimized and more scalable for SGI UV series systems than the generic MPI libraries. SGI MPI takes advantage of the SGI UV architecture and SGI nonuniform memory access (NUMA) features.

Use the -lmpi compiler option to use MPI. For a list of environment variables that are supported, see the mpi(1) man page.

The MPIO_DIRECT_READ and MPIO_DIRECT_WRITE environment variables are supported under Linux for local XFS filesystems in SGI MPT version 1.6.1 and beyond.

MPI provides the MPI-2 standard MPI I/O functions that provide file read and write capabilities. A number of environment variables are available to tune MPI I/O performance. The mpi_io(3) man page describes these environment variables.

For information about performance tuning for MPI applications, see the following:

SGI MPI and SGI SHMEM User Guide
MPInside Reference Guide

Using OpenMP

OpenMP is a shared memory multiprocessing API, which standardizes existing practice. It is scalable for fine or coarse grain parallelism with an emphasis on performance. It exploits the strengths of shared memory and is directive-based. The OpenMP implementation also contains library calls and environment variables. OpenMP is included with the C, C++, and Fortran compilers.

To use OpenMP directives, specify the ifort -openmp or icc -openmp compiler options. These options use the OpenMP front-end that is built into the Intel compilers. The latest Intel compiler OpenMP run-time library name is libiomp5.so. The latest Intel compiler also supports the GNU OpenMP library as an either/or option, in other words, do not mix-and-match the GNU library with the Intel version.

For more information, see the OpenMP standard at the following website:

http://www.openmp.org/wp/openmp-specifications

Identifying OpenMP Nested Parallelism

The following Open MP nested parallelism output shows two primary threads and four secondary threads, called master/nested:

% cat place_nested
firsttask cpu=0
thread name=a.out oncpu=0 cpu=4 noplace=1 exact onetime thread name=a.out oncpu=0
cpu=1-3 exact thread name=a.out oncpu=4 cpu=5-7 exact

% dplace -p place_nested a.out
Master thread 0 running on cpu 0
Master thread 1 running on cpu 4
Nested thread 0 of master 0 gets task 0 on cpu 0 Nested thread 1 of master 0 gets task 1 on cpu 1
Nested thread 2 of master 0 gets task 2 on cpu 2 Nested thread 3 of master 0 gets task 3 on cpu 3
Nested thread 0 of master 1 gets task 0 on cpu 4 Nested thread 1 of master 1 gets task 1 on cpu 5
Nested thread 2 of master 1 gets task 2 on cpu 6 Nested thread 3 of master 1 gets task 3 on cpu 7

For more information, see the dplace(1) man page.

Using Compiler Options

You can use compiler options to invoke automatic parallelization. Use the -parallel or -par_report options to the ifort or icc compiler commands. These options show which loops were parallelized and the reasons why some loops were not parallelized. If a source file contains many loops, it might be necessary to add the -override_limits flag to enable automatic parallelization. The code generated by the -parallel option is based on the OpenMP API. The standard OpenMP environment variables and Intel extensions apply.

There are some limitations to automatic parallelization:

For Fortran codes, only DO loops are analyzed
For C/C++ codes, only for loops using explicit array notation or those using pointer increment notation are analyzed. In addition, for loops using pointer arithmetic notation are not analyzed, nor does it analyze while or do while loops. The compiler also does not check for blocks of code that can be run in parallel.

Identifying Opportunities for Loop Parallelism in Existing Code

Another parallelization optimization technique is to identify loops that have a potential for parallelism, such as the following:

Loops without data dependencies; a data dependency conflict occurs when a loop has results from one loop pass that are needed in future passes of the same loop.
Loops with data dependencies because of temporary variables, reductions, nested loops, or function calls or subroutines.

Loops that do not have a potential for parallelism are those with premature exits, too few iterations, or those where the programming effort to avoid data dependencies is too great.

Fixing False Sharing

If the parallel version of your program is slower than the serial version, false sharing might be occurring. False sharing occurs when two or more data items that appear not to be accessed by different threads in a shared memory application correspond to the same cache line in the processor data caches. If two threads executing on different CPUs modify the same cache line, the cache line cannot remain resident and correct in both CPUs, and the hardware must move the cache line through the memory subsystem to retain coherency. This causes performance degradation and reduction in the scalability of the application. If the data items are only read, not written, the cache line remains in a shared state on all of the CPUs concerned. False sharing can occur when different threads modify adjacent elements in a shared array. When two CPUs share the same cache line of an array and the cache is decomposed, the boundaries of the chunks split at the cache line.

If you suspect false sharing, take one of the following actions:

Use the information in the following manual to determine the appropriate hardware performance counter names relevant to false sharing:

http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html
On SGI UV systems, you can use the hubstats(1) command in the SGI Foundation Software suite to verify whether false sharing is occurring.

If false sharing is a problem, try the following solutions:

Use the hardware counter to run a profile that monitors storage to shared cache lines. This shows the location of the problem.
Revise data structures or algorithms.
Check shared data, static variables, common blocks, private variables, and public variables in shared objects.
Use critical regions to identify the part of the code that has the problem.

Environment Variables for Performance Tuning

You can use several different environment variables to assist in performance tuning. For details about environment variables used to control MPI behavior, see the mpi(1) man page.

Several OpenMP environment variables can affect the actions of the OpenMP library. For example, some environment variables control the behavior of threads in the application when they have no work to perform or are waiting for other threads to arrive at a synchronization semantic. Other environment variables can specify how the OpenMP library schedules iterations of a loop across threads. The following environment variables are part of the OpenMP standard:

OMP_NUM_THREADS. The default is the number of CPUs in the system.
OMP_SCHEDULE. The default is static.
OMP_DYNAMIC. The default is false.
OMP_NESTED. The default is false.

In addition to the preceding environment variables, Intel provides several OpenMP extensions, two of which are provided through the use of the KMP_LIBRARY variable.

The KMP_LIBRARY variable sets the run-time execution mode, as follows:

If set to serial, single-processor execution is used.
If set to throughput, CPUs yield to other processes when waiting for work. This is the default and is intended to provide good overall system performance in a multiuser environment.
If set to turnaround, worker threads do not yield while waiting for work. Setting KMP_LIBRARY to turnaround may improve the performance of benchmarks run on dedicated systems, where multiple users are not contending for CPU resources.

If your program generates a segmentation fault immediately upon execution, you might need to increase KMP_STACKSIZE. This is the private stack size for threads. The default is 4 MB. You may also need to increase your shell stacksize limit.

Understanding Parallel Speedup and Amdahl's Law

You can use multiple CPUs in the following ways:

Take a conventional program in C, C++, or Fortran, and have the compiler find the parallelism that is implicit in the code.
Write your source code to use explicit parallelism. In in the source code, specify the parts of the program that you want to execute asynchronously and how the parts are to coordinate with each other.

When your program runs on more than one CPU, its total run time should be less. But how much less? What are the limits on the speedup? That is, if you apply 16 CPUs to the program, should it finish in 1/16th the elapsed time?

This section covers the following topics:

Adding CPUs to Shorten Execution Time

You can distribute the work your program does over multiple CPUs. However, there is always some part of the program's logic that has to be executed serially, by a single CPU. This sets the lower limit on program run time.

Suppose there is one loop in which the program spends 50% of the execution time. If you can divide the iterations of this loop so that half of them are done in one CPU while the other half are done at the same time in a different CPU, the whole loop can be finished in half the time. The result is a 25% reduction in program execution time.

The mathematical treatment of these ideas is called Amdahl's law, for computer pioneer Gene Amdahl, who formalized it. There are two basic limits to the speedup you can achieve by parallel execution:

The fraction of the program that can be run in parallel, p, is never 100%.
Because of hardware constraints, after a certain point, there are diminishing benefits from each added CPU.

Tuning for parallel execution comes down to doing the best that you are able to do within these two limits. You strive to increase the parallel fraction, p, because in some cases even a small change in p (from 0.8 to 0.85, for example) makes a dramatic change in the effectiveness of added CPUs.

Then you work to ensure that each added CPU does a full CPU's work and does not interfere with the work of other CPUs. In the SGI UV architectures this means coding to accomplish the following:

Spreading the workload equally among the CPUs
Eliminating false sharing and other types of memory contention between CPUs
Making sure that the data used by each CPU are located in a memory near that CPU's node

Understanding Parallel Speedup

If half the iterations of a loop are performed on one CPU, and the other half run at the same time on a second CPU, the whole loop should complete in half the time. For example, consider the typical C loop in Example 5-1.

Example 5-1. Typical C Loop

for (j=0; j<MAX; ++j) {
   z[j] = a[j]*b[j];
}

The compiler can automatically distribute such a loop over n CPUs (with n decided at run time based on the available hardware), so that each CPU performs MAX/n iterations.

The speedup gained from applying n CPUs, Speedup(n), is the ratio of the single-CPU execution time to the n-CPU execution time: Speedup(n) = T(1) ÷ T(n). If you measure the single-CPU execution time of a program at 100 seconds, and the program runs in 60 seconds with two CPUs, Speedup(2) = 100 ÷ 60 = 1.67.

This number captures the improvement from adding hardware. T(n) ought to be less than T(1). If it is not, adding CPUs has made the program slower, and something is wrong. So Speedup(n) should be a number greater than 1.0, and the greater it is, the better. Intuitively you might hope that the speedup would be equal to the number of CPUs (twice as many CPUs, half the time) but this ideal can seldom be achieved.

Understanding Superlinear Speedup

You expect Speedup(n) to be less than n, reflecting the fact that not all parts of a program benefit from parallel execution. However, it is possible, in rare situations, for Speedup(n) to be larger than n. When the program has been sped up by more than the increase of CPUs it is known as superlinear speedup.

A superlinear speedup does not really result from parallel execution. It comes about because each CPU is now working on a smaller set of memory. The problem data handled by any one CPU fits better in cache, so each CPU runs faster than the single CPU can. A superlinear speedup is welcome, but it indicates that the sequential program was being held back by cache effects.

Understanding Amdahl's Law

There are always parts of a program that you cannot make parallel, where code must run serially. For example, consider the loop. Some amount of code is devoted to setting up the loop and allocating the work between CPUs. This housekeeping must be done serially. Then comes parallel execution of the loop body, with all CPUs running concurrently. At the end of the loop comes more housekeeping that must be done serially. For example, if n does not divide MAX evenly, one CPU must execute the few iterations that are left over.

Concurrency cannot speed up the serial parts of the program. Let p be the fraction of the program's code that can be made parallel (p is always a fraction less than 1.0.) The remaining fraction (1-p) of the code must run serially. In practical cases, p ranges from 0.2 to 0.99.

The potential speedup for a program is proportional to p divided by the CPUs you can apply, plus the remaining serial part, 1-p. As an equation, this appears as Example 5-2.

Example 5-2. Amdahl's law: Speedup(n) Given p

                  1 
Speedup(n) = ----------- 
             (p/n)+(1-p)

Suppose p = 0.8; then Speedup (2) = 1 / (0.4 + 0.2) = 1.67, and Speedup(4)= 1 / (0.2 + 0.2) = 2.5. The maximum possible speedup, if you could apply an infinite number of CPUs, would be 1 / (1-p). The fraction p has a strong effect on the possible speedup.

The reward for parallelization is small unless p is substantial (at least 0.8). To put the point another way, the reward for increasing p is great no matter how many CPUs you have. The more CPUs you have, the more benefit you get from increasing p. Using only four CPUs, you need only p= 0.75 to get half the ideal speedup. With eight CPUs, you need p= 0.85 to get half the ideal speedup.

There is a slightly more sophisticated version of Amdahl's law that includes communication overhead. This version shows that if the program has no serial part and you increase the number of cores, the following occurs:

The amount of computations per core diminishes.
The communication overhead increases (unless there is no communication and there is trivial parallelization).
The efficiency of the code and the speedup diminishes.

The equation is as follows:

Speedup(n) = n/(1+ a*( n-1) + n*(tc/ ts))

The preceding equation uses the following variables:

n is the number of processes.
a is the fraction of the given task not dividable into concurrent subtasks.
ts is the time to execute the task in a single processor.
tc is the communication overhead.

If a=0 and tc=0, there is no serial part and no communication. In this case, as in a trivial parallelization program, you see a linear speedup.

Calculating the Parallel Fraction of a Program

You do not have to guess at the value of p for a given program. Measure the execution times T(1) and T(2) to calculate a measured Speedup(2) = T(1) / T(2). The Amdahl's law equation can be rearranged to yield p when Speedup (2) is known, as in Example 5-3.

Example 5-3. Amdahl's law: p Given Speedup(2)

     2    SpeedUp(2) - 1 
p = --- * -------------- 
     1      SpeedUp(2)

Suppose you measure T(1) = 188 seconds and T(2) = 104 seconds.

SpeedUp(2) = 188/104 = 1.81 
p = 2 * ((1.81-1)/1.81) = 2*(0.81/1.81) = 0.895

In some cases, the Speedup(2) = T(1)/T(2) is a value greater than 2; in other words, a superlinear speedup. When this occurs, the formula in Example 5-3 returns a value of p greater than 1.0, which is clearly not useful. In this case, you need to calculate p from two other more realistic timings, for example T(2) and T(3). The general formula for p is shown in Example 5-4, where n and m are the two CPU counts whose speedups are known, n>m.

Example 5-4. Amdahl's Law: p Given Speedup( n) and Speedup( m)

                Speedup(n) - Speedup(m)
p  =  ------------------------------------------- 
   (1 - 1/n)*Speedup(n) - (1 - 1/m)*Speedup(m)

For more information about superlinear speedups, see the following:

“Understanding Superlinear Speedup”

Predicting Execution Time with n CPUs

You can use the calculated value of p to extrapolate the potential speedup with higher numbers of CPUs. The following example shows the expected time with four CPUs, if p=0.895 and T(1)=188 seconds:

Speedup(4)= 1/((0.895/4)+(1-0.895)) = 3.04 
T(4)= T(1)/Speedup(4) = 188/3.04 = 61.8

The calculation can be made routine using the computer by creating a script that automates the calculations and extrapolates run times.

These calculations are independent of most programming issues such as language, library, or programming model. They are not independent of hardware issues because Amdahl's law assumes that all CPUs are equal. At some level of parallelism, adding a CPU no longer affects run time in a linear way. For example, on some architectures, cache-friendly codes scale closely with Amdahl's law up to the maximum number of CPUs, but scaling of memory intensive applications slows as the system bus approaches saturation. When the bus bandwidth limit is reached, the actual speedup is less than predicted.

Gustafson's Law

Gustafson's law proposes that programmers set the size of problems to use the available equipment to solve problems within a practical fixed time. Therefore, if faster, more parallel equipment is available, larger problems can be solved in the same time. Amdahl's law is based on fixed workload or fixed problem size. It implies that the sequential part of a program does not change with respect to machine size (for example, the number of processors). However, the parallel part is evenly distributed by n processors. The effect of Gustafson's law was to shift research goals to select or reformulate problems so that solving a larger problem in the same amount of time would be possible. In particular, the law redefines efficiency as a need to minimize the sequential part of a program even if it increases the total amount of computation. The effect is that by running larger problems, it is hoped that the bulk of the calculation will increase faster than the serial part of the program, allowing for better scaling.

Floating-point Program Performance

Certain floating-point programs experience slowdowns due to excessive floating point traps called Floating-Point Software Assists (FPSWAs).

These slowdowns occur when the hardware fails to complete a floating-point operation and requests help (emulation) from software. This happens, for instance, with denormalized numbers.

The symptoms are a slower than normal execution and an FPSWA message in the system log. Use the dmesg(1) to display the message. The average cost of an FPSWA fault is quite high, around 1000 cycles/fault.

By default, the kernel prints a message similar to the following in the system log:

foo(7716): floating-point assist fault at ip 40000000000200e1
            isr 0000020000000008

The kernel throttles the message in order to avoid flooding the console.

It is possible to control the behavior of the kernel on FPSWA faults using the prctl(1) command. In particular, it is possible to get a signal delivered at the first FPSWA. It is also possible to silence the console message.

About MPI Application Tuning

When you design your MPI application, make sure to include the following in your design:

The pinning of MPI processes to CPUs
The isolating of multiple MPI jobs onto different sets of sockets and Hubs

You can achieve this design by configuring a batch scheduler to create a cpuset for every MPI job. MPI pins its processes to the sequential list of logical processors within the containing cpuset by default, but you can control and alter the pinning pattern using MPI_DSM_CPULIST. For more information about these programming practices, see the following:

The MPI_DSM_CPULIST discussion in the SGI MPI and SGI SHMEM User Guide.
The omplace(1) and dplace(1) man pages.
The SGI Cpuset Software Guide.

MPI Application Communication on SGI Hardware

On an SGI UV system, the following two transfer methods facilitate MPI communication between processes:

Shared memory
The global reference unit (GRU), which is part of the SGI UV Hub ASIC

The SGI UV series systems use a scalable nonuniform memory access (NUMA) architecture to allow the use of thousands of processors and terabytes of RAM in a single Linux operating system instance. As in other large, shared-memory systems, memory is distributed to processor sockets and accesses to memory are cache coherent. Unlike other systems, SGI UV systems use a network of Hub ASICs connected over NUMALink to scale to more sockets than any other x86-64 system, with excellent performance out of the box for most applications.

When running on SGI UV systems with SGI's Message Passing Toolkit (MPT), applications can attain higher bandwidth and lower latency for MPI calls than when running on more conventional distributed memory clusters. However, knowing your SGI UV system's NUMA topology and the performance constraints that it imposes can still help you extract peak performance. For more information about the SGI UV hub, SGI UV compute blades, Intel QPI, and SGI NUMALink, see your SGI UV hardware system user guide.

The MPI library chooses the transfer method depending on internal heuristics, the type of MPI communication that is involved, and some user-tunable variables. When using the GRU to transfer data and messages, the MPI library uses the GRU resources it allocates via the GRU resource allocator, which divides up the available GRU resources. It allocates buffer space and control blocks between the logical processors being used by the MPI job.

MPI Job Problems and Application Design

The MPI library chooses buffer sizes and communication algorithms in an attempt to deliver the best performance automatically to a wide variety of MPI applications, but user tuning might be needed to improve performance. The following are some application performance problems and some ways that you might be able to improve MPI performance:

Primary HyperThreads are idle.

Most high-performance computing MPI programs run best when they use only one HyperThread per core. When an SGI UV system has multiple HyperThreads per core, logical CPUs are numbered such that primary HyperThreads are the high half of the logical CPU numbers. Therefore, the task of scheduling only on the additional HyperThreads may be accomplished by scheduling MPI jobs as if only half the full number exists, leaving the high logical CPUs idle.

You can use the cpumap(1) command to determine if cores have multiple HyperThreads on your SGI UV system. The command's output includes the following:
- The number of physical and logical processors
- Whether HyperThreading is enabled
- How shared processors are paired
If an MPI job uses only half of the available logical CPUs, set GRU_RESOURCE_FACTOR to 2 so that the MPI processes can use all the available GRU resources on a hub, rather than reserving some of them for the idle HyperThreads. For more information about GRU resource tuning, see the gru_resource(3) man page.
Message bandwidth is inadequate.

Use either huge pages or transparent huge pages (THP) to ensure that your application obtains optimal message bandwidth.

To specify the use of hugepages, use the MPI_HUGEPAGE_HEAP_SPACE environment variable. The MPI_HUGEPAGE_HEAP_SPACE environment variable defines the minimum amount of heap space that each MPI process can allocate using huge pages. For information about this environment variable, see the MPI(1) man page.

To use THPs, see “Using Transparent Huge Pages (THPs) in MPI and SHMEM Applications ”.

Some programs transfer large messages via the MPI_Send function. To enable unbuffered, single-copy transport in these cases, you can set MPI_BUFFER_MAX to 0. For information about the MPI_BUFFER_MAX environment variable, see the MPI(1) man page.
MPI small or near messages are very frequent.

For small fabric hop counts, shared memory message delivery is faster than GRU messages. To deliver all messages within an SGI UV host via shared memory, set MPI_SHARED_NEIGHBORHOOD to host. For more information, see the MPI (1) man page.
Memory allocations are nonlocal.

MPI application processes normally perform best if their local memory is allocated near the socket assigned to use it. This cannot happen if memory on that socket is exhausted by the application or by other system consumption, for example, file buffer cache. Use the nodeinfo(1) command to view memory consumption on the nodes assigned to your job, and use bcfree(1) to clear out excessive file buffer cache. PBS Professional batch scheduler installations can be configured to issue bcfree commands in the job prologue.

For information about PBS Professional, including the availablilty of scripts, see the PBS Professional documentation and the bcfree(1) man page.

MPI Performance Tools

SGI supports several MPI performance tools. You can use the following tools to enhance or troubleshoot MPI program performance:

MPInside. MPInside is an MPI profiling tool that can help you optimize your MPI application. The tool provides information about data transferred between ranks, both in terms of speed and quantity.
SGI Perfboost. SGI PerfBoost uses a wrapper library to run applications compiled against other MPI implementations under the SGI Message Passing Toolkit (MPT) product on SGI platforms. The PerfBoost software allows you to run SGI MPT, which is a version of MPI optimized for SGI's large, shared-memory systems that can take advantage of the SGI UV Hub.
SGI PerfCatcher. SGI PerfCatcher uses a wrapper library to return MPI and SHMEM function profiling information. The information returned includes percent CPU time, total time spent per function, message sizes, and load imbalances. For more information, see the following:

For more information about the MPI performance tools, see the following:

The MPInside(1) man page.
The perfcatch(1) man page
MPInside Reference Guide
SGI MPI and SGI SHMEM User Guide

Using Transparent Huge Pages (THPs) in MPI and SHMEM Applications

On SGI UV systems, THP is important because it contributes to attaining the best GRU-based data transfer bandwidth in Message Passing Interface (MPI) and SHMEM programs. On newer kernels, the THP feature is enabled by default. If THP is disabled on your SGI UV system, see “Enabling Huge Pages in MPI and SHMEM Applications on Systems Without THP”.

On SGI ICE systems, if you use a workload manager, such as PBS Professional, your site configuration might let you enable or disable THP on a per-job basis.

The THP feature can affect the performance of some OpenMP threaded applications in a negative way. For certain OpenMP applications, some threads in some shared data structures might be forced to make more nonlocal references because the application assumes a smaller, 4-KB page size.

The THP feature affects users in the following ways:

Administrators:

To activate the THP feature on a system-wide basis, write the keyword always to the following file:
/sys/kernel/mm/transparent_hugepage/enabled
To create an environment in which individual applications can use THP if memory is allocated accordingly within the application itself, type the following:
# echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
To disable THP, type the following command:
# echo never > /sys/kernel/mm/transparent_hugepage/enabled
If the khugepaged daemon is taking a lot of time when a job is running, then defragmentation of THP might be causing performance problems. You can type the following command to disable defragmentation:
# echo never > /sys/kernel/mm/transparent_hugepage/defrag
If you suspect that defragmentation of THP is causing performance problems, but you do not want to disable defragmentation, you can tune the khugepaged daemon by editing the values in /sys/kernel/mm/transparent_hugepage/khugepaged.
MPI application programmers:

To determine whether THP is enabled on your system, type the following command and note the output:
% cat /sys/kernel/mm/transparent_hugepage/enabled
The output is as follows on a system for which THP is enabled:

[always] madvise never

In the output, the bracket characters ([ ]) appear around the keyword that is in effect.

Enabling Huge Pages in MPI and SHMEM Applications on Systems Without THP

If the THP capability is disabled on your SGI UV system, you can use the MPI_HUGEPAGE_HEAP_SPACE environment variable and the MPT_HUGEPAGE_CONFIG command to create huge pages.

The MPT_HUGEPAGE_CONFIG command configures the system to allow huge pages to be available to MPT's memory allocation interceptors. The MPI_HUGEPAGE_HEAP_SPACE environment variable enables an application to use the huge pages reserved by the MPT_HUGEPAGE_CONFIG command.

For more information, see the MPI_HUGEPAGE_HEAP_SPACE environment variable on the MPI(1) man page, or see the mpt_hugepage_config(1) man page.

Prev	Table of Contents	Next
Chapter 4. Data Process and Placement Tools		Chapter 6. Flexible File I/O