Chapter 2. Performance Analysis and Debugging

Chapter 2. Performance Analysis and Debugging
Prev		Next

This chapter contains the following topics:

About Performance Analysis and Debugging

Tuning an application involves determining the source of performance problems and then rectifying those problems to make your programs run their fastest on the available hardware. Performance gains usually fall into one of three categories of measured time:

User CPU time, which is the time accumulated by a user process when it is attached to a CPU and is running.
Elapsed (wall-clock) time, which is the amount of time that passes between the start and the termination of a process.
System time, which is the amount of time spent on performing kernel functions such as system calls, sched_yield, for example, or floating point errors.

Any application tuning process involves the following steps:

Analyzing and identifying a problem
Locating the problem in the code
Applying an optimization technique

This topics in this chapter describe how to analyze your code to determine performance bottlenecks. For information about how to tune your application for a single processor system and then tune it for parallel processing, see the following:

Chapter 5, “Performance Tuning”

Determining System Configuration

One of the first steps in application tuning is to determine the details of the system that you are running. Depending on your system configuration, different options might or might not provide good results.

The topology(1) command displays general information about SGI systems, with a focus on node information. This can include node counts for blades, node IDs, NASIDs, memory per node, system serial number, partition number, UV Hub versions, CPU to node mappings, and general CPU information. The topology(1) command is part of the SGI Foundation Software package.

The following is example output from two topology(1) commands:

uv-sys:~ # topology
System type: UV2000
System name: harp34-sys
Serial number: UV2-00000034
Partition number: 0
       8 Blades
     256 CPUs
      16 Nodes
  235.82 GB Memory Total
   15.00 GB Max Memory on any Node
       1 BASE I/O Riser
       2 Network Controllers
       2 Storage Controllers
       2 USB Controllers
       1 VGA GPU

uv-sys:~ # topology --summary --nodes --cpus
System type: UV2000
System name: harp34-sys
Serial number: UV2-00000034
Partition number: 0
       8 Blades
     256 CPUs
      16 Nodes
  235.82 GB Memory Total
   15.00 GB Max Memory on any Node
       1 BASE I/O Riser
       2 Network Controllers
       2 Storage Controllers
       2 USB Controllers
       1 VGA GPU

Index      ID        NASID CPUS     Memory
--------------------------------------------
    0 r001i11b00h0      0   16      15316 MB
    1 r001i11b00h1      2   16      15344 MB
    2 r001i11b01h0      4   16      15344 MB
    3 r001i11b01h1      6   16      15344 MB
    4 r001i11b02h0      8   16      15344 MB
    5 r001i11b02h1     10   16      15344 MB
    6 r001i11b03h0     12   16      15344 MB
    7 r001i11b03h1     14   16      15344 MB
    8 r001i11b04h0     16   16      15344 MB
    9 r001i11b04h1     18   16      15344 MB
   10 r001i11b05h0     20   16      15344 MB
   11 r001i11b05h1     22   16      15344 MB
   12 r001i11b06h0     24   16      15344 MB
   13 r001i11b06h1     26   16      15344 MB
   14 r001i11b07h0     28   16      15344 MB
   15 r001i11b07h1     30   16      15344 MB

CPU        Blade PhysID CoreID APIC-ID Family Model Speed L1(KiB) L2(KiB) L3(KiB)
---------------------------------------------------------------------------------
  0 r001i11b00h0     00     00       0      6    45  2599 32d/32i     256   20480
  1 r001i11b00h0     00     01       2      6    45  2599 32d/32i     256   20480
  2 r001i11b00h0     00     02       4      6    45  2599 32d/32i     256   20480
  3 r001i11b00h0     00     03       6      6    45  2599 32d/32i     256   20480
  4 r001i11b00h0     00     04       8      6    45  2599 32d/32i     256   20480
  5 r001i11b00h0     00     05      10      6    45  2599 32d/32i     256   20480
  6 r001i11b00h0     00     06      12      6    45  2599 32d/32i     256   20480
  7 r001i11b00h0     00     07      14      6    45  2599 32d/32i     256   20480
  8 r001i11b00h1     01     00      32      6    45  2599 32d/32i     256   20480
  9 r001i11b00h1     01     01      34      6    45  2599 32d/32i     256   20480
 10 r001i11b00h1     01     02      36      6    45  2599 32d/32i     256   20480
 11 r001i11b00h1     01     03      38      6    45  2599 32d/32i     256   20480
...

The cpumap(1) command displays logical CPUs and shows relationships between them in a human-readable format. Aspects displayed include hyperthread relationships, last level cache sharing, and topological placement. The cpumap (1) command gets its information from /proc/cpuinfo, the /sys/devices/system directory structure, and /proc/sgi_uv/topology. When creating cpusets, the numbers reported in the output section called Processor Numbering on Node(s) correspond to the mems argument you use to define a cpuset. The cpuset mems argument is the list of memory nodes that tasks in the cpuset are allowed to use.

For more information, see the SGI Cpuset Software Guide.

The following is example output:

uv# cpumap
Thu Sep 19 10:17:21 CDT 2013
harp34-sys.americas.sgi.com

This is an SGI UV
model name           : Genuine Intel(R) CPU @ 2.60GHz
Architecture         : x86_64
cpu MHz              : 2599.946
cache size           : 20480 KB (Last Level)

Total Number of Sockets                 : 16
Total Number of Cores                   : 128   (8 per socket)
Hyperthreading                          : ON
Total Number of Physical Processors     : 128
Total Number of Logical Processors      : 256   (2 per Phys Processor)

UV Information
 HUB Version:                            UVHub  3.0
 Number of Hubs:                         16
 Number of connected Hubs:               16
 Number of connected NUMAlink ports:     128
=============================================================================


Hub-Processor Mapping

  Hub Location      Processor Numbers -- HyperThreads in ()
  --- ----------    ---------------------------------------
    0 r001i11b00h0     0    1    2    3    4    5    6    7
                  (  128  129  130  131  132  133  134  135 )
    1 r001i11b00h1     8    9   10   11   12   13   14   15
                  (  136  137  138  139  140  141  142  143 )
    2 r001i11b01h0    16   17   18   19   20   21   22   23
                  (  144  145  146  147  148  149  150  151 )
    3 r001i11b01h1    24   25   26   27   28   29   30   31
                  (  152  153  154  155  156  157  158  159 )
    4 r001i11b02h0    32   33   34   35   36   37   38   39
                  (  160  161  162  163  164  165  166  167 )
    5 r001i11b02h1    40   41   42   43   44   45   46   47
                  (  168  169  170  171  172  173  174  175 )
    6 r001i11b03h0    48   49   50   51   52   53   54   55
                  (  176  177  178  179  180  181  182  183 )
    7 r001i11b03h1    56   57   58   59   60   61   62   63
                  (  184  185  186  187  188  189  190  191 )
    8 r001i11b04h0    64   65   66   67   68   69   70   71
                  (  192  193  194  195  196  197  198  199 )
    9 r001i11b04h1    72   73   74   75   76   77   78   79
                  (  200  201  202  203  204  205  206  207 )
   10 r001i11b05h0    80   81   82   83   84   85   86   87
                  (  208  209  210  211  212  213  214  215 )
   11 r001i11b05h1    88   89   90   91   92   93   94   95
                  (  216  217  218  219  220  221  222  223 )
   12 r001i11b06h0    96   97   98   99  100  101  102  103
                  (  224  225  226  227  228  229  230  231 )
   13 r001i11b06h1   104  105  106  107  108  109  110  111
                  (  232  233  234  235  236  237  238  239 )
   14 r001i11b07h0   112  113  114  115  116  117  118  119
                  (  240  241  242  243  244  245  246  247 )
   15 r001i11b07h1   120  121  122  123  124  125  126  127
                  (  248  249  250  251  252  253  254  255 )

=============================================================================

Processor Numbering on Node(s)

   Node    (Logical) Processors
  ------    -------------------------
     0      0    1    2    3    4    5    6    7  128  129  130  131  132  133  134  135
     1      8    9   10   11   12   13   14   15  136  137  138  139  140  141  142  143
     2     16   17   18   19   20   21   22   23  144  145  146  147  148  149  150  151
     3     24   25   26   27   28   29   30   31  152  153  154  155  156  157  158  159
     4     32   33   34   35   36   37   38   39  160  161  162  163  164  165  166  167
     5     40   41   42   43   44   45   46   47  168  169  170  171  172  173  174  175
     6     48   49   50   51   52   53   54   55  176  177  178  179  180  181  182  183
     7     56   57   58   59   60   61   62   63  184  185  186  187  188  189  190  191
     8     64   65   66   67   68   69   70   71  192  193  194  195  196  197  198  199
     9     72   73   74   75   76   77   78   79  200  201  202  203  204  205  206  207
    10     80   81   82   83   84   85   86   87  208  209  210  211  212  213  214  215
    11     88   89   90   91   92   93   94   95  216  217  218  219  220  221  222  223
    12     96   97   98   99  100  101  102  103  224  225  226  227  228  229  230  231
    13    104  105  106  107  108  109  110  111  232  233  234  235  236  237  238  239
    14    112  113  114  115  116  117  118  119  240  241  242  243  244  245  246  247
    15    120  121  122  123  124  125  126  127  248  249  250  251  252  253  254  255

=============================================================================

Sharing of Last Level (3) Caches

  Socket    (Logical) Processors
  ------    -------------------------
     0      0    1    2    3    4    5    6    7  128  129  130  131  132  133  134  135
     1      8    9   10   11   12   13   14   15  136  137  138  139  140  141  142  143
     2     16   17   18   19   20   21   22   23  144  145  146  147  148  149  150  151
     3     24   25   26   27   28   29   30   31  152  153  154  155  156  157  158  159
     4     32   33   34   35   36   37   38   39  160  161  162  163  164  165  166  167
     5     40   41   42   43   44   45   46   47  168  169  170  171  172  173  174  175
     6     48   49   50   51   52   53   54   55  176  177  178  179  180  181  182  183
     7     56   57   58   59   60   61   62   63  184  185  186  187  188  189  190  191
     8     64   65   66   67   68   69   70   71  192  193  194  195  196  197  198  199
     9     72   73   74   75   76   77   78   79  200  201  202  203  204  205  206  207
    10     80   81   82   83   84   85   86   87  208  209  210  211  212  213  214  215
    11     88   89   90   91   92   93   94   95  216  217  218  219  220  221  222  223
    12     96   97   98   99  100  101  102  103  224  225  226  227  228  229  230  231
    13    104  105  106  107  108  109  110  111  232  233  234  235  236  237  238  239
    14    112  113  114  115  116  117  118  119  240  241  242  243  244  245  246  247
    15    120  121  122  123  124  125  126  127  248  249  250  251  252  253  254  255

=============================================================================

HyperThreading

  Shared Processors
  -----------------
 (    0,  128) (    1,  129) (    2,  130) (    3,  131)
 (    4,  132) (    5,  133) (    6,  134) (    7,  135)
 (    8,  136) (    9,  137) (   10,  138) (   11,  139)
 (   12,  140) (   13,  141) (   14,  142) (   15,  143)
 (   16,  144) (   17,  145) (   18,  146) (   19,  147)
 (   20,  148) (   21,  149) (   22,  150) (   23,  151)
 (   24,  152) (   25,  153) (   26,  154) (   27,  155)
 (   28,  156) (   29,  157) (   30,  158) (   31,  159)
 (   32,  160) (   33,  161) (   34,  162) (   35,  163)
 (   36,  164) (   37,  165) (   38,  166) (   39,  167)
 (   40,  168) (   41,  169) (   42,  170) (   43,  171)
 (   44,  172) (   45,  173) (   46,  174) (   47,  175)
 (   48,  176) (   49,  177) (   50,  178) (   51,  179)
 (   52,  180) (   53,  181) (   54,  182) (   55,  183)
 (   56,  184) (   57,  185) (   58,  186) (   59,  187)
 (   60,  188) (   61,  189) (   62,  190) (   63,  191)
 (   64,  192) (   65,  193) (   66,  194) (   67,  195)
 (   68,  196) (   69,  197) (   70,  198) (   71,  199)
 (   72,  200) (   73,  201) (   74,  202) (   75,  203)
 (   76,  204) (   77,  205) (   78,  206) (   79,  207)
 (   80,  208) (   81,  209) (   82,  210) (   83,  211)
 (   84,  212) (   85,  213) (   86,  214) (   87,  215)
 (   88,  216) (   89,  217) (   90,  218) (   91,  219)
 (   92,  220) (   93,  221) (   94,  222) (   95,  223)
 (   96,  224) (   97,  225) (   98,  226) (   99,  227)
 (  100,  228) (  101,  229) (  102,  230) (  103,  231)
 (  104,  232) (  105,  233) (  106,  234) (  107,  235)
 (  108,  236) (  109,  237) (  110,  238) (  111,  239)
 (  112,  240) (  113,  241) (  114,  242) (  115,  243)
 (  116,  244) (  117,  245) (  118,  246) (  119,  247)
 (  120,  248) (  121,  249) (  122,  250) (  123,  251)
 (  124,  252) (  125,  253) (  126,  254) (  127,  255)

The x86info(1) command displays x86 CPU diagnostics information. Type one of the following commands to load the x86info(1) command if the command is not already installed:

On Red Hat Enterprise Linux (RHEL) systems, type the following:
# yum install x86info.x86_64
On SLES systems, type the following:
# zypper install x86info

The following is an example of x86info(1) command output:

uv44-sys:~ # x86info
x86info v1.25.  Dave Jones 2001-2009
Feedback to .

Found 64 CPUs
--------------------------------------------------------------------------
CPU #1
EFamily: 0 EModel: 2 Family: 6 Model: 46 Stepping: 6
CPU Model: Unknown model.
Processor name string: Intel(R) Xeon(R) CPU           E7520  @ 1.87GHz
Type: 0 (Original OEM)  Brand: 0 (Unsupported)
Number of cores per physical package=16
Number of logical processors per socket=32
Number of logical processors per core=2
APIC ID: 0x0    Package: 0  Core: 0   SMT ID 0
--------------------------------------------------------------------------
CPU #2
EFamily: 0 EModel: 2 Family: 6 Model: 46 Stepping: 6
CPU Model: Unknown model.
Processor name string: Intel(R) Xeon(R) CPU           E7520  @ 1.87GHz
Type: 0 (Original OEM)  Brand: 0 (Unsupported)
Number of cores per physical package=16
Number of logical processors per socket=32
Number of logical processors per core=2
APIC ID: 0x6    Package: 0  Core: 0   SMT ID 6
--------------------------------------------------------------------------
CPU #3
EFamily: 0 EModel: 2 Family: 6 Model: 46 Stepping: 6
CPU Model: Unknown model.
Processor name string: Intel(R) Xeon(R) CPU           E7520  @ 1.87GHz
Type: 0 (Original OEM)  Brand: 0 (Unsupported)
Number of cores per physical package=16
Number of logical processors per socket=32
Number of logical processors per core=2
APIC ID: 0x10   Package: 0  Core: 0   SMT ID 16
-------------------------------------------------------------------------- 
                       ...

You can also use the uname command, which returns the kernel version and other machine information. For example:

uv44-sys:~ # uname -a
Linux uv44-sys 2.6.32.13-0.4.1.1559.0.PTF-default #1 SMP 2010-06-15 12:47:25 +0200 x86_64 x86_64 x86_64 GNU/Linux

For more system information, change to the /sys/devices/system/node/node0/cpu0/cache directory and list the contents. For example:

uv44-sys:/sys/devices/system/node/node0/cpu0/cache # ls
index0  index1  index2  index3

Change directory to index0 and list the contents, as follows:

uv44-sys:/sys/devices/system/node/node0/cpu0/cache/index0 # ls
coherency_line_size  level  number_of_sets  physical_line_partition  shared_cpu_list  shared_cpu_map  size  type  ways_of_associativity

Sources of Performance Problems

The following three processes types typically cause program execution performance slowdowns:

CPU-bound processes, whch are processes that perform slow operations, such as sqrt or floating-point divides, or nonpipelined operations, such as switching between add and multiply operations.
Memory-bound processes, which consist of code that uses poor memory strides, occurrences of page thrashing or cache misses, or poor data placement in NUMA systems.
I/O-bound processes, which are processes that wait on synchronous I/O or formatted I/O. These are also processes that wait when there is library-level or system-level buffering.

The following topics describe some of the tools that can help pinpoint performance slowdowns:

Profiling with `perf`

Linux Perf Events provides a performance analysis framework. It includes hardware-level CPU performance monitoring unit (PMU) features, software counters, and tracepoints. The perf RPM comes with the operating system, includes man pages, and is not an SGI product.

For more information, see the following man pages:

perf(1)
perf-stat(1)
perf-top(1)
perf-record(1)
perf-report(1)
perf-list(1)

Profiling with `PerfSuite`

PerfSuite is a set of tools, utilities, and libraries that you can use to analyze application software performance on Linux systems. You can use the PerfSuite tools to perform performance-related activities, ranging from assistance with compiler optimization reports to hardware performance counting, profiling, and MPI usage summarization. PerfSuite is Open Source software. It is approved for licensing under the University of Illinois/NCSA Open Source License (OSI-approved).

For more information, see one of the following websites:

http://perfsuite.ncsa.uiuc.edu/
http://perfsuite.sourceforge.net/
http://perfsuite.ncsa.illinois.edu

This website hosts NCSA-specific information about using PerfSuite tools.

PerfSuite includes the psrun utility, which gathers hardware performance information on an unmodified executable. For more information, see http://perfsuite.ncsa.uiuc.edu/psrun/.

Other Performance Analysis Tools

The following tools might be useful to you when you try to optimize your code:

The Intel® VTune™ Amplifier XE, which is a performance and thread profiler. This tool does remote sampling experiments. The VTune data collector runs on the Linux system, and an accompanying graphical user interface (GUI) runs on an IA-32 Windows machine, which is used for analyzing the results. VTune allows you to perform interactive experiments while connected to the host through its GUI. An additional tool, the Performance Tuning Utility (PTU), requires the Intel VTune license.

For information about Intel VTune Amplifier XE, see the following URL:

http://software.intel.com/en-us/intel-vtune-amplifier-xe#pid-3773-760
Intel Inspector XE, which is a memory and thread debugger. For information about Intel Inspector XE, see the following:

http://software.intel.com/en-us/intel-inspector-xe/
Intel Advisor XE, which is a threading design and prototyping tool. For information about Intel Advisor XE, see the following:

http://software.intel.com/en-us/intel-advisor-xe

About Debugging

The following debuggers are available on SGI platforms:

The Intel Debugger and GDB. You can start the Intel Debugger and GDB with the ddd command. The ddd command starts the Data Display Debugger, a GNU product that provides a graphical debugging interface.
Totalview. TotalView is graphical debugger that you can use with MPI programs. For information about TotalView, including its licensing, see the following:

http://www.roguewave.com
DDT. The DDT debugger from Allinea Software is a graphical debugger. For information about DDT, see the following:

www.allinea.com/products/ddt

The following topics provide more information about some of the debuggers available on SGI systems:

Using the Intel Debugger

The Intel Debugger for Linux is the Intel symbolic debugger. The Intel Debugger is part of Intel Parallel Studio XE Professional Edition and above. This debugger is based on the Eclipse GUI. This debugger works with the Intel® C and C++ compilers, the Intel® Fortran compilers, and the GNU compilers. This product is available if your system is licensed for the Intel compilers. You are asked during the installation if you want to install it or not. The idb command starts the GUI. The idbc command starts the command line interface. If you specify the -gdb option on the idb command, the shell command line provides user commands and debugger output similar to the GNU debugger.

You can use the Intel Debugger for Linux with single-threaded applications, multithreaded applications, serial code, and parallel code.

Figure 2-1 shows the GUI.

Figure 2-1. Intel® Debugger GUI

For more information, see the following:

http://software.intel.com/en-us/articles/idb-linux/

Using the GNU Data Display Debugger (GNU DDD)

GDB is the GNU debugger. The GDB debugger supports C, C++, Fortran, and Modula-2 programs. The following information pertains to these compilers:

When compiling with C and C++, include the -g option on the compiler command line. The -g option produces the dwarf2 symbols database that GDB uses.
When using GDB for Fortran debugging, include the -g and -O0 options. Do not use gdb for Fortran debugging when compiling with -O1 or higher. The standard GDB debugger does not support Fortran 95 programs. To debug Fortran 95 programs, download and install the gdbf95 patch from the following website:

http://sourceforge.net/project/showfiles.php?group_id=56720

To verify that you have the correct version of GDB installed, use the gdb -v command. The output should appear similar to the following:
GNU gdb 5.1.1 FORTRAN95-20020628 (RC1) Copyright 2012 Free Software Foundation, Inc.

The Data Display Debugger provides a graphical debugging interface for the GDB debugger and other command line debuggers. To use GDB through a GUI, use the ddd command. Specify the --debugger option to specify the debugger you want to use. For example, specify --debugger "idb" to specify the Intel Debugger. Use the gdb command to start GDB's command line interface.

When the debugger loads, the Data Display Debugger screen appears divided into panes that show the following information:

Array inspection
Source code
Disassembled code
A command line window to the debugger engine

From the View menu, you can switch these panes on and off.

Some commonly used commands can be found on the menus. In addition, the following actions can be useful:

To select an address in the assembly view, click the right mouse button, and select lookup. The gdb command runs in the command pane and shows the corresponding source line.
Select a variable in the source pane, and click the right mouse button. The debugger displays the current value. Arrays appear in the array inspection window. You can print these arrays to PostScript by using the Menu>Print Graph option.
To view the contents of the register file, including general, floating-point, NaT, predicate, and application registers, select Registers from the Status menu. The Status menu also allows you to view stack traces or to switch OpenMP threads.

For a complete list of GDB commands, use the help option or see the documentation at the following website:

http://sourceware.org/gdb/documentation/

Note: The current instances of GDB do not report ar.ec registers correctly. If you are debugging rotating, register-based, software-pipelined loops at the assembly code level, try using the Intel Debugger for Linux.

Prev	Table of Contents	Next
Chapter 1. The SGI Compiling Environment		Chapter 3. Monitoring Commands