This chapter contains the following topics:
Tuning an application involves determining the source of performance problems and then rectifying those problems to make your programs run their fastest on the available hardware. Performance gains usually fall into one of three categories of measured time:
User CPU time, which is the time accumulated by a user process when it is attached to a CPU and is running.
Elapsed (wall-clock) time, which is the amount of time that passes between the start and the termination of a process.
System time, which is the amount of time spent on performing kernel functions such as system calls, sched_yield, for example, or floating point errors.
Any application tuning process involves the following steps:
Analyzing and identifying a problem
Locating the problem in the code
Applying an optimization technique
This topics in this chapter describe how to analyze your code to determine performance bottlenecks. For information about how to tune your application for a single processor system and then tune it for parallel processing, see the following:
One of the first steps in application tuning is to determine the details of the system that you are running. Depending on your system configuration, different options might or might not provide good results.
The topology(1) command displays general information about SGI systems, with a focus on node information. This can include node counts for blades, node IDs, NASIDs, memory per node, system serial number, partition number, UV Hub versions, CPU to node mappings, and general CPU information. The topology(1) command is part of the SGI Foundation Software package.
The following is example output from two topology(1) commands:
uv-sys:~ # topology
System type: UV2000
System name: harp34-sys
Serial number: UV2-00000034
Partition number: 0
8 Blades
256 CPUs
16 Nodes
235.82 GB Memory Total
15.00 GB Max Memory on any Node
1 BASE I/O Riser
2 Network Controllers
2 Storage Controllers
2 USB Controllers
1 VGA GPU
uv-sys:~ # topology --summary --nodes --cpus
System type: UV2000
System name: harp34-sys
Serial number: UV2-00000034
Partition number: 0
8 Blades
256 CPUs
16 Nodes
235.82 GB Memory Total
15.00 GB Max Memory on any Node
1 BASE I/O Riser
2 Network Controllers
2 Storage Controllers
2 USB Controllers
1 VGA GPU
Index ID NASID CPUS Memory
--------------------------------------------
0 r001i11b00h0 0 16 15316 MB
1 r001i11b00h1 2 16 15344 MB
2 r001i11b01h0 4 16 15344 MB
3 r001i11b01h1 6 16 15344 MB
4 r001i11b02h0 8 16 15344 MB
5 r001i11b02h1 10 16 15344 MB
6 r001i11b03h0 12 16 15344 MB
7 r001i11b03h1 14 16 15344 MB
8 r001i11b04h0 16 16 15344 MB
9 r001i11b04h1 18 16 15344 MB
10 r001i11b05h0 20 16 15344 MB
11 r001i11b05h1 22 16 15344 MB
12 r001i11b06h0 24 16 15344 MB
13 r001i11b06h1 26 16 15344 MB
14 r001i11b07h0 28 16 15344 MB
15 r001i11b07h1 30 16 15344 MB
CPU Blade PhysID CoreID APIC-ID Family Model Speed L1(KiB) L2(KiB) L3(KiB)
---------------------------------------------------------------------------------
0 r001i11b00h0 00 00 0 6 45 2599 32d/32i 256 20480
1 r001i11b00h0 00 01 2 6 45 2599 32d/32i 256 20480
2 r001i11b00h0 00 02 4 6 45 2599 32d/32i 256 20480
3 r001i11b00h0 00 03 6 6 45 2599 32d/32i 256 20480
4 r001i11b00h0 00 04 8 6 45 2599 32d/32i 256 20480
5 r001i11b00h0 00 05 10 6 45 2599 32d/32i 256 20480
6 r001i11b00h0 00 06 12 6 45 2599 32d/32i 256 20480
7 r001i11b00h0 00 07 14 6 45 2599 32d/32i 256 20480
8 r001i11b00h1 01 00 32 6 45 2599 32d/32i 256 20480
9 r001i11b00h1 01 01 34 6 45 2599 32d/32i 256 20480
10 r001i11b00h1 01 02 36 6 45 2599 32d/32i 256 20480
11 r001i11b00h1 01 03 38 6 45 2599 32d/32i 256 20480
... |
The cpumap(1) command displays logical CPUs and shows relationships between them in a human-readable format. Aspects displayed include hyperthread relationships, last level cache sharing, and topological placement. The cpumap (1) command gets its information from /proc/cpuinfo, the /sys/devices/system directory structure, and /proc/sgi_uv/topology. When creating cpusets, the numbers reported in the output section called Processor Numbering on Node(s) correspond to the mems argument you use to define a cpuset. The cpuset mems argument is the list of memory nodes that tasks in the cpuset are allowed to use.
For more information, see the SGI Cpuset Software Guide.
The following is example output:
uv# cpumap
Thu Sep 19 10:17:21 CDT 2013
harp34-sys.americas.sgi.com
This is an SGI UV
model name : Genuine Intel(R) CPU @ 2.60GHz
Architecture : x86_64
cpu MHz : 2599.946
cache size : 20480 KB (Last Level)
Total Number of Sockets : 16
Total Number of Cores : 128 (8 per socket)
Hyperthreading : ON
Total Number of Physical Processors : 128
Total Number of Logical Processors : 256 (2 per Phys Processor)
UV Information
HUB Version: UVHub 3.0
Number of Hubs: 16
Number of connected Hubs: 16
Number of connected NUMAlink ports: 128
=============================================================================
Hub-Processor Mapping
Hub Location Processor Numbers -- HyperThreads in ()
--- ---------- ---------------------------------------
0 r001i11b00h0 0 1 2 3 4 5 6 7
( 128 129 130 131 132 133 134 135 )
1 r001i11b00h1 8 9 10 11 12 13 14 15
( 136 137 138 139 140 141 142 143 )
2 r001i11b01h0 16 17 18 19 20 21 22 23
( 144 145 146 147 148 149 150 151 )
3 r001i11b01h1 24 25 26 27 28 29 30 31
( 152 153 154 155 156 157 158 159 )
4 r001i11b02h0 32 33 34 35 36 37 38 39
( 160 161 162 163 164 165 166 167 )
5 r001i11b02h1 40 41 42 43 44 45 46 47
( 168 169 170 171 172 173 174 175 )
6 r001i11b03h0 48 49 50 51 52 53 54 55
( 176 177 178 179 180 181 182 183 )
7 r001i11b03h1 56 57 58 59 60 61 62 63
( 184 185 186 187 188 189 190 191 )
8 r001i11b04h0 64 65 66 67 68 69 70 71
( 192 193 194 195 196 197 198 199 )
9 r001i11b04h1 72 73 74 75 76 77 78 79
( 200 201 202 203 204 205 206 207 )
10 r001i11b05h0 80 81 82 83 84 85 86 87
( 208 209 210 211 212 213 214 215 )
11 r001i11b05h1 88 89 90 91 92 93 94 95
( 216 217 218 219 220 221 222 223 )
12 r001i11b06h0 96 97 98 99 100 101 102 103
( 224 225 226 227 228 229 230 231 )
13 r001i11b06h1 104 105 106 107 108 109 110 111
( 232 233 234 235 236 237 238 239 )
14 r001i11b07h0 112 113 114 115 116 117 118 119
( 240 241 242 243 244 245 246 247 )
15 r001i11b07h1 120 121 122 123 124 125 126 127
( 248 249 250 251 252 253 254 255 )
=============================================================================
Processor Numbering on Node(s)
Node (Logical) Processors
------ -------------------------
0 0 1 2 3 4 5 6 7 128 129 130 131 132 133 134 135
1 8 9 10 11 12 13 14 15 136 137 138 139 140 141 142 143
2 16 17 18 19 20 21 22 23 144 145 146 147 148 149 150 151
3 24 25 26 27 28 29 30 31 152 153 154 155 156 157 158 159
4 32 33 34 35 36 37 38 39 160 161 162 163 164 165 166 167
5 40 41 42 43 44 45 46 47 168 169 170 171 172 173 174 175
6 48 49 50 51 52 53 54 55 176 177 178 179 180 181 182 183
7 56 57 58 59 60 61 62 63 184 185 186 187 188 189 190 191
8 64 65 66 67 68 69 70 71 192 193 194 195 196 197 198 199
9 72 73 74 75 76 77 78 79 200 201 202 203 204 205 206 207
10 80 81 82 83 84 85 86 87 208 209 210 211 212 213 214 215
11 88 89 90 91 92 93 94 95 216 217 218 219 220 221 222 223
12 96 97 98 99 100 101 102 103 224 225 226 227 228 229 230 231
13 104 105 106 107 108 109 110 111 232 233 234 235 236 237 238 239
14 112 113 114 115 116 117 118 119 240 241 242 243 244 245 246 247
15 120 121 122 123 124 125 126 127 248 249 250 251 252 253 254 255
=============================================================================
Sharing of Last Level (3) Caches
Socket (Logical) Processors
------ -------------------------
0 0 1 2 3 4 5 6 7 128 129 130 131 132 133 134 135
1 8 9 10 11 12 13 14 15 136 137 138 139 140 141 142 143
2 16 17 18 19 20 21 22 23 144 145 146 147 148 149 150 151
3 24 25 26 27 28 29 30 31 152 153 154 155 156 157 158 159
4 32 33 34 35 36 37 38 39 160 161 162 163 164 165 166 167
5 40 41 42 43 44 45 46 47 168 169 170 171 172 173 174 175
6 48 49 50 51 52 53 54 55 176 177 178 179 180 181 182 183
7 56 57 58 59 60 61 62 63 184 185 186 187 188 189 190 191
8 64 65 66 67 68 69 70 71 192 193 194 195 196 197 198 199
9 72 73 74 75 76 77 78 79 200 201 202 203 204 205 206 207
10 80 81 82 83 84 85 86 87 208 209 210 211 212 213 214 215
11 88 89 90 91 92 93 94 95 216 217 218 219 220 221 222 223
12 96 97 98 99 100 101 102 103 224 225 226 227 228 229 230 231
13 104 105 106 107 108 109 110 111 232 233 234 235 236 237 238 239
14 112 113 114 115 116 117 118 119 240 241 242 243 244 245 246 247
15 120 121 122 123 124 125 126 127 248 249 250 251 252 253 254 255
=============================================================================
HyperThreading
Shared Processors
-----------------
( 0, 128) ( 1, 129) ( 2, 130) ( 3, 131)
( 4, 132) ( 5, 133) ( 6, 134) ( 7, 135)
( 8, 136) ( 9, 137) ( 10, 138) ( 11, 139)
( 12, 140) ( 13, 141) ( 14, 142) ( 15, 143)
( 16, 144) ( 17, 145) ( 18, 146) ( 19, 147)
( 20, 148) ( 21, 149) ( 22, 150) ( 23, 151)
( 24, 152) ( 25, 153) ( 26, 154) ( 27, 155)
( 28, 156) ( 29, 157) ( 30, 158) ( 31, 159)
( 32, 160) ( 33, 161) ( 34, 162) ( 35, 163)
( 36, 164) ( 37, 165) ( 38, 166) ( 39, 167)
( 40, 168) ( 41, 169) ( 42, 170) ( 43, 171)
( 44, 172) ( 45, 173) ( 46, 174) ( 47, 175)
( 48, 176) ( 49, 177) ( 50, 178) ( 51, 179)
( 52, 180) ( 53, 181) ( 54, 182) ( 55, 183)
( 56, 184) ( 57, 185) ( 58, 186) ( 59, 187)
( 60, 188) ( 61, 189) ( 62, 190) ( 63, 191)
( 64, 192) ( 65, 193) ( 66, 194) ( 67, 195)
( 68, 196) ( 69, 197) ( 70, 198) ( 71, 199)
( 72, 200) ( 73, 201) ( 74, 202) ( 75, 203)
( 76, 204) ( 77, 205) ( 78, 206) ( 79, 207)
( 80, 208) ( 81, 209) ( 82, 210) ( 83, 211)
( 84, 212) ( 85, 213) ( 86, 214) ( 87, 215)
( 88, 216) ( 89, 217) ( 90, 218) ( 91, 219)
( 92, 220) ( 93, 221) ( 94, 222) ( 95, 223)
( 96, 224) ( 97, 225) ( 98, 226) ( 99, 227)
( 100, 228) ( 101, 229) ( 102, 230) ( 103, 231)
( 104, 232) ( 105, 233) ( 106, 234) ( 107, 235)
( 108, 236) ( 109, 237) ( 110, 238) ( 111, 239)
( 112, 240) ( 113, 241) ( 114, 242) ( 115, 243)
( 116, 244) ( 117, 245) ( 118, 246) ( 119, 247)
( 120, 248) ( 121, 249) ( 122, 250) ( 123, 251)
( 124, 252) ( 125, 253) ( 126, 254) ( 127, 255) |
The x86info(1) command displays x86 CPU diagnostics information. Type one of the following commands to load the x86info(1) command if the command is not already installed:
On Red Hat Enterprise Linux (RHEL) systems, type the following:
# yum install x86info.x86_64 |
On SLES systems, type the following:
# zypper install x86info |
The following is an example of x86info(1) command output:
uv44-sys:~ # x86info
x86info v1.25. Dave Jones 2001-2009
Feedback to .
Found 64 CPUs
--------------------------------------------------------------------------
CPU #1
EFamily: 0 EModel: 2 Family: 6 Model: 46 Stepping: 6
CPU Model: Unknown model.
Processor name string: Intel(R) Xeon(R) CPU E7520 @ 1.87GHz
Type: 0 (Original OEM) Brand: 0 (Unsupported)
Number of cores per physical package=16
Number of logical processors per socket=32
Number of logical processors per core=2
APIC ID: 0x0 Package: 0 Core: 0 SMT ID 0
--------------------------------------------------------------------------
CPU #2
EFamily: 0 EModel: 2 Family: 6 Model: 46 Stepping: 6
CPU Model: Unknown model.
Processor name string: Intel(R) Xeon(R) CPU E7520 @ 1.87GHz
Type: 0 (Original OEM) Brand: 0 (Unsupported)
Number of cores per physical package=16
Number of logical processors per socket=32
Number of logical processors per core=2
APIC ID: 0x6 Package: 0 Core: 0 SMT ID 6
--------------------------------------------------------------------------
CPU #3
EFamily: 0 EModel: 2 Family: 6 Model: 46 Stepping: 6
CPU Model: Unknown model.
Processor name string: Intel(R) Xeon(R) CPU E7520 @ 1.87GHz
Type: 0 (Original OEM) Brand: 0 (Unsupported)
Number of cores per physical package=16
Number of logical processors per socket=32
Number of logical processors per core=2
APIC ID: 0x10 Package: 0 Core: 0 SMT ID 16
--------------------------------------------------------------------------
... |
You can also use the uname command, which returns the kernel version and other machine information. For example:
uv44-sys:~ # uname -a Linux uv44-sys 2.6.32.13-0.4.1.1559.0.PTF-default #1 SMP 2010-06-15 12:47:25 +0200 x86_64 x86_64 x86_64 GNU/Linux |
For more system information, change to the /sys/devices/system/node/node0/cpu0/cache directory and list the contents. For example:
uv44-sys:/sys/devices/system/node/node0/cpu0/cache # ls index0 index1 index2 index3 |
Change directory to index0 and list the contents, as follows:
uv44-sys:/sys/devices/system/node/node0/cpu0/cache/index0 # ls coherency_line_size level number_of_sets physical_line_partition shared_cpu_list shared_cpu_map size type ways_of_associativity |
The following three processes types typically cause program execution performance slowdowns:
CPU-bound processes, whch are processes that perform slow operations, such as sqrt or floating-point divides, or nonpipelined operations, such as switching between add and multiply operations.
Memory-bound processes, which consist of code that uses poor memory strides, occurrences of page thrashing or cache misses, or poor data placement in NUMA systems.
I/O-bound processes, which are processes that wait on synchronous I/O or formatted I/O. These are also processes that wait when there is library-level or system-level buffering.
The following topics describe some of the tools that can help pinpoint performance slowdowns:
Linux Perf Events provides a performance analysis framework. It includes hardware-level CPU performance monitoring unit (PMU) features, software counters, and tracepoints. The perf RPM comes with the operating system, includes man pages, and is not an SGI product.
For more information, see the following man pages:
perf(1)
perf-stat(1)
perf-top(1)
perf-record(1)
perf-report(1)
perf-list(1)
PerfSuite is a set of tools, utilities, and libraries that you can use to analyze application software performance on Linux systems. You can use the PerfSuite tools to perform performance-related activities, ranging from assistance with compiler optimization reports to hardware performance counting, profiling, and MPI usage summarization. PerfSuite is Open Source software. It is approved for licensing under the University of Illinois/NCSA Open Source License (OSI-approved).
For more information, see one of the following websites:
http://perfsuite.ncsa.illinois.edu
This website hosts NCSA-specific information about using PerfSuite tools.
PerfSuite includes the psrun utility, which gathers hardware performance information on an unmodified executable. For more information, see http://perfsuite.ncsa.uiuc.edu/psrun/.
The following tools might be useful to you when you try to optimize your code:
The Intel® VTune™ Amplifier XE, which is a performance and thread profiler. This tool does remote sampling experiments. The VTune data collector runs on the Linux system, and an accompanying graphical user interface (GUI) runs on an IA-32 Windows machine, which is used for analyzing the results. VTune allows you to perform interactive experiments while connected to the host through its GUI. An additional tool, the Performance Tuning Utility (PTU), requires the Intel VTune license.
For information about Intel VTune Amplifier XE, see the following URL:
http://software.intel.com/en-us/intel-vtune-amplifier-xe#pid-3773-760
Intel Inspector XE, which is a memory and thread debugger. For information about Intel Inspector XE, see the following:
Intel Advisor XE, which is a threading design and prototyping tool. For information about Intel Advisor XE, see the following:
The following debuggers are available on SGI platforms:
The Intel Debugger and GDB. You can start the Intel Debugger and GDB with the ddd command. The ddd command starts the Data Display Debugger, a GNU product that provides a graphical debugging interface.
Totalview. TotalView is graphical debugger that you can use with MPI programs. For information about TotalView, including its licensing, see the following:
DDT. The DDT debugger from Allinea Software is a graphical debugger. For information about DDT, see the following:
The following topics provide more information about some of the debuggers available on SGI systems:
The Intel Debugger for Linux is the Intel symbolic debugger. The Intel Debugger is part of Intel Parallel Studio XE Professional Edition and above. This debugger is based on the Eclipse GUI. This debugger works with the Intel® C and C++ compilers, the Intel® Fortran compilers, and the GNU compilers. This product is available if your system is licensed for the Intel compilers. You are asked during the installation if you want to install it or not. The idb command starts the GUI. The idbc command starts the command line interface. If you specify the -gdb option on the idb command, the shell command line provides user commands and debugger output similar to the GNU debugger.
You can use the Intel Debugger for Linux with single-threaded applications, multithreaded applications, serial code, and parallel code.
Figure 2-1 shows the GUI.
For more information, see the following:
GDB is the GNU debugger. The GDB debugger supports C, C++, Fortran, and Modula-2 programs. The following information pertains to these compilers:
When compiling with C and C++, include the -g option on the compiler command line. The -g option produces the dwarf2 symbols database that GDB uses.
When using GDB for Fortran debugging, include the -g and -O0 options. Do not use gdb for Fortran debugging when compiling with -O1 or higher. The standard GDB debugger does not support Fortran 95 programs. To debug Fortran 95 programs, download and install the gdbf95 patch from the following website:
http://sourceforge.net/project/showfiles.php?group_id=56720
To verify that you have the correct version of GDB installed, use the gdb -v command. The output should appear similar to the following:
GNU gdb 5.1.1 FORTRAN95-20020628 (RC1) Copyright 2012 Free Software Foundation, Inc. |
The Data Display Debugger provides a graphical debugging interface for the GDB debugger and other command line debuggers. To use GDB through a GUI, use the ddd command. Specify the --debugger option to specify the debugger you want to use. For example, specify --debugger "idb" to specify the Intel Debugger. Use the gdb command to start GDB's command line interface.
When the debugger loads, the Data Display Debugger screen appears divided into panes that show the following information:
Array inspection
Source code
Disassembled code
A command line window to the debugger engine
From the View menu, you can switch these panes on and off.
Some commonly used commands can be found on the menus. In addition, the following actions can be useful:
To select an address in the assembly view, click the right mouse button, and select lookup. The gdb command runs in the command pane and shows the corresponding source line.
Select a variable in the source pane, and click the right mouse button. The debugger displays the current value. Arrays appear in the array inspection window. You can print these arrays to PostScript by using the Menu>Print Graph option.
To view the contents of the register file, including general, floating-point, NaT, predicate, and application registers, select Registers from the Status menu. The Status menu also allows you to view stack traces or to switch OpenMP threads.
For a complete list of GDB commands, use the help option or see the documentation at the following website:
http://sourceware.org/gdb/documentation/
| Note: The current instances of GDB do not report ar.ec registers correctly. If you are debugging rotating, register-based, software-pipelined loops at the assembly code level, try using the Intel Debugger for Linux. |