Chapter 8. Suggested Shortcuts and Workarounds

Chapter 8. Suggested Shortcuts and Workarounds
Prev		Next

This chapter includes the following topics:

Determining Process Placement

This topic describes methods that you can use to determine where different processes are running. This can help you understand your application structure and help you decide if there are obvious placement issues. Note that all examples use the C shell.

The following procedure explains how to set up the computing environment.

Procedure 8-1. To create the computing environment

Set up an alias as in this example, changing guest to your username:
% alias pu "ps -edaf|grep guest" % pu
The pu command alias shows current processes.
Create the .toprc preferences file in your login directory to set the appropriate top (1) options.

If you prefer to use the top(1) defaults, delete the .toprc file.
% cat <<EOF>> $HOME/.toprc YEAbcDgHIjklMnoTP|qrsuzV{FWX 2mlt EOF
Inspect all processes, determine which CPU is in use, and create an alias file for this procedure.

The CPU number appears in the first column of the top(1) output:
% top -b -n 1 | sort -n | more % alias top1 "top -b -n 1 | sort -n "
Use the following variation to produce output with column headings:
% alias top1 "top -b -n 1 | head -4 | tail -1;top -b -n 1 | sort -n"
View your files, replacing guest with your username:
% top -b -n 1 | sort -n | grep guest
Use the following variation to produce output with column headings:
% top -b -n 1 | head -4 | tail -1;top -b -n 1 | sort -n grep guest

The following topics present examples:

Example Using `pthreads`

The following example demonstrates simple usage with a program name of th. It sets the number of desired OpenMP threads and runs the program. Notice the process hierarchy as shown by the PID and the PPID columns. The command usage is as follows, where n is the number of threads:

% th n

% th 4
% pu

UID       PID   PPID   C STIME TTY          TIME CMD
root      13784 13779  0 12:41 pts/3    00:00:00 login --
guest1                  
guest1    13785 13784  0 12:41 pts/3    00:00:00 -csh
guest1    15062 13785  0 15:23 pts/3    00:00:00 th 4   <-- Main thread
guest1    15063 15062  0 15:23 pts/3    00:00:00 th 4   <-- daemon thread
guest1    15064 15063 99 15:23 pts/3    00:00:10 th 4   <-- worker thread 1
guest1    15065 15063 99 15:23 pts/3    00:00:10 th 4   <-- worker thread 2
guest1    15066 15063 99 15:23 pts/3    00:00:10 th 4   <-- worker thread 3
guest1    15067 15063 99 15:23 pts/3    00:00:10 th 4   <-- worker thread 4
guest1    15068 13857  0 15:23 pts/5    00:00:00 ps -aef
guest1    15069 13857  0 15:23 pts/5    00:00:00 grep guest1

% top -b -n 1 | sort -n | grep guest1

LC %CPU   PID USER     PRI  NI  SIZE  RSS SHARE STAT %MEM   TIME COMMAND
 3  0.0 15072 guest1     16   0  3488 1536  3328 S     0.0   0:00 grep
 5  0.0 13785 guest1     15   0  5872 3664  4592 S     0.0   0:00 csh
 5  0.0 15062 guest1     16   0 15824 2080  4384 S     0.0   0:00 th
 5  0.0 15063 guest1     15   0 15824 2080  4384 S     0.0   0:00 th
 5 99.8 15064 guest1     25   0 15824 2080  4384 R     0.0   0:14 th
 7  0.0 13826 guest1     18   0  5824 3552  5632 S     0.0   0:00 csh
10 99.9 15066 guest1     25   0 15824 2080  4384 R     0.0   0:14 th
11 99.9 15067 guest1     25   0 15824 2080  4384 R     0.0   0:14 th
13 99.9 15065 guest1     25   0 15824 2080  4384 R     0.0   0:14 th
15  0.0 13857 guest1     15   0  5840 3584  5648 S     0.0   0:00 csh
15  0.0 15071 guest1     16   0 70048 1600 69840 S     0.0   0:00 ort
15  1.5 15070 guest1     15   0  5056 2832  4288 R     0.0   0:00top

Now skip the Main and daemon processes and place the rest:

% /usr/bin/dplace -s 2 -c 4-7 th 4
% pu

UID         PID  PPID  C STIME TTY          TIME CMD
root      13784 13779  0 12:41 pts/3    00:00:00 login --
guest1                  
guest1    13785 13784  0 12:41 pts/3    00:00:00 -csh
guest1    15083 13785  0 15:25 pts/3    00:00:00 th 4
guest1    15084 15083  0 15:25 pts/3    00:00:00 th 4
guest1    15085 15084 99 15:25 pts/3    00:00:19 th 4
guest1    15086 15084 99 15:25 pts/3    00:00:19 th 4
guest1    15087 15084 99 15:25 pts/3    00:00:19 th 4
guest1    15088 15084 99 15:25 pts/3    00:00:19 th 4
guest1    15091 13857  0 15:25 pts/5    00:00:00 ps -aef
guest1    15092 13857  0 15:25 pts/5    00:00:00 grep guest1

% top -b -n 1 | sort -n | grep guest1

LC %CPU   PID USER      PRI  NI  SIZE  RSS SHARE STAT %MEM   TIME COMMAND
 4 99.9 15085 guest1     25   0 15856 2096  6496 R     0.0   0:24 th
 5 99.8 15086 guest1     25   0 15856 2096  6496 R     0.0   0:24 th
 6 99.9 15087 guest1     25   0 15856 2096  6496 R     0.0   0:24 th
 7 99.9 15088 guest1     25   0 15856 2096  6496 R     0.0   0:24 th
 8  0.0 15095 guest1     16   0  3488 1536  3328 S     0.0   0:00 grep
12  0.0 13785 guest1     15   0  5872 3664  4592 S     0.0   0:00 csh
12  0.0 15083 guest1     16   0 15856 2096  6496 S     0.0   0:00 th
12  0.0 15084 guest1     15   0 15856 2096  6496 S     0.0   0:00 th
15  0.0 15094 guest1     16   0 70048 1600 69840 S     0.0   0:00 sort
15  1.6 15093 guest1     15   0  5056 2832  4288 R     0.0   0:00 top

Example Using OpenMP

The following example demonstrates simple OpenMP usage with a program name of md. Set the desired number of OpenMP threads and run the program as follows:

% alias pu "ps -edaf | grep guest1
% setenv OMP_NUM_THREADS 4
% md

Use the pu alias and the top(1) command to see the output, as follows:

% pu

UID         PID  PPID  C STIME TTY          TIME CMD
root      21550 21535  0 21:48 pts/0    00:00:00 login -- guest1
guest1    21551 21550  0 21:48 pts/0    00:00:00 -csh
guest1    22183 21551 77 22:39 pts/0    00:00:03 md    <-- parent / main
guest1    22184 22183  0 22:39 pts/0    00:00:00 md    <-- daemon 
guest1    22185 22184  0 22:39 pts/0    00:00:00 md    <-- daemon helper
guest1    22186 22184 99 22:39 pts/0    00:00:03 md    <-- thread 1
guest1    22187 22184 94 22:39 pts/0    00:00:03 md    <-- thread 2
guest1    22188 22184 85 22:39 pts/0    00:00:03 md    <-- thread 3
guest1    22189 21956  0 22:39 pts/1    00:00:00 ps -aef
guest1    22190 21956  0 22:39 pts/1    00:00:00 grep guest1

% top -b -n 1 | sort -n | grep guest1

LC %CPU   PID USER      PRI  NI  SIZE  RSS SHARE STAT %MEM   TIME COMMAND
 2  0.0 22192 guest1     16   0 70048 1600 69840 S     0.0   0:00 sort
 2  0.0 22193 guest1     16   0  3488 1536  3328 S     0.0   0:00 grep
 2  1.6 22191 guest1     15   0  5056 2832  4288 R     0.0   0:00 top
 4 98.0 22186 guest1     26   0 26432 2704  4272 R     0.0   0:11 md
 8  0.0 22185 guest1     15   0 26432 2704  4272 S     0.0   0:00 md
 8 87.6 22188 guest1     25   0 26432 2704  4272 R     0.0   0:10 md
 9  0.0 21551 guest1     15   0  5872 3648  4560 S     0.0   0:00 csh
 9  0.0 22184 guest1     15   0 26432 2704  4272 S     0.0   0:00 md
 9 99.9 22183 guest1     39   0 26432 2704  4272 R     0.0   0:11 md
14 98.7 22187 guest1     39   0 26432 2704  4272 R     0.0   0:11 md

From the notation on the right of the pu list, you can see the -x 6 pattern, which is as follows:

Place 1, skip 2 of them, place 3 more [ 0 1 1 0 0 0 ].
Reverse the bit order and create the dplace(1) -x mask:
[ 0 0 0 1 1 0 ] --> [ 0x06 ] --> decimal 6
The dplace(1) command does not currently process hexadecimal notation for this bit mask.

The following example confirms that a simple dplace placement works correctly:

% setenv OMP_NUM_THREADS 4
% /usr/bin/dplace -x 6 -c 4-7 md
% pu
UID         PID  PPID  C STIME TTY          TIME CMD
root      21550 21535  0 21:48 pts/0    00:00:00 login -- guest1
guest1    21551 21550  0 21:48 pts/0    00:00:00 -csh
guest1    22219 21551 93 22:45 pts/0    00:00:05 md
guest1    22225 21956  0 22:45 pts/1    00:00:00 ps -aef
guest1    22226 21956  0 22:45 pts/1    00:00:00 grep guest1
guest1    22220 22219  0 22:45 pts/0    00:00:00 md
guest1    22221 22220  0 22:45 pts/0    00:00:00 md
guest1    22222 22220 93 22:45 pts/0    00:00:05 md
guest1    22223 22220 93 22:45 pts/0    00:00:05 md
guest1    22224 22220 90 22:45 pts/0    00:00:05 md

% top -b -n 1 | sort -n | grep guest1

LC %CPU   PID USER      PRI  NI  SIZE  RSS SHARE STAT %MEM   TIME COMMAND
 2  0.0 22228 guest1     16   0 70048 1600 69840 S     0.0   0:00 sort
 2  0.0 22229 guest1     16   0  3488 1536  3328 S     0.0   0:00 grep
 2  1.6 22227 guest1     15   0  5056 2832  4288 R     0.0   0:00 top
 4  0.0 22220 guest1     15   0 28496 2736 21728 S     0.0   0:00 md
 4 99.9 22219 guest1     39   0 28496 2736 21728 R     0.0   0:12 md
 5 99.9 22222 guest1     25   0 28496 2736 21728 R     0.0   0:11 md
 6 99.9 22223 guest1     39   0 28496 2736 21728 R     0.0   0:11 md
 7 99.9 22224 guest1     39   0 28496 2736 21728 R     0.0   0:11 md
 9  0.0 21551 guest1     15   0  5872 3648  4560 S     0.0   0:00 csh
15  0.0 22221 guest1     15   0 28496 2736 21728 S     0.0   0:00 md

Resetting System Limits

To regulate these limits on a per-user basis for applications that do not rely on limit.h, you can modify the limits.conf file. The system limits that you can modify include maximum file size, maximum number of open files, maximum stack size, and so on. To view this file, type the following:

[user@machine user]# cat /etc/security/limits.conf
# /etc/security/limits.conf
#
#Each line describes a limit for a user in the form:
#
#            #
#Where:
# can be:
#        - an user name
#        - a group name, with @group syntax
#        - the wildcard *, for default entry
#
# can have the two values:
#        - "soft" for enforcing the soft limits
#        - "hard" for enforcing hard limits
#
# can be one of the following:
#        - core - limits the core file size (KB)
#        - data - max data size (KB)
#        - fsize - maximum filesize (KB)
#        - memlock - max locked-in-memory address space (KB)
#        - nofile - max number of open files
#        - rss - max resident set size (KB)
#        - stack - max stack size (KB)
#        - cpu - max CPU time (MIN)
#        - nproc - max number of processes
#        - as - address space limit
#        - maxlogins - max number of logins for this user
#        - priority - the priority to run user process with
#        - locks - max number of file locks the user can hold
#
#                 #

#*               soft    core            0
#*               hard    rss             10000
#@student        hard    nproc           20
#@faculty        soft    nproc           20
#@faculty        hard    nproc           50
#ftp             hard    nproc           0
#@student        -       maxlogins       4

# End of file

For information about how to change these limits, see “Resetting the File Limit Resource Default”.

Resetting the File Limit Resource Default

Several large user applications use the value set in the limit.h file as a hard limit on file descriptors, and that value is noted at compile time. Therefore, some applications might need to be recompiled in order to take advantage of the SGI system hardware.

To regulate these limits on a per-user basis for applications that do not rely on limit.h, you can modify the limits.conf file. This allows the administrator to set the allowed number of open files per user and per group. This also requires a one-line change to the /etc/pam.d/login file.

The following procedure explains how to change the /etc/pam.d/login file.

Procedure 8-2. To change the file limist resource default

Add the following line to /etc/pam.d/login:

session  required  /lib/security/pam_limits.so

Add the following line to /etc/security/limits.conf, where username is the user's login and limit is the new value for the file limit resource:
[username] hard nofile [limit]

The following command shows the new limit:

ulimit -H -n

Because of the large number of file descriptors that some applications require, such as MPI jobs, you might need to increase the system-wide limit on the number of open files on your SGI system. The default value for the file limit resource is 1024. The default of 1024 file descriptors allows for approximately 199 MPI processes per host. You can increase the file descriptor value to 8196 to allow for more than 512 MPI processes per host by adding adding the following lines to the /etc/security/limits.conf file:

*     soft    nofile      8196
*     hard    nofile      8196

The ulimit -a command displays all limits, as follows:

sys:~ # ulimit -a
core file size          (blocks, -c) 1
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 511876
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) 55709764
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 511876
virtual memory          (kbytes, -v) 68057680
file locks                      (-x) unlimited

Resetting the Default Stack Size

Some applications do not run well on an SGI system with a small stack size. To set a higher stack limit, follow the instructions in “Resetting the File Limit Resource Default” and add the following lines to the /etc/security/limits.conf file:

* soft stack 300000
* hard stack unlimited

These lines set a soft stack size limit of 300000 KB and an unlimited hard stack size for all users (and all processes).

Another method that does not require root privilege relies on the fact that many MPI implementation use ssh, rsh, or some sort of login shell to start the MPI rank processes. If you merely need to increase the soft limit, you can modify your shell's startup script. For example, if your login shell is bash, add a line similar to the following to your .bashrc file:

% ulimit -s 300000

Note that SGI MPI allows you to set your stack size limit larger. To reset the limit, use the ulimit or limit shell command before launching an MPI program with mpirun(1) or mpiexec_mpt (1). MPT propagates the stack limit setting to all MPI processes in the job.

For more information on default settings, see “Resetting the File Limit Resource Default”.

Avoiding Segmentation Faults

The default stack size in the Linux operating system is 8MB (8192 kbytes). This value often needs to be increased to avoid segmentation fault errors. If your application fails to run immediately, check the stack size.

You can use the ulimit -a command to view the stack size, as follows:

uv44-sys:~ # ulimit -a
core file size          (blocks, -c) unlimited

data seg size           (kbytes, -d) unlimited

file size               (blocks, -f) unlimited

pending signals                 (-i) 204800

max locked memory       (kbytes, -l) unlimited

max memory size         (kbytes, -m) unlimited

open files                      (-n) 16384

pipe size            (512 bytes, -p) 8

POSIX message queues     (bytes, -q) 819200

stack size              (kbytes, -s) 8192

cpu time               (seconds, -t) unlimited

max user processes              (-u) 204800

virtual memory          (kbytes, -v) unlimited

file locks                      (-x) unlimited

To change the value, use a command similar to the following:

uv44-sys:~ # ulimit -s 300000

There is a similar variable for OpenMP programs. If you get a segmentation fault right away while running a program parallelized with OpenMP, increase the KMP_STACKSIZE to a larger size. The default size in Intel Compilers is 4MB.

For example, to increase it to 64MB, use the following commands:

In the C shell, set the environment variable as follows:
setenv KMP_STACKSIZE 64M
In the Bash shell, set the environment variable as follows:
export KMP_STACKSIZE=64M

Resetting Virtual Memory Size

The virtual memory parameter vmemoryuse determines the amount of virtual memory available to your application.

If you are using the Bash shell, use commands such as the following when setting this limit:

ulimit -a
ulimit -v 7128960
ulimit -v unlimited

If you are using the C shell, use commands such as the following when setting this limit:

limit
limit vmemoryuse 7128960
limit vmemoryuse unlimited

For example. The following MPI program fails with a memory-mapping error because of a virtual memory parameter, vmemoryuse, value that is set too low:

% limit vmemoryuse 7128960

% mpirun -v -np 4 ./program
MPI: libxmpi.so 'SGI MPI 4.9 MPT 1.14  07/18/06 08:43:15'
MPI: libmpi.so  'SGI MPI 4.9 MPT 1.14  07/18/06 08:41:05'
MPI: MPI_MSGS_MAX = 524288
MPI: MPI_BUFS_PER_PROC= 32
mmap failed (memmap_base) for 504972 pages (8273461248

bytes) Killed n

The program now succeeds when virtual memory is unlimited:

%  limit vmemoryuse unlimited


% mpirun -v -np 4 ./program
MPI: libxmpi.so 'SGI MPI 4.9 MPT 1.14  07/18/06 08:43:15'
MPI: libmpi.so  'SGI MPI 4.9 MPT 1.14  07/18/06 08:41:05'
MPI: MPI_MSGS_MAX = 524288
MPI: MPI_BUFS_PER_PROC= 32

HELLO WORLD from Processor 0

HELLO WORLD from Processor 2

HELLO WORLD from Processor 1

HELLO WORLD from Processor 3

Linux Shared Memory Accounting

The Linux operating system does not calculate memory utilization in a manner that is useful for certain applications in situations where regions are shared among multiple processes. This can lead to over-reporting of memory and to processes being killed by schedulers that erroneously detect memory quota violations.

The get_weighted_memory_size function weighs shared memory regions by the number of processes using the regions. Thus, if 100 processes each share a total of 10GB of memory, the weighted memory calculation shows 100MB of memory shared per process, rather than 10GB for each process.

Because this function applies mostly to applications with large shared-memory requirements, it is located in the SGI NUMA tools package and made available in the libmemacct library available from a package called memacct. The library function makes a call to the numatools kernel module, which returns the weighted sum back to the library, and then returns back to the application.

The usage statement for the memacct call is as follows:

cc ... -lmemacct
 #include <sys/types.h>
 extern int get_weighted_memory_size(pid_t pid);

The syntax of the memacct call is, as follows:

int *get_weighted_memory_size(pid_t pid);

The call returns the weighted memory (RSS) size for a pid, in bytes. This call weights the size of the shared regions by the number of processes accessing the region. Returns -1 when an error occurs and sets errno, as follows:

`ESRCH`		Process pid was not found.
`ENOSYS`		The function is not implemented. Check if `numatools` kernel package is up-to-date.

Normally, the following errors should not occur:

`ENOENT`		Cannot open `/proc/numatools` device file.
`EPERM`		No read permission on `/proc/numatools` device file.
`ENOTTY`		Inappropriate `ioctl` operation on `/proc/numatools` device file.
`EFAULT`		Invalid arguments. The `ioctl()` operation performed by the function failed with invalid arguments.

For more information, see the memacct(3) man page.

OFED Tuning Requirements for SHMEM

You can specify the maximum number of queue pairs (QPs) for SHMEM applications when run on large clusters over an OFED fabric, such as InfiniBand. If the log_num_qp parameter is set to a number that is too low, the system generates the following message:

MPT Warning: IB failed to create a QP

SHMEM codes use the InfiniBand RC protocol for communication between all pairs of processes in the parallel job, which requires a large number of QPs. The log_num_qp parameter defines the log₂ of the number of QPs. The following procedure explains how to specify the log_num_qp parameter.

Procedure 8-3. To specify the log_num_qp parameter

Log into one of the hosts upon which you installed the MPT software as the root user.
Use a text editor to open file /etc/modprobe.d/libmlx4.conf.
Add a line similar to the following to file /etc/modprobe.d/libmlx4.conf:
options mlx4_core log_num_qp=21
By default, the maximum number of queue pairs is 2¹⁸ (262144). This is true across all platforms (RHEL 7.1, RHEL 6.6, SLES 12, and SLES 11SP3).
Save and close the file.
Repeat the preceding steps on other hosts.

Setting Java Enviroment Variables

When Java software starts, it checks the environment in which it is running and configures itself to fit, assuming that it owns the entire environment. The default for some Java implementations (for example, IBM J9 1.4.2) is to start a garbage collection (GC) thread for every CPU it sees. Other Java implementations use other algorithms to decide the number of GC threads to start, but the number is generally 0.5 to 1 times the number of CPUs, which is appropriate on a 1- or 2-socket system.

However, this strategy does not scale well to systems with a larger core count. Java command line options let you control the number of GC threads that the Java virtual machine (JVM) will use. In many cases, a single GC thread is sufficient. In other cases, a larger number might be appropriate and can be set with the applicable environment variable or command line option. Properly tuning the number of GC threads for an application is an exercise in performance optimization, but a reasonable starting point is to use one GC thread per active worker thread.

For example:

For Oracle Java:

-XX:ParallelGCThreads
For IBM Java:

-Xgcthreads

An example command line option:

java -XX:+UseParallelGC -XX:ParallelGCThreads=1

As an administrator, you might choose to limit the number of GC threads to a reasonable value with an environment variable set in the global profile, for example the /etc/profile.local file, so casual Java users can avoid difficulties. The environment variable settings are as follows:

For Oracle Java:

JAVA_OPTIONS="-XX:ParallelGCThreads=1"
For IBM Java:

IBM_JAVA_OPTIONS="-Xgcthreads1"

Prev	Table of Contents	Next
Chapter 7. I/O Tuning		Index