Chapter 12. Troubleshooting

This chapter discusses the following:

Diagnostic Tools

You can use the following diagnostic tools:

  • Use the cat(1) command to view the /proc/interrupts file in order to determine where your interrupts are going:

    [user@linux user]% cat /proc/interrupts

    For an example, see Appendix A, “libreact API Example”.

  • Use the profile.pl(1) Perl script to do procedure-level profiling of a program and discover latencies. For more information, see the profile.pl(1) man page.

  • Use the following ps(1) command to see where your threads are running:

    [user@linux user]% ps -FC processname

    For an example, see Appendix A, “libreact API Example”.

    To see the scheduling policy, real-time priority, and current processor of all threads on the system, use the following command:

    [user@linux user]% ps -eLo pid,tid,class,rtprio,psr,cmd

    For more information, see the ps(1) man page.

  • Use the top(1) command to display the largest processes on the system. For more information, see the top(1) man page.

  • Use the strace(1) command to determine where an application is spending most of its time and where there may be large latencies. The strace command is a very flexible tool for tracing application activities and can be used for tracking down latencies in an application. Following are several simple examples:

    • To see the amount of time being used by system calls in the form of histogram data for a program named hello_world, use the following:

      [root@linux root]# strace -c hello_world
      execve("./hello_world", ["hello_world"], [/* 80 vars */]) = 0
      Hello World
      % time     seconds  usecs/call     calls    errors syscall
      ------ ----------- ----------- --------- --------- ----------------
       27.69    0.000139          28         5         3 open
       20.92    0.000105          15         7           mmap
       10.76    0.000054          54         1           write
        7.57    0.000038          13         3           fstat
        6.57    0.000033          17         2         1 stat
        5.98    0.000030          15         2           munmap
        4.58    0.000023          12         2           close
        4.38    0.000022          22         1           mprotect
        4.18    0.000021          21         1           madvise
        2.99    0.000015          15         1           read
        2.39    0.000012          12         1           brk
        1.99    0.000010          10         1           uname
      ------ ----------- ----------- --------- --------- ----------------
      100.00    0.000502                    27         4 total

    • You can record the actual chronological progression through a program with the following command (line breaks added for readability):

      [root@linux root]# strace -ttT hello_world
      14:21:03.974181 execve("./hello_world", ["hello_world"], [/* 80 vars */]) = 0
      ..
      14:21:03.976992 mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)
         = 0x2000000000040000 <0.000007>
      14:21:03.977053 write(1, "Hello World\n", 12Hello World
      ) = 12 <0.000008>
      14:21:03.977109 munmap(0x2000000000040000, 65536) = 0 <0.000009>
      14:21:03.977158 exit_group(0)           = ?

      The time stamps are displayed in the following format:

      hour:minute:second.microsecond

      The execution time of each system call is displayed in the following format:

      <second>


    Note: You can use the -p option to attach to another already running process.

    For more information, see the strace(1) man page.

  • Use Linux Trace Toolkit Next Generation (LTTng) commands. See Chapter 11, “SLES LTTng”.

  • To find the CPU-to-core numbering scheme, examine the following fields in the /proc/cpuinfo file:

    processor
    physical id
    core id

    For example, the following output for a third-party x86-64 system shows that logical CPU 0 (processor 0) and CPU 2 ( processor 2) are cores sharing the same socket: (physical id 0)

    processor       : 0
    ...
    physical id     : 0
    siblings        : 2
    core id         : 0
    cpu cores       : 2
    
    
    processor       : 2
    ...
    physical id     : 0
    siblings        : 2
    core id         : 1
    cpu cores       : 2

    The following output shows two logical processors CPU 0 ( processor 0) and CPU 8 (processor 8):

    processor       : 0
    ..
    physical id     : 0
    siblings        : 16
    core id         : 0
    cpu cores       : 8
    
    processor       : 8
    ..
    physical id     : 1
    siblings        : 16
    core id         : 0
    cpu cores       : 8

    Note the following:

    • CPU 0 is housed in the first socket on the system (physical id 0). This socket has 8 CPU cores. Each of those cores will have two logical CPUs if hyperthreading is enabled.

    • CPU 8 is housed in the second socket ( physical id 1). This socket has 8 CPU cores. Each of those cores will have two logical CPUs if hyperthreading is enabled.

    Each logical CPU is in the first core on its respective socket ( core ID 0).

Problem Removing /rtcpus

You should stop real-time processes before using the --disable option. However, the script will attempt to remove the process from the real-time CPUs and display the following failure message if it was unable to move them:

 "*** Problem removing /rtcpus/rtcpu3. cpuset***
  Try again.  If that doesn't work check /dev/cpuset/rtcpus/rtcpu3/tasks
  for potential problem PIDS;