Chapter 4. Trace PMDA

This chapter provides an introduction to the design of the trace Performance Metrics Domain Agent (PMDA), in an effort to explain how to configure the agent optimally for a particular problem domain. This information supplements the functional coverage which the man pages provide to both the agent and the library interfaces.

The chapter also includes information on how to use the trace PMDA and the associated library (libpcp_trace) for instrumenting applications. The example programs in IRIX are installed in /var/pcp/demos/trace from the pcp.sw.trace subsystem. On Linux, the trace PMDA exists in the pcp-2.3-rev package at /usr/share/pcp/demo/trace.

4.1. Performance Instrumentation and Tracing

The pcp_trace library provides function calls for identifying sections of a program as transactions or events for examination by the trace PMDA, a user command called pmdatrace. The pcp_trace library is described in the pmdatrace(3) man page

The monitoring of transactions using the Performance Co-Pilot (PCP) infrastructure begins with a pmtracebegin call. Time is recorded from there to the corresponding pmtraceend call (with matching tag identifier). A transaction in progress can be cancelled by calling pmtraceabort.

A second form of program instrumentation is available with the pmtracepoint function. This is a simpler form of monitoring that exports only the number of times a particular point in a program is passed. The pmtraceobs and pmtracecount functions have similar semantics, but allows an arbitrary numeric value to be passed to the trace PMDA.

The pmdatrace command is a PMDA that exports transaction performance metrics from application processes using the pcp_trace library; see the pmdatrace(1) man page for details.

For a complete introduction to performance tracing, refer to the Web-based PCP Tutorial, which contains the trace.html file covering this topic.

4.2. Trace PMDA Design

Trace PMDA design covers application interaction, sampling techniques, and configuring the trace PMDA.

4.2.1. Application Interaction

Figure 4-1 describes the general state maintained within the trace PMDA.

Figure 4-1. Trace PMDA Overview

Trace PMDA Overview

Applications that are linked with the libpcp_trace library make calls through the trace Application Programming Interface (API). These calls result in interprocess communication of trace data between the application and the trace PMDA. This data consists of an identification tag and the performance data associated with that particular tag. The trace PMDA aggregates the incoming information and periodically updates the exported summary information to describe activity in the recent past.

As each protocol data unit (PDU) is received, its data is stored in the current working buffer. At the same time, the global counter associated with the particular tag contained within the PDU is incremented. The working buffer contains all performance data that has arrived since the previous time interval elapsed. For additional information about the working buffer, see Section 4.2.2.2.

4.2.2. Sampling Techniques

The trace PMDA employs a rolling-window periodic sampling technique. The arrival time of the data at the trace PMDA in conjunction with the length of the sampling period being maintained by the PMDA determines the recency of the data exported by the PMDA. Through the use of rolling-window sampling, the trace PMDA is able to present a more accurate representation of the available trace data at any given time than it could through use of simple periodic sampling.

The rolling-window sampling technique affects the metrics in Example 4-1:

Example 4-1. Rolling-Window Sampling Technique

trace.observe.rate
trace.counter.rate
trace.point.rate
trace.transact.ave_time
trace.transact.max_time
trace.transact.min_time
trace.transact.rate

The remaining metrics are either global counters, control metrics, or the last seen observation value. Section 4.3, documents in more detail all metrics exported by the trace PMDA.

4.2.2.1. Simple Periodic Sampling

The simple periodic sampling technique uses a single historical buffer to store the history of events that have occurred over the sampling interval. As events occur, they are recorded in the working buffer. At the end of each sampling interval, the working buffer (which at that time holds the historical data for the sampling interval just finished) is copied into the historical buffer, and the working buffer is cleared. It is ready to hold new events from the sampling interval now starting.

4.2.2.2. Rolling-Window Periodic Sampling

In contrast to simple periodic sampling with its single historical buffer, the rolling-window periodic sampling technique maintains a number of separate buffers. One buffer is marked as the current working buffer, and the remainder of the buffers hold historical data. As each event occurs, the current working buffer is updated to reflect it.

At a specified interval, the current working buffer and the accumulated data that it holds is moved into the set of historical buffers, and a new working buffer is used. The specified interval is a function of the number of historical buffers maintained.

The primary advantage of the rolling-window sampling technique is seen at the point where data is actually exported. At this point, the data has a higher probability of reflecting a more recent sampling period than the data exported using simple periodic sampling.

The data collected over each sample duration and exported using the rolling-window sampling technique provides a more up-to-date representation of the activity during the most recently completed sample duration than simple periodic sampling as shown in Figure 4-2.

Figure 4-2. Sample Duration Comparison

Sample Duration Comparison

The trace PMDA allows the length of the sample duration to be configured, as well as the number of historical buffers that are maintained. The rolling-window approach is implemented in the trace PMDA as a ring buffer (see Figure 4-1).

When the current working buffer is moved into the set of historical buffers, the least recent historical buffer is cleared of data and becomes the new working buffer.

4.2.2.3. Rolling-Window Periodic Sampling Example

Consider the scenario where you want to know the rate of transactions over the last 10 seconds. You set the sampling rate for the trace PMDA to 10 seconds and fetch the metric trace.transact.rate. So if in the last 10 seconds, 8 transactions took place, the transaction rate would be 8/10 or 0.8 transactions per second.

The trace PMDA does not actually do this. It instead does its calculations automatically at a subinterval of the sampling interval. Reconsider the 10-second scenario. It has a calculation subinterval of 2 seconds as shown in Figure 4-3.

Figure 4-3. Sampling Intervals

Sampling Intervals

If at 13.5 seconds, you request the transaction rate, you receive a value of 0.7 transactions per second. In actual fact, the transaction rate was 0.8, but the trace PMDA did its calculations on the sampling interval from 2 seconds to 12 seconds, and not from 3.5 seconds to 13.5 seconds. For efficiency, the trace PMDA calculates the metrics on the last 10 seconds every 2 seconds. As a result, the PMDA is not driven each time a fetch request is received to do a calculation.

4.2.3. Configuring the Trace PMDA

The trace PMDA is configurable primarily through command-line options. The list of command-line options in Table 4-1 is not exhaustive, but it identifies those options which are particularly relevant to tuning the manner in which performance data is collected.

Table 4-1. Selected Command-Line Options

Option

Description

Access controls

The trace PMDA offers host-based access control. This control allows and disallows connections from instrumented applications running on specified hosts or groups of hosts. Limits to the number of connections allowed from individual hosts can also be mandated.

Sample duration

The interval over which metrics are to be maintained before being discarded is called the sample duration.

Number of historical buffers

The data maintained for the sample duration is held in a number of internal buffers within the trace PMDA. These are referred to as historical buffers. This number is configurable so that the rolling window effect can be tuned within the sample duration.

Counter and observation metric units

Since the data being exported by the trace.observe.value and trace.counter.count metrics are user-defined, the trace PMDA by default exports these metrics with a type of “none.” A framework is provided that allows the user to make the type more specific (for example, bytes per second) and allows the exported values to be plotted along with other performance metrics of similar units by tools like pmchart.

Instance domain refresh

The set of instances exported for each of the trace metrics can be cleared through the storable trace.control.reset metric.


4.3. Trace API

The libpcp_trace Application Programming Interface (API) is called from C, C++, Fortran, and Java. Each language has access to the complete set of functionality offered by libpcp_trace. In some cases, the calling conventions differ slightly between languages. This section presents an overview of each of the different tracing mechanisms offered by the API, as well as an explanation of their mappings to the actual performance metrics exported by the trace PMDA.

4.3.1. Transactions

Paired calls to the pmtracebegin and pmtraceend API functions result in transaction data being sent to the trace PMDA with a measure of the time interval between the two calls. This interval is the transaction service time. Using the pmtraceabort call causes data for that particular transaction to be discarded. The trace PMDA exports transaction data through the following trace.transact metrics listed in Table 4-2:

Table 4-2. trace.transact Metrics

Metric

Description

trace.transact.ave_time

The average service time per transaction type. This time is calculated over the last sample duration.

trace.transact.count

The running count for each transaction type seen since the trace PMDA started.

trace.transact.max_time

The maximum service time per transaction type within the last sample duration.

trace.transact.min_time

The minimum service time per transaction type within the last sample duration.

trace.transact.rate

 

The average rate at which each transaction type is completed. The rate is calculated over the last sample duration.

trace.transact.total_time

The cumulative time spent processing each transaction since the trace PMDA started running.


4.3.2. Point Tracing

Point tracing allows the application programmer to export metrics related to salient events. The pmtracepoint function is most useful when start and end points are not well defined. For example, this function is useful when the code branches in such a way that a transaction cannot be clearly identified, or when processing does not follow a transactional model, or when the desired instrumentation is akin to event rates rather than event service times. This data is exported through the trace.point metrics listed in Table 4-3:

Table 4-3. trace.point Metrics

Metric

Description

trace.point.count

Running count of point observations for each tag seen since the trace PMDA started.

trace.point.rate

The average rate at which observation points occur for each tag within the last sample duration.


4.3.3. Observations and Counters

The pmtraceobs and pmtracecount functions have similar semantics to pmtracepoint, but also allow an arbitrary numeric value to be passed to the trace PMDA. The most recent value for each tag is then immediately available from the PMDA. Observation data is exported through the trace.observe metrics listed in Table 4-4:

Table 4-4. trace.observe Metrics

Metric

Description

trace.observe.count

Running count of observations seen since the trace PMDA started.

trace.observe.rate

The average rate at which observations for each tag occur. This rate is calculated over the last sample duration.

trace.observe.value

The numeric value associated with the observation last seen by the trace PMDA.

trace.counter

Counter data is exported through the trace.counter metrics. The only difference between trace.counter and trace.observe metrics is that the numeric value of trace.counter must be a monotonic increasing count.


4.3.4. Configuring the Trace Library

The trace library is configurable through the use of environment variables listed in Table 4-5 as well as through the state flags listed in Table 4-6. Both provide diagnostic output and enable or disable the configurable functionality within the library.

Table 4-5. Environment Variables

Name

Description

PCP_TRACE_HOST

The name of the host where the trace PMDA is running.

PCP_TRACE_PORT

TCP/IP port number on which the trace PMDA is accepting client connections.

PCP_TRACE_TIMEOUT

The number of seconds to wait until assuming that the initial connection is not going to be made, and timeout will occur. The default is three seconds.

PCP_TRACE_REQTIMEOUT

The number of seconds to allow before timing out on awaiting acknowledgment from the trace PMDA after trace data has been sent to it. This variable has no effect in the asynchronous trace protocol (refer to Table 4-6).

PCP_TRACE_RECONNECT

A list of values which represents the backoff approach that the libpcp_trace library routines take when attempting to reconnect to the trace PMDA after a connection has been lost. The list of values should be a positive number of seconds for the application to delay before making the next reconnection attempt. When the final value in the list is reached, that value is used for all subsequent reconnection attempts.

The Table 4-6 are used to customize the operation of the libpcp_trace routines. These are registered through the pmtracestate call, and they can be set either individually or together.

Table 4-6. State Flags

Flag

Description

PMTRACE_STATE_NONE

The default. No state flags have been set, the fault-tolerant, synchronous protocol is used for communicating with the trace PMDA, and no diagnostic messages are displayed by the libpcp_trace routines.

PMTRACE_STATE_API

High-level diagnostics. This flag simply displays entry into each of the API routines.

PMTRACE_STATE_COMMS

Diagnostic messages related to establishing and maintaining the communication channel between application and PMDA.

PMTRACE_STATE_PDU

The low-level details of the trace protocol data units (PDU) is displayed as each PDU is transmitted or received.

PMTRACE_STATE_PDUBUF

The full contents of the PDU buffers are dumped as PDUs are transmitted and received.

PMTRACE_STATE_NOAGENT

Interprocess communication control. If this flag is set, it causes interprocess communication between the instrumented application and the trace PMDA to be skipped. This flag is a debugging aid for applications using libpcp_trace.

PMTRACE_STATE_ASYNC

Asynchronous trace protocol. This flag enables the asynchronous trace protocol so that the application does not block awaiting acknowledgment PDUs from the trace PMDA. In order for the flag to be effective, it must be set before using the other libpcp_trace entry points.


4.4. Instrumenting Applications to Export Performance Data

The relationship between an application, the libpcp_trace library, the trace PMDA and the rest of the PCP infrastructure is shown in Figure 4-4:

Figure 4-4. Application and PCP Relationship

Application and PCP Relationship

The libpcp_trace library is designed to encourage application developers (independent software vendors and end-user customers) to embed calls in their code that enable application performance data to be exported. When combined with system-level performance data, this feature allows total performance and resource demands of an application to be correlated with application activity.

For example, developers can provide the following application performance metrics:

  • Computation state (especially for codes with major shifts in resource demands between phases of their execution)

  • Problem size and parameters, that is, degree of parallelism throughput in terms of subproblems solved, iteration count, transactions, data sets inspected, and so on

  • Service time by operation type

The libpcp_trace library approach offers a number of attractive features:

  • A simple API for inserting instrumentation calls into an application as shown in the following example:

    pmtracebegin(“pass 1”);
    ... 
    pmtraceend(“pass 1”); 
    ...
    pmtraceobs(“threads”, N);

  • Trace routines that are called from C, C++, Fortran, and Java, and that are well suited to macro encapsulation for compile-time inclusion and exclusion.

  • Shipped source code for a stub version of the library that enables the following:

    • Replacement by private debugging or development versions

    • Flexibility based on not being locked into a SGI program

    • Added functionality on SGI platforms, when the PCP version of the library is present

  • A PCP version of the library that allows numerical observations, measures time between matching begin-end calls, and so on to be shipped to a PCP agent and then exported into the PCP infrastructure. As exporting is controlled by environment variables, the overhead is very low if the metrics are not being exported.

Once the application performance metrics are exported into the PCP framework, all of the PCP tools may be leveraged to provide performance monitoring and management, including:

  • Two- and three-dimensional visualization of resource demands and performance, showing concurrent system activity and application activity.


    Note: On Linux, visualization tools are not provided as part of the PCP for IA-64 Linux distribution.


  • Transport of performance data over the network for distributed performance management.

  • Archive logging for historical records of performance, most useful for problem diagnosis, postmortem analysis, performance regression testing, capacity planning, and benchmarking.

  • Automated alarms when bad performance is observed. These apply both in real-time or when scanning archives.

  • A toolkit approach that encourages customization. For example, a complete PCP for XYZ package could be offered for performance monitoring of application XYZ on SGI and other Linux-based platforms.