Andi Kleen's blog

Tilting at windmills and other endeavors

A primer on Processor Trace timing

without comments

Processor Trace can be quite useful to understand the timing break down of programs. This post describes how the timing works and how to get the best timing resolution.

Processor Trace logs are packetized. A trace uses three kinds of timing packets: PSB TSC updates, MTC and CYC. PSB is the basic synchronization of the trace and provides a full TSC time stamp. However it is quite coarse grained. The MTC packet provides regular timing updates from the 24/25Mhz ART Always Running Timer. And then finally there is an optional cycle accurate mode that gives cycle accurate updates relative to the last MTC packet.

Note that Broadwell Processor Trace used a different much less accurate method only using the PSB timing updates. For accurate timing please use Skylake or later or Goldmont Atom or later..

By default Linux perf uses a mtc period of 3, which means there is an ART timing update ever 2^3=8 times, so roughly every 300ns. That is quite coarse grained, usually not fine grained enough for a program time break down. It is possible to increase the timing resolution at some trace overhead cost, longer decoding time, and a risk of additional data loss.

We can set the MTC period to 0 to get more frequent MTC updates. In addition by default perf script only reports us, so we should use the –ns option to force nano second timestamp output:

perf record -e intel_pt/mtc_period=0/ -a ...
perf script --ns --insn-trace --xed

Now in addition to that we can enable cycle accurate mode to get better resolution (at the cost of a significantly larger trace):

perf record -e intel_pt/mtc_period=0,cyc/ -a ...
perf script --ns --insn-trace --xed

When the resulting trace is too big it’s also possible to use the cyc_thresh=N option, which configured cycle packets only every 2^N cycles. This can be useful if the full accurate trace causes too much data loss.

When looking at the output we still only get updates roughly every 5 conditional branches or returns. This is related to how PT encodes the branches into packets. Cycle updates are only written to the trace when the CPU writes a packet for a branch, or another reason. Conditional branches and returns are encoded into TNT packets, and a TNT packet is always filled with 5 conditionals or returns before it is being written.

For returns we can disable the ‘return compression’ which leads to a guaranteed packet at every return, so at least we get a time stamp at the end of every function. Again this comes at significant additional trace overhead.

perf record -e intel_pt/noretcomp,cyc,mtc_period=0/ ...
perf script --ns --insn-trace --xed

With this we can see timing updates every 5 conditional branches or return. But often we want to time complete functions. Unfortunately function calls are direct, which are not logged by PT because it can be statically determined by the PT decoder. One track is to call the function through an indirect pointer, which results in a TIP packet, and therefore a timing update at the beginning of the function. However that requires modifying the code.

There is another trick that we can use to time individual functions. PT has address filters to enable/disable the trace. We can enable PT at the beginning of the function and disable it at the end. Disable/Enable has accurate time stamps, so that’s good enough to time the function:

perf record -e intel_pt/cyc,mtc_period=0,noretcomp/ --filter 'myfunc @ /path/to/executable' executable
perf script --ns --insn-trace --xed

The CPU supports upto two address filter region, so this trick works for two function at a time. We could time more by switching the filter regions over time. It can be used to time smaller program regions too, however it is required that they be entered and exited by a branch.

[Based on an earlier writeup from Adrian Hunter]

Written by therapsid

July 28th, 2020 at 12:36 am