Andi Kleen's blog

Tilting at windmills and other endeavors

Archive for the ‘pt’ Category

Cheat sheet for Intel Processor Trace with Linux perf and gdb

without comments

What is Processor Trace

Intel Processor Trace (PT) traces program execution (every branch) with low overhead.

This is a cheat sheet of how to use PT with perf for common tasks

It is not a full introduction to PT. Please read Adding PT to Linux perf or the links from the general PT reference page.

PT support in hardware

CPU Support
Broadwell (5th generation Core, Xeon v4) More overhead. No fine grained timing.
Skylake (6th generation Core, Xeon v5) Fine grained timing. Address filtering.
Goldmont (Apollo Lake, Denverton) Fine grained timing. Address filtering.

PT support in Linux

PT is supported in Linux perf, which is integrated in the Linux kernel.
It can be used through the “perf” command or through gdb.

There are also other tools that support PT: VTune, simple-pt, gdb, JTAG debuggers.

In general it is best to use the newest kernel and the newest Linux perf tools. If that is not possible older tools and kernels can be used. Newer tools can be used on an older kernel, but may not support all features

Linux version Support
Linux 4.1 Initial PT driver
Linux 4.2 Support for Skylake and Goldmont
Linux 4.3 Initial user tools support in Linux perf
Linux 4.5 Support for JIT decoding using agent
Linux 4.6 Bug fixes. Support address filtering.
Linux 4.8 Bug fixes.
Linux 4.10 Bug fixes. Support for PTWRITE and power tracing

Many commands require recent perf tools, you may need to update them rom a recent kernel tree.

This article covers mainly Linux perf and briefly gdb.


Only needed once.

Allow seeing kernel symbols (as root)

echo kernel.kptr_restrict=0' >> /etc/sysctl.conf
sysctl -p

Basic perf command lines for recording PT

ls /sys/devices/intel_pt/format

Check if PT is supported and what capabilities.

perf record -e intel_pt// program

Trace program

perf record -e intel_pt// -a sleep 1

Trace whole system for 1 second

perf record -C 0 -e intel_pt// -a sleep 1

Trace CPU 0 for 1 second

perf record --pid $(pidof program) -e intel_pt//

Trace already running program.

perf has to save the data to disk. The CPU can execute branches much faster than than the disk can keep up, so there will be some data loss for code that executes
many instructions. perf has no way to slow down the CPU, so when trace bandwidth > disk bandwidth there will be gaps in the trace. Because of this it is usually not a good idea
to try to save a long trace, but work with shorter traces. Long traces also take a lot of time to decode.

When decoding kernel data the decoder usually has to run as root.
An alternative is to use the script included with perf

perf script --ns --itrace=cr

Record program execution and display function call graph.

perf script by defaults “samples” the data (only dumps a sample every 100us).
This can be configured using the –itrace option (see reference below)

Install xed first.

perf script --itrace=i0ns --ns -F time,pid,comm,sym,symoff,insn,ip | xed -F insn: -S /proc/kallsyms -64

Show every assembly instruction executed with disassembler.

For this it is also useful to get more accurate time stamps (see below)

perf script --itrace=i0ns --ns -F time,sym,srcline,ip

Show source lines executed (requires debug information)

perf script --itrace=s1Mi0ns ....

Often initialization code is not interesting. Skip initial 1M instructions while decoding:

perf script --time 1.000,2.000 ...

Slice trace into different time regions Generally the time stamps need to be looked up first in the trace, as they are absolute.

perf report --itrace=g32l64i100us --branch-history

Print hot paths every 100us as call graph histograms

Install Flame graph tools first.

perf script --itrace=i100usg | > workload.folded workloaded.folded > workload.svg
google-chrome workload.svg

Generate flame graph from execution, sampled every 100us

Other ways to record data

perf record -a -e intel_pt// sleep 1

Capture whole system for 1 second

Use snapshot mode

This collects data, but does not continuously save it all to disk. When an event of interest happens a data dump of the current buffer can be triggered by sending a SIGUSR2 signal to the perf process.

perf record -a -e --snapshot intel_pt// sleep 1
*execute workload*

*event happens*
kill -USR2 $PERF_PID

*end of recording*
kill $PERF_PID>

Record kernel only, complete system

perf record -a -e intel_pt//k sleep 1

Record user space only, complete system

perf record -a -e intel_pt//u

Enable fine grained timing (needs Skylake/Goldmont, adds more overhead)

perf record -a -e intel_pt/cyc=1,cyc_thresh=2/ ...

echo $[100*1024*1024] > /proc/sys/kernel/perf_event_mlock_kb
perf record -m 512,100000 -e intel_pt// ...

Increase perf buffer to limit data loss

perf record -e intel_pt// --filter 'filter main @ /path/to/program' ...

Only record main function in program

perf record -e intel_pt// -a --filter 'filter sys_write' program

Filter kernel code (needs 4.11+ kernel)

perf record -e intel_pt// -a --filter 'start func1 @ program' --filter 'stop func2 @ program' program

Start tracing in program at main and stop tracing at func2.

perf archive
rsync -r ~/.debug other-system:

Transfer data to a trace on another system. May also require using if decoding

Using gdb

Requires a new enough gdb built with libipt. For user space only.

gdb program
record btrace pt

record instruction-history /m # show instructions
record function-history # show functions executed
prev # step backwards in time

For more information on gdb pt see the gdb documentation


The perf PT documentation

Reference for –itrace option (from perf documentation)

i synthesize "instructions" events
b synthesize "branches" events
x synthesize "transactions" events
c synthesize branches events (calls only)
r synthesize branches events (returns only)
e synthesize tracing error events
d create a debug log
g synthesize a call chain (use with i or x)
l synthesize last branch entries (use with i or x)
s skip initial number of events

Reference for –filter option (from perf documentation)

A hardware trace PMU advertises its ability to accept a number of
address filters by specifying a non-zero value in
/sys/bus/event_source/devices/ /nr_addr_filters.

Address filters have the format:

filter|start|stop|tracestop [/ ] [@]

- 'filter': defines a region that will be traced.
- 'start': defines an address at which tracing will begin.
- 'stop': defines an address at which tracing will stop.
- 'tracestop': defines a region in which tracing will stop.

is the name of the object file, is the offset to the
code to trace in that file, and is the size of the region to
trace. 'start' and 'stop' filters need not specify a .

If no object file is specified then the kernel is assumed, in which case
the start address must be a current kernel memory address.

can also be specified by providing the name of a symbol. If the
symbol name is not unique, it can be disambiguated by inserting #n where
'n' selects the n'th symbol in address order. Alternately #0, #g or #G
select only a global symbol. can also be specified by providing
the name of a symbol, in which case the size is calculated to the end
of that symbol. For 'filter' and 'tracestop' filters, if is
omitted and is a symbol, then the size is calculated to the end
of that symbol.

If is omitted and is '*', then the start and size will
be calculated from the first and last symbols, i.e. to trace the whole
If symbol names (or '*') are provided, they must be surrounded by white

The filter passed to the kernel is not necessarily the same as entered.
To see the filter that is passed, use the -v option.

The kernel may not be able to configure a trace region if it is not
within a single mapping. MMAP events (or /proc/ /maps) can be
examined to determine if that is a possibility.

Multiple filters can be separated with space or comma.

v2: Fix some typos/broken links

Written by therapsid

April 7th, 2017 at 8:55 pm

Intel Processor Trace resources

without comments

Intel Processor Trace (PT) can be used on modern Intel CPUs to trace execution. This page contains references for learning about and using Intel PT.

Basic information:


JTAG support

Other presentations


Research papers using PT (subset):


Written by therapsid

March 17th, 2017 at 4:46 pm

Announcing simple-pt — A simple Processor Trace implementation

without comments

Modern Intel Core CPUs (5th and 6th generation) have a Intel Processor Trace (PT) feature to trace branch execution with low overhead. This is useful for performance analysis and debugging.

simple-pt is a simple standalone driver and decoder tool to implement PT on Linux.

Starting with Linux 4.1 Linux already has a integrated PT implementation in perf (see ). simple-pt is an alternative implementation. It has many disadvantages over the perf PT implementation, such as:
- needs to run as root
- no long term tracing or sampling with interrupts
- no support for interactive debugging (use gdb 7.10 on perf for that)
- no support for histograms
- somewhat experimental
- not as well supported as perf

On the positive side simple-pt is:
- simple
- standalone. No kernel changes needed. Could be ported to older kernels or other operating systems
- easy to modify and experiment with
- more ftrace like decoding tool
- support for kprobes based triggers
- modular “unix style” design with simple tools that do only one thing each
- BSD licensed

Example output:

        % sptcmd  -c tcall taskset -c 0 ./tcall
        cpu   0 offset 1027688,  1003 KB, writing to ptout.0
        Wrote sideband to ptout.sideband
        % sptdecode --sideband ptout.sideband --pt ptout.0 | less
        frequency 32
        0        [+0]     [+   1] _dl_aux_init+436
                          [+   6] __libc_start_main+455 -> _dl_discover_osversion
                          [+  13] __libc_start_main+446 -> main
                          [+   9]     main+22 -> f1
                          [+   4]             f1+9 -> f2
                          [+   2]             f1+19 -> f2
                          [+   5]     main+22 -> f1
                          [+   4]             f1+9 -> f2
                          [+   2]             f1+19 -> f2
                          [+   5]     main+22 -> f1

Available from

Written by therapsid

August 17th, 2015 at 4:27 am

Posted in kernel,pt

Generating Flame graphs with Processor Trace

with 2 comments

How to generate a FlameGraph with Processor Trace. Everybody loves Flame Graphs.

Processor trace allows to do as very exact histograms of a program’s run time. Normal sampling has shadow effects, which can hide some details. Processor traces every branch, so it can be much more accurate than normal sampling.

You need a Intel Broadwell or Skylake CPU.
Running at 4.1 or later Linux kernel where perf supports PT.
You can verify the kernel supports pt with

ls /sys/devices/intel_pt

You need perf user tools built from
(this should soon be fixed when the user tools code is merged into Linux mainline)

Build perf with PT support

# set up https_proxy as needed
git clone
cd linux-perf/tools/perf

Copy the resulting perf binary to where you want to run it

Get the flamegraph code

git clone

Collect data from the workload. Best to not collect too long traces as they take much longer to process and may need too much disk space.

perf record -e intel_pt// workload (or -a sleep 1 to collect 1s globally)

Decode the data. This may take quite some time

perf script --itrace=i100usg | /path/to/FlameGraph/ | > workload.folded

The i100us means the trace decoder samples an instruction every 100us. This can be made more accurate (down to 1ns), at the cost of longer decoding time. The ‘g’ tells the decoder to add callgraphs.

Then generate the Flamegraph with

/path/to/FlameGraph/ workloaded.folded > workload.svg

Then view the resulting SVG in a SVG viewer, such as google chrome

google-chrome workload.svg

It is possible to click around.

Here’s a larger svg example from a gcc build (2.5MB). May need chrome or firefox to view.

In principle the trace also has support for more information not in normal sampling, such as determining the exact run time of individual functions from the trace. This is unfortunately not (yet?) supported by the Flame Graph tools.

Written by therapsid

August 5th, 2015 at 11:13 pm