Archive for the ‘performance’ Category
What is Processor Trace
Intel Processor Trace (PT) traces program execution (every branch) with low overhead.
This is a cheat sheet of how to use PT with perf for common tasks
PT support in hardware
|Broadwell (5th generation Core, Xeon v4)||More overhead. No fine grained timing.|
|Skylake (6th generation Core, Xeon v5)||Fine grained timing. Address filtering.|
|Goldmont (Apollo Lake, Denverton)||Fine grained timing. Address filtering.|
PT support in Linux
PT is supported in Linux perf, which is integrated in the Linux kernel.
It can be used through the “perf” command or through gdb.
In general it is best to use the newest kernel and the newest Linux perf tools. If that is not possible older tools and kernels can be used. Newer tools can be used on an older kernel, but may not support all features
|Linux 4.1||Initial PT driver|
|Linux 4.2||Support for Skylake and Goldmont|
|Linux 4.3||Initial user tools support in Linux perf|
|Linux 4.5||Support for JIT decoding using agent|
|Linux 4.6||Bug fixes. Support address filtering.|
|Linux 4.8||Bug fixes.|
|Linux 4.10||Bug fixes. Support for PTWRITE and power tracing|
Many commands require recent perf tools, you may need to update them rom a recent kernel tree.
This article covers mainly Linux perf and briefly gdb.
Only needed once.
Allow seeing kernel symbols (as root)
echo kernel.kptr_restrict=0' >> /etc/sysctl.conf
Basic perf command lines for recording PT
Check if PT is supported and what capabilities.
perf record -e intel_pt// program
perf record -e intel_pt// -a sleep 1
Trace whole system for 1 second
perf record -C 0 -e intel_pt// -a sleep 1
Trace CPU 0 for 1 second
perf record --pid $(pidof program) -e intel_pt//
Trace already running program.
perf has to save the data to disk. The CPU can execute branches much faster than than the disk can keep up, so there will be some data loss for code that executes
many instructions. perf has no way to slow down the CPU, so when trace bandwidth > disk bandwidth there will be gaps in the trace. Because of this it is usually not a good idea
to try to save a long trace, but work with shorter traces. Long traces also take a lot of time to decode.
When decoding kernel data the decoder usually has to run as root.
An alternative is to use the perf-with-kcore.sh script included with perf
perf script --ns --itrace=cr
Record program execution and display function call graph.
perf script by defaults “samples” the data (only dumps a sample every 100us).
This can be configured using the –itrace option (see reference below)
Install xed first.
perf script --itrace=i0ns --ns -F time,pid,comm,sym,symoff,insn,ip | xed -F insn: -S /proc/kallsyms -64
Show every assembly instruction executed with disassembler.
For this it is also useful to get more accurate time stamps (see below)
perf script --itrace=i0ns --ns -F time,sym,srcline,ip
Show source lines executed (requires debug information)
perf script --itrace=s1Mi0ns ....
Often initialization code is not interesting. Skip initial 1M instructions while decoding:
perf script --time 1.000,2.000 ...
Slice trace into different time regions Generally the time stamps need to be looked up first in the trace, as they are absolute.
perf report --itrace=g32l64i100us --branch-history
Print hot paths every 100us as call graph histograms
Install Flame graph tools first.
perf script --itrace=i100usg | stackcollapse-perf.pl > workload.folded
flamegraph.pl workloaded.folded > workload.svg
Generate flame graph from execution, sampled every 100us
Other ways to record data
perf record -a -e intel_pt// sleep 1
Capture whole system for 1 second
Use snapshot mode
This collects data, but does not continuously save it all to disk. When an event of interest happens a data dump of the current buffer can be triggered by sending a SIGUSR2 signal to the perf process.
perf record -a -e --snapshot intel_pt// sleep 1
kill -USR2 $PERF_PID
*end of recording*
Record kernel only, complete system
perf record -a -e intel_pt//k sleep 1
Record user space only, complete system
perf record -a -e intel_pt//u
Enable fine grained timing (needs Skylake/Goldmont, adds more overhead)
perf record -a -e intel_pt/cyc=1,cyc_thresh=2/ ...
echo $[100*1024*1024] > /proc/sys/kernel/perf_event_mlock_kb
perf record -m 512,100000 -e intel_pt// ...
Increase perf buffer to limit data loss
perf record -e intel_pt// --filter 'filter main @ /path/to/program' ...
Only record main function in program
perf record -e intel_pt// -a --filter 'filter sys_write' program
Filter kernel code (needs 4.11+ kernel)
perf record -e intel_pt// -a --filter 'start func1 @ program' --filter 'stop func2 @ program' program
Start tracing in program at main and stop tracing at func2.
rsync -r ~/.debug perf.data other-system:
Transfer data to a trace on another system. May also require using perf-with-kcore.sh if decoding
Requires a new enough gdb built with libipt. For user space only.
record btrace pt
record instruction-history /m # show instructions
record function-history # show functions executed
prev # step backwards in time
For more information on gdb pt see the gdb documentation
Reference for –itrace option (from perf documentation)
i synthesize "instructions" events
b synthesize "branches" events
x synthesize "transactions" events
c synthesize branches events (calls only)
r synthesize branches events (returns only)
e synthesize tracing error events
d create a debug log
g synthesize a call chain (use with i or x)
l synthesize last branch entries (use with i or x)
s skip initial number of events
Reference for –filter option (from perf documentation)
A hardware trace PMU advertises its ability to accept a number of
address filters by specifying a non-zero value in
Address filters have the format:
- 'filter': defines a region that will be traced.
- 'start': defines an address at which tracing will begin.
- 'stop': defines an address at which tracing will stop.
- 'tracestop': defines a region in which tracing will stop.
code to trace in that file, and
trace. 'start' and 'stop' filters need not specify a
If no object file is specified then the kernel is assumed, in which case
the start address must be a current kernel memory address.
symbol name is not unique, it can be disambiguated by inserting #n where
'n' selects the n'th symbol in address order. Alternately #0, #g or #G
select only a global symbol.
the name of a symbol, in which case the size is calculated to the end
of that symbol. For 'filter' and 'tracestop' filters, if
of that symbol.
be calculated from the first and last symbols, i.e. to trace the whole
If symbol names (or '*') are provided, they must be surrounded by white
The filter passed to the kernel is not necessarily the same as entered.
To see the filter that is passed, use the -v option.
The kernel may not be able to configure a trace region if it is not
within a single mapping. MMAP events (or /proc/
examined to determine if that is a possibility.
Multiple filters can be separated with space or comma.
v2: Fix some typos/broken links
Intel Processor Trace (PT) can be used on modern Intel CPUs to trace execution. This page contains references for learning about and using Intel PT.
- Intel Software Developer’s Manual Vol 3 low level reference information on Processor Trace trace format and registers (Chapter 36)
- Intel Processor Trace on Linux gives an overview of processor trace on Linux
- A tutorial web site for PT that contains many references
- Intel® Developer Forum 2015: Zoom-in on Your Code with Intel® Processor Trace and Supporting Tools (find SPCS012)
- Intel® Developer Forum 2014: Debug and Fine-grain Profiling with Intel® Processor Trace
- Efficient and large scale program flow tracing in Linux
- Adding processor trace to Linux describes the Linux perf Processor trace implementation.
- Reference documentation for PT on Linux perf
- simple-pt is an alternative reference PT implementation. It is implemented on Linux, but can be also used as a starting point to implement PT on other OS.
- The GNU debugger gdb support PT on Linux for backward debugging
- Intel VTune amplifier supports PT for performance analysis
- A Windows windbg processor trace plugin for debugging on Windows with PT.
- A reference Processor Trace decode library.
- A plugin for Linux crash to dump PT buffers (look for ptdump)
- The Lauterbach JTAG debugger supports PT
- Intel system studio supports JTAG debugging with PT
- The SourcePoint for Intel debugger support PT
- The hongfuzz fuzzer supports feedback fuzzing using PT
- Harnessing Intel Processor Trace on Windows for vulnerability discovery
Research papers using PT (subset):
- Failure Sketching: A technique for automated root cause analysis of in production failures
- Griffin: Guarding control flows using Intel Processor Trace
- Hardware-assisted instruction profiling and latency detection
- Inspector: Data Provenance using Intel Processor Trace
- Transparent and efficient CFI enforcement with Intel Processor Trace