Archive for the ‘performance’ Category

A primer on Processor Trace timing

Processor Trace can be quite useful to understand the timing break down of programs. This post describes how the timing works and how to get the best timing resolution.

Processor Trace logs are packetized. A trace uses three kinds of timing packets: PSB TSC updates, MTC and CYC. PSB is the basic synchronization of the trace and provides a full TSC time stamp. However it is quite coarse grained. The MTC packet provides regular timing updates from the 24/25Mhz ART Always Running Timer. And then finally there is an optional cycle accurate mode that gives cycle accurate updates relative to the last MTC packet.

Note that Broadwell Processor Trace used a different much less accurate method only using the PSB timing updates. For accurate timing please use Skylake or later or Goldmont Atom or later..

By default Linux perf uses a mtc period of 3, which means there is an ART timing update ever 2^3=8 times, so roughly every 300ns. That is quite coarse grained, usually not fine grained enough for a program time break down. It is possible to increase the timing resolution at some trace overhead cost, longer decoding time, and a risk of additional data loss.

We can set the MTC period to 0 to get more frequent MTC updates. In addition by default perf script only reports us, so we should use the –ns option to force nano second timestamp output:

perf record -e intel_pt/mtc_period=0/ -a ... perf script --ns --insn-trace --xed

Now in addition to that we can enable cycle accurate mode to get better resolution (at the cost of a significantly larger trace):

perf record -e intel_pt/mtc_period=0,cyc/ -a ... perf script --ns --insn-trace --xed

When the resulting trace is too big it’s also possible to use the cyc_thresh=N option, which configured cycle packets only every 2^N cycles. This can be useful if the full accurate trace causes too much data loss.

When looking at the output we still only get updates roughly every 5 conditional branches or returns. This is related to how PT encodes the branches into packets. Cycle updates are only written to the trace when the CPU writes a packet for a branch, or another reason. Conditional branches and returns are encoded into TNT packets, and a TNT packet is always filled with 5 conditionals or returns before it is being written.

For returns we can disable the ‘return compression’ which leads to a guaranteed packet at every return, so at least we get a time stamp at the end of every function. Again this comes at significant additional trace overhead.

perf record -e intel_pt/noretcomp,cyc,mtc_period=0/ ... perf script --ns --insn-trace --xed

With this we can see timing updates every 5 conditional branches or return. But often we want to time complete functions. Unfortunately function calls are direct, which are not logged by PT because it can be statically determined by the PT decoder. One track is to call the function through an indirect pointer, which results in a TIP packet, and therefore a timing update at the beginning of the function. However that requires modifying the code.

There is another trick that we can use to time individual functions. PT has address filters to enable/disable the trace. We can enable PT at the beginning of the function and disable it at the end. Disable/Enable has accurate time stamps, so that’s good enough to time the function:

perf record -e intel_pt/cyc,mtc_period=0,noretcomp/ --filter 'myfunc @ /path/to/executable' executable perf script --ns --insn-trace --xed

The CPU supports upto two address filter region, so this trick works for two function at a time. We could time more by switching the filter regions over time. It can be used to time smaller program regions too, however it is required that they be entered and exited by a branch.

[Based on an earlier writeup from Adrian Hunter]

Written by therapsid

July 28th, 2020 at 12:36 am

Posted in monitoring,performance,pt,tuning

Cheat sheet for Intel Processor Trace with Linux perf and gdb

without comments

What is Processor Trace

Intel Processor Trace (PT) traces program execution (every branch) with low overhead.

This is a cheat sheet of how to use PT with perf for common tasks

It is not a full introduction to PT. Please read Adding PT to Linux perf or the links from the general PT reference page.

PT support in hardware

CPU	Support
Broadwell (5th generation Core, Xeon v4)	More overhead. No fine grained timing.
Skylake (6th generation Core, Xeon v5)	Fine grained timing. Address filtering.
Goldmont (Apollo Lake, Denverton)	Fine grained timing. Address filtering.

PT support in Linux

PT is supported in Linux perf, which is integrated in the Linux kernel.
It can be used through the “perf” command or through gdb.

There are also other tools that support PT: VTune, simple-pt, gdb, JTAG debuggers.

In general it is best to use the newest kernel and the newest Linux perf tools. If that is not possible older tools and kernels can be used. Newer tools can be used on an older kernel, but may not support all features

Linux version	Support
Linux 4.1	Initial PT driver
Linux 4.2	Support for Skylake and Goldmont
Linux 4.3	Initial user tools support in Linux perf
Linux 4.5	Support for JIT decoding using agent
Linux 4.6	Bug fixes. Support address filtering.
Linux 4.8	Bug fixes.
Linux 4.10	Bug fixes. Support for PTWRITE and power tracing

Many commands require recent perf tools, you may need to update them rom a recent kernel tree.

This article covers mainly Linux perf and briefly gdb.

Preparations

Only needed once.

Allow seeing kernel symbols (as root)

echo kernel.kptr_restrict=0' >> /etc/sysctl.conf sysctl -p

Basic perf command lines for recording PT

ls /sys/devices/intel_pt/format

Check if PT is supported and what capabilities.

perf record -e intel_pt// program

Trace program

perf record -e intel_pt// -a sleep 1

Trace whole system for 1 second

perf record -C 0 -e intel_pt// -a sleep 1

Trace CPU 0 for 1 second

perf record --pid $(pidof program) -e intel_pt//

Trace already running program.

perf has to save the data to disk. The CPU can execute branches much faster than than the disk can keep up, so there will be some data loss for code that executes
many instructions. perf has no way to slow down the CPU, so when trace bandwidth > disk bandwidth there will be gaps in the trace. Because of this it is usually not a good idea
to try to save a long trace, but work with shorter traces. Long traces also take a lot of time to decode.

When decoding kernel data the decoder usually has to run as root.
An alternative is to use the perf-with-kcore.sh script included with perf

perf script --ns --itrace=cr

Record program execution and display function call graph.

perf script by defaults “samples” the data (only dumps a sample every 100us).
This can be configured using the –itrace option (see reference below)

Install xed first.

perf script --itrace=i0ns --ns -F time,pid,comm,sym,symoff,insn,ip | xed -F insn: -S /proc/kallsyms -64

Show every assembly instruction executed with disassembler.

For this it is also useful to get more accurate time stamps (see below)

perf script --itrace=i0ns --ns -F time,sym,srcline,ip

Show source lines executed (requires debug information)

perf script --itrace=s1Mi0ns ....

Often initialization code is not interesting. Skip initial 1M instructions while decoding:

perf script --time 1.000,2.000 ...

Slice trace into different time regions Generally the time stamps need to be looked up first in the trace, as they are absolute.

perf report --itrace=g32l64i100us --branch-history

Print hot paths every 100us as call graph histograms

Install Flame graph tools first.

perf script --itrace=i100usg | stackcollapse-perf.pl > workload.folded flamegraph.pl workloaded.folded > workload.svg google-chrome workload.svg

Generate flame graph from execution, sampled every 100us

Other ways to record data

perf record -a -e intel_pt// sleep 1

Capture whole system for 1 second

Use snapshot mode

This collects data, but does not continuously save it all to disk. When an event of interest happens a data dump of the current buffer can be triggered by sending a SIGUSR2 signal to the perf process.

perf record -a -e --snapshot intel_pt// sleep 1 PERF_PID=$! *execute workload*

*event happens* kill -USR2 $PERF_PID

*end of recording* kill $PERF_PID>

Record kernel only, complete system

perf record -a -e intel_pt//k sleep 1

Record user space only, complete system

perf record -a -e intel_pt//u

Enable fine grained timing (needs Skylake/Goldmont, adds more overhead)

perf record -a -e intel_pt/cyc=1,cyc_thresh=2/ ...

echo $[100*1024*1024] > /proc/sys/kernel/perf_event_mlock_kb perf record -m 512,100000 -e intel_pt// ...

Increase perf buffer to limit data loss

perf record -e intel_pt// --filter 'filter main @ /path/to/program' ...

Only record main function in program

perf record -e intel_pt// -a --filter 'filter sys_write' program

Filter kernel code (needs 4.11+ kernel)

perf record -e intel_pt// -a --filter 'stop func2 @ program' program

Stop tracing at func2.

perf archive rsync -r ~/.debug perf.data other-system:

Transfer data to a trace on another system. May also require using perf-with-kcore.sh if decoding
kernel.

Using gdb

Requires a new enough gdb built with libipt. For user space only.

gdb program start record btrace pt # cannot be done before start. has to be redone every re-start cont

record instruction-history /m # show instructions record function-call-history # show functions executed reverse-step # step backwards in time

For more information on gdb pt see the gdb documentation

References

The perf PT documentation

Reference for –itrace option (from perf documentation)

i synthesize "instructions" events b synthesize "branches" events x synthesize "transactions" events c synthesize branches events (calls only) r synthesize branches events (returns only) e synthesize tracing error events d create a debug log g synthesize a call chain (use with i or x) l synthesize last branch entries (use with i or x) s skip initial number of events

Reference for –filter option (from perf documentation)

A hardware trace PMU advertises its ability to accept a number of address filters by specifying a non-zero value in /sys/bus/event_source/devices/ /nr_addr_filters.

Address filters have the format:

filter|start|stop|tracestop [/ ] [@]

Where:
– ‘filter’: defines a region that will be traced.
– ‘start’: defines an address at which tracing will begin.
– ‘stop’: defines an address at which tracing will stop.
– ‘tracestop’: defines a region in which tracing will stop.

is the name of the object file, is the offset to the
code to trace in that file, and is the size of the region to
trace. ‘start’ and ‘stop’ filters need not specify a .

If no object file is specified then the kernel is assumed, in which case
the start address must be a current kernel memory address.

can also be specified by providing the name of a symbol. If the
symbol name is not unique, it can be disambiguated by inserting #n where
‘n’ selects the n’th symbol in address order. Alternately #0, #g or #G
select only a global symbol. can also be specified by providing
the name of a symbol, in which case the size is calculated to the end
of that symbol. For ‘filter’ and ‘tracestop’ filters, if is
omitted and is a symbol, then the size is calculated to the end
of that symbol.

If is omitted and is ‘*’, then the start and size will
be calculated from the first and last symbols, i.e. to trace the whole
file.
If symbol names (or ‘*’) are provided, they must be surrounded by white
space.

The filter passed to the kernel is not necessarily the same as entered.
To see the filter that is passed, use the -v option.

The kernel may not be able to configure a trace region if it is not
within a single mapping. MMAP events (or /proc/ /maps) can be
examined to determine if that is a possibility.

Multiple filters can be separated with space or comma.

v2: Fix some typos/broken links

v3: Update gdb commands

Written by therapsid

April 7th, 2017 at 8:55 pm

Posted in kernel,monitoring,not-really,performance,pt,tuning

Intel Processor Trace resources

without comments

Intel Processor Trace (PT) can be used on modern Intel CPUs to trace execution. This page contains references for learning about and using Intel PT.

Basic information:

Intel Software Developer’s Manual Vol 3 low level reference information on Processor Trace trace format and registers (Chapter 36)
Intel Processor Trace on Linux gives an overview of processor trace on Linux
A tutorial web site for PT that contains many references
Intel® Developer Forum 2015: Zoom-in on Your Code with Intel® Processor Trace and Supporting Tools (find SPCS012)
Intel® Developer Forum 2014: Debug and Fine-grain Profiling with Intel® Processor Trace
Efficient and large scale program flow tracing in Linux

Implementations

Adding processor trace to Linux describes the Linux perf Processor trace implementation.
Reference documentation for PT on Linux perf
simple-pt is an alternative reference PT implementation. It is implemented on Linux, but can be also used as a starting point to implement PT on other OS.
The GNU debugger gdb support PT on Linux for backward debugging
Intel VTune amplifier supports PT for performance analysis
A Windows windbg processor trace plugin for debugging on Windows with PT.
A reference Processor Trace decode library.
A plugin for Linux crash to dump PT buffers (look for ptdump)

JTAG support

The Lauterbach JTAG debugger supports PT
Intel system studio supports JTAG debugging with PT
The SourcePoint for Intel debugger support PT

Andi Kleen's blog

Archive for the ‘performance’ Category

A primer on Processor Trace timing

Cheat sheet for Intel Processor Trace with Linux perf and gdb

What is Processor Trace

PT support in hardware

PT support in Linux

Preparations

Basic perf command lines for recording PT

Other ways to record data

Using gdb

References

Intel Processor Trace resources

Pages

Search

Blogroll

Archives

Categories

Meta