{"id":410,"date":"2017-04-07T20:55:07","date_gmt":"2017-04-07T20:55:07","guid":{"rendered":"http:\/\/halobates.de\/blog\/?p=410"},"modified":"2024-07-23T20:24:38","modified_gmt":"2024-07-23T20:24:38","slug":"cheat-sheet-for-intel-processor-trace-with-linux-perf-and-gdb","status":"publish","type":"post","link":"http:\/\/halobates.de\/blog\/p\/410","title":{"rendered":"Cheat sheet for Intel Processor Trace with Linux perf and gdb"},"content":{"rendered":"<h1>What is Processor Trace<\/h1>\n<p>Intel Processor Trace (PT) traces program execution (every branch) with low overhead.<\/p>\n<p>This is a cheat sheet of how to use PT with perf for common tasks<\/p>\n<p>It is not a full introduction to PT. Please read <a href=\"https:\/\/lwn.net\/Articles\/648154\/\">Adding PT to Linux perf<\/a> or the links from the general <a href=\"http:\/\/halobates.de\/blog\/p\/406\">PT reference page<\/a>.<\/p>\n<h1>PT support in hardware<\/h1>\n<table>\n<tbody>\n<tr>\n<td>CPU<\/td>\n<td>Support<\/td>\n<\/tr>\n<tr>\n<td>Broadwell (5th generation Core, Xeon v4)<\/td>\n<td>More overhead. No fine grained timing.<\/td>\n<\/tr>\n<tr>\n<td>Skylake (6th generation Core, Xeon v5)<\/td>\n<td>Fine grained timing. Address filtering.<\/td>\n<\/tr>\n<tr>\n<td>Goldmont (Apollo Lake, Denverton)<\/td>\n<td>Fine grained timing. Address filtering.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h1>PT support in Linux<\/h1>\n<p>PT is supported in Linux perf, which is integrated in the Linux kernel.<br \/>\nIt can be used through the &#8220;perf&#8221; command or through gdb.<\/p>\n<p>There are also other tools that support PT: <a href=\"https:\/\/software.intel.com\/en-us\/intel-vtune-amplifier-xe\">VTune<\/a>, <a href=\"https:\/\/github.com\/andikleen\/simple-pt\">simple-pt<\/a>, gdb, JTAG debuggers.<\/p>\n<p>In general it is best to use the newest kernel and the newest Linux perf tools. If that is not possible older tools and kernels can be used. Newer tools can be used on an older kernel, but may not support all features<\/p>\n<table>\n<tbody>\n<tr>\n<td>Linux version<\/td>\n<td>Support<\/td>\n<\/tr>\n<tr>\n<td>Linux 4.1<\/td>\n<td>Initial PT driver<\/td>\n<\/tr>\n<tr>\n<td>Linux 4.2<\/td>\n<td>Support for Skylake and Goldmont<\/td>\n<\/tr>\n<tr>\n<td>Linux 4.3<\/td>\n<td>Initial user tools support in Linux perf<\/td>\n<\/tr>\n<tr>\n<td>Linux 4.5<\/td>\n<td>Support for JIT decoding using agent<\/td>\n<\/tr>\n<tr>\n<td>Linux 4.6<\/td>\n<td>Bug fixes. Support address filtering.<\/td>\n<\/tr>\n<tr>\n<td>Linux 4.8<\/td>\n<td>Bug fixes.<\/td>\n<\/tr>\n<tr>\n<td>Linux 4.10<\/td>\n<td>Bug fixes. Support for PTWRITE and power tracing<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Many commands require recent perf tools, you may need to update them rom a recent kernel tree.<\/p>\n<p>This article covers mainly Linux perf and briefly gdb.<\/p>\n<h1>Preparations<\/h1>\n<p>Only needed once.<\/p>\n<p>Allow seeing kernel symbols (as root)<\/p>\n<p><code><br \/>\necho kernel.kptr_restrict=0' &gt;&gt; \/etc\/sysctl.conf<br \/>\nsysctl -p<br \/>\n<\/code><\/p>\n<h1>Basic perf command lines for recording PT<\/h1>\n<p><code>ls \/sys\/devices\/intel_pt\/format<\/code><\/p>\n<p>Check if PT is supported and what capabilities.<\/p>\n<p><code>perf record -e intel_pt\/\/ program<\/code><\/p>\n<p>Trace program<\/p>\n<p><code>perf record -e intel_pt\/\/ -a sleep 1<\/code><\/p>\n<p>Trace whole system for 1 second<\/p>\n<p><code>perf record -C 0 -e intel_pt\/\/ -a sleep 1<\/code><\/p>\n<p>Trace CPU 0 for 1 second<\/p>\n<p><code>perf record --pid $(pidof program) -e intel_pt\/\/<\/code><\/p>\n<p>Trace already running program.<\/p>\n<p>perf has to save the data to disk. The CPU can execute branches much faster than than the disk can keep up, so there will be some data loss for code that executes<br \/>\nmany instructions. perf has no way to slow down the CPU, so when trace bandwidth &gt; disk bandwidth there will be gaps in the trace. Because of this it is usually not a good idea<br \/>\nto try to save a long trace, but work with shorter traces. Long traces also take a lot of time to decode.<\/p>\n<p>When decoding kernel data the decoder usually has to run as root.<br \/>\nAn alternative is to use the perf-with-kcore.sh script included with perf<\/p>\n<p><code>perf script --ns --itrace=cr<\/code><\/p>\n<p>Record program execution and display function call graph.<\/p>\n<p>perf script by defaults &#8220;samples&#8221; the data (only dumps a sample every 100us).<br \/>\nThis can be configured using the &#8211;itrace option (see reference below)<\/p>\n<p>Install <a href=\"https:\/\/github.com\/intelxed\/xed\">xed<\/a> first.<\/p>\n<p><code>perf script --itrace=i0ns --ns -F time,pid,comm,sym,symoff,insn,ip | xed -F insn: -S \/proc\/kallsyms -64<\/code><\/p>\n<p>Show every assembly instruction executed with disassembler.<\/p>\n<p>For this it is also useful to get more accurate time stamps (see below)<\/p>\n<p><code>perf script --itrace=i0ns --ns -F time,sym,srcline,ip <\/code><\/p>\n<p>Show source lines executed (requires debug information)<\/p>\n<p><code>perf script --itrace=s1Mi0ns .... <\/code><\/p>\n<p>Often initialization code is not interesting. Skip initial 1M instructions while decoding:<\/p>\n<p><code>perf script --time 1.000,2.000 ...<\/code><\/p>\n<p>Slice trace into different time regions Generally the time stamps need to be looked up first in the trace, as they are absolute.<\/p>\n<p><code>perf report --itrace=g32l64i100us  --branch-history<\/code><\/p>\n<p>Print hot paths every 100us as call graph histograms<\/p>\n<p>Install <a href=\"https:\/\/github.com\/brendangregg\/FlameGraph\">Flame graph tools<\/a> first.<\/p>\n<p><code><br \/>\nperf script --itrace=i100usg | stackcollapse-perf.pl &gt; workload.folded<br \/>\nflamegraph.pl workloaded.folded &gt; workload.svg<br \/>\ngoogle-chrome workload.svg<br \/>\n<\/code><\/p>\n<p>Generate flame graph from execution, sampled every 100us<\/p>\n<h1>Other ways to record data<\/h1>\n<p><code>perf record -a -e intel_pt\/\/ sleep 1<\/code><\/p>\n<p>Capture whole system for 1 second<\/p>\n<p>Use snapshot mode<\/p>\n<p>This collects data, but does not continuously save it all to disk. When an event of interest happens a data dump of the current buffer can be triggered by sending a SIGUSR2 signal to the perf process.<\/p>\n<p><code><br \/>\nperf record -a -e --snapshot intel_pt\/\/ sleep 1<br \/>\nPERF_PID=$!<br \/>\n*execute workload*<\/code><\/p>\n<p><code> <\/code><\/p>\n<p><code>*event happens*<br \/>\nkill -USR2 $PERF_PID<\/code><\/p>\n<p><code> <\/code><\/p>\n<p><code>*end of recording*<br \/>\nkill $PERF_PID&gt;<br \/>\n<\/code><\/p>\n<p>Record kernel only, complete system<\/p>\n<p><code>perf record -a -e intel_pt\/\/k sleep 1<\/code><\/p>\n<p>Record user space only, complete system<\/p>\n<p><code>perf record -a -e intel_pt\/\/u <\/code><\/p>\n<p>Enable fine grained timing (needs Skylake\/Goldmont, adds more overhead)<\/p>\n<p><code>perf record -a -e intel_pt\/cyc=1,cyc_thresh=2\/ ...<\/code><\/p>\n<p><code><br \/>\necho  $[100*1024*1024] &gt; \/proc\/sys\/kernel\/perf_event_mlock_kb<br \/>\nperf record -m 512,100000 -e intel_pt\/\/ ...<br \/>\n<\/code><\/p>\n<p>Increase perf buffer to limit data loss<\/p>\n<p><code><br \/>\nperf record -e intel_pt\/\/ --filter 'filter main @ \/path\/to\/program'  ...<br \/>\n<\/code><\/p>\n<p>Only record main function in program<\/p>\n<p><code><br \/>\nperf record -e intel_pt\/\/ -a --filter 'filter sys_write'  program<br \/>\n<\/code><\/p>\n<p>Filter kernel code (needs 4.11+ kernel)<\/p>\n<p><code><br \/>\nperf record -e intel_pt\/\/ -a\u00a0 --filter 'stop func2 @ program' program<br \/>\n<\/code><\/p>\n<p>Stop tracing at func2.<\/p>\n<p><code><br \/>\nperf archive<br \/>\nrsync -r ~\/.debug perf.data other-system:<br \/>\n<\/code><\/p>\n<p>Transfer data to a trace on another system. May also require using perf-with-kcore.sh if decoding<br \/>\nkernel.<\/p>\n<h1>Using gdb<\/h1>\n<p>Requires a new enough gdb built with libipt. For user space only.<\/p>\n<p><code><br \/>\ngdb program<br \/>\nstart<br \/>\nrecord btrace pt # cannot be done before start. has to be redone every re-start<br \/>\ncont<\/code><\/p>\n<p><code><br \/>\n<\/code><\/p>\n<p><code>record instruction-history \/m\t# show instructions<br \/>\nrecord function-call-history\t\t# show functions executed<br \/>\nreverse-step\u00a0 \u00a0# step backwards in time<br \/>\n<\/code><\/p>\n<p>For more information on gdb pt see the <a href=\"https:\/\/sourceware.org\/gdb\/onlinedocs\/gdb\/Process-Record-and-Replay.html\">gdb documentation<\/a><\/p>\n<h1>References<\/h1>\n<p>The <a href=\"https:\/\/git.kernel.org\/cgit\/linux\/kernel\/git\/torvalds\/linux.git\/tree\/tools\/perf\/Documentation\/intel-pt.txt\">perf PT documentation<\/a><\/p>\n<p>Reference for &#8211;itrace option (from perf documentation)<\/p>\n<p><code><br \/>\ni       synthesize \"instructions\" events<br \/>\nb       synthesize \"branches\" events<br \/>\nx       synthesize \"transactions\" events<br \/>\nc       synthesize branches events (calls only)<br \/>\nr       synthesize branches events (returns only)<br \/>\ne       synthesize tracing error events<br \/>\nd       create a debug log<br \/>\ng       synthesize a call chain (use with i or x)<br \/>\nl       synthesize last branch entries (use with i or x)<br \/>\ns       skip initial number of events<br \/>\n<\/code><\/p>\n<p>Reference for &#8211;filter option (from perf documentation)<\/p>\n<p><code> <\/code><\/p>\n<p><code> A hardware trace PMU advertises its ability to accept a number of<br \/>\naddress filters by specifying a non-zero value in<br \/>\n\/sys\/bus\/event_source\/devices\/ \/nr_addr_filters.<\/code><\/p>\n<p><code>Address filters have the format:<\/code><\/p>\n<p><code><code><\/code><\/code><\/p>\n<p>filter|start|stop|tracestop [\/ ] [@]<\/p>\n<p><code><code><\/code><\/code><\/p>\n<p>Where:<br \/>\n&#8211; &#8216;filter&#8217;: defines a region that will be traced.<br \/>\n&#8211; &#8216;start&#8217;: defines an address at which tracing will begin.<br \/>\n&#8211; &#8216;stop&#8217;: defines an address at which tracing will stop.<br \/>\n&#8211; &#8216;tracestop&#8217;: defines a region in which tracing will stop.<\/p>\n<p><code><code><\/code><\/code><\/p>\n<p>is the name of the object file, is the offset to the<br \/>\ncode to trace in that file, and is the size of the region to<br \/>\ntrace. &#8216;start&#8217; and &#8216;stop&#8217; filters need not specify a .<\/p>\n<p><code><code><\/code><\/code><\/p>\n<p>If no object file is specified then the kernel is assumed, in which case<br \/>\nthe start address must be a current kernel memory address.<\/p>\n<p><code><code><\/code><\/code><\/p>\n<p>can also be specified by providing the name of a symbol. If the<br \/>\nsymbol name is not unique, it can be disambiguated by inserting #n where<br \/>\n&#8216;n&#8217; selects the n&#8217;th symbol in address order. Alternately #0, #g or #G<br \/>\nselect only a global symbol. can also be specified by providing<br \/>\nthe name of a symbol, in which case the size is calculated to the end<br \/>\nof that symbol. For &#8216;filter&#8217; and &#8216;tracestop&#8217; filters, if is<br \/>\nomitted and is a symbol, then the size is calculated to the end<br \/>\nof that symbol.<\/p>\n<p><code><code><\/code><\/code><\/p>\n<p>If is omitted and is &#8216;*&#8217;, then the start and size will<br \/>\nbe calculated from the first and last symbols, i.e. to trace the whole<br \/>\nfile.<br \/>\nIf symbol names (or &#8216;*&#8217;) are provided, they must be surrounded by white<br \/>\nspace.<\/p>\n<p><code><code><\/code><\/code><\/p>\n<p>The filter passed to the kernel is not necessarily the same as entered.<br \/>\nTo see the filter that is passed, use the -v option.<\/p>\n<p><code><code><\/code><\/code><\/p>\n<p>The kernel may not be able to configure a trace region if it is not<br \/>\nwithin a single mapping. MMAP events (or \/proc\/ \/maps) can be<br \/>\nexamined to determine if that is a possibility.<\/p>\n<p><code><br \/>\n<\/code><\/p>\n<p><code><\/code><\/p>\n<p><code> Multiple filters can be separated with space or comma.<br \/>\n<\/code><\/p>\n<p>v2: Fix some typos\/broken links<\/p>\n<p>v3: Update gdb commands<\/p>\n","protected":false},"excerpt":{"rendered":"<p>What is Processor Trace Intel Processor Trace (PT) traces program execution (every branch) with low overhead. This is a cheat sheet of how to use PT with perf for common tasks It is not a full introduction to PT. Please read Adding PT to Linux perf or the links from the general PT reference page. [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[7,14,12,18,17,11],"tags":[],"_links":{"self":[{"href":"http:\/\/halobates.de\/blog\/wp-json\/wp\/v2\/posts\/410"}],"collection":[{"href":"http:\/\/halobates.de\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/halobates.de\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/halobates.de\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/halobates.de\/blog\/wp-json\/wp\/v2\/comments?post=410"}],"version-history":[{"count":23,"href":"http:\/\/halobates.de\/blog\/wp-json\/wp\/v2\/posts\/410\/revisions"}],"predecessor-version":[{"id":445,"href":"http:\/\/halobates.de\/blog\/wp-json\/wp\/v2\/posts\/410\/revisions\/445"}],"wp:attachment":[{"href":"http:\/\/halobates.de\/blog\/wp-json\/wp\/v2\/media?parent=410"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/halobates.de\/blog\/wp-json\/wp\/v2\/categories?post=410"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/halobates.de\/blog\/wp-json\/wp\/v2\/tags?post=410"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}