Archive for the ‘tuning’ Category
What is Processor Trace
Intel Processor Trace (PT) traces program execution (every branch) with low overhead.
This is a cheat sheet of how to use PT with perf for common tasks
PT support in hardware
|Broadwell (5th generation Core, Xeon v4)||More overhead. No fine grained timing.|
|Skylake (6th generation Core, Xeon v5)||Fine grained timing. Address filtering.|
|Goldmont (Apollo Lake, Denverton)||Fine grained timing. Address filtering.|
PT support in Linux
PT is supported in Linux perf, which is integrated in the Linux kernel.
It can be used through the “perf” command or through gdb.
In general it is best to use the newest kernel and the newest Linux perf tools. If that is not possible older tools and kernels can be used. Newer tools can be used on an older kernel, but may not support all features
|Linux 4.1||Initial PT driver|
|Linux 4.2||Support for Skylake and Goldmont|
|Linux 4.3||Initial user tools support in Linux perf|
|Linux 4.5||Support for JIT decoding using agent|
|Linux 4.6||Bug fixes. Support address filtering.|
|Linux 4.8||Bug fixes.|
|Linux 4.10||Bug fixes. Support for PTWRITE and power tracing|
Many commands require recent perf tools, you may need to update them rom a recent kernel tree.
This article covers mainly Linux perf and briefly gdb.
Only needed once.
Allow seeing kernel symbols (as root)
echo kernel.kptr_restrict=0' >> /etc/sysctl.conf
Basic perf command lines for recording PT
Check if PT is supported and what capabilities.
perf record -e intel_pt// program
perf record -e intel_pt// -a sleep 1
Trace whole system for 1 second
perf record -C 0 -e intel_pt// -a sleep 1
Trace CPU 0 for 1 second
perf record --pid $(pidof program) -e intel_pt//
Trace already running program.
perf has to save the data to disk. The CPU can execute branches much faster than than the disk can keep up, so there will be some data loss for code that executes
many instructions. perf has no way to slow down the CPU, so when trace bandwidth > disk bandwidth there will be gaps in the trace. Because of this it is usually not a good idea
to try to save a long trace, but work with shorter traces. Long traces also take a lot of time to decode.
When decoding kernel data the decoder usually has to run as root.
An alternative is to use the perf-with-kcore.sh script included with perf
perf script --ns --itrace=cr
Record program execution and display function call graph.
perf script by defaults “samples” the data (only dumps a sample every 100us).
This can be configured using the –itrace option (see reference below)
Install xed first.
perf script --itrace=i0ns --ns -F time,pid,comm,sym,symoff,insn,ip | xed -F insn: -S /proc/kallsyms -64
Show every assembly instruction executed with disassembler.
For this it is also useful to get more accurate time stamps (see below)
perf script --itrace=i0ns --ns -F time,sym,srcline,ip
Show source lines executed (requires debug information)
perf script --itrace=s1Mi0ns ....
Often initialization code is not interesting. Skip initial 1M instructions while decoding:
perf script --time 1.000,2.000 ...
Slice trace into different time regions Generally the time stamps need to be looked up first in the trace, as they are absolute.
perf report --itrace=g32l64i100us --branch-history
Print hot paths every 100us as call graph histograms
Install Flame graph tools first.
perf script --itrace=i100usg | stackcollapse-perf.pl > workload.folded
flamegraph.pl workloaded.folded > workload.svg
Generate flame graph from execution, sampled every 100us
Other ways to record data
perf record -a -e intel_pt// sleep 1
Capture whole system for 1 second
Use snapshot mode
This collects data, but does not continuously save it all to disk. When an event of interest happens a data dump of the current buffer can be triggered by sending a SIGUSR2 signal to the perf process.
perf record -a -e --snapshot intel_pt// sleep 1
kill -USR2 $PERF_PID
*end of recording*
Record kernel only, complete system
perf record -a -e intel_pt//k sleep 1
Record user space only, complete system
perf record -a -e intel_pt//u
Enable fine grained timing (needs Skylake/Goldmont, adds more overhead)
perf record -a -e intel_pt/cyc=1,cyc_thresh=2/ ...
echo $[100*1024*1024] > /proc/sys/kernel/perf_event_mlock_kb
perf record -m 512,100000 -e intel_pt// ...
Increase perf buffer to limit data loss
perf record -e intel_pt// --filter 'filter main @ /path/to/program' ...
Only record main function in program
perf record -e intel_pt// -a --filter 'filter sys_write' program
Filter kernel code (needs 4.11+ kernel)
perf record -e intel_pt// -a --filter 'start func1 @ program' --filter 'stop func2 @ program' program
Start tracing in program at main and stop tracing at func2.
rsync -r ~/.debug perf.data other-system:
Transfer data to a trace on another system. May also require using perf-with-kcore.sh if decoding
Requires a new enough gdb built with libipt. For user space only.
record btrace pt
record instruction-history /m # show instructions
record function-history # show functions executed
prev # step backwards in time
For more information on gdb pt see the gdb documentation
Reference for –itrace option (from perf documentation)
i synthesize "instructions" events
b synthesize "branches" events
x synthesize "transactions" events
c synthesize branches events (calls only)
r synthesize branches events (returns only)
e synthesize tracing error events
d create a debug log
g synthesize a call chain (use with i or x)
l synthesize last branch entries (use with i or x)
s skip initial number of events
Reference for –filter option (from perf documentation)
A hardware trace PMU advertises its ability to accept a number of
address filters by specifying a non-zero value in
Address filters have the format:
- 'filter': defines a region that will be traced.
- 'start': defines an address at which tracing will begin.
- 'stop': defines an address at which tracing will stop.
- 'tracestop': defines a region in which tracing will stop.
code to trace in that file, and
trace. 'start' and 'stop' filters need not specify a
If no object file is specified then the kernel is assumed, in which case
the start address must be a current kernel memory address.
symbol name is not unique, it can be disambiguated by inserting #n where
'n' selects the n'th symbol in address order. Alternately #0, #g or #G
select only a global symbol.
the name of a symbol, in which case the size is calculated to the end
of that symbol. For 'filter' and 'tracestop' filters, if
of that symbol.
be calculated from the first and last symbols, i.e. to trace the whole
If symbol names (or '*') are provided, they must be surrounded by white
The filter passed to the kernel is not necessarily the same as entered.
To see the filter that is passed, use the -v option.
The kernel may not be able to configure a trace region if it is not
within a single mapping. MMAP events (or /proc/
examined to determine if that is a possibility.
Multiple filters can be separated with space or comma.
v2: Fix some typos/broken links
This is a short non comprehensive (and somewhat biased) literature survey on how lock elision with Intel TSX can be used to improve performance to programs by increasing parallelism. Focus is on practical incremental improvements in existing software.
A basic introduction of TSX lock elision is available in Scaling existing lock based applications with Lock elision.
The papers below are on actual code using lock elision, not just how to implement lock elision itself. The basic rules of how to implement lock elision in a locking library are in chapter 12 of the Intel Optimization manual. Common anti-patterns (mistakes) while doing so are described in TSX Anti-patterns.
Lock elision has been used widely to speed up both production and research databases. A lot of work has been done on the SAP HANA in memory data which uses TSX in production today to improve performance of the B+Tree index and the delta tree data structures. This is described in Improving In-Memory Database Index Performance with TSX.
Several research databases went beyond just using TSX to speed up specific data structures, but map complete database transactions to hardware transaction. This requires much more changes to the databases, and careful memory layout, but can also give more gains. It generally goes beyond simple lock elision. This approach has been implemented by Leis in Exploiting HTM in Main memory databases. A similar approach is used in Using Restricted Transactional Memory to Build a Scalable In-Memory Database. Both see large gains on TPC-C style workloads.
For non in memory databases, a group at Berkeley used lock elision to improve parallelism in LevelDB and getting a 25% speedup on a 4 core system. Standard LevelDB is essentially a single lock system, so this was a nice speedup with only minor work compared to other efforts that used manual fine grained locking to improve parallelism in LevelDB. However it required special handling of condition variables, which are used for commit. For simpler key value stores Diegues used an automatic tuner with TSX transactions to get a 2x gain with memcached.
DrTM uses TSX together with RDMA to implement a fast distributed transaction manager using 2PL. TSX provides local isolation, while RDMA (which aborts transactions) provides remote isolation.
An attractive use of lock elision is to speed up locks implicit in language runtimes. Yoo implemented transparent. support for using TSX for Java synchronized sections in Early Experience on Transactional
Execution of Java TM Programs. The runtime automatically detects when a synchronized region does not benefit from TSX and disables lock elision then. This works transparently, but using it successfully may still need some changes in the program to minimize conflicts and other aborts, as described by Gil Tene in Understanding Hardware Transactional Memory. This support is available in JDK 8u40 and can be enabled with the -XX:+UseRTMLocking option.
Another interesting case for lock elision is improving parallelism for the Great Interpreter Lock (GIL) used in interpreters. Odaiara implemented this for Ruby in Eliminating Global Interpreter Locks in Ruby through HTM. They saw a 4.4x speedup on Ruby NPB, 1.6x in WEBrick and 1.2x speedup in Ruby on Rails on a 12 core system.
Hardware transactions can be used for auto-parallelizing existing loops, even when the compiler cannot prove that iterations are independent, by using the transactions to isolate individual iterations. Odaira implemented TSX loop speculation manually for workloads in SPECCpu and report a 11% speedup on a 4 core system. There are some limitations to this technique due to the lazy subscription problem described by Dice, but in principle it can be directly implemented in compilers.
Salamanca used TSX to implement tracing recovery code for Speculative Trace Optimization (STO) for loops. The basic principles are similar to the previous paper, but they implemented an automated prototype. They report 9% improvement on 4 cores with a number of benchmarks.
An older paper, which predates TSX, by Dice describes how to use Hardware Transactional Memory to simplify Work Stealing schedulers.
High Performance Computing
Yoo.et.al. use TSX lock elision to benchmark a number of HPC applications in Performance Evaluation of Intel TSX for High-Performance Computing. They report an average 41% speedup on a 4 core system. They also report an average 31% improvement in bandwidth when applying TSX to a user space TCP stack.
Hay explored lock elision for improving Parallel Discrete Event Simulations. He reports a speed ups of up to 28%.
Lock elision has been used widely to speed up parallel data structures. Normally applying lock elision to an existing data structure is very simple — elide the lock protecting it — but some tweaking of the data structure can often give better performance. Dementiev explores TSX for general fast scalable hash tables. Li uses TSX to implement scalable Cuckoo hash tables. Using TSX for hash tables is generally very straight forward. For tree data structures one need to be careful that tree rebalancing does not overwhelm the write set capacity of the hardware transactions., Repetti uses TSX to scale Patricia Tries. Siakavas explores TSX usage for scalable Red-Black Trees, similar in this paper. Bonnichsen uses HTM to improve BT Trees, reporting a speed up of 2x to 3x compared to earlier implementations. The database papers described above describe the rules needed to successfully elide BTrees.
Calciu uses TSX to implement a more scalable priority queue based on skip list and reports increased parallelism.
Memory allocation and Garbage Collection
One challenge with using garbage collection is that the worst case “stop the world” pauses from parallel garbage collectors limit the total heap size that can be supported in copying garbage collectors. The Chihuahua GC paper implements a prototype TSX based collector for the Jikes research java VM. They report upto 101% speedup in concurrent copying speed, and show that a simple parallel garbage collector can be implemented with limited efforts.
Another dog-themed GC, the Collie garbage collector (original paper predates TSX) presents a production quality parallel collector that minimizes pauses and allows scaling to large heaps. Opdahl has another description of the Collie algorithm It is presumably deployed now for TSX in Azul’s commercial ZING JVM product which claims to scale upto 2TB of heap memory.
StackTrack is an efficient algorithm to do automatic memory reclaim for parallel data structures using hardware transactions, out performing existing techniques such as hazard pointers. It requires recompiling the program with a special patched gcc compiler, and automatically creates variable-length transactions for functions freeing memory. The technique could be potentially used even without special compilers.
Kuzmaul uses TSX to implement a scalable SuperMalloc and reports good performance combined with relatively simple code. Dice et all report how an cache index aware malloc can improve TSX performance by improving utilization of the L1 cache.
Peters use lock elision to parallelize a micro kernel For a micro kernel a big lock is fine and finds that RTM lock elision outperforms fine grained locking due to less single thread overhead.
Processor trace allows to do as very exact histograms of a program’s run time. Normal sampling has shadow effects, which can hide some details. Processor traces every branch, so it can be much more accurate than normal sampling.
You need a Intel Broadwell or Skylake CPU.
Running at 4.1 or later Linux kernel where perf supports PT.
You can verify the kernel supports pt with
You need perf user tools built from https://github.com/virtuoso/linux-perf
(this should soon be fixed when the user tools code is merged into Linux mainline)
Build perf with PT support
# set up https_proxy as needed
git clone https://github.com/virtuoso/linux-perf
Copy the resulting perf binary to where you want to run it
Get the flamegraph code
git clone https://github.com/brendangregg/FlameGraph.git
Collect data from the workload. Best to not collect too long traces as they take much longer to process and may need too much disk space.
perf record -e intel_pt// workload (or -a sleep 1 to collect 1s globally)
Decode the data. This may take quite some time
perf script --itrace=i100usg | /path/to/FlameGraph/ | stackcollapse-perf.pl > workload.folded
The i100us means the trace decoder samples an instruction every 100us. This can be made more accurate (down to 1ns), at the cost of longer decoding time. The ‘g’ tells the decoder to add callgraphs.
Then generate the Flamegraph with
/path/to/FlameGraph/flamegraph.pl workloaded.folded > workload.svg
Then view the resulting SVG in a SVG viewer, such as google chrome
It is possible to click around.
Here’s a larger svg example from a gcc build (2.5MB). May need chrome or firefox to view.
In principle the trace also has support for more information not in normal sampling, such as determining the exact run time of individual functions from the trace. This is unfortunately not (yet?) supported by the Flame Graph tools.
Energy efficient servers – Blue prints for data center optimization from Gough/Steiner/Sanders is a new book on power tuning on servers that was recently published at Apress. I got my copy a few weeks ago and read it and it is great.
Disclaimer: I contributed a few pages to the book, but have no financial interest in its success.
As you probably already know power efficiency is very important for modern computing. It matters to mobile devices to extend battery time, it matters to desktops and servers to avoid exceeding the thermal/power capacity and lower energy costs.
Modern chips cannot run all their transistors at full speed at the same time due to the dark silicon problem. This results in the somewhat paradoxical situation that power management is needed, even if energy costs don’t matter, just to give the best performance (such as the highest Turbo frequencies)
Power management in modern systems is quite complex, with many different moving parts, hardware, operating systems, drivers, firmware, embedded micro-controllers working together to be as efficient as possible. I’m not aware of any good overview of all of this.
There is some lore around — for example you may have heard of race to idle, that is running as fast as possible to go idle again — but nothing really that puts it all into a larger context. BTW race-to-idle is not always a good idea, as the book explains.
The new book makes an attempt to explain all of this together for Intel servers (the basic concepts are similar on other systems and also on client systems).
It starts with a (short) introduction of the underlying physical principles and then moves on to the basic CPU and platform power management techniques, such as frequency scaling and idle state and thermal management. It has a discussion on modern memory subsystems and describes the trade offs between different DIMM configurations. It describes the power management differences between larger servers and micro servers. And there is a overview of thermal management and power supply, such as energy efficient power supplies and voltage regulators.
Then it moves on to an overview of the software involved in power management, including firmware, rack level power management software, and operating systems. Then there is an extensive chapter how to instrument and measure power management
Finally (and perhaps most valuable) the book lays out a systematic power tuning methodology, starting with measurements and then concrete steps to optimize existing workloads for the best power efficiency.
The book is written not as an academic text book, but intended for people who solve concrete problems on shipping systems. It is quite readable, explaining any complicated concepts. You can clearly tell the authors have deep knowledge on the topic. While the details are intended for Intel servers, I would expect the book to be useful even to people working on clients or also other architectures.
One possible issue with the book is that it may be too specific for today’s systems. We’ll see how well it ages to future systems. But right now, as it just came out, it it very up-to-date and a good guide. It has some descriptions of data center design (such as efficient cooling), but these parts are quite short and are clearly not the main focus.
Often when doing performance analysis or debugging, it boils down to stare at long text trace files with the less text viewer. Yes you can do a lot of analysis with custom scripts, but at some point it’s usually needed to also look at the raw data.
The first annoyance in less when opening a large file is the time it takes to count lines (less counts lines at the beginning to show you the current position as a percentage). The line counting has the easy workaround of hitting Ctrl-C or using less -n to disable percentage. But it would be still better if that wasn’t needed.
Nicolai Haenle speeded the process by about 20x in his less repository.
One thing that always bothered me was that searching in less is so slow. If you’re browsing a tens to hundreds of MB file file it can easily take minutes to search for a string. When browsing log and trace files searching over longer distances is often very important.
And there is no good workaround. Running grep on the file is much faster, but you can’t easily transfer the file position from grep to the less session.
Some profiling with perf shows that most of the time searching is spent converting each line. Less internally cleans up the line, convert it to canonical case, remove backspace bold, and some other changes. The conversion loop processes each character in a inefficient way. Most of the time this is not needed, so I replaced that with a quick check if the line contains any backspaces using the optimized strchr() from the standard C library. For case conversion the string search functions (either regular expression or fixed string search) can also handle case insensitive search directly, so we don’t need an extra conversion step. The default fixed string search (when the search string contains no regular expression meta characters) can be also done using the optimized C library functions.
The resulting less version searches ~85% faster on my benchmarks. I tried to submit the patch to the less maintainer, but it was ignored unfortunately. The less version in the repository also includes Nicolai’s speedup patches for the initial line counting.
One side effect of the patch is that less now defaults to case sensitive searches. The original less had a feature (or bug) to default to case-insensitive even without the -i option. To get case insensitive searches now “less -i” needs to be used.
[Edit: Fix typos]
I published a new article on TSX anti patterns: Common mistakes that people make when implementing TSX lock libraries or writing papers about them. Make sure you don’t fall into the same traps. Enjoy!
I published an introductory article on practical lock elision with Intel TSX at ACM Queue/CACM: Scaling existing lock-based applications with lock elision. Enjoy!
Modern CPUs are quite complicated and to understand the performance profiling often needs to be used. The CPUs have performance monitoring units (PMUs) that allow to count and sample a wide variety of events. Linux perf provides an interface to the PMU. It has been designed to provide an abstracted view of the PMU events, and provides a limited number of abstracted events for common situations. In addition it has an interface to access all the raw events. pmu-tools is my toolkit to make access to these raw events more user-friendly for Intel CPUs, and provide some additional functionality. It is not really an replacement for perf, just an addition. If the abstracted perf events work there is no need to use pmu-tools. But there are some situations where additional events are useful. Also it can be useful to experiment: if a “raw” pmu tool use case is useful it may move later into “abstracted” perf.
pmu-tools has a number of components: several of wrappers for perf, and some C libraries for programs. I’ll describe these different components in a number of posts.
git clone git://github.com/andikleen/pmu-tools cd pmu-tools
pmu-tools currently has no installer. I just run the tools from the source directory.
# export PATH=$PATH:$(pwd)
The first (and original component of pmu-tools) is ocperf. ocperf is a wrapper around the perf command line program that translates events from the full Intel event lists to perf format, and does some additional setup.
The command line is the same as normal perf, just in the ‘-e’ line you can also use Intel events. ocperf list outputs all the additional events.
# perf list | wc -l 544 # ocperf.py list | wc -l 1244
(the actual numbers will vary based on system setup and CPU)
As you can see ocperf adds a large number of additional events. I’m not describing all these events, but the ocperf event list includes a brief description. They can be used to analyze a wide variety of performance conditions
To use them just use a normal perf command line with ocperf
Count global remote node accesses
# ocperf.py stat -e offcore_response.demand_data.remote_dram_1 -a sleep 5
Sample conditional branches
# ocperf.py record -e br_inst_exec.cond my-program # ocperf.py report --stdio
Translate an event into the raw format to use directly with perf
# ocperf.py --print stat -e DTLB_MISSES.LARGE_WALK_COMPLETED perf -e r8049
The r8049 code can be used directly with perf or other tools that accept raw events.
ocperf translates the events and calls perf with the translated events. It also tries to translate them back in the output. This only works for “–stdio” output. When you are using the interactive browser (or the gtk UI) you will see the raw translated events in the output.
Another ocperf feature is to set the recommended Intel sampling period for an event (with -c default). By default perf uses an adaptive sampling period, that may use a lot of additional CPU time and is less predictible. This is only supported on some CPUs.
To set additional perf flags you can use the usual :XXX syntax
Count all the division operations in the kernel
# ocperf.py stat -e arith.div:k
ocperf currently only supports the old-style perf attribute syntax (with :xxx), not “cpu//”. This may change in future versions
Originally ocperf was just to handle “offcore events” (that is what the oc in the name stands for), but these days it is useful for far more.
First I should mention that oprofile recently added an “operf” tool. ocperf is not related to that tool and the name predates it.
Modern CPU cores are very fast at computation, and often spend large parts of their time waiting for something else (memory, IO, other cores) As you can imagine, profiling for that can be fairly important. Since Nehalem, Intel Core CPUs, have special offcore events to distinguish all the different “offcore” cases: L3 hit, memory hit, remote cache hit/miss etc. There are so many cases that the normal unit mask of a PMU event does not have enough bits to describe them, so separate registers are used instead. Originally perf didn’t know how to program these additional registers, so couldn’t profile offcore events.
ocperf was a workaround to program these registers directly from user space. This is fixed in recent perf versions (using the offcore_rsp attribute) and not needed anymore. But ocperf is still quite useful as it can directly generate all the needed masks from a predefined table. perf has some builtin offcoure events, but the set supplied by ocperf is larger and better documented.
And of course it still supports older kernels too, if you are not running the latest and greatest.
These days — in addition to translating events from the Intel events table — it also provides some additional workarounds, for example an offcore problem on Xeon E5 2600 series
I will write about more pmu-tools features in future posts.
Hope everyone who already has a Haswell with TSX is considering playing around with it. Chapter 12 of the optimization manual has a good methology.
There’s currently a push to solve the last issues in the TSX glibc to get the feature in glibc 2.18 including even the most obscure POSIX requirements. A nice solution to still provide deadlocks for PTHREAD_MUTEX_NORMAL has been found now, that doesn’t affect most programs. We also settled on removing support for mutex initializers that disable/enable elision to improve binary compatibility with old glibcs. This has the nice side effect that the glibc can internally ensure that no mutex has a elision flag set, when the CPU does not support RTM. This then allows to shave off at least one more check in the pthread_mutex_lock() fast path.
I published an article describing TSX fallback paths. Every RTM transaction needs a working fallback path, and not doing that properly is a common TSX newbie mistake.
And Roman collected all the available TSX resources in a nice overview page
Also a new version of PCM is available that supports TSX counting (no abort sampling). It doesn’t need a kernel driver, and should work on all Linux, Mac, Windows systems. Having some form of performance counting is fairly important for any TSX tuning
Also a new version of tsx-tools, including HLE and RTM compat headers, has been published.
And the HLE examples in the gcc documentation have been finally fixed to commit (not yet backported to 4.8). HLE requires the operation size of the acquire and release to match, and always aborts the transaction if that is not the case. __atomic_clear always casts the argument to bool for obscure reasons, so if the lock variable is not bool the operand size is likely to mismatch. The fix is to use __atomic_store_n instead, which doesn’t cast the pointer.
There are also some other issues in the HLE intrinsics in 4.8 currently (mostly fixed in 4.9). gcc won’t warn in all cases when it cannot generate HLE for an atomic primitive (e.g. when the primitive does not map to a single x86 instruction). And you still need to enable optimization to use the C++ HLE interface or some more complex __atomic intrinsics, as the gcc backend may otherwise not see the __ATOMIC_HLE_RELEASE flag as a constant.
Right now it is still safer to use the compat headers from tsx-tools which avoid all these problems.