Andi Kleen's blog » kernel Tilting at windmills and other endeavors Mon, 17 Aug 2015 04:27:33 +0000 en hourly 1 Announcing simple-pt — A simple Processor Trace implementation Mon, 17 Aug 2015 04:27:33 +0000 therapsid Modern Intel Core CPUs (5th and 6th generation) have a Intel Processor Trace (PT) feature to trace branch execution with low overhead. This is useful for performance analysis and debugging.

simple-pt is a simple standalone driver and decoder tool to implement PT on Linux.

Starting with Linux 4.1 Linux already has a integrated PT implementation in perf (see ). simple-pt is an alternative implementation. It has many disadvantages over the perf PT implementation, such as:
- needs to run as root
- no long term tracing or sampling with interrupts
- no support for interactive debugging (use gdb 7.10 on perf for that)
- no support for histograms
- somewhat experimental
- not as well supported as perf

On the positive side simple-pt is:
- simple
- standalone. No kernel changes needed. Could be ported to older kernels or other operating systems
- easy to modify and experiment with
- more ftrace like decoding tool
- support for kprobes based triggers
- modular “unix style” design with simple tools that do only one thing each
- BSD licensed

Example output:

        % sptcmd  -c tcall taskset -c 0 ./tcall
        cpu   0 offset 1027688,  1003 KB, writing to ptout.0
        Wrote sideband to ptout.sideband
        % sptdecode --sideband ptout.sideband --pt ptout.0 | less
        frequency 32
        0        [+0]     [+   1] _dl_aux_init+436
                          [+   6] __libc_start_main+455 -> _dl_discover_osversion
                          [+  13] __libc_start_main+446 -> main
                          [+   9]     main+22 -> f1
                          [+   4]             f1+9 -> f2
                          [+   2]             f1+19 -> f2
                          [+   5]     main+22 -> f1
                          [+   4]             f1+9 -> f2
                          [+   2]             f1+19 -> f2
                          [+   5]     main+22 -> f1

Available from

]]> 0
Generating Flame graphs with Processor Trace Wed, 05 Aug 2015 23:13:04 +0000 therapsid How to generate a FlameGraph with Processor Trace. Everybody loves Flame Graphs.

Processor trace allows to do as very exact histograms of a program’s run time. Normal sampling has shadow effects, which can hide some details. Processor traces every branch, so it can be much more accurate than normal sampling.

You need a Intel Broadwell or Skylake CPU.
Running at 4.1 or later Linux kernel where perf supports PT.
You can verify the kernel supports pt with

ls /sys/devices/intel_pt

You need perf user tools built from
(this should soon be fixed when the user tools code is merged into Linux mainline)

Build perf with PT support

# set up https_proxy as needed
git clone
cd linux-perf/tools/perf

Copy the resulting perf binary to where you want to run it

Get the flamegraph code

git clone

Collect data from the workload. Best to not collect too long traces as they take much longer to process and may need too much disk space.

perf record -e intel_pt// workload (or -a sleep 1 to collect 1s globally)

Decode the data. This may take quite some time

perf script --itrace=i100usg | /path/to/FlameGraph/ | > workload.folded

The i100us means the trace decoder samples an instruction every 100us. This can be made more accurate (down to 1ns), at the cost of longer decoding time. The ‘g’ tells the decoder to add callgraphs.

Then generate the Flamegraph with

/path/to/FlameGraph/ workloaded.folded > workload.svg

Then view the resulting SVG in a SVG viewer, such as google chrome

google-chrome workload.svg

It is possible to click around.

Here’s a larger svg example from a gcc build (2.5MB). May need chrome or firefox to view.

In principle the trace also has support for more information not in normal sampling, such as determining the exact run time of individual functions from the trace. This is unfortunately not (yet?) supported by the Flame Graph tools.

]]> 2
Energy efficient servers book review Mon, 27 Jul 2015 06:14:29 +0000 therapsid Energy efficient servers – Blue prints for data center optimization from Gough/Steiner/Sanders is a new book on power tuning on servers that was recently published at Apress. I got my copy a few weeks ago and read it and it is great.

Disclaimer: I contributed a few pages to the book, but have no financial interest in its success.

As you probably already know power efficiency is very important for modern computing. It matters to mobile devices to extend battery time, it matters to desktops and servers to avoid exceeding the thermal/power capacity and lower energy costs.

Modern chips cannot run all their transistors at full speed at the same time due to the dark silicon problem. This results in the somewhat paradoxical situation that power management is needed, even if energy costs don’t matter, just to give the best performance (such as the highest Turbo frequencies)

Power management in modern systems is quite complex, with many different moving parts, hardware, operating systems, drivers, firmware, embedded micro-controllers working together to be as efficient as possible. I’m not aware of any good overview of all of this.

There is some lore around — for example you may have heard of race to idle, that is running as fast as possible to go idle again — but nothing really that puts it all into a larger context. BTW race-to-idle is not always a good idea, as the book explains.

The new book makes an attempt to explain all of this together for Intel servers (the basic concepts are similar on other systems and also on client systems).

It starts with a (short) introduction of the underlying physical principles and then moves on to the basic CPU and platform power management techniques, such as frequency scaling and idle state and thermal management. It has a discussion on modern memory subsystems and describes the trade offs between different DIMM configurations. It describes the power management differences between larger servers and micro servers. And there is a overview of thermal management and power supply, such as energy efficient power supplies and voltage regulators.

Then it moves on to an overview of the software involved in power management, including firmware, rack level power management software, and operating systems. Then there is an extensive chapter how to instrument and measure power management

Finally (and perhaps most valuable) the book lays out a systematic power tuning methodology, starting with measurements and then concrete steps to optimize existing workloads for the best power efficiency.

The book is written not as an academic text book, but intended for people who solve concrete problems on shipping systems. It is quite readable, explaining any complicated concepts. You can clearly tell the authors have deep knowledge on the topic. While the details are intended for Intel servers, I would expect the book to be useful even to people working on clients or also other architectures.

One possible issue with the book is that it may be too specific for today’s systems. We’ll see how well it ages to future systems. But right now, as it just came out, it it very up-to-date and a good guide. It has some descriptions of data center design (such as efficient cooling), but these parts are quite short and are clearly not the main focus.

The ebook version is currently available as a free download both at the the publisher after registration, or at amazon as free kindle edition, or as reasonable priced paperback.

]]> 0
Speeding up less Fri, 10 Jul 2015 20:26:05 +0000 therapsid Often when doing performance analysis or debugging, it boils down to stare at long text trace files with the less text viewer. Yes you can do a lot of analysis with custom scripts, but at some point it’s usually needed to also look at the raw data.

The first annoyance in less when opening a large file is the time it takes to count lines (less counts lines at the beginning to show you the current position as a percentage). The line counting has the easy workaround of hitting Ctrl-C or using less -n to disable percentage. But it would be still better if that wasn’t needed.

Nicolai Haenle speeded the process by about 20x in his less repository.

One thing that always bothered me was that searching in less is so slow. If you’re browsing a tens to hundreds of MB file file it can easily take minutes to search for a string. When browsing log and trace files searching over longer distances is often very important.

And there is no good workaround. Running grep on the file is much faster, but you can’t easily transfer the file position from grep to the less session.

Some profiling with perf shows that most of the time searching is spent converting each line. Less internally cleans up the line, convert it to canonical case, remove backspace bold, and some other changes. The conversion loop processes each character in a inefficient way. Most of the time this is not needed, so I replaced that with a quick check if the line contains any backspaces using the optimized strchr() from the standard C library. For case conversion the string search functions (either regular expression or fixed string search) can also handle case insensitive search directly, so we don’t need an extra conversion step. The default fixed string search (when the search string contains no regular expression meta characters) can be also done using the optimized C library functions.

The resulting less version searches ~85% faster on my benchmarks. I tried to submit the patch to the less maintainer, but it was ignored unfortunately. The less version in the repository also includes Nicolai’s speedup patches for the initial line counting.

One side effect of the patch is that less now defaults to case sensitive searches. The original less had a feature (or bug) to default to case-insensitive even without the -i option. To get case insensitive searches now “less -i” needs to be used.

[Edit: Fix typos]

]]> 0
toplev tutorial and manual Thu, 09 Jul 2015 17:57:01 +0000 therapsid toplev, part of pmu-tools is a tool to determine the CPU bottleneck of workloads. Now finally there is a tutorial and manual available for toplev.,

]]> 0
Adding Processor Trace support to Linux Thu, 09 Jul 2015 17:51:15 +0000 therapsid I published an article at LWN: Adding processor trace to Linux. It describes the Linux perf support for the Intel Processor Trace feature on Intel Broadwell and other CPUs. Processor Trace allows fine grained tracing of program control flow.

]]> 0
TSX anti patterns Thu, 27 Mar 2014 04:24:20 +0000 therapsid I published a new article on TSX anti patterns: Common mistakes that people make when implementing TSX lock libraries or writing papers about them. Make sure you don’t fall into the same traps. Enjoy!

]]> 0
Scaling existing lock-based applications with lock elision Wed, 12 Feb 2014 04:01:07 +0000 therapsid I published an introductory article on practical lock elision with Intel TSX at ACM Queue/CACM: Scaling existing lock-based applications with lock elision. Enjoy!

]]> 0
pmu-tools, part II: toplev Sun, 28 Jul 2013 06:53:49 +0000 therapsid In part 1 I gave an introduction to pmu-tools, and described ocperf, which allows low level access to the Intel defined CPU performance counter events.

toplev introduction

This part describes another component of pmu-tools: toplev toplev builds on top of ocperf, but works at a much higher level.

perf record defaults to cycle sampling. cycle sampling can tell roughly what part of the workload is taking up CPU time. It cannot directly tell why it is slow. If you have psychic super powers you may be able to figure it out from the source code. If not, using other measurements can help to narrow down the performance bottleneck.

ocperf has a lot of low level events to sample or count specific conditions, but using them requires some knowledge of the CPU to select the right events.

Another approach is to just count specific events. Many parts of the CPU have “stall cycles” counter support, that is they can count how long they are waiting for something. This can be used to compute “stall ratios” when divided by the total number of cycles.

The standard “perf stat” displays such ratios as “stalled-cycles-frontend” (the part of the CPU decoding instructions) and for the backend (that is the actual execution) as “stalled-cycles-backend”.

This assumes a very simplified CPU model. But modern out of order CPUs execute many instructions in parallel and try to execute something else in stall times. The stalls are only a performance problem if they actually bottleneck the execution, that is if there is nothing else to do that could hide the stall.

Also some workloads simply don’t do specific operations much (for example a workload that fits into the L1 cache does not do much memory operations) and evaluating the stall cycles of the memory subsystem may not be very useful, as they are only for very rare events.

So just looking at isolated ratios is not necessarily useful.

To avoid this problem we can compute a larger number of ratios for different units in the CPU, and then define a hierarchy of thresholds between ratios that define whether a specific ratio is meaningful or not.

This is described as the “Top Down” methology in B.3.2 of the Intel optimization manual. More information on TopDown is in this article or in Ahmad Yasin’s ISCA workshop presentation. I didn’t invent it, I’m just implementing it.

The toplev tool in pmu-tools implements this methology. It uses counting, not sampling, which means it can only tell you “what”, but not “where exactly in the program”. If interval mode is used (-I1000) it can also give a very rough “when”.

how toplev works

toplev automatically runs perf stat with the right counters and computes the thresholds and only displays meaningful bottlenecks. toplev defaults to a single 5 event model that already gives some useful information for Intel Core CPUs since Sandy Bridge.

simple model

The simple model has the advantage that it fits into the standard 4 performance counters without multiplexing, which makes it more reliable. More on that later.

For specific CPUs there is also a more detailed model available (enable with -d)

detailed model

The detailed model is a tree of different levels. The first level corresponds to the simplified model. Additional levels (max 4, default 2, using the -l option) can be used to narrow down specific issues more by going down the tree. Each level is only meaningful if the parent crossed its threshold.

The detailed model requires running many more events to compute all the needed ratios. Since the CPU only has 4 (or 8 with HyperThreading off) general performance counters available, perf will need to multiplex (that is regularly re-program) the counters, which adds measurement errors.

In general the lowers levels less reliable than the higher levels and should be taken with a grain of salt. But upto level 2 works generally well.


First set up pmu-tools if you haven’t yet.

% git clone 

% cd pmu-tools 

% export PATH=$PATH:$(pwd) 

Let’s try a memory bound program. The STREAM benchmark is very memory bound. We use the simple (single threaded, not terrible optimized) version from numademo.

% numademo  100M stream
perf stat --log-fd 4 -x, -e {r100030d,r2c2,r19c,r10e,cycles} numademo 100M stream
Backend Bound:                              72.33%
    This category reflects slots where no uops are being delivered due to a lack
    of required resources for accepting more uops in the Backend of the pipeline. 

Lets look a bit closer with a level 2 detailed model

% -d -l2 numademo  100M stream
perf stat --log-fd 4 -x, -e
numademo 100M stream
BE      Backend Bound:                      72.03%
    This category reflects slots where no uops are being delivered due to a lack
    of required resources for accepting more uops in the    Backend of the pipeline.
BE/Mem  Memory Bound:                       43.18%
    This metric represents how much Memory subsystem was a bottleneck.
BE/Core Core Bound:                           18.90%
    This metric represents how much Core non-memory issues were a bottleneck.
RET     BASE:                               24.76%
    This metric represents slots fraction CPU was retiring uops not originated
    from the microcode-sequencer. 

So we’re memory bound as expected, but it’s only part of the problem.

With a level 3 measurement we can look even further. As you can see the underlying perf command line already gets really complicated for this, a tool like toplev is really needed to set it up.

% -d -l3 numademo  100M stream
perf stat --log-fd 4 -x, -e
r211,r8010,r4010,r3079,r1010,r1b1,r110,r111,r2010 numademo 100M stream
BE      Backend Bound:                      71.58%
    This category reflects slots where no uops are being delivered due to a lack
    of required resources for accepting more uops in the Backend of the pipeline.
BE/Mem  Memory Bound:                       43.66%
    This metric represents how much Memory subsystem was a bottleneck.
BE/Mem  L1 Bound:                           33.26%
    This metric represents how often CPU was stalled without missing the L1 data
BE/Core Core Bound:                         19.24%
    This metric represents how much Core non-memory issues were a bottleneck. 

BE/Core Ports Utilization:                  19.24%
    This metric represents cycles fraction application was stalled due to Core
    non-divider-related issues.
RET     BASE:                               25.08%
    This metric represents slots fraction CPU was retiring uops not originated
    from the microcode-sequencer.
RET     OTHER:                              87.89%
    This metric represents non-floating-point (FP) uop fraction the CPU has

This shows that numademo’s STREAM actually consists of more loads/stores than floating operations. It’s not a really optimized version.

And finally a “real workload”, a kernel build with gcc. gcc has a lot of code, so the CPU’s instruction decoding frontend becomes a bottleneck, partly caused by branch mispredictions (which cause the frontend to do more work). This data is averaged over 4 cores.

FE      Frontend Bound:                     54.07%
    This category reflects slots where the Frontend of the processor undersupplies
    its Backend.
FE      Frontend Latency:                   39.53%
    This metric represents slots fraction CPU was stalled due to Frontend latency
BAD     Bad Speculation:                    11.75%
    This category reflects slots wasted due to incorrect speculations, which
    include slots used to allocate uops that do not eventually get retired and
    slots for which allocation was blocked due to recovery from earlier incorrect
BAD     Branch Mispredicts:                 11.66%
    This metric represents slots fraction CPU was impacted by Branch
RET     BASE:                               25.74%
    This metric represents slots fraction CPU was retiring uops not originated
    from the microcode-sequencer. 

Some caveats with TopDown

The topdown approach only works for CPU bound workloads. If the program’s performance is limited by something else (for example waiting for IO or blocking for other reasons) other methods need to be used.

The lower levels of the measurement tree are less reliable than the higher levels. They also rely on counter multi-plexing and cannot use groups, which can cause larger measurement errors with non steady state workloads.

(If you don’t understand this terminology; it means measurements are much less accurate and it works best with programs that primarily do the same thing over and over)

It’s recommended to measure the work load only after the startup phase by using interval mode or attaching later.

level 1 or running without -d is generally the most reliable. The lower tree levels have larger measurement errors. Level 2 usually also works well. Level 3 and 4 can have some mismeasurements.

One of the events (even used by level 1) requires a recent enough kernel that understands its counter constraints. 3.10+ is safe.

Update 2013/07/28: Add links to other reference material on TopDown. Change “much less reliable” to “less reliable”.

]]> 0
pmu-tools part I Sat, 27 Jul 2013 02:50:18 +0000 therapsid Introduction

Modern CPUs are quite complicated and to understand the performance profiling often needs to be used. The CPUs have performance monitoring units (PMUs) that allow to count and sample a wide variety of events. Linux perf provides an interface to the PMU. It has been designed to provide an abstracted view of the PMU events, and provides a limited number of abstracted events for common situations. In addition it has an interface to access all the raw events. pmu-tools is my toolkit to make access to these raw events more user-friendly for Intel CPUs, and provide some additional functionality. It is not really an replacement for perf, just an addition. If the abstracted perf events work there is no need to use pmu-tools. But there are some situations where additional events are useful. Also it can be useful to experiment: if a “raw” pmu tool use case is useful it may move later into “abstracted” perf.

pmu-tools has a number of components: several of wrappers for perf, and some C libraries for programs. I’ll describe these different components in a number of posts.

Getting pmu-tools

git clone git:// 

cd pmu-tools 

pmu-tools currently has no installer. I just run the tools from the source directory.

# export PATH=$PATH:$(pwd)


The first (and original component of pmu-tools) is ocperf. ocperf is a wrapper around the perf command line program that translates events from the full Intel event lists to perf format, and does some additional setup.

The command line is the same as normal perf, just in the ‘-e’ line you can also use Intel events. ocperf list outputs all the additional events.

# perf list | wc -l     544 

# list | wc -l     1244 

(the actual numbers will vary based on system setup and CPU)

As you can see ocperf adds a large number of additional events. I’m not describing all these events, but the ocperf event list includes a brief description. They can be used to analyze a wide variety of performance conditions

To use them just use a normal perf command line with ocperf

Count global remote node accesses

# stat -e offcore_response.demand_data.remote_dram_1 -a sleep 5 

Sample conditional branches

# record -e br_inst_exec.cond my-program 

# report --stdio 

Translate an event into the raw format to use directly with perf

perf  -e r8049

The r8049 code can be used directly with perf or other tools that accept raw events.

ocperf translates the events and calls perf with the translated events. It also tries to translate them back in the output. This only works for “–stdio” output. When you are using the interactive browser (or the gtk UI) you will see the raw translated events in the output.

Another ocperf feature is to set the recommended Intel sampling period for an event (with -c default). By default perf uses an adaptive sampling period, that may use a lot of additional CPU time and is less predictible. This is only supported on some CPUs.

To set additional perf flags you can use the usual :XXX syntax

Count all the division operations in the kernel

# stat -e arith.div:k 

ocperf currently only supports the old-style perf attribute syntax (with :xxx), not “cpu//”. This may change in future versions

ocperf background

Originally ocperf was just to handle “offcore events” (that is what the oc in the name stands for), but these days it is useful for far more.

First I should mention that oprofile recently added an “operf” tool. ocperf is not related to that tool and the name predates it.

Modern CPU cores are very fast at computation, and often spend large parts of their time waiting for something else (memory, IO, other cores) As you can imagine, profiling for that can be fairly important. Since Nehalem, Intel Core CPUs, have special offcore events to distinguish all the different “offcore” cases: L3 hit, memory hit, remote cache hit/miss etc. There are so many cases that the normal unit mask of a PMU event does not have enough bits to describe them, so separate registers are used instead. Originally perf didn’t know how to program these additional registers, so couldn’t profile offcore events.

ocperf was a workaround to program these registers directly from user space. This is fixed in recent perf versions (using the offcore_rsp attribute) and not needed anymore. But ocperf is still quite useful as it can directly generate all the needed masks from a predefined table. perf has some builtin offcoure events, but the set supplied by ocperf is larger and better documented.

And of course it still supports older kernels too, if you are not running the latest and greatest.

These days — in addition to translating events from the Intel events table — it also provides some additional workarounds, for example an offcore problem on Xeon E5 2600 series

I will write about more pmu-tools features in future posts.

]]> 0