Andi Kleen's blog

Tilting at windmills and other endeavors

toplev tutorial and manual

without comments

toplev, part of pmu-tools is a tool to determine the CPU bottleneck of workloads. Now finally there is a tutorial and manual available for toplev.,

Written by therapsid

July 9th, 2015 at 5:57 pm

Adding Processor Trace support to Linux

without comments

I published an article at LWN: Adding processor trace to Linux. It describes the Linux perf support for the Intel Processor Trace feature on Intel Broadwell and other CPUs. Processor Trace allows fine grained tracing of program control flow.

Written by therapsid

July 9th, 2015 at 5:51 pm

Posted in kernel

TSX anti patterns

without comments

I published a new article on TSX anti patterns: Common mistakes that people make when implementing TSX lock libraries or writing papers about them. Make sure you don’t fall into the same traps. Enjoy!

Written by therapsid

March 27th, 2014 at 4:24 am

Scaling existing lock-based applications with lock elision

without comments

I published an introductory article on practical lock elision with Intel TSX at ACM Queue/CACM: Scaling existing lock-based applications with lock elision. Enjoy!

Written by therapsid

February 12th, 2014 at 4:01 am

Posted in kernel,tsx,tuning

pmu-tools, part II: toplev

without comments

In part 1 I gave an introduction to pmu-tools, and described ocperf, which allows low level access to the Intel defined CPU performance counter events.

toplev introduction

This part describes another component of pmu-tools: toplev toplev builds on top of ocperf, but works at a much higher level.

perf record defaults to cycle sampling. cycle sampling can tell roughly what part of the workload is taking up CPU time. It cannot directly tell why it is slow. If you have psychic super powers you may be able to figure it out from the source code. If not, using other measurements can help to narrow down the performance bottleneck.

ocperf has a lot of low level events to sample or count specific conditions, but using them requires some knowledge of the CPU to select the right events.

Another approach is to just count specific events. Many parts of the CPU have “stall cycles” counter support, that is they can count how long they are waiting for something. This can be used to compute “stall ratios” when divided by the total number of cycles.

The standard “perf stat” displays such ratios as “stalled-cycles-frontend” (the part of the CPU decoding instructions) and for the backend (that is the actual execution) as “stalled-cycles-backend”.

This assumes a very simplified CPU model. But modern out of order CPUs execute many instructions in parallel and try to execute something else in stall times. The stalls are only a performance problem if they actually bottleneck the execution, that is if there is nothing else to do that could hide the stall.

Also some workloads simply don’t do specific operations much (for example a workload that fits into the L1 cache does not do much memory operations) and evaluating the stall cycles of the memory subsystem may not be very useful, as they are only for very rare events.

So just looking at isolated ratios is not necessarily useful.

To avoid this problem we can compute a larger number of ratios for different units in the CPU, and then define a hierarchy of thresholds between ratios that define whether a specific ratio is meaningful or not.

This is described as the “Top Down” methology in B.3.2 of the Intel optimization manual. More information on TopDown is in this article or in Ahmad Yasin’s ISCA workshop presentation. I didn’t invent it, I’m just implementing it.

The toplev tool in pmu-tools implements this methology. It uses counting, not sampling, which means it can only tell you “what”, but not “where exactly in the program”. If interval mode is used (-I1000) it can also give a very rough “when”.

how toplev works

toplev automatically runs perf stat with the right counters and computes the thresholds and only displays meaningful bottlenecks. toplev defaults to a single 5 event model that already gives some useful information for Intel Core CPUs since Sandy Bridge.

simple model

The simple model has the advantage that it fits into the standard 4 performance counters without multiplexing, which makes it more reliable. More on that later.

For specific CPUs there is also a more detailed model available (enable with -d)

detailed model

The detailed model is a tree of different levels. The first level corresponds to the simplified model. Additional levels (max 4, default 2, using the -l option) can be used to narrow down specific issues more by going down the tree. Each level is only meaningful if the parent crossed its threshold.

The detailed model requires running many more events to compute all the needed ratios. Since the CPU only has 4 (or 8 with HyperThreading off) general performance counters available, perf will need to multiplex (that is regularly re-program) the counters, which adds measurement errors.

In general the lowers levels less reliable than the higher levels and should be taken with a grain of salt. But upto level 2 works generally well.

Examples

First set up pmu-tools if you haven’t yet.

% git clone https://github.com/andikleen/pmu-tools 

% cd pmu-tools 

% export PATH=$PATH:$(pwd) 

Let’s try a memory bound program. The STREAM benchmark is very memory bound. We use the simple (single threaded, not terrible optimized) version from numademo.

% toplev.py numademo  100M stream 
... 
perf stat --log-fd 4 -x, -e {r100030d,r2c2,r19c,r10e,cycles} numademo 100M stream     
... 
Backend Bound:                              72.33%     
    This category reflects slots where no uops are being delivered due to a lack     
    of required resources for accepting more uops in the Backend of the pipeline. 

Lets look a bit closer with a level 2 detailed model

% toplev.py -d -l2 numademo  100M stream 
... 
perf stat --log-fd 4 -x, -e 
{r3079,r19c,r10401c3,r100030d,rc5,r10e,cycles,r400019c,r2c2,instructions}
{r15e,r60006a3,r30001b1,r40004a3,r8a2,r10001b1,cycles} 
numademo 100M stream 
... 
BE      Backend Bound:                      72.03%     
    This category reflects slots where no uops are being delivered due to a lack     
    of required resources for accepting more uops in the    Backend of the pipeline. 
BE/Mem  Memory Bound:                       43.18%     
    This metric represents how much Memory subsystem was a bottleneck. 
BE/Core Core Bound:                           18.90%     
    This metric represents how much Core non-memory issues were a bottleneck. 
RET     BASE:                               24.76%     
    This metric represents slots fraction CPU was retiring uops not originated     
    from the microcode-sequencer. 

So we’re memory bound as expected, but it’s only part of the problem.

With a level 3 measurement we can look even further. As you can see the underlying perf command line already gets really complicated for this, a tool like toplev is really needed to set it up.

% toplev.py -d -l3 numademo  100M stream 
... 
perf stat --log-fd 4 -x, -e 
{r2ab,r19c,r2c2,r485,r480,r400019c,r187,cycles,r114,instructions},
{r4001879,r1002479,r40001a8,r4002479,r50005a3,r1001879,r10001a8,cycles,r12000ca3},
{r3079,r2c2,r20d1,r100030d,r10e,r50005a3,r4d1,cycles,r19c},
{r60006a3,cycles,r45f,r12000ca3,r8408},{r2c2,r10401c3,r100030d,rc5,r10e,cycles},
{r15e,r10401c3,r1fe6,rc5,r184015e,r480,cycles},
{r15e,r60006a3,r30001b1,r40004a3,r8a2,r10001b1,cycles,r114},
{r211,r8010,r4010,r1010,r1b1,r110,r111,r2010},
r211,r8010,r4010,r3079,r1010,r1b1,r110,r111,r2010 numademo 100M stream 
... 
BE      Backend Bound:                      71.58%     
    This category reflects slots where no uops are being delivered due to a lack     
    of required resources for accepting more uops in the Backend of the pipeline. 
BE/Mem  Memory Bound:                       43.66%     
    This metric represents how much Memory subsystem was a bottleneck. 
BE/Mem  L1 Bound:                           33.26%     
    This metric represents how often CPU was stalled without missing the L1 data     
    cache. 
BE/Core Core Bound:                         19.24%     
    This metric represents how much Core non-memory issues were a bottleneck. 

BE/Core Ports Utilization:                  19.24%     
    This metric represents cycles fraction application was stalled due to Core     
    non-divider-related issues. 
RET     BASE:                               25.08%     
    This metric represents slots fraction CPU was retiring uops not originated     
    from the microcode-sequencer. 
RET     OTHER:                              87.89%     
    This metric represents non-floating-point (FP) uop fraction the CPU has     
executed. 

This shows that numademo’s STREAM actually consists of more loads/stores than floating operations. It’s not a really optimized version.

And finally a “real workload”, a kernel build with gcc. gcc has a lot of code, so the CPU’s instruction decoding frontend becomes a bottleneck, partly caused by branch mispredictions (which cause the frontend to do more work). This data is averaged over 4 cores.


FE      Frontend Bound:                     54.07%     
    This category reflects slots where the Frontend of the processor undersupplies     
    its Backend. 
FE      Frontend Latency:                   39.53%     
    This metric represents slots fraction CPU was stalled due to Frontend latency     
    issues. 
BAD     Bad Speculation:                    11.75%     
    This category reflects slots wasted due to incorrect speculations, which     
    include slots used to allocate uops that do not eventually get retired and     
    slots for which allocation was blocked due to recovery from earlier incorrect     
    speculation. 
BAD     Branch Mispredicts:                 11.66%     
    This metric represents slots fraction CPU was impacted by Branch     
    Missprediction. 
RET     BASE:                               25.74%     
    This metric represents slots fraction CPU was retiring uops not originated     
    from the microcode-sequencer. 

Some caveats with TopDown

The topdown approach only works for CPU bound workloads. If the program’s performance is limited by something else (for example waiting for IO or blocking for other reasons) other methods need to be used.

The lower levels of the measurement tree are less reliable than the higher levels. They also rely on counter multi-plexing and cannot use groups, which can cause larger measurement errors with non steady state workloads.

(If you don’t understand this terminology; it means measurements are much less accurate and it works best with programs that primarily do the same thing over and over)

It’s recommended to measure the work load only after the startup phase by using interval mode or attaching later.

level 1 or running without -d is generally the most reliable. The lower tree levels have larger measurement errors. Level 2 usually also works well. Level 3 and 4 can have some mismeasurements.

One of the events (even used by level 1) requires a recent enough kernel that understands its counter constraints. 3.10+ is safe.

Update 2013/07/28: Add links to other reference material on TopDown. Change “much less reliable” to “less reliable”.

Written by therapsid

July 28th, 2013 at 6:53 am

Posted in kernel,monitoring

pmu-tools part I

without comments

Introduction

Modern CPUs are quite complicated and to understand the performance profiling often needs to be used. The CPUs have performance monitoring units (PMUs) that allow to count and sample a wide variety of events. Linux perf provides an interface to the PMU. It has been designed to provide an abstracted view of the PMU events, and provides a limited number of abstracted events for common situations. In addition it has an interface to access all the raw events. pmu-tools is my toolkit to make access to these raw events more user-friendly for Intel CPUs, and provide some additional functionality. It is not really an replacement for perf, just an addition. If the abstracted perf events work there is no need to use pmu-tools. But there are some situations where additional events are useful. Also it can be useful to experiment: if a “raw” pmu tool use case is useful it may move later into “abstracted” perf.

pmu-tools has a number of components: several of wrappers for perf, and some C libraries for programs. I’ll describe these different components in a number of posts.

Getting pmu-tools

git clone git://github.com/andikleen/pmu-tools 

cd pmu-tools 

pmu-tools currently has no installer. I just run the tools from the source directory.

# export PATH=$PATH:$(pwd)

ocperf

The first (and original component of pmu-tools) is ocperf. ocperf is a wrapper around the perf command line program that translates events from the full Intel event lists to perf format, and does some additional setup.

The command line is the same as normal perf, just in the ‘-e’ line you can also use Intel events. ocperf list outputs all the additional events.

# perf list | wc -l     544 

# ocperf.py list | wc -l     1244 

(the actual numbers will vary based on system setup and CPU)

As you can see ocperf adds a large number of additional events. I’m not describing all these events, but the ocperf event list includes a brief description. They can be used to analyze a wide variety of performance conditions

To use them just use a normal perf command line with ocperf

Count global remote node accesses

# ocperf.py stat -e offcore_response.demand_data.remote_dram_1 -a sleep 5 

Sample conditional branches

# ocperf.py record -e br_inst_exec.cond my-program 

# ocperf.py report --stdio 

Translate an event into the raw format to use directly with perf

# ocperf.py --print stat -e  DTLB_MISSES.LARGE_WALK_COMPLETED
perf  -e r8049

The r8049 code can be used directly with perf or other tools that accept raw events.

ocperf translates the events and calls perf with the translated events. It also tries to translate them back in the output. This only works for “–stdio” output. When you are using the interactive browser (or the gtk UI) you will see the raw translated events in the output.

Another ocperf feature is to set the recommended Intel sampling period for an event (with -c default). By default perf uses an adaptive sampling period, that may use a lot of additional CPU time and is less predictible. This is only supported on some CPUs.

To set additional perf flags you can use the usual :XXX syntax

Count all the division operations in the kernel

# ocperf.py stat -e arith.div:k 

ocperf currently only supports the old-style perf attribute syntax (with :xxx), not “cpu//”. This may change in future versions

ocperf background

Originally ocperf was just to handle “offcore events” (that is what the oc in the name stands for), but these days it is useful for far more.

First I should mention that oprofile recently added an “operf” tool. ocperf is not related to that tool and the name predates it.

Modern CPU cores are very fast at computation, and often spend large parts of their time waiting for something else (memory, IO, other cores) As you can imagine, profiling for that can be fairly important. Since Nehalem, Intel Core CPUs, have special offcore events to distinguish all the different “offcore” cases: L3 hit, memory hit, remote cache hit/miss etc. There are so many cases that the normal unit mask of a PMU event does not have enough bits to describe them, so separate registers are used instead. Originally perf didn’t know how to program these additional registers, so couldn’t profile offcore events.

ocperf was a workaround to program these registers directly from user space. This is fixed in recent perf versions (using the offcore_rsp attribute) and not needed anymore. But ocperf is still quite useful as it can directly generate all the needed masks from a predefined table. perf has some builtin offcoure events, but the set supplied by ocperf is larger and better documented.

And of course it still supports older kernels too, if you are not running the latest and greatest.

These days — in addition to translating events from the Intel events table — it also provides some additional workarounds, for example an offcore problem on Xeon E5 2600 series

I will write about more pmu-tools features in future posts.

Written by therapsid

July 27th, 2013 at 2:50 am

TSX updates

without comments

Hope everyone who already has a Haswell with TSX is considering playing around with it. Chapter 12 of the optimization manual has a good methology.

There’s currently a push to solve the last issues in the TSX glibc to get the feature in glibc 2.18 including even the most obscure POSIX requirements. A nice solution to still provide deadlocks for PTHREAD_MUTEX_NORMAL has been found now, that doesn’t affect most programs. We also settled on removing support for mutex initializers that disable/enable elision to improve binary compatibility with old glibcs. This has the nice side effect that the glibc can internally ensure that no mutex has a elision flag set, when the CPU does not support RTM. This then allows to shave off at least one more check in the pthread_mutex_lock() fast path.

I published an article describing TSX fallback paths. Every RTM transaction needs a working fallback path, and not doing that properly is a common TSX newbie mistake.

And Roman collected all the available TSX resources in a nice overview page
Also a new version of PCM is available that supports TSX counting (no abort sampling). It doesn’t need a kernel driver, and should work on all Linux, Mac, Windows systems. Having some form of performance counting is fairly important for any TSX tuning

Also a new version of tsx-tools, including HLE and RTM compat headers, has been published.

And the HLE examples in the gcc documentation have been finally fixed to commit (not yet backported to 4.8). HLE requires the operation size of the acquire and release to match, and always aborts the transaction if that is not the case. __atomic_clear always casts the argument to bool for obscure reasons, so if the lock variable is not bool the operand size is likely to mismatch. The fix is to use __atomic_store_n instead, which doesn’t cast the pointer.

There are also some other issues in the HLE intrinsics in 4.8 currently (mostly fixed in 4.9). gcc won’t warn in all cases when it cannot generate HLE for an atomic primitive (e.g. when the primitive does not map to a single x86 instruction). And you still need to enable optimization to use the C++ HLE interface or some more complex __atomic intrinsics, as the gcc backend may otherwise not see the __ATOMIC_HLE_RELEASE flag as a constant.

Right now it is still safer to use the compat headers from tsx-tools which avoid all these problems.

Written by therapsid

June 23rd, 2013 at 10:13 pm

TSX profiling

with one comment

I published a quick overview on how to do TSX profiling with Linux perf: Intel TSX profiling with Linux perf

This is a technical overview that assumes some prior knowledge of profiling. I apologize for the cumbersome title.

Written by therapsid

May 4th, 2013 at 2:53 am

Modern locking

without comments

I wrote a blog post and contributed to a paper on modern locking on Intel Xeon systems. My recent talk on this has been also covered by LWN here (still behind paywall for a few days)

Summary is more or less: batch your locks. Don’t make critical sections too small. Having the smallest locks is not cool anymore.

Written by therapsid

April 30th, 2013 at 4:21 am

Program tuning as a resource allocation problem

without comments

As the saying goes, every problem in CS can be solved with another level of indirection (except the problem of too much indirection) This often leads to caching to improve performance. Remember some results to handle them faster. Caches are everywhere in computing these days: from the CPU itself (memory caches, TLBs, instruction caches, branch prediction), to databases (query caches, table caches), networking stacks (neighbor caches, routing caches, DNS caches) and IO (directory caches, OS page buffering). When I say cache here I mean this generalized cache concept, not just the caches of the CPUs.

Caches usually improve performance. But only if your cache hit rates are high enough (and the cache latency low enough, but that’s a different discussion). So if you thrash the caches and hit rate becomes very low performance suffers. This is a problem that theoretical CS algorithm analysis largely ignored for a long time, but there has some work on this been recently (like cache oblivious algorithms).

However this is only for the CPU caches, not for all the other caches that exist in a modern programming environment.

A cache is a shared resource in a program. Typically programs consist of many subsystems and libraries. They all impact various caches. They are likely written by different developers. Calling a library means whoever wrote that library shares cache resources with you.

Now when tuning some sub function you often have a trade off between simplicity and performance. A common technique to improve performance is to replace computation with table lookups (despite the memory wall it is often still the fastest way). This will (usually) improve performance, but increase the cache foot print if the input values vary enough to cover larger parts of the table. Or you could add a cache, which is essentially an automatic table lookup. This will make the function faster, but also increase its resource foot print in the shared cache resources.

As a side note anything with a table lookup is much harder to tune in a micro benchmark. If the foot print of a function is data independent we can just call the function in a loop with the same input to and measure the total time, divided by the iterations. But if the foot print is data dependent we always need a representative input set, otherwise the unrealistic cache hit rate on the table skews the results badly.

Even if no table lookup is used more complex logic will likely have more overhead in the instruction caches and branch prediction caches. So common optimizations often increase the foot print. But when the total program already has a large foot print increasing the total further may cause other time critical subsystems start thrashing their working set.

Similar reasoning applies to other kinds of caches: a database sub function may thrash the database engine query or table cache. A network subsystem may thrash the DNS cache. A 3d sub function may thrash the 3d stacks JIT code cache.

So you could say that the foot print of a function should not be higher than the proportion of runtime it executes. If there are free resources it may be reasonable to take more than that, but even then it may be better to be frugal because the increase resource usage will not pay off in better performance. Using less resources may also save power or leave more resources for other programs sharing the same system (for power that’s not always true, having better performance may help you in the race to idle)

So we may end up with an interesting trade off when tuning a function. We have the choice between different algorithms with different resource footprints (e.g. table lookup vs computation). But the choice which one to use depends on the rest of the program: whether the function is time critical enough and how much resources are used elsewhere.

So this is a nasty problem. Instead of being able to tune each function individually we actually have a global optimization problem over the whole program. The problem of resource allocation is non decomposable. That is when you change one small piece in a big program it may affect everyone else. When the program changes and increases resource usage somewhere a lot of resource allocation may need to be redone to re-balance, which may include changing algorithms of some old subsystem.

This is especially a problem for library functions where the program author doesn’t really know what the calling program does. And on the other hand the library user didn’t write the library and doesn’t know what the library author optimized for.

One way would be to offer multiple variants optimized for different resource consumptions, so that the user can chose. This would be similar to e.g. the algorithms in the C++ STL are tagged with their algorithmic complexity for different operations. One could imagine a library of algorithms that are tagged with their resource overhead. This likely would need to be equations based on the input set size. Also since there are multiple resources (e.g. memory hierarchy, branch predictor, other software caches) it may need multiple indicators.

Modern caches are generally so complex that they are nearly impossible to analyze “on paper” (especially when you have multiple level interacting), and need measurements instead.

So developing metrics for this would be interesting. Since the costs are paid elsewhere (you cause a cache miss somewhere else) it is hard to associate the two by classic profiling. We have no way to directly monitor the various caches. For software caches this could be improved by adding better tracing capabilities, but for hardware it’s harder.

One way would be subtractive monitoring, similar to cyclesoak. You run a “cachesoak” in parallel to the program and it touches memory and using time measurements measures how much of its working set has been displaced by another program running on the same core.

This technique has been used for attacking cryptographic algorithms, by exploiting the cache access patterns of table lookups in common ciphers, or with OS paging starting in the 70ies for password attacks

One could also use this subtractive technique for other software caches. For example for a database access run a background thread that accesses the database in a controlled way and measures access times. The challenge of course would be to understand the caching behavior of the database well enough to get useful data.

This all still doesn’t tell you where the resource consumption happens. So to measure the resource consumption of each subsystem in a complex program you would need to run each of it individually against a cyclesoak test. And ideally with a representative input data set, otherwise you may get unrealistic access patterns.

I mostly talked about memory hierarchy caches here, but of course this applies to other caches too. An interesting case is branch prediction caches in CPUs. A mispredicted branch can take take nearly an order of magnitude more time than a predicted branch. Indirect branches need more resources than conditional branches (full address versus single bit). I wrote about optimizing conditional branches earlier. So a less complex algorithm may be slower, but use less resources.

It would be interesting to develop a metric for the branch prediction resource consumption. One way may be to use a variant of cyclomatic complexity, but focus on the hot paths of the program only using profile feedback.

So overall allocating cache resources to different sub systems is hard. It would be good if we had better tools for this.

Written by therapsid

December 30th, 2012 at 8:58 pm