Andi Kleen's blog

Tilting at windmills and other endeavors

pmu-tools, part II: toplev

without comments

In part 1 I gave an introduction to pmu-tools, and described ocperf, which allows low level access to the Intel defined CPU performance counter events.

toplev introduction

This part describes another component of pmu-tools: toplev toplev builds on top of ocperf, but works at a much higher level.

perf record defaults to cycle sampling. cycle sampling can tell roughly what part of the workload is taking up CPU time. It cannot directly tell why it is slow. If you have psychic super powers you may be able to figure it out from the source code. If not, using other measurements can help to narrow down the performance bottleneck.

ocperf has a lot of low level events to sample or count specific conditions, but using them requires some knowledge of the CPU to select the right events.

Another approach is to just count specific events. Many parts of the CPU have “stall cycles” counter support, that is they can count how long they are waiting for something. This can be used to compute “stall ratios” when divided by the total number of cycles.

The standard “perf stat” displays such ratios as “stalled-cycles-frontend” (the part of the CPU decoding instructions) and for the backend (that is the actual execution) as “stalled-cycles-backend”.

This assumes a very simplified CPU model. But modern out of order CPUs execute many instructions in parallel and try to execute something else in stall times. The stalls are only a performance problem if they actually bottleneck the execution, that is if there is nothing else to do that could hide the stall.

Also some workloads simply don’t do specific operations much (for example a workload that fits into the L1 cache does not do much memory operations) and evaluating the stall cycles of the memory subsystem may not be very useful, as they are only for very rare events.

So just looking at isolated ratios is not necessarily useful.

To avoid this problem we can compute a larger number of ratios for different units in the CPU, and then define a hierarchy of thresholds between ratios that define whether a specific ratio is meaningful or not.

This is described as the “Top Down” methology in B.3.2 of the Intel optimization manual. More information on TopDown is in this article or in Ahmad Yasin’s ISCA workshop presentation. I didn’t invent it, I’m just implementing it.

The toplev tool in pmu-tools implements this methology. It uses counting, not sampling, which means it can only tell you “what”, but not “where exactly in the program”. If interval mode is used (-I1000) it can also give a very rough “when”.

how toplev works

toplev automatically runs perf stat with the right counters and computes the thresholds and only displays meaningful bottlenecks. toplev defaults to a single 5 event model that already gives some useful information for Intel Core CPUs since Sandy Bridge.

simple model

The simple model has the advantage that it fits into the standard 4 performance counters without multiplexing, which makes it more reliable. More on that later.

For specific CPUs there is also a more detailed model available (enable with -d)

detailed model

The detailed model is a tree of different levels. The first level corresponds to the simplified model. Additional levels (max 4, default 2, using the -l option) can be used to narrow down specific issues more by going down the tree. Each level is only meaningful if the parent crossed its threshold.

The detailed model requires running many more events to compute all the needed ratios. Since the CPU only has 4 (or 8 with HyperThreading off) general performance counters available, perf will need to multiplex (that is regularly re-program) the counters, which adds measurement errors.

In general the lowers levels less reliable than the higher levels and should be taken with a grain of salt. But upto level 2 works generally well.

Examples

First set up pmu-tools if you haven’t yet.

% git clone https://github.com/andikleen/pmu-tools 

% cd pmu-tools 

% export PATH=$PATH:$(pwd) 

Let’s try a memory bound program. The STREAM benchmark is very memory bound. We use the simple (single threaded, not terrible optimized) version from numademo.

% toplev.py numademo  100M stream 
... 
perf stat --log-fd 4 -x, -e {r100030d,r2c2,r19c,r10e,cycles} numademo 100M stream     
... 
Backend Bound:                              72.33%     
    This category reflects slots where no uops are being delivered due to a lack     
    of required resources for accepting more uops in the Backend of the pipeline. 

Lets look a bit closer with a level 2 detailed model

% toplev.py -d -l2 numademo  100M stream 
... 
perf stat --log-fd 4 -x, -e 
{r3079,r19c,r10401c3,r100030d,rc5,r10e,cycles,r400019c,r2c2,instructions}
{r15e,r60006a3,r30001b1,r40004a3,r8a2,r10001b1,cycles} 
numademo 100M stream 
... 
BE      Backend Bound:                      72.03%     
    This category reflects slots where no uops are being delivered due to a lack     
    of required resources for accepting more uops in the    Backend of the pipeline. 
BE/Mem  Memory Bound:                       43.18%     
    This metric represents how much Memory subsystem was a bottleneck. 
BE/Core Core Bound:                           18.90%     
    This metric represents how much Core non-memory issues were a bottleneck. 
RET     BASE:                               24.76%     
    This metric represents slots fraction CPU was retiring uops not originated     
    from the microcode-sequencer. 

So we’re memory bound as expected, but it’s only part of the problem.

With a level 3 measurement we can look even further. As you can see the underlying perf command line already gets really complicated for this, a tool like toplev is really needed to set it up.

% toplev.py -d -l3 numademo  100M stream 
... 
perf stat --log-fd 4 -x, -e 
{r2ab,r19c,r2c2,r485,r480,r400019c,r187,cycles,r114,instructions},
{r4001879,r1002479,r40001a8,r4002479,r50005a3,r1001879,r10001a8,cycles,r12000ca3},
{r3079,r2c2,r20d1,r100030d,r10e,r50005a3,r4d1,cycles,r19c},
{r60006a3,cycles,r45f,r12000ca3,r8408},{r2c2,r10401c3,r100030d,rc5,r10e,cycles},
{r15e,r10401c3,r1fe6,rc5,r184015e,r480,cycles},
{r15e,r60006a3,r30001b1,r40004a3,r8a2,r10001b1,cycles,r114},
{r211,r8010,r4010,r1010,r1b1,r110,r111,r2010},
r211,r8010,r4010,r3079,r1010,r1b1,r110,r111,r2010 numademo 100M stream 
... 
BE      Backend Bound:                      71.58%     
    This category reflects slots where no uops are being delivered due to a lack     
    of required resources for accepting more uops in the Backend of the pipeline. 
BE/Mem  Memory Bound:                       43.66%     
    This metric represents how much Memory subsystem was a bottleneck. 
BE/Mem  L1 Bound:                           33.26%     
    This metric represents how often CPU was stalled without missing the L1 data     
    cache. 
BE/Core Core Bound:                         19.24%     
    This metric represents how much Core non-memory issues were a bottleneck. 

BE/Core Ports Utilization:                  19.24%     
    This metric represents cycles fraction application was stalled due to Core     
    non-divider-related issues. 
RET     BASE:                               25.08%     
    This metric represents slots fraction CPU was retiring uops not originated     
    from the microcode-sequencer. 
RET     OTHER:                              87.89%     
    This metric represents non-floating-point (FP) uop fraction the CPU has     
executed. 

This shows that numademo’s STREAM actually consists of more loads/stores than floating operations. It’s not a really optimized version.

And finally a “real workload”, a kernel build with gcc. gcc has a lot of code, so the CPU’s instruction decoding frontend becomes a bottleneck, partly caused by branch mispredictions (which cause the frontend to do more work). This data is averaged over 4 cores.


FE      Frontend Bound:                     54.07%     
    This category reflects slots where the Frontend of the processor undersupplies     
    its Backend. 
FE      Frontend Latency:                   39.53%     
    This metric represents slots fraction CPU was stalled due to Frontend latency     
    issues. 
BAD     Bad Speculation:                    11.75%     
    This category reflects slots wasted due to incorrect speculations, which     
    include slots used to allocate uops that do not eventually get retired and     
    slots for which allocation was blocked due to recovery from earlier incorrect     
    speculation. 
BAD     Branch Mispredicts:                 11.66%     
    This metric represents slots fraction CPU was impacted by Branch     
    Missprediction. 
RET     BASE:                               25.74%     
    This metric represents slots fraction CPU was retiring uops not originated     
    from the microcode-sequencer. 

Some caveats with TopDown

The topdown approach only works for CPU bound workloads. If the program’s performance is limited by something else (for example waiting for IO or blocking for other reasons) other methods need to be used.

The lower levels of the measurement tree are less reliable than the higher levels. They also rely on counter multi-plexing and cannot use groups, which can cause larger measurement errors with non steady state workloads.

(If you don’t understand this terminology; it means measurements are much less accurate and it works best with programs that primarily do the same thing over and over)

It’s recommended to measure the work load only after the startup phase by using interval mode or attaching later.

level 1 or running without -d is generally the most reliable. The lower tree levels have larger measurement errors. Level 2 usually also works well. Level 3 and 4 can have some mismeasurements.

One of the events (even used by level 1) requires a recent enough kernel that understands its counter constraints. 3.10+ is safe.

Update 2013/07/28: Add links to other reference material on TopDown. Change “much less reliable” to “less reliable”.

Written by therapsid

July 28th, 2013 at 6:53 am

Posted in kernel,monitoring