Modern CPUs are quite complicated and to understand the performance profiling often needs to be used. The CPUs have performance monitoring units (PMUs) that allow to count and sample a wide variety of events. Linux perf provides an interface to the PMU. It has been designed to provide an abstracted view of the PMU events, and provides a limited number of abstracted events for common situations. In addition it has an interface to access all the raw events. pmu-tools is my toolkit to make access to these raw events more user-friendly for Intel CPUs, and provide some additional functionality. It is not really an replacement for perf, just an addition. If the abstracted perf events work there is no need to use pmu-tools. But there are some situations where additional events are useful. Also it can be useful to experiment: if a “raw” pmu tool use case is useful it may move later into “abstracted” perf.
pmu-tools has a number of components: several of wrappers for perf, and some C libraries for programs. I’ll describe these different components in a number of posts.
git clone git://github.com/andikleen/pmu-tools cd pmu-tools
pmu-tools currently has no installer. I just run the tools from the source directory.
# export PATH=$PATH:$(pwd)
The first (and original component of pmu-tools) is ocperf. ocperf is a wrapper around the perf command line program that translates events from the full Intel event lists to perf format, and does some additional setup.
The command line is the same as normal perf, just in the ‘-e’ line you can also use Intel events. ocperf list outputs all the additional events.
# perf list | wc -l 544 # ocperf.py list | wc -l 1244
(the actual numbers will vary based on system setup and CPU)
As you can see ocperf adds a large number of additional events. I’m not describing all these events, but the ocperf event list includes a brief description. They can be used to analyze a wide variety of performance conditions
To use them just use a normal perf command line with ocperf
Count global remote node accesses
# ocperf.py stat -e offcore_response.demand_data.remote_dram_1 -a sleep 5
Sample conditional branches
# ocperf.py record -e br_inst_exec.cond my-program # ocperf.py report --stdio
Translate an event into the raw format to use directly with perf
# ocperf.py --print stat -e DTLB_MISSES.LARGE_WALK_COMPLETED perf -e r8049
The r8049 code can be used directly with perf or other tools that accept raw events.
ocperf translates the events and calls perf with the translated events. It also tries to translate them back in the output. This only works for “–stdio” output. When you are using the interactive browser (or the gtk UI) you will see the raw translated events in the output.
Another ocperf feature is to set the recommended Intel sampling period for an event (with -c default). By default perf uses an adaptive sampling period, that may use a lot of additional CPU time and is less predictible. This is only supported on some CPUs.
To set additional perf flags you can use the usual :XXX syntax
Count all the division operations in the kernel
# ocperf.py stat -e arith.div:k
ocperf currently only supports the old-style perf attribute syntax (with :xxx), not “cpu//”. This may change in future versions
Originally ocperf was just to handle “offcore events” (that is what the oc in the name stands for), but these days it is useful for far more.
First I should mention that oprofile recently added an “operf” tool. ocperf is not related to that tool and the name predates it.
Modern CPU cores are very fast at computation, and often spend large parts of their time waiting for something else (memory, IO, other cores) As you can imagine, profiling for that can be fairly important. Since Nehalem, Intel Core CPUs, have special offcore events to distinguish all the different “offcore” cases: L3 hit, memory hit, remote cache hit/miss etc. There are so many cases that the normal unit mask of a PMU event does not have enough bits to describe them, so separate registers are used instead. Originally perf didn’t know how to program these additional registers, so couldn’t profile offcore events.
ocperf was a workaround to program these registers directly from user space. This is fixed in recent perf versions (using the offcore_rsp attribute) and not needed anymore. But ocperf is still quite useful as it can directly generate all the needed masks from a predefined table. perf has some builtin offcoure events, but the set supplied by ocperf is larger and better documented.
And of course it still supports older kernels too, if you are not running the latest and greatest.
These days — in addition to translating events from the Intel events table — it also provides some additional workarounds, for example an offcore problem on Xeon E5 2600 series
I will write about more pmu-tools features in future posts.