Andi Kleen's blog

Tilting at windmills and other endeavors

Archive for the ‘bloat’ Category

Program tuning as a resource allocation problem

without comments

As the saying goes, every problem in CS can be solved with another level of indirection (except the problem of too much indirection) This often leads to caching to improve performance. Remember some results to handle them faster. Caches are everywhere in computing these days: from the CPU itself (memory caches, TLBs, instruction caches, branch prediction), to databases (query caches, table caches), networking stacks (neighbor caches, routing caches, DNS caches) and IO (directory caches, OS page buffering). When I say cache here I mean this generalized cache concept, not just the caches of the CPUs.

Caches usually improve performance. But only if your cache hit rates are high enough (and the cache latency low enough, but that’s a different discussion). So if you thrash the caches and hit rate becomes very low performance suffers. This is a problem that theoretical CS algorithm analysis largely ignored for a long time, but there has some work on this been recently (like cache oblivious algorithms).

However this is only for the CPU caches, not for all the other caches that exist in a modern programming environment.

A cache is a shared resource in a program. Typically programs consist of many subsystems and libraries. They all impact various caches. They are likely written by different developers. Calling a library means whoever wrote that library shares cache resources with you.

Now when tuning some sub function you often have a trade off between simplicity and performance. A common technique to improve performance is to replace computation with table lookups (despite the memory wall it is often still the fastest way). This will (usually) improve performance, but increase the cache foot print if the input values vary enough to cover larger parts of the table. Or you could add a cache, which is essentially an automatic table lookup. This will make the function faster, but also increase its resource foot print in the shared cache resources.

As a side note anything with a table lookup is much harder to tune in a micro benchmark. If the foot print of a function is data independent we can just call the function in a loop with the same input to and measure the total time, divided by the iterations. But if the foot print is data dependent we always need a representative input set, otherwise the unrealistic cache hit rate on the table skews the results badly.

Even if no table lookup is used more complex logic will likely have more overhead in the instruction caches and branch prediction caches. So common optimizations often increase the foot print. But when the total program already has a large foot print increasing the total further may cause other time critical subsystems start thrashing their working set.

Similar reasoning applies to other kinds of caches: a database sub function may thrash the database engine query or table cache. A network subsystem may thrash the DNS cache. A 3d sub function may thrash the 3d stacks JIT code cache.

So you could say that the foot print of a function should not be higher than the proportion of runtime it executes. If there are free resources it may be reasonable to take more than that, but even then it may be better to be frugal because the increase resource usage will not pay off in better performance. Using less resources may also save power or leave more resources for other programs sharing the same system (for power that’s not always true, having better performance may help you in the race to idle)

So we may end up with an interesting trade off when tuning a function. We have the choice between different algorithms with different resource footprints (e.g. table lookup vs computation). But the choice which one to use depends on the rest of the program: whether the function is time critical enough and how much resources are used elsewhere.

So this is a nasty problem. Instead of being able to tune each function individually we actually have a global optimization problem over the whole program. The problem of resource allocation is non decomposable. That is when you change one small piece in a big program it may affect everyone else. When the program changes and increases resource usage somewhere a lot of resource allocation may need to be redone to re-balance, which may include changing algorithms of some old subsystem.

This is especially a problem for library functions where the program author doesn’t really know what the calling program does. And on the other hand the library user didn’t write the library and doesn’t know what the library author optimized for.

One way would be to offer multiple variants optimized for different resource consumptions, so that the user can chose. This would be similar to e.g. the algorithms in the C++ STL are tagged with their algorithmic complexity for different operations. One could imagine a library of algorithms that are tagged with their resource overhead. This likely would need to be equations based on the input set size. Also since there are multiple resources (e.g. memory hierarchy, branch predictor, other software caches) it may need multiple indicators.

Modern caches are generally so complex that they are nearly impossible to analyze “on paper” (especially when you have multiple level interacting), and need measurements instead.

So developing metrics for this would be interesting. Since the costs are paid elsewhere (you cause a cache miss somewhere else) it is hard to associate the two by classic profiling. We have no way to directly monitor the various caches. For software caches this could be improved by adding better tracing capabilities, but for hardware it’s harder.

One way would be subtractive monitoring, similar to cyclesoak. You run a “cachesoak” in parallel to the program and it touches memory and using time measurements measures how much of its working set has been displaced by another program running on the same core.

This technique has been used for attacking cryptographic algorithms, by exploiting the cache access patterns of table lookups in common ciphers, or with OS paging starting in the 70ies for password attacks

One could also use this subtractive technique for other software caches. For example for a database access run a background thread that accesses the database in a controlled way and measures access times. The challenge of course would be to understand the caching behavior of the database well enough to get useful data.

This all still doesn’t tell you where the resource consumption happens. So to measure the resource consumption of each subsystem in a complex program you would need to run each of it individually against a cyclesoak test. And ideally with a representative input data set, otherwise you may get unrealistic access patterns.

I mostly talked about memory hierarchy caches here, but of course this applies to other caches too. An interesting case is branch prediction caches in CPUs. A mispredicted branch can take take nearly an order of magnitude more time than a predicted branch. Indirect branches need more resources than conditional branches (full address versus single bit). I wrote about optimizing conditional branches earlier. So a less complex algorithm may be slower, but use less resources.

It would be interesting to develop a metric for the branch prediction resource consumption. One way may be to use a variant of cyclomatic complexity, but focus on the hot paths of the program only using profile feedback.

So overall allocating cache resources to different sub systems is hard. It would be good if we had better tools for this.

Written by therapsid

December 30th, 2012 at 8:58 pm


without comments

Chris Wilson pointed me to a fix for the xterm font rendering problems I’ve been seeing on my Intel graphics laptop with Fedora 14. The fonts on each new xterm are corrupted until I resize it. I first blamed the Intel X driver (there were other xterm fixes in the past), but according to Chris it’s a compiz problem.

Anyways now I wanted to rebuild compiz-core to include the patch and see if it helps. First it took some time to figure out how to install source rpms with yum (summary: yum doesn’t support it; I guess I’m spoiled from zypper). Then
I ask yum to install all the -devel rpms which are dependencies for compiz-core’s source rpm. I got a full screen full of dependencies, culminating in yum asking me to install mysql-server. Yes, mysql-server to build a simple source rpm.
I guess it was taken in through some other dependencies, but installing a full database server for building some rpm is just crazy. I balked at that.

I guess I need to find now a script similar to SUSE’s build that builds in chroot or build it in build service

Fedora is not the only distribution with dependency hell. Some years ago I found out that one openSUSE released had a dependency from fvwm2 to gnome-print (which pulled in a lot of the rest of gnome)

Update: Fedora fixed this now with an update (see bug 614542) I’m really glad that the feels like 10 GB updates every day actually contain at least one valuable fix.

Written by therapsid

December 14th, 2010 at 11:58 am

Posted in bloat,rants


without comments

This is the recently announced “first usable” distribution of perl6 aka rakudo star. Just out of curiosity I compiled it for x86_64-linux using gcc 4.5.

Interesting results:

% size /usr/bin/perl5.12.1
text    data     bss  dec     hex filename
2778082   13076     584 2791742  2a993e /usr/bin/perl5.12.1
% size ~/src2/rakudo-star-2010.07/install/bin/perl6
text    data     bss     dec     hex filename
27972225  792      16 27973033 1aad5a9 rakudo-star-2010.07/install/bin/perl6

The text segment of perl6 is more than 10 times as big as that of perl 5.

Is it also 10 times as good?

That is progress I guess.

Written by therapsid

August 1st, 2010 at 10:51 am

Posted in bloat,languages,rants