Archive for December, 2010
Big kernel lock semantics
My previous post on BKL removal received some comments about confusing and problematic semantics of the big kernel log. Does the BKL really have weird semantics?
Again some historical background: the original Unix and Linux kernels were written for single processor systems. They employed a very simple model to avoid races accessing data structures in the kernel. Each process running in kernel code owns the CPU until it yields explicitly (I ignore interrupts here, which complicate the model again, but are not substantial to the discussion). That’s a classical cooperative multi-tasking or coroutine model. It worked very well, after all both Unix and Linux were very successful.
Coroutines are a established programming model: in many scripting languages they are used to implement iterators or generators. They are also used for lots of other purposes, with many libraries being available to implement them for C.
Later when Linux (and Unix) were ported to multi-processor systems this model was changed and replaced with explicit locks. In order to convert the whole kernel step by step the big kernel lock was introduced which emulates the old cooperative multi-tasking model for code running under it. Only a single process can take the big kernel lock at a time, but it is automatically dropped when the process sleeps and reaquired when it wakes up again.
So essentially BKL semantics are not “weird”: they are just classic Unix/Linux semantics.
Good literature on the topic is the older Unix Systems for Modern Architectures book by Curt Schimmel, which provides a survey of the various locking schemes to convert a single processor kernel into a SMP system.
When a subsystem is converted from the BKL it is often replaced by a mutex code lock. A single mutex protects the complete subsystem. This locking model is not fully equivalent with the BKL: the BKL is dropped when code sleeps, while the mutex will continue serializing even over the sleep region. Often this doesn’t matter: a lot of common kernel utility functions may sleep, but in practice do only rarely in exceptional cases. But sometimes sleeping is actually common — for example when doing disk IO — and in this case the straight forward mutex conversion can seriously limit parallelism. The old BKL code was able to run the sleeping regions in parallel. The mutex version is not. For example I believe this was a performance regression in the early versions of Frederic Weisbecker’s heroic reiserfs BKL conversion.
With all this I don’t want to say that removing the BKL is bad: I actually did some own work on this and it’s generally a good thing. But it does not necessarily give you immediate advantages and may even result in slower code at first.
Removing the big kernel lock. A big deal?
There’s a lot of hubbub recently about removing the big kernel lock completely in Linux 2.6.37. That is it’s now possible to compile a configuration without any BKL use. There is still some code depending on the BKL,
but it can be compiled out. But is that a big deal?
First some background: Linux 2.0 was originally ported to SMP the complete kernel ran under a single lock, to preserve the same semantics for kernel code as on uni-processor kernels. This was known as the big kernel lock.
Then over time more and more code was moved out of the lock (see chapter 6 in LK09 Scalability paper for more details)
In Linux 2.6 kernels very few subsystems actually still rely on the BKL. And most code that uses it is not time critical.
The biggest (moderately critical) users were a few file systems like reiserfs and NFS. The kernel lockf()/F_SETFL file locking subsystem was also still using the BKL. And the biggest user (in terms of amount of code) were the ioctl callbacks in drivers. While there are a lot of them (most drivers have ioctls) the number of cycles spent in them tends
to be rather minimal. Usually ioctls are just used for initialization and other comparatively rare jobs.
In most cases these do not really suffer from global locking.
This is not to say that the BKL does not matter at all. If you’re a heavy parallel user of a subsystem or a function that still needs it then it may have been a bottleneck. In some situations — like hard rt kernels — where locking is more expensive it may also hurt more. But I suspect for most workloads this wasn’t already the case for a long time,
because BKL is already rare and has been eliminated from most hot paths a long time ago.
There’s an old rule of thumb that 90% of all program run time is spent in 10% of code (I suspect the numbers are actually even more extreme for typical kernel loads). And that should only tune the 10% that matter and leave the rest of the code alone. The BKL is likely already not in your 10% (unless you’re unlucky). The goal of kernel scalability is to make those 10% run in parallel on multiple cores.
Now what happens when the BKL is removed from a subsystem? Normally the subsystem gains a code lock as the first stage: that is a private lock that serializes the code paths of the subsystem. Now using code locks is usually considered
bad practice (“lock data, not code”). Next come object locks (“lock data”) and then sometimes read-copy-update and other fancy lock-less tricks. A lot of code gets stuck in the first or second stages of code or coarse-grained data locking.
Consider a workload that tasks a particular subsystem in a highly parallel manner. The subsystem is most
of the 90% of kernel runtime. The subsystem was using the BKL and gets converted to a subsystem code lock.
Previously the parallelism was limited by the BKL. Now it’s limited by the subsystem code lock. If you don’t have
other hot subsystems that also use the BKL you gain absolutely nothing: the serialization of your 90% runtime is still there, just shifted to another lock. You would only gain if there’s another heavy BKL user too in your workload, which is unlikely.
Essentially with code locks you don’t have a single big kernel lock anymore, but instead lots of little BKLs.
As a next step the subsystem may converted to more finegrained data locking. For examples if the subsystem has a hash table to look up objects and do something with them (that’s a common kernel code pattern) you may have locks per hash bucket and another lock for each object. Objects can be lots of things: devices, mount points, inodes,
directories, etc.
But here’s a common case: your workload only uses one object (for example all writes to the same SCSI disk) and that code is the 90% runtime code. Now again your workload will hit that object lock all the time and it will limit parallelism.
The lock holding time may be slightly smaller, but the fundamental scaling costs of locking are still there. Essentially you will have another small BKL for that workload that limits parallelism.
The good news is that in this case you can at least do somethingabout it as kernel user: you could change the workloads to spread the accesses out to different objects: for example use different directories or different devices. In some cases that’s practical, in others not. If it’s not you still have the same problem. But it’s already a large step in the right
direction.
Now a lot of the BKL conversions recently were simply code locks. About those I must admit I’m not really excited.
Essentially a code lock is not that much better as the BKL. What’s good is data locking and other techniques. There
are a few projects that do this like Nick Piggin’s work on VFS layer scalability, Jens Axboe’s work on block layer scalability, the work on speculative page fault and lots of others. I think those will have much more impact on real kernel scalability than the BKL removal.
So what’s left: The final BKL removal isn’t really a big step forward for Linux. It’s more a symbolic gesture, but I prefer to leave those to politicians and priests.
The 2.6.35.10 longterm Linux kernel has been released
2.6.35.10 is out. Since it contains security fixes all 2.6.35 users are encouraged to update. Get it at
ftp://ftp.kernel.org/pub/linux/kernel/v2.6/longterm/v2.6.35/ or from git://git.kernel.org/pub/scm/linux/kernel/git/longterm/linux-2.6.35.y.git
Full release notes.
Thanks to all contributors and patch reviewers.
Linux kernel 2.6.35 longterm maintenance
After Greg Kroah-Hartmann stopped maintaining 2.6.35-stable. I plan to maintain the 2.6.35 linux kernel tree longterm for now. Thanks to Greg for all his hard work on this.
The primary consumers of 2.6.35 right now are Meego and the CE Linux Forum, but some others are interested and everyone who wants a long term stable tree is of course free to use it.
The longterm tree continues after Greg’s 2.6.35.9 stable release. It will not be called “stable”, but called “longterm” to make the distinction clear.
longterm has no defined end date (but also no guarantee that it will be around forever) and will be maintained longer than normal stable kernels. The existing 2.6.32 stable kernel mainteind by Greg will be also named “longterm” now. There is also a 2.6.34 longterm kernel maintained by Paul Gortmaker.
The longterm kernels will move into new directories on kernel.org. The 2.6.35 releases will be available at ftp://ftp.kernel.org/pub/linux/kernel/v2.6/longterm/v2.6.35/. The other longterm kernels (.32 and .34) will get similar directories. The git tree for the release is available in git://git.kernel.org/pub/scm/linux/kernel/git/longterm/linux-2.6.35.y.git
The 2.6.35 longterm tree follows the stable rules in Documentation/stable_kernel_rules.txt
My plan is to look at all patches that go into later stables (that is which are sent to stable@kernel.org or marked Cc: stable@kernel.org in the git description) and merge them into 2.6.35 when applicable. I don’t expect there will be many
patches only for 2.6.35 and not for later stable trees. If there are and they are submitted to stable@kernel.org I will consider them. I plan to be fairly conservative.
I will follow a similar flow as Greg: There will be candidate trees which are posted to Linux kernel and have a 48h review period, with patches being dropped then as needed. Then the candidates will turn into releases, if noone objects to the patches.
The current candidate is 2.6.35.10-rc1. This will soon turn into 2.6.35.10.
The patches before a release are maintained in a quilt queue, which is available in a git repository at kernel.org. After that they migrate to the 2.6.35.y git tree.
Dependencies
Chris Wilson pointed me to a fix for the xterm font rendering problems I’ve been seeing on my Intel graphics laptop with Fedora 14. The fonts on each new xterm are corrupted until I resize it. I first blamed the Intel X driver (there were other xterm fixes in the past), but according to Chris it’s a compiz problem.
Anyways now I wanted to rebuild compiz-core to include the patch and see if it helps. First it took some time to figure out how to install source rpms with yum (summary: yum doesn’t support it; I guess I’m spoiled from zypper). Then
I ask yum to install all the -devel rpms which are dependencies for compiz-core’s source rpm. I got a full screen full of dependencies, culminating in yum asking me to install mysql-server. Yes, mysql-server to build a simple source rpm.
I guess it was taken in through some other dependencies, but installing a full database server for building some rpm is just crazy. I balked at that.
I guess I need to find now a script similar to SUSE’s build that builds in chroot or build it in build service
Fedora is not the only distribution with dependency hell. Some years ago I found out that one openSUSE released had a dependency from fvwm2 to gnome-print (which pulled in a lot of the rest of gnome)
Update: Fedora fixed this now with an update (see bug 614542) I’m really glad that the feels like 10 GB updates every day actually contain at least one valuable fix.