Removing the Big Kernel Lock 222
Corrado writes "There is a big discussion going on over removing a bit of non-preemptable code from the Linux kernel. 'As some of the latency junkies on lkml already know, commit 8e3e076 in v2.6.26-rc2 removed the preemptable BKL feature and made the Big Kernel Lock a spinlock and thus turned it into non-preemptable code again. "This commit returned the BKL code to the 2.6.7 state of affairs in essence," began Ingo Molnar. He noted that this had a very negative effect on the real time kernel efforts, adding that Linux creator Linus Torvalds indicated the only acceptable way forward was to completely remove the BKL.'"
Re:I don't understand (Score:5, Interesting)
Why did they remove the preemptable BKL?
I'm not a kernel developer, but I'd say it's because there's widespread belief that the preemtable BKL is "the wrong way forward". Statements like these lead me to believe this:
In any large software project there's always a path to get from where you are, to where you want to be. It sounds like any version of BKL is considered ugly and causes problems, and patching it just won't work. In other words, fixing this part of the kernel isn't really possible, so they need to start over and change any code that relies on it to rely on something different entirely.
Re:Translation? (Score:3, Interesting)
The answer is to allow sections of code to "lock" access for a brief duration -- "I'm working with this right now, don't anyone else touch it." Simple in theory, very difficult in concept.
Note that I'm speaking generically; I'm not an expert on the Linux kernel. Ideally, though, you want locks to be "granular" -- in other words, you only lock that specific hardware and/or portion of memory that you need exclusive access to. Apparently, the "big kernel lock" takes a brick wall and hammer approach, locking access (and claiming exclusive access during the lock, preventing anything from running). It's not granular.
If I'm wrong, someone else here can correct me. Like I said, I'm not an expert on the Linux kernel.
Comment removed (Score:5, Interesting)
Re:I don't understand (Score:3, Interesting)
Re:Sounds like the Linux kernel needs some tests.. (Score:5, Interesting)
In other words:
TESTS DON'T VERIFY THAT YOUR CODE IS NOT BUGGY. YOU VERIFY THAT YOUR CODE ISN'T BUGGY.
Comment removed (Score:5, Interesting)
Re:Fundamental kernel structures such as this... (Score:3, Interesting)
Time passes, technology changes, and now it's become urgent to deal with this, so now it's being dealt with.
One should, perhaps, wonder what currently unreasonable problem should actually start being addressed RIGHT NOW!! The things I can think of divide neatly into two camps. 1) We don't know enough to even get started, and 2) It really seems utterly implausible, even given this example to work from. Unfortunately, somewhere in there is something that's being overlooked, and I don't know what. Kernel support for Actors? Kernel security to control Actors? Kernel support for Language parsing? They all seem implausible.
What is clearly needed soon is software that facilitates the use of multi-processor environments. Dataflow languages have promise, but there may be other reasonable choices. Possibly some interface that would easily allow different computer languages to work together, but that may be a real impossibility. Or even a language basically like C or C++. but extended with a "foreach" operator that allowed parallel execution of the loop body...but the language would need to be smart enough to tell what needed to be read locked and what needed to be write locked, and what could just be ignored. This implies that use of pointers is *severely* circumscribed! And if you're going to do that, you probably ought to have garbage collection. It might sound like I'm talking about Java, but that would be wrong. This language would need to be close to the metal, so it could adapt itself (at run time!!) to the local machine. And since we want as much efficiency as possible, virtual machines, interpreters, etc. are probably out.
I don't know of any language that meets the specs I've outlined, but I know of many languages that meet large parts of them. Of the languages I know, D (Digital Mars D) comes the closest, but its totally missing on even the parallelization that C/C++ have (as an add-on).
But that doesn't really say where the kernel should be going...except that possibly C isn't the best language to use for a multiprocessor environment. (But C is still the most efficient in most places, and it DOES have add-ons for parallelization...though whether you can use those add-ons in kernel programming isn't something I've investigated.)
Re:Translation? (Score:2, Interesting)
Re:I don't understand (Score:3, Interesting)
The new semaphore implementation gives way in a very fair manor proving that sometimes you can be too fair and Kon was right from the get go (a desktop user's opinion, i don't mean to bait). if something didn't explicitly ask for a time slice, ie: a kernel section was preempted by an interrupt and then queued back in normally, then it doesn't get one until the pope said yes. the BKL as we knew it would allow a section enclosed in [un]lock_kernel to queue jump - at the expense of some extra code trickery.
Sadly the BKL was one of the more demanding locks - if it didn't get what it wanted fast enough, subtle differences in character device operations, which all seem to be strategically linked to the tty and other vfs layers from what I can make of KernelTrap, would work against the overall throughput of the kernel and into usermode (archaic ioctls suffer)
the best "fix" would be to implement the old semaphore types for depreciated BKL-only use (don't export them?) - but the old semaphore code was quite substantial, and required some arch specific implementations which have now... I don't see that much work being reintroduced in it's old form any time soon.
Matt
Re:Fundamental kernel structures such as this... (Score:2, Interesting)
Re:(Performance != Speed) // in an RT system (Score:2, Interesting)
Re:So when can we run QNX-Ubntu? (Score:3, Interesting)
You can actually run X-windows on QNX, although nobody does. QNX has its own GUI system called Photon, which is oriented towards things one might want in an industrial control panel or an entertainment system, like meters, dials, and graphs.
QNX the OS is fine; it's QNX the company that's difficult to deal with. On the other hand, who else has gone up against Microsoft on x86 for 20 years and is still alive?
Re:Translation? (Score:4, Interesting)
This is one approach to deadlocks; it would fall generally under "avoidance".
The problem is that if you have any serious contention over the resources, it is entirely possible that the process will _never_ get the resources (because one of them always gets snatched up before another gets released, so all n resources are never available at once to the requesting process). This leads to starvation and general sadness.
If the system has minimal contention (so the normal case is that all three resources are unclaimed) and resources are held very briefly (so if a resource is taken it is likely to be released before another is taken, anyway) then it may work. In real systems these are hard properties to guarantee.
Also, the scheme requires a process to know in advance which locks it will need. A lot of algorithms may discover this on the fly (e.g. if you are traversing a data structure), which becomes a problem. The best you could hope to do is to lock aggressively--taking everything you might need--but this is ugly, and would tend to violate the conditions above (locking everything will lead to contention, locking everything in advance will lead to holding locks for a long time). Alternatively, if you discover that you need a new lock that you don't yet have, you could give up all the locks you do have and then try to lock again (with the new lock added to the set). This is also ugly and increases the chance of starvation (since now you need to lock a bunch of resources several times). Additionally, since you have to unlock in the middle, the algorithm becomes much more complicated. For example, when you discover you need a new lock, you must put the world in a consistent state before unlocking. And when you re-lock, you must check to make sure that the world hasn't been modified under your feet (which is entirely possible, and may very well cause you to need a still-different lock).
Basically it doesn't work that well.Re:I don't understand (Score:5, Interesting)
The monolithic design is slowly forming a focal point in performance: something has to do a lot of locked switching - if SMP machines could do what they do best and handle IRQs and threads concurrently without waiting for a lock (they're better spent sending/receiving lockless messages), life would be easier on the scalability gurus.
Re:Translation? (Score:2, Interesting)
What would happen if there were two distinct ways of handling the BKL problem, which could be implemented simultaneously: Method number 1, the existing method. This would remain in place to support all the program that use it. Method number 2, a newfangled method that new programs will need to be made aware of. The idea is that "well behaved" programs would, over time, begin to use the second method. The first method would actually use the second method in its implementation, so really there's only one method, although it appears that there are two.
Each resource in the kernel that might potentially be locked receives its own small lock. The idea is that the more finely grained the locking, the less likely it is that more than one program will need the same resource at the same time. The deadlock and resource contention problems will still exist, of course, but they will occur less frequently. Programs that know exactly which locks they'll need over the course of an operation would put in a request for all of those resources. At this moment, the kernel checks if all requested resources are available, and if so, the resources are locked immediately (and simultaneously) and program execution continues into the portion that uses those resources. However, if one or more of the resources are not available, the kernel will queue the request and one of two things will happen at the program's choice: Either the call will block until the resources are available, at which point program execution continues normally (we'll call this "synchronous"); or the program can be notified that "I'll get back to you on that" (we'll call this "asynchronous") in which case it can continue executing some other, unrelated part of its code, such as the part that updates the screen, outputs sound, or whatever. Obviously, this will be a bit trickier for application programmers to implement, but the reward will be programs that appear responsive to a user while they "wait" for locks. There should be an additional option in the locking request: a timeout value (with an option to never timeout). If the request times out, the program can do something appropriate. The kernel function that releases a lock will be modified to check the queue of lock requests, and immediately transfer control of the lock to a waiting program (according to some prioritization method). This means unblocking something that is blocked synchronously, or sending a signal to the process that queued the asynchronous lock request. The BKL will be implemented by requesting all locks, synchronous, never time out. Simple. The lock request function will return a value telling you whether you received the lock or why you didn't receive it. The scheduler "knows" if there are two or more processes in its list that are blocked waiting for a lock. If at least one of them has provided a timeout value other than "never," the scheduler does nothing since the problem will fix itself eventually, a la Office Space. Otherwise, it can examine the lock relationships. If a deadlock situation exists, it will choose one of the processes and return it an answer that it didn't receive the lock for deadlock prevention. That process can then do something appropriate. Code that doesn't know how to handle this situation will crash, but the system won't block indefinitely.
Re:Punchline (Score:1, Interesting)
So the approach they seem to be taking, is:
- add a WARN_ON in the scheduler when it releases the BKL, so that users get notified via syslog and can report such cases (in mainline)
- identify all cases where BKL is locked and not released, and fix the code (might be done in mainline)
- rewrite the BKL as a recursive spinlock (basically, a lock that can double-lock itself and still un-lock correctly) - might also be done in mainline once they are confident that all call sites have been fixed
- merge it, if needed
- move the locks down into subsystems (as you said)