Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
Operating Systems Software Linux

Removing the Big Kernel Lock 222

Corrado writes "There is a big discussion going on over removing a bit of non-preemptable code from the Linux kernel. 'As some of the latency junkies on lkml already know, commit 8e3e076 in v2.6.26-rc2 removed the preemptable BKL feature and made the Big Kernel Lock a spinlock and thus turned it into non-preemptable code again. "This commit returned the BKL code to the 2.6.7 state of affairs in essence," began Ingo Molnar. He noted that this had a very negative effect on the real time kernel efforts, adding that Linux creator Linus Torvalds indicated the only acceptable way forward was to completely remove the BKL.'"
This discussion has been archived. No new comments can be posted.

Removing the Big Kernel Lock

Comments Filter:
  • Why did they remove the preemptable BKL?

    RTFAing says that temporarily forking the kernel with a branch dedicated to experimenting with the BKL is being considered. Maybe they can call it 2.7...
    • by Vellmont ( 569020 ) on Saturday May 17, 2008 @11:44AM (#23446302) Homepage

      Why did they remove the preemptable BKL?

      I'm not a kernel developer, but I'd say it's because there's widespread belief that the preemtable BKL is "the wrong way forward". Statements like these lead me to believe this:

      "all this has built up to a kind of Fear, Uncertainty and Doubt about the BKL: nobody really knows it, nobody really dares to touch it and code can break silently and subtly if BKL locking is wrong."


      In any large software project there's always a path to get from where you are, to where you want to be. It sounds like any version of BKL is considered ugly and causes problems, and patching it just won't work. In other words, fixing this part of the kernel isn't really possible, so they need to start over and change any code that relies on it to rely on something different entirely.
      • by Anonymous Coward on Saturday May 17, 2008 @12:37PM (#23446634)
        The recent semaphore consolidation assumed that semaphores are not timing critical. Also it made semaphores fair. This interacted badly with the BKL (see [1]) which is a semaphore.

        The consensus was to not revert the generic semaphore patch, but to fix it another way. Linus decided on a path that will make people focus on removing the BKL rather than a workaround in the generic semaphore code. Also, Linus doesn't think that the latency of the non-preemptable BKL is too bad [2].

        [1] http://linux.derkeiler.com/Mailing-Lists/Kernel/2008-05/msg03526.html
        [2] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=8e3e076c5a78519a9f64cd384e8f18bc21882ce0
      • by johannesg ( 664142 ) on Saturday May 17, 2008 @01:48PM (#23447034)
        Are you arguing for a microkernel style solution, sir? If so, I salute your bravery! ;-)
        • by Anonymous Coward
          Micro kernels are cute to play with but can't cope with the heavy demands that todays computing requires.

          The best course of action would be to redesign the Linux kernel from scratch and this time integrate all possible drivers. Hardware support would be a lot easier!

          I would even go so far as to suggest integrating the most important server tools into the kernel to decrease latency. Why not integrate Apache? You could even integrate the shell for added responsiveness!

          Linus has demonstrated that micro k

        • by QX-Mat ( 460729 ) on Saturday May 17, 2008 @06:06PM (#23448634)
          I'm brave enough to want per CPU microkernels (with a messaging master?). I envisage all multi-CPU systems addressing memory in an non-unified manor soon enough - it'll be like the jump from segmented addressing to protected mode, but for CPUs.

          The monolithic design is slowly forming a focal point in performance: something has to do a lot of locked switching - if SMP machines could do what they do best and handle IRQs and threads concurrently without waiting for a lock (they're better spent sending/receiving lockless messages), life would be easier on the scalability gurus.
    • by QX-Mat ( 460729 ) on Saturday May 17, 2008 @11:52AM (#23446356)
      new semaphore code was introduced that simplified locking. Unfortunately in many kernel situations it's proven to affect performance at around something like 40% - which isn't just considerable its disastrous.

      rather than merge the old locking code back in, and reintroduce the many different locking primitives they had, someone decided to simply reenable the BKL - the downside of which is they have to either fix the regression caused by the simpler semaphore code (not likely, it's very simple and clean - everyone's favourite pet/child) or remove instances of where the semaphore code is likely to be called (the BKL).

      Matt
      • by toby ( 759 ) * on Saturday May 17, 2008 @12:10PM (#23446470) Homepage Journal
        here [lwn.net] (for subscribers. I dare not post a free link here :)
      • Re: (Score:3, Interesting)

        by kestasjk ( 933987 )

        new semaphore code was introduced that simplified locking. Unfortunately in many kernel situations it's proven to affect performance at around something like 40% - which isn't just considerable its disastrous. rather than merge the old locking code back in, and reintroduce the many different locking primitives they had, someone decided to simply reenable the BKL - the downside of which is they have to either fix the regression caused by the simpler semaphore code (not likely, it's very simple and clean - everyone's favourite pet/child) or remove instances of where the semaphore code is likely to be called (the BKL). Matt

        Couldn't they just ask the real-time developers to kindly find a real-time kernel to work on? Why try to make a non-preemptible kernel preemptible for the sake of real-time, if it affects non-real-time performance?

        • by Arakageeta ( 671142 ) on Saturday May 17, 2008 @01:14PM (#23446854)
          That's a terrible excuse. There are many applications where a real-time Linux kernel is highly desired. Besides, it is important to note that real time systems do not focus on speed. This is a subtle difference from "performance" which usually caries speed as a connotation; it doesn't for a real time system. The real time system's focus is on completing tasks by the time the system promised to get them done (meeting scheduling contracts). It's all about deadlines, not speed. So from this point of view, the preemptible BKL, even with the degraded speed, could still be viewed as successful for a real time kernel.
          • Re: (Score:2, Interesting)

            by Anonymous Coward
            Yes, and since the kernel can and is branched, they can decline to apply this patch and keep the 2.6.7-2.6.22 or whatever style BKL... or they can help everyone and rewrite various BKL-using code to not use it. I'd rather have a kernel that has low latency AND behaves correctly, but if I have to chose I prefer correct behavior every time.
        • by QX-Mat ( 460729 ) on Saturday May 17, 2008 @03:12PM (#23447518)
          There are a few added benefits from a fully real-time OS that most people gloss over.

          For example, you will *know* your PC will never become utterly unusable to the point it's unsafe with your data. ie: while it's handling many IO operations (say you're being ddosed whilst transcoding a dvd and flossing with a sata cable) unless you run completely out of system memory. Nothing should run away wasn't design to. This stems from the predictability of code execution times that pre-empting offers.

          The predictably allows devices to make guarantees, for example, if your mouse is aware it's going to get a time slice, at worst, every 100ms, at least it'll be doing something every 100ms and you gain a visually responsive mouse (aye, it's not as great as it could be). The non-preempt side of life has your CPU tied up doing work that was sits inside a BKL - ie: dealing with a character device or ioctl - your mouse could be waiting 500ms or 1000ms before updating it's position: giving you the impression your PC is dying.

          Code that is stuck inside the BKL isn't pre-emptable (you *must* wait for it to finish.) - there's a lot of it that does a lot of regular stuff. Often this will hold up other cores if you've got a cooperative multi-threaded program. The net effect is a slow PC.

          RT systems have a different use: they want a guarentee that something is unable to ever delay the system, in particular interrupts. The BKL allows code to take a time slice and run away with it, because you thought it was very important and wrapped it in [un]lock_kernel. This then delays IRQs (IRQs cant run until the lock has finish - at least, I'd like to believe second level IRQs can't run, I'm unsure of the specifics) which will delay data coming in and out of your PC - hard file, disk and display buffers suffer: they fill and you start to loose data because there's nothing dealing with it.

          Preempt kernels are good :)

          (viva the Desktop)

          Matt
        • Couldn't they just ask the real-time developers to kindly find a real-time kernel to work on? Why try to make a non-preemptible kernel preemptible for the sake of real-time, if it affects non-real-time performance?

          Running a non-real-time kernel, is like asking Eddie the shipboard computer for tea.

          If you ever want to put the linux kernet into anything other than your standard IT server, a airplane, a missile or any weapons system for example, it is going to HAVE to be real-time-kernel.

      • by HeroreV ( 869368 )

        someone decided to simply reenable the BKL
        My understanding is they didn't reenable the BKL, they just made in non-preemptable again.
        • by QX-Mat ( 460729 )
          thats right. I guess I should have said that - the BKL is the preempt sense wasn't really a lock, although it was a mutual exclusion mechanism... tbh, I don't know what you call a lockless lock nowadays.

          IM's patch, amongst a host of other things, adds down(&kernel_sem) to the lock_kernel (and up() to unlock) to making it a true lock.

          It's interesting to note that copy and save (CAS/CAS2) instructions have been around in many architectures for nearly a decade (longer still for some). It's surprising to se
    • by diegocgteleline.es ( 653730 ) on Saturday May 17, 2008 @11:53AM (#23446368)
      Because these days the BKL is barely used in the kernel core, or so Linus says [lkml.org]: the core kernel, VM and networking already don't really do BKL. And it's seldom the case that subsystems interact with other unrelated subsystems outside of the core areas. IOW, it's rare to hit BKL contention - and in those cases, you want the contention period to be as short as possible. And spinlocks are the faster locking primitive, so making the BKL a spinlock (which is not preemptable) makes the BKL contention periods faster. A mutex/spinlock brings you "preemptability" and hides a bit the fact that there's a global lock being used sometimes at the expense of performance, which may be a good thing for RT/lowlatency users, but apparently Linus prefers to choose the solution that is faster and doesn't hid the real problem.
      • by diegocgteleline.es ( 653730 ) on Saturday May 17, 2008 @12:06PM (#23446442)
        A mutex/spinlock brings you "preemptability"

        Duh, I meant mutex/semaphore. And Linux semaphores have become slower, meanwhile mutexes still are fast as old semaphores were, as #23446368 says. The options were to move from a semaphore to mutexes or spinlocks, but Linus chose spinlocks because the RT/low-latency crow will notice it and will try to remove the remaining BKL users.
      • Comment removed (Score:5, Interesting)

        by account_deleted ( 4530225 ) on Saturday May 17, 2008 @12:07PM (#23446446)
        Comment removed based on user account deletion
      • by SpinyNorman ( 33776 ) on Saturday May 17, 2008 @12:20PM (#23446528)
        If that is true then it sounds like a bad decision.

        If the BKL code is rarely used then the general usage performance impact is minimal and the efficiency of a spinlock vs mutex is irrelevant. If this is not true then saying it is rarely used is misleading.

        However for real-time use you either do or don't meet a given worst case latency spec - the fact that a glitch only rarely happens is of little comfort.

        It seems like it should have been a no-brainer to leave the pre-emptable code in for the time being. If there's a clean way to redesign the lock out altogether then great, but that should be a seperate issue.
    • by Sits ( 117492 ) on Saturday May 17, 2008 @12:25PM (#23446546) Homepage Journal
      Matthew Wilcox replaced the per platform semaphore code with a generic implementation [lwn.net] because it was likely to be less buggy, reduced code size and most places that are performance critical should be using mutexes now.

      Unfortunately this caused a 40% regression in the AIM7 benchmark [google.com]. The BKL was now a (slower) semaphore and the high lock contention on it was made worse by its ability to be preempted. As the ability to build a kernel without BKL preemption had been removed [google.com] Linus decided that the BKL preemption would go. Ingo suggested semaphore gymnastics to try and recover performance but Linus didn't like this idea.

      As the the BKL is no longer be preemptible [google.com] it is now a big source of latency (since it could no longer be interrupted). People still want low latencies (that's why they made the BKL preemptible in the first place) so they took the only option left and started work to get rid of the BKL.

      (Bah half a dozen other people have replied in the time it's taken me to edit and redit this. Oh well...)
  • by paratiritis ( 1282164 ) on Saturday May 17, 2008 @11:33AM (#23446236)
    Worse is Better [dreamsongs.com] (also here [wikipedia.org]) basically says that fast (and crappy) approaches dominate in fast-moving software, because they may produce crappy results, but they allow you to ship products first.

    That's fine, but once you reach maturity you should be trying to do the "right thing" (the exact opposite.) And the Linux kernel has reached maturity for quite a while now.

    I think Linus is right on this.

  • Punchline (Score:5, Informative)

    by Anonymous Coward on Saturday May 17, 2008 @12:03PM (#23446422)
    Since the summary doesn't cut to the chase, and the article was starting to get a little boring and watered-down, I read Ingo's post and here's what I got from it: the BKL is released in the scheduler, so a lot of code is written that grabs the lock and assumes it will be released later, which is bad. Giving it the usual lock behavior of having explicit release will break lots of code. Ingo created a new branch that does this necessary breakage so that the broken code can be detected and fixed. He wants people to test this "highly experimental" branch and report error messages and/or fixes.

    Assuming everything is stable and correct, the next step is to break the BKL into locks with finer granularity so that the BKL can go the way of the dodo.
  • Will this affect anything I do if I am eventually given an option to install this kernel version? (Or am presented with a distro that has this kernel as the default?)

    I know (or think I know) low latency is important for audio work, and I know people who do a lot of audio work under Linux, should I be giving them aheads up to avoid upgrading their kernel until this gets fixed, or should I start looking for unofficial, special low latency versions of the kernel to recommend to them?
    • I imagine a kernel will come out that just has uses the BKL far less (I don't think it will be a compilation option). There is a risk of instability (especially if you are using SMP/preemption) while overlooked code that need locking is sorted out (this could lead to deadlocks or in an extreme case memory corruption). Over time this risk should decrease.

      This work won't go into 2.6.26 (it's too late). It may not even go into 2.6.27 (it's been done outside of the mainline tree). This may mean that until it is
    • by ttldkns ( 737309 )
      Unless you are in a position where you compile your own kernel every few weeks then this wont affect you in any noticible way.

      Im sure all the major distros will be watching whats happening here and making sure it doesnt affect their end users. From what i've read the changes mean better performance at the expense of latency but they're working to elimintate the BKL all together.

      By the time your distros next kernel upgrade comes around im sure none of this will matter. :)
    • by eh2o ( 471262 )
      AFAIK you already need a patched or at least properly configured kernel to get really good latency response. Planet CCRMA and the Debian Multimedia Project are two distros that come with LL kernels. Obviously these distros will stay on older versions until the BKL situation is resolved.
    • Unless they run real-time audio processing applications, it won't matter.

      If they DO run real-time audio processing applications, they most likely have their kernel specifically configured for it, and won't get updates until developers will make sure that there is no additional latency introduced.
  • by TheNetAvenger ( 624455 ) on Saturday May 17, 2008 @12:28PM (#23446566)
    Keep these on Front Page...

    This is the type of stuff that needs to be kept in the news, as the people who post here often have no understanding of, and the ones that do, have the opportunity to explain this stuff, bringing everyone up to a better level of understanding.

    Maybe if we did this, real discussions about the designs and benefits of all technologies could be debated and referenced accruately.. Or even dare say, NT won't have people go ape when someone refers to a good aspect of its kernel design.
  • by Crazy Taco ( 1083423 ) on Saturday May 17, 2008 @12:44PM (#23446692)

    He noted that the various dependencies of the lock are lost in the haze of 15 years of code changes, "all this has built up to a kind of Fear, Uncertainty and Doubt about the BKL: nobody really knows it, nobody really dares to touch it and code can break silently and subtly if BKL locking is wrong."

    Wow. It sounds like it's about time someone on the kernel team reads Working Effectively With Legacy Code [amazon.com] by Michael Feathers.

    I'm a software developer myself on a very large project myself, and this book has absolutely revolutionized what I do. Having things break silently in the kernel is a sure sign that dependency problems in the code exist, and most of this book is about ways to break dependencies effectively and get code under test. And that's the other thing... if they aren't writing tests for everything they do, then even the code they write today is legacy code. Code without tests can't be easilly checked for correctness when a change is made, can fail silently easilly, and can't be understood as easilly.

    That's what this book is about, and if things in the kernel have deteriorated to such a state then they need to swallow their pride and take a look at resources designed to cope with this. I know they are all uber-coders in many respects, but everyone has something they can improve on, and from the description they give of their own code, this is their area for improving.

    • by Alex Belits ( 437 ) * on Saturday May 17, 2008 @01:14PM (#23446858) Homepage

      And that's the other thing... if they aren't writing tests for everything they do, then even the code they write today is legacy code. Code without tests can't be easilly checked for correctness when a change is made, can fail silently easilly, and can't be understood as easilly.
      On the other hand, code WITH tests also can't be easily checked for correctness when a change is made. There is only very small scope of possible mistakes that a test can detect, and if you will try to make test verify everything, test will grow larger (and buggier, and more incomprehensible) than your code. It's also possible that intended behavior of the code and expected behavior that the test checks for, diverge because of some changed interface. Tests help with detection of obvious breakage, but you can never rely on anything just because it passed them.

      In other words:

      TESTS DON'T VERIFY THAT YOUR CODE IS NOT BUGGY. YOU VERIFY THAT YOUR CODE ISN'T BUGGY.
    • You do wonder if they need some proper test strategy to test regression etc..

      Also, I wonder if the Linux kernel can carry on expanding or if it's time for the form of the kernel to change.

      I know people like the monolithic kernel, but lack of change does not promote new techniques. Doesn't have to be a microkernel or have to fit in any existing box.
    • by pherthyl ( 445706 ) on Saturday May 17, 2008 @01:28PM (#23446912)
      Whatever your large project is, I'm willing to bet it's nowhere near as complex as the kernel. Whenever you get the feeling that they must have missed something that seems obvious, you're probably the one that's wrong. No offense, but they have a lot more experience dealing with unique kernel issues than you do.

      You talk about unit testing, but how exactly are you going to unit test multi-threading issues? This is not some simple problem that you can run a test/fail test against. These kinds of things can really only be tested by analysis to prove it can't fail, or extensive fuzz testing to get it "good enough"..
      • Its called System testing and I agree writing Unit tests are never enough.

        As for your comment about them "knowing better", I've worked on a multi-million line project. When your lines of code reaches that sort of size the issues faced for someone on ten million are pretty much the same for people on twenty million loc projects. If you RTFA you'll see it was a series of system tests which demonstrated the problem in the first place. Although the fact the kernel doesn't seem to have a standardised set of sys
        • Still, the OS kernel is, by definition, one of the most complex pieces of software in a system. There's only three other ones I can think of that would even come close: The compiler, the system libraries (libc), and device firmware.
        • Re: (Score:3, Funny)

          Its called System testing
          Thats what users are for.
        • Re: (Score:3, Insightful)

          by swillden ( 191260 )

          As for your comment about them "knowing better", I've worked on a multi-million line project. When your lines of code reaches that sort of size the issues faced for someone on ten million are pretty much the same for people on twenty million loc projects.

          All lines of code are not equal.

          There's a huge difference between typical application code and system code. Very little application code is as performance-sensitive as system code, because the goal of the system code is to use as little time as possible (consistent with some other goals), to make as much time as possible available to run application code.

          OS code is performance-tuned to a degree that application code almost never is, and that focus on performance results in complexity that isn't well-

    • by Sits ( 117492 ) on Saturday May 17, 2008 @01:30PM (#23446930) Homepage Journal
      It's hard to test whether you've broken a driver when you don't have the hardware to test with. Perhaps the future will be Qemu emulation of all the different hardware in your system : )

      This is not to say that there need to be tests for things that can be caught at compile time or run time regardless of hardware but there is only so far you can take it.

      It's not like the kernel doesn't have any testing done on it though. There's the Linux Test Project [sourceforge.net] which seems to test new kernel's nightly. If you ever look in the kernel hacking menu of the kernel configuration you will see tests ranging from Ingo Molnar's lock dependency tester [mjmwired.net] (which checks to see locks are taken in the right order at run time), memory poisoning, spurious IRQ at un/registration time, rcu torture testing, softlockup testing, stack overflow checking, marking parts of the kernel readonly, changing page attributes every 30 seconds... Couple that with people like Coverity [coverity.com] reporting static analysis checks on the code. Tools like sparse [kernel.org] have been developed to try and so some of the static checks on kernel developer machines while they are building the code.

      But this is not enough. Bugs STILL get through and there are still no go areas of code. If you've got the skills to write tests for the Linux kernel PLEASE do! Even having more people testing and reporting issues with the latest releases of the kernel would also help. It's only going to get more buggy without help...
  • by Animats ( 122034 ) on Saturday May 17, 2008 @01:30PM (#23446928) Homepage

    This task is not easy at all. 12 years after Linux has been converted to an SMP OS we still have 1300+ legacy BKL using sites. There are 400+ lock_kernel() critical sections and 800+ ioctls. They are spread out across rather difficult areas of often legacy code that few people understand and few people dare to touch.

    This is where microkernels win. When almost everything is in a user process, you don't have this problem.

    Within QNX, which really is a microkernel, almost everything is preemptable. All the kernel does is pass messages, manage memory, and dispatch the CPUs. All these operations either have a hard upper bound in how long they can take (a few microseconds), or are preemptable. Real time engineers run tests where interrupts are triggered at some huge rate from an external oscillator, and when the high priority process handling the interrupt gets control, it sends a signal to an output port. The time delay between the events is recorded with a logic analyzer. You can do this with QNX while running a background load, and you won't see unexpected delays. Preemption really works. I've seen complaints because one in a billion interrupts was delayed 12 microseconds, and that problem was quickly fixed.

    As the number of CPUs increases, microkernels may win out. Locking contention becomes more of a problem for spinlock-based systems as the number of CPUs increases. You have to work really hard to fix this in monolithic kernels, and any badly coded driver can make overall system latency worse.

    • by Anonymous Coward
      So when can we run QNX-Ubntu?

      Just asking . . .
      • Re: (Score:3, Interesting)

        by Animats ( 122034 )

        You can actually run X-windows on QNX, although nobody does. QNX has its own GUI system called Photon, which is oriented towards things one might want in an industrial control panel or an entertainment system, like meters, dials, and graphs.

        QNX the OS is fine; it's QNX the company that's difficult to deal with. On the other hand, who else has gone up against Microsoft on x86 for 20 years and is still alive?

  • ...is removing?

    If so then perhaps what DragonFlyBSD (the BSD could be dropped at this time - as its only relevant to history line) is doing to remove it can be helpful to removing it in Linux.
    • (the BSD could be dropped at this time - as its only relevant to history line)
      Oh really? Dragonfly isn't BSD licensed anymore? Dragonfly doesn't regularly sync parts of the source tree from Free/Net/OpenBSDs? Matt Dillon himself didn't go to school at the B in BSD?

      I think Dragonflys links to the BSD world are slightly more than you seem to!
      • by 3seas ( 184403 )
        The break from BSD is in regards to code resemblance. The code of DragonFly has changed enough that it doesn't resemble BSD so much anymore.

        In regards to teh Big Giant Lock...

        DragonFly Projects

        http://wiki.dragonflybsd.org/index.cgi/GoogleSoC2008 [dragonflybsd.org]

        Extend Multi-Processing (MP) support

        * Robert Luciani, mentored by Simon Schubert
        * Back in 2003 when DragonFly was born, the first subsystem to be implemented was the LWKT. The reduction in complexity achieved
        • Re: (Score:3, Informative)

          by Moridineas ( 213502 )

          The break from BSD is in regards to code resemblance. The code of DragonFly has changed enough that it doesn't resemble BSD so much anymore.

          You might have been able to say something was "BSD" even 5 years ago, but I think you would have a lot more trouble saying that now. The family is more philosophical than architectural now. How close are the kernels between OpenBSD and FreeBSD for instance? My guess is if you looked back to FreeBSD4, they would be far more similar than now. Likewise, if you just look at the Dragonfly change logs, they frequently import code directly from the other BSDs--I believe the ATA code is one recent example.

          Dragonfl

  • Would one of you kind folks please put this into non-kernel-programmer terms that explain what this does for software/hardware in terms of the user experience and how the proposed outcomes will affect said experience?
    • The proposed outcome is for there to be increased opportunities to switch between programs/kernel or to run multiple things at the same time.

      For those who enable the option this should reduce the chance of a hardware's buffer not being filled in time (so audio is less likely to skip in demanding environments). If you are an audio recording person or need VERY (less than hundredths of seconds) fast responses all the time your experience should improve. If you run VERY big workloads that have lots of pieces t
    • Would one of you kind folks please put this into non-kernel-programmer terms that explain what this does for software/hardware in terms of the user experience and how the proposed outcomes will affect said experience?

      They're trying to make your mouse pointer move smoothly, your audio never skip, your FPS game never miss a beat and your Linux-driven coffee machine never burn a bean, even when there is some process opening a random device in the background.

  • by mikeb ( 6025 ) on Saturday May 17, 2008 @04:33PM (#23447958) Homepage
    It's worth pointing out here that the kind of races (bugs) introduced by faulty locking in general suffer from a very important problem: YOU CANNOT TEST FOR THEM.

    Race conditions are mostly eliminated by design, not by testing. Testing will find the most egregious ones but the rest cause bizarre and hard-to-trace symptoms that usually end up with someone fixing them by reasoning about them. "Hmm" you think to yourself "that sounds like a race problem. Wonder where it might be?" and thinking about it, looking at the code, inventing scenarios that might trigger a race; that's how you find them.

After all is said and done, a hell of a lot more is said than done.

Working...