Protothreads and Other Wicked C Tricks 229
lwb writes "For those of you interested in interesting hard-core C programming
tricks: Adam Dunkels' protothreads library
implements an unusually lightweight type of threads. Protothreads are
not real threads, but rather something in between an event-driven
state machine and regular threads. But they are implemented in 100%
portable ANSI C and with an interesting but quite unintuitive use of the switch/case
construct. The same trick has previously been used by Simon Tatham to implement
coroutines
in C. The trick was originally invented by Tom Duff and dubbed Duff's
device. You either love it or you hate it!"
Looks pretty cool (Score:5, Interesting)
It isn't Duff's device. (Score:3, Interesting)
Duff's device is a way of forcing C to do a form of loop unrolling. It has nothing to do with coroutines.
Re:Implementation in languages? (Score:2, Interesting)
Real threads are much better for things like interpreters. JIT compilation with native threads is even better.
Loop Abuse (Score:5, Interesting)
The PPC architecture has a special-purpose count register with specialized branch instructions relating to it; e.g., the assembly mnemonic 'bdnz' means "decrement the count register by one, and branch if it has not reached zero." I've used this in some pretty weird loops, including this one that broke the Codewarrior 9.3 compiler (fixed in 9.4.) This computes the location of the n'th trailing one in a 32-bit integer. Pardon my weak attempt at formatting this in HTML:
static uint32 nth_trailing_one(register uint32 p, register uint32 n) { end: }
return __cntlzw(p ^ (p - 1));
}
The idea was that the instruction stream should stay as linear as possible; most of the time the branches are not taken, and execution falls through to the next line of code. Ironically (siliconically?), the entire function could probably be implemented in a single cycle in silicon; shoehorning bitwise functions like this into standard instructions tends to be extremely wasteful. Perhaps FPGA's will make an end run around this at some point. I've also tried this function with a dynamically-calculated jump at the beginning, similar to the case statement logic in the article.
Hmm, I had a point I was trying to make with this post, but now it's escaped my mind...
It was looking interesting until (Score:5, Interesting)
I got to this little gem:
My English parser thread shut down at that point . . .
Seriously, this looks like a handy little thing for low-memory systems, though I'd be a bit hesitant about pushing at the C standard like that--the last thing you need is a little compiler bug eating your program because the compiler writers never thought you'd do crazy things to switch blocks like that.
Stackless Python (Score:2, Interesting)
Python (Score:5, Interesting)
http://www-128.ibm.com/developerworks/library/l-p
They are cooperative but far more efficient than Python's own threading model. You can easily create hundreds of thousands of concurrent threads.
...sane detexi hanc marginis exiguitas non caperet (Score:3, Interesting)
From the above Duff on Duff's Device: I have another revolting way to use switches to implement interrupt driven state machines but it's too horrid to go into.
Perhaps this is the Duff's Device equivalent of a proof of Fermat's Last Theorem? Or is my ignorance of the history of Evil Computing showing?
You want cool C stuff... (Score:5, Interesting)
Dijkstra says... (Score:2, Interesting)
Re:Loop Abuse (Score:3, Interesting)
Adam
Yes, can be useful (depending on platform) (Score:3, Interesting)
Another advantage is its portability. Small embedded systems, whether they have operating systems or not, usually can't support some fully-blown threading standard. Those operating systems seem to implement some kind of 'specially tuned' thread APIs.
Using these kind of threads on a full-blown PC (or servers) would have almost no benefit. However, in the embedded software engineer's perspective, it's great to see a ultra-lightweight thread library without any platform-dependent code.
Re:It isn't Duff's device. (Score:3, Interesting)
Re:Loop Abuse (Score:4, Interesting)
It would be if I were looking for the n'th leading one, but this code is looking for the n'th trailing one. (e.g. for 0b0010011001011100, the 3rd trailing one is in the fifth-lowest bit.) The equivalent code sequence for leading ones is in fact more complicated, requiring three arithmetic instructions and a branch per iteration. (cntlzw, shift, xor, branch).
I actually use this code as part of an algorithm where I have a very large (e.g. 65k-element) packed single-bit histogram array, and need to find the position of (say) the 1000th set bit. Vector instructions can do a coarse population-count from one end fairly efficiently, but once it's narrowed down to a 32-bit region, it comes down to slicing and dicing. My code operates by clearing the rightmost set bit in each iteration (x & (x - 1)), then at the end, isolating the needed bit (x ^ (x - 1)) and using cntlzw to find its position. To clear the leftmost set bit, you need three instructions: first get its position with cntlzw, then shift 0x80000000 right by that number of bits, and finally XOR to clear the bit. (If there's a shorter sequence, I haven't found it.)
(oh, and for the troll responder-- you are quite spectacularly wrong. But thanks for the giggle.)
Re:a fun trick only useful in very specialized cas (Score:3, Interesting)
I downloaded it. But the version that is there does not, in fact, compile: there's an obvious typo at example-buffer.c, line 137: That aside, I believe your claim is wrong now that I've read the code. right? The reason is that you are simply case-statementing into the code and bypassing the whole subroutine calling mechanism entirely -- so the variable initialization just isn't called.
Now consider the following code: Suppose the compiler decided to extract out the common y*(2*pi)/360 term as an optimization. The assignment to that common term will not occur when you jump into the middle of that routine -- so your code will not work as it appears.
wtf? (Score:3, Interesting)
switch (count % 8)
{
case 0: do { *to = *from++;
case 7: *to = *from++;
case 6: *to = *from++;
case 5: *to = *from++;
case 4: *to = *from++;
case 3: *to = *from++;
case 2: *to = *from++;
case 1: *to = *from++;
} while (--n > 0);
}
What the hell is up with that do { applying only in case zero? It's in several places on the net just like that and Visual Studio compiles this just fine, so it's not an error. I checked K&R, and they don't even hint at what could be going on there... I'm lost. Help?
Use of this technique in Felix (Score:4, Interesting)
http://felix.sf.net/ [sf.net]
to provide user space threading. The main difference is that all the 'C tricks' are generated automatically by the language translator. If you're using gcc then the switch is replaced by a computed jump (a gcc language extension). On my AMD64/2800 time for creating 500,000 threads and sending each a message is 2 seconds, most of the time probably being consumed by calls to malloc, so the real thread creation and context switch rate is probably greater than Meg/sec order
Re:I guess the idea is it's extremely portable. (Score:4, Interesting)
My experiences contradict your statement. State machines are both easy to implement, and easy to debug, if you do it the right way. I have seen many entirely wrong implementations, including one where you can go from any of about two dozen "states" to any other. I have seen some that just switch states when they feel like it, or switch states based on complex decisions, which makes debugging difficult. Put another way, you can make a "state machine" degenerate into something else, and nullify its benefits, if you refuse to follow the rules.
A well-implemented state machine has an important characteristic: it is clear to see why you are where you are. This means that state transitions are checked (against unexpected events) and traced, so debugging the machine is literally a matter of reading a log that looks like this:
in state 0, received event A so went to state 2
in state 2, received event B so went to state 1
in state 1, received event C so went to state 5
in state 5, ignoring event A
in state 5, received unexpected event C
and so on. In this particular example, the question to answer is why we're not handling event C properly in state 5, or why we went to state 5 in the first place. Either should be pretty obvious when you consult the original design. The fix is likewise obvious. Figuring out state machines, in my experience, has always been easier than figuring out multi-threaded code.
This isn't to say that all programs should be implemented as a state machine. Simple Unix-style pipe programs, for instance, are generally unsuitable. If you don't know how to design a state machine properly, it's also going to be unsuitable.
Re:wtf? (Score:2, Interesting)
Re:extremely limited applicability (Score:1, Interesting)
Local variables are not preserved.
A protothread runs within a single C function and cannot span over other functions. A protothread may call normal C functions, but cannot block inside a called function.
Well,
I think it it is a simple idea but very well implemented and especially useful for microcontroller programs. Mr. Dunkels other project is uIP, which I do find very useful in this area, too.
I.e. this is neat if you don't have the resources at hand to run a even a small multitasking kernel! Maybe 2kB flash for program storage or similar.
As one of my hobby projects, I'm implementing a software DCC encoder (a protocol used for digital model railroads) that has to talk concurrently to the PC and to another bus with various connected gadgets. It has to do buffering and status displays and whatnot. Going the traditional way of
a chain of 'if's is hard and an explicit state-machine non-obvious. Protothreads work, are easy and small (few bytes).
Cheers,
Onno