Forgot your password?
typodupeerror
Technology

Classic Computer Science Papers 54

Posted by sengan
from the things-to-reread dept.
Dean Chouinard writes "For inspiration I occasionally peruse the web pages of Dennis Ritchie, Brian Kernighan and Ken Thompson. I recently found a pointer to the following classic paper by Ken Thompson. Other interesting papers are listed under the classics directory as well."
This discussion has been archived. No new comments can be posted.

Classic Computer Science Papers

Comments Filter:
  • by Anonymous Coward
    This sounds like the kind of 'warning' that could be leveled at the open source process,

    having read the article (and enjoyed it immensely), i don't see how free software would be more vulnerable than any other variety. it's transmitted from binary to binary, after having been initially incubated in the source.

    actually, the great thing about this would have been if it had scared people into compiling their compilers back and forth among other vendor's compilers and their own. i think (unless i'm as confused as i think i am!) that would tend to kill it. or else would it just tend to add any *other* trojan things that the others might have? you know what, i think it would be the latter. never mind. i've never heard of anybody doing that, but then again i wouldn't have heard of it if they had, so who knows.

    if there's anybody out there who doesn't get an attack of vertigo contemplating the code in question, please chime in!

    what would a "cure" be? maybe bootstrapping writing an assembler by hand, byte by byte, and then writing a minimal c compiler in assembler . . . ? i know almost nothing about assembly language or whatnot, though i did see somebody once write a trivial (REAL trivial!) program by typing opcodes or whatever you call them into the dos debug.exe thing. it seemed kinda like the long way around, for most purposes, but you could avoid introducing anything weird if you really wanted to. you shouldn't trust debug.exe, but it would be trivial to write a quick minimal write-only binary editor in c, the output of which could be verified. god, this is like a philip k. dick novel.

    :)

  • by Anonymous Coward
    Trojan horse code *can't* be detected, at least not by a computer program. The only way to do it would be for a human to go through each line of code at every level, and even then you can't always be sure.
  • by Anonymous Coward
    They were writing programs long before compilers were invented. They wrote them in machine code or assembly if they were lucky enough to have an assembler, which isn't too hard to make.


    Then, once you have a couple compilers you write other compilers with those existing compilers. The 'test' seems to be writing a compiler that can then be used to compile itself. GCC does this, it has some stub code that can get you a basic compiler and it uses that to compile GCC and then you can redo the whole thing with GCC.


    Take a compilers course, I think it was the hardest course I took in college but you learn a ton. It's not too tough to build a very primitive 'yacc compiler' that takes a yacc grammer and outputs assembly for a very simple language. If you assume the first compilers were something like that (and they were) then you suddenyl have a better tool to make an even more sophisticated compiler, and this sprials and spirals and you eventually end up with modern compilers.

  • by Anonymous Coward
    AFAIK all the early compilers (COBOL, FORTRAN, etc) were written in assembly. I think C was originally written in B (which was derived/inspired from BCPL, sortof a high-level assembler, designed for writing operating systems). In a way, you could say that C = B++.
  • These are some very important pieces. I remember reading them in college. I felt like I was getting into their brains - no watered down texts that give only the greatests ideas in clean form.

    Who knows. Someone reading them for the first time because of an introduction on here might be inspired to advance the field of knowledge.

    Sure beats people bragging about how many MP3 files they've pirated.

    Peace
    Ron Rangel
  • by Anonymous Coward
    Hey, something I have ALWAYS wondered is how they made the first compiler? Since obviously they had nothing to compile it with. And compilers for new languages, are they made in different languages? Or what? And... if the c-compiler was written in c, how did Ken Thompson compile it the first time?

    -Confused
  • by Anonymous Coward on Wednesday January 20, 1999 @03:23PM (#2036921)
    My recitation instructor when I took 6.004 at MIT was Steve Ward. (This was 3 or so years ago.) Ward told us this same story. It seems that Ward and Ken Thompson went out to a bar one time, and Thompson was describing how to plant this trojan in a compiler. After a few more drinks, Thompson admitted to actually creating such a compiler, and compiling the login program with it. It worked. Ward kept feeding Thompson drinks, and pressing for more information. But Thompson would never quite say if he actually did anything with it, like if he merged it in on an install tape.

    I expect Thompson didn't do anything with it; I expect he was just holding back to keep an air of mystery around it all. But still... interesting...

    (That's one of my best memories of that class. All us youngins hanging on every word of Ward's story about the "real" hacking done by our heros. Or at least mine. ;)
  • This line of reasoning leads naturally to the conclusion that some forking is good. It would be a Good Thing (TM) to have at least two compilers, with a significantly different code base and provonance, to allow cross-compiling of sources, particularly compilers.

    To a limited extent, at least, forking is good.

  • That would be troublesome, since the two compilers most likely won't produce identical binary output anyway, even if there are no trojans. Try compiling a program on gcc and then on cc, and then on egcs, and you'll get slightly different binaries.
  • Umm, aren't the classics on the website already free? I just read the Ken Thompson article that the article linked to, and I have no ACM subscription.
  • In Forth, I believe.
  • Too bad they haven't updated that page in 2 1/2 years... I'm sure there are many other classics that should be up there.
  • Sample quotes:

    "Subsequently, I aided Ken Thompson in creating the Unix operating system."

    "Early in the development of Unix, I added data types and new syntax to Thompson's B language, thus producing the new language C."

    Imagine having that on your resume! :)
  • gcc was first compiled with a non-gcc compiler, which could not have known how to infect gcc, because gcc hadn't existed when the previous compiler was written.

    Since then, gcc has always been compiled using gcc, and its source has been open for public inspection. egcs has also been compiled from gcc and egcs versions which are also publicly inspected.

    The proof that publicly-inspected versions of gcc and egcs are not infected follows via recursive proof. G(0) is clean, because it was compiled with a non-gcc compiler; G(n+1) is clean if G(n) is clean, because its source is publicly inspected. Hence G(n) is clean for all n.

    The problem enters when a version of gcc or egcs is **secretly** forked from the main (publicly inspectable) source tree, and *its* code is contaminated, and it is then distributed in binary form, as if it were legitimate. Once binaries enter the arena from an unverifiable source, you have a problem, since if anyone compiles a (clean-source) version of gcc with an infected compiler, they only get infected compilers as a result.

    Solutions:

    * Maintain a "reference compiler binary" for each platform, located at FSF or CYGNUS. This would be a copy of gcc or egcs which had been derived from an old (pre-gcc) compiler, and which was thus known (by the above recursive proof) to be clean.

    * Trace through all compilers with a debugger. While it's possible that a debugger could be hacked to skip over the trojan routines, it would be MUCH harder to hide than the largely black-box process of compilation.

    An analogy:

    Every newly-compiled compiler has two "parents": its own source code, and the compiler which was used to compile it. Infection is a dominant genetic trait: if *either* parent is infected, then the "baby" compiler will be infected too.

    If you detect an infection in a compiler (by stepping through it with a debugger, for instance) then you know that one or both of its parents were themselves infected. This could theoretically be used to trace an infection back to the source patch that introduced it.
  • by dalke (1472)
    It was a coincidence.

    Thanks for the verification :)

    if this is the type of stuff on segfault I will certainly be checking it out from now on.

    For the most part, no, it isn't.

  • The method described depends on regular expressions in general, and on recognizing particualr functions to be compiled in particular. You could do a bit to foil this sort of scheme with the judicious use of sed-- eliminate comments, rename all functions, variables, and macros, salt what's left with whitespace, and rename all files and change your makefiles accordingly. Then compile and rename your executables. It should be very easy to automate this in several different ways-- perl, shell scripts, etc.

    This would make it very difficult to produce a trojan that could replace a specific function in a specific program *without* affecting other programs to be compiled.

    It is still possible to insert a trojan that will be added to *every* binary compiled, but checking for this is trivial-- the expected machine code for, e.g., hello world can be calculated by hand and compared to the output of your compiler.

    Code munging is only a partial protection, probably less effective than bootstrapping, but it would at least give the crackers some headaches.
  • The method is only a hindrance, not a panacea, even if the method is only applied to source code. I think I tried to stress that.

    Machine-code trojans would tend to be platform dependant, and would be much harder to implement, especially for a portable, cross-compiling compiler such as gcc.

    I assume that by a "machine code implementation" yo mean a trojan that would wait until code had been compiled, and then examine the resultant bits for a particular pattern to replace.

    Small changes to the source code could result in such large changes to the object code that this method would only be practical for infecting code that is not expected to change. The exuberant code munger could randomly salt the code with valid but useless statements-- printing to /dev/null, sending a packet to the loopback, etc., that could not be optimised out. This unseemly bloat would monkeywrench trojans expecting a nice predictable chunk of machine code. Going through many generations of this before generating a binary with no "extras" would reduce a trojan's chances of survival.

    The problem of recognizing a particular function in a particular program in machine code for many different platforms is on a par with recognizing a particular function in a particular program in source code with arbitrary function/variable/file names. It is certainly soluble, but much more difficult than the example put forward in the paper.

    I stand by the assertion that munging would make life harder for the cracker.
  • Yep, like gcc, lcc, and TenDRA....

    But what do you compile *them* with ;-)
  • Once you have any computer with the ability to run code, you can simply write code on that system to build the first compiler on another system, then copy that compiler over to the new system and go from there. Once you have a working code environment on any system, you'd be crazy not to use it to develop the tools for the next.
  • Well, of course it does. I didn't literally mean
    'compiler' in the sense of gcc, I meant all the bits that is required to get a new computer system able to reasonably support its own code development.
  • The sad thing is there don't seem to have been any updates since mid-1996. Don't you hate continuing series that don't?

    I guess the ACM don't care about their website that much. (And they're using the CERN webserver still!)
  • It's (vaguely) interesting to note that Djikstra is the only author with two papers in their list.

    So there.


    "Umm... Goto's are bad, m'kay?"

    Actually, I've been hunting around for the "Goto's" paper for a while now.
  • I'm a little beyond the 'hello world' stage of programming, but not much. Can anyone tell me how far along developers have/have not come in terms of detecting the type of 'trojan' code discussed in this paper?

    This sounds like the kind of 'warning' that could be leveled at the open source process, and I'd like to be able to refute it as accurately as possible.

  • They wrote it in assembly language, or raw opcodes, just like most everything else they did back then.

    Supposedly that's one of the big tests of the viability of a language: whether you can write a compiler for the language in the language. But you've always gotta write the first compiler in some other language.

    After taking my compiler design course, I've never looked at programming the same way. Same with assembly programming. IMNSHO, every programmer should be exposed to both of these topics.

  • Well, it would be technically very difficult to insert such a trojan into GCC. Particularly as there are many people who 'bootstrap' GCC by compiling it with another compiler, for example, the Sun C++ compiler. If you wanted to insert nefarious code like this, apart from the problem of making it work and knowing when to trigger it, you'd have to add it to every compiler under the sun. (no pun intended.)
  • Hiihaa, I'm beginning to feel myself old.. too bad. I used to write raw machine code on Commodore-64 (though one had to use basic to access the memory). It took me about 4 years to get to the level of structured programming :) Yes past days are golden, but I'd never want them back!

    There must be quite a few of us around here?
  • Except for the bit about smoking crack.
  • Wait a sec...The ACM is a professional organization. They offer really great services, including reduced fees to conferences, SIG memberships, etc. The prices they charge are very reasonable for the services they offer. I'm a member of SIGOPS, and SIGLINK, and SIGART -- I receive valuable publications, as well as being allowed to access their full text archives. Just because they charge a reasonable fee to allow access to their archives, they don't "suck dick."

    Should a professional organization like the ACM provide free access to their archives? Why?
  • The key, Confused, is remembering that compilers are only needed for High-Level Languages. C needs a compiler, but assembler code does not. An assembler development program is merely a convenience. Most assembler code used to be hand-written (just as, before that, most punch-card code was). From this evolved the "macro assembler" which let you write out subroutines and such as abbreviations, and from that evolved high-level languages such as B, which was the precursor to C.
  • When I wrote my language ReWrite, I had the problem that there were no existing implementations (Mathematica(tm) was close), but _really_ wanted to write it in itself, because compiling the langauge is quite complex, but it is good for writing compilers.

    My solution was to write a quick and dirty hack ReWrite interpreter in Pascal, give it a sufficient subset of the language to get me started, then wrote a pretty minimal compiler in ReWrite, and interpreted the compiler to compile itself. (It took four days on my 8 MHz Mac Classic doing it in little chunks, that interpreter was really slow). Then I stopped using the interpreter and started extending the compiler.

    A couple of memorable things from the development process:

    I had the interpreter and compiler designed so that they could interoperate, and it was fascinating to watch the compilation rate speed up as more bits of the compiler were compiled (it would link the each compiled function in immediately, and use that in preference to the interpreted one). A gradual speed gain of about a factor of 60.

    An interesting part of extending a language written in itself, particularly when it is not a standard language, is that once an extension has been added and works, I start using it in further development on the compiler. The langauge has changed enough since I started it that there is a bit of a mix of coding styles in there - some of the older code is quite hard to read when it has to work around restrictions that no longer exist. (That's why it turns out to have such an unintentionally appropriate name - every so often I rewrite a code module that has grown too old).

    By the way, is ReWrite only goes on Macintoshes, is 68K code (not PowerPC), has free binaries but is not distributed with source, and hasn't been released for about three years. I would like to change all of that (except the free part) as I get more time to work on it.

    Of course because it only goes on one platform and is wrtten in itself, to only sensible way to port it is to write a cross-compiler, ReWrite to C++ or ReWrite to Java Bytecode compiler (these last two are possible, but hard).

    I've strayed off topic a bit, but my experience is that going through that process of getting a compiler going is fun, even if _everyone_ you mention it to before you get it working starts out by saying that it is a waste of time.

    I've also get ReWrite on my CV, and I remember going for a job interview where the fact that I had designed and implemented language was seen quite favourably, but:

    Interviewer: "What did you write it in?"
    Me: "ReWrite"
    Interviewer: "No, what did you _write_ it in?"
    Me: "ReWrite, it's written in itself"
    Interviewer: "No, what did you _write_ it in?"
    Me: "The source code for ReWrite is written in ReWrite itself, just like C compilers are written in C"
    (snipped a couple more ways of me trying to say the same thing in different ways)
    Me: (not wanting the interview to be stuck any longer) "The user interface is written in Pascal"
    Interviewer: "Ah". (he was happy then)

    Roy Ward.
  • Maybe I should revise and post the article I wrote nearly two years ago called "Solutions Considered Harmful" which was written partly in homage to Dijkstra's famed ACM paper and partly in shock horror to the reality of what Windows 95 has done to computing.

    The essence of my paper is that in a computer industry mad to sell boxes -- whether hardware or software -- solutions are merely a problem's way of creating another problem.
    --------
  • I actually work with Denis, Ken, and Brian and they're still cranking out some really cool stuff. None of it is mainstream due to Lucent's (Bell Labs) marketing of it and the huge installed base of current poop (MS Win) and software (*NIX)..

    True that GOTOs are still running amuck in their code sometimes. Some of their newer stuff, papers, and such should be public like the theories behind OSs like Plan9 and Inferno, two OSs based completely on a file stucture for everything.

    If you check out http://plan9.bell-labs.com/cm/cs/ [bell-labs.com] you can find them and some off their recent work. Check it out.

  • No, remember what you are doing when you program, you are giving a series of instructions. For convenience those instructions are written in a human readable format then later translated to a machine readable format. The actual intent of the instruction has not changed, just the representation. The trojan lies in the instruction you gave the original program (in this case gcc). When you use Sun's compiler to compile gcc, the instruction does not change, only its form (to the purists, this is an abstract view of an instruction, I don't want to hear opcodes mentioned :). Thus the trojan is still present.
    In fact I would argue that this problem cannot be feasibly solved but merely represents a risk in trusting any person/entity.

If you think the system is working, ask someone who's waiting for a prompt.

Working...