Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
Technology

Transmeta Code Morphing != Just In Time 454

Andy Armstrong has written us a pretty interesting, but somewhat technical piece of Just in Time, computer generated assembly language, profiling, and more. Its a pretty interesting little bit that ought to give you a lot to talk about.

The following was written by Slashdot Reader Andy Armstrong

Transmeta Code Morphing != Code Morphing

The recent Transmeta announcment crystalised something I've been thinking about for a while. It's my belief that it should be possible to make a compiler generate much better code in the general case than someone writing hand coded assembler. Furthermore it should be possible for a JIT (Just In Time) compiler to produce better code than a conventional one time compiler.

Why should a compiler be better than an experienced assembler programmer? Well,

  1. the compiler can know the target processor intimately (cycle times, impact of instruction ordering, etc.)
  2. the compiler gets to re-write the entire program each time it sees it.

The second point is critical: any programmer writing an assy. language program of any significant size will write the code to be maintainable. Of course, it makes sense to do things like defining standard entry and exit sequences to routines, keep a few registers spare (in those architectures that have more than a few) and other practices that lead to maintainable code, but the compiler doesn't have to maintain the code it writes. It gets to write the whole thing from scratch every time. This means that functions can be inlined (and repeated code sequences turned into functions). Loops can be unrolled then rolled back up at the next compile when the programmer has decided that space is more important than speed.

If you know you're not going to have to maintain a bit of code you can do some pretty scary things to get it to perform better. A compiler can potentially do that all the time.

Why should JIT be better than one time? This should be clear to anyone who's followed the Transmeta story. A key element of their code morphing technology is that they insert instrumentation into the code they generate, effectively profiling as it runs so that the compiler can decide which bits to optimize the next time it sees the code. It's well known that the coverage graph for a typical programme looks extremely spiky - the most frequently executed code may get hit thousands or millions of times more than it's neighbours. It follows that it really isn't worth optimizing the stuff that only accounts for a millionth of the code's execution time.

This brings me to my point: is there really any reason why a Java / JIT combination shouldn't result in code that executes as quickly as the equivalents in other languages?

You might suggest that garbage collection must slow things down, but I'm convinced that, done right, garbage collection can actually improve performance. malloc()/free() require the memory manager to think about the heap for every call, but new() can be implemented as a stack (m = memlimit; memlimit += size; return m) and the garbage collector gets to do all its memory management in one chunk - it can take an overview of the memory landscape rather than trying to keep things fairly optimal each time memory is released as free() must.

You could argue that the OO nature of Java means that it must dynamically allocate objects that would be static in a program written in C or assembler. That's true, but (assuming calls to new() are cheap, which I believe they can be) this really isn't a problem. Current processors don't take a huge performance hit when working with objects who's address is not known at compile time; in fact in many architectures it makes no difference at all.

So while it might seem profoundly counter-intuitive can anyone actually give me a good reason why Java + JIT should be slower than Good Programmertm + Assembler?

This discussion has been archived. No new comments can be posted.

Transmeta Code Morphing != Just In Time

Comments Filter:
  • I will only comment on one of your point, while you are, in my opinion, right, and insightfull (You are insightfull even on this one, but not totally right):
    When given an environment like Java which doesn't allow pointers and other antiquated memory hniques a good dynamic compiler with a modern garbage collector is both faster and more efficient than any hand coded attempt on all but the most simple of applications.
    The point is that the average application has inflated enormeous over the years. Thanks to stupid programmers, abstraction and cool languages. You gave the key to this yourself "on all but the most simple of applications"...
    Not that a good compiler cannot outperform even a good assembly hacker on, for instance, memmory allocation. And not that abstraction in itself must lead to inneficiency, but it makes it inherently easy to allocate/free memory for the programmer, who will then of course allocate/free more memmory.
    --The knowledge that you are an idiot, is what distinguishes you from one.
  • Other engineers are required to use the minimum resources to achieve the maximum requirements of the project design, shouldn't computer 'engineers' be able to acheive the same.

    Apples and oranges. The cost of creation materials for say a building, demands that the least amount of them be used. Not so with bits and bytes. These days, they're almost free. I applaud the true craftsmen of the programming world for their principle and talents, but with all due respect, to me it's penny wise, pound foolish to waste time reinventing the wheel each time you have to write code. Did you dig your home's foundation with a trowel?

  • by DQuinn ( 110990 ) on Thursday January 27, 2000 @08:50AM (#1330831)
    Sorry... this post is probably meant to be deeper inside this thread.

    Are we not all just dancing around the idea of a human compiler with a vast database of algorithms, intuition, look-ahead properties, and an abstracted idea of what it is exactly that the program is doing? IE. There is no compiler, to my knowledge, which writes highly effective parallel code. Why is that? Simply because no compiler can really understand the overall problem.

    How long do you think it will be before the all-powerful AI compiler is written, with all of these abilities, and more, on a box that runs at a few GHz?

    Or maybe i'm just being a little too science fictiony here :)

    Cheers,
    D
  • Take a look at a few of the most widely ported software packages around. How many of these redefine the types in order to "know the size of their types"? Not many, I can tell you.

    Quite a few. Amazingly these are usually those that have portability problems. PGP 5 is a good example.

  • What if you compiled the original code with performance analysis included, and then fed that performance analysis back into a traditional compiler to execute again. In this way, you get all of the benefits of the above JIT system without the running overhead of the JIT. Sure, it takes a few thousand runs to get the program optimized, and you can only optimize for the average case, but most programs that need to be heavily optimized will have a roughly similar executaion pattern every time.

    I may be wrong, but you can do this with the SUNWspro workshop compilers. This technology would be nice to have in gcc as well.

    Mike
    --
    Mike Mangino Consultant, Analysts International
  • I think you may be missing (or misunderstanding) the point a bit. Sure, you can furbish an example, or a language, in which a human can optimize the program /statically/ better than a compiler/computer can optimize a program /statically/. But the author makes the point that the most frequently executed parts of the program are very sparse, and that the cost of profiling the program dynamically can be recouped by doing intelligent dynamic optimizations. As hardware gets faster and more complex I think this will be more true. There will simply be no way for humans to statically optimize for such a non-deterministic runtime environment. For example, how many programmers explicitly specify parallelism in their programs? How can they be sure that the CPU at runtime doesn't know better than them how to optimize the program? I say let the CPU decide. There is going to come a point in which it is just not feasible for humans to even attempt to optimize for such a complex runtime environment, and that time is coming fast. The faster chips get, the smaller the more amortized the overhead of finding better optimizations at runtime becomes.

    Jazilla.org - the Java Mozilla [sourceforge.net]
  • by Junks Jerzey ( 54586 ) on Thursday January 27, 2000 @07:15AM (#1330840)
    You're right, of course. No matter what some people have been trying to say for 40 years or more, you can almost always write better code than a compiler. This is especially true when you're dealing with something bigger than a a single function. The usual way of showing that a compiler is better than a human is by using one smallish C function as an example. That's a pointless example, because the benefit comes from analyzing that function in context and not on its own.

    The point of diminishing returns comes into play quickly, however. For example, take the renderer for most any fast 3D game. If you went into the routine that passes triangles to the graphics card, a frequent hot spot, and added a pointless call to a supposedly expensive function, like sqrt, you're not going to notice an effect on frame rate. Doing a higher level optimization, like removing a single polygon from a model, is going to be more of a benefit than optimizing the polygon code, but it's still not going to be noticible. Sometimes you can get big benefits by using algorithms that look more complicated, ones you wouldn't want to approach in assembly, even though they use more code.

    With complex programs, it's even conceivable for an interpreted language to out run a compiled one, because everything comes down to architecture and an understanding of the problem. This is hasn't been the case with Java, because Java is a fairly low level language (the more abstract a language is, the less win from compliation) and because Java has become entrenched in the "learn programming in 14 days" market of web designers turned programmers.

    So, yes, an assembly programmer can outrun most any compiler. But does it matter? Almost never.
  • Would you also send your pointers across a network and expect them to work at the other end?

    The idiots who wrote the code that caused me the pain would have...

    This involves one of the internal debates I've had about C for years. C allows you to shoot yourself in the foot. That is all well and good as I (I hope) know what I'm doing. The trouble is that, with turnover in this industry what it is, a new foot might have arrived by the time the bullet hits. (Unfortunately, I've often been the new foot.)

  • 1.Please give an example on the type of code you'd want to optimize, any reasonably simple (small fragment of )ecode that doesn't depend on advanced integer arithmetic should compile to something as fast or faster than you
    would possibly write yourself... Please include example.c or .cpp, GCC input, GCC assembler output and your "better" ASM output.


    I was talking about a simple brute force compare and if it dosn't work then move on type thing. I think that this type of thing will always work (just take a little while). Considering that most modern computers are extremely powerful then I have to say that such brute force is still possible. What I don't get and still don't get is how people get all the time to actually research all this stuff. Does anyone have a functional life anymore? I am took courses for an associate of science and they didn't get anywhere as complex as all the little theory arguments as I have seen here.

    1.Invartiant detection, using compilers to find invariants in the code, and then deduce lemma's which may remove bounds-checks, type-lookups, casting or flush of registers to memory.

    Could someone please explain what a lemma is I'm sorry I guess I didn't take Phd level math courses.
  • by Anonymous Coward on Thursday January 27, 2000 @05:28AM (#1330850)
    "In the general case", perhaps a compiler will produce better code (and certainlty not, as yet, for specialised cases - an inner loop coded direct to PPC macro assembler (PASM on Amiga PPC) is the fastest code I've ever written, but humans still write compilers. Now, if they were to force-evolve code using a genetic algorithm, you might get code better than any a human code write, but it'll probably depend on some weird side effect to some obscure instruction, or the rtesosnant frequency of your ram bus, or something...

    Cool article, all the same though. I'd love to see more of this sort of thing on Slashdot, like back in the old days.
  • The Curso technology will probably make the Java chip a reality and improve upon the idea all in one shot. Program the morphing code to run Java Byte Code and you have a Java chip which optmises your code. A very clean language that you can easily maintain and not sacrifice execution speed. Every Software development Managers dream!
  • by Mr T ( 21709 )
    It's not sloppiness so much as what C++ let's you get away with. You start building objects and then you have to communicate between them. Do you provide references to objects or do you provide copies of the objects and then update them all when one changes? Once you start leaking references, if they go far enough, problem errupt.

    On large systems, master C++ programmers, master programmers have had these problems in real world applications. It's not so much not knowing when to delete objects so much as making sure you delete enough pieces when you delete them. If you delete an object that is referenced by another then you're going to have a crash. If you delete an object but not the references it contains then you have to have another reference to each of those objects and you have to know what you're down to the last reference so you can delete the object.

    I agree about the sloppiness of thought associated with java and GC but that laziness and the willingness to be sloppy is there regardless of whether or not the programmers are using GC. TakeGC away and they may be a little bit more careful but they've already shown that they are willing to be that lazy and sloppy.

  • 3. as WORA is important in java, it is hard for java to use any platform specific resources such as Graphic card's accelerations. that is
    one of the reasons swing's design is getting ugly as it tries to boost its performance.


    Surely it's the JVM's job to abstract hardware acceleration in the same way as, say, DirectX or OpenGL? For instance, my windows JVM might know about directX and so implement the primitive drawing functions within the swing classfiles using the accelerated calls. I don't know if that's what happens, but it strikes me as the logical plan.
  • by ZoeSch ( 70624 ) on Thursday January 27, 2000 @07:23AM (#1330861)
    Nevertheless I'd like to point out that JIT is based a somewhat dangerous technique: a program that alters (its own) code. I believe this technique was used in the eighties to scare off hackers by making code incomprehensible and hard to disassemble until the program was actually running. Also (even on a good old 6502 processor) it's possible to make some speed improvents peeking and poking into the code you're actually executing.

    Not at all true... self modifying code is still around not only on JIT compilers but in interpreted languages and even device drivers (Or how do you think some device drivers adjust in real time to hardware changes?). And it's not a dangerous technique in itself, but (As usual) when executed improperly it can be a wild beast (As most people discovered when thy installed their first Cyrix and AMD 486 processors and discovered most x86 code wouldn't run properly because of cache integrity issues)

    When compilers for Microcomputers got faster and most processor architectures (known as Harvard architecture, I believe) explicitly require a division of RW and RO memory segments, self-modifying code was abandoned...

    First of all, Harvard machines require separate instruction and data pipelines and memory spaces, and 99% of the CPU's on the market (General, Embedded, etc.) are Von Neumann machines which use a single space to store instruction and data. The thing is, newer CPU's (Even if they're Von Neumann desings)have separate caches for data and instructions and most self modifying code violates the cache integrity (See above) because the code is modified in the cache but never stored back into memory (Unless you use a write-through cache design which is impractical from the performace point of view).

    BUT (And that's a big BUT :) if your JIT VM forces periodic cache writes to RAM (Not every time but enough times to ensure some sort of coherence if the cache and RAM are out of sync) you lose a bit of performance but gain a stabler setup.

    Disassemblers for JIT won't be as complex as a JIT assembler just because when you disassemble a piece of code you treat is under the black box principle (What goes in and what goes out) in order to derive the fundamental principles and algorithms, which can be implemented in 1E999 different ways, so if you disassembled the less optimal morph, bad luck... disassemble another test run and see if you get a better implementation of the algorithm. Or you could use the aforementioned principle and implement your own algorithm.

    JIT code/compilers, BTW are also being tested to produce self modifying chips (ASIC's and FPGA's) under VHDL/Verilog using also NNets or Gen. Algorithms to obtaing a first implementation and then using the JIT to optimize it...

    Really interesting stuff (The Transmeta Crusoe) and my bet is that soon other companies will follow altough not for the same reasons as Transmeta :)

    ZoeSch

  • by Chuan-kai Lin ( 26774 ) on Thursday January 27, 2000 @07:26AM (#1330867) Homepage

    I do agree that machines should spit better code than humans do, but we are just not there yet. When will we get there? Maybe tomorrow, maybe next month, maybe next year, maybe next century. Nobody knows. Suppose we do get there tomorrow, I believe it will not be Java + JIT that made it, for the following reasons:

    1. Java is an imperative language, which are very hard to reason about because of the lack of referential transparency. Yes, I know that it is a known fact in computation theory that it is impossible to reason about arbitrary programs, but the fact is that we do not write arbitrary programs. Imperative programs with destructive updates are hard to reason about, and as a result we are almost always restricted to local optimizations. Local optimizations are great, but I am afraid that that alone will not be enough to break the barrier.
    2. Vast amounts of information is lost once the source program is compiled into bytecode, therefore making effective optimization much harder. You can tell the intention of the programmer from the source code, but it is hard to do so for machine instructions (or even preprocessed source code, which anyone who had read the nVidia XFree86 source code must agree). The JIT compiler must create code that acts exactly the same as the bytecode, and that restriction severely reduced its options.

    If that breakthrough were to come tomorrow, I believe it is most likely to come from the functional programming people. I believe that is the way to go for the future.

  • I don't see where your argument about placing memory for new() on the stack holds up. The point of dynamic object creation is to allow the program to create an aribitrary number of objects and then free() them when done with them.

    Trivial example that really makes the stack argument fall apart:
    Let's say I have 500 objects new()ed. I'm assuming that the first object created would be on the bottom of the stack, and the last object created would be on the top. I now delete the first 499 items. I now have 499 * object size bytes of memory on my stack that I can't use becuase I still have the 500th item on the stack.
    I can't recover the memory because it's on the stack.

    Am I missing your argument? Help me out here.
    --
  • "garbage collection is slower than hand coded memory management" - You really ought to read some of the latest garbage collection papers. This is no longer true even for collectors managing C or C++ allocation. When given an environment like Java which doesn't allow pointers and other antiquated memory techniques a good dynamic compiler with a modern garbage collector is both faster and more efficient than any hand coded attempt on all but the most simple of applications.

    Now, I have yet to see anyone provide concrete examples of a garbage collector performing better for the general case, than standard memory management. However, let's assume that the garbage collector provides a small performace gain overall.

    Unfortunately, as others have pointed out, compilers don't understand the objective of the program they are dealing with the way assembly coders do. The same is true for a garbage collector.

    I can speak from experience as a game programmer here. There are certain parts of a game which are time and performance critical. How can I trust a garbage collector not to slow things down in that section? Well, I can't. There has to be some form of programmer interaction telling it not to cleanup in section X of the code.

    This isn't very practical for games though. Typically we want full control over the memory. We know the upper limit of what we need, and we don't care much about other applications. So we can allocate a large chunk at start time and then work within that, to avoid the new/delete/malloc/free overhead.

    If removing "antiquated" things like pointers is the best solution, then we'd all be writing games in Java. OK well that's a stretch but you see my point...

    Best regards,

    SEAL

  • by TheDullBlade ( 28998 ) on Thursday January 27, 2000 @09:16AM (#1330874)
    "1...not a JIT system but a dynamically optimizing compiler" a slow start is a big problem (which is why they are emphasizing long-term server programs), but that's not the main issue. There's no reason that profile-based optimization can't be applied to C programs. That there's an experimental JIT (yes, it's still a JIT) that's ahead of the C compilers in common use is no reason to claim that the JIT strategy is better.

    "2...as long as the results of the operation don't change" Which means that if the representation is different in a meaningful way they have to check whether the results of the operation change, reducing the efficiency, which was my point.

    "3...on all but the most simple applications" But I rarely do anything but the most simple applications. I allocate an array, then I free an array. I do this for hashes, trees, stacks, queues, and pretty much any data structure I use. Only when I am being extremely lazy will I allocate and free objects recursively. Modern CPUs with slow main memory and fast cache like it when you keep all the data together, so I generally do so. I also try to access it linearly, which they also like. The point holds.

    Don't try to compare the most primitive and unoptimized methods of custom memory management to the most sophisticated garbage collection algorithms. Remember, too, that a person doing custom memory management can write the same sophisticated garbage collector that Java would use.

    And don't confuse "primitive" with "fundamental". Pointers are fundamental. Having to analyse reverse polish code function to prevent dangerous operations, rather than making it impossible to construct operations you don't want, is primitive.

    "4...then tell me that dynamic dispatch is a problem." Okay, dynamic dispatch is a problem. Just because it can inline methods here and there doesn't mean you won't take an overall performance hit. Besides, that was only one example of the ways the lousy Java handholding slows things down.

    The Sun marketing papers you linked me to are just more of the same old Java hype. They'll go on to "prove" that their new super JIT is better than anything you could code by hand by comparing optimized idiomatic Java with the same code translated poorly into unidiomatic C++.

    Idiomatic C written by a competent optimizer will still blow the Java version out of the water.

    JIT by any other name smells just as bad.
  • This is very intersting stuff, and I do care about these questions, but virtually none of this enters into my selection of a language for a task.

    When you are writing a routine that reads single keystrokes and puts them in a buffer, you really don't care much how optomized the loop is unless it takes more time than elapses between the user's keystrokes.

    In other words, I'm very glad people are working on smarter compilers and Java optomizations, but I select Java for one reason alone:

    It is easier to write more code more quickly with fewer bugs in Java than in C or C++.

    Most of us in MIS find ourselves in situations where we can throw more hardware at things much more cheaply than we can pay people to write and debug code.

    Certainly there are situations where we are prcoessor-bound, or where data comes from a high volume network source and then we must consider performance, but since in many cases even a slow language is "fast enough" for the maximum throughput, the ease and reliability of development come to the fore.

    So, if I want a fairly portable device driver, I write in C. If I want a stable application, I choose Java. For me it is this feature of the language, not its speed (or lack), not its "write once, run anywhere," but rather the ease with which stable code may be written that draws me to Java. After that, I am concerned more with the footprint of the VM and its impact on non-Java system components than I am with the optomization of Java byte codes.

    I say this not to disparage this discussion, which I really do find enormously interesting, but I must say that for many of us, this is the computing equivalent of counting angels on pinheads.
  • The only advantage C++ has is it's speed. There is no other advantage. There's lots of disadvantages though.

    Java is easier to code, easier to maintain, platform independent. And that's only the language. C++ is going to be cobol's replacement. The next generation of programmers will make huge amounts of money keeping C++ code running. Don't get me wrong, C++ is a powerfull language in the right hands. Unfortunately it often ends up in the wrong hands: people who never really understood the OO paradigm, people who love to use multiple inheritance. people who think that macros actually improve the maintainability of their code, ... (all the horrible things you can do to make source code unreadable) ...

    You are probably right that java is suitable for web applications but then, what isn't webrelated these days? Standalone programs are a dead end for most domains nowadays. The client server architectures and n-tier architectures in use nowadays make networking a requirement not an optional feature.
  • I'm surprised nobody else has commented on the obvious here. Sure, maybe dynamically recompiling is awesome. But doing it twice at once isn't. Just imagine your Crusoe chip screaming in agony as the code-morphing tries to profile and recompile code that is being dynamically recompiled by the JIT compiler. And how do you fix that?

    By abstracting your two smarty-pants recompilers into a "hardware" and an "interpreter" layer, you've removed any way for them to talk to each other or even really be aware of the other's existence. The problem's not theoretically impossible, but the practical advantages of keeping your abstractions clean make me worry about the hopes for Crusoe web-pads (which would presumably run a lot of Java).
  • Actually, computers tend to be even better at speed chess than at regular chess. The reason is that computers analyze games by expanding the game tree as far as possible. Because the chess game tree grows exponentially, this leads to diminishing returns as the amount of time increases; if you watch the "ply searched" indicator on a chess program it'll generally jump to somewhere between 3-5 moves almost immediately and then appear to stall. However, even this shallow game tree is enough to catch an unsuspecting human in a tactical trap.

    The reason humans can still beat computers is that they can apply long-range strategic planning -- "what are the best ideas to pursue in this position?" -- and rules of thumb or experience about what sorts of things are likely to happen in a certain type of position. Working from this, they then decide on a small set of moves to evaluate, usually in a narrow tree (because humans can generally discard moves which are irrelevant out of hand; I believe some better computer algorithms try to do this as well but I'm not sure) This approach doesn't really start to pay off until the human has had some time to think, but once it does it can allow the human to catch up to the computer, which is still examining every possible outcome of shuffling the rooks back and forth.

    Disclaimer: all of the above is empirical guessing and may not bear any resemblence to reality.
    Daniel
  • This is a personal anecdote where I would say that the compiler generated better code than I would have.

    I worked on an embedded network analyzer product using the intel gcc compiler for the i960. It contained about 16000 lines of C code (not trivial, but not large) that included a bare bones operating environment, network drivers, and the actual analysis code. Of these ~16000 lines, about 500 of them were in the "common critical path" and would execute on every incomming packet.

    The intel gcc compiler for the i960 has this is really cool feature called "profile based optimization". You build an instrumented version of the executable, run a set of test data through it, get the profile data, and re-compile with the profile data. This optimization feature had the added advantage of closely integrating both the compiler and the linker.

    The main benefits of this optimization included:

    Setting the branch prediction bits for the most common cases.

    Inlining functions with high hit counts : both within a module (traditional inlining), and across many modules (an advantage of linker integration)

    Placing variables with high hit counts in faster memory (the i960 had 1k of on board, single-clock DRAM, and our card had both 2-clock SRAM and 4-clock DRAM)

    Inlining all functions that get called only once in the executable.

    I saw a 150% performance improvement over the non-profile based, but otherwise fully optimized builds.

    The key 500 lines were interspaced through about six functions in four modules, often in functions where half the function was error/unusual path handling. Although I could have rewritten those 500 lines in asm, it would have been very hard to keep it integrated with the rest of the code. Also when I examined the instruction trace of those 500 lines, I found only two or three places where I would want to change things.

    Intel's x86 compiler is supposed to have similar a set of profiling optimization features and another feature that's even more important on the x86 that lays out functions and data in the executable to reduce cache misses and page faults based on the profiling data.

    In the "me too" catagory: Optimizing for Alpha processors is a nightmare best left to the compiler. Just try to single-instruction-step through some Alpha code and see how much you move around within the C source.

    Peter (first /. post!)

  • I had a commodore 64 once. Never bothered to learn assembler code for it but even today I'm amazed by what people did on those machines. But it just doesn't scale up to a modern PC.

    Perhaps those people from Intel can write code for trivial software but anything that goes beyond that is even beyond the minds of those people. Of course those guys at Intel can program their own CPU's manually (at least I hope so). Hand coding millions lines of assembly code is something else though, let alone optimizing it. The problem doesn't scale linearly you need a computer to optimize that.

    Either way a static compiler nor a human being can compete when you start to involve real time profiling information. It gives the dynamic compiler an edge, it can optimize to situations only known at run time.

    So, I hope you see my point. I admire people who can handcode a more optimal program than a compiler but they can't compete with dynamic copilers such as transmeta's code morphing or SUN's Hotspot because they simply don't have the kind of information those compilers have.
  • There is certainly a range of people that write code and I've seen both extremes. Poorly written higher level language programs aren't going to be improved by a compiler. There might be suggestions by a good level syntax checker but that isn't the topic here. One of the long term comments I've always made is that you can write Fortran in any language. If you don't understand the language constructs sufficiently, you aren't going to be saved by the compiler. There is no substitute for programming competence.
  • This post [slashdot.org] clarifies what I was saying. To summarize: 1. By "modern processor" I mean a processor that allows a compiler to perform good optimizations. The X86 instruction set is not very good for this. I don't believe the 486 even performed any parallel execution. Specifically, I have experience with the Alpha. 2. You are talking about a specific algorithm. As I said, 99% of the time the compiler will beat you. I didn't say assembly was dead or not useful sometimes. Just that most of the time the compiler will generate faster code overall than a human. 3. Facinatingly enough there are cases where your assembly optimization will actually hinder the compiler from doing a better optimization. Specifically, some paralellizing compilers I think.

    Check out this [nec.com] link for a plethora of information on compiler optimizations.


    --GnrcMan--
  • There's also lot's of valid reasons for code bloat. Well, maybe not real "bloat", which would imply lot's of frivalous code. What I mean is that there are lot's of valid reasons why programs are growing ever larger.

    The use of libraries, even for a small number of functions, increases code size (thus the 20k "Hello World" program). However, if we proceed to enhance the "Hello World" program to do a lot more, including input and output and simple processing - the code does not grow proportionately. In other words, a program 100 times bigger and much more useful than "Hello World" may only take up twice as much space.

    Code size increases today are mainly due to interface issues. Very little interface code in Word Perfect 5.0 for DOS compared to Word Perfect 2000. It's a necessary part of the modern program.

    I just don't think it's crappy compilers, or necessarily crappy code writing, that's to blame. It's feature bloat. And like anything else, as it becomes larger it becomes harder to steer.

    The fact is, though, wether we need the features or not, programs today do a lot more than they used to. Word Processors give you many more format options, fonts, merge features, etc., then they ever did. Add the big GUI code (90% of most of my programs ends up being GUI stuff these days), and you get a modern "bloated" program.

    I hate to say it, and I write command line utilities as much as possible, but when your target audience expects a visual interface, it adds a LOT of overhead to what might otherwise be a simple program. How big is "Hello World" compiled using Xtoolkit or Motif?!? Not to mention the libraries that may need to be loaded at execution time, making the perceived size even larger.

    I'm not excusing poor programming, and I certainly have a dislike of "Visual" programming tools for my own reasons, but code growth is a necessary evil.


    ----------

  • Your programmer would have to be really good to beat a modern compiler on optimizing for very complex instruction. Also he probably would need a lot of time even to get very simple programs up and running that way.

    And then there's still something a java VM (I'm assuming the hotspot type here) and that is dynamically (i.e. while the program is running) optimizing code using profiling information. This also works in situations where a program dynamically loads a class (possibly a class that did not exist at the point the other classes were developed). Your programmer cannot not take this situation into account when he is coding.

    In other words, under normal circumstances the assembler superman would lose (badly).

    Now, your next question will probably be why java is so slow then -> see my other reply's on this story.
  • As a side note: I remember hearing FX!32 also keeping some information from the JIT compilation and storing it on disk for the next time the
    application is run. Anyone knows whether this is still used and to what success?


    Yes, FX!32 would actually profile the application and convert parts to native Alpha code, storing the native code in a database. Incidentally, the profiling happened during execution, while the optimization happened during free time, after the program was done being executed. The more you ran the program, and the more code paths you hit, the faster it would get.

    It's a moot point now, though, since NT doesn't exist for the Alpha anymore.

    --GnrcMan--
  • Massivly parallel creations are largely very close to what the programmer wrote, good compiler or no. Compilers for this case are especially weak, and only make vague and half correct assumptions.
  • I'll simplify:

    1)Anything that can be expressed in Java can be expressed in hand-coded assembly; the reverse is not true.

    2)Humans are smarter than computers.

    3)Humans can always use computers as aids to their work, and so produce output equivalent to what the computer can produce as a baseline worst case.
  • On a machine with 64 bit ints, as on a machine with 32 bit ints, ints and longs are the same (assuming the compiler was competently designed; it could fit the standard without being). Using long should only ever confer a performance hit on 32-bit machines, for 32-bits and over, long is pretty much guaranteed to be the same as int (AFAIK, the standard doesn't force it, but it doesn't require any more range in the long, so it would be stupid to make it bigger).

    If you want to define your own special int ranges, you can always use Ada. That's the kind of language you get when you throw in every nice little feature.
  • by um... Lucas ( 13147 ) on Thursday January 27, 2000 @07:45AM (#1330922) Journal
    I read that the 700 MHz Crusoe chip actually only attains the performance of a 500MHz Pentium (II or III, I forgot - oh, can Crusoe emulate SIMD? Not like it matters, but I'm wondering).

    So there you have it - code morphing, or just in time compiliing or whatever gives you about 70% of the performance precompiled code.

    That's not to downplay Transmeta's theoretical accomplishment. Those extra 200MHz go to things like monitoring it's VM, throttling the clock, etc. So in the end you lose 30% of the performance but make it up with 35X less power consumed.

    It's a great trade off for laptops, handhelds, etc. But not for workstations, servers, etc...

    I still don't like the idea that they're keeping the instruction sets closed. It would seem like if someone out there wanted to port GCC to Crusoes native instructions, that would be good... But they just don't want to be percieved as being at all incompatible with Intel, i guees.
  • Java is easier to code, easier to maintain, platform independent. And that's only the language. C++ is going to be cobol's replacement. The next generation of programmers will make huge amounts of money keeping C++ code
    running. Don't get me wrong, C++ is a powerfull language in the right hands. Unfortunately it often ends up in the wrong hands: people who never really understood the OO paradigm, people who love to use multiple inheritance.
    people who think that macros actually improve the maintainability of their code, ... (all the horrible things you can do to make source code unreadable) ...


    I have actually learned/am learing C++ on a collegiate level. I can say that they do not offer Java in the CS department. This is a reflection in the fact that Java is not important for what programmers actually do in the world. There is one class of Java taught in the whole college and that is in a business related sub-optimal computer program. This surely says something about what people really do. I would wager a whole keg of beer that you would find a large percentage of schools who do not use java and a larger portion of software development that dosn't use it. (Around 30-70%). As long as java remains so un interesting and tied to the web and the internet I will never learn it. Haven't you ever tried to access the internet and couldn't for whatever reason? Now look at the number of times you computer has prevented you from accessing the data you need. Now what should happen for most people is that the internet should be statistically more prone to failure than traditional media and far less controllable and customizable for your needs.

    You are probably right that java is suitable for web applications but then, what isn't webrelated these days? Standalone programs are a dead end for most domains nowadays. The client server architectures and n-tier architectures in use
    nowadays make networking a requirement not an optional feature.


    This is quite bad for at least a few reasons:

    1. access: If I need to have more silly requirements then if begins to decrease the ability of that application to actually get something accomplished in a timely manner. When I get a computer do I have to get say Phd in CS do use said computer (contrary to what people would like of the world)? No. This is because it's not necessary. Suppose I have a video game. Does say even dramcast or playstation games or even almost any PC game really *need* to be connected to a network? I can't really see any advantage if I am just a single lone person why that makes the experience any better for me. Almost anything you want control over does not *need* to have access to the web. Does working with say simple spread sheets need web access? How about word processing?

    Also let's look at how the business model works in the internet world for services. Usually the trend has been that the individual has little or no control. This is reinforced with the huge cost of actually getting a dedicated server or bandwidth so that you can control it. Again for the average person (typical consumer who wishes to utilize the computer and has a moderate income of say $25,000-$70,000 USD/yr even if they are schooled in CS) this dosn't help at all in any way. Tell me why I shouldn't have access to all the things that I want to do at my finger tips without the need for someone to control it, charge me for it, change it, or make use of a series of networks which quite frankly are controlled by robber barons who don't care about the little guy?
  • I can also write code in ML that can be proven correct at compiletime, since it's statically typed and lexically scoped (Java can still come up with an exception and die on you).

    So my question to you is: How are your languages so magic if none of the software I use on a daily basis uses them, or seems to miss their properties?

    Some guys here at CMU wrote a TCP/IP stack in ML that was provably correct. It took them a whole year. Was it worth it? You could write a whole operating system in that time!

    ...

    The stack-based ISA for Java is slow. It doesn't map well onto actual processors. It's a big reason that even JITC'd Java isn't very fast.

    There's no reason not to use a language like Java with slightly stronger typing than C++ (although there are good reasons to use C++; there are powerful things C++ can do that Java can't). But it should be compiled, even if "compilation" is something that happens the first time you run a program, and it should be compiled from a more intelligent format than a nonexistent ISA. Why not distribute source code after the frontend has run on it, with all the symbols resolved and the syntax broken up into lower-level constructs? That would be as hard to reverse-engineer as assembly, and significantly moreso than Java classes (which leave way too much info in there).

    ...

    Being able to state with supreme confidence that this type of bug does not exist in my code because it has been rigorously proven is helpful. You really should be more concerned with running correct code than running "fast" code.

    Except for the fact that we've gotten everything done so far by writing fast code, and testing it to see that it's reasonably bug free. Proving code with Java is still quite hard, and writing code at all in ML is very hard.

    Empirically, companies will trade off getting the software written for having it proven; they've been doing things that way for quite a long time. Java certainly won't change that.
  • No. Hand coded assembler, coded by a clueful programmer, will always be better than compiler generated code.

    You are wrong.

    With modern processors, a decent compiler can do better general optimizations than you or any other assembly programmer I've ever heard of. While there is that (less than) 1% of the time when you can hand optimize a small piece of code, that's the exception rather than the rule. The rest of the time you are better off letting the compiler do it's job.

    --GnrcMan--
  • I think I agree... :)

    There is indeed an art - if not a Zen - and a science to knowing when to optimize stuff, and I think it boils down to 'use the hottest algorithm around' (especially if you start in C), *followed* by 'hack the ASM by hand'.

    Something I finally got round to doing just this past weekend, that I'd been meaning to do for a while, is it compare the speed of execution of loops that simply count between 0 and a big number, but using different comparison methods, and different compiler optimization levels in gcc, with Perl for the hell of it as well.

    The full source I used follows:

    #include
    #include
    #include

    #define INNER 5000000
    #define OUTER 200

    void up(void), down(void), xor(void);

    int
    main (void)
    {
    int c;
    time_t t0, t1, t2, t3;

    t0 = time (NULL);

    for (c = 0; c Mosix [mosix.org] cluster, which was "interesting", to say the least... ;)

    As for perl, well that behaved the way I expected. The whole thing was way slower than C, but it did at least do an xor when I asked for it - that was almost invariably faster than counting up with "".

    Well, those be the results... compile the source yourselves with all the various gcc options and compare the assembler, folks :)
  • I read that the 700 MHz Crusoe chip actually only attains the performance of a 500MHz Pentium

    So there you have it - code morphing, or just in time compiliing or whatever gives you about 70% of the performance precompiled code.

    Comparing interpreted to compiled code on a MHz-for-MHz basis makes no sense, because MHz is not an intrinsically interesting number. Most people don't care about what their computer's clock speed is. They care about how fast it runs applications (i.e. MIPS), and how much it costs. So don't compare Crusoe to an Intel chip on the basis of a MIPS/MHz ratio. Compare it on the basis of a MIPS/$$$ ratio. How well does a Crusoe perform compared to an Intel chip *of the same price*?

  • Aaaargh!
    CmdrTaco, your preview mode is broken. That worked for me before.

    In any case, the source is available here [custard.org].
  • Comparing interpreted to compiled code

    Complete brain fart. That should read "Comparing Crusoe to Intel..."

  • Its wrong to have the most basic data type wobble over the bit length.

    No. This simply wrong, and almost humorously so. What are you going to do? In the C standard, are you going to say "Implementations shall make int 32 bits wide?" What do you do on a machine (say, an embedded system) with 9-bit bytes? What about a Cray YMP? The goal of the C standard is to have the implementor pick representations for the basic types that make sense for the target machine. Performance is nice, but in the end, they have to make sense. Now, there are some limits in the form of minimal ranges that are placed on the "basic" integral types, but that's about it.

    The solution would have been to have the basic data types with fixed size, and additional data types with "some size between 16 and 64 bit giving the optimal performance on the target machine".

    You should be more than pleased with the next revision of the C language standard, C99. Among other things, it standardizes such types as int16_t and int32_t, which provide exactly the functionality that you seem to think is important to portability.

    This would have saved many people a lot of grief...

    What would have saved many people a lot of grief is if they would have learned C correctly in the first place. They would then know what it guarantees and what it does not, and how they can use the language facilities (i.e., typedef to satisfy their non-portable desires.) The amount of junk I see in code written even today is staggering! (assumption that short is 16 bits, long is 32 bits, etc .. casting the return value of malloc(), etc ..)
  • Hmm... That bites.

    If you ask me, it is a design flaw. Longs are only required to have the range of a 32-bit number, so it is wasteful to include more bits. A lot of portable code will have unnecessarily bloated memory use and any reliance on the extra 32 bits throws portability out the window. Given that, a non-portable extension for 64-bit ints (since full use of 64-bits is inherently non-portable; with a lot of sizeof checking some use is possible, but not convenient) to be used only for system specific optimizations would probably be a better choice.

    Given that most code run on Alphas probably isn't written with a 64-bit long in mind, it almost certainly hurts more than it helps (especially with memory access being the main bottleneck of modern systems).

    The next C standard should fix this problem by providing a 64-bit int (if you ask me, they should just allow you to stack longs, doubling the bits of the minimal range each time; writing arbitrary bit size emulation into the compiler is trivial, though the trivial implementation is inefficient, and it would be a permanent fix that could have surprising benefits for certain supercomputer applications), as well as the addition of fixed bit size types (which is, I suppose, good for networking code, though I fear people will abuse it; IMHO it's better the way it is where you should chop everything up manually into whatever format the protocol takes).
  • Somebody moderate this guy UP. He's hit the nail on the head with a huge sledgehammer, this is _exactly_ why code morphing is different, and far more spectacular, than the JIT concept.

    Although I still want access to the morph layer to play :)

  • by PapaZit ( 33585 ) on Thursday January 27, 2000 @05:37AM (#1330967)
    Doesn't Sun have a patent on hardware that can run java bytecode natively?

    I wonder where transmeta-ish devices fall. I'd call it software, but would sun's lawyers?
  • by tjansen ( 2845 ) on Thursday January 27, 2000 @05:38AM (#1330971) Homepage
    I have already heard that assumption that a compiler can generate better code than a programmer a thousand of times, but it does not get more true by repeating it - it is false. At least until compilers are able to understand the program. In every program there are things the compiler simply doesnt know. For example, in x86 assembler it is possible to save some cycles (and memory accesses) by using 16-bit or even 8-bit registers instead of 32 bit registers. You can double the number of registers by doing it. You need to be sure that the values in the registers are below 65536 or 256 to use these tricks, and the programmer can know this, but the compiler cant. The compiler might profile the possible value range if it is a really advanced compiler (I never heard of any compiler actually doing this), but it cannot be sure so it must at least check the values before for the case that they arent.

    As long as the programmer has more knowledge than the compiler, he will always find tricks to save an instruction here or there and outperform the compilers this way. You can find a great example of such programming tricks in the PCGPEs article about texture mapping inner loops here [gamers.org].

  • If I'm understanding what you're saying correctly,
    a JIT compiler will compile the exact same code each and every time you go to run it. While a convential compiler will compile it one time. Yes the JIT compiler might be able to tune the performance better for each run, but I think it takes a performance hit at the start by having to do the compilation. While the JIT compiler is compiling code, the exact same code compiled with the convential compiler is already executing.



    The faster the hardward becomes though, the differences between the two might not be as noticible. Although, the more powerfull the hardware becomes, the more complex the code becomes.

    Nothing exists exept atoms and empty space; everything else is opinion.

  • It's been over a decade since I did any appreciable assembly coding (I quit after the 80286), but I have a serious question, and I'd be interested in serious answers:


    How many of you actually routinely (e.g. 6 times a year) examine the output of your compiler, and compare it to what you would have written?


    I've done it often (a promise I made to myself as a foolish youth) and learned quite a bit. It's actually changed my higher-level coding technique on a few occassions (and I'm not just talking about optimizing for a better fit with a specific compiler)


    Not only have I never met another person who does it, but I've almost never met a programmer who didn't look at me oddly for even mentioning the practice. And yet, here are so many people walking as if they did it routinely. I guess that's just the Silicon Valley edge.


    I'm not being sarcastic or snide. I really want to know. Maybe I should give up my current profession (ditching a decade of grad and post-doc training) and move to the West Coast, where Programmers are Real Programmers.


    Or maybe I should finish reading my NEJM, and we should all just keep in mind what we really know, and what we only think we know. Again, no disrespect, Lord knows I've had to re-think the extent to which I "knew" things before.


    Please post your honest answer, and encourage your friends to do so as well. I'm afraid I'll have to take a lack of responses as a confirmation of my worst fears -- that everyone is only guessing.

  • Well, a good optimization typically makes one type of instruction or sequence of instructions a fair bit faster (say 30-50%).

    While this might give a 2% improvement across average code, it will make a lot of difference if this particulr optimization helps out your inner loop. There are times when you really need this.

    The way I see it, if compilers can optimize effectively, then there becomes less need to hand-code an inner loop in assembly. Decent optimising compilers may not give you a great deal of extra speed but they sure as hell give the hardcore coder an easier life.

    BTW, The rate of hardware improvement is a good argument *unless* you are trying to stay at the cutting edge. In these cases, optimizations are certainly not trivial.

  • Talking about dynamic recompilation and program profiling, and specifically about Hotspot, I've always wondered why I have to put up with the fact that between program runs, the compiler "forgets" everything it learned during previous runs? Why can't the compiler generate a log for each program, and review the log when the program is run again, so the compiler can pick up learning where it left off?
  • [slashdot bug reformatted the last post] As the microkernel unix people used to say before they finally died out, "We are within X% of the performance of the original". - Alan Cox The JIT guys can write all the papers in the world about how they should (yes, really should) be able to get native code speed. No, that extra level of indirection is really a performance benefit, not a hindrance! I'll believe it when I run Java bytecode and it's not dog slow.
  • First off, there are profound performance differences that can be acheived simply by optimizing for the Pentium Pro and not the Pentium, see The Twofish Implementations [counterpane.com]. This is mainly just instruction ordering issues- possible to do in a JIT. There's more to optimization anymore than simple cycle counting.

    Second, memory management is a different beast under object oriented programming than under procedural programming. The holy grail of object oriented programming (as it were) is black box reuse- allowing a class maintainer to modify how a class works without those using the class needing to change their code. Not having garbage collection in an OO language either disallows true black-box reuse (which is a maintainability issue), forces the object to implement it's _own_ garbage collection (such as the C++ STL template basic_string does- and this is generally the worst sort of garbage collection, such as basic_string's reference counting), or forces the program to leak memory like a seive. If those are the options, then garbage collection is not too high a price to pay.

    Especially when you consider that GC isn't that high of a cost. _Every_ memory management scheme that isn't static in nature that I've seen is O(n). Either on the allocation or on the free. Sooner or later you have to search for a hole. GC doesn't remove this penalty, but instead it allows you to _defer_ it until a better time. Since most programs spend most of their time waiting for input, there are lots of places where garbage collection can be free because the program isn't doing anything else.

    Last, but not least, I'd like to point you at the The GC FAQ [iecc.com], a good resource for all things GC related. Highly recommended reading- after it, it was "intuitively obvious" that the world was flat, that the sun went around the earth, and that time didn't slow down as you went faster.

  • When you run a Java program, the JIT has to do its thing before the program starts. That means the application takes longer to launch. For a lot of GUI apps, this is exactly where the greatest bottleneck is.

    So the kind of optimization suggested probably wouldn't improve matters for the user, at least with the small, visual applications Java was supposed to be good at. However, it suggests the people using it for numerical work may not have been so mad after all.

    If it works.

    Of course, you could cache the compiled code, so that it launches better second time. Pretty soon, you'll end up re-inventing compile/run cycles.

  • by Hard_Code ( 49548 ) on Thursday January 27, 2000 @10:15AM (#1330990)
    Right. I've always thought an ounce of design is worth a pound of optimization ;)

    Leave the design to the humans...leave the optimization to the compilers.

    Jazilla.org - the Java Mozilla [sourceforge.net]
  • by TheDullBlade ( 28998 ) on Thursday January 27, 2000 @05:42AM (#1330994)
    Seems profoundly intuitive to me.

    First of all, a JIT is rushed. You can design your optimizing one-time compiler to look at the code for better ways to do it all day, if you like, and while the developers will moan, they will still use it if gets them better results.

    Secondly, Java has a very specific standard (that is typically fudged... but that's beside the point). It doesn't give any leeway for a program to act a little different in boundary conditions, like C does. In C, an int is whatever size int is most easily handled by the target system; if you need a certain exact size of int, you can code that in with some extra effort. In Java, an int is a Java int and Java doesn't care in the least what the most efficient native int format is.

    Thirdly... automatic garbage collection is less efficient than hand coded allocation and deallocation, and dynamic allocation is less efficient than static allocation. There are odd cases where this is false, but they generally hold true, and where they do not, the C version can always use the more efficient method anyway. (and extremely fast calls to "new" can be easily achieved... at roughly 50% memory wastage)

    Fourthly: Java locks you into a OOP model which is inherently inefficient (at least as done in Java). All function calls must be dynamically redirected etc.

    I could go on, but it feels like trying to describe that water is wet and stones sink in it to someone who thinks it's intuitive that water should sink into stones.

    However, I will not dispute that you can definitely get better results by recompiling for each chip that something will run on. Some JITs may use this to push out ahead of one-time compilers.

    BTW, an experienced assembly coder will always beat or at least equal the optimizing compiler, because if nothing else works he can always look at what the compiler produces and see if he can improve on that. Besides, optimizing compilers are good, but not that good, someone has to write them, and when was the last time that you wrote a program that can solve complex creative problems better than you can?
  • The problem that one tries to solve with JIT based language (say Java) is not the same problem that one tries to solve with a systems programming language (say, C) is not the same problem that one tries to solve with an AI language (say LISP).

    I would not try to write device drivers in anything but C -- assembly is too unportable. I would not try to write an AI program in C unless I had an essentially unlimited budget -- LISP is much better suited to that task. If I want to interact with a WWW user -- Java is the best choice thus far. As to effeciency of the resulting program, the speed of assembly code is irrelevant in 99.99% of code, since the program spends most of its time in that 0.01% (OK -- I'm exaggerating, but not that much ;-). Ease of programming and maintenance is much more valuable -- that observation led to the decision to use C to implement UNIX in the first place.

    "So what" you ask? While we talk about JIT and code morphing vs. compilation vs. hand assembly, we need to compare similar things. Language characteristics are nearly unrelated to performance (*implementation* is a whole other story...) In any HLL there must be some translation from strings of bytes to executable things, whether byte codes (emacs lisp), VM codes (Java), or machine code (any of the above under an alternate translator).

    JIT tries to solve the problem of bridging the gap between one (abstract) instruction set to another. Code morphing does the same thing. Compilation does the translation from HLL to machine code in one block. The time available to a compiler to explore different register allocation strategies and branch orderings is much larger, but less information about the typical case is available than for the JIT. However, branch prediction is the only place where repeated JIT/morphing has an advantage.

    BUT: I repeat -- they solve different problems. Compilers create programs to execute on a chip. JIT/morphing makes a program written for one chip executable on another.

    The use/non-use of garbage collection does not tell us a-priori that a program will be slower than the same program without GC -- that depends on the allocation behaviour. There is considerable literature (both theory and practice) on this topic, for languages such as C (gasp!), C++, and LISP. Probably Java too, by now.

    Compilers may miss optimization opportunities occasionally but the assembly code they generate is usually pretty good, and it is free of human-errors.

    Finally -- Alan Cox mentioned the microkernel performance mistake. I think that it applies here, but not in the way you might expect. It is really easy to miss the real cause of performance problems. In my experience, choice of algorithm and data structure (even in OO code) have a much greater impact on performance than any other single factor (or even several other factors together). You should use language features to help with your task -- but don't be sloppy with them. That always leads to bad, buggy code, and poor performance.


    Hmmmm -- still on topic? You decide ;-)
  • "This surely says something about what people really do.'

    I think it says more about the college you are attending. If the business administration department (the place where they create those clueless marketing types) is the first to adopt Java in the college you are attending I have just one advice: get out of that conservative hole you are sitting.

    "As long as java remains so un interesting and tied to the web and the internet I will never learn it."

    If you don't see the point now you'll never see it. I give up. Happy hacking.

    BTW. have you tried COBOL, I heard 90% of the code in use today is COBOL (just like you no source for my information though) must be a great language, fast too!
  • In theory, a human can write better assembly than a compiler - in a vacuum. But code doesn't exist in a vacuum, and it becomes less of a vacuum every day. Consider these few issues:

    * Cache misses: page faults that cause L1/L2 cache misses (going straight to memory, or worse, to disk) cost thousands, maybe millions of instruction cycles. Do you write all your assembly language to neatly break your data structures across paging lines? No? Didn't think so. btw, this level of optimization can be performed in C in many cases, by good C programmers.

    * Pipelining: Do you hand-code your assembly language to give the best hints to the branch decision-making hardware so it can optimize pipelining? Really? You're better than me. Or maybe your code is so much better than compiler-generated crap that the branch lookahead hardware support is useless.

    * APIs: I suppose you don't mind reverse-engineering every library call you might ever need, so you can avoid those wasteful, inefficient C headers.

    * Et Cetera: The days where humans can outthink compilers in the general case died with the PowerPC and Sparc I. Those of you who have written 1,000,000+ line assembly language programs are free to tell me i'm full of shit. Those of you who have worked on million-plus line code in high-level languages are free to give me an AMEN!!


    ---
  • Incidentally, the optimizer hint you describe is called "__assume" in Visual C, and it does exactly as you say. So your code would be:

    switch (x) {
    case 0: ...
    case 1: ...
    ...
    case 23: ...
    default: __assume(0); // Same as 'notreached'!
    }

    normally you'd have a define like:
    # define ASSERT(e) ( ((e) || assert(__FILE__, __LINE__) )
    #else
    # define ASSERT(e) ( __assume(e) )
    #endif

    in order to hide this, but you get the point.

    Which brings to mind an optimizer bug that I came across (and fixed) which had to do with a nested turnary operator with two assumes:

    __assume(foo() && x==something),bar(&x) ? dosomething() : x == foo ? somethingwierd() : __assume(0);

    That's complete nonsense of course but it had the same form. This expanded out to a huge affair in the intermediate code the optimizer uses (one of those embedded functions was inlined(!!)). It was fun. Oh well, I'm sure no one is interested.





    --GnrcMan--
  • by Anonymous Coward on Thursday January 27, 2000 @05:44AM (#1331016)
    For deeply pipelined, agressively superscalar architectures, I fail to believe that a human being could possibly solve the difficult scheduling and graph-theoretical problems that optimal instruction ordering requires better than a mechanistic compiler. Current architectures are simply too complex and subtle for a project of non-trivial size to be more efficiently coded by humans at such a low level.

    Nevermind of course, that for real prejects, the trade off in maintainability, portability, and extensibility is almost not worth any demonstrable performance gain anyway.

  • Hmm, I think the author missed the REAL advantage of Code Morphing (or Binary Retranslation).

    JIT Compilers (or even FX32!) are pieces of software running ABOVE the OS layer. They are executed for a specific application which runs on top of it (that is the application on top of the JIT environment).

    Yes there are similarities between Code Morphing and JIT compilation, as each does some profiling to get better performance in frequently executed code areas. It is further true, that on more predictable and orthogonal platforms like the Alpha, all kinds of optimisation depend on knowing the underlying platform.

    BUT they only see the code from the application running on top of them.

    The processor itself however gets code from MANY SOURCES. The number of processes running on any machine today is quite large. In addition things as shared libraries, OS calls, etc add a pinch of code.
    Hence the processor still has to perform all kinds of clever things, trying to resolve dependencies, avoiding stall and stalling at times. For modern multiple Issue machines this gets VERY complicated. It slows down cycle times, increases size (and hence speed) takes up space which could be used for better performing optimisations (larger cache, etc), etc.

    This is one of the major headaches for CPU designers these days, as these parts increase unproportionally in complexity when trying to add more functional units, etc.


    This is the reason why VLIW (very long instruction words) are used in the next generation processors (i.e Itanuim). In those the compiler does all the hard work, resolving all dependencies stalls, etc, at compile time, and then simply sending one long instruction to the processor, which includes one instruction for every functional unit. This means the processor can run at MUCH Higher speed, as it loses a big overhead complexity.

    This is the reason why TRANSMETA uses such a VLIW, it enables them to build fast, small and CHEAP processors which can run at quite high speeds. (I am sure they could do more than 500Mhz, but then cost and power consumption get in the way. And the are much smaller and havent (yet) got the man resources as Intel, AMD)

    Now where does code morphing come into play ?
    I recently saw a few comments asking for access to the VLIW instruction directly, bypassing the code morphing. Well, that would DEFEAT the concept !

    The Big Thing about code morphing is that it sees ALL instructions which get executed, in the right order. This includes the library calls, the OS routines, etc. Hence it can do MUCH BETTER optimization than ANY Compiler by definition can do. A compiler can only know about the program it compiles (yes you can do clever profiling, but you still only have the application source to optimize), it cannot optimize ACROSS APPLICATIONS.

    The code morphing software has absolut knowledge about what code has executed and what code currently tries to execute. This enables it to do much better optimizations.
    Furthermore it is SPECIFIC to the underlying processor. It knows everything about the processor and even more important, the processor is designed specifically for it !
    (Hence the two versions are internally comletely different, each optimised for the kind of code linux/Win it is executing)


    This is what makes the concept so interesting. It adds an additional abstraction layer on top of the actual hardware, allowing the underlying implementation to be always up-to-date and optimized for the application (removing all the complexity of having backwards compatible hardware, ask Intel ;-)


    Overall, the approach is a MAJOR step forward, IMHO it can be compared to the CISC->RISC transition. It is after all the first implementation, competing with products well understood and optimized for years. It has potential to solve many of the major headaches of modern CPU designs (including COST). Comparing it to JIT is missing the BIG points.

    I think it will go a LONG WAY !


    Frank
  • by BinxBolling ( 121740 ) on Thursday January 27, 2000 @10:36AM (#1331022)
    You need to understand the code fragment you are coding to do the must useful optimizations which are algorithmic ones. A badly coded better variant of an algorithm will beat well coded, but inherenetly slower version.

    Absolutely. But one need not be programming in assembly to do the sort of algorithmic optimizations you speak of. In fact, in higher-level languages, where source code content is determined largely by the algorithm being implemented (rather than by the low-level details of the implementation, as in assembly language), you have a better chance of 'seeing' and successfully implementing a faster algorithm. This sort of thing is one of the big reasons why assembler loses in the long run.

    It seems to me that there are two basic types of optimization task:

    • Choosing the fastest algorithm for a particular task.
    • Making a given algorithm run as fast as possible on a given piece of hardware.

    Of course, this is a bit of an oversimplification, since it assumes that the fastest algorithm for a given task is going to be the same on all platforms. Usually, this is true, but probably there are a few minor cases where it isn't. But it works most of the time.

    At any rate, humans are much better than machines at the first type of optimization. But machines are much better than humans at the second (even if only because normal humans don't have the patience required to do the enormous amount of repetitive analysis that it requires).

  • by jilles ( 20976 ) on Thursday January 27, 2000 @05:45AM (#1331027) Homepage
    Hi,

    What transmeta does is very similar to SUN's hotspot compiler for java. Like Transmeta's code morphing, hotspot compiles and optimzes bytecode on the fly using profiling data to optimize the parts that are executed most.

    Your question as to why Java is still slower than C++ is answered in a series of javaworld articles (www.javaworld.com): http://www.javaworld.com/javaworld/jw-11-1999/jw-1 1-performance.html and http://www.javaworld.com/javaworld/jw-12-1999/jw-1 2-performance.html and http://www.javaworld.com/javaworld/jw-02-2000/jw-0 2-performance.html

    Basically the problem is that stuff such as memory allocation, garbage collection, synchronization and runtime type checking have a performance price (you get a safer programming environment in return). The articles discusses these performance issues and give some usefull advice to avoid these bottlenecks.
  • There's local overhead in the instrumentation code, but not a net overhead in the combined instrumentation code plus controlled loop code: the loss due to extra code prior to a loop is far less than the gain resulting from the more efficient loop code produced by introducing that extra instrumentation.
  • by dingbat_hp ( 98241 ) on Thursday January 27, 2000 @05:47AM (#1331037) Homepage

    when the programmer has decided that space is more important than speed.

    That would seem to be one of the issues where manual coding has the edge. I agree with your general point, but I think there's still work to do before auto-generation is perfect.

    Imagine a case where an algorithm could easily optimise either way speed/space - maybe there's a hash table that's going to hold some programmer-controlled depth of the initial search and allowing it to expand would make usage quicker, but eat memory. A human coder would probably know the pattern of usage this routine would get. By knowing the practical number of data items to be encountered, they could optimise the hash size. A compiler can't do this, because the necessary information comes from the overall systemm domain, not just the source code. To allow compilers to compete effectively, we'll need much more subtle optimisation hinting than we currently have; especially that of the form "ignore this block, it's only used once" and "speed like crazy here, and you'll never have more than 5 items loaded simultaneously".

    I'm not a compiler / assembler geek, so maybe someone is already doing this ?

  • by jilles ( 20976 ) on Thursday January 27, 2000 @05:49AM (#1331040) Homepage
    Java bytecode interpretation is not the main cause for java's weak performance. The cause lies in mechanisms like garbage collection, memory allocation, thread synchronization etc. which are expensive. If you'd implement similar mechanisms (i.e. with all the built in safety) in C++ you'd have exactly the same problems (I don't think that's actually feasible though).
  • I'll just say that when the compiler and the machine are already married together, the compiler DOES know about the size of the caches. I'm specifically talking about VLIW hardware as oppossed to a super-scalar machine that has multiple execution units. The main difference between these two classes of machines is that the super-scalar machine has a hardware layer that is doing the instruction scheduling where a VLIW box has the compiler doing that function. Thus I feel your observation is more of an apples and oranges situation.
  • Where the hell did you get these figures ?
    What compiler, what OS, what architecture.
    They look *completely* out of line with every
    other test I've seen. Check out the JGL against STL comparison above for some real numbers.

    For a start, what the hell is the difference between integer division in C or C++ - no compiler that I've heard of written in the last 5 years produces different code in these two cases.

    Secondly - who remembers to turn on optimization: anybody who cares about performance and isn't completely stupid.
  • >I can say that they do not offer Java in the
    >CS department. This is a reflection in the fact
    >that Java is not important for what programmers
    >actually do in the world.

    Man have you got to get out in the real world! You and your profs that is. Right now Java is my job. A year ago I was coding c and asm on an 8051 derivative for an embedded project. Today I'm building a distributed network management system for a company that builds fiber backbones. Guess what the most universal platform for executing code is? I'll give you a few clues: All their winXX boxes support it. All their *nix boxes support it. Any ideas? It's called a web browser and everyone that might be interested in designing, configuring, maintaining a network, every one of their laptops, desktops, and workstations has one.

    Sooo...which platform do we support? What language do we choose? You only get one choice. I suggest that you transfer to a, uh, more modern school ASAP.
  • by arivanov ( 12034 ) on Thursday January 27, 2000 @05:56AM (#1331064) Homepage
    No it will not.

    Think of Java licencing and control issues. Discussed widely on slashdot. Actually I shall retract this statement if one of the MAJOR league players will accept Crusoe as a primary CPU for at least one machine class.

    Otherwise it is obvious that it could be thy Java machine, but it is least likely to be.

    I think Cruose will actually make a reality something else which is much cooler than Java. It will make real thy developer's dream - the coat of many colors: The affordable machine of many architectures.

    On the basis of Crusoe even now you can build a machine that can happily emulate:

    Mac, Sun (lower end), IBM PPC (lower end), SGI (lower end), Alpha and curse it x86.

    1. All these are PCI based.

    2. The differences in chipsets can be ignored under the "one OS to rule them all, one os to find them, one os to bring them all and in (oh well cutting out darkness) bind them ". It can have drivers for the chipset and peripherals in question.

    3. You may actually do the reverse thing and develop drivers for the peripherals and the chipset for all platforms in question (not a hell of an effort, actually quite achievable). If peripherals are something like adaptec, tulip and a PCI VGA the drivers are basically there already. So you have only the chipset left. Actually you can intercept these and emulate chipset behaviour if you so desire.

    Overall - the developer's dream can become a reality for just a few bucks - about the price of a PC (excluding licencing for OSes and sowftare of course ;-)
  • 16-bit or even 8-bit registers instead of 32 bit registers. [...] need to be sure that the values in the registers are below 65536 or 256 to use these tricks,

    Isn't that trivial old-tech for compiler optimisation ? If your type system is strong enough to know that it's an 8 bit operand, then you're only ever going to need the 8 bit register.

    You can argue that multiplying two int8 only really needs an int16 result if the initial values are big enough, which is fair comment in favour of human optimisation. OTOH, it's not a rocket science extension to the type system to allow an int4 type to be declared, even if it's always held in an >=8 bit register and the compiler can then safely optimise that to place the result back into an int8. That way you even improve the code quality, as there's no discrepancy between the valid and the apparently valid numeric ranges, as you get with the human optimisation and the non-obvious limiting of the initial int8 to be <0xF

    Anyone remember BCPL ? 8-)

  • by spacewalrus ( 145025 ) on Thursday January 27, 2000 @03:15PM (#1331070)
    This whole question boils down to whether or not a human mind is deterministic. We can prove that a
    compiler is, or in fact that any program that runs on current computer hardware is.

    If a human mind is deterministic then it follows that a human could do just as good as a compiler simply by using all of the same algorithms to generate the code, although the computer would certainly finish alot sooner.

    If a human mind is NOT deterministic then it is clear that a person who had no programming skill whatsoever could beat the best compiler given enough time, a specification sheet of the language he was to generate the given algorithims in and a circit diagram of the cpu he/she was coding for.

    This is VERY simple to see. A compiler can NEVER
    under any conditions for any reason EVER EVER EVER produce better code than a human because the
    human can ALWAYS produce the same code that the compiler did. The question should be is it reasonable to wait the 150 years of time it would take me to compile mozzila by hand or should I build a compiler to do it in a few hours. The next question would be how much time(if any) should I spend hand optimizing the compiler output.

    A question I have is why cant we do the same thing that a JIT compiler does before runtime? It seems
    clear that you could easily get very close to as much optimization from a multi pass profiling compiler as you ever could with a JIT without the runtime overhead.

    Later
    Spacewalrus
  • Hmmm.... back in the days when I used to dabble in a bit of demo coding I would frequently benchmark my 486 code against the (Borland) compilers.

    I almost inevitably won. Hands down.

    Compilers may have come on a bit since then, but I'd still fancy my chances. Of course, I would only dream of doing this kind of optimization for the most critical inner loops. Most of the time it isn't worth it.

    But I would like to stress the point that a human with good knowledge of the system and the problem at hand should be able to beat a compiler in many cases. Compilers can't think laterally.

    For example, consider a function that must fill an array of bytes with consecutive numbers between X and Y.

    Compiler produces beautifully optimized loop that increments a counter, moves an array index and sets the bytes appropriately. If it's clever, it might use the same register as both an index and a loop variable. No problems. Pretty fast, but....

    Joe Hacker takes one look at the problem, sets a 32-bit register to contain the bytes X, X+1, X+2, X+3 and does a quarter of the number of loops just adding 04040404 hex to the register each time.

    Joe wins, because he *knows* that most of the time X and Z will be sufficiently far apart for his bit of hackery to be quicker. His knowledge of the nature of the problem will beat the compiler.

    This is the kind of thing that humans will be better at for the time being. Once AI compilers start to make an appearance, things may change. But right now, I'll bet on human intuition any day of the week.



  • Nevermind of course, that for real prejects, the trade off in maintainability, portability, and extensibility is almost not worth any demonstrable performance gain anyway.

    Some people just have no idea how true this is, and I've had to work with many like that.

  • by Dominic_Mazzoni ( 125164 ) on Thursday January 27, 2000 @05:57AM (#1331077) Homepage
    I completely agree that humans can always write better assembly than compilers can generate, but it's getting harder.

    Not all that long ago, compilers weren't all that smart, and processors were a lot simpler. Today processors use tricks like out-of-order execution and branch prediction, making it extremely difficult for an assembly-language programmer to determine exactly what is going on. It still remains the case, though, that the programmer may know something about the code that the processor doesn't (that a register will never contain a certain value, or that some execution paths are much more likely or more critical than others) which will give the programmer the edge.

    None of the just-in-time compilers that I am aware of recompile code as they go. If the Crusoe actually attempts to optimize sections of the code based on runtime profiling information, this could be the main reason it performs so well. There have been academic papers written on this idea before but this would be the first implementation I've heard of.
  • Jilles,

    Beating a modern compiler is easy...if you have the right backround. For people that came into programming when 16KBytes was ALOT of memory, assembly coding in an efficient manner is like breathing air. For folks that grew up in the era of 16Mb isn't enough to boot my OS, well... hopefully you see my point.

    Modern compilers are marevelous pieces of technology...and I've worked with some of the best folks in the field at one time or another (most of these folks are off working at HP or Intel right now on Merced me thinks..) Those same folks could write better code than their compilers produced.

    Note - The Merced and it's breathern are considered decendants of the VLIW machines built in the 80's. Alot of the same people are working on Merced that were the engineers at those VLIW startups.
  • by hoss10 ( 108367 ) on Thursday January 27, 2000 @05:58AM (#1331082)
    I was about to try and answer your question but it's bloody difficult to explain!
    With garbage collection implemented in the best way the JIT or whatever doesn't try to hard to keep a record of what mem has been allocated.
    But, whenever the system isn't paricularly busy or if it panics due to low memory the JIT can traverse all the objects that have been allocated (just starting with each of the Thread objects I think) just by checking the references each object has until it has a list of all the objects in the system. There are a few reasons this is more reliable (not just faster) than maintaining a reference count.
    Then I suppose it (the JIT) could ask the OS to rearrange it's memory mapping thingies (Page Translation whatchamacallems) to put any of it's user space pages after the current position of the stack or something like that (hard to describe but i know what i'm talking about (even though i can't remember the name exactly!))

    Basically it can be made to work but maybe not particular efficiently because it could result in a full page (4K) been used for just one small object but this can also be fixed
  • by stevew ( 4845 ) on Thursday January 27, 2000 @05:59AM (#1331083) Journal
    Years ago I worked at a company that was going to bring VLIW to the commercial world. One of the things they had to invent was MUCH smarter compiler technology. That was in 1985, so the basic ideas involved here are at least that old.

    Anyway, when the compiler DOES know the hardware, especially VLIW machines, they can indeed do a job as GOOD as a human. (Look - they won't usually do better jobs, when the guys who designed the machine write code, they're going to do what the compiler is trying to emulate..) But from what I remember, static profiling was considered "good enough" back then if the VLIW machine is built right.

    Most codes spend most of their time in inner most loops. If those loops can be rolled up correctly for VLIW you can be executing different interations of the loop within the same instruction slot. Being able to do this is usually the largest performance payoff. If you have a GOOD compiler in the first place, that KNOWS the hardware, then JIT and morphing aren't needed, or help.

    The place that I see the morphing being a step forward is that VLIW machines were considered architectural dead-ends until just now. If I use one of these smart compilers for a machine with say 7 functional units, the code that emitted won't work on a machine with 8 functional units, or at least not be optimized any longer. These machines didn't scale well until now! Code-morphing technology completely removes this limitation!
  • I haven't heard that, but I have heard of extremely similar experiments, where weak players and strong players were given a similar challenge, but with both positions from real games and randomly generated positions. The novices did equally well (or poorly) on both types of positions, while the strong players performed significantly better on the 'real' positions than on the random ones.
    I don't remember where this information comes from, it was probably some chess book or other :)

    Daniel
  • Actually, there is no absolute requirement that free() does all the cleanup work that the allocator, any more than a garbage collector must call a routine to do all of its cleanup work at the end of every block and at every assignment.

    free() could just add the freed chunk to a list of chunks to be cleaned up. A second thread could run continuously tidying up all these chunks with the same system-wide view as a garbage collector.
  • by Anonymous Coward
    I can see two major problems offhand:

    The first is that java is, conventionally, a
    language which runs in a virtual machine. The
    infamous JVM. This involves a layer of
    inefficiency from the word go. Granted, you could
    avoid this problem by compiling direct for the
    CPU in question, but this is a nontrivial
    conversion from a stack-based language to a
    register-based architecture. If you are on a
    java-optimised CPU, compiling direct, this could
    still be pretty good.

    The second objection involves the compiler.
    Compilers take CPU cycles. The smarter they are,
    the more they take. If you compile beforehand,
    this of course doesn't impact execution speed.
    If, of course, you'd like to compile on the fly,
    well, you're going to have to execute the compiler.

    Java also isn't really all that suited to this,
    to be frank. If you want to see a REALLY powerful
    model for this sort of thing, try using a FORTH
    compiler. On a CPU which is at all stack-friendly
    with a good execution environment you can pull
    all kinds of nifty tricks (on the fly modification
    of running code, assembler-like low level activity
    and so on) which can really speed things up
    immensely.
  • How about this one? Lots of people want to be "Operators" because they get to spend time with their family, and don't get paged at three in the morning or during their daughters 8th birthday party just to come in to work to fix a a bug they introduced because they were just stupid? We also have hobbies that don't include computers, or computer games, and usually involve more than 4 people.

    And why should anyone other than you care? Your "hobbies" most likely contribute nothing to lives of others, yet the price, society pays for supporting overpaid drones becomes rather noticeable. Why should I care, how wide was someone's smile at his daughter's birthday party if that person was responsible for network (or phone, or power) outage?

  • So while it might seem profoundly counter-intuitive can anyone actually give me a good reason why Java + JIT should be slower than Good Programmertm + Assembler?

    There are many things that slow java down, that I'm sure you're aware of, but I'll outline some of them here as they come to me :).

    Limited stack, all java objects are created on the heap, and thus is considerably slower. Microsoft Research actually have a division working on speeding up java by putting some objects onto the thread stack.

    Runtime checking, array bounds checking etc can not be turned off. Good for stability, bad for speed.

    Late binding, can be rather slow :), it's always best avoided when using java, but sometimes you can't. This affects other languages too ofcourse.

    Coding practise. I don't know about many people here, but when I program in C, I usually know exactly what's going in memory, so I spend a bit more time optimizing this and that saving a few extra cycles, especially if the code is in a loop. I'm sure this is even more true of most ASM programmers. In java, I just don't care, I don't really know what's going on in memory, and I don't really care, I just do what makes the tidiest code, I also tend to use inheritance a bit more :) hmmm.

    Also, java's thread synchronization is easy, but extraordinarily slow compared to what you could do if you were to use Win32 for example, since you have a lot more control. A JIT is not going to help that. Come to think of that, that's where JIT can fail with high level languages, there are things that these languages restrict you from doing, which prevent such cool optimizations that a JIT couldn't possibly do - I'm not talking about instruction level optmization, I'm talking about things like being able to manipulate memory directly.

    The idea that Java + JIT can be faster than traditional ASM is interesting, but never will happen I'm afraid :P. It's nice throwing around O notations about how much more efficient garbage collection can be etc, but you have to realise than you're assuming that memory freeing is the most costly process, it may not be. With JIT, GC etc, you have the additional overhead of the JIT and GC. I haven't seen any proof so far that these overheads aren't overheads at all.

    There are additional 'optimizations' that can be done at runtime, but does using a JIT really improve performance (taking into account the overhead of the JIT)? How many cases are there when a compiler can not optimize the code as much as a JIT, and the JIT can actually give any kind of improvement. I doubt there are many, and unless the improvements are like 10% improvements, I doubt it would matter much. What's the diff between a 433Mhz and a 476Mhz celeron? Nothing any average user would notice. The only place you would notice the speed difference is when you're running something like VMWare, 3DSMax or compiling Linux....and in those cases would a JIT improve or just hinder performance?

    Ofcourse, as we get faster, and faster processors, we can move to higher and higher level languages like Java. They drastically reduce development times, and in many cases, will be the better choice. I doubt you'll see Quake or any OS written in Java anytime soon....hell, I doubt you'll see Office or CorelDraw written in java anytime soon ;).

    Crusoe is fast and super cool, but it's not as fast as a real 666Mhz x86. And if Transmeta adds official java support, I'd love to get my hands on one :)~.

    I should also stress here, I have nothing again'st java/jit etc. There're all cool stuff, but I think the concept of Java beating good programmer + asm a bit, uh, unrealistic :).
  • "64 bit operations on an alpha are no slower than 32 bit operations, so it's not really being wasteful."
    Think of the memory consumption (if it's not being wasteful, why then did you claim that in this case using long was sub-optimal?).

    "IIRC, C requires long to be large enough to hold a pointer. On non-NT alpha, that's 64 bits."
    IIRC, that's an old K&R thing and in ANSI C, pointers and ints are completely different animals and can never be inter-converted. Or maybe that's just the upcoming new ANSI standard.

    "Why use a 64-bit chip if you're only ever going to use half of it?"
    Indeed, why use a 64-bit chip at all? (I don't like the current trend of building bigger and bigger chips; we should be learning to make more and more of them work together) But more on topic, very little portable ANSI C code will make profitable use of a 64-bit long (yes, you can write code that checks just how big it is, and do, say, bit masking operations on more bits in parallel, or write a more efficient BigInt library). If you are going to write non-portable code, why even bother sticking to the standard? I like the "long long" solution that some compilers use; for one thing, it immediately tells any other compiler that this code isn't portable ANSI C.
  • If one Transmeta CPU, running the morphing software and actual productive code, is on par with a PIII 500 Mhz, what would the speed improvements be if you used 2 CPUs? One dedicated to morphing, and one dedicated to productive code? Design would be almost as easy as current design, and speed could double.
    Best part of this approach is that it still appears that your system has a single CPU. Win9x will still work.
  • Almost everything that a compiler need to do to optimize code structure un undecidable. Human brains kick-ass at undecideble problems. Almost anyting that pertain to instruction scheduling is NP complete. Human brains kick-ass at NP-complete problems

    Once a programmer has infered some property about his program, he needs a highi-level language to express his solution to the compiler. ML's typing system is the best thing I know that acheives that.

    Once a program has figure a fast instruction ordering, he need a low-level language to express his solution in. We all know what kind of language does that.

    But I don't know any languges that combines both.

    As for todays' question : if JITs looks cool, and one time compiles looks boring, think of compiling n-times with a profile session in between and you will get much better result. (all the sugar, no r.time JIT overhead)

    -

  • by joe_fish ( 6037 ) on Thursday January 27, 2000 @06:05AM (#1331135) Homepage Journal
    The theory looks sound, but one of the biggest things to do when optimizing any Java program is to minimize the calls to new().
    I recently sped a program up by 150% in a snap simply by killing a few new()s. Swing and LotusXSL have had similar experiences.

    I think that part of the problem is that all of this is new, so there is more to do. HotSpot the trendy JIT from Sun in places IS ALREADY FASTER THAN C, but whenever it comes to Object creation, things slow down a lot.

    So why in theory should new() be quick when in fact it is slow? IMHO the problem is not with the memory claiming, its with all the other stuff that the JVM has to do.

    When I call new Foo() the JVM:

    • Checks to see if the bytecode for Foo already exists, and if not it loads it, verifies it, and calls the class init method.
      This is very very slow, but should only be done once.
    • Allocs new memory. Probably very quick
    • Calls the hierachy of constructors of all Foos superclasses. Quite slow.
    • (For advanced garbage collectors) Place the object on the 'recent items' list. Probably quick
    So I guess the complexity of the system as a whole is the problem here.
  • I once took a course with Dr. Alfred Aho, one of the authors of The Dragon Book [aw.com] and a contributor to lex, yacc and awk (he's the 'a' in awk.) Dr. Aho would never call a compiler component an "optimizer;" instead, he called it a "transmogrifier," because the only way a compiler could produce truly optimal code would be if it could write an optimal program--and it can't.

    While the compiler has the opportunity to rewrite the code every time it processes it, it's not very good at rewriting code, so its abilities are limited.

    A well-designed compiler on a well-designed CPU can produce very good assembler code--I doubt I could outsmart a compiler in a RISC assembler optimization problem--but I believe that code tweaking is usually of value only in inner loops and for certain kinds of data storage.

    All of which adds up to the reality that, pending the arrival of artifical intelligence, human programmers are still needed.

  • "I don't remember saying that"
    Sorry, you didn't, that was AndrewHowe. I wasn't keeping track of who is on this thread.

    "I'll take your word for it"
    Don't take my word too seriously, I don't have the C spec either. I'm going on what other people have said about the standard. And like I said, it may have been about the upcoming standard, not the present one (you've gotta love it, it's finally been around long enough for people to comply and then they go and change the thing).

    "However, without an integer type big enough to hold a pointer, how are you going to compare a pointer to another using operators like do wish that methods of handling larger ints were specified; I mean, even the x86s have special features to handle 64-bit ints, which you just can't access using ANSI C). Actually, I'm rather fond of the idea of 24-bit chips. Chuck Moore thinks 20-bits is enough, and I really respect him, but he also thinks it's an acceptable tradeoff to use a character set where ones and lower case Ls are the same (I really recommend reading around his web site, while he might be a little over the edge, he still has a lot to teach the rest of us). [ultratechnology.com]

    The fewer excess bits you use, the smaller and faster (not to mention cheaper) the chip can be.

    "So that the code can run on 99.9999% of the machines someone may want to run it on, instead of, say, 85%, or 50%, or 1%, depending on how far you deviate from the standard and what assumptions you make?"

    Nothing that depends on a 64-bit long (and very little that can benefit from it) is going to run on 99.9999% of machines (it will, in fact, be broken on many 64-bit machine compilers), which was my point.
  • The basic data types are defined. You can use sizeof() to generate 100% portable code when you really have to know exactly how bit your ints are. More importantly, you can use your noggin and avoid it being a problem in most cases by writing code that accepts the range of possible value limits.

    Embedded systems programmers will back me up on this one: it's a Good Thing that when you write C you can write one piece of code that's perfectly happy to use a 16-bit, 20-bit, or 32-bit int and it will be equally efficient on any of those systems.
  • by Tune ( 17738 ) on Thursday January 27, 2000 @06:10AM (#1331148)
    I believe Andy has a good point in arguing that there's no fundamental reason for manually written assembly being better than automatically self-optimizing stuff. I also believe that manually written assembly will ultimately become obsolete.

    Nevertheless I'd like to point out that JIT is based a somewhat dangerous technique: a program that alters (its own) code. I believe this technique was used in the eighties to scare off hackers by making code incomprehensible and hard to disassemble until the program was actually running. Also (even on a good old 6502 processor) it's possible to make some speed improvents peeking and poking into the code you're actually executing.

    When compilers for Microcomputers got faster and most processor architectures (known as Harvard architecture, I believe) explicitly require a division of RW and RO memory segments, self-modifying code was abandoned. Hacking into your own code is generally conceived as bad-programming-practice.

    I was wondering wheter JIT techniques also require very intelligent (and thus "heavy") disassemblers? One might also expect that developing a JIT compiler is a lot harder than doing a conventional one (without peep-hole optimization). Does anyone have experience in these subjects?
  • You need to be sure that the values in the registers are below 65536 or 256 to use these tricks, and the programmer can know this, but the compiler cant.
    Isn't this why we have short int, long int, and char data types - so we can tell the compiler these things?
  • by Anonymous Coward on Thursday January 27, 2000 @06:15AM (#1331175)
    Now, if they were to force-evolve code using a genetic algorithm, you might get code better than any a human code write

    Of course, you'd have to put up with the fact that this would not be executable in Kansas.
  • by aheitner ( 3273 ) on Thursday January 27, 2000 @06:20AM (#1331195)
    @#$% theory.

    I can write fast C (or C++ if you stay away from the evil slow features) code.

    Everything I do in Java is bloody slow.

    I do believe fundamentally that some sort of per-machine compilation is in fact ideal -- the C compiler could do clever stuff if it new the exact cache architecture of your machine and compiled your programs from some intermediate (say parsed and desymbolized) representation when you ran them (or maybe just the 1st time you ran them).

    (all the slackware people in the crowd say "bwaahahahaha! I did compile my programs for my own machine!" :-)

    But I think the dumbness of Java -- the funky GC, the dumb dumb dumb RTTI, will always make it slow. For that matter, Java uses a dumb fake-o instruction set that's not a particularly good representation of the information needed by a compiler. Why not give your JITC something closer to the source rather than a binary for a machine that doesn't even exist ?!?

    ...

    Ditzel said something very significant when he introduced Crusoe: (i'm paraphrasing) "The reason we haven't talked openly about this before is we didn't want to boast before we could prove we've done it". Ditzel stood there and said "We've got a chip that does on-chip JIT architecture translation, and does it to run as fast as any other chip out there, and here's the proof".

    The Java people have been claiming for a long time that they could write faster code than the C compiler with JITC.

    Java has gotten better than when it started, it's true.

    But it still can't hold a candle to gcc -O2.

    My message to the Java world: I spend too much on my computers to waste time running slow code. Call me back when you can actually demonstrate your claims.
  • by AndrewHowe ( 60826 ) on Thursday January 27, 2000 @06:40AM (#1331212)
    In some ways I agree with you, OK, I love writing assembler stuff.
    But I'm trying to separate fantasy from the reality here.
    The fantasy is that you're some super studly asm c0der who can not only produce properly scheduled code, but also has the time to *completely re-write* it:
    a) For each new processor
    b) For each sub-type of that processor with different internal resources
    c) Every time the specification changes - Maybe you have limited the range of a variable or something.
    Any of these changes will require a huge amount of work. Even a small change, coupled with the pipelined nature of modern processors, means a large modification to re-establish optimal resource usage.
    The reality is that no-one has the time to do this, and even if they do, they should be doing something else instead. High level optimizations almost always beat low level tweaking.
    In addition, I simply don't believe it's impossible to
    a) Tell the compiler in more detail how to optimise your code
    b) Have the compiler suggest ways in which it could produce better code (for example, "Is it OK to assume there is no aliasing of this item?" or "Hey, if I make this variable 8 bits in size I can do this huge optimization")
    c) Have the compiler work this stuff out for itself
    d) As (c) but at run time, guided by profiling information.
    Some of those are hard to do right now, but I don't see anything that would make them impossible.
    Does anyone?
  • by monac ( 83505 ) on Thursday January 27, 2000 @06:26AM (#1331225)
    As shown at http://www.idiom.com/~zilla/Computer/javaCbenchmar k.html, even the linux jdk, which is relatively slow than other platform's, runs as fast as natively compiled C program when runs simple operations.

    In my understanding, java's poor performance is due to following factors:

    1. too generized api design resulted in very deep call stack. for example, printing current date and time using java.text.DateFormat uses much more calls than traditional C (a system call and a printf is enough for C). It gets more terrible in swing. use of interfaces causes the same problem i think.

    2. JNI is too slow. (can't understand why but it is known to be slow and java can't help but using many JNIs)

    3. as WORA is important in java, it is hard for java to use any platform specific resources such as Graphic card's accelerations. that is one of the reasons swing's design is getting ugly as it tries to boost its performance.

    4. too many small class files result in high I/O usage when JVM loads classes.

    5. no MACRO! for example, most of java API is synchronized, which is expensive, reguardless thread is used or not. the same for security check codes and platform specific implementation codes,
    etc. these problems can be solved by MACRO but...
    someone in sun doesn't like macro and i agree with him when think of java's purity.

    some of these problems are known to be solved in next jdk version(hotspot client version). slow synchronizations and class loading speed, etc.
    but mostly, they are java's design issue and never be solved, as java is mostly running in virtual machine and originally it is designed to be used for java chips, only when JVM is implemented in processor level, the speed problem will be solved.

    I'm looking forward to seeing java running in MAJC or Transmeta's chips.

    please correct me if i'm wrong in some and forgive me for my poor english.
    Cheers.
  • by BurdMan ( 29934 ) on Thursday January 27, 2000 @06:31AM (#1331240) Homepage
    Firstly, water does sink into rocks. It then freezes and the pressure it exerts reshapes the world as we know it. But that is beside the point.

    I will address your comments with respect to the Java Programming Language and its HotSpot compiler technology. If you would like to enlighten yourself on the techniques behind HotSpot take a look at http://self.sunlabs.com [sunlabs.com]. Transmeta, IMHO, was influenced by the same concepts when designing its code morphing techniques.

    1. "A JIT is rushed" - The first time bytecode is compiled to native code it must be done quickly to avoid delay. That code may be inefficient, true, but it can also be instrumented with profiler like information that can be used by later passes of the compiler. You see, what the author is describing is not a JIT system but a dynamic optimizing compiler which has the opportunity to study the program during execution and recompile parts or all of that program based on that information.
    2. "no leeway for boundary conditions" - Well here again you seem a bit confused and your example is naive. The native representation of a type is not fixed in Java, only the conceptual representation. To be a compliant Java implementation all results generated using 'int' (to use your example) must be consistent with the conceptual representation of 'int' when using the various operations. If the optimizing compiler finds a way to represent an 'int' in a more optimal way it is free to do so as long as the results of the operation don't change.
    3. "garbage collection is slower than hand coded memory management" - You really ought to read some of the latest garbage collection papers. This is no longer true even for collectors managing C or C++ allocation. When given an environment like Java which doesn't allow pointers and other antiquated memory techniques a good dynamic compiler with a modern garbage collector is both faster and more efficient than any hand coded attempt on all but the most simple of applications.
    4. "OOP and dynamic dispatch are inefficient" - Again, I urge you to read the papers at the Self site listed above. The way the Self, and now HotSpot, compilers work eliminates this bottle neck. I'll admit in early implementations of Smalltalk and Objective-C dynamic dispatch did add a small amount of overhead. Take a look at the documents on the HotSpot here [sun.com] and then tell me that dynamic dispatch is a problem.


    In summary, you are more than welcome to use assembly all you want. Code on my brother! But, please before you slam some other method try doing the smallest amount of research first. Maybe your snap intuition is wrong. You never know.

    As far as Transmeta goes, it has a lot of the HotSpot/Self style technology and I personally think that technology is the future. I can't wait to get my hands on a Crusoe powered product.

    -BurdMan
  • by Namarrgon ( 105036 ) on Thursday January 27, 2000 @06:31AM (#1331242) Homepage
    It's just like Deep Blue vs. Kasparov again. Brute force compilers vs. intuitive & creative humans. In some cases, humans can write better assembly than compilers (thanks to knowing more about what the program is intended to do), and in others the compilers have the advantage (thanks to being better at scheduling instructions on today's pipelined, superscalar CPUs). JIT compilers could have even more knowledge about instruction timing, but have less time to think about it. Kinda like Deep Blue playing blitz chess. I'd say in most cases these days, a compiler could write faster code from scratch than a human could, unless the human really spends some time on it, but the optimal combination will always be a human assisted by a compiler (and profiler, etc). Best of both worlds. Namarrgon
  • by GnrcMan ( 53534 ) on Thursday January 27, 2000 @06:57AM (#1331253) Homepage
    That may be true of a crappy compiler on a lame platform (x86)...but I'd like to see you try reordering and slot scheduling Alpha instructions by hand to enable optimal pipelining and branch prediction.
    As someone who worked firsthand on a compiler for the Alpha, I can tell you that for 99% of the cases, the compiler can perform better general optimizations than a human. That 1% left over is what you should be concentrating your efforts on.
    Historically, the point of a CISC processor was to reduce compiler complexity. So of course hand coding optimal X86 assembly isn't going to be difficult.

    --GnrcMan--
  • by locust ( 6639 ) on Thursday January 27, 2000 @07:01AM (#1331265)
    I have already heard that assumption that a compiler can generate better code than a programmer a thousand of times, but it does not get more true by repeating it - it is false.

    The metric that I given in my university computer architecture class was that a compiler can generate better code than 90% of assembly programmers. Which, when I examine higher level code written by a variety of programmers, some good, some bad, makes sense. Some people know all the details of a language and some don't. Some pay attention and some don't. I'm pretty sure it was backed up by a reference to an IEEE paper somewhere.

    If large numbers humans could reliably write large volumes of high quality, optimized code, at high speed, we'd still be using assembly everywhere. But they can't. I certainly don't have the time or the inclination to learn all the ins and outs of every instruction set architecture I operate on to squeze every last bit of juice out of the machine. And I certainly am unwilling to give up the expresiveness of a highlevel compiled language.

    But ultimately, the fraction of people that need to write things like texture mapping loops is very small, and the fact of the matter is that the loop is probably embeded in some higher level piece of code. So relax, compilers reliably produce better code than the majority of assembly programmers (maybe not on /.). Or more precisely, better machine code, than the majority of programmers, if they were asked to write assembly. --locust

A committee is a group that keeps the minutes and loses hours. -- Milton Berle

Working...