HPs Dynamo Optimizes Code 78
sysboy writes "ArsTechnica have a very interesting piece about HP's Dynamo. " Interesting stuff about their run-time optimization stuff. They compare it to Transmeta's code morphing technology.
"What if" is a trademark of Hewlett Packard, so stop using it in your sentences without permission, or risk being sued.
Closed Source Binaries (Score:1)
This can be done better (Score:1)
1) Generates streams of instructions in execution order, with the branch predictions already determined, based on past execution.
2) Additional code optimizations are enabled by knowing how the program actually executes. Things like pointer aliasing (which is the single biggest cause of poorly compiled code) can be resolved.
#2 is the same as profile guided optimization, which has been trashed on Slashdot several times because Intel put it into their compiler and got decent speedups. PGO allows the compiler to gather statistics on the application by actually running it, then recompiling based on that data.
#1 sounds exactly like the trace cache which will be in Willamette, Intel's new IA32 chip. Streams of uops are stored in program execution order, with the branches already predicted.
The difference is that Dynamo uses the hardware to do all of these things while the application is running, which adds considerable overhead. The Intel implementation is done automatically, without using CPU cycles, allowing ALL of the gains to be seen by the user.
Ars Technica has their history of anti-Intel bias, so I am sure that you will never see this comparison on their site.
Re:Bad compilers? (Score:1)
The limitation is that the dataflow analysis is not completely precise. This is because the compiler only has static information. It must assume both paths after a branch are taken, for example. When a join point is reached, the compiler must conservatively resolve the dataflow information from the separate paths to the join point. So for example if variable a is defined in both paths, the compiler must assume both definitions reach the join point.
Dynamo improves on this by forming dynamic traces. Now the compiler can optimize exactly the path through the program taken at run-time. Essentially, there are no join points, so there is no need to combine dataflow information from multiple paths. The trade-off is between constructing long traces and being able to reuse them frequently enough to recover the penalty of constructing and optimizing them.
--
Re:Letting you know... (Score:1)
Yeah, Ars! (Score:1)
They wrote an excellent article on the Crusoe, (I just wrote a short paper on it for class, myself, although I used the white paper for reference...) and were looking forward to this architecture. It's nice to hear more details.
Wow, these results are really impressive! I wish someone would implement an x86 interpreter on x86 that actually ran faster. I wonder if the limited number of registers would really get in the way of a control program, though. Transmeta, where are you?
---
pb Reply or e-mail; don't vaguely moderate [152.7.41.11].
Re:A Priori vs. A Posteriori Optimizations (Score:1)
They are concerned with SPEC...which contains lots of time consuming calculations buried in nested loops. Consider a less esoteric example that has a similar program structure. 3D software (Povray, Quake, whatever) goes through long lists of polygons. The instructions in these loops will be hit thousands or millions of times. These oft-touched bits of code will be optimized and have several fragments that correspond to the various common branching configurations. Any trip through these tight loops is going to be "representative." If the two or more branchings occur frequently, each will be encapsulated in a fragment. If the loop is exhaustive enough, every possible branching may be optimized. Amdahl's law is at work here. The parts of the entire code that get run the most are the locations for potential bottlenecks, and hence they attract the most attention from the optimizer. This is in direct contrast to a priori compiler optimization approaches which have a very narrow focus and have no idea how often a particular bit of code will be executed.
Again, the proof of the efficacy of this approach is in the HP results. Those results will only get better when you dump the runtime overhead. Furthermore, it is quite understandable that these types of optimizations will be orthogonal to those made by compile-type optimizations. The squeaky wheels will get the oil. And, there is NO possibility of a performance penalty. If the "optimized" fragments are slower than original code, they can be discarded. And these optimization will be resilent to parametric changes, otherwise the HP software wouldn't be seeing any speedup. It has to have a cache of pre-ordered instruction blocks that is hit many times in order for the initial optimization/translation time to be amortized.
There is a lot of potential here I think, especially for computational physics codes that I work on.
A Priori vs. A Posteriori Optimizations (Score:1)
Good for multiprocessors (Score:1)
Overhead? (Score:1)
There are a lot of "geewheeze" out there about this supposingly new morphing thingy, but nobody is talking about the OVERHEAD !
If something has to be translated (even optimized) from a form to another form, there ought to be overhead for that something.
So, who can tell me about the OVERHEAD incurred?
Re:Bad compilers? (Score:1)
> *native binaries*...
For those that didn't read the article, here is a direct quote from there showing this is for native binaries.
What Dynamo does
Dynamo is an odd beast. It is, in essence, an interpreter for HP's PA-8000 instruction set that itself runs on a PA-8000 processor. That's right -- it interprets programs that could just as easily be executed natively on the same hardware.
Stephen L. Palmer
---
Here is the result of your Slashdot Purity Test.
You answered "yes" to 86 of 200 questions, making you 57.0%
Re:Windows (Score:1)
Re:Thank Goodness! (Score:1)
thad
patent stuff (Score:1)
Re:Runtime morphing is NOT NEW (Score:1)
JavaVM + Dynamo = Fast Java? (Score:1)
http://www.hpl.hp.com/cambridge/projects/Dynamo/do cs.htm
Dynamo as a Virtual Machine Accelerator [hp.com] (PDF)
The Dynamo team has been at it since 1997 and the writeup is dated June 1999, does anyone know what HP is doing with it?
Runtime Optimization in Linux Kernel? (Score:1)
Can you immagine what Dynamo and Code Morphing technology can do for the Linux Kernel? Faster emulators, faster code, and _better_use_of_existing_hardware_!!!! MMX acceleration for everything (theoretically), because it's in the kernel!
The holy grail of linux is how well it works with legacy hardware, why not make it even better?
So who's going to take the reigns of this puppy? Put it on the bill for the 2.5 series! (or 3.0?)
code morphing vs. profiling (Score:1)
Re:20% faster than native (Score:1)
As the book says, even its creator did not know it still existed until his computer started speaking to him...
20% faster than native (Score:1)
(It's on the Cyberpunk reading list, but hard to find. P-1 is a fictional program created to penetrate a remote computer, get resources for an attack, but avoid detection. The creator mistypes, the program thinks he's not the creator, and it hides...it continues getting more resources and hiding its existence for years, including optimizing applications so it can use the extra CPU although the application seems to be using exactly as much CPU as it should.)
Re:An artifact of HP architecture? (Score:1)
>work that way. In the Pentium Pro and later,
>there are branch prediction bits in the cache,
>set as the code >executes.
x86 (and Alpha and SPARC and MIPS) have branch
predictors but the prediction history bits are
stored in on-chip prediction tables. NOT in the
instruction cache. Keeping them in the cache is not
unreasonable though and may be an interesting
research topic since you do not have to keep
a big on-chip table that suffers aliasing and
capacity problems.
Y.
Re:Yabut (Score:1)
Re:patent stuff (Score:1)
VMWare / Plex (Score:1)
Re:Bad compilers? (Score:1)
I can only partly agree with this statement.
I suspect that a whole lot of the code that Dynamo optimised around, possibly is there because of the lack of good exception handling in C. For instance:
fd = open(argv[1], O_RDONLY);
if (fd < 0) err(2, "cannot open %s", argv[1]);
clen = read(fd, &buf, sizeof(buf));
if (clen < 0) err(2, "error reading %s", argv[1]);
if (clen == 0) err(2, "no input");
process(buf);
and so on. A lot of C code gets fragmented because it's natural to put error handlers in the middle of the natural flow of execution.
However, many modern compilers can look for the most commonly taken branches from actual execution. (gcc and HP-UX's compiler, for instance, both can do this.) It's nice to compare against O2 code, but what I want to know is, does this provide a noticable speedup from this arc-analyzed code?
--
Fourth law of programming: Anything that can go wrong wi
FYI (Score:1)
OT: But did you see the ad? (Score:1)
Re: Yabut (Score:1)
Re:Yeah, Ars! (Score:1)
Re:Bad compilers? (Score:1)
It would be interesting to see if this could function as an adjunct to the existing compilers. If the compiler were to imbed directional *waypoints* that this could use in the real-time rework.. And of course, this could eventually make it into mainstream CPU in the same manner as Transmeta is using the concept; add some large L2 cache and get this loaded in and we might see real improvements.
Re:Bad compilers? (Score:1)
A compiler couldn't do this ahead of time. Here's why:
Imagine you have a program that reads some files (this happens on occasion). Now you compile that code, and run a profiler on it with "Data Set A". You then feed that profiled data into your hypothetical compiler to optimize your executable.
Now you run that executable with a completely different set of data, "Data Set B". "Data Set B" doesn't take the program through the same paths of execution as "Data Set A", so now your program may actually run slower than if you did no profiled optimizations at all.
If doing this at run time, the interpreter will adapt for each execution, therefore the program will automatically be optimized no matter what the input.
Re:Runtime morphing is NOT NEW (Score:1)
An employer of mine told me of the time the IBM CSE, upon first seeing the mutated machine, said "you're running 3x as many users as this system is supposed to support. Corporate headquarters is going to put a price on your head"
Most of "new computer technologies" have their roots back in the `60s and early `70s. Changes have made some of these more effective or important now, resulting in their resurfacing as "bold new concepts".
Re:Runtime Optimization in Linux Kernel? (Score:1)
http://citeseer.nj.nec.com/mass alin92synthesis.html [nec.com]
Moderate Down! Ignore! (Score:1)
Do not click here
Uh oh, not good.
Thank Goodness! (Score:1)
Re:20% faster than native (Score:1)
Yup, that would explain everything. My computer is constantly getting slower.
Re:Is that really runtime morphing? (Score:1)
Your point about the distinction between runtime morphing and emulation is well taken, however.
Is that really runtime morphing? (Score:1)
(Which, of course, has almost always been notoriously slow)
Re:Runtime morphing is NOT NEW (Score:1)
There's a fairly large difference, when you drill down to the details level, between runtime morphing of pcode-based and/or interpreted languages like Smalltalk and Java and native code, which is being discussed here. This stuff is obviously based off of lots of past research, but it IS a rather nice new use of ideas and technology. I'd love to see widespread usage of this type of tech on the x86 and other platforms.
Also, the Mac emulators for the Amiga, even the hardware based ones, did no code morphing. The Amiga and Mac were both based on the exact same processor line (680x0), and thus no translation was needed. My guess is you are thinking of that one Mac emulator that came out that promised it would be useful for emulating all sorts of other computers. That was mostly smoke and mirrors. The board was pretty much just a hardware add-on for the Amiga that would let you plug a Mac ROM in so you didn't have to use a binary image on disk.
Most Mac emulators on the Amiga at the time, even the software-only ones, were ever so slightly faster than a Mac running the same CPU. This was no fancy code trickery... just superior (at the time) video hardware in the Amiga, coupled with the fact that the Mac ROM would be cached within fast RAM if it was available.
Re:Bad compilers? (Score:1)
But you don't always have the code for applications you might want to speed up. You'd be depending upon the vendor to go back and recompile with the new optimizer...
Also...
The other problem would be that the compiler would generate abnormally large binaries (due to all the unrolled/prebranched fragments). Doing it in runtime allows you to get the same sort of speed advantage without users doing a doubletake when the binary that used to be 3 megabytes compiled is now closer to 37 or so megabytes... even if it is a few percent faster to run.
Re:Runtime morphing is NOT NEW (Score:1)
Close enough. Sun didn't buy a company as far as I know. The technology comes mostly from the Self group [sun.com] at Sun. This is credited (but maybe not enough!) in the Hotspot whitepaper [sun.com] (try searching for 'Self').
When I was visiting the Self group in '96, they told me how frustrated they were that they knew how to speed up Java and couldn't get the Java guys to pay any attention... it was good to see that they got the message across eventually.[end shameful namedropping]
Dynamic optimisation is totally not new but there's always room for more research.
By the way Self is really worth checking out. How many environments let you change your inheritance at runtime? Or totally blitz the machine by visually manipulating the interpreter's settings?
RyanS
Re:Runtime morphing is NOT NEW (Score:1)
Geesh people, Runtime morphing is NOT NEW. The company that Sun bought to do the Hotspot VM did it with Smalltalk YEARS before Transmeta.
Close enough. Sun didn't buy a company as far as I know. The technology comes mostly from the Self group [sun.com] at Sun. This is credited (but maybe not enough!) in the Hotspot whitepaper [sun.com] (try searching for 'Self').
[Begin shameful namedrop]When I was visiting the Self group in '96, they told me how frustrated they were that they knew how to speed up Java and couldn't get the Java guys to pay any attention... it was good to see that they got the message across eventually.[end shameful namedrop]
Dynamic optimisation is totally not new but there's always room for more research.
By the way Self is really worth checking out. How many environments let you change your inheritance at runtime? Or blitz the machine by visually manipulating the interpreter's settings?
RyanS
Runtime morphing is NOT NEW (Score:2)
Heck, there were even products for the Amiga that would take code for the Mac and change it around to work on the Amiga. (That was a hardware based product). Not sure how well they optimized the code, but it still fits into this category.
Re:Letting you know... (Score:2)
This is one problem Java has for performance- since they wrote most of the libraries in Java, you are spending most of your time in the virtual machine.
Re:Is that really runtime morphing? (Score:2)
Similar things have been demonstrated for Smalltalk. Lisp has certainly outperformed a lot of other languages in benchmarks, but it's mostly compiled to native code anyway.
Re:Runtime morphing is NOT NEW (Score:2)
Re:Is that really runtime morphing? (Score:2)
The statement that emulation has almost always been notoriously slow is very misleading. AS/400's only do "emulation". JVM's only do "emulation". Smalltalk, LISP, and any other language that defines a VM for it's runtime are all doing "emulation".
Yabut (Score:2)
But I don't know of any widely known languages before Perl which were predominantly thought of as interpreted languages and were compiled. The overhead of compiling was just thought to be too large for an interactive program...
Cheers,
Ben
* Piece of trivia I saw in the Dragon Book. The story is that LISP was used as an informal way to write down a program that you would then convert by hand into assembler. Well someone decided to demonstrate that LISP made a nice general purpose machine and wrote an "eval" function in LISP. Someone else saw that, realized that eval was *also* a LISP interpreter and converted it by hand into the first working LISP interpreter. I remember that this was in 1959, but I forget the names...
Re:Bad compilers? (Score:2)
Compilers only have the static representation of the program available to them. They have no idea what will be executed at runtime or how frequently. The only way they could ever do this, is to perform a data flow analysis - no faster, less effective and much more complex than actually running the program.
Not really; run-time vs compile-time (Score:2)
If you think of control flow as a tree, Dynamo is doing run-time tree balancing, or twisting the code loops to shorten (or inline if you will) the paths most often enountered.
Then again, this is just my preliminary, first-read, interpretation. (no pun int)
Sidebar: Has anyone noticed that even when slashdot is doggy-slow, the ads pop up in not time flat, and then we get to stare at them while the rest of the page takes it's time loading?
Re:A Priori vs. A Posteriori Optimizations (Score:2)
There's your problem right there. As a previous poster mentioned, for extremely dynamic systems, picking representative data is difficult, maybe even impossible.
That's not to say this isn't cool tech though. I'd love to see this implemented on x86, just to see what's possible.
Re:An artifact of HP architecture? (Score:2)
R U shittin' me? 20% is a kick ass SPEC improvement, esp since it's done totally in software. You can bet that there will be some sort of hardware support soon. Like the article starts out stating, the compiler team get hugeass bonusses based on piddly decimal point improvements (I might be overstating the case somewhat, tho). 20% is like, well, huge. The compiler guys didn't believe the numbers at first.
R U shittin' me mk II? compared to the black voodoo magic of some compiler transformations, this transformation is almost clear as crystal. And it runs in REAL TIME. These days, O(n^3) worst case is almost acceptable for static optimsation algs.
But you are right about your other points, sorta. While hint bits are a cheapass version of this idea, it appears that the real gains come when you start completely removing the branches altogether and propagate assumptions forward (redoing register allocation, that sort of thing).
Even more funner would be to cross kernel boundaries and inline kernel calls. Transmeta can do this, as their optimiser run at the meta level, whilst dynamo cannot, as it is merely a user level program.
Well an idea usually has to be changed a little (Score:2)
What I have yet to actually see is any of this affect a real desktop computer that I can buy. A transmeta processor would be a nice touch but tell me why waste something like that on afn embedded or handheld device which almost by definition has severe limitations?
Re:Yabut (Score:2)
I do: Beginner's All-purpose Symbolic Instruction Code, otherwise known as BASIC.
Since around 1980 at least, it seems most people thought of it as an interpreted language, or, more properly, of its implementations as interpreters.
For early PC's that may well have been true (with keywords and such converted to byte codes internally, to save space versus storing the whole program in ASCII).
But the original and many subsequent BASICs were environments that looked interpreted but actually compiled when the " RUN " command was issued.
(Which, for a large program, would mean a noticeable wait before anything useful happened within the program.)
That goes back to the mid-60s, and through my own personal close encounter with TOPS-10 BASIC around the mid-70s, when I first learned how compilers worked (and what PDP-10 assembly code really looked like) by typing in the whole thing from a listing! (Took about three months or so on my beat-up KSR-33 Teletype.)
An artifact of HP architecture? (Score:2)
Intel IA-32 (i386 through Pentium III) doesn't work that way. In the Pentium Pro and later, there are branch prediction bits in the cache, set as the code executes. After a few passes through a loop, the most probable path is known and speculative execution gets smarter. Intel does its scheduling in hardware.
I'm not thrilled about adding all that hard-to-debug software to get only 20% better performance on SPECmark benchmarks.
Re:Bad compilers? (Score:2)
Even if you wrote optimisation code into your linker (which does see all the object files at once), it still couldn't optimise the way dynamo does, because any particular function can be called from many different places in the code.
Jeff
The algorithm's the thing (Score:2)
When you run a Perl program, for instance, it actually compiles and then runs... unlike most interpreted languages, which translate and run instructions one at a time. Instead, a Perl script could begin to run translating one instruction at a time, go until it hit a trace condition, increment the counter, and so on.
Following the Perl example, the down side is that Perl does some compile-time error checking, which wouldn't be done if you used this algorithm.
I'm speculating off the top of my head, mind you... but then, if I'm not talking sense, I'm sure someone will let me know! ;-)
Bad compilers? (Score:3)
Stephen L. Palmer
---
Here is the result of your Slashdot Purity Test.
You answered "yes" to 86 of 200 questions, making you 57.0%
Letting you know... (Score:3)
Also note that while Perl was the first well-known interpreted language to do this, it is an idea that is now considered the norm.
But to apply HP's ideas to Perl would be quite possible and probably beneficial. Today Perl has a lot of run-time lookups. For instance if I write a function foo, and I call it from a function bar, then (I believe) every time you run my function bar there is a run-time lookup while running bar to find foo because you might have changed it.
But foo almost never changes from call to call! Imagine instead the following. Each time you run bar, first check a flag for whether you have profiled it. If you have not then increment a counter for how many times it is accessed. If that counter is below some level, just execute. If it is above some target, then profile it and record information from which you can check whether it needs to be profiled. After profiling then execute the profiled version.
If your original check found that you profiled it, then quickly check that the profiled version has not been invalidated, and if all is fine then run the profiled version. If it has been invalidated then flip the profiled version, set the counter at 0, and run the safe version.
So...what does this complicated logic do? Why for a minor run-time penalty you manage to remove most of the repeated dynamic run-time lookups for which function to call, probably with significant speedups! (It could be an even bigger win if you did a similar trick on method-call lookups.)
The basic technique of taking out time to store a fast static lookup on data that is otherwise recalculated is called memoizing, and is good to know. (In Perl memoizing is almost always done by having a lexical hash (ie one defined with my) around the function with memoizing being done by sticking the answer in the hash. So the program becomes check the memoized hash and return that answer if you can. Otherwise calculate the answer, stick it in the hash, and return the answer.)
Cheers,
Ben
HP's Dymo Server (Score:3)
...Oh, you said *Dynamo*! Sorry!
...I'll go away now....
Re:Bad compilers? (Score:4)
The thing about static compilers is that they have no idea what happens at run-time. Profiling has been used to mitigate this somewhat, but it's still a huge problem.
Accesses through memory are slow, so you want to get rid of them. One way to do this is through register allocation. Unfortunately, even if an infinite number of registers was available, you couldn't allocate everything to registers.
Why? Because we use pointers. There are multiple names for the same data running around in our little electronic brain. When you allocate something to a register, you bind it to one name. This is by definition incorrect for aliased data (data with multiple names).
Optimizations like register promotion try to get around this by allocating things in regions where the compiler can prove it only has one name. But this is exceedingly difficult when you have things like function calls which must be assumed to access and modify lots of data.
I won't even get into the problem of static instruction scheduling or other optimizations like partial redundancy elimination.
In short, aliasing through memory is nearly impossible to track accurately at static compile time. At run-time, the machine knows exactly which memory accesses reference which data, so things like run-time register allocation can do a better job. Crusoe does this to a limited extent.
Dynamo is essentially a software trace cache [wisc.edu]. Except that when forming the trace, it does transformations like Common Subexpression Elimination and other traditional compiler manipulations.
IBM has the Daisy project, which does code morphing from PPC to a VLIW ISA. I believe it also does some run-time optimizations. Projects like DyC [washington.edu] and Tempo [irisa.fr] have been compiling at run-time for a while now.
I like to think of dynamic compilation in terms of the stock market. Which would you rather do: trade stocks with only limited information about their past behavior (and sometimes none at all), or trade stocks after having observed the absolutely most recent trends? I'll bet that if you pick the first strategy and I pick the second, I'll beat you every time.
That said, there are tricks ou can pull with static compilation. IA64 has the ALAT, which lets the machine track when store addresses match load addresses. This lets the static compiler speculatively move a load ahead of the store. If the store conflicts, the machine will execute some compiler-provided code to fix up the error. Essentially, the compiler is making an assumption that the load and store do not reference the same data and is communicating that assumption to the machine. The machine checks the assumption and invokes some fixup code if it proves to be incorrect.
--
Re:Bad compilers? (Score:4)
As an example, consider a fragment of code that performs an operation on a sequence of numbers. The sequence is unknown at design time - bounds checking, conditional statements and other code will all need to be compiled in to ensure different sequences can be handled. At runtime it may be that the same 4 number sequence is processed a large number of times. The runtime optimiser (Dynamo) detects this and can optimise out the bounds checking, the conditional statements and streamline the rest of the code.
It's very clever technology, but it's surprising it hasn't been done more extensively before now. What would be very interesting is to see if something like VMWare incorporated this technology - could it effectively negate the CPU cost of running VMWare itself?
~Cederic