Next-Gen Processor Unveiled 183
A bunch of readers sent us word on the prototype for a new general-purpose processor with the potential of reaching trillions of calculations per second. TRIPS (obligatory back-formation given in the article) was designed and built by a team at the University of Texas at Austin. The TRIPS chip is a demonstration of a new class of processing architectures called Explicit Data Graph Execution. Each TRIPS contains two processing cores, each of which can issue 16 operations per cycle with up to 1,024 instructions in flight simultaneously. The article claims that current high-performance processors typically are designed to sustain a maximum execution rate of four operations per cycle.
I want one (Score:4, Insightful)
Re:I want one (Score:5, Funny)
But when are they likely to be ready?
Re:I want one (Score:4, Funny)
One word: (Score:2, Funny)
Re: (Score:2)
Re: (Score:2, Funny)
Hm... (Score:5, Insightful)
But it seems to me that we called this great new invention "vector processors" 15 years ago, and there is a reason they arent around anymore.
"Many instructions in flight"=="huge pipeline flushes on context switches"+"huge branching penalities" anybody?
Re:Hm... (Score:5, Interesting)
Re:Hm... (Score:5, Informative)
The vector processors never went away. They just became your graphics card: 128 floating point units at your command [nvidia.com]
BTW, here is a real article on TRIPS [utexas.edu].
Re:Hm... (Score:5, Informative)
Re: (Score:2)
Re: (Score:3, Informative)
The idea is simple, instead of discovering instruction level parallelism by checking the dependencies and anti-dependencies by global names (registers), define the dependencies directly by relating to instructions themselves.
> "Many instructions in flight"=="huge pipeline flushes on context switches"+"huge branching penalities" anybody?
That equality does not exist. It is a wid
Re: (Score:2)
So what happens when you have more possible branches than you have execution units?
Silly, you run over to MicroCenter buy more hardware.
Re: (Score:2)
Re: (Score:2)
actually... (Score:2)
Actually, from what I can tell it's more like a VLIW with it's program chopped up into horizontal and vertical microcode "chunks" for more efficient register forwarding, than a vector processor...
I figure that it chops up the code into 128-instruction chunks (or smaller if there are branch dependancies that can't be done with predicates) and schedules it horizontally (the classic wide VLIW microcode which feeds independent instruction pipelines), and vertically (the sequenc
Re: (Score:2)
Doug Burger [utexas.edu] is one of the main PI's on this project (which is around seven years old at this point). I'm sure you can find more information there if you are interested.
Re: (Score:2)
Re: (Score:2)
http://www.cs.utexas.edu/~trips/publications.html [utexas.edu]
Instructions are grouped but WITH data dependencies, that are explicitly encoded in the instruction stream, which means that generating code running well on that thing doesn't sound that difficult. IANA compiler expert but I think a compiler able to generate good code for this out of regular, scalar code sounds quite plausible.
Re: (Score:2)
You could also avoid some switching by keeping micro-contexts - separating the context of the various units and letting the software deal with them independently. This way, if you only have use for 5 pro
Re: (Score:2)
Re: (Score:2)
Re:Hm... (Score:4, Informative)
Re: (Score:2)
I'm willing to bet that you typed that on a machine with a vector processor. What happened is that they became integrated into general purpose CPUs. The Altivec unit in my Mac's Power PC chip is a vector processor, as is the SSE unit in Intel CPUs.
Next-Gen Business Model Unveiled (Score:3, Insightful)
2. Make sure google ads show up at the top of the page
3. Submit blog to slashdot
4. Profit
Re: (Score:2)
Re: (Score:2, Funny)
Re: (Score:2)
Ew... Where did that come from? I need to get out more.
Marketting hype? (Score:5, Informative)
It's me or are they trying to reparaphrasing, euphemistically, the Out-of-order execution [wikipedia.org]?
Re:Marketting hype? (Score:5, Informative)
And as an aside, the reason modern CPUs are designed to "only" issue 4 instructions per cycle instead of 16 is because after years of careful research and testing real work applications, 4 instructions is almost always the maximum number of instructions any program can concurrently issue, due to issues like branches, cache-misses, data dependencies, etc. Makes me question just how much these "professors" really know.
Re:Marketting hype? (Score:4, Interesting)
Re: (Score:2)
Re: (Score:2)
From the paper "Scaling to the End of Silicon with EDGE Architectures", TRIPS ISAs are hardware dependant though, which means that you'd have to recompile your applications each time you use a new CPU, if I understood correctly, this is a significant problem (that and the memory wall).
Re:Marketting hype? (Score:5, Interesting)
Having read the articles that were easy to get to, and the abstract of the PhD student: this is buzzword bollocks. There is no innovation in what they have done. As other people have pointed out this is a vector / datastream architecture. It's not a very good one at that. Although it has the "potential" to scale to terraflops, so does my toothbrush. On a 130 process they can fit 2 cores with 32-wide dispatch clocked at 500Mhz. My 7800 is fab'ed on a 130 process with 24*4*4 = 384 operation wide vector dispatch. This prototype would hit about 16 billion ops/sec, versus 180 Gflop on the 7800. This is a long way from terraflops, and doesn't convince me that it can scale.
As the 7800 is close to a systolic model there is a limited class of programs that can be executed; but those that are in that class exhibit (near)perfect parallelism and so have zero hit from memory access costs. Actually the internal bandwidth on the 7800 is a bottleneck for some computations but I'm just going for coarse detail here.
Edge appears mix and match ideas from several parallel designs; every one of which suffers from hard code generation problems. I suspect that the only sample applications that hit 32 ops / cycle are media apps (or dataflow problems as they used to be called) which normal architectures run at high speed anyway.
Interesting research, as it's always good to see people explore different designs, but it sounds overhyped and I believe that it has zero commercial appeal. Finally, as a sidenote, you are right about cache latencies being a memory defect rather than processor but there are ways around it. If you are willing to limit yourself to a certain class of applications (roughly the same one that executes well on most parallel architectures such as this, or GPUs) then you can completely avoid the latency. This provides a much bigger performance hike than any other technique as memory latency is a dominating factor in most runtimes. The only snag is that it is very hard to do, requires different fabrication technology (largely solved now), and lots of compiler advances... If you're interested then google for intelligent ram. It's about a decade of research now...
Re:Marketting hype? (Score:4, Interesting)
No, it isn't. The TRIPS group has done some really interesting things with compilers, for example. They've managed to have the compiler break up code into packets and schedule them on the processor array so that dependencies flow nicely across the grid. That is not an easy problem to tackle. This is very good research.
That's not the point of research. The point of research is to explore problems no one has tackled before, of course always with an eye toward future technology trends.
Re: (Score:2)
From the way that you misquoted me and then attacked a strawman, either you don't understand what research is about, or you do know but scoring points is
Re: (Score:2)
I had no idea Atari's third 8-bit console was so powerful. It's too bad they had to shelve the system when the market crash hit, and never gained back the developer or user
Re: (Score:2)
Re:Marketting hype? (Score:4, Informative)
Re: (Score:2)
I'd call it OOE perfected. EDGE allows OOE on scales orders of magnitude larger than current architectures can (and a few other benefits), using less (and less power consuming) hardware. This is accomplished through a rather interesting paradigm inversion (I haven't seen anything like it before, though I don't exactly sc
Re: (Score:2)
Re: (Score:2)
vs
Current high-performance processors are typically designed to sustain a maximum execution rate of four operations per cycle. [scienceblog.com]
They are comparing oranges against apples (!), as you can not compare 16 OoO executed instructions per cycle versus 4 *resulting in-order* instructions per cycle (where for achieving these 4 instructions/cycle may be you had to execute 10, 20 or more OoO instructions (!)). Please, where is the rigor
Re: (Score:2)
TRIPS (Score:2)
Re:TRIPS (Score:4, Insightful)
One trillion calculations per second by 2012 (Score:3, Informative)
Key Innovations:
Scalable and distributed processor core composed of replicated heterogeneous tiles
Non-uniform cache architecture and implementation
On-chip networks for operands and data traffic
Configurable on-chip memory system with capability to shift storage between cache and physical memory
Composable processors constructed by aggregating homogeneous processor tiles
Compiler algorithms and an implementation that create atomically executable blocks of code
Spatial instruction scheduling algorithms and implementation
TRIPS Hardware and Software
Re: (Score:3, Informative)
Check out this writeup at HPC wire [hpcwire.com].
A major design goal of the TRIPS architecture is to support "polymorphism," that is, the capability to provide high-performance execution for many different application domains. Polymorphism is one of the main capabilities sought by DARPA, TRIPS' principal sponsor. The objective is to enable a single processor to perform as if it were a heterogeneous set of special-purpose processors. The advantages of this approach, in terms of scalability and simplicity of design, are obvious.
To implement polymorphism, the TRIPS architecture employs three levels of concurrency: instruction-level, thread-level and data-level parallelism (ILP, TLP, and DLP, respectively). At run-time, the grid of execution nodes can be dynamically reconfigured so that the hardware can obtain the best performance based on the type of concurrency inherent to the application. In this way, the TRIPS architecture can adapt to a broad range of application types, including desktop, signal processing, graphics, server, scientific and embedded.
obvious drm implications.. (Score:2)
Re: (Score:2)
Re: (Score:2)
not true, otherwise they would not be general purpose because they would not run every piece of x86 software thrown at them.
with the current architecture the key has to be in plaintext in one of the registers, which can then be dumped.
in this proposed architecture it can be passed to the next instruction throughout huge contiguous blocks of code w/o touching a register.
this also brings up the related issue of debugg
Re: (Score:2)
Re: (Score:2)
Scalable and distributed processor core composed of replicated heterogeneous tiles [wikipedia.org]?
Non-uniform cache architecture and implementation [wikipedia.org]?
Well, very disapointing when compared to other [wikipedia.org] modern [wikipedia.org] microprocessor [wikipedia.org] architectures [wikipedia.org]. Don't get me wrong, I love computer architecture, and the design seems interesting, but the "over-hype" is discouraging.
Re: (Score:2)
Exactly. As is explicitly stated in the PDF [utexas.edu] linked from this [slashdot.org] comment by volsung, TRIPS is an implementation of EDGE.
Must...resist...obvious...joke (Score:3, Funny)
Nope, the most obvious joke is (Score:2)
Re: (Score:2)
Care for a game of chess? Nothing drives innovation in processing power more than a good game of chess
Re: (Score:2)
Re: (Score:2)
Let me be the first to say.... (Score:2)
Ix86 (Score:2)
It did say 'general purpose' and if you try to create something beter but different, you get slapped down eventually ( like PowerPC Apples. )
Re:Ix86 (Score:5, Insightful)
something which doesn't ultimately look like an x86.
Re: (Score:2)
Having a compatibly layer will help prevent it from being a doomed project/product.
Re: (Score:2)
Re: (Score:2)
In fact, this is exactly how modern Intel and AMD chips work. Internally, they are RISC, with an instruction reordering unit and a vector unit. All the horrible instructions in the x86 ISA are microcoded, allowing the
Re: (Score:2)
and 90% of it is too slow? or in java? (Score:2)
Video transcoding.
Rendering farms - need a $500 solution that can out do a $35,000 solution. ie, 10 x $35 chips on a $20 card + profit margin and yearly software licence.
Folding type apps.
Nuclear/Sci sims.
Gets rid of the register-file (Score:5, Insightful)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
This is cool! (Score:4, Informative)
The PDF here: has more information about EDGE [utexas.edu].
The basic idea is that CISC/RISC architectures rely on storing intermediate data in registers (or in main memory on old skool CISC). EDGE bypasses registers: the output of one instruction is fed directly to the input of the next. No need to do register allocation while compiling. I'm still reading the PDF, this sounds like a really neat idea, though.
The only question is, will this be so much better than existing ISA's to eventually replace them? -- even if only for specific applications like high-performance computing.
Re: (Score:3, Insightful)
Re: (Score:2)
Doug Burger's work is known to computer scientists for years...
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Hmm. Interesting.
I wonder how this differs from the dataflow architectures of the early 90s?
Re: (Score:2)
Re: (Score:2)
Or running Java. Or CLR.
Moore's law immortal? (Score:3, Interesting)
Re: (Score:2)
Re: (Score:2)
Since there's a finite atom density on a chip, the transistor density will inevitably stop to grow eventually.
but... (Score:4, Insightful)
They point out that current multi-core architectures put a huge burden on the software developer. This is true, but their claim that this technology will relieve that burden is dubious. They mention, for example, that current processing cores can typically only perform 4 simultaneous operations per-core, and imply that this is some kind of weakness. They completely fail to mention that the vast majority of applications running on those processors don't even use the 4 available scheduling resources in each core. In other words, the number of applications that would benefit from being able to execute more than 4 simultaneous instructions in the same core is vanishingly small. This is why most current processors have stopped at 3 or 4. Not because they haven't thought of pushing it beyond that, but because it is expensive, and because it yields very little return on the investment. Very few real-world users would see any performance benefit if the current cores on the market were any wider than 3 or 4. Most of those users aren't even using the 4 that are currently available.
Certainly the ability to do 1024 operations simulatenously in a single core is impressive. But it is not an ability that magically solves any of the current bottlenecks in multi-threaded software design. Most software application developers have difficulty figuring out what to do with multiple-cores. Those same developers would have just as much (if not more) difficult figuring out what to do with a the extra resources in a core that can execute 1024 simultaneous operations.
Re:but... (Score:4, Informative)
Re: (Score:3, Interesting)
Re: (Score:2)
Consider a loop with a medium-sized body, and iterations mostly independent. If there are enough simultaneous operations allowed to schedule multiple iterations through the loop at once, the loop could potentially run that many times faster. Now, with current designs, there aren't that many slots, and even if there were, the ISA makes it difficult to express this in a way that's useful to the processor. All we can do is O
Re:Better support for concurrency in Languages (Score:2, Insightful)
A lot of this is due to the fact that most popular languages right now do not support concurrency very well. Most common languages are stateful, and state and concurrency are rather antithetical to one another. The solution is to gradually evolve toward languages that solve this either by forsaking state (Haskell, Erlang) or by using something like transaction memory for encapsulating state in a way that is easy to deal with (Haskell's STM, Fortress (I think), maybe some others).
Concurrency is not that
Re: (Score:2)
TRIPS may solve some problems (Score:2, Informative)
Re: (Score:3, Interesting)
Re: (Score:2)
Yes, but that's primarily because most of those resources are specialized. One or two of those are integer paths, one's a branch system, another is floating point, and so on. If the current code block doesn't include any of those specialized instructions, then those particular execution paths sit there unused.
nothing spectacular (Score:5, Informative)
It might sound very novel if you are only accustomed to normal processors. Look at MOVE http://www.everything2.com/index.pl?node_id=10322
Secondly, they talk about how execution graphs are mapped onto their processing grid. I don't think any scheduler has a problem with scheduling an execution graph (or whatever name you give it) to an architecture. Generally, it can be scheduled in-time (there is a critical path somewhere) or it is scheduled with a certain degree (generally >
Now here comes the shameless self-plug. If you want to gain efficiency in scheduling a node of an execution graph you have to know which node is more critical than the other. The critical nodes (the ones on the critical path) need to be scheduled to the fast/optimized processing units and the others can be scheduled to slow/efficient processing units (and they can get some communication delays without penalty). Look http://ce.et.tudelft.nl/publicationfiles/786_11_d
And when it is determined to have bugs??? (Score:2)
This is just an update from year ago... (Score:5, Informative)
Here is the slashdot article from 2003 about this processor: link [slashdot.org]
The specs have been updated to 1024 from 512, but that's about it.
Another 3-5 years out?
Re: (Score:2)
Don't dismiss it (Score:5, Informative)
http://www.cs.utexas.edu/~trips/ [utexas.edu]
They have several papers available that motivate the rationale for a architecture.
The designers of this architecture believed that conventional architectures were going to run into some physical limitations that were going to prevent them from scaling further. One of the issues they foresaw was that as feature size continued to shrink and die size continued to increase chips would become susceptible to, and ultimately constrained by wire delay. Meaning the amount of time it took to send a signal from one part of a chip to another would constrain the ultimate performance. To some extent the shift in focus to multi-core CPUS validates some of their beliefs.
To address the wire delay problem the architecture attempts to limit the length of signal paths through the CPU by having instructions send their results directly to their dependent instructions instead of using intermediate architectural registers. TRIPS is similar to VLIW in that many small instructions are grouped into larger instructions (Blocks) by the compiler. However it differs in how the operations within the block are scheduled.
TRIPS does not depend on the compiler to schedule the operations making up a block like a VLIW architecture does. Instead the TRIPS compiler maps the individual operations making up a large TRIPS instruction block to a grid of execution units. Each execution unit in the grid has several reservation stations, effectively forming a 3 dimensional execution substrate.
By having the compiler assign data dependent instructions to execution units that are physically close to one another the communication overhead on the chip can be reduced. The individual operations wait for the operands to arrive at their assigned execution unit, once all of operations dependencies are available then the operation fires and its result is forwarded to any waiting instruction. In this way the operations making up the TRIPS are dynamically scheduled according to the data flow of the block and the amount of communications that have to occur across large distances are limited. Once an entire block is executed its can be retired and its results can be written to a register or memory.
At the block level a TRIPS processor can still function much like a conventional processor. Blocks can be executed out of order, speculatively, or in parallel. They have also defined TRIPS as a polymorphous architecture meaning the configuration and execution dynamics can be changed to best leverage the available parallelism. If code is highly parallelizable it might make sense to allow bigger blocks mapped. However, by performing these type of operations at the level of a block instead of for each individual instruction the overhead is theoretically drastically reduced.
There is some flexibility in how the hardware can be utilized. For some types of software with a high degree of parallelism you may want very large blocks, when there is less data level parallelism available it may be better to schedule multiple blocks onto the substrate simultaneously. I'm not sure how the prototype is implemented but the designers have several papers available where they discuss how a TRIPS style architecture can be adapted to perform well on a wide gamut of software.
Compilers Are Not Magic, or Why IA64 Didn't Work (Score:3, Insightful)
And it's true. You build a sea of ALUs and you sic some folks on hand coding all sorts of things to the machine, and you end up with some spectacular results.
The problem is that we still can't get a compiler to do a good job at it, for the most part. We thought we could, and we threw every bell and whistle into IA64
for a compiler-controlled architecture, and you've seen what we've ended up with. Many years later, the situation is still pretty much the same: the compiler
can't do all that great of a job with these sorts of machines.
Don't get me wrong, there are lots of good ideas in TRIPS or any of the various other academic projects like it, but I'm yet to be convinced that it's useful in
any kind of real codebase that's not coded by hand by an army of graudate students. For some tasks, that's an acceptable model -- It's been the model in the world of
signal processing for quite a while (though becoming less so daily) -- but for most mainstream applications it just won't fly.
That, and it's hard for compilers to have knowledge about history. It's terribly important for optimization, and it's just hard to get into the compiler (though relatively
easy to get into a branch predictor).
Re:Compilers Are Not Magic, or Why IA64 Didn't Wor (Score:2)
Re:Welcome to 1994 (Score:4, Informative)