Next-Gen Processor Unveiled

Next-Gen Processor Unveiled 183

Posted by kdawson on Tuesday April 24, 2007 @04:33PM from the trillions-and-trillions dept.

A bunch of readers sent us word on the prototype for a new general-purpose processor with the potential of reaching trillions of calculations per second. TRIPS (obligatory back-formation given in the article) was designed and built by a team at the University of Texas at Austin. The TRIPS chip is a demonstration of a new class of processing architectures called Explicit Data Graph Execution. Each TRIPS contains two processing cores, each of which can issue 16 operations per cycle with up to 1,024 instructions in flight simultaneously. The article claims that current high-performance processors typically are designed to sustain a maximum execution rate of four operations per cycle.

Next-Gen Processor Unveiled

This discussion has been archived. No new comments can be posted.

Search 183 Comments Log In/Create an Account

Comments Filter:

Marketting hype? (Score:5, Informative)

by faragon ( 789704 ) writes: on Tuesday April 24, 2007 @04:40PM (#18861085) Homepage

Each TRIPS chip contains two processing cores, each of which can issue 16 operations per cycle with up to 1,024 instructions in flight simultaneously. Current high-performance processors are typically designed to sustain a maximum execution rate of four operations per cycle.

It's me or are they trying to reparaphrasing, euphemistically, the Out-of-order execution [wikipedia.org]?

One trillion calculations per second by 2012 (Score:3, Informative)

by xocp ( 575023 ) writes: on Tuesday April 24, 2007 @04:41PM (#18861101)

A link to the U of Texas project website can be found here [utexas.edu].

Key Innovations:
Explicit Data Graph Execution (EDGE) instruction set architecture
Scalable and distributed processor core composed of replicated heterogeneous tiles
Non-uniform cache architecture and implementation
On-chip networks for operands and data traffic
Configurable on-chip memory system with capability to shift storage between cache and physical memory
Composable processors constructed by aggregating homogeneous processor tiles
Compiler algorithms and an implementation that create atomically executable blocks of code
Spatial instruction scheduling algorithms and implementation
TRIPS Hardware and Software

Re:Hm... (Score:5, Informative)

by volsung ( 378 ) writes: <stan@mtrr.org> on Tuesday April 24, 2007 @04:42PM (#18861133)

The vector processors never went away. They just became your graphics card: 128 floating point units at your command [nvidia.com]

BTW, here is a real article on TRIPS [utexas.edu].

Re:Hm... (Score:3, Informative)

by Anonymous Coward writes: on Tuesday April 24, 2007 @04:48PM (#18861245)

Actually, it is more like the dataflow architectures from the 70s. Vector processors are a totally different kind of thing (SIMD).

The idea is simple, instead of discovering instruction level parallelism by checking the dependencies and anti-dependencies by global names (registers), define the dependencies directly by relating to instructions themselves.

> "Many instructions in flight"=="huge pipeline flushes on context switches"+"huge branching penalities" anybody?

That equality does not exist. It is a wide parallel execution, not super-pipelining, ergo no huge branching penalties.
Also, the architecture is more likely exploting the wide execution unit by predicating both branches and calculating them both.

This is cool! (Score:4, Informative)

by adubey ( 82183 ) writes: on Tuesday April 24, 2007 @04:50PM (#18861261)

The link has NO information.

The PDF here: has more information about EDGE [utexas.edu].

The basic idea is that CISC/RISC architectures rely on storing intermediate data in registers (or in main memory on old skool CISC). EDGE bypasses registers: the output of one instruction is fed directly to the input of the next. No need to do register allocation while compiling. I'm still reading the PDF, this sounds like a really neat idea, though.

The only question is, will this be so much better than existing ISA's to eventually replace them? -- even if only for specific applications like high-performance computing.

Re:One trillion calculations per second by 2012 (Score:3, Informative)

by xocp ( 575023 ) writes: on Tuesday April 24, 2007 @04:51PM (#18861289)

DARPA is the primary sponsor...
Check out this writeup at HPC wire [hpcwire.com].
A major design goal of the TRIPS architecture is to support "polymorphism," that is, the capability to provide high-performance execution for many different application domains. Polymorphism is one of the main capabilities sought by DARPA, TRIPS' principal sponsor. The objective is to enable a single processor to perform as if it were a heterogeneous set of special-purpose processors. The advantages of this approach, in terms of scalability and simplicity of design, are obvious.

To implement polymorphism, the TRIPS architecture employs three levels of concurrency: instruction-level, thread-level and data-level parallelism (ILP, TLP, and DLP, respectively). At run-time, the grid of execution nodes can be dynamically reconfigured so that the hardware can obtain the best performance based on the type of concurrency inherent to the application. In this way, the TRIPS architecture can adapt to a broad range of application types, including desktop, signal processing, graphics, server, scientific and embedded.

Re:Marketting hype? (Score:5, Informative)

by Aadain2001 ( 684036 ) writes: on Tuesday April 24, 2007 @05:02PM (#18861459) Journal

Based on the article, "TRIPS" is nothing more than a Out-Of-Order(OOO) SuperScalar based processor. So unless the article is grossly simplifying (possible), this is nothing but a PR stunt. And based on the quote from one of the Professors about building it on "nanoscale" technology (um, been doing that for years now), my vote is pure PR BS.
And as an aside, the reason modern CPUs are designed to "only" issue 4 instructions per cycle instead of 16 is because after years of careful research and testing real work applications, 4 instructions is almost always the maximum number of instructions any program can concurrently issue, due to issues like branches, cache-misses, data dependencies, etc. Makes me question just how much these "professors" really know.

nothing spectacular (Score:5, Informative)

by CBravo ( 35450 ) writes: on Tuesday April 24, 2007 @05:08PM (#18861557)

Right, let me begin by saying that after reading ftp://ftp.cs.utexas.edu/pub/dburger/papers/IEEECOM PUTER04_trips.pdf [utexas.edu] it actually became a bit more clear about what they were talking about.

It might sound very novel if you are only accustomed to normal processors. Look at MOVE http://www.everything2.com/index.pl?node_id=103228 8&lastnode_id=0 [everything2.com] to see what transport-triggered architectures are about. They are more power efficient, etc etc.

Secondly, they talk about how execution graphs are mapped onto their processing grid. I don't think any scheduler has a problem with scheduling an execution graph (or whatever name you give it) to an architecture. Generally, it can be scheduled in-time (there is a critical path somewhere) or it is scheduled with a certain degree (generally > .9 efficient) of optimality. I don't see the gain there in efficiency.

Now here comes the shameless self-plug. If you want to gain efficiency in scheduling a node of an execution graph you have to know which node is more critical than the other. The critical nodes (the ones on the critical path) need to be scheduled to the fast/optimized processing units and the others can be scheduled to slow/efficient processing units (and they can get some communication delays without penalty). Look http://ce.et.tudelft.nl/publicationfiles/786_11_dh ofstee_v1.0_18july2003_eindverslag.pdf [tudelft.nl] here for my thesis.

Re:but... (Score:4, Informative)

by $RANDOMLUSER ( 804576 ) writes: on Tuesday April 24, 2007 @05:15PM (#18861667)

Two words: loop unwinding [wikipedia.org]. This critter is perfect to run all iterations of (certain) loops in parallel, which would be determinable at compile time.

Re:Welcome to 1994 (Score:4, Informative)

by Wesley Felter ( 138342 ) writes: <wesley@felter.org> on Tuesday April 24, 2007 @05:56PM (#18862321) Homepage

EPIC (i.e. Itanium) is still based on centralized structures like register files. To create a 16-issue EPIC processor, you'd need a ~32R/16W port register file which would be virtually impossible to build because it would be so huge and power-hungry. Also, EPIC needs heroic compiler optimizations to overcome its in-order execution, while EDGE is naturally out-of-order.

This is just an update from year ago... (Score:5, Informative)

by coldmist ( 154493 ) writes: on Tuesday April 24, 2007 @06:15PM (#18862577) Homepage

Here is the slashdot article from 2003 about this processor: link [slashdot.org]
The specs have been updated to 1024 from 512, but that's about it.
Another 3-5 years out?

TRIPS may solve some problems (Score:2, Informative)

by knowsalot ( 810875 ) writes: on Tuesday April 24, 2007 @06:24PM (#18862693)

The big thing that all the commenters have missed that I've read so far is the fact that OOO execution is difficult not because it's hard to make many ALU's on a chip (vector design, anyone?) but because in a general-purpose processor the register file and routing complexity grows as N^2 in the number of units. That's bad. Every unit has to communicate with every other unit (via the register file or, more commonly, via bypasses to an OOO buffer for every stage prior to writeback). The issue being addressed here is wiring complexity which, as modern designers would tell you, is a much harder problem than designing fast logic. Routing is hard. Plunking down more ALU's is easy. If you eliminate the register file, and design your processor and ISA to feed instructions in a data-flow manner to thousands of ALU's then you may be able to vastly simplify routing requirements, thereby decreasing the length of your critical path electrical circuits, thereby allowing the processor to clock faster. (Data-flow execution is executing instructions when their data inputs are ready, rather than tracking the compiler-optimized order, which does not have the run-time information that the hardware has.) If you are clever about your compiler, and make your hardware wide enough, you can for example speculatively execute both sides of a branch until it is resolved, thus eliminating a certain percentage of pipeline stalls for branch mispredicts. Similarly, with data-prediction you can speculate during cache misses. The list goes on. This is a very new and different paradigm (ugly word) for CPUs which may lead to higher IPC. This isn't the single golden goose, but it's a very different way of looking at the problem of pushing more instructions through a processor at higher speeds.

Don't dismiss it (Score:5, Informative)

by er824 ( 307205 ) writes: on Tuesday April 24, 2007 @06:31PM (#18862757)

I apologize if I butcher some of the details, but I highly recommend that anyone interested peruse the TRIPS website.

http://www.cs.utexas.edu/~trips/ [utexas.edu]

They have several papers available that motivate the rationale for a architecture.

The designers of this architecture believed that conventional architectures were going to run into some physical limitations that were going to prevent them from scaling further. One of the issues they foresaw was that as feature size continued to shrink and die size continued to increase chips would become susceptible to, and ultimately constrained by wire delay. Meaning the amount of time it took to send a signal from one part of a chip to another would constrain the ultimate performance. To some extent the shift in focus to multi-core CPUS validates some of their beliefs.

To address the wire delay problem the architecture attempts to limit the length of signal paths through the CPU by having instructions send their results directly to their dependent instructions instead of using intermediate architectural registers. TRIPS is similar to VLIW in that many small instructions are grouped into larger instructions (Blocks) by the compiler. However it differs in how the operations within the block are scheduled.

TRIPS does not depend on the compiler to schedule the operations making up a block like a VLIW architecture does. Instead the TRIPS compiler maps the individual operations making up a large TRIPS instruction block to a grid of execution units. Each execution unit in the grid has several reservation stations, effectively forming a 3 dimensional execution substrate.

By having the compiler assign data dependent instructions to execution units that are physically close to one another the communication overhead on the chip can be reduced. The individual operations wait for the operands to arrive at their assigned execution unit, once all of operations dependencies are available then the operation fires and its result is forwarded to any waiting instruction. In this way the operations making up the TRIPS are dynamically scheduled according to the data flow of the block and the amount of communications that have to occur across large distances are limited. Once an entire block is executed its can be retired and its results can be written to a register or memory.

At the block level a TRIPS processor can still function much like a conventional processor. Blocks can be executed out of order, speculatively, or in parallel. They have also defined TRIPS as a polymorphous architecture meaning the configuration and execution dynamics can be changed to best leverage the available parallelism. If code is highly parallelizable it might make sense to allow bigger blocks mapped. However, by performing these type of operations at the level of a block instead of for each individual instruction the overhead is theoretically drastically reduced.

There is some flexibility in how the hardware can be utilized. For some types of software with a high degree of parallelism you may want very large blocks, when there is less data level parallelism available it may be better to schedule multiple blocks onto the substrate simultaneously. I'm not sure how the prototype is implemented but the designers have several papers available where they discuss how a TRIPS style architecture can be adapted to perform well on a wide gamut of software.

Re:Hm... (Score:5, Informative)

by frank_adrian314159 ( 469671 ) writes: on Tuesday April 24, 2007 @07:11PM (#18863145) Homepage

No, here [utexas.edu] are [utexas.edu] the [utexas.edu] real [utexas.edu] articles [utexas.edu] on [utexas.edu] TRIPS [utexas.edu]. These and many others can be found here [utexas.edu].

Re:Hm... (Score:4, Informative)

by SiJockey ( 702903 ) writes: on Tuesday April 24, 2007 @11:47PM (#18865553) Homepage

The big difference in TRIPS is that stuff flying around out in memory can be squashed easily. The machine has aggressive branch prediction, efficient predication support in the ISA, and data dependence prediction. So, the 1024 instructions don't need to be long vectors streaming from memory. Squashing a mispredicted branch and restarting down the right path takes on the order of 10-20 machine cycles. Thanks for your comments and interest. -DB

Re:Marketting hype? (Score:4, Informative)

by SiJockey ( 702903 ) writes: on Tuesday April 24, 2007 @11:56PM (#18865625) Homepage

Actually, there is much more parallelism (more than 4 ops/cycle) available in many of these applications, but you correctly observe that many of these ancillary features (branch mispredictions, cache misses, etc.) chip away at the achieved parallelism. The TRIPS ISA and microarchitecture (which is, as you correctly point out, a variant of an OOO "superscalar" processor) has numerous features to try to mitigate many of these features ... up to 64 outstanding cache misses from the 1,024-entry window, aggressive predication to eliminate many branches, a memory dependence predictor, and direct ALU-ALU communication for making data dependences more efficient. The most important difference is in the ISA, which allows the compiler to express dataflow graphs to directly to the hardware, which will work best (compared to convention) in ultra-small technologies where the wires are quite slow. To get a similar dependence graph in a RISC or CISC ISA, a superscalar processor must reconstruct it on the fly, instruction by instruction, using register renaming and issue window tag broadcasting. Thanks for reading.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Next-Gen Processor Unveiled 183

Next-Gen Processor Unveiled More Login

Next-Gen Processor Unveiled

Marketting hype? (Score:5, Informative)

One trillion calculations per second by 2012 (Score:3, Informative)

Re:Hm... (Score:5, Informative)

Re:Hm... (Score:3, Informative)

This is cool! (Score:4, Informative)

Re:One trillion calculations per second by 2012 (Score:3, Informative)

Re:Marketting hype? (Score:5, Informative)

nothing spectacular (Score:5, Informative)

Re:but... (Score:4, Informative)

Re:Welcome to 1994 (Score:4, Informative)

This is just an update from year ago... (Score:5, Informative)

TRIPS may solve some problems (Score:2, Informative)

Don't dismiss it (Score:5, Informative)

Re:Hm... (Score:5, Informative)

Re:Hm... (Score:4, Informative)

Re:Marketting hype? (Score:4, Informative)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot