Startup Combines CPU and DRAM 211

Posted by samzenpus on Monday January 23, 2012 @06:27AM from the two-in-one dept.

MojoKid writes "CPU design firm Venray Technology announced a new product design this week that it claims can deliver enormous performance benefits by combining CPU and DRAM on to a single piece of silicon. Venray's TOMI (Thread Optimized Multiprocessor) attempts to redefine the problem by building a very different type of microprocessor. The TOMI Borealis is built using the same transistor structures as conventional DRAM; the chip trades clock speed and performance for ultra-low low leakage. Its design is, by necessity, extremely simple. Not counting the cache, TOMI is a 22,000 transistor design. Instead of surrounding a CPU core with L2 and L3 cache, Venray inserted a CPU core directly into a DRAM design. A TOMI Borealis core connects eight TOMI cores to a 1Gbit DRAM with a total of 16 ICs per 2GB DIMM. This works out to a total of 128 processor cores per DIMM. That said, when your CPU has fewer transistors than an architecture that debuted in 1986, there is a good chance that you left a few things out--like an FPU, branch prediction, pipelining, or any form of speculative execution. Venray may have created a chip with power consumption an order of magnitude lower than anything ARM builds and more memory bandwidth than Intel's highest-end Xeons, but it's an ultra-specialized, ultra-lightweight core that trades 25 years of flexibility and performance for scads of memory bandwidth."

Startup Combines CPU and DRAM

This discussion has been archived. No new comments can be posted.

Search 211 Comments Log In/Create an Account

Comments Filter:

Either/or? (Score:5, Insightful)

by Gwala ( 309968 ) writes: <adam@PARISgwala.net minus city> on Monday January 23, 2012 @06:35AM (#38789535) Homepage

Does it have to be a either-or suggestion?
I could see this being useful as an accelerator - in the same way that GPUs can accellerate vector operations. E.g. memory that can calculate a hash table index by itself. Stuffed in as a component of a larger system it could be a really clever breakthrough for incremental performance improvements.

performance vs. memory bandwidth (Score:2, Insightful)

by Anonymous Coward writes: on Monday January 23, 2012 @06:40AM (#38789557)

> "that trades 25 years of flexibility and performance for scads of memory bandwidth"
Right... because memory bandwidth isn't one of the greatest bottlenecks in current designs...

Re:Don't count this out yet (Score:4, Insightful)

by ByOhTek ( 1181381 ) writes: on Monday January 23, 2012 @07:57AM (#38789927) Journal

pfft. floating point sucks anyway.
typedef struct FRACTION_STRUCT
{ //numerator/denominator * 10^exponent
int numerator;
unsigned int denominator;
int exponent;
} Fraction;

Re:Map Reduce? (Score:5, Insightful)

by lkcl ( 517947 ) writes: <lkcl@lkcl.net> on Monday January 23, 2012 @08:09AM (#38789979) Homepage

Aspex Semiconductors took this a lot further. they did content-addressable-memory. ok, they did a hell of a lot more than that. they created a massively-parallel deep SIMD architecture with a 2-bit CPU (early versions were 1 bit), with each CPU having something like 256 bits of memory to play with. ok, early versions had 128-bits of "straight" RAM and 256 bits of content-addressable RAM. when i was working for them they were planning the VASP-G architecture which would have 65536 such 2-bit CPUs on a single die. it was the 10th largest CPU being designed, in the world, at the time.
programming such CPUs was - is - a complete f*****g nightmare. you not only have the parallelism of the CPU to deal with but you have the I/O handling to deal with. do you try to fit the data 1-bit-wide per CPU and process it serially? or... do you try to fit the data across 32 CPUs and process it in parallel? (each CPU was connected to its 2 neighbours so you could do this sort of thing). or... do you do anything in between, because if you have only 1-bit-wide that means that the I/O is held up, but if you do 32-bits across 32 CPUs you process it so quick that you're now I/O bound.
much of the work in fitting algorithms onto ASPs involved having to write bloody spreadsheets in Excel to analyse whether it was best to use 1, 2, 4 .... 32 CPUs just to process the bloody data! 6 weeks of analysis to write 30 lines of code for god's sake!
it gets worse: you can't even go read a book on algorithms for hardware because that doesn't apply; you can't go read a book on algorithms for software because _that_ doesn't apply. working out how to fit AES onto the Aspex Semi CPU took me about... i think it was 6 weeks, to even _remotely_ make it optimal. i had to read up on the design of the 2-bit Galois Field theory behind the S-Boxes, because although you could do 8-bit S-Box substitution by running 256 "compare" instructions, one per substitution, in parallel across all 4096 CPUs, it turned out that if you actually implemented the *original* 2-bit Galois Field mathematical operations in each of the 2-bit CPUs you could get it down to 40 instructions, not 256.
and that was just for _one_ part of the Rijndael algorithm: i had to do a comprehensive detailed analysis of _every_ aspect of the algorithm.
in other words, everything that you _think_ you know about optimising software and algorithm design for either hardware or for software is completely and utterly wrong, for these types of massively-parallel map-reduce and content-addressable-memory CPUs.
that leaves them somewhere in the very very specialist dept, and even there, they have problems, because it takes so long to verify and design a new CPU. when the Aspex VASP-F architecture was being planned, it was AMAZING! wow! 100x faster than the best Pentium-III processor! of course, within 18 months it was only 20x better than the top-of-the-line Pentium that was available, and by the time it _actually_ came out, it was only 5x better than a bunch of x86 CPUs, which are a hell of a lot easier to program.
it was the same story for the next version of the CPU, even though that promised to have 64k processing elements...

Re:So, is it a CAM or a DRPU? (Score:2, Insightful)

by Anonymous Coward writes: on Monday January 23, 2012 @08:45AM (#38790165)

The idea is not new and lots of products having a CPU on the RAM die exist. Sun had this on graphic cards for example.
The missing FP is not a great deal since FP can be calculated with ints if needed - but it shall get an FP in the follow up products to stop the rants.
The dirty secrect of the computer industrie is that the CPU has to wait "lots" of cycles to go to memory and back since the CPUs are much higher clocked that RAM plus there are other chips in between...
RAM can be added in system designs here - but you simply get more CPUs as well ;-)
I guess market acceptance is always a matter of integration effort so Linux it shall run and have broad IO chipsets available.
Well done - hope it succeedes

Re:performance vs. memory bandwidth (Score:4, Insightful)

by tibit ( 1762298 ) writes: on Monday January 23, 2012 @08:57AM (#38790249)

Perfect for networking -- switching, routing, ... Think of content addressable memory, etc.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Startup Combines CPU and DRAM 211

Startup Combines CPU and DRAM More Login

Startup Combines CPU and DRAM

Either/or? (Score:5, Insightful)

performance vs. memory bandwidth (Score:2, Insightful)

Re:Don't count this out yet (Score:4, Insightful)

Re:Map Reduce? (Score:5, Insightful)

Re:So, is it a CAM or a DRPU? (Score:2, Insightful)

Re:performance vs. memory bandwidth (Score:4, Insightful)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot