Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Technology Hardware

Startup Combines CPU and DRAM 211

MojoKid writes "CPU design firm Venray Technology announced a new product design this week that it claims can deliver enormous performance benefits by combining CPU and DRAM on to a single piece of silicon. Venray's TOMI (Thread Optimized Multiprocessor) attempts to redefine the problem by building a very different type of microprocessor. The TOMI Borealis is built using the same transistor structures as conventional DRAM; the chip trades clock speed and performance for ultra-low low leakage. Its design is, by necessity, extremely simple. Not counting the cache, TOMI is a 22,000 transistor design. Instead of surrounding a CPU core with L2 and L3 cache, Venray inserted a CPU core directly into a DRAM design. A TOMI Borealis core connects eight TOMI cores to a 1Gbit DRAM with a total of 16 ICs per 2GB DIMM. This works out to a total of 128 processor cores per DIMM. That said, when your CPU has fewer transistors than an architecture that debuted in 1986, there is a good chance that you left a few things out--like an FPU, branch prediction, pipelining, or any form of speculative execution. Venray may have created a chip with power consumption an order of magnitude lower than anything ARM builds and more memory bandwidth than Intel's highest-end Xeons, but it's an ultra-specialized, ultra-lightweight core that trades 25 years of flexibility and performance for scads of memory bandwidth."
This discussion has been archived. No new comments can be posted.

Startup Combines CPU and DRAM

Comments Filter:
  • synthesis (Score:5, Informative)

    by lkcl ( 517947 ) <lkcl@lkcl.net> on Monday January 23, 2012 @07:43AM (#38789851) Homepage

    there's a problem with doing designs like this. the tooling for CPUs is very very specific: 28nm, 32nm, 45nm - all those companies that do the simulations where they charge something like $USD 250,000 per week to license their tools like mentor do - have written the tools SPECIFICALLY for those geometries.

    if you wander randomly outside of those geometries you are either on your own or you are into some unbelievably-high development costs.

    why is this relevant?

    it's because the DRAM manufacturers do *not* stick to the well-known geometries: they vary the geometry in order to get the absolute best performance because the cell layout is absolutely identical for DRAM ICs. and, because those cells _are_ identical, the verification process is much simpler than is required for a complex CPU.

    in other words, this company is trying to mix-and-match two wildly different approaches. in other words, what he's doing is either incredibly expensive or is sub-optimal. which begs the question: what's it _for_?

  • by lkcl ( 517947 ) <lkcl@lkcl.net> on Monday January 23, 2012 @07:52AM (#38789901) Homepage

    the cache is there because the speed of DRAM, regardless of how fast you can communicate with it, still has latency issues on addressing.

    to do the "routing" to address a 4-bit bus, you need 1/2 the number of transistors than if you addressed a 2-bit bus. each time you add another bit to the address range, you have increased the latency of access.

    if you were to provide entirely random-access to an entire 32-bit range you would absolutely kill performance. so, what RAM IC designers do is they go "ok, you're not going to get 32-bit addressing, you're going to get 14-bit addressing, you're going to have to read an entire page of 1k or 2kbits, and you're going to have to have parallel ICs, the first IC does bits 0 to 1 of the data, the second IC does bits 2 and 3 etc."

    this relies on the design of the processor having a VM architecture - paging.

    but the same principle applies *inside* the processor: even just decoding the addressing, in the MMU, it's *still* too much latency involved.

    so this is why you end up with hierarchical cacheing - 1st level is tiny, 2nd level is huge.

    even with RISC designs you _still_ have to have 1st and 2nd level caches in order to remain competitive. if you've ever seen a picture of a RISC CPU, it's astounding: the actual CPU is like 1% of the total area; caches are huuuge by comparison, crossbar routing takes up 50% of the chip and the I/O pads, required to be massive in order to handle the current, can take up something like 5% of the chip (guessing here, it's been a while since i looked at an annotated example CPU).

  • by Theovon ( 109752 ) on Monday January 23, 2012 @10:12AM (#38790875)

    My research area is computer architecture.

    This idea of moving compute into the RAM has been around a long time. Papers have proposed everything from adding simple ALUs to the DRAMs to fully functional microprocessors. Most assume that these are "accelerators" for common vector operations and such, while the heavy lifting is done by beefier cores, but the idea if doing all the compute embedded in a DRAM has been proposed and evaluated before.

    One thing we've learned in the past few decades is that modern processors are limited by memory latency and bandwidth. A Sun engineer (talking about Rock) pointed out that a modern out-of-order processor performs a race between last-level cache misses. When you have to go out to DRAM, the CPU instruction window fills up with as much dependent work as possible, before it completely stalls because everything is dependent on that one miss. When that data finally arrives, the CPU blasts through that work really fact, and then soon stalls out again on another miss. OOO processors resolve this (somewhat) by the instruction window, while Rock solved it by speculative execution. One of the reasons for Sandy Bridge's excellent performance is the very large instruction window that can absorb more of the LLC miss stall time.

    And so, although these processors have other advantages, OOO processors dedicate a huge amount of logic just to dealing with the cache miss latency. If there were no such latency, then they could get the same performance with a hell of a lot less hardware. Although I haven't seen the figures, my suspicion is that for general computation, TOMI will blow the doors off of whatever else we've got in both performance AND energy efficiency. Only when you have a specialized compute kernel whose working data fits in the cache can you comparatively benefit from something like Sandy Bridge. (I realize that's an overly strong statement, because lots of general purpose workloads have good locality, but nevertheless main memory is a major bottleneck for most workloads.)

Happiness is twin floppies.

Working...