Startup Combines CPU and DRAM 211
MojoKid writes "CPU design firm Venray Technology announced a new product design this week that it claims can deliver enormous performance benefits by combining CPU and DRAM on to a single piece of silicon. Venray's TOMI (Thread Optimized Multiprocessor) attempts to redefine the problem by building a very different type of microprocessor. The TOMI Borealis is built using the same transistor structures as conventional DRAM; the chip trades clock speed and performance for ultra-low low leakage. Its design is, by necessity, extremely simple. Not counting the cache, TOMI is a 22,000 transistor design. Instead of surrounding a CPU core with L2 and L3 cache, Venray inserted a CPU core directly into a DRAM design. A TOMI Borealis core connects eight TOMI cores to a 1Gbit DRAM with a total of 16 ICs per 2GB DIMM. This works out to a total of 128 processor cores per DIMM. That said, when your CPU has fewer transistors than an architecture that debuted in 1986, there is a good chance that you left a few things out--like an FPU, branch prediction, pipelining, or any form of speculative execution. Venray may have created a chip with power consumption an order of magnitude lower than anything ARM builds and more memory bandwidth than Intel's highest-end Xeons, but it's an ultra-specialized, ultra-lightweight core that trades 25 years of flexibility and performance for scads of memory bandwidth."
Map Reduce? (Score:4, Interesting)
Processing In Memory (Score:5, Interesting)
This isn't new. The MIT Terasys platform did the same in 1995, and many have since. Nobody has yet come up with a viable programming model for such processors.
I'm expecting AMD's Fusion platform to move in the same direction (interleaved memory and shader banks), and they already have a usable MIMD model (basically OpenCL).
Just a first step... (Score:4, Interesting)
Really, this was inevitable, and this first implementation is just a first step. Future versions will undoubtedly include more functionality.
Current processors are ridiculously complicated. If you can knock out the entire cache with all of its logic, give the processor direct access to memory, and stick to a RISC design, you can get a very nice processor in under a million transistors.
Why not a hexagonal design? (Score:5, Interesting)
Speaking of unconventional design, why don't we see hexagonal or triangular CPU-designs? All I have seen are the Manhattan-like designs. Are these really the best? Embedding the CPU inside a hexagonal/triangular DRAM design should be possible too. What would be the trade-offs?
Re:performance vs. memory bandwidth (Score:3, Interesting)
And how much performance per clock are you going to get out of a 22,000 transistor chip, with what looks like 3 registers (and 3 shadow registers)?
One of the issues they had to deal with was that DRAM is usually made on a 3 metal layer process, whereas CPUs usually take a lot more layers due to their complexity.
This will have to compete with TSV connected DRAM, which will be a major bandwidth and power aid to conventional SoCs.
Their money isn't old enough. (Score:2, Interesting)
They're innovating?
T-minus two days until they've been hit with 13 different patent lawsuits by companies that don't even produce anything similar.
Sorry about your luck!
Re:Either/or? (Score:3, Interesting)
Like Mitsubishi 3D RAM [thefreelibrary.com]
They put the logic ops and blend on the RAM
The 3D-RAM is based on the Mitsubishi Cache DRAM (CDRAM (Cached DRAM) A high-speed DRAM memory chip developed by Mitsubishi that includes a small SRAM cache. ) architecture that integrates DRAM memory and SRAM cache on a single chip. The CDRAM was then optimized for 3-D graphics rendering and further enhanced by adding an on-chip arithmetic logic unit See ALU. (ALU (Arithmetic Logic Unit) The high-speed CPU circuit that does calculating and comparing. Numbers are transferred from memory into the ALU for calculation, and the results are sent back into memory. Alphanumeric data are sent from memory into the ALU for comparing. ) and video buffer.
All on one chip (Score:4, Interesting)
I'm just wondering and maybe it exists already, but why not make everything on one chip? The CPU, memory, GPU, etc? Most people don't mess with the insides of their computer, and I'm guessing that it will speed up the computer as a whole. You won't even need to make it high-performance. Just do a I3 core with the associated chipset (or equivalent), maybe 4GB of RAM, some connectivity (USB 2, DVI, SATA, Wi-Fi and 1000Base-T) and you have it all. The power savings should be huge as everything internally should be low voltage. The die will be huge but we are heading that way anyway.
Am I talking bollocks?
Re:All on one chip (Score:5, Interesting)
There are basically two problems:
1. The external connectivity -- SATA, USB, ethernet, etc. needs too much power to easily move or handle on a chip (and the radio stuff needs radio power). You can do the protocol work on the main chip if you like, but you'll need amplifiers, and possibly sensors off chip.
2. DRAM and CPUs are made in quite different processes, optimised for different purposes. Cache is memory made using CPU processes (so it's expensive and not very dense). These guys are trying to make CPUs using DRAM processes, which are slow.
Don't count this out yet (Score:5, Interesting)
Useless? My key question would be does it have decent speed integer multiply and perhaps even divide instructions. A whole heck of a lot can be achieved if you have, say, the basic instruction set of a 6809, but fast and wide (and it didn't even have a divide... so we built multiply-by-reciprocal macros to substitute, that works too.)
I know everyone's used to having FP right at hand, but I'm telling you, fast integer code and table tricks can cover a lot more bases than one might initially think. A lot of my high performance stuff -- which is primarily image processing and software defined radio -- is currently limited considerably more by how fast I can move data in and out of main memory than it is by actually needing FP operations. On a dual 4-core machine, I can saturate the memory bus without half trying with code that would otherwise be considerably more efficient, if it could actually get to the memory when it needs to.
Another thing... when you're coding with C, for instance, the various FP ops can just as easily be buried in a library, then who cares why or how they get done anyway, as long as they are? With lots-o-RAM, you can write whatever you need to and it'd be the same code you'd write for another platform. Just mostly faster, because for many things, FP just isn't required, or critical. Fixed point isn't very bard to build either and can cover a wide range of needs (and then there's BCD code... better than FP for accounting, for instance.)
Signed, old assembly language programmer guy who actually admits he likes asm...
Humm, IBM did it first. (Score:2, Interesting)
IBM sells CPU's that have DRAM onboard for quite a while, IBM developed it, patented it, and sells it as "eDRAM" aka "embeddedDRAM".
I guess IBM's POWER7 processor family powering such things like, Sony's PlayStation 2, Sony's PlayStation Portable, Nintendo's GameCube, Nintendo's Wii, and Microsoft's Xbox 360. All have eDRAM.
Maybe news articles should be checked to see if they are really news or not before posting?
Re:Why not a hexagonal design? (Score:2, Interesting)
Re:Don't count this out yet (Score:5, Interesting)
Exactly. ARM2 didn't have FP, people still wrote some extremely good stuff for it.
Nintendo DS doesn't have an fpu on either cpu.
Re:Processing In Memory (Score:4, Interesting)
It was not American (NIH)
The best training manual for it was called "Algol68 with fewer tears"
Other than that, it was able to handle parallelism, and most everything else, in a relatively painless manner.
For those who actually LIKE pain, there is always Occam.
Re:Don't count this out yet (Score:5, Interesting)
Agreed. I'm working on a digital oscilloscope display system and that thing might be very useful in this application -- where you need lots of bandwith, but also plenty of storage. Say, zooming, filtering, scaling of one second long acquisition done at 2Gs/s, using a 12 bit digitizer. You tweak the knobs, it updates, all in real time. In the worst case, you need about 120 Gbytes/s memory bandwidth to make it real time on a 30FPS display. And that's assuming the filter coefficients don't take up any bandwidth, because if they do you've just upped the bandwidth to terabytes/s.
Re:performance vs. memory bandwidth (Score:4, Interesting)
And how much performance per clock are you going to get out of a 22,000 transistor chip, with what looks like 3 registers (and 3 shadow registers)?
Quite a lot, I would guess. A stack-based design would give you 1 instruction per cycle with a compact opcode format capable of packing multiple instructions into a single machine word, which means a single instruction fetch for multiple actual instructions executed. Oh, and make it word addressed, that simplifies things a bit as well. In the end, you'll have a core that does perhaps 50%-100% as much clock cycles per second on a given manufacturing technology level (say, 60 nm), with just a single thread of execution, but with a negligible transistor budget and power consumption. The resulting effective computational performance per energy consumed will be at least one OOM better than the current offerings by Intel and AMD, although you first have to learn how to program it.
Re:Don't count this out yet (Score:2, Interesting)
the DEC alpha didn't have a floating-point unit: it had primitives that could be used to emulate floating-point almost as quickly as having a dedicated FPU. i spoke to a friend several years back who know about these things: he said that a 1s complement add goes a long way towards speeding up floating-point using integer operations. i have _no_ idea what he meant but it was kinda interesting to hear from someone who'd really thought about this stuff.