Startup Combines CPU and DRAM 211
MojoKid writes "CPU design firm Venray Technology announced a new product design this week that it claims can deliver enormous performance benefits by combining CPU and DRAM on to a single piece of silicon. Venray's TOMI (Thread Optimized Multiprocessor) attempts to redefine the problem by building a very different type of microprocessor. The TOMI Borealis is built using the same transistor structures as conventional DRAM; the chip trades clock speed and performance for ultra-low low leakage. Its design is, by necessity, extremely simple. Not counting the cache, TOMI is a 22,000 transistor design. Instead of surrounding a CPU core with L2 and L3 cache, Venray inserted a CPU core directly into a DRAM design. A TOMI Borealis core connects eight TOMI cores to a 1Gbit DRAM with a total of 16 ICs per 2GB DIMM. This works out to a total of 128 processor cores per DIMM. That said, when your CPU has fewer transistors than an architecture that debuted in 1986, there is a good chance that you left a few things out--like an FPU, branch prediction, pipelining, or any form of speculative execution. Venray may have created a chip with power consumption an order of magnitude lower than anything ARM builds and more memory bandwidth than Intel's highest-end Xeons, but it's an ultra-specialized, ultra-lightweight core that trades 25 years of flexibility and performance for scads of memory bandwidth."
but... (Score:5, Funny)
does it run GNU/Linux?
Re:but... (Score:5, Funny)
Don't count this out yet (Score:5, Interesting)
Useless? My key question would be does it have decent speed integer multiply and perhaps even divide instructions. A whole heck of a lot can be achieved if you have, say, the basic instruction set of a 6809, but fast and wide (and it didn't even have a divide... so we built multiply-by-reciprocal macros to substitute, that works too.)
I know everyone's used to having FP right at hand, but I'm telling you, fast integer code and table tricks can cover a lot more bases than one might initially think. A lot of my high performance stuff -- which is primarily image processing and software defined radio -- is currently limited considerably more by how fast I can move data in and out of main memory than it is by actually needing FP operations. On a dual 4-core machine, I can saturate the memory bus without half trying with code that would otherwise be considerably more efficient, if it could actually get to the memory when it needs to.
Another thing... when you're coding with C, for instance, the various FP ops can just as easily be buried in a library, then who cares why or how they get done anyway, as long as they are? With lots-o-RAM, you can write whatever you need to and it'd be the same code you'd write for another platform. Just mostly faster, because for many things, FP just isn't required, or critical. Fixed point isn't very bard to build either and can cover a wide range of needs (and then there's BCD code... better than FP for accounting, for instance.)
Signed, old assembly language programmer guy who actually admits he likes asm...
Re:Don't count this out yet (Score:4, Insightful)
pfft. floating point sucks anyway.
typedef struct FRACTION_STRUCT //numerator/denominator * 10^exponent
{
int numerator;
unsigned int denominator;
int exponent;
} Fraction;
Re: (Score:3)
You're assuming a rational number there.
Wait. Hang on. Forget that I pointed that out... :P
Re: (Score:3)
it still manages a few cases classic floating point would miss. However, if you give me an infinitely wide register, I would happily work on something that would handle irrational numbers :-)
Re: (Score:2)
Since one way to define a real number is as an (equivalence class of) sequences of rationals, I suppose any object that (1) stores a rational number, and (2) can advance to a "next" rational number, could be called a real number. (That's so long as the sequences it generates are Cauchy.)
I guess I'm saying you just need to make all your evaluation really, really lazy, and you can work with arbitrary precision. :-)
Re: (Score:2)
Floating point is based on an integer base (rather than a fractional one).
Or maybe, I should say, floating point, as I have seen it implemented up to this point.
Re: (Score:2)
Re:Don't count this out yet (Score:5, Interesting)
Exactly. ARM2 didn't have FP, people still wrote some extremely good stuff for it.
Nintendo DS doesn't have an fpu on either cpu.
Re:Don't count this out yet (Score:5, Interesting)
Agreed. I'm working on a digital oscilloscope display system and that thing might be very useful in this application -- where you need lots of bandwith, but also plenty of storage. Say, zooming, filtering, scaling of one second long acquisition done at 2Gs/s, using a 12 bit digitizer. You tweak the knobs, it updates, all in real time. In the worst case, you need about 120 Gbytes/s memory bandwidth to make it real time on a 30FPS display. And that's assuming the filter coefficients don't take up any bandwidth, because if they do you've just upped the bandwidth to terabytes/s.
Re:Don't count this out yet (Score:5, Funny)
Signed, old assembly language programmer guy
I see what you did there.
Re:Don't count this out yet (Score:5, Funny)
That's a bit shifty, don't you think? I don't mean to negate your point, but too, it's beyond my power to complement you -- I'm somewhat over a barrel. Perhaps if you add one to your argument, we'd have something else. Logically speaking. HCF.
Re: (Score:3)
I raise my carry to you, sir.
Re: (Score:2)
I agree with you, this is substantial. The ability to have no megabytes of "cache" but instead gigabytes, depending on how it's used, could be very very substantial.
I could see this going many ways, including basically the equivalent of having a "Socket type" for the ram - drop in and upgrade as necessary. Even with the significant latency differences versus various levels of on-die cache this can be very significant.
Re: (Score:2, Interesting)
the DEC alpha didn't have a floating-point unit: it had primitives that could be used to emulate floating-point almost as quickly as having a dedicated FPU. i spoke to a friend several years back who know about these things: he said that a 1s complement add goes a long way towards speeding up floating-point using integer operations. i have _no_ idea what he meant but it was kinda interesting to hear from someone who'd really thought about this stuff.
Re: (Score:3)
You must be thinking of some other processor. The first released Alpha silicon, Alpha 21064, had a pipelined FPU for adds/subtracts/multiplies and a non-pipelined floating-point divide unit.
Re: (Score:2)
I could see it as part of a database machine too. Somewhat limited by the amount of memory available on each chip, but an architecture similar to a Netezza appliance [theregister.co.uk] where the processing is pushed out closer to the data might be interesting.
Re: (Score:2)
for most applications you don't need a FPU, or floating point numbers at all.
After all, you just redefine where the decimal point is and you have perfect-accuracy floats, its how you create decimal type for currency arithmetic, no reason why you can't use it for 3d graphics too.
On a dual 4-core machine, I can saturate the memory bus without half trying with code
you code in .NET too huh? :)
Re: (Score:2)
Quite right. I do research on AI for embedded systems, specifically Integer Neural Networks (is it a shameless plug if I don't profit from it? creative commons book chapter. [intechopen.com]). By cutting out FP you can make all those low cost (sub $1) microcontrollers pretty powerful. Neural Networks cut out a lot of the processing by just making good guesses, then cutting out FP makes an implementation very light on resources.
Re: (Score:2)
This could be interesting datastorage. Recent articles about the NoSQL implementations at FB and other new media companies have indicated that 80+% of the data is actually stored in memory (with a write behind to disk). This coule end up being a specialized server product.
Re: (Score:2)
Re: (Score:2)
I can suggest an application from the top of my head: spatial configuration tables for experimantal/reconfigurable robotic arms.
Re: (Score:2)
Not yet, but I'm sure it'll run a herd of HURD!
Either/or? (Score:5, Insightful)
Does it have to be a either-or suggestion?
I could see this being useful as an accelerator - in the same way that GPUs can accellerate vector operations. E.g. memory that can calculate a hash table index by itself. Stuffed in as a component of a larger system it could be a really clever breakthrough for incremental performance improvements.
Re: (Score:3, Interesting)
Like Mitsubishi 3D RAM [thefreelibrary.com]
They put the logic ops and blend on the RAM
The 3D-RAM is based on the Mitsubishi Cache DRAM (CDRAM (Cached DRAM) A high-speed DRAM memory chip developed by Mitsubishi that includes a small SRAM cache. ) architecture that integrates DRAM memory and SRAM cache on a single chip. The CDRAM was then optimized for 3-D graphics rendering and further enhanced by adding an on-chip arithmetic logic unit See ALU. (ALU (Arithmetic Logic Unit) The high-speed CPU circuit that does calculating and c
performance vs. memory bandwidth (Score:2, Insightful)
> "that trades 25 years of flexibility and performance for scads of memory bandwidth"
Right... because memory bandwidth isn't one of the greatest bottlenecks in current designs...
Re: (Score:3, Interesting)
And how much performance per clock are you going to get out of a 22,000 transistor chip, with what looks like 3 registers (and 3 shadow registers)?
One of the issues they had to deal with was that DRAM is usually made on a 3 metal layer process, whereas CPUs usually take a lot more layers due to their complexity.
This will have to compete with TSV connected DRAM, which will be a major bandwidth and power aid to conventional SoCs.
Re: (Score:2)
for a limited subset of tasks, very high performance.
If the chip can achieve either (a) higher clock speed or (b) fewer cycles for the same op, or even both - then there can easily be some operations that are faster. For tasks focused on those operations, the chip will be faster. The memory improvements won't hurt things either.
CPUs and GPUs are rarely the same speed and transistor count, but we use both. GPUs excel for floating point and rapid fire streams of the same op against an array of data. CPUs are
Re:performance vs. memory bandwidth (Score:4, Insightful)
Perfect for networking -- switching, routing, ... Think of content addressable memory, etc.
Re:performance vs. memory bandwidth (Score:4, Interesting)
And how much performance per clock are you going to get out of a 22,000 transistor chip, with what looks like 3 registers (and 3 shadow registers)?
Quite a lot, I would guess. A stack-based design would give you 1 instruction per cycle with a compact opcode format capable of packing multiple instructions into a single machine word, which means a single instruction fetch for multiple actual instructions executed. Oh, and make it word addressed, that simplifies things a bit as well. In the end, you'll have a core that does perhaps 50%-100% as much clock cycles per second on a given manufacturing technology level (say, 60 nm), with just a single thread of execution, but with a negligible transistor budget and power consumption. The resulting effective computational performance per energy consumed will be at least one OOM better than the current offerings by Intel and AMD, although you first have to learn how to program it.
Map Reduce? (Score:4, Interesting)
Re:Map Reduce? (Score:5, Insightful)
Aspex Semiconductors took this a lot further. they did content-addressable-memory. ok, they did a hell of a lot more than that. they created a massively-parallel deep SIMD architecture with a 2-bit CPU (early versions were 1 bit), with each CPU having something like 256 bits of memory to play with. ok, early versions had 128-bits of "straight" RAM and 256 bits of content-addressable RAM. when i was working for them they were planning the VASP-G architecture which would have 65536 such 2-bit CPUs on a single die. it was the 10th largest CPU being designed, in the world, at the time.
programming such CPUs was - is - a complete f*****g nightmare. you not only have the parallelism of the CPU to deal with but you have the I/O handling to deal with. do you try to fit the data 1-bit-wide per CPU and process it serially? or... do you try to fit the data across 32 CPUs and process it in parallel? (each CPU was connected to its 2 neighbours so you could do this sort of thing). or... do you do anything in between, because if you have only 1-bit-wide that means that the I/O is held up, but if you do 32-bits across 32 CPUs you process it so quick that you're now I/O bound.
much of the work in fitting algorithms onto ASPs involved having to write bloody spreadsheets in Excel to analyse whether it was best to use 1, 2, 4 .... 32 CPUs just to process the bloody data! 6 weeks of analysis to write 30 lines of code for god's sake!
it gets worse: you can't even go read a book on algorithms for hardware because that doesn't apply; you can't go read a book on algorithms for software because _that_ doesn't apply. working out how to fit AES onto the Aspex Semi CPU took me about... i think it was 6 weeks, to even _remotely_ make it optimal. i had to read up on the design of the 2-bit Galois Field theory behind the S-Boxes, because although you could do 8-bit S-Box substitution by running 256 "compare" instructions, one per substitution, in parallel across all 4096 CPUs, it turned out that if you actually implemented the *original* 2-bit Galois Field mathematical operations in each of the 2-bit CPUs you could get it down to 40 instructions, not 256.
and that was just for _one_ part of the Rijndael algorithm: i had to do a comprehensive detailed analysis of _every_ aspect of the algorithm.
in other words, everything that you _think_ you know about optimising software and algorithm design for either hardware or for software is completely and utterly wrong, for these types of massively-parallel map-reduce and content-addressable-memory CPUs.
that leaves them somewhere in the very very specialist dept, and even there, they have problems, because it takes so long to verify and design a new CPU. when the Aspex VASP-F architecture was being planned, it was AMAZING! wow! 100x faster than the best Pentium-III processor! of course, within 18 months it was only 20x better than the top-of-the-line Pentium that was available, and by the time it _actually_ came out, it was only 5x better than a bunch of x86 CPUs, which are a hell of a lot easier to program.
it was the same story for the next version of the CPU, even though that promised to have 64k processing elements...
Re: (Score:3)
I know this is probably going to sound flippant, but I'm sure there is a genuine reason and I'd be interested to hear it... Why not just write it both ways and test?
Better yet, why not get the compiler to try different parallelisations and use a genetic algorithm to optimise automatically?
Re: (Score:2)
Meh.
From a theorists standpoint, it's classical. You get classical Von Neumann state machine. There's the problem of heat and die size, and buses are absolutely custom if you use them, although someone will put together a nice chipset to deal with the timing.
Multiple cores still have the same problem in terms of cache state, fetch state, and synch, so no real benefit there. Add in memory protection and this has become more wicked still. Fast, but wicked difficult from an OS maker's standpoint. Not that it's
Re: (Score:2)
You want "Content Addressable Parallel Processors".
http://en.wikipedia.org/wiki/Content_Addressable_Parallel_Processor [wikipedia.org]
STARAN was one such beast. It had a PDP-11 as the control unit and ran queries in parallel for instance for air traffic control. Load the memory up with your planes (that's whatthe PDP is for). And then perform operations in parallel on all memory units at once. And query for anything you need to know.
Very interesting devices.
See Also :
Foster, Caxton C (1976), Content Addressable Parallel Pr
Re:Map Reduce? (Score:5, Funny)
Why in the world is people always saying the word Map Reduce nowerdays, I hear it every week atleast. Like it would be the solution to world war 3.
Since WWIII hasn't happened yet, you cannot rule out the fact that it *might* be the solution.
Re: (Score:2)
Why in the world is people always saying the word Map Reduce nowerdays.
Distributed computing...
Processing In Memory (Score:5, Interesting)
This isn't new. The MIT Terasys platform did the same in 1995, and many have since. Nobody has yet come up with a viable programming model for such processors.
I'm expecting AMD's Fusion platform to move in the same direction (interleaved memory and shader banks), and they already have a usable MIMD model (basically OpenCL).
Re: (Score:3)
*nod* I'm not surprised it isn't new.
I've noticed that locality has become more and more important as speeds have gone up. I kind of wonder if something like this isn't the future.
I'm noticing, for example, that programming models involving channels and lots of threads have shown up and seem like a viable model for something like this. Erlang and Go are the two languages that do this that I can think of right offhand.
Re: (Score:3)
Nobody has yet come up with a viable programming model for such processors.
The Actor [wikipedia.org] model fits well to this type of CPU. Each CPU could be considered an actor.
A Jump instruction to a memory bank of another CPU could be translated as an Actor call to be executed in parallel with the caller.
A call instruction to a memory bank of another CPU could be translated as a parallel call and wait instruction.
A load/store instruction could be translated as a queued request for retrieving/updating data.
This provides a very natural multitasking solution, that provides good expandabili
Re: (Score:2)
The programming challenge with these architectures is not how to write applications for them. It's how to write efficient, correct applications reasonably quickly. In practice, the processors quickly become special-purpose rather than general-purpose as a result of their programming frameworks focusing on particular problems that the architecture is good at. (Not to mention Amdahl's law kicks in pretty quickly.)
Re: (Score:2)
What about the programming model that was used for every processor that had a 1:1 clock relationship with its memory, i.e. everything before the 80386?
Re: (Score:2)
That's not the issue. It's the massive parallelism that's the issue. And most models for getting a grip on that tacitly assume symmetric access to all memory by all CPUs. It's just now that C++ is getting the atomic operations that have as an implicit assumption that perhaps some memory is seen differently by one thread vs. another.
Re:Processing In Memory (Score:4, Interesting)
It was not American (NIH)
The best training manual for it was called "Algol68 with fewer tears"
Other than that, it was able to handle parallelism, and most everything else, in a relatively painless manner.
For those who actually LIKE pain, there is always Occam.
Re: (Score:2)
I get a little giddy whenever someone brings back memories of Algol But why stop reinventing the wheel every five years with a new greatest bestest programming language.. just think of all the lost revenue to publishers and software companies among others.
Re: (Score:2)
Considering the topic, it might be more suitable to say you meant "throw an exception here"
Just a first step... (Score:4, Interesting)
Really, this was inevitable, and this first implementation is just a first step. Future versions will undoubtedly include more functionality.
Current processors are ridiculously complicated. If you can knock out the entire cache with all of its logic, give the processor direct access to memory, and stick to a RISC design, you can get a very nice processor in under a million transistors.
Re:Just a first step... (Score:5, Informative)
the cache is there because the speed of DRAM, regardless of how fast you can communicate with it, still has latency issues on addressing.
to do the "routing" to address a 4-bit bus, you need 1/2 the number of transistors than if you addressed a 2-bit bus. each time you add another bit to the address range, you have increased the latency of access.
if you were to provide entirely random-access to an entire 32-bit range you would absolutely kill performance. so, what RAM IC designers do is they go "ok, you're not going to get 32-bit addressing, you're going to get 14-bit addressing, you're going to have to read an entire page of 1k or 2kbits, and you're going to have to have parallel ICs, the first IC does bits 0 to 1 of the data, the second IC does bits 2 and 3 etc."
this relies on the design of the processor having a VM architecture - paging.
but the same principle applies *inside* the processor: even just decoding the addressing, in the MMU, it's *still* too much latency involved.
so this is why you end up with hierarchical cacheing - 1st level is tiny, 2nd level is huge.
even with RISC designs you _still_ have to have 1st and 2nd level caches in order to remain competitive. if you've ever seen a picture of a RISC CPU, it's astounding: the actual CPU is like 1% of the total area; caches are huuuge by comparison, crossbar routing takes up 50% of the chip and the I/O pads, required to be massive in order to handle the current, can take up something like 5% of the chip (guessing here, it's been a while since i looked at an annotated example CPU).
Re: (Score:2)
even with RISC designs you _still_ have to have 1st and 2nd level caches in order to remain competitive. if you've ever seen a picture of a RISC CPU, it's astounding: the actual CPU is like 1% of the total area; caches are huuuge by comparison,
Don't do caches, do scratchpad memories and minimal instruction formats that require minimum bandwidth per opcode performed. And write a reasonable compiler. I've already seen books on it (static/automatic allocation of storage for scratchpad-equipped CPUs), I think CRC published a chapter on it in one of their recent compiler construction handbooks.
Re: (Score:2)
A lot of digital signal processing, that those chips would seem useful for, requires sequential access at very high bandwidths. When used that way, modern DRAM has no latency to speak of.
Why not a hexagonal design? (Score:5, Interesting)
Speaking of unconventional design, why don't we see hexagonal or triangular CPU-designs? All I have seen are the Manhattan-like designs. Are these really the best? Embedding the CPU inside a hexagonal/triangular DRAM design should be possible too. What would be the trade-offs?
Re: (Score:3)
It probably boils down to ease and efficiency of manufacture. Certainly for the core of the cpu I would imagine it's because squares tesselate nicely on the silicon wafer.
Re: (Score:2)
hexagons would probably tessellate even better, with less waste.
Ease of manufacture is still the case though. Cutting them out would be a bitch though.
Re: (Score:3)
hexagons would probably tessellate even better, with less waste.
Ease of manufacture is still the case though. Cutting them out would be a bitch though.
That's why triangles would be good; they can act as parts of hexagons, and yet you can cut them out with straight cuts. OTOH, you'll have to deal with acute angles in the result, which might have its own set of problems. Squares are likely a reasonable compromise, all things considered.
Re: (Score:2)
acute angles are definitely bad in electronics like that, also, if you tried to use them as hexagons, you'd have to merge six, which would have a whole extra set of complexities, and room for error. Yes, square is going to be the best option.
Re: (Score:2)
Ease, yes. Efficiency, especially the amount of the wafer that's wasted due to it being circular initially, not so much.
A triangular layout probably would be ideal - you could (or at least, should be able to) adapt existing equipment (instead of two cuts at right angles, you make three cuts at 60-degree angles - hexagonal layouts would require either piecing together triangles produced this way, casting much smaller ingots such that it's one chip per wafer, or stamping out the hexagons), and you reduce was
Re: (Score:2, Interesting)
Their money isn't old enough. (Score:2, Interesting)
They're innovating?
T-minus two days until they've been hit with 13 different patent lawsuits by companies that don't even produce anything similar.
Sorry about your luck!
Sorry , I don't believe it (Score:2)
Memory bottlenecks might be an issue but cache generally solves a lot of them. Binning just about every advance in processor design since the Z80 simply to speed up memory access is farcical. I'm afraid this is going to sink without trace since if you need low power you can just use ARM anyway which incidentally will have a shed load more performance.
Re: (Score:2)
There's plenty of perhaps specialized but still fairly common digital signal processing that doesn't care at all about those "advances" in processor design. All it needs to do is plenty of multiplies-and-adds, saturated operations, test-and-modifies, etc. It doesn't require branch predictors, virtual memory, memory protection, layered caches, cache coherency, speculative execution, and plenty of other stuff that's needed to make x86 perform decently. The x86 instruction set is just very bad at extracting us
All on one chip (Score:4, Interesting)
I'm just wondering and maybe it exists already, but why not make everything on one chip? The CPU, memory, GPU, etc? Most people don't mess with the insides of their computer, and I'm guessing that it will speed up the computer as a whole. You won't even need to make it high-performance. Just do a I3 core with the associated chipset (or equivalent), maybe 4GB of RAM, some connectivity (USB 2, DVI, SATA, Wi-Fi and 1000Base-T) and you have it all. The power savings should be huge as everything internally should be low voltage. The die will be huge but we are heading that way anyway.
Am I talking bollocks?
Support Chip? (Score:2)
I'm no techie, but I'm just wondering if this isn't more of a support chip that works the other way, if it's like a "smart cache" where the main CPU can offload something memory intensive and repetitive to keep it out of the way of the fancy thread calculations.
Re: (Score:3)
Re: (Score:2)
I didn't think about that! Now why can't they do that for a desktop or laptop? Is the ARM system just that much smaller then the Intel desktop chips?
Re: (Score:2)
Intel instruction set architecture requires a lot of hardware to execute efficiently. That's the price we all pay for using an instruction set that is 3 decades behind the hardware it runs on.
Re: (Score:2)
You'd think three decades would be enough to dispel an inaccurate meme. The grotesque legacy instructions are mostly handled by microcode. Have you looked at how much die area that occupies relative to the rest of the CPU? The 286 was capable of executing the majority of the crappiest legacy instructions. That ha
Re: (Score:3)
It's not about how much die area the microcode takes, it's about how much die area everything else needed to run this microcode efficiently is taking! Properly designed opcodes would obviate trace generator, branch predictor, register reallocator, parts of northbridge, etc. Now that takes a lot of space. In case of x86 ISA it's not about legacy opcodes really, it's about all the missing opcodes (and registers) that a well performing architecture should have. That's what I mean by it being 30 years behind. E
Re: (Score:2)
Re: (Score:2)
Well, Rasberry Pi [raspberrypi.org] could be described as a proof of concept for the whole SoC as a PC substitute idea. At least for the Windows world, the popular software is only offered as precompiled binaries for x86 based platforms. It may be a while before there's a critical mass of ARM based offerings to attract serious commercial attention. Windows 8 may change this but I think it's still too early to tell.
I think upgradability is possibly not the main advantage of desktops though it's certainly a key factor for many
Re: (Score:2)
You have one in your phone.
They may call it a "system on chip" but while there is greater integration than with PC processors many components such as power management, main memory ethernet PHY (and sometimes MAC too), cellphone and/or wifi, serial level shift and so-on are nearly always on seperate chips.
The problem is different chips need different process compromises, DRAM needs capacitors with low leakage, processors need fast switching and lots of interconnect. Ethernet needs relatively high currents to drive a couple of volts in
Re:All on one chip (Score:5, Interesting)
There are basically two problems:
1. The external connectivity -- SATA, USB, ethernet, etc. needs too much power to easily move or handle on a chip (and the radio stuff needs radio power). You can do the protocol work on the main chip if you like, but you'll need amplifiers, and possibly sensors off chip.
2. DRAM and CPUs are made in quite different processes, optimised for different purposes. Cache is memory made using CPU processes (so it's expensive and not very dense). These guys are trying to make CPUs using DRAM processes, which are slow.
Re: (Score:2)
That doesn't make it impossible; but it would very likely make it extraordinarily expensive. If you totted up the total die area of a contemporary PC, CPU,RAM, GPU, assorted peripherals and int
synthesis (Score:5, Informative)
there's a problem with doing designs like this. the tooling for CPUs is very very specific: 28nm, 32nm, 45nm - all those companies that do the simulations where they charge something like $USD 250,000 per week to license their tools like mentor do - have written the tools SPECIFICALLY for those geometries.
if you wander randomly outside of those geometries you are either on your own or you are into some unbelievably-high development costs.
why is this relevant?
it's because the DRAM manufacturers do *not* stick to the well-known geometries: they vary the geometry in order to get the absolute best performance because the cell layout is absolutely identical for DRAM ICs. and, because those cells _are_ identical, the verification process is much simpler than is required for a complex CPU.
in other words, this company is trying to mix-and-match two wildly different approaches. in other words, what he's doing is either incredibly expensive or is sub-optimal. which begs the question: what's it _for_?
Re: (Score:2)
there's a problem with doing designs like this. the tooling for CPUs is very very specific: 28nm, 32nm, 45nm - all those companies that do the simulations where they charge something like $USD 250,000 per week to license their tools like mentor do - have written the tools SPECIFICALLY for those geometries.
Or you can do it the way Chuck Moore does and write your own OKAD, simpler, faster and better. :)
Re: (Score:2)
which begs the question: what's it _for_?
"begging the question" doesn't mean what you think it means [begthequestion.info].
Aside from that, the device is a building block for massively parallel computers with extremely high memory bandwidth for the processors. The tasks it would be used for are the same tasks that other massively parallel supercomputers are used for today; simulating complex systems, graphics rendering, etc.
Yo dawg! (Score:2, Funny)
I heard you like to reduce maps, so I put a CPU in your RAM so you can hash while you map.
Humm, IBM did it first. (Score:2, Interesting)
IBM sells CPU's that have DRAM onboard for quite a while, IBM developed it, patented it, and sells it as "eDRAM" aka "embeddedDRAM".
I guess IBM's POWER7 processor family powering such things like, Sony's PlayStation 2, Sony's PlayStation Portable, Nintendo's GameCube, Nintendo's Wii, and Microsoft's Xbox 360. All have eDRAM.
Maybe news articles should be checked to see if they are really news or not before posting?
Re: (Score:2)
In the field, a microcontroller is just a processor with onboard usually static memory, so this thing is pretty close to a microcontroller.
Anyone know of any other microcontroller type chipsets that use dynamic ram?
Caches (Score:3)
Normally, in any CPU, you have 1, 2 or even 3 levels of cache - level one being the fastest accessed from the CPU, and higher numbers involving more latency. The whole idea being that data that is frequently accessed should be either within the CPU's register files, or within the level 1 cache. Failing that, the level 2 cache, failing that, level 3 cache or main memory. So for this CPU, the DRAM can be considered an L4 cache?
Incidentally, is it an SoC? Does all the support circuitry - to the South Bridge, PCIx, USB, 802.11 and other peripheral interfaces - get included here? And can someone attach a few extra GB externally to give what's effectively an L5 cache?
I can't say I like this approach - I'd prefer it if the CPU and interface logic was on 1 chip, and the memory on another.
Re: (Score:3)
Umm... no. You've apparently completely failed to notice the part where this CPU *has* no cache, at least certainly no L2 or L3. Instead, it talks directly to main memory (which it's embedded in, at least in a portion of, and has extremely fast access to). More accurately, any given gigabit (128MB of RAM) is the cache for one of these CPUs.
I don't know how quickly they can communicate across the DIMM (each 2GB has 16 CPUs, so some intercommunication is critical) - maybe that's more akin to traditional memor
cray3/super scalar system (Score:2)
cray where heading that way also in the 90ish with their sss system, they where just adding many 2048 cpus per block.
http://en.wikipedia.org/wiki/Cray-3/SSS
http://www.thefreelibrary.com/CRAY+COMPUTER+CORP.+COMPLETES+INITIAL+DEMONSTRATION+OF+THE+CRAY-3...-a016628331
cpu and memory already atomic (Score:2)
Reinvention from 1984 (Score:2)
Looks like they have reinvented the inmos Transputer, from about 1984. http://en.wikipedia.org/wiki/Transputer [wikipedia.org] . They alwaysintended to take that multicore, but never got that far. But it looks remarkably similar in intention.
Re: (Score:2)
The ideas from inmos are alive and well at XMOS [xmos.com]. I use their two core chip and I'm fairly happy -- it's plenty fast for what I use it for (industrial data collection). If only they documented the darn thing better.
How curious (Score:2)
More than just embedded DRAM (Score:2)
This is not just about putting DRAM and a CPU on the same chip while keeping the architecture of both unchanged.
This is about how computer architecture is effected by the possibility of implementing both on the same chip.
Dave Patterson noted in the nineties that the number of DRAM chips per computer went down with time. He predicted that DRAM
will become large enough soon that at least the memory for a single process will fit into one chip soon. At that point it is unecessary
slow and power consuming to move
Maybe we're thinking about this ALL wrong... (Score:2)
Let's say, instead of looking at it as a substitute for a main processors. We look at a much more distributed system.
8GB of RAM with (what I'll call) an Inline RAM processor.
So it doesn't do a lot of FP. That's fine, most portables and handhelds already have a GPU. GPUs love FP. Then let's add (if necessary) a simply CPU that essentially controls drive & I/O access.
Now, I'm not saying this will replace current processors or platforms. But there might be uses. Heck, I don't know. But what if this type
Pipelines (Score:2)
Of course you can get a pipeline in a CPU with ~22000 transistors, the original ARM had IIRC about ~28000 transistors, and has a pipeline. I'm guessing that this chip isn't x86. The x86 is far less economical with transistors, just the part that works out how long the next instruction is for x86 is larger than an entire ARM core. With simple fixed length instructions, and with a simple ALU you can get a chip that'll have pretty decent instruction throughput.
I somehow doubt this chip is designed to take over
Old idea, but better than you expect (Score:4, Informative)
My research area is computer architecture.
This idea of moving compute into the RAM has been around a long time. Papers have proposed everything from adding simple ALUs to the DRAMs to fully functional microprocessors. Most assume that these are "accelerators" for common vector operations and such, while the heavy lifting is done by beefier cores, but the idea if doing all the compute embedded in a DRAM has been proposed and evaluated before.
One thing we've learned in the past few decades is that modern processors are limited by memory latency and bandwidth. A Sun engineer (talking about Rock) pointed out that a modern out-of-order processor performs a race between last-level cache misses. When you have to go out to DRAM, the CPU instruction window fills up with as much dependent work as possible, before it completely stalls because everything is dependent on that one miss. When that data finally arrives, the CPU blasts through that work really fact, and then soon stalls out again on another miss. OOO processors resolve this (somewhat) by the instruction window, while Rock solved it by speculative execution. One of the reasons for Sandy Bridge's excellent performance is the very large instruction window that can absorb more of the LLC miss stall time.
And so, although these processors have other advantages, OOO processors dedicate a huge amount of logic just to dealing with the cache miss latency. If there were no such latency, then they could get the same performance with a hell of a lot less hardware. Although I haven't seen the figures, my suspicion is that for general computation, TOMI will blow the doors off of whatever else we've got in both performance AND energy efficiency. Only when you have a specialized compute kernel whose working data fits in the cache can you comparatively benefit from something like Sandy Bridge. (I realize that's an overly strong statement, because lots of general purpose workloads have good locality, but nevertheless main memory is a major bottleneck for most workloads.)
Hmmmm, I may have been looking for this... (Score:2)
Just as I was thinking that this might be the start of a good FORTH machine, I find out that Fish used to work with Chuck Moore. What a coinkydink.
+FPGA FTW (Score:3)
Embed a fat FPGA in this chip well-interconnected to DRAM and CPU, and you get all those things. You might even replace the current chip's buses with FPGA for both data distribution and inline logic. Or make a discrete (but well-interconnected) onchip FPGA able to power down when not in use, and keep the low power consumption except when it's necessary. Turn on the FPGA for speed, or when the FPGA logic is so efficient that it's lower power than doing it in the CPU.
For somewhat lower power consumption, and better performance in many tasks, but less flexibility, embed a DSP in the chip instead of the FPGA.
Or both: DSP as ALU, FPGA as CLU (and flexible ALU, and beyond), on the chip with a simple processor to run the OS and main app threads. Bringing all the ports and buses to RAM all on the chip makes it all wicked fast. De/selecting these modules for power on demand (or in thread init) saves energy.
Re: (Score:3)
Yes, I'm very excited about the Zynq, but it doesn't embed the RAM onchip, which is what's interesting about this new processor. The dual Cortex A9 is far more complex than the CPU on this new chip, in part to speed execution that's slowed by offchip RAM latency.
But indeed an FPGA and the AMBA bus into a simple chip with onboard RAM is interesting. Though for Xilinx's market for FPGA apps they'd be better with much more than 2GB onchip RAM; more like 128GB or a TB. And onboard optical Gbps ethernet, as long
Same old story... (Score:3)
Can't really solve the mutex problem so pretend it doesn't exist and screw the programmers by pretending to solve the main memory latency problem with CPU-local memory.
The "innovation" over Illiac IV is to call it "multicore".
PS: There is a solution but since I can't afford the patent fees, its not going anywhere.
Re: (Score:2)
And how do I add more RAM to my system?
Its Computer Assisted Memory. Check the NUMA box. /Ducks and Covers
Re: (Score:2)
I think UM-CRAP, or RAD-MC-PU would be more catchy.
Re:So, is it a CAM or a DRPU? (Score:4, Funny)
Missed a D - better make that DUM-CRAP*.
I wonder, how much DUM-CRAP could we fit into a single PC?
* this name is by no means a reflection on what I think of the tech - it sounds like a pretty cool idea.
Re: (Score:2, Insightful)
The idea is not new and lots of products having a CPU on the RAM die exist. Sun had this on graphic cards for example.
The missing FP is not a great deal since FP can be calculated with ints if needed - but it shall get an FP in the follow up products to stop the rants.
The dirty secrect of the computer industrie is that the CPU has to wait "lots" of cycles to go to memory and back since the CPUs are much higher clocked that RAM plus there are other chips in between...
RAM can be added in system designs here -
Re: (Score:2)
Yea, my understanding is that in a modern CPU a cache failure is more costly than poorly optimized code.
Re: (Score:2)
It seems to be fairly useless for things other than very limited specialized tasks. 128 words of memory per core? What the heck? Even with 512 words per core on Parallax Propeller, people are fighting to get things going without having to access the shared and slow hub memory. Fitting everything in 512 words (2 kbytes) is hard, 128 words makes it more than 4 times harder. Their architecture basically implements a very simple dialect of FORTH in hardware. It's very hard to design systems of any considerable