Forgot your password?
Intel Technology

Intel Talks 1000-Core Processors 326

Posted by samzenpus
from the we're-gonna-need-a-bigger-heat-sink dept.
angry tapir writes "An experimental Intel chip shows the feasibility of building processors with 1,000 cores, an Intel researcher has asserted. The architecture for the Intel 48-core Single Chip Cloud Computer processor is 'arbitrarily scalable,' according to Timothy Mattson. 'This is an architecture that could, in principle, scale to 1,000 cores,' he said. 'I can just keep adding, adding, adding cores.'"
This discussion has been archived. No new comments can be posted.

Intel Talks 1000-Core Processors

Comments Filter:
  • by mentil (1748130) on Monday November 22, 2010 @03:01AM (#34303406)

    This is for server/enterprise usage, not consumer usage. That said, it could scale to the number of cores necessary to make realtime raytracing work at 60fps for computer games. Raytracing could be the killer app for cloud gaming services like OnLive, where the power to do it is unavailable for consumer computers, or prohibitively expensive. The only way Microsoft etc. would be able to have comparable graphics in a console in the next few years is if it were rental-only like the Neo-Geo originally was.

  • Re:One question? (Score:3, Insightful)

    by JWSmythe (446288) <> on Monday November 22, 2010 @03:37AM (#34303582) Homepage Journal

    The only thing I'd be compensating for is the fact I can't do calculations at Exaflop rates in my head.

        Just like my car only compensates for the fact I can't run at 165mph. :)

  • by Anonymous Coward on Monday November 22, 2010 @03:38AM (#34303590)

    Basically, we are going to need compilers that automatically take advantage of all that parallelism without making you think about it too much, and programming languages that are designed to make your programs parallel-friendly. Even Microsoft is finally starting to edge in this direction with F# and some new features of .NET 4.0. Look at Haskell and Erlang for examples of languages that take such things more seriously, even if the world takes them less seriously.

    I don't know about AI, but almost certainly we will end up with both compilers and virtual machines that are aware of parallelism and try to take advantage of it whenever possible.

    But still, certain algorithms just aren't very friendly to parallelism no matter what technology you apply to them.

  • Instruction set... (Score:4, Insightful)

    by KonoWatakushi (910213) on Monday November 22, 2010 @03:47AM (#34303634)

    "Performance on this chip is not interesting," Mattson said. It uses a standard x86 instruction set.

    How about developing a small efficient core, where the performance is interesting? Actually, don't even bother; just reuse the DEC Alpha instruction set that is collecting dust at Intel.

    There is no point in tying these massively parallel architectures to some ancient ISA.

  • Re:Imagine (Score:2, Insightful)

    by AuMatar (183847) on Monday November 22, 2010 @04:35AM (#34303788)

    Why would you care to see one on your desktop? Do you have any use for one? There's a point where except for supercomputers enough is enough. We've probably already passed it.

  • by Anonymous Coward on Monday November 22, 2010 @04:43AM (#34303814)

    Learn a functional language. Leanr it not for some practical reason. Learn it because having another view will give you interesting choices even when writing imperative languages. Every serious programmer should try to look at the important paradigms so that he can freely choose to use them where appropriate.

  • by francium de neobie (590783) on Monday November 22, 2010 @04:44AM (#34303820)
    Ok, you can cram 1000 cores into one CPU chip - but feeding all 1000 CPU cores with enough data for them to process and transferring all the data they spit out is gonna be a big problem. Things like OpenCL work now because the high end GPUs these days have 100GB/s+ bandwidth to the local video memory chips, and you're only pulling out the result back into system memory after the GPU did all the hard work. But doing the same thing on a system level - you're gonna have problems with your usual DDR3 modules, your SSD hard disk (even PCI-E based) and your 10GE network interface.
  • by Electricity Likes Me (1098643) on Monday November 22, 2010 @04:48AM (#34303834)
    1000 cores at 1Ghz on a single chip, networked to a 1000 other chips, would probably just about make a non-real time simulation of a full human brain possible (going off something I read about this somewhere). Although if it is possible to arbitrarily scale the number of cores, then we might be able to seriously consider building a system of very simple processors acting as electronic neurons.
  • by Terje Mathisen (128806) on Monday November 22, 2010 @04:58AM (#34303870)

    The key difference between this research chip and the other Multicore chips Intel have worked on, like Larrabee, is that it is explicitly NOT cache coherent, i.e. it is a cluster on chip instead of a single-image multi-processor.

    This means, among many other things, that you cannot load a single Linux OS across all the cores, you need a separate executive on every core.

    Compare this with the 7-8 Cell cores in a PS3.


  • by kohaku (797652) on Monday November 22, 2010 @05:47AM (#34304088)

    There's also no reason to throw away an ISA that has proven to be extremely scalable and very successful, just because it's ancient or it looks ugly.

    Uh, scalable? Not really... The only reason x86 is still around (i.e. successful) is because it's pretty much backwards compatible since the 8086- which is over THIRTY YEARS OLD.

    The advantage of the x86 instruction set is that it's very compact. It comes at a price of increased decoding complexity, but that problem has already been solved.

    Whoa nelly. compact? I'm not sure where you got that idea, but it's called CISC and not RISC for a reason! if you think x86 is compact, you might be interested to find out that you can have a fifteen byte instruction [] In fact, on the i7 line, the instructions are so complex it's not even worth writing a "real" decoder- they're translated in real-time into a RISC instruction set! If Intel would just abandon x86, they could reduce their cores by something like 50%!
    The low number of registers _IS_ a problem. The only reason there are only four is because of backwards compatability. It definitely is a problem for scalability, one cannot simply rely on a shared memory architecture to scale vertically indefinitely, you just use too much power as a die size increases, and memory just doesn't scale up as fast as the number of transistors on a CPU.
    A far better approach is to have a decent model of parallelism (CSP, Pi-calculus, Ambient calculus) underlying the architecture and to provide a simple architecture with primitives supporting features of these calculi, such as channel communication. There are plenty of startups doing things like this, not just Intel, and they've already products in the market- though not desktop processors. Picochip [] and Icera [] to name just a couple, not to mention things like GPGPU (Fermi, etc.)
    Really, the way to go is small, simple, low power cores with on-chip networks which can scale up MUCH better than just the old intel method of "More transistors, increase clock speed, bigger cache".

  • by Anonymous Coward on Monday November 22, 2010 @06:33AM (#34304234)

    The advantage of the x86 instruction set is that it's very compact. It comes at a price of increased decoding complexity, but that problem has already been solved.

    Wrong and wrong. The x86 instruction is hardly compact with all its redundancy and roundabout inconsistent operations. ARM Thumb coding is superior in just about every way.

    Decoding is also not "solved", it is minimised due to intense optimisation and pre-decoding instructions before they are placed in the cache but that still harms the efficiency of the pipeline (branch misprediction is lethal).

    The low number of registers is not a problem. In fact, it may even be an advantage to scalability. A register is nothing more than a programmer-controlled mini cache in front of the memory.

    You keep using that word, I don't think it means what you think it means.

    The fact that you think x86 is scalable is laughable. You may want to do some research into cache coherency, specifically how for every CPU added, the cache protocol becomes less and less efficient until every physical core you add makes the system slower as every existing CPU is too busy arguing with each other over who owns what cache lines to actually do any computation. The only solution is either to throw out cache coherency which is exactly what most competing ISAs have done or partition the memory so that the system resembles a cluster in a single box more than a multi-CPU computer (NUMA). The second option doesn't work unless you have separate CPU chips, a heap of cores on a single chip will have electrical problems connecting enough pins for several fully independent memory buses.

    I can't imagine what exactly "not having many registers" and "scalable" have in common. RISC cores like PPC (which has 64 registers) can be packed more densely on a single chip than x86 cores so it isn't size related. Smaller numbers of registers are correlated with lower performance so it isn't that either.

    I'd rather have few registers, and go directly to memory. The hardware can then scale to include bigger and faster caches, so that memory access is just as fast a register access, without the software having to deal with register allocation and save/restore.

    A memory-only machine without programmable registers may work with internal control logic that maps addresses on to registers without direct programming but that sort of thing is generally a bad idea. The control logic is rarely optimal as there isn't time to look ahead and perform a full code analysis, current CPUs already try things like that (called register renaming, modern x86 has a lot of registers; it just maps the 8 programmable ones on to the larger set) and it just doesn't work as well as having the compiler allocate registers appropriately in advance.

    Oh, and cache can never be as fast as registers, that is, after all, the difference between memory and registers (memory is big and slow, registers are tiny and crazy fast). You'll have to wait until we invent Unobtanium based computers for that to be possible (hardware that is not limited by the speed of light).

    Going back to the earlier statement about cache coherency. Registers are local state, the contents of registers are not visible to other CPUs so the CPU doesn't need to worry about coherency problems with those values. Direct memory access is always incoherent and forces the CPU to behave defensively (i.e. slowly). [Direct memory access on x86 is convenient, but that's ultimately because it's a solution to a problem of its own design, the lack of registers]

  • by Arlet (29997) on Monday November 22, 2010 @06:35AM (#34304248)

    The only reason x86 is still around (i.e. successful) is because it's pretty much backwards compatible since the 8086- which is over THIRTY YEARS OLD.

    That's a clear testament to scalability when you consider the speed improvement in the last 30 years using basically the same ISA.

    you might be interested to find out that you can have a fifteen byte instruction

    So ? It's not the maximum instruction length that counts, but the average. In typical programs that's closer to three. Frequently used opcodes like push/pop only take a single byte. Compare to a DEC Alpha architecture, where nearly every single instruction uses 15 bits just to tell which registers are used, no matter whether a function needs that many registers.

    If Intel would just abandon x86, they could reduce their cores by something like 50%!

    Even if that's true (I doubt it), who cares ? The problem is not intel has too many transistors for a given area. The problem is just the opposite. They have the capability to put more transistors in a core that they know what to do with. Also, typically half the chip is for the cache memories, and the compact instruction set helps to use that cache memory more effectively.

    one cannot simply rely on a shared memory architecture to scale vertically indefinitely

    Sure you can. Shared memory architectures can do everything explicit channel communication architectures can do, plus you have the benefit that the communication details are hidden from the programmer, allowing improvements to the implementation without having to rewrite your software. Sure, the hardware is more complex, but transistors are dirt cheap, so I'd rather put the complexity in the hardware.

  • by Arlet (29997) on Monday November 22, 2010 @10:26AM (#34305504)

    Examples? It's just a different model, it's doesn't prevent you solving any problem.

    A typical consumer desktop machine, running typical programs for instance. In order to use these cores effectively, all these programs need to rewritten. Imagine your word processor reformatting a 500 page document on 1000 cores. It's just not going to work very well.

    How about the operating system ? 1000 different cores all trying to access a file system on a single physical drive. How are you going to run that efficiently ?

  • testament to what? (Score:2, Insightful)

    by reiisi (1211052) on Monday November 22, 2010 @10:52AM (#34305804) Homepage

    I'd call it more of a testament to how much intel's fanatacism can induce them to waste all the benefits of Moore's law supporting baggage that was unnecessary when the x86 was "invented".

    Just for the marketing department's black magic.

    Instruction efficiency? Compact code? There are numerous processors that wax the floor with x86 in those departments, but marketing department's black magic killed the market.

    Magic? It's all parlor tricks, you know, pay a researcher here to slip a little excess code in a tight loop on that 68k "benchmark", that sort of thing. The problem with the old saw about magic being indistinguishable from advanced tech is that magic is not about real results. Magic is about illusion. The confusing point is that illusion can be turned into reality with some effort.

    In the x86 case, it was a huge lot of effort justified by a huge load of hubris and the needs of the black magic department, a vicious cycle.

    x86 is a significant contributor to global warming (which is part of the reason some people want to deny the reality of human impact on the climate changes).

  • by menkhaura (103150) <> on Monday November 22, 2010 @11:47AM (#34306490) Homepage

    Talk is cheap, show me the cores.

  • by cdpage (1172729) on Monday November 22, 2010 @12:42PM (#34307198)
    Photoshop has been stuck at 2 processors for Way too long. Software companies have been lagging behind hardware far too long. Until I see See more software taking advantage of cores of more than 1 or 2... I'm not wasting money on them.
  • Benchmarks (Score:3, Insightful)

    by Chemisor (97276) on Monday November 22, 2010 @12:47PM (#34307266)

    According to benchmarks [], a functional language like Erlang is slower than C++ by an order of magnitude. Sure, it can distribute processing over more cores, which is the only thing that enabled it to win one of the benchmarks. I suspect that was only because it used a core library function that was written in C. So no, if you want to write code with acceptable performance, DON'T use a functional language. All CPU intensive programs, like games, are written in C or C++; think about that.

Whoever dies with the most toys wins.