Forgot your password?
typodupeerror
Intel Technology

Intel Talks 1000-Core Processors 326

Posted by samzenpus
from the we're-gonna-need-a-bigger-heat-sink dept.
angry tapir writes "An experimental Intel chip shows the feasibility of building processors with 1,000 cores, an Intel researcher has asserted. The architecture for the Intel 48-core Single Chip Cloud Computer processor is 'arbitrarily scalable,' according to Timothy Mattson. 'This is an architecture that could, in principle, scale to 1,000 cores,' he said. 'I can just keep adding, adding, adding cores.'"
This discussion has been archived. No new comments can be posted.

Intel Talks 1000-Core Processors

Comments Filter:
  • Jeez... (Score:5, Funny)

    by Joe Snipe (224958) on Monday November 22, 2010 @02:54AM (#34303374) Homepage Journal

    I hope he never works for Gillette.

  • From the article: "By installing the TCP/IP protocol on the data link layer, the team was able to run a separate Linux-based operating system on each core. Mattson noted that while it would be possible to run a 48-node Linux cluster on the chip, it "would be boring."

    Huh?! Boring?! It would have been a nice a first post on Slashdot on the eternal topic - does it run Linux? - to begin with.

    The we have all the programming goodies to follow up with.

    • by c0lo (1497653)

      From the article: "By installing the TCP/IP protocol on the data link layer, the team was able to run a separate Linux-based operating system on each core. Mattson noted that while it would be possible to run a 48-node Linux cluster on the chip, it "would be boring."

      Huh?! Boring?! It would have been a nice a first post on Slashdot on the eternal topic - does it run Linux? - to begin with.

      The we have all the programming goodies to follow up with.

      ;) To make the things interesting, each of the cores would have to use a public Inet IPv4 address.

    • by RAMMS+EIN (578166) on Monday November 22, 2010 @03:45AM (#34303616) Homepage Journal

      Running Linux on a 48-core system is boring, because it has already been run on a 64-core system in 2007 [gigaom.com] (at the time, Tilera [tilera.com] said they would be up to 1000 cores in 2014; they're up to 100 cores per CPU now).

      As far as I know, Linux currently supports up to 256 CPUs. I assume that means logical CPUs, so that, for example, this would support one CPU with 256 cores, or one CPU with 128 cores with two CPU threads per core, etc.

      • The most interesting part to me is how they're actually making a built in router for the chips. The cores communicate through TCP/IP. That's incredible.

      • by vojtech (565680) <vojtech@suse.cz> on Monday November 22, 2010 @05:34AM (#34304036)
        The current limit on Linux (with 2.6 series) is 8192 CPUs on POWER and 4096 on x86. And there are even a number of non-x86 machines today that reach these sizes in a cache-coherent (ccNUMA) manner that Linux works well on. You still have to be careful with application design, though, because it's fairly easy to hit bottlenecks either in the application or in the kernel that will limit scalability. Most common workloads are already seeing
        • Re: (Score:3, Informative)

          by TheRaven64 (641858)

          The current limit on Linux (with 2.6 series) is 8192 CPUs on POWER and 4096 on x86

          That's kind-of true, but quite misleading. 8192 is the hard limit, but scheduler and related overhead means that the performance gets pretty poor long before then. Please don't cite the big SGI and IBM machines as counter examples. The SGI machines effectively run a cluster OS, but with hardware distributed shared memory. They are 'single system image' in that they appear to be one OS to the user, but each board has its own kernel, I/O peripherals and memory and works largely independently except when a

  • by PaulBu (473180) on Monday November 22, 2010 @03:00AM (#34303400) Homepage

    Are they trying to reinvent Transputer? :)

    But yes, I am happy to see Intel pushing it forward!

    Paul B.

    • Re: (Score:2, Interesting)

      by TinkersDamn (647700)
      Yes, I've been wondering the same thing. Transputers contained key ideas that seem to be coming around again...
      But a more crucial thing might be how much heat can you handle on one chip? These guys are already at 25-125 watts, likely depending on how many cores are actually turned on. After all they're playing pretty hefty heat management tricks on current i7's and Phenom's.
      http://techreport.com/articles.x/15818/2 [techreport.com]
      What use are 48 cores, let alone 1000 if they're all being slowed down to 50% or whatever
  • by mentil (1748130) on Monday November 22, 2010 @03:01AM (#34303406)

    This is for server/enterprise usage, not consumer usage. That said, it could scale to the number of cores necessary to make realtime raytracing work at 60fps for computer games. Raytracing could be the killer app for cloud gaming services like OnLive, where the power to do it is unavailable for consumer computers, or prohibitively expensive. The only way Microsoft etc. would be able to have comparable graphics in a console in the next few years is if it were rental-only like the Neo-Geo originally was.

  • by Anonymous Coward
  • by SixDimensionalArray (604334) on Monday November 22, 2010 @03:07AM (#34303434)

    Imagine a Beowulf cluster of th^H^H^H

    Ah, forget it, the darn thing practically is one already! :/

    "Imagine exactly ONE of those" just doesn't sound the same.

  • by pyronordicman (1639489) on Monday November 22, 2010 @03:12AM (#34303456)
    Having been in attendance of this presentation at Supercomputing 2010, for once I can say without a doubt that the article captured the essence of reality. The only part it left out is that the interconnect between all the processing elements uses significantly less energy than that of the previous 80-core chip; I think the figure was around 10% of chip power for the 48-core, and 30% for the 80-core. Oh, and MPI over TCP/IP was faster than the native message passing scheme for large messages.
  • "It's a lot harder than you'd think to look at your program and think 'how many volts do I really need?'" he [Mattson] said.

    First was RAM (640kb should be... doh), then M/GHz, then Watts, now is volts... so, what's next?
    (my bet... returning to RAM and the advent of x128)

  • by Anonymous Coward

    Make a processor with four asses.

  • by igreaterthanu (1942456) on Monday November 22, 2010 @03:31AM (#34303544)
    This just goes to show that if you care about having a future career (or even just continuing with your existing one) in programming, Learn a functional language NOW!
    • Re: (Score:3, Interesting)

      by jamesswift (1184223)

      It's quite something isn't it, how so few people on even slashdot seem to get this. Old habits die hard I guess.
      Years ago a clever friend of mine clued me into how functional was going to be important.

      He was so right and the real solutions to concurrency (note, not parallelism which is easy enough in imperative) are in the world of FP or at least mostly FP.

      My personal favourite so far is Clojure which has the most comprehensive and realistic approach to concurrency I've seen yet in a language ready for real

    • by Anonymous Coward on Monday November 22, 2010 @04:43AM (#34303814)

      Learn a functional language. Leanr it not for some practical reason. Learn it because having another view will give you interesting choices even when writing imperative languages. Every serious programmer should try to look at the important paradigms so that he can freely choose to use them where appropriate.

    • by rrohbeck (944847)

      All you need is a library that gives you worker threads, queues and synchronization primitives. We've all learned that stuff at some point (and forgot most of it.)

    • by loufoque (1400831)

      Sorry, but while functional programming style is indeed the future of HPC (with C++), functional languages themselves aren't. Read the research papers of the field and see for yourself.

    • Benchmarks (Score:3, Insightful)

      by Chemisor (97276)

      According to benchmarks [debian.org], a functional language like Erlang is slower than C++ by an order of magnitude. Sure, it can distribute processing over more cores, which is the only thing that enabled it to win one of the benchmarks. I suspect that was only because it used a core library function that was written in C. So no, if you want to write code with acceptable performance, DON'T use a functional language. All CPU intensive programs, like games, are written in C or C++; think about that.

  • by Jason Kimball (571886) on Monday November 22, 2010 @03:46AM (#34303620)

    1000 cores on a chip isn't too bad. I already have one with 110 cores.

    That's only 10 more cores!

  • I wonder how the inter-core communication will scale without packing 1000+ layers in the die.
  • Instruction set... (Score:4, Insightful)

    by KonoWatakushi (910213) on Monday November 22, 2010 @03:47AM (#34303634)

    "Performance on this chip is not interesting," Mattson said. It uses a standard x86 instruction set.

    How about developing a small efficient core, where the performance is interesting? Actually, don't even bother; just reuse the DEC Alpha instruction set that is collecting dust at Intel.

    There is no point in tying these massively parallel architectures to some ancient ISA.

    • by Arlet (29997)

      There's also no reason to throw away an ISA that has proven to be extremely scalable and very successful, just because it's ancient or it looks ugly.

      The advantage of the x86 instruction set is that it's very compact. It comes at a price of increased decoding complexity, but that problem has already been solved.

      The low number of registers is not a problem. In fact, it may even be an advantage to scalability. A register is nothing more than a programmer-controlled mini cache in front of the memory. I'd rather

      • by Splab (574204)

        Err, did you just claim cache is as fast as a register access?

      • by kohaku (797652) on Monday November 22, 2010 @05:47AM (#34304088)

        There's also no reason to throw away an ISA that has proven to be extremely scalable and very successful, just because it's ancient or it looks ugly.

        Uh, scalable? Not really... The only reason x86 is still around (i.e. successful) is because it's pretty much backwards compatible since the 8086- which is over THIRTY YEARS OLD.

        The advantage of the x86 instruction set is that it's very compact. It comes at a price of increased decoding complexity, but that problem has already been solved.

        Whoa nelly. compact? I'm not sure where you got that idea, but it's called CISC and not RISC for a reason! if you think x86 is compact, you might be interested to find out that you can have a fifteen byte instruction [derkeiler.com] In fact, on the i7 line, the instructions are so complex it's not even worth writing a "real" decoder- they're translated in real-time into a RISC instruction set! If Intel would just abandon x86, they could reduce their cores by something like 50%!
        The low number of registers _IS_ a problem. The only reason there are only four is because of backwards compatability. It definitely is a problem for scalability, one cannot simply rely on a shared memory architecture to scale vertically indefinitely, you just use too much power as a die size increases, and memory just doesn't scale up as fast as the number of transistors on a CPU.
        A far better approach is to have a decent model of parallelism (CSP, Pi-calculus, Ambient calculus) underlying the architecture and to provide a simple architecture with primitives supporting features of these calculi, such as channel communication. There are plenty of startups doing things like this, not just Intel, and they've already products in the market- though not desktop processors. Picochip [picochip.com] and Icera [icerasemi.com] to name just a couple, not to mention things like GPGPU (Fermi, etc.)
        Really, the way to go is small, simple, low power cores with on-chip networks which can scale up MUCH better than just the old intel method of "More transistors, increase clock speed, bigger cache".

        • by Arlet (29997) on Monday November 22, 2010 @06:35AM (#34304248)

          The only reason x86 is still around (i.e. successful) is because it's pretty much backwards compatible since the 8086- which is over THIRTY YEARS OLD.

          That's a clear testament to scalability when you consider the speed improvement in the last 30 years using basically the same ISA.

          you might be interested to find out that you can have a fifteen byte instruction

          So ? It's not the maximum instruction length that counts, but the average. In typical programs that's closer to three. Frequently used opcodes like push/pop only take a single byte. Compare to a DEC Alpha architecture, where nearly every single instruction uses 15 bits just to tell which registers are used, no matter whether a function needs that many registers.

          If Intel would just abandon x86, they could reduce their cores by something like 50%!

          Even if that's true (I doubt it), who cares ? The problem is not intel has too many transistors for a given area. The problem is just the opposite. They have the capability to put more transistors in a core that they know what to do with. Also, typically half the chip is for the cache memories, and the compact instruction set helps to use that cache memory more effectively.

          one cannot simply rely on a shared memory architecture to scale vertically indefinitely

          Sure you can. Shared memory architectures can do everything explicit channel communication architectures can do, plus you have the benefit that the communication details are hidden from the programmer, allowing improvements to the implementation without having to rewrite your software. Sure, the hardware is more complex, but transistors are dirt cheap, so I'd rather put the complexity in the hardware.

          • by kohaku (797652) on Monday November 22, 2010 @07:24AM (#34304436)

            That's a clear testament to scalability when you consider the speed improvement in the last 30 years using basically the same ISA.

            It's scaled that way until now. We've hit a power wall in the last few years: as you increase the number of transistors on chip it gets more difficult to distribute a faster clock synchronously, so you increase the power, which is why Nehalem is so power hungry, and why you haven't seen clock speeds really increase since the P4. In any case, we're talking about parallelism, not just "increasing the clock speed" which isn't even a viable approach anymore.
            When you said "Compact" I assumed you meant the instruction set itself was compact rather than the average length- I was talking about the hardware needed to decode, not necessarily code density. Even so, x86 is nothing special when it comes to density, especially considered against things like ARM's Thumb-2.
            If you take look at Nehalem's pipeline, there's a significant chunk of it simply dedicated to translating x86 instructions into RISC uops, which is only there for backwards compatability. The inner workings of the chip don't even see x86 instructions.
            Sure you can do everything the same with shared memory and channel comms, but if you have a multi-node system, you're going to be doing channel communcation anyway. You also have to consider that memory speed is a bottleneck that just won't go away, and for massive parallelism on-chip networks are just faster. In fact, Intel's QPI and AMD's HyperTransport are examples of on-chip network- they provide a NUMA on Nehalem and whatever AMD have these days. Indeed, in the article, it says

            Mattson has argued that a better approach would be to eliminate cache coherency and instead allow cores to pass messages among one another.

            The thing is, if you want to put more cores on a die, you need either a bigger die or smaller cores. x86 is stuck with larger cores because of all the translation and prediction it's required to do to be both backwards compatible and reasonably well-performing. If you're scaling horizontally like that, you want the simplest core possible, which is why this chip only has 48 cores, and Clearspeed's [clearspeed.com] 2-year-old CSX700 [clearspeed.com] had 192.

            • by Arlet (29997) on Monday November 22, 2010 @07:49AM (#34304550)

              The thing is, if you want to put more cores on a die, you need either a bigger die or smaller cores

              Nobody wants to put more cores on a die, but they're forced to do so because they reach the limits of a single core. I'd rather have as few cores as possible, but have each one be really powerful. Once multiple cores are required, I'd want them to stretch the coherent shared memory concept as far as it will go. When that concept doesn't scale anymore, use something like NUMA.

              Small, message passing cores have been tried multiple times, and they've always failed. The problem is that the requirement of distributed state coherency doesn't go away. The burden only gets shifted from the hardware to the software, where it is just as hard to accomplish, but much slower. In addition, if you try to tackle the coherency problem in software, you don't get to benefit from hardware improvements.

              • by kohaku (797652) on Monday November 22, 2010 @08:33AM (#34304752)

                they're forced to do so because they reach the limits of a single core

                Well yes, but you might as well have argued that nobody wanted to make faster cores but they're limited by current clock speeds... The fact is that you can no longer make cores faster and bigger, you have to go parallel. Even the intel researcher in the article is saying the shared memory concept needs to be abandoned to scale up.
                Essentially there are two approaches to the problem of performance now. Both use parallelism. The first (Nehalem's) is to have a 'powerful' superscalar core with lots of branch prediction and out-of-order logic to run instructions from the same process in parallel. It results in a few, high performance cores that won't scale horizontally (memory bottleneck)
                The second is to have explicit hardware-supported parallelism with many many simple RISC or MISC cores on an on-chip network. It's simply false to say that small message passing cores have failed. I've already given examples of ones currently on the market (Clearspeed, Picochip, XMOS, and Icera to an extent). It's a model that has been shown time and time again to be extremely scalable, in fact it was done with the Transputer in the late 80s/early 90s [acm.org]. The only reason it's taking off now is because it's the only way forward as we hit the power wall, and shared memory/superscalar can't scale as fast to compete. The reason things like the Transputer didn't take off in mainstream (i.e. desktop) applications is because they were completely steamrolled by what x86 had to offer: an economy of scale, the option to "keep programming like you've always done", and most importantly backwards compatability. In fact they did rather well in i/o control for things such as robotics, and XMOS continues to do well in that space.
                The "coherency problem" isn't even part of a message passing architecture because the state is distributed amongst the parallel processes. You just don't program a massively parallel architecture in the same way as a shared memory one.

    • by loufoque (1400831)

      Just take a look at tilera. It's not open though.

    • How about developing a small efficient core, where the performance is interesting? Actually, don't even bother; just reuse the DEC Alpha instruction set that is collecting dust at Intel. There is no point in tying these massively parallel architectures to some ancient ISA.

      Technically the cores are not executing x86 instructions. For several architectural generations of Intel chips the x86 instructions have been translated into a small efficient instruction set executed by the cores. Intel refers to these core instructions as micro-operations. An x86 instruction is translated on the fly into some number of micro-ops and these micro-op are reordered and scheduled for execution. So they have kind of done what you ask, the problem is that they don't give us direct access to the

  • by Anonymous Coward on Monday November 22, 2010 @03:55AM (#34303670)

    Probably in future 1 million cores is minimum requirement for applications. We will then laugh for these stupid comments...

    Image and audio recognition, true artificial intelligence, handling data from huge amount of different kind of sensors, movement of motors (robots), data connections to everything around the computer, virtual worlds with thousands of AI characters with true 3D presentation... etc...etc... will consume all processing power available.

    1000 cores is nothing... We need much more.

    • Re: (Score:2, Insightful)

      1000 cores at 1Ghz on a single chip, networked to a 1000 other chips, would probably just about make a non-real time simulation of a full human brain possible (going off something I read about this somewhere). Although if it is possible to arbitrarily scale the number of cores, then we might be able to seriously consider building a system of very simple processors acting as electronic neurons.
  • by Animats (122034) on Monday November 22, 2010 @04:34AM (#34303784) Homepage

    It's an interesting machine. It's a shared-memory multiprocessor without cache coherency. So one way to use it is to allocate disjoint memory to each CPU and run it as a cluster. As the article points out, that is "uninteresting", but at least it's something that's known to work.

    Doing something fancier requires a new OS, one that manages clusters, not individual machines. One of the major hypervisors, like Xen, might be a good base for that. Xen already knows how to manage a large number of virtual machines. Managing a large number of real machines with semi-shared memory isn't that big a leap. But that just manages the thing as a cluster. It doesn't exploit the intercommunication.

    Intel calls this "A Platform for Software Innovation". What that means is "we have no clue how to program this thing effectively. Maybe academia can figure it out". The last time they tried that, the result was the Itanium.

    Historically, there have been far too many supercomputer architectures roughly like this, and they've all been duds. The NCube Hypercube, the Transputer, and the BBN Butterfly come to mind. The Cell machines almost fall into this category. There's no problem building the hardware. It's just not very useful, really tough to program, and the software is too closely tied to a very specific hardware architecture.

    Shared-memory multiprocessors with with cache coherency have already reached 256 CPUs. You can even run Windows Server or Linux on them. The headaches of dealing with non-cache-coherent memory may not be worth it.

  • by francium de neobie (590783) on Monday November 22, 2010 @04:44AM (#34303820)
    Ok, you can cram 1000 cores into one CPU chip - but feeding all 1000 CPU cores with enough data for them to process and transferring all the data they spit out is gonna be a big problem. Things like OpenCL work now because the high end GPUs these days have 100GB/s+ bandwidth to the local video memory chips, and you're only pulling out the result back into system memory after the GPU did all the hard work. But doing the same thing on a system level - you're gonna have problems with your usual DDR3 modules, your SSD hard disk (even PCI-E based) and your 10GE network interface.
  • by Baldrson (78598) * on Monday November 22, 2010 @04:55AM (#34303854) Homepage Journal

    It seem like I've been here before. [slashdot.org]

    A little while ago you asked [slashdot.org] Forth (and now colorForth) originator Chuck Moore about his languages, the multi-core chips he's been designing, and the future of computer languages -- now he's gotten back with answers well worth reading, from how to allocate computing resources on chips and in programs, to what sort of (color) vision it takes to program effectively. Thanks, Chuck!

  • by Required Snark (1702878) on Monday November 22, 2010 @05:02AM (#34303890)
    This is at least the third time that Intel has said that it is going to change the way computing is done.

    The first time was the i432 http://en.wikipedia.org/wiki/Intel_iAPX_432 [wikipedia.org] Anyone remember that hype? Got to love the first line of the Wikipedia article "The Intel iAPX 432 was a commercially unsuccessful 32-bit microprocessor architecture, introduced in 1981."

    The second time was the Itanium (aka Itanic) that was going to bring VLIW to the masses. Check out some of the juicy parts of the timeline also over on Wikipedia http://en.wikipedia.org/wiki/Itanium#Timeline [wikipedia.org]

    1997 June: IDC predicts IA-64 systems sales will reach $38bn/yr by 2001

    1998 June: IDC predicts IA-64 systems sales will reach $30bn/yr by 2001

    1999 October: the term Itanic is first used in The Register

    2000 June: IDC predicts Itanium systems sales will reach $25bn/yr by 2003

    2001 June: IDC predicts Itanium systems sales will reach $15bn/yr by 2004

    2001 October: IDC predicts Itanium systems sales will reach $12bn/yr by the end of 2004

    2002 IDC predicts Itanium systems sales will reach $5bn/yr by end 2004

    2003 IDC predicts Itanium systems sales will reach $9bn/yr by end 2007

    2003 April: AMD releases Opteron, the first processor with x86-64 extensions

    2004 June: Intel releases its first processor with x86-64 extensions, a Xeon processor codenamed "Nocona"

    2004 December: Itanium system sales for 2004 reach $1.4bn

    2005 February: IBM server design drops Itanium support

    2005 September: Dell exits the Itanium business

    2005 October: Itanium server sales reach $619M/quarter in the third quarter.

    2006 February: IDC predicts Itanium systems sales will reach $6.6bn/yr by 2009

    2007 November: Intel renames the family from Itanium 2 back to Itanium.

    2009 December: Red Hat announces that it is dropping support for Itanium in the next release of its enterprise OS

    2010 April: Microsoft announces phase-out of support for Itanium.

    So how do you think it will go this time?

  • cue kilocore debates (Score:2, Interesting)

    by bingoUV (1066850)

    Do 1024 cores constitute a kilocore? Or 1000? I'd love to see that debate move from hard disks to processors.

  • by rebelwarlock (1319465) on Monday November 22, 2010 @07:59AM (#34304598)
    I will need to buy a pair of sunglasses, and crush them when I find that the new Intel processor has over 9000 cores.
  • by Panaflex (13191) <convivialdingo.yahoo@com> on Monday November 22, 2010 @11:44AM (#34306450)

    IMHO the biggest problem with these multi-core chips is the lock latency. Locking in heap all works great, but a shared hw register of locks would save a lot of cache coherency and MMU copies.

    A 1024 slot register with instruction support for mutex and read-write locks would be fantastic.

    I'm developing 20+Gbps applications - we need fast locks and low latency. Snap snap!!!

  • by menkhaura (103150) <espinafre@gmail.com> on Monday November 22, 2010 @11:47AM (#34306490) Homepage

    Talk is cheap, show me the cores.

"Only the hypocrite is really rotten to the core." -- Hannah Arendt.

Working...