Intel Talks 1000-Core Processors 326
angry tapir writes "An experimental Intel chip shows the feasibility of building processors with 1,000 cores, an Intel researcher has asserted. The architecture for the Intel 48-core Single Chip Cloud Computer processor is 'arbitrarily scalable,' according to Timothy Mattson. 'This is an architecture that could, in principle, scale to 1,000 cores,' he said. 'I can just keep adding, adding, adding cores.'"
Jeez... (Score:5, Funny)
I hope he never works for Gillette.
Re:Jeez... (Score:4, Funny)
I hope he never works for Gillette.
Obligatory Onion [theonion.com]
Re:Jeez... (Score:4, Funny)
Re:Jeez... (Score:5, Funny)
Other way around; he used to work for Gillette. He left after they cancelled his 1000-blade razor project.
Yes, I also heard about the 1000-blade project getting cut...
Re:Jeez... (Score:4, Funny)
...in the nick of time.
Re: (Score:2)
No worries, I think he works for Spishak [youtube.com] now.
does it run Linux - yea but it is "boring" (Score:2)
From the article: "By installing the TCP/IP protocol on the data link layer, the team was able to run a separate Linux-based operating system on each core. Mattson noted that while it would be possible to run a 48-node Linux cluster on the chip, it "would be boring."
Huh?! Boring?! It would have been a nice a first post on Slashdot on the eternal topic - does it run Linux? - to begin with.
The we have all the programming goodies to follow up with.
Re: (Score:2)
From the article: "By installing the TCP/IP protocol on the data link layer, the team was able to run a separate Linux-based operating system on each core. Mattson noted that while it would be possible to run a 48-node Linux cluster on the chip, it "would be boring."
Huh?! Boring?! It would have been a nice a first post on Slashdot on the eternal topic - does it run Linux? - to begin with.
The we have all the programming goodies to follow up with.
;) To make the things interesting, each of the cores would have to use a public Inet IPv4 address.
Re:does it run Linux - yea but it is "boring" (Score:5, Interesting)
Running Linux on a 48-core system is boring, because it has already been run on a 64-core system in 2007 [gigaom.com] (at the time, Tilera [tilera.com] said they would be up to 1000 cores in 2014; they're up to 100 cores per CPU now).
As far as I know, Linux currently supports up to 256 CPUs. I assume that means logical CPUs, so that, for example, this would support one CPU with 256 cores, or one CPU with 128 cores with two CPU threads per core, etc.
Re: (Score:2)
The most interesting part to me is how they're actually making a built in router for the chips. The cores communicate through TCP/IP. That's incredible.
Re:does it run Linux - yea but it is "boring" (Score:5, Informative)
Re: (Score:3, Informative)
The current limit on Linux (with 2.6 series) is 8192 CPUs on POWER and 4096 on x86
That's kind-of true, but quite misleading. 8192 is the hard limit, but scheduler and related overhead means that the performance gets pretty poor long before then. Please don't cite the big SGI and IBM machines as counter examples. The SGI machines effectively run a cluster OS, but with hardware distributed shared memory. They are 'single system image' in that they appear to be one OS to the user, but each board has its own kernel, I/O peripherals and memory and works largely independently except when a
Re:Why is 8192 a hard limit? (Score:4, Informative)
The kernel needs some data structures per processor. 8192 means it needs a 15-bit index for them. I'm not certain about the Linux kernel, but in other kernels it's quite common for this to be squeezed in to other values for various reasons, so adding more processors requires you to either increase the size of other data structures (often ones designed to be exactly one word long). Not impossible, but more effort than just changing a constant.
The reason for the limit in the Windows NT kernel is that various things use bit masks with processor IDs as the indexes. For example, when defining processor affinity set you have an n-bit bitfield (one bit per supported processor), with the bit set if the thread is allowed to run on that processor. At 256 bits (the current limit for Windows), these are already pretty large to scan (especially since the kernel isn't allowed to use SSE instructions, meaning that it's potentially got to be 4 64-bit lsb-tests to find the next core to use).
Re: (Score:3, Interesting)
Re: (Score:3, Funny)
What? Tell me. WHAT ARE THEY SEEING?
... problems with data truncation.
Re: (Score:2)
Message passing between cores? Hmm... (Score:4, Interesting)
Are they trying to reinvent Transputer? :)
But yes, I am happy to see Intel pushing it forward!
Paul B.
Re: (Score:2, Interesting)
But a more crucial thing might be how much heat can you handle on one chip? These guys are already at 25-125 watts, likely depending on how many cores are actually turned on. After all they're playing pretty hefty heat management tricks on current i7's and Phenom's.
http://techreport.com/articles.x/15818/2 [techreport.com]
What use are 48 cores, let alone 1000 if they're all being slowed down to 50% or whatever
Yep! except for... (Score:2)
Intel would not have (presumably!) to re-invent *Intel* Paragon! :)
We can throw a Connection Machine in there, and really date ourselves -- but it's still nice to know that finally CMOS tech has caught up with late 80s comp. arch. advances!
And then, do not get me started on the original Tera, with its multithreading it seemed to be much better bang for the buck of chip real estate than currently accepted multicore solutions. But what would I know...
Paul B.
Re:Message passing between cores? Hmm... (Score:4, Informative)
IF YOU GOT A COMMS ERROR, THE ONLY RECOVERY MECHANISM WAS A TOTAL SYSTEM REBOOT.
That is as crap as you can get! TPP/IP might be an improvement, but HDLC would have cracked the Transputer's problems, and it was already over 15 years old when the transputer was invented.
Yes I did build a Transputer based system, and yes it did work. (but...)
Re: (Score:3, Informative)
I had a board with 4 T800s in my 286 PC, I wrote a raytracer for it.
The chips were OK but the compilers and development kit were terrible.
Could be good for games using raytracing (Score:4, Insightful)
This is for server/enterprise usage, not consumer usage. That said, it could scale to the number of cores necessary to make realtime raytracing work at 60fps for computer games. Raytracing could be the killer app for cloud gaming services like OnLive, where the power to do it is unavailable for consumer computers, or prohibitively expensive. The only way Microsoft etc. would be able to have comparable graphics in a console in the next few years is if it were rental-only like the Neo-Geo originally was.
Obligatory XKCD (Score:2, Funny)
Bring out your Memes! (Score:5, Funny)
Imagine a Beowulf cluster of th^H^H^H
Ah, forget it, the darn thing practically is one already! :/
"Imagine exactly ONE of those" just doesn't sound the same.
Re:Bring out your Memes! (Score:4, Funny)
I've said it for years: 640K cores ought to be enough for anybody.
accurate representation (Score:5, Interesting)
For us not at SC10 (Score:3, Informative)
The paper referenced in the arcticle can be found here [computer.org].
Fascinating that MPI works that well unmodified.
How many... what's next? (Score:2)
"It's a lot harder than you'd think to look at your program and think 'how many volts do I really need?'" he [Mattson] said.
First was RAM (640kb should be... doh), then M/GHz, then Watts, now is volts... so, what's next?
(my bet... returning to RAM and the advent of x128)
You wanna impress me? (Score:2, Funny)
Make a processor with four asses.
Future of Programming (Score:4, Interesting)
Re: (Score:3, Interesting)
It's quite something isn't it, how so few people on even slashdot seem to get this. Old habits die hard I guess.
Years ago a clever friend of mine clued me into how functional was going to be important.
He was so right and the real solutions to concurrency (note, not parallelism which is easy enough in imperative) are in the world of FP or at least mostly FP.
My personal favourite so far is Clojure which has the most comprehensive and realistic approach to concurrency I've seen yet in a language ready for real
Re:Future of Programming (Score:5, Insightful)
Learn a functional language. Leanr it not for some practical reason. Learn it because having another view will give you interesting choices even when writing imperative languages. Every serious programmer should try to look at the important paradigms so that he can freely choose to use them where appropriate.
Re: (Score:2)
All you need is a library that gives you worker threads, queues and synchronization primitives. We've all learned that stuff at some point (and forgot most of it.)
Re: (Score:2)
Sorry, but while functional programming style is indeed the future of HPC (with C++), functional languages themselves aren't. Read the research papers of the field and see for yourself.
Benchmarks (Score:3, Insightful)
According to benchmarks [debian.org], a functional language like Erlang is slower than C++ by an order of magnitude. Sure, it can distribute processing over more cores, which is the only thing that enabled it to win one of the benchmarks. I suspect that was only because it used a core library function that was written in C. So no, if you want to write code with acceptable performance, DON'T use a functional language. All CPU intensive programs, like games, are written in C or C++; think about that.
1000 cores is easy! (Score:5, Funny)
1000 cores on a chip isn't too bad. I already have one with 110 cores.
That's only 10 more cores!
Re:1000 cores is easy! (Score:4, Funny)
That has got to be the funniest thing I've read here in a month.
- jesus, that must have been one sad month.
Inter-core communication? (Score:2)
Instruction set... (Score:4, Insightful)
"Performance on this chip is not interesting," Mattson said. It uses a standard x86 instruction set.
How about developing a small efficient core, where the performance is interesting? Actually, don't even bother; just reuse the DEC Alpha instruction set that is collecting dust at Intel.
There is no point in tying these massively parallel architectures to some ancient ISA.
Re: (Score:2)
There's also no reason to throw away an ISA that has proven to be extremely scalable and very successful, just because it's ancient or it looks ugly.
The advantage of the x86 instruction set is that it's very compact. It comes at a price of increased decoding complexity, but that problem has already been solved.
The low number of registers is not a problem. In fact, it may even be an advantage to scalability. A register is nothing more than a programmer-controlled mini cache in front of the memory. I'd rather
Re: (Score:2)
Err, did you just claim cache is as fast as a register access?
Re:Instruction set... (Score:5, Insightful)
Uh, scalable? Not really... The only reason x86 is still around (i.e. successful) is because it's pretty much backwards compatible since the 8086- which is over THIRTY YEARS OLD.
Whoa nelly. compact? I'm not sure where you got that idea, but it's called CISC and not RISC for a reason! if you think x86 is compact, you might be interested to find out that you can have a fifteen byte instruction [derkeiler.com] In fact, on the i7 line, the instructions are so complex it's not even worth writing a "real" decoder- they're translated in real-time into a RISC instruction set! If Intel would just abandon x86, they could reduce their cores by something like 50%!
The low number of registers _IS_ a problem. The only reason there are only four is because of backwards compatability. It definitely is a problem for scalability, one cannot simply rely on a shared memory architecture to scale vertically indefinitely, you just use too much power as a die size increases, and memory just doesn't scale up as fast as the number of transistors on a CPU.
A far better approach is to have a decent model of parallelism (CSP, Pi-calculus, Ambient calculus) underlying the architecture and to provide a simple architecture with primitives supporting features of these calculi, such as channel communication. There are plenty of startups doing things like this, not just Intel, and they've already products in the market- though not desktop processors. Picochip [picochip.com] and Icera [icerasemi.com] to name just a couple, not to mention things like GPGPU (Fermi, etc.)
Really, the way to go is small, simple, low power cores with on-chip networks which can scale up MUCH better than just the old intel method of "More transistors, increase clock speed, bigger cache".
Re:Instruction set... (Score:4, Insightful)
That's a clear testament to scalability when you consider the speed improvement in the last 30 years using basically the same ISA.
So ? It's not the maximum instruction length that counts, but the average. In typical programs that's closer to three. Frequently used opcodes like push/pop only take a single byte. Compare to a DEC Alpha architecture, where nearly every single instruction uses 15 bits just to tell which registers are used, no matter whether a function needs that many registers.
Even if that's true (I doubt it), who cares ? The problem is not intel has too many transistors for a given area. The problem is just the opposite. They have the capability to put more transistors in a core that they know what to do with. Also, typically half the chip is for the cache memories, and the compact instruction set helps to use that cache memory more effectively.
Sure you can. Shared memory architectures can do everything explicit channel communication architectures can do, plus you have the benefit that the communication details are hidden from the programmer, allowing improvements to the implementation without having to rewrite your software. Sure, the hardware is more complex, but transistors are dirt cheap, so I'd rather put the complexity in the hardware.
Re:Instruction set... (Score:4, Interesting)
It's scaled that way until now. We've hit a power wall in the last few years: as you increase the number of transistors on chip it gets more difficult to distribute a faster clock synchronously, so you increase the power, which is why Nehalem is so power hungry, and why you haven't seen clock speeds really increase since the P4. In any case, we're talking about parallelism, not just "increasing the clock speed" which isn't even a viable approach anymore.
When you said "Compact" I assumed you meant the instruction set itself was compact rather than the average length- I was talking about the hardware needed to decode, not necessarily code density. Even so, x86 is nothing special when it comes to density, especially considered against things like ARM's Thumb-2.
If you take look at Nehalem's pipeline, there's a significant chunk of it simply dedicated to translating x86 instructions into RISC uops, which is only there for backwards compatability. The inner workings of the chip don't even see x86 instructions.
Sure you can do everything the same with shared memory and channel comms, but if you have a multi-node system, you're going to be doing channel communcation anyway. You also have to consider that memory speed is a bottleneck that just won't go away, and for massive parallelism on-chip networks are just faster. In fact, Intel's QPI and AMD's HyperTransport are examples of on-chip network- they provide a NUMA on Nehalem and whatever AMD have these days. Indeed, in the article, it says
The thing is, if you want to put more cores on a die, you need either a bigger die or smaller cores. x86 is stuck with larger cores because of all the translation and prediction it's required to do to be both backwards compatible and reasonably well-performing. If you're scaling horizontally like that, you want the simplest core possible, which is why this chip only has 48 cores, and Clearspeed's [clearspeed.com] 2-year-old CSX700 [clearspeed.com] had 192.
Re:Instruction set... (Score:5, Interesting)
Nobody wants to put more cores on a die, but they're forced to do so because they reach the limits of a single core. I'd rather have as few cores as possible, but have each one be really powerful. Once multiple cores are required, I'd want them to stretch the coherent shared memory concept as far as it will go. When that concept doesn't scale anymore, use something like NUMA.
Small, message passing cores have been tried multiple times, and they've always failed. The problem is that the requirement of distributed state coherency doesn't go away. The burden only gets shifted from the hardware to the software, where it is just as hard to accomplish, but much slower. In addition, if you try to tackle the coherency problem in software, you don't get to benefit from hardware improvements.
Re:Instruction set... (Score:4, Interesting)
Well yes, but you might as well have argued that nobody wanted to make faster cores but they're limited by current clock speeds... The fact is that you can no longer make cores faster and bigger, you have to go parallel. Even the intel researcher in the article is saying the shared memory concept needs to be abandoned to scale up.
Essentially there are two approaches to the problem of performance now. Both use parallelism. The first (Nehalem's) is to have a 'powerful' superscalar core with lots of branch prediction and out-of-order logic to run instructions from the same process in parallel. It results in a few, high performance cores that won't scale horizontally (memory bottleneck)
The second is to have explicit hardware-supported parallelism with many many simple RISC or MISC cores on an on-chip network. It's simply false to say that small message passing cores have failed. I've already given examples of ones currently on the market (Clearspeed, Picochip, XMOS, and Icera to an extent). It's a model that has been shown time and time again to be extremely scalable, in fact it was done with the Transputer in the late 80s/early 90s [acm.org]. The only reason it's taking off now is because it's the only way forward as we hit the power wall, and shared memory/superscalar can't scale as fast to compete. The reason things like the Transputer didn't take off in mainstream (i.e. desktop) applications is because they were completely steamrolled by what x86 had to offer: an economy of scale, the option to "keep programming like you've always done", and most importantly backwards compatability. In fact they did rather well in i/o control for things such as robotics, and XMOS continues to do well in that space.
The "coherency problem" isn't even part of a message passing architecture because the state is distributed amongst the parallel processes. You just don't program a massively parallel architecture in the same way as a shared memory one.
Re: (Score:3, Interesting)
Which is of course what is already being done, but whether that's the best approach remains to be seen. Communication is always the bottleneck in HPC systems, and many processors on chip with a fast interconnect seems to do very well, at least for Picochip (though it is a DSP chip, I think it's a valid comparison
Re: (Score:3, Insightful)
A typical consumer desktop machine, running typical programs for instance. In order to use these cores effectively, all these programs need to rewritten. Imagine your word processor reformatting a 500 page document on 1000 cores. It's just not going to work very well.
How about the operating system ? 1000 different cores all trying to access a file system on a single physical drive. How are you going to run that efficiently ?
Re: (Score:3, Interesting)
We way we do it now is a single filesystem layer which is, at all times, in a single coherent state. With today's shared memory systems, and cache coherency guaranteed by the hardware, that's reasonable easy to accomplish.
The current filesystem concept just doesn't map onto 1000 non-coherent cores.
Re: (Score:3, Interesting)
The '93 era Pentium they're talking about only has 3 million transistors, and only a fraction are needed to handle the x86 instruction set. Current transistor count goes into the billions, so as far as real estate goes, you can put 1000 Pentium class cores on a single die, despite the x86 translations.
Of course, the whole concept of a 1000 cores running on a single die is only going to serve a small niche of applications.
Re: (Score:3, Informative)
Well that would be true, but the really complex x86 instructions are rarely [strchr.com] used, so you're not really adding much in the way of code density, and you have to add a lot of hardware complexity to decode it. Not only that, more complex instructions mean bigger pipelines which mean bigger branch penalties.
Re: (Score:2)
Just take a look at tilera. It's not open though.
Cores are not executing x86 instructions (Score:2)
How about developing a small efficient core, where the performance is interesting? Actually, don't even bother; just reuse the DEC Alpha instruction set that is collecting dust at Intel. There is no point in tying these massively parallel architectures to some ancient ISA.
Technically the cores are not executing x86 instructions. For several architectural generations of Intel chips the x86 instructions have been translated into a small efficient instruction set executed by the cores. Intel refers to these core instructions as micro-operations. An x86 instruction is translated on the fly into some number of micro-ops and these micro-op are reordered and scheduled for execution. So they have kind of done what you ask, the problem is that they don't give us direct access to the
1000 cores is nothing (Score:5, Interesting)
Probably in future 1 million cores is minimum requirement for applications. We will then laugh for these stupid comments...
Image and audio recognition, true artificial intelligence, handling data from huge amount of different kind of sensors, movement of motors (robots), data connections to everything around the computer, virtual worlds with thousands of AI characters with true 3D presentation... etc...etc... will consume all processing power available.
1000 cores is nothing... We need much more.
Re: (Score:2, Insightful)
Re: (Score:3, Funny)
Yes, an while we are at it gat a working EMH/ECH(star-trek voyager) and a mobuile emitor :)
Or a working spell check program ;-)
"Build it and they will come" - NOT (Score:5, Informative)
It's an interesting machine. It's a shared-memory multiprocessor without cache coherency. So one way to use it is to allocate disjoint memory to each CPU and run it as a cluster. As the article points out, that is "uninteresting", but at least it's something that's known to work.
Doing something fancier requires a new OS, one that manages clusters, not individual machines. One of the major hypervisors, like Xen, might be a good base for that. Xen already knows how to manage a large number of virtual machines. Managing a large number of real machines with semi-shared memory isn't that big a leap. But that just manages the thing as a cluster. It doesn't exploit the intercommunication.
Intel calls this "A Platform for Software Innovation". What that means is "we have no clue how to program this thing effectively. Maybe academia can figure it out". The last time they tried that, the result was the Itanium.
Historically, there have been far too many supercomputer architectures roughly like this, and they've all been duds. The NCube Hypercube, the Transputer, and the BBN Butterfly come to mind. The Cell machines almost fall into this category. There's no problem building the hardware. It's just not very useful, really tough to program, and the software is too closely tied to a very specific hardware architecture.
Shared-memory multiprocessors with with cache coherency have already reached 256 CPUs. You can even run Windows Server or Linux on them. The headaches of dealing with non-cache-coherent memory may not be worth it.
I/O and memory bandwidth (Score:4, Insightful)
Deja Vu from a decade ago (Score:3, Informative)
It seem like I've been here before. [slashdot.org]
Remember the last couple of times this happened? (Score:5, Informative)
The first time was the i432 http://en.wikipedia.org/wiki/Intel_iAPX_432 [wikipedia.org] Anyone remember that hype? Got to love the first line of the Wikipedia article "The Intel iAPX 432 was a commercially unsuccessful 32-bit microprocessor architecture, introduced in 1981."
The second time was the Itanium (aka Itanic) that was going to bring VLIW to the masses. Check out some of the juicy parts of the timeline also over on Wikipedia http://en.wikipedia.org/wiki/Itanium#Timeline [wikipedia.org]
1997 June: IDC predicts IA-64 systems sales will reach $38bn/yr by 2001
1998 June: IDC predicts IA-64 systems sales will reach $30bn/yr by 2001
1999 October: the term Itanic is first used in The Register
2000 June: IDC predicts Itanium systems sales will reach $25bn/yr by 2003
2001 June: IDC predicts Itanium systems sales will reach $15bn/yr by 2004
2001 October: IDC predicts Itanium systems sales will reach $12bn/yr by the end of 2004
2002 IDC predicts Itanium systems sales will reach $5bn/yr by end 2004
2003 IDC predicts Itanium systems sales will reach $9bn/yr by end 2007
2003 April: AMD releases Opteron, the first processor with x86-64 extensions
2004 June: Intel releases its first processor with x86-64 extensions, a Xeon processor codenamed "Nocona"
2004 December: Itanium system sales for 2004 reach $1.4bn
2005 February: IBM server design drops Itanium support
2005 September: Dell exits the Itanium business
2005 October: Itanium server sales reach $619M/quarter in the third quarter.
2006 February: IDC predicts Itanium systems sales will reach $6.6bn/yr by 2009
2007 November: Intel renames the family from Itanium 2 back to Itanium.
2009 December: Red Hat announces that it is dropping support for Itanium in the next release of its enterprise OS
2010 April: Microsoft announces phase-out of support for Itanium.
So how do you think it will go this time?
cue kilocore debates (Score:2, Interesting)
Do 1024 cores constitute a kilocore? Or 1000? I'd love to see that debate move from hard disks to processors.
In the near future... (Score:4, Funny)
Biggest problem and a fix... (Score:3, Interesting)
IMHO the biggest problem with these multi-core chips is the lock latency. Locking in heap all works great, but a shared hw register of locks would save a lot of cache coherency and MMU copies.
A 1024 slot register with instruction support for mutex and read-write locks would be fantastic.
I'm developing 20+Gbps applications - we need fast locks and low latency. Snap snap!!!
Paraphrasing Torvalds... (Score:3, Insightful)
Talk is cheap, show me the cores.
Re: (Score:2)
The chip, first fabricated with a 45-nanometer process at Intel facilities about a year ago, is actually a six-by-four array of tiles, each tile containing two cores. It has more than 1.3 billion transistors and consumes from 25 to 125 watts.
Re: (Score:3, Funny)
Re:Temperature? (Score:5, Interesting)
Dude, what the fuck, that's only 48 cores. How does that get you anywhere close to 1000?
Well, Watson, that's elementary...
Therefore, on top of the computation benefits derived from fully utilizing 1000 cores, one would have a pretty good heat source: 2150 Watts or so. One's choice what to do with it, but it's far too high for a domestic-sized slow cooker (the dished would come with a weird burned taste).
Satisfied, now?
If not, to put the things in perspective, assuming our ancestors (that could use only horses as a source of power) would have wanted to use this computer, they's need approx. 2.68 horses... but hey, wow... what a delight to play the MMORPG so smooth... especially in "farming/grinding" phases.
PS. the above computations are meant to be funny and/or an exercise of approximating based on insufficient data and/or vent some frustration caused by "all work and no play", definitely a wasted time... Ah, yes, some karma would be nice, but not mandatory.
Re: (Score:2)
Re: (Score:2)
Right. The really interesting chips will arrive when you run between four and sixteen cores with the entirety of main RAM for those cores (in a NUMA configuration with other sockets, starting with maybe a gigabyte or so per die). You could then use SDRAM for both a paging file and for cache between the storage system and the processor/memory die.
You could map registers straight to portions of the on-chip memory if necessary for backwards compatibility. You'd probably be better off, though, compiling nearly
Workaround, yeah (Score:2)
Er, yeah, pretty much everyone knows they have no practical way to make the clock speed much faster. The only thing they can do is proliferate cores beyond all reason. Nobody has the slightest idea how to take advantage of that many cores in normal household use and even most workstation use.
Re:Workaround, yeah (Score:5, Informative)
You've obviously never worked in Aerospace.
I can bring a quad core Xeon system to its knees running Catia. (I mean, 100% saturation, all 4 cores, with IO contention.) I do it fairly regularly too.
Might have something to do with the NP-Hard problem of resolving tangencies on extremely complex nurbs surfaces. (aircraft skins).
Granted, that is not a "normal" workstation; But I would be VERY happy indeed to have a 1000 core workstation at my disposal. Maybe then I could actually work with Gulfstream's horrible part models where they include literally the whole god-damn aircraft's surface geometry in the digital part model for a fucking bolt. (Guess what happens when you load several such models, and digitally assemble them. I have seen a 64 bit workstation allocate over 8gb of swap because of them and their dumbassery.)
Now, if I could get one with over 1TB of RAM installed too, then I'd be in business.
Re:Workaround, yeah (Score:5, Interesting)
In my field it would be real time conflict detection between aircraft. The better your conflict detection, the more aircraft you can pack in to small volumes of space. There is a lot of money in that.
Performance gains from multithreading not clock (Score:2)
Am i the only one feeling this is just a foray into multicore chips because they hit a brick wall when it comes to faster single core CPUs?
For many years (at least 5, possibly more) Intel has been telling developers that future performance gains will come from multithreading not faster clock speeds. So no, you are not the only one feeling this way. :-)
Re: (Score:2)
Why have 1000 cores when you can have 1 MILLION CORES, (all running applications that can barely take advantage of 1 or 2)
Your computer only runs 1 application at a time?
Re: (Score:2)
While scalable from a computing pov (data exchange, addressing, whatnot) I can imagine that it's not scalable from a physical pov: power supply, size, heat dissipation, and getting your signals to and from the chip over longer and longer distances.
The last part is getting an issue already due to the long cable problem: at 3 GHz, a signal travels only about 10 cm before the next signal is produced. One core communicating with another over a distance of just 5 cm would have the problem that the data from one
Re: (Score:2)
Just how small does your penis need to be to need a 1,000 cores?
That's what it takes to run Flash these days.
Re: (Score:3, Insightful)
The only thing I'd be compensating for is the fact I can't do calculations at Exaflop rates in my head.
Just like my car only compensates for the fact I can't run at 165mph. :)
Re: (Score:3, Interesting)
Why? :) I know. meme. It's just, I've built a couple Beowulf clusters for fun, and didn't have an application written to use MPI (or any of the alphabet soup of protocols), so it was just an exercise, not for any practical use. It's not like most of us are crunching numbers hard enough to need one, and it won't help out playing games or even building kernels.
I'd like to see a 1k core machine on my desktop, but that's beyond the practical limits of any software currently avai
Re: (Score:2, Insightful)
Why would you care to see one on your desktop? Do you have any use for one? There's a point where except for supercomputers enough is enough. We've probably already passed it.
Re: (Score:2)
Re:Imagine (Score:5, Informative)
Re: (Score:2)
Re: (Score:2, Interesting)
Re: (Score:3, Funny)
Why would you care to see one on your desktop? Do you have any use for one?
You got that right. I've never used more than 639 K of RAM either.
Re:Imagine (Score:5, Informative)
Linux can only go to 256 cores.
Uhmm no.
./arch/ia64/Kconfig: int "Maximum number of CPUs (2-4096)"
/arch/powerpc/platforms/Kconfig.cputype: int "Maximum number of CPUs (2-8192)"
In x86 we have:
config MAXSMP
bool "Enable Maximum number of SMP Processors and NUMA Nodes"
depends on X86_64 && SMP && DEBUG_KERNEL && EXPERIMENTAL
And I believe you can crank that dial all the way up
Also consider this: the number of cores in my desktop is doubling every year or two (and this is with a single core chip), 6 and 8 cores are cheap now, so we'll be at 1024 in roughly 7-14 years which makes sense because the GHz war is done and simply making more cores is relatively cheap (once you have the interconnect making a bigger CPU isn't all that hard).
This is NOT a cache-coherent/SMP machine! (Score:3, Insightful)
The key difference between this research chip and the other Multicore chips Intel have worked on, like Larrabee, is that it is explicitly NOT cache coherent, i.e. it is a cluster on chip instead of a single-image multi-processor.
This means, among many other things, that you cannot load a single Linux OS across all the cores, you need a separate executive on every core.
Compare this with the 7-8 Cell cores in a PS3.
Terje
Re:Imagine (Score:5, Interesting)
depends on X86_64 && SMP && DEBUG_KERNEL && EXPERIMENTAL
And I believe you can crank that dial all the way up
Also consider this: the number of cores in my desktop is doubling every year or two (and this is with a single core chip), 6 and 8 cores are cheap now, so we'll be at 1024 in roughly 7-14 years which makes sense because the GHz war is done and simply making more cores is relatively cheap (once you have the interconnect making a bigger CPU isn't all that hard).
Don't you worry, the GHz war is not done!
There's talk of exotic materials (SiC, diamond, etc...) going to 10 GHz. If someone figures out how to make the Rapid Single Flux Quantum [wikipedia.org] digital chips with high temperature superconductors, then we may seriously start to see 1 THz clock speeds in practical computers, using extreme Peltier cooling to get the CPU core down to cryogenic temps.
Re:Imagine (Score:5, Interesting)
The GHz war is over. The speed of light won. A long time ago, it stopped being "all about the transistor" and started being "all about the wires". IBM won the race to copper in 180nm (back when it was 0.18um), and that helped make those technologies even better, but about the time we hit 90nm, semiconductors were "fast enough", or even by some measurements stopped being able to speed up. Since then, almost all speed increases have been largely (but not exclusively) due to the transistors getting smaller, reducing the distance wires need to go.
The RC delay of wires is the major problem. R isn't going to be getting much better than copper. Silver has a lower resistance by a little bit, but it's too reactive to be used anywhere real. In these geometries, any alloy would be insufficiently mixable to be reliable, to say nothing about more exotic materials (like ceramics). There's some room for improvement in the dielectric (the "C"), but by the time you make a box with corners covering water permeability, thermal coefficient of expansion close to the wires, mechanical properties friendly to sub micron manufacturing, you have to concede you're not going to be able to get more than 20% faster there (and that we could dispute separately).
Take a cache. The slowest path is having a memory cell read. That tiny little device needs to have a measurable change in voltage on the bitlines, and be sensed by a sensing structure. That sensing structure has nothing to do with storage, so it's pure overhead and thusly you want as few of them as possible. Can you have it 16 bits away? 32? The days are gone that it was 64 bits away for any meaningful performance. There's nothing you can do to the characteristics of that little device (which needs to be minimum feature size to maximize the density of the cache) to dominate over the characteristics of the bitline he's trying to affect.
Take a data path. Even if 95% of your data is highly predictable, easily pipelined stuff with local signals, your critical path is going to involve signals from other areas of the chip, and they're going to have to be rebuffered and trucked from hundreds of microns away. No giant buffer in the history of man can dominate over a long distance wire. The signal will show up "eventually".
3GHz is a good place to stop. We make it to 4GHz with compromises in power, but beyond that and you're dedicating so much of your chip to rebuffering that you're blowing a lot of power on that. At that point, your pipeline is so many stages that branch mispredicts are very painful. You're devoting so much of your cycle time to setup and holds for your latches that you're going to be embarassed at how little work you can do in each cycle.
1 THz clock speeds are on their way, and maybe even higher. But they're not useful to CPUs or GPUs. They're useful for more exotic applications, primarily technology demonstrations.
Re: (Score:3, Interesting)
http://www.sgi.com/products/servers/altix/uv/ [sgi.com]
2,048 cores (256 sockets) and 16TB of memory, one OS image.
Re: (Score:3, Informative)
Why? :) I know. meme. It's just, I've built a couple Beowulf clusters for fun, and didn't have an application written to use MPI (or any of the alphabet soup of protocols), so it was just an exercise, not for any practical use. It's not like most of us are crunching numbers hard enough to need one, and it won't help out playing games or even building kernels.
I'd like to see a 1k core machine on my desktop, but that's beyond the practical limits of any software currently available. Linux can only go to 256 cores. Windows 2008 tops out at 64. But hey, if they did come to market, I know who would be first to support all those cores, and it doesn't come from Redmond (or their offshore outsourced developers).
ummm no. Windows 2008 can handle 64 SOCKETS, it currently scales to 256 cores
Real time raytracing of course! (Score:3, Informative)
Isn't that Intel's pet project for the last decade?
Re:Imagine (Score:5, Interesting)
Pretty much anything that I've written in Erlang uses (at least) a few thousand concurrent processes. I've never tried running it on more than a 64-core machine, but when I moved stuff from my single-core laptop to a 64-core SGI machine the load was pretty evenly distributed.
It's pretty easy to write concurrent code that scales as long as you respect one rule: No data may be both mutable and aliased. You can do this in object-oriented languages with the actor model, but languages like Erlang enforce it for you (at the cost of a few redundant copies).
Re: (Score:3, Interesting)
I don't know what extra detail you need - the rule should be pretty self explanatory. If something is shared between two or more threads, it should be immutable. If something is mutable, only one thread / process should hold references to it.
The only exception to this rule is explicitly synchronised communication objects (message queues, process handles, and suchlike). If you follow this rule, then the only concurrency problems that you will have are caused by high-level design problems, rather than b
Re: (Score:2)
Imagine a Beowulf with all of the overhead, and none of the speed.
For simplicity's sake, the team used an off-the-shelf 1994-era Pentium processor design for the cores themselves. "Performance on this chip is not interesting," Mattson said. It uses a standard x86 instruction set.
Re: (Score:3, Insightful)
Basically, we are going to need compilers that automatically take advantage of all that parallelism without making you think about it too much, and programming languages that are designed to make your programs parallel-friendly. Even Microsoft is finally starting to edge in this direction with F# and some new features of .NET 4.0. Look at Haskell and Erlang for examples of languages that take such things more seriously, even if the world takes them less seriously.
I don't know about AI, but almost certainly
Re: (Score:2)
True. CPUs like that would be a godsend for tasks like 3D rendering (entertainment industry, architectural visualization, ...).
Re: (Score:2)
Aerospace mockup and simulation... (Imagine, a "down to the last bolt" NURBS model, with dynamic stress simulation... Can theoretically be done now, but I have never seen a workstation in any production engineering department get closer than perhaps a wing segment before crashing the workstation.)
I greedily await such a future.
Re: (Score:2)
Here's an example of how it's used in the entertainment sector:
http://www.youtube.com/watch?v=JhJauu_vB2A [youtube.com]
Basically linking a 3D suite up to a camera to get motion data, reference point positioning and other information to allow more seamless integration and even low-quality previews during the shoot. It's the technique from "Avatar", which was combined with two-stage motion capturing to make those shots possible ( http://www.wired.com/magazine/2009/11/ff_avatar_5steps/ [wired.com] ).
Even with twin-hexacore workstations
Re: (Score:3, Informative)
"Lazy CPU designers" hah!
GPUs are severely limited on their types of tasks. Instead of a 1600 core GPU, pretend your CPU has as single large SIMD register that can hold 1600 floats. Now, it would be great at crunching large matrices of floats and utterly suck at everything else. That's a GPU in a nutshell. GPUs are absolutely horrible at branches. If one core takes a branch, every core in that core's group must stall and wait for the branch to finish. All cores must be working on the same instruction at the
Consumer oriented products can use many cores (Score:2)
Okay, I'm sure some high-end consumers would benefit from this, I think the majority of consumers will not.
As a game developer I have to say consumers could benefit. And no I am not necessarily thinking about more graphical eye candy. For example I would like to have hundreds of cores working on AI for computer controlled characters/units.