IEEE Says Multicore is Bad News For Supercomputers 251

Posted by timothy on Friday December 05, 2008 @08:04AM from the unexpected-downsides dept.

Richard Kelleher writes "It seems the current design of multi-core processors is not good for the design of supercomputers. According to IEEE: 'Engineers at Sandia National Laboratories, in New Mexico, have simulated future high-performance computers containing the 8-core, 16-core, and 32-core microprocessors that chip makers say are the future of the industry. The results are distressing. Because of limited memory bandwidth and memory-management schemes that are poorly suited to supercomputers, the performance of these machines would level off or even decline with more cores.'"

IEEE Says Multicore is Bad News For Supercomputers

This discussion has been archived. No new comments can be posted.

Search 251 Comments Log In/Create an Account

Comments Filter:

Re:Time for vector processing again (Score:5, Interesting)

by jellomizer ( 103300 ) writes: on Friday December 05, 2008 @08:40AM (#26001487)

I've always felt there was something odd about the recent trend of Super Computers using common hardware. components. They have really loss their way in super computing by just making a beefed up PC and running a version of a common OS which could handle it. Or Clustering a bunch of PC's togeter. Multi-Core technology is good for desktop systems as it is meant to run a lot of relatively small apps Rarely taking advantage of more then 1 or 2 cores. per app.In other-words it allows Multi-Tasking without a penalty. We don't use super computers that way. We use them to to perform 1 app that takes huge resources that would take hours or years on your PC and spit out results in seconds or days. Back in the early-mid 90's we had different processors for Desktop and Super Computers. Yes it was more expensive for the super computers but if you were going to pay millions of dollars for a super computer what the difference if you need to pay an additional $80,000 for more custom processors.

It's so obvious... (Score:4, Interesting)

by Alwin Henseler ( 640539 ) writes: on Friday December 05, 2008 @08:55AM (#26001585)

That to remove the 'memory wall', main memory and CPU will have to be integrated.
I mean, look at general-purpose computing systems past & present: there is a somewhat constant relation between CPU speed and memory size. Ever seen a 1 MHz. system with a GB. RAM? Ever seen a GHz. CPU coupled with a single KB. of RAM? Why not? Because with very few exceptions, heavier compute loads also require more memory space.
Just like the line between GPU and CPU is slowly blurring, it's just obvious that the parts with the most intensive communication, should be the parts closest together. Instead of doubling nummber of cores from 8 to 16, why not use those extra transistors to stack main memory directly on top of the CPU core(s)? Main memory would then be split up in little sections, with each section on top of a particular CPU core. I read sometime that semiconductor processes that are suitable for CPU's, aren't that good for memory chips (and vice versa) - don't know if that's true but if so, let the engineers figure that out.
Ofcourse things are different with supercomputers. If you have a 1000 'processing units', where each PU would consist of say, 32 cores and some GB's RAM on a single die, that would create a memory wall between 'local' and 'remote' memory. The on-die section of main memory would be accessible at near CPU speed, main memory that is part of other PU's would be 'remote', and slow. Hey wait, sounds like a compute cluster of some kind... (so scientists already know how to deal with it).
Perhaps the trick would be to make access to memory found on one of the other PU's transparent, so that programming-wise there's no visible distinction between 'local' and 'remote' memory. With some intelligent routing to migrate blocks of data closer towards the core(s) that access it? Maybe that could be done in hardware, maybe that's better done on a software level. Either way: the technology isn't the problem, it's an architectural / software problem.

Re:but.. (Score:3, Interesting)

by peragrin ( 659227 ) writes: on Friday December 05, 2008 @09:01AM (#26001633)

So your saying that next generation processors need a gig of cache. Plus 4gigs of ram.
I think what is really needed is new OS designs. Something that is no longer tied quite as close to the hardware. So that new hardware ideas can be tried.

Re:Well doh (Score:1, Interesting)

by Anonymous Coward writes: on Friday December 05, 2008 @09:02AM (#26001645)

Why is it important that each CPU/Core has it's own RAM? How is that more efficient than a *huge* chunk of RAM that could be accessed by any CPU/Core?
I understand there are risks -- concurrent access and the like -- but completely separate RAM seems like an extreme solution if this is the only problem it is trying to address.

Re:Well doh (Score:5, Interesting)

by Targon ( 17348 ) writes: on Friday December 05, 2008 @09:43AM (#26001949)

You have failed to notice that AMD is already on top of this and can add more memory channels to their processors as needed for the application. This may increase the number of pins the processor has, but that is to be expected.
You may not have noticed, but there is a difference between AMD Opteron and Phenom processors beyond just the price. The base CPU design may be the same, but AMD and Intel can make special versions of their chips for the supercomputer market and have them work well.
In a worst case, with the support from AMD or Intel, a new CPU with extra pins(and an increased die size) could add as many channels of memory support as required for the application. This is another area where spinning off the fab business might come in handy.
And yes, this might be a bit expensive, but have you seen the price of a supercomputer?

Re:Well doh (Score:2, Interesting)

by gabebear ( 251933 ) writes: on Friday December 05, 2008 @09:44AM (#26001957) Homepage Journal

if you read the article they are talking about is the disparity that is growing between CPU speed and access to memory. The stacked memory they are talking about is shoving the physical CPU dies and RAM chips closer together(in the same package) [betanews.com] so that you can have a LOT more interconnects(less wire to screw things up). Build up a huge stack of these and you have supercomputer cluster WITH fast access to all memory in the stack. For infomatic applications you need random access to any bit of memory in the entire array and this might do it for them. The biggest problem is heat dissipation...

The other options I see are creating some kind of super giant shared buffered RAM pool [wikipedia.org] that has high latency but great throughput and then sticking as many cores as they can on a single motherboard(1000+), or for a wizard to find some caching algorithm that will let them stay on commodity hardware(a.k.a. use those extra cores to figure out what you are going to need and optimize for it).

I'd put my money on them finding a wizard.

Re:Well doh (Score:3, Interesting)

by Targon ( 17348 ) writes: on Friday December 05, 2008 @09:53AM (#26002037)

Multi-channel memory controller is my response to this. Remember how going to a dual-channel memory controller increased the available bandwidth to memory? Having support for even 32 banks of memory could be implemented if the CPU design and connections are there.
You are thinking along the lines of current computers, not of the applications. People keep quoting the old statement that 640KB should be enough memory for anyone, but then repeat the same mistake they quote. Quantity of memory not only goes up, but the way to talk to that memory also evolves over time.
We used to see the CPU to chipset to memory as the way personal computers would work. Since then, AMD moved to an integrated memory controller on their CPUs, and Intel is finally following the example that AMD set. A dual-channel memory controller used to be the exception, not the rule, but now the idea is very common. In time, a 32 channel memory controller will be the standard even in an average home computer. How those channels are used to talk to memory of course remains to be seen, but you get the idea.

Re:Time for vector processing again (Score:3, Interesting)

by David Gerard ( 12369 ) writes: <slashdot@NospAm.davidgerard.co.uk> on Friday December 05, 2008 @09:53AM (#26002041) Homepage

I eagerly await the Slashdot story about an Apple laptop with liquid nitrogen cooling. Probably Alienware will do it first.

Re:Time for vector processing again (Score:5, Interesting)

by necro81 ( 917438 ) writes: on Friday December 05, 2008 @10:41AM (#26002523) Journal

A related problem to the speed of memory access is the energy efficiency of it. In an IEEE Spectrum Radio [ieee.org] piece interviewing Peter Kogge, current supercomputers can spend many times more energy shuffling bits around than operating on them. Today's computer can do a double-precision (64-bit) floating point operation using about 100 picojoules. However, it takes upwards of 30 pJ per bit to get the 128 bits of data loaded into the floating point math unit of the CPU, and then moving the 64-bit result elsewhere.
Actual math operations consume 5-10% of a supercomputer's total power, moving data from A to B is approaching 50%. Most optimization and innovation in the past few decades has gone into compute algorithms in the CPU core, and very little has gone into memory.

Re:Time for vector processing again (Score:5, Interesting)

by hey! ( 33014 ) writes: on Friday December 05, 2008 @11:02AM (#26002739) Homepage Journal

It may be true that "That market simply isn't large enough to support an R&D which will consistently outperform commodity hardware at a price people are willing to pay," that's not quite tantamount to saying "there is no possible rational justification for a larger supercomputer budget." There are considerable inflection points and external factors to consider.
The market doesn't allocate funds the way a central planner does. A central planner says, "there isn't room in this budget to add to supercomputer R&D." The way the market works is that commodity hardware vendors beat each other down until everybody is earning roughly similar normal profits. Then somebody comes a long with a set of ideas that could double the rate at which supercomputer power is increasing. If that person is credible, he is a standout investment, not just despite the fact that there is so much money being poured into commodity hardware, but because of that.
There may also be reasons for public investment in R&D. Naturally the public has no reason to invest in commodity hardware research, but it may have reason to look at exotic computing research. Suppose that you expected to have a certain maximum practical supercomputer capability in twenty years' time. Suppose you figure that once you have that capability you could predict a hurricane's track with several times the precision you could today. It'd be quite reasonable to put a fair amount of public research funds into supercomputing in order to have the that ability in five to ten years' time.

Re:The problem allegedly being.. (Score:2, Interesting)

by amori ( 1424659 ) writes: on Friday December 05, 2008 @11:18AM (#26002917)

Earlier this year, I had access to a large supercomputer cluster. Often I would run code on the supercomputer (with 1000+, 100, 10, 2 CPUs), and then I would try running it on my own dual core machine. Benchmarking the 2 CPUs for comparison purposes. More than anything, just the manner in which memory was being shared or distributed would influence the end results, tremendously. You really have to rethink how you choose to parallelize your vectors when dealing with supercomputers vs. multicore machines. As a researcher, I've found that I don't necessarily have the time to rewrite my code for both scenarios. I think this too might factor in heavily ...

Re:Time for vector processing again (Score:3, Interesting)

by frieko ( 855745 ) writes: on Friday December 05, 2008 @12:40PM (#26003941)

I think the solution here is to go a bit more fine-grained when defining the "commodity". This seems to be what IBM is doing. Their current strategy is, "we have a sweet-ass core design, you're welcome to slap it on whatever chip you can dream up."

Thus the "commodity" is the IP design, not the finished chip. If everybody else is doing a chip with 128 cores and one interconnect, they'll be happy to fab you a chip with one core and 128 interconnects.

Re:Time for vector processing again (Score:5, Interesting)

by lysergic.acid ( 845423 ) writes: on Friday December 05, 2008 @03:24PM (#26006065) Homepage
well, supercomputing has always been about maximizing system performance through parallelism, which can only be done in three main ways: instruction level parallelism, thread level parallelism, and data parallelism.
ILM can be achieved through instruction pipelining, which means breaking down instructions into multiple stages so that CPU modules can work in parallel and reduce idle time. for instance, in a RISC pipeline you break an instruction down into 5 operations:
1. instruction fetch
2. instruction decode / register fetch
3. instruction execute
4. memory access
5. register write-back
so while the first instruction is still in the decode stage the CPU is already fetching a second instruction. thus if fully-pipelined there are no stalls or wasted idle time, and a new instruction is loaded every clock cycle, resulting in a maximum of 5 parallel instructions being processed simultaneously.
then there are superscalar processors, which have redundant functional units--for instance, multiple ALUs, FPUs, or SIMD (vector processing) units. and if each of these functional units are also pipelined, then the result is a processor with an execution rate far in excess of one instruction per cycle.
thread level parallelism OTOH is achieved through multiprocessing (SMP, ASMP, NUMA, etc.) or multithreading. this is where multicore and multiprocessor systems come in handy. multithreading is generally cheaper to achieve than multiprocessing since fewer processor components need to be replicated.
lastly, there's data level parallelism, which is achieved in the form of SIMD (Single Instruction, Multiple Data) vector processors. this type of parallelism, which originated from supercomputing, is especially useful for multimedia applications, scientific research, engineering tasks, cryptography, and data processing/compression, where the same operation needs to be applied to large sets of data. most modern CPUs have some kind of SWAR (SIMD Within A Register) instruction set extension like MMX, 3DNow!, SSE, AltiVec, but these are of limited utility compared to highly specialized dedicated vector processors like GPUs, array processors, DSPs, and stream processors (GPGPU).

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

IEEE Says Multicore is Bad News For Supercomputers 251

IEEE Says Multicore is Bad News For Supercomputers More Login

IEEE Says Multicore is Bad News For Supercomputers

Re:Time for vector processing again (Score:5, Interesting)

It's so obvious... (Score:4, Interesting)

Re:but.. (Score:3, Interesting)

Re:Well doh (Score:1, Interesting)

Re:Well doh (Score:5, Interesting)

Re:Well doh (Score:2, Interesting)

Re:Well doh (Score:3, Interesting)

Re:Time for vector processing again (Score:3, Interesting)

Re:Time for vector processing again (Score:5, Interesting)

Re:Time for vector processing again (Score:5, Interesting)

Re:The problem allegedly being.. (Score:2, Interesting)

Re:Time for vector processing again (Score:3, Interesting)

Re:Time for vector processing again (Score:5, Interesting)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot