Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Supercomputing Upgrades Hardware

IEEE Says Multicore is Bad News For Supercomputers 251

Richard Kelleher writes "It seems the current design of multi-core processors is not good for the design of supercomputers. According to IEEE: 'Engineers at Sandia National Laboratories, in New Mexico, have simulated future high-performance computers containing the 8-core, 16-core, and 32-core microprocessors that chip makers say are the future of the industry. The results are distressing. Because of limited memory bandwidth and memory-management schemes that are poorly suited to supercomputers, the performance of these machines would level off or even decline with more cores.'"
This discussion has been archived. No new comments can be posted.

IEEE Says Multicore is Bad News For Supercomputers

Comments Filter:
  • by suso ( 153703 ) * on Friday December 05, 2008 @07:06AM (#26001307) Journal

    Sounds like its time for supercomputers to go their own way again. I'd love to see some new technologies.

    • by jellomizer ( 103300 ) on Friday December 05, 2008 @07:40AM (#26001487)

      I've always felt there was something odd about the recent trend of Super Computers using common hardware. components. They have really loss their way in super computing by just making a beefed up PC and running a version of a common OS which could handle it. Or Clustering a bunch of PC's togeter. Multi-Core technology is good for desktop systems as it is meant to run a lot of relatively small apps Rarely taking advantage of more then 1 or 2 cores. per app.In other-words it allows Multi-Tasking without a penalty. We don't use super computers that way. We use them to to perform 1 app that takes huge resources that would take hours or years on your PC and spit out results in seconds or days. Back in the early-mid 90's we had different processors for Desktop and Super Computers. Yes it was more expensive for the super computers but if you were going to pay millions of dollars for a super computer what the difference if you need to pay an additional $80,000 for more custom processors.

      • by suso ( 153703 ) *

        Yes I agree and have the same odd feeling. The first time I read an article where I think Los Alamos was ordering a supercomputer with 8192 Pentium Pro processors in it, I was like WTF?

        I missed the days when super computers looked like alien technology or Raiders of the Lost Ark.

        • by postbigbang ( 761081 ) on Friday December 05, 2008 @09:37AM (#26002471)

          Look are deceptive.

          The problem with multicores relates to the fact that the cores are processors, but the relationship to other cores and to memory aren't fully 'cross-bar'. Sun did a multi-CPU architecture that's truly crossbar (meaning that there are no dirty cache problems and semaphor latencies) among the processors, but the machine was more of a technical achievement than a decent workhorse to use in day to day stuff.

          Still, cores are cores. More cores aren't better necessarily until you fix what they describe. And it doesn't matter what they look like at all. Like any other system, it's what's under the hood that count. Esoteric-looking shells are there for marketing purposes and cost-justification.

          • by lysergic.acid ( 845423 ) on Friday December 05, 2008 @02:24PM (#26006065) Homepage

            well, supercomputing has always been about maximizing system performance through parallelism, which can only be done in three main ways: instruction level parallelism, thread level parallelism, and data parallelism.

            ILM can be achieved through instruction pipelining, which means breaking down instructions into multiple stages so that CPU modules can work in parallel and reduce idle time. for instance, in a RISC pipeline you break an instruction down into 5 operations:

            1. instruction fetch
            2. instruction decode / register fetch
            3. instruction execute
            4. memory access
            5. register write-back

            so while the first instruction is still in the decode stage the CPU is already fetching a second instruction. thus if fully-pipelined there are no stalls or wasted idle time, and a new instruction is loaded every clock cycle, resulting in a maximum of 5 parallel instructions being processed simultaneously.

            then there are superscalar processors, which have redundant functional units--for instance, multiple ALUs, FPUs, or SIMD (vector processing) units. and if each of these functional units are also pipelined, then the result is a processor with an execution rate far in excess of one instruction per cycle.

            thread level parallelism OTOH is achieved through multiprocessing (SMP, ASMP, NUMA, etc.) or multithreading. this is where multicore and multiprocessor systems come in handy. multithreading is generally cheaper to achieve than multiprocessing since fewer processor components need to be replicated.

            lastly, there's data level parallelism, which is achieved in the form of SIMD (Single Instruction, Multiple Data) vector processors. this type of parallelism, which originated from supercomputing, is especially useful for multimedia applications, scientific research, engineering tasks, cryptography, and data processing/compression, where the same operation needs to be applied to large sets of data. most modern CPUs have some kind of SWAR (SIMD Within A Register) instruction set extension like MMX, 3DNow!, SSE, AltiVec, but these are of limited utility compared to highly specialized dedicated vector processors like GPUs, array processors, DSPs, and stream processors (GPGPU).

      • by virtual_mps ( 62997 ) on Friday December 05, 2008 @08:07AM (#26001679)

        It's very simple. Intel & AMD spend about $6bn/year on R&D. The total supercomputing market is on the order of $35bn (out of a global IT market on the order of $1000bn) and a big chunk of that is spent on storage, people, software, etc., rather than processors. That market simply isn't large enough to support an R&D effort which will consistently outperform commodity hardware at a price people are willing to pay. Even if a company spent a huge amount of money developing a breakthrough architecture which dramatically outperformed existing hardware, the odds are that the commodity processors would catch up before that innovator recouped its development costs. Certainly they'd catch up before everyone rewrote their software to take advantage of the new architecture. The days when Seymour Cray could design a product which was cutting edge & saleable for a decade are long gone.

        • If they can make superconducting FETs [newscientist.com] that can be manufactured on ICs, I could see there being a very big difference that will last until they can reach liquid nitrogen temperatures (at which point it goes mainstream and cryogenics turns into a boom industry for a while).
          • Re: (Score:3, Interesting)

            by David Gerard ( 12369 )
            I eagerly await the Slashdot story about an Apple laptop with liquid nitrogen cooling. Probably Alienware will do it first.
        • by hey! ( 33014 ) on Friday December 05, 2008 @10:02AM (#26002739) Homepage Journal

          It may be true that "That market simply isn't large enough to support an R&D which will consistently outperform commodity hardware at a price people are willing to pay," that's not quite tantamount to saying "there is no possible rational justification for a larger supercomputer budget." There are considerable inflection points and external factors to consider.

          The market doesn't allocate funds the way a central planner does. A central planner says, "there isn't room in this budget to add to supercomputer R&D." The way the market works is that commodity hardware vendors beat each other down until everybody is earning roughly similar normal profits. Then somebody comes a long with a set of ideas that could double the rate at which supercomputer power is increasing. If that person is credible, he is a standout investment, not just despite the fact that there is so much money being poured into commodity hardware, but because of that.

          There may also be reasons for public investment in R&D. Naturally the public has no reason to invest in commodity hardware research, but it may have reason to look at exotic computing research. Suppose that you expected to have a certain maximum practical supercomputer capability in twenty years' time. Suppose you figure that once you have that capability you could predict a hurricane's track with several times the precision you could today. It'd be quite reasonable to put a fair amount of public research funds into supercomputing in order to have the that ability in five to ten years' time.

          • Re: (Score:3, Interesting)

            by frieko ( 855745 )
            I think the solution here is to go a bit more fine-grained when defining the "commodity". This seems to be what IBM is doing. Their current strategy is, "we have a sweet-ass core design, you're welcome to slap it on whatever chip you can dream up."

            Thus the "commodity" is the IP design, not the finished chip. If everybody else is doing a chip with 128 cores and one interconnect, they'll be happy to fab you a chip with one core and 128 interconnects.
          • The problem is that no idea doubles the rate at which supercomputers advance. Most of the ideas out there jump foreward, but they do it once. Vectors, streams, reconfigurable computing. All of these buzzwords once were the next big thing in supercomputing. Today everyone is talking about GPGPUs. None of them go very far. How much engineering goes into the systems? How long does it take to get to market? How difficult is it to rewrite all the algorithms to take advantage of the new machine? What proportion o

            • by hey! ( 33014 )

              I'm talking about the scenario TFA proposes: that directions in technology cause supercomputing advancement to stall. In that case expressed as a ratio any advancement at all would be infinite. However, I don't expect improvements in supercomputing will go to all the way down to zero.

              Now, why couldn't the rate of improvement double over some timescale from what it is now? I think it is because investors don't care a rat's ass about the rate of technological advance; they care about having something to sel

      • by Retric ( 704075 ) on Friday December 05, 2008 @08:17AM (#26001757)
        Modern CPU's have 8+ Mega Bytes of L2/L3 cache on chip so RAM is only a problem when your working set it larger than that. The problem super computing folks are having is they want to solve problems that don't really fit in L3 cache which creates significant problems but they still need a large cache. However, because of speed of light issues off chip ram is always going to be high latency so you need to use some type of on chip cache or stream lot's off data to the chip.

        There are really only 2 options for modern systems when it comes to memory you can have lot's of cores and a tiny cache like GPU's or lot's of cache and fewer cores like CPU's. (ignoring type of core issues and on chip interconnects etc.) So there is little advantage to paying 10x per chip to go custom vs using more cheaper chips when they can build supper computers out of CPU's, GPU's, or something between them like the Cell processor.
        • by necro81 ( 917438 ) on Friday December 05, 2008 @09:41AM (#26002523) Journal

          A related problem to the speed of memory access is the energy efficiency of it. In an IEEE Spectrum Radio [ieee.org] piece interviewing Peter Kogge, current supercomputers can spend many times more energy shuffling bits around than operating on them. Today's computer can do a double-precision (64-bit) floating point operation using about 100 picojoules. However, it takes upwards of 30 pJ per bit to get the 128 bits of data loaded into the floating point math unit of the CPU, and then moving the 64-bit result elsewhere.

          Actual math operations consume 5-10% of a supercomputer's total power, moving data from A to B is approaching 50%. Most optimization and innovation in the past few decades has gone into compute algorithms in the CPU core, and very little has gone into memory.

        • Modern CPU's have 8+ Mega Bytes of L2/L3 cache on chip so RAM is only a problem when your working set it larger than that.

          Unfortunately, most modern apps require far more working set than that! The crappy .NET app I use at work had a working set of 700MB today.

          The other issue is that HPC applications generally require small amounts of processing on lots and lots of snippets of data - ie highly parallel processing. This means that memory bandwidth is a very significant bottleneck.

          you can have lot's of cores and a tiny cache like GPU's

          Incidentally GPUs have a lot of cache - my Graphics card has 512Mb RAM.

      • by AlpineR ( 32307 ) <wagnerr@umich.edu> on Friday December 05, 2008 @08:18AM (#26001759) Homepage

        My supercomputing tasks are computation-limited. Multicores are great because each core shares memory and they save me the overhead of porting my simulations to distributed memory multiprocessor setups. I think a better summary of the study is:

        Faster computation doesn't help communication-limited tasks. Faster communication doesn't help computation-limited tasks.


        • Faster computation doesn't help communication-limited tasks. Faster communication doesn't help computation-limited tasks.

          I thought the same thing. Years ago with the massively-parallel architectures you could have said that massively-parallel architectures don't help inherently serial tasks.

          The other thing I wonder is how server and desktop tasks will drive the multi-core architecture. It may be the case that many of the common server and desktop tasks have massive IO need (gaming?). The current memory a

        • Re: (Score:3, Insightful)

          by ipoverscsi ( 523760 )

          Faster computation doesn't help communication-limited tasks. Faster communication doesn't help computation-limited tasks.

          Computation is communication. It's communication between the CPU and memory.

          The problem with multicore is that, as you add more cores, the increased bus contention causes the cores to stall making so they cannot compute. This is why many real supercomputers have memory local to each CPU. Cache memory can help, but just adding more cache per core yields diminishing returns. SMP will only get you so far in the supercomputer world. You have to go NUMA for performance, which means custom code and algorith

          • Re: (Score:3, Insightful)

            by Zebra_X ( 13249 )

            Yeah, if you buy Intel chips. Despite the fact that they are slower clock for clock than the new intel chips, amd's architecture was and is the way to go, which is of course why Intel has copied it (i7). If you properly architect the chips to contain all of the "proper" plumbing, then this becomes less of a problem. Unfortuantely Intel has for the past few years simply cobbled together "cores" that are nothing more than processors that are linked via a partially adequite bus. So when contention goes up they

      • by timeOday ( 582209 ) on Friday December 05, 2008 @08:44AM (#26001955)
        IMHO this study is not an indictment against the use of today's multi-core processors for supercomputers or anything else. They're simply pointing out that in the future (as cores continue to grow exponentially) some memory bandwidth advances will be needed. The implication that today's multi-core processors are best suited for games is silly - where they're really well utilized is in servers, and they work very well. The move towards commodity processors in supercomputing wasn't some kind of accident, it occurred because that's what currently gets the best results. I'd expect a renaisance in true supercomputing just as soon as it's justified, but I wouldn't hold my breath.
      • Re: (Score:3, Informative)

        by yttrstein ( 891553 )
        We still have different processors for desktops and supercomputers.

        http://www.cray.com/products/XMT.aspx

        Rest assured, there are still people who know how to build them. They're just not quite as popular as they used to be, now that a middle manager who has no idea what the hell they're talking about can go to an upper manager with a spec sheet that's got 8 thousand processors on it and say "look! This ones got a whole ton more processors than that dumb Cray thing!"
        • The same is true of some of IBM's offerings. They still run Linux on PowerPC, but they are just using Linux as an I/O scheduler and the PowerPC chips as I/O controllers.
      • by TapeCutter ( 624760 ) on Friday December 05, 2008 @09:04AM (#26002145) Journal
        "Multi-Core technology is good for desktop systems as it is meant to run a lot of relatively small apps Rarely taking advantage of more then 1 or 2 cores. per app.In other-words it allows Multi-Tasking without a penalty. We don't use super computers that way. We use them to to perform 1 app that takes huge resources that would take hours or years on your PC and spit out results in seconds or days."

        Sorry but that's not entirely correct, most super computers work on highly parallel problems [earthsimulator.org.uk] using numerical analysis [wikipedia.org] techniques. By definition the problem is broken up into millions of smaller problems [bluebrain.epfl.ch] that make ideal "small apps", a common consequence is that the bandwidth of the communications between the 'small apps' becomes the limiting factor.

        "Back in the early-mid 90's we had different processors for Desktop and Super Computers."

        The earth simulator was refered to in some parts as 'computenick', it's speed jump over it's nearest rival and longevity at the top marked the renaissance [lbl.gov] of "vector processing" after it had been largely ignored during the 90's.

        In the end a supercomputer is a purpose built machine, if cores fit the purpose then they will be used.
      • Back in the 90s, there were custom super-computer processors (both vector and scalar), that were faster than desktop processors for all supercomputing tasks. This hit a wall, as the desktop processors became faster than the custom processors, at least for some tasks. If you can get a processor that's faster for some tasks and slower for others, but costs 1/10th the price of the other, you're probably going to go with the cheap one. The world has petaflop computers because of the move to commodity parts. Noo

      • by mikael ( 484 )

        Early supercomputers were built from custom chips designed for specific applications along with a custom network topology. This may have reduced the energy demands of the system, but meant that the system was good for one application only.

        Also, different supercomputers would have different network topologies depending upon the application. It become immediately obvious that a single bus shared between a group of CPU's wasn't going to achieve peak performance, so different architectures were developed for ea

    • Maybe they should all just be simulated at Sandia!
    • Re: (Score:2, Funny)

      That does it...I'm not buying a supercomputer this Christmas!

    • >> I'd love to see some new technologies.

      Yeah, It would be nice to see that Quantum Computing ( http://en.wikipedia.org/wiki/Quantum_Computer [wikipedia.org] ) finally adds a couple of arbitrary integers. Despite the many publications in the subject, it smells like the superstrings theory of computing. Hope that's not the case.

    • Sounds like its time for supercomputers to go their own way again. I'd love to see some new technologies.

      While supercomputers might have to come up with unique architectures again, vector processing isn't it. The issue here is the total bandwidth to a single shared view of a large amount of memory. Off-the-shelf PC cores with their SIMD units are already too fast for the available memory bandwidth; swapping those cores out for vector units won't do anything to solve the problem. (Especially if you were go back to the original Cray approach of streaming vectors from main memory with no caches, which would just

      • Re: (Score:3, Informative)

        by David Greene ( 463 )

        Cray did not stream vectors from memory. One of the advances of the Cray-1 was the use of vector registers as opposed to, for example, the Burroughs machines which streamed vectors directly to/from memory.

        We know how to build memory systems that can handle large vectors. Both the Cray X1 and Earth Simulator demonstrate that. The problem is that those memory systems are currently too expensive. We are going to see more and more vector processing in commodity processors.

        • So it had a tiny 4 kilobyte, manually allocated cache (aka vector registers). But yes, they still had to be juggled by streaming vectors to and from memory. That approach still requires more memory bandwidth than current cache architectures, and it wouldn't solve the memory issues any more effectively than today's common designs, which just happen to stream cache blocks instead of vectors.

    • NEc still makes the SX9 vector system, and cray still sells X2 blades that can be installed into their xt5 super. So vector processors are available, they just aren't very popular, mostly due to cost/flop.

      A vector processor implements an instruction set that is slightly better than a scalar processor at doing math, considerably worse than a scalar processor at branch-heavy code, but orders of magnitude better in terms of memory bandwidth. The X2, for example, has 4 25gflop cores per node, which share 64 cha

  • Well doh (Score:5, Insightful)

    by Kjella ( 173770 ) on Friday December 05, 2008 @07:17AM (#26001367) Homepage

    If you make a simulation like that keeping the memory interface constant then of course you'll see diminishing returns. That's why we're still not running plain old FSBs as AMD has HyperTransport, Intel has QPI, the AMD Horus system expands it up to 32 sockets / 128 cores and I'm sure something similar can and will be built as a supercomputer backplane. The header is more than a little sensationalist...

    • Re:Well doh (Score:5, Insightful)

      by cheater512 ( 783349 ) <nick@nickstallman.net> on Friday December 05, 2008 @07:34AM (#26001449) Homepage

      There are limits however to what you can do.
      Its not like multi-processor systems where each cpu gets its own ram.

      • Re:Well doh (Score:5, Interesting)

        by Targon ( 17348 ) on Friday December 05, 2008 @08:43AM (#26001949)

        You have failed to notice that AMD is already on top of this and can add more memory channels to their processors as needed for the application. This may increase the number of pins the processor has, but that is to be expected.

        You may not have noticed, but there is a difference between AMD Opteron and Phenom processors beyond just the price. The base CPU design may be the same, but AMD and Intel can make special versions of their chips for the supercomputer market and have them work well.

        In a worst case, with the support from AMD or Intel, a new CPU with extra pins(and an increased die size) could add as many channels of memory support as required for the application. This is another area where spinning off the fab business might come in handy.

        And yes, this might be a bit expensive, but have you seen the price of a supercomputer?

      • by Lumpy ( 12016 )

        no but by using dual or quad cores with a crapload of ram each you do get a benefit.

        a 128 processor quad core supercomputer will be faster than a 128 processor single core supercomputer.

        you get a benefit.

      • Re: (Score:3, Informative)

        by bwcbwc ( 601780 )

        Actually that is part of the problem. Most of the architectures have core-specific L1 cache, and unless a particular thread has its affinity mapped to a particular core, a thread can jump from a core where its data is in the L1 cache to a core where its data is not present, and is forced to undergo a cache refresh from memory.

        Also, regardless of whether a system is multi-processing within a chip (multi-core) or on a board (multi-CPU), the number of communication channels required to avoid communication bott

        • So yes, we are probably seeing the beginning of the end of performance gains using general-purpose CPU interconnects and have to go back to vector processing. Unless we are somehow able to jump the heat dissipation barrier and start raising GHz again.

          That's what the superconducting FETs [newscientist.com] are for, just wait a few years / couple decades for them to get something that can be made on an IC and works at liquid nitrogen temperatures.

    • This really is a problem that doesn't exist. The issue at hand is if you have all cores cranking away you run out of bandwidth. Simple solution - don't run all cores and continue to scale horizontally as they currently do. So if you need 8-core CPUs and it has the bandwidth you need, only buy 8-core CPUs. If your CPUs run out of bandwidth are 16-cores (or whatever), then only buy up to 16-core CPUs, passing on the 32-core CPUs.

      Wow that a hard problem to solve. Next.

  • Yeah! (Score:5, Funny)

    by crhylove ( 205956 ) <rhy@leperkhanz.com> on Friday December 05, 2008 @07:51AM (#26001549) Homepage Journal

    Once we get to 32 or 64 core cpus that cost less than $100 (say, five years), I'd HATE to have a beowulf cluster of those!

  • by theaveng ( 1243528 ) on Friday December 05, 2008 @07:53AM (#26001559)

    >>>"After about 8 cores, there's no improvement," says James Peery, director of computation, computers, information, and mathematics at Sandia. "At 16 cores, it looks like 2 cores."
    >>>

    That's interesting but how does it affect us, the users of "personal computers"? Can we extrapolate that buying a CPU larger than 8 cores is a waste of dollars, because it will actually run slower?

    • What they are saying doesn't apply. The problem is specifically that a supercomputer usually runs one big demanding program. Right now, my task manager says that I am running 76 processes to look at the internet. So I could easily benefit from extra cores as each process being run could go on a separate core.
      • I've only ever used a QuadCore PC once in my life:

        - Core 1 was 100% utilized.
        - Core 2 was only 25%.
        - Cores 3 and 4 were sitting idle doing nothing.

        It was clocked at 2000 megahertz and based upon what I observed, it doesn't look like I'm "hurting" myself by sticking with my "singlecore" 3100 megahertz Pentium. The multicores don't seem to be used very well by Windows, and my singlecore Pentium might actually be faster for my main purpose (web browsing/watching tv shows).

    • Re: (Score:3, Funny)

      by David Gerard ( 12369 )
      It will only affect you if you're running ForecastFoxNG, where you can set the weather and the CPU will calculate where the butterfly should flap to get the effect you want (M-x butterfly).
    • I still don't understand the problem... they are saying 1 computer with a limited throughput bus technology will be limited by adding more cores... well... duh? If it was a supercomputer, then chances are they will open the bandwidth to the processors/memory etc.

      Make it a car analogy: you can transport cargo, each car = core. You can only transport 2 cargo loads down a 2 lane road, adding a 3rd car makes them go slower... if you think the engineers don't think about expanding the road is idiotic.

      Adding c
  • It's so obvious... (Score:4, Interesting)

    by Alwin Henseler ( 640539 ) on Friday December 05, 2008 @07:55AM (#26001585)

    That to remove the 'memory wall', main memory and CPU will have to be integrated.

    I mean, look at general-purpose computing systems past & present: there is a somewhat constant relation between CPU speed and memory size. Ever seen a 1 MHz. system with a GB. RAM? Ever seen a GHz. CPU coupled with a single KB. of RAM? Why not? Because with very few exceptions, heavier compute loads also require more memory space.

    Just like the line between GPU and CPU is slowly blurring, it's just obvious that the parts with the most intensive communication, should be the parts closest together. Instead of doubling nummber of cores from 8 to 16, why not use those extra transistors to stack main memory directly on top of the CPU core(s)? Main memory would then be split up in little sections, with each section on top of a particular CPU core. I read sometime that semiconductor processes that are suitable for CPU's, aren't that good for memory chips (and vice versa) - don't know if that's true but if so, let the engineers figure that out.

    Ofcourse things are different with supercomputers. If you have a 1000 'processing units', where each PU would consist of say, 32 cores and some GB's RAM on a single die, that would create a memory wall between 'local' and 'remote' memory. The on-die section of main memory would be accessible at near CPU speed, main memory that is part of other PU's would be 'remote', and slow. Hey wait, sounds like a compute cluster of some kind... (so scientists already know how to deal with it).

    Perhaps the trick would be to make access to memory found on one of the other PU's transparent, so that programming-wise there's no visible distinction between 'local' and 'remote' memory. With some intelligent routing to migrate blocks of data closer towards the core(s) that access it? Maybe that could be done in hardware, maybe that's better done on a software level. Either way: the technology isn't the problem, it's an architectural / software problem.

    • P.S. Your idea of putting memory on the CPU is certainly workable. The very first CPU to integrate memory was the 80486 (8 kilobyte cache), so the idea has been proven sound since at least 1990.

      • The very first CPU to integrate memory was the 80486 (8 kilobyte cache), so the idea has been proven sound since at least 1990.

        I seem to recall the 68020 (1984?) having an instruction cache. (Though a lot smaller than 8k, if I recall.)

        • You are correct. The Motorola 68020 had 1/4 kilobyte of memory onboard, and was also a true 32-bit processor in 1984.

          I should have known. Motorola CPUs were always more-advanced than Intel. Of course I'm biased since I always preferred Amigas and Macs. ;-)

          • Motorola CPUs were always more-advanced than Intel. Of course I'm biased since I always preferred Amigas and Macs. ;-)

            Or maybe you're not biased, and preferred those machines because of their more advanced tech. :-)

    • by AlXtreme ( 223728 ) on Friday December 05, 2008 @08:44AM (#26001959) Homepage Journal

      You mean something like a CPU cache? I assume you know that every core already has a cache (L1) on multi-core [wikipedia.org] systems, and shares a larger cache (L2) between all cores.

      The problem is that on/near-core memory is damn expensive, and your average supercomputing task requires significant amounts of memory. When the bottleneck for high performance computing becomes memory bandwidth instead of interconnect/network bandwidth you have something a lot harder to optimize, so I can understand where the complaint in IEEE comes from.

      Perhaps this will lead to CPUs with large L1 caches specifically for supercomputing tasks, who knows...

      • by DRobson ( 835318 )

        Perhaps this will lead to CPUs with large L1 caches specifically for supercomputing tasks, who knows...

        Even discounting price concerns, L1 caches can only increase a certain amount. As the capacity increases, so does the search time for the data, until you find yourself with access times equivalent to the next level down the cache heirarchy, thus negating use of L1. L1 needs to be /quite/ fast for it to be worthwhile.

        • Re: (Score:3, Informative)

          by TheRaven64 ( 641858 )
          More likely is going to something like the Cell's design. Cache is by definition hidden from the programmer, but on-die SRAM doesn't have to be cache, it can be explicitly-managed memory with instructions to bulk fetch from the slower external DRAM. For supercomputer applications, this would probably be more efficient, and lets you get rid of all of the cache coherency logic and use the space for more ALUs or SRAM.
    • Re: (Score:3, Insightful)

      by Funk_dat69 ( 215898 )

      Ofcourse things are different with supercomputers. If you have a 1000 'processing units', where each PU would consist of say, 32 cores and some GB's RAM on a single die, that would create a memory wall between 'local' and 'remote' memory. The on-die section of main memory would be accessible at near CPU speed, main memory that is part of other PU's would be 'remote', and slow. Hey wait, sounds like a compute cluster of some kind... (so scientists already know how to deal with it).

      It also sounds like you are described the Cell Processor setup. Each SPU has local memory on-die - but cannot do operations on main memory(remote). Each SPU also has a DMA engine that will grab data from main memory and bring it into its local store. The good thing is you can overlap the DMA transfer and the computation so the SPUs are constantly burning through computation.

      This does help against the memory wall. And is a big reason why Roadrunner is so damn fast.

  • Memory (Score:5, Insightful)

    by Detritus ( 11846 ) on Friday December 05, 2008 @08:02AM (#26001639) Homepage
    I once heard someone define a supercomputer as a $10 million memory system with a CPU thrown in for free. One of the interesting CPU benchmarks is to see how much data it can move when the cache is blown out.
  • Multiple CPUs? (Score:5, Insightful)

    by Dan East ( 318230 ) on Friday December 05, 2008 @08:03AM (#26001649) Journal

    This doesn't quite make sense to me. You wouldn't replace a 64 CPU supercomputer with a single 64 core CPU, but would instead use 64 multicore CPUs. As production switches to multicore, the cost of producing multiple cores will be about the same as the single core CPUs of old. So eventually you'll get 4 cores from the price of 2, then get 8 cores from the price of 4, then 16 for the price of 8, etc. So the extra cores in the CPUs of a supercomputer are like a bonus, and if software can be written to utilize those extra cores in some way that benefits performance, then that's a good thing.

    • by Junta ( 36770 ) on Friday December 05, 2008 @08:21AM (#26001777)

      For a given node count, we've seen increases in performance. The claimed problem is that for the workloads that concern these researchers, they don't see people mentioning significant enhancements to the fundamental memory architecture projected to follow the scale at which multi-core systems go. So you buy a 16 core chip system to upgrade your quad-core based system and hypothetically gain little despite the expense. Power efficencies drop and getting more performance requires more nodes. Additionally, who is to say that clock speeds won't lower if programming models in the mass market change such that distributed workloads are common and single-core performance isn't all that impressive.

      All that said, talk beyond 6-core/8-core is mostly grandstanding at this time. As memory architecture for the mass market is not considered as intrinsically exciting, I would wager there will be advancements that no one speaks to. For example, Nehalem leapfrogs AMD memory bandwidth by a large margin (like by a factor of 2). It means if Shanghai parts are considered satisfactory today to get respectable yield memory wise to support four cores, Nehalem, by a particular metric, supports 8 equally satisfactorily. The whole picture is a tad more complicated (i.e. latency, numbers I don't know off hand), but the one metric is a highly important one in the supercomputer field.

      For all the worry over memory bandwidth though, it hasn't stopped supercomputer purchasers from buying into Core2 all this time. Despite improvements in their chipset, Intel Core2 still doesn't reach AMD performance. Despite that, people spending money to get into the Top500 still chose to put their money on Core2 in general. Sure, Cray and IBM supercomputers in the Top2 used AMD, but from the time of its release, Core2 has decimated AMD supercomputer market share despite an inferior memory architecture.

      • Re: (Score:2, Interesting)

        by amori ( 1424659 )
        Earlier this year, I had access to a large supercomputer cluster. Often I would run code on the supercomputer (with 1000+, 100, 10, 2 CPUs), and then I would try running it on my own dual core machine. Benchmarking the 2 CPUs for comparison purposes. More than anything, just the manner in which memory was being shared or distributed would influence the end results, tremendously. You really have to rethink how you choose to parallelize your vectors when dealing with supercomputers vs. multicore machines.
  • as expected (Score:3, Funny)

    by tyler.willard ( 944724 ) on Friday December 05, 2008 @09:32AM (#26002439)

    "A supercomputer is a device for turning compute-bound problems into I/O-bound problems."

    -Ken Batcher

  • What's distressing here? That they have to keep building supercomputers the same way they always have? I worked with an ex IBM'er from their supercomputing algorithms department, he and I BSed about future chip performance alot in the late 2006 - early 2007 timeframe. We were both convinced that the current approaches to CPU design were going to top out in usefulness at 8 to maybe 16 cores due to memory bandwidth.

    I guess the guys at Sandia had to do a little more than BS about it before they published

  • Because of limited memory bandwidth and memory-management schemes that are poorly suited to supercomputers, the performance of these machines would level off or even decline with more cores.

    So increase the bandwidth on the memory to something more suited to supercomputers then. Design and make a supercomputer for supercomputer purposes. You are scientists using supercomputers, not kids begging mom for a new laptop on christmas. Make it happen.

  • Well, duh.... (Score:3, Insightful)

    by SpinyNorman ( 33776 ) on Friday December 05, 2008 @10:28AM (#26003011)

    It's hardly any secret that CPU speed, even for single core processors, has been running ahead of memory bandwidth gains for years - that's why we have cache, and ever increasing amounts of it. It's also hardly any relevation to realize that if you're sharing your memory bandwidth between multiple cores then the bandwidth available per core is less than if you weren't sharing. Obviously you need to keep the amount of cache per core and the number of cores per machine (or, more precisely, per unit of memory sybsystem bandwidth) within reasonable bounds to keep it usable for general purpose aplications, else you'll end up in GPU-CPU (e.g. CUDA) territory where you're totally memory constrained and applicability is much less universal.

    For cluster-based ("supercomputer") applications, partitioning between nodes is always going to be an issue in optimizing performance for a given architecture, and available memory bandwidth per node and per core is obviously a part of that equation. Moreover, even if CPU designers do add more cores per processor than is useful for some applications, no-one is forcing you to use them. The cost per CPU is going to remain approximately fixed, so extra cores per CPU essentially come for free. A library like pthreads, and different implementations of it (coroutine vs LWP based), gives you the flexibility over the mapping of threads to cores, and your overall across-node application partitioning gives you control over how much memory bandwidth per node you need.

  • It's worth noting that multicore CPUs are just a plan B technology. What the market really wants is faster CPUs, but the current old technology can't deliver them, so CPU makers are trying to convince people that multicore is a good idea.

  • This just in:

    * Intel sucks at making zillion-dollar computers
    * AMD sucks at everything
    * Supercomputer engineers are worried for their jobs

    I realize these people have a legitimate complaint, but quite frankly if you're worried about a certain processor affecting your code, maybe you suck at programming ?! So what if the internal bandwidth is ho-hum ? These old dogs need to stop complaining and learn to adapt, else their overpaid jobs will be given to others who can.

Genius is ten percent inspiration and fifty percent capital gains.

Working...