Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
IBM Technology

Cell Architecture Explained 570

IdiotOnMyLeft writes "OSNews features an article written by Nicholas Blachford about the new processor developed by IBM and Sony for their Playstation 3 console. The article goes deep inside the Cell architecture and describes why it is a revolutionary step forwards in technology and until now, the most serious threat to x86. '5 dual core Opterons directly connected via HyperTransport should be able to achieve a similar level of performance in stream processing - as a single Cell. The PlayStation 3 is expected to have have 4 Cells.'"
This discussion has been archived. No new comments can be posted.

Cell Architecture Explained

Comments Filter:
  • So (Score:1, Interesting)

    by ikkonoishi ( 674762 ) on Friday January 21, 2005 @04:43AM (#11429642) Journal
    How long until the first beowulf cluster [defcon.org] of ps3s?

    Also can it run linux?
  • Is it just me? (Score:3, Interesting)

    by morriscat69 ( 807260 ) on Friday January 21, 2005 @05:08AM (#11429739)
    Or does the logical extension of this chart:

    http://www.blachford.info/computer/Cells/Cell_Dist ributed.gif [blachford.info]

    Make it look a little more like a HAL than a Cell?
  • by mr_jrt ( 676485 ) on Friday January 21, 2005 @05:44AM (#11429846) Homepage

    Program in a language that is referentially transparent [wikipedia.org].

    ...once you can assume that any function is able to be concurrently executed all you have to solve is the communication between processors/storage. The latency of current networking technologies makes this unpractical for general tasks, but this is less of a problem with a low-latency internal bus.

    Time to drag those Haskell textbooks out of the closet and dust them off. ;)

  • by kai.chan ( 795863 ) on Friday January 21, 2005 @05:50AM (#11429863)
    was the ps2 the supercomputer it was said to be...?

    I don't remember Sony making any big statements about the Emotion Engine being a supercomputer. What I do remember, is that when they released the clock speed of their processor, people knew the relative power of the PS2. From what I see of the Cell architecture, I can guarantee that the Cell is much more powerful than any AMD and Intel processors.

    It seems like you didn't read much into the technical aspect of the Cell architecture presented in the long write-up. From just looking at a simple top-level diagram of the Cell architecture, it is clearly shown that the Cell is much more powerful than any other processors currently available. A Cell contains a Processor Unit with 8 additional Processor Units, each with its own registers. The architecture is also a distributed computing network capable of splitting tasks and computations over a wide variety of home electronics. Each Cell product you buy, you are increasing your processing power of your household. In conclusion, yes, it would smoke a x86 counterpart.
  • by kyonos ( 88114 ) on Friday January 21, 2005 @06:07AM (#11429920)
    Wonder if IBM looks into the future and doesn't see PCs anywhere? Intriguing possibility.
  • by Troed ( 102527 ) on Friday January 21, 2005 @06:23AM (#11429970) Homepage Journal
    Either chip your box or use a software exploit, switch it into NTSC video mode (you can do that without having to change the game region mode - your PAL originals will still work), connect a component videocable, enter the Microsoft Dashboard, select the HDTV resolutions you want from the list under Video - Settings. Play in HDTV.

    ... yes, all this - and you can still play PAL originals in higher resolution on XboxLive - as long as the chip is off (or the software exploit not used) afterwards.

  • Unfair comparison (Score:3, Interesting)

    by Stripsurge ( 162174 ) on Friday January 21, 2005 @06:40AM (#11430017) Homepage
    Since the main goal of the chip is to pump through graphics, regardless of what device its in, a GPU is better grounds for comparison.

    From TFA: "Existing GPUs can provide massive processing power when programmed properly, the difference is the Cell will be cheaper and several times faster."

    Its supposed to do 250GFlops when? 2 years from now? Apparently the Geforce 6800 Ultra will do 40GFlops and thats today.... extrapolate with some doubling here and there it seems a lot more reasonable.

    So the big thing is that it comes down to programming. It came up a few times in the article "Doing this will make it faster but will make for one hell of a time for the programmers" It may have a huge potential but may take a while to get everything efficiently as Sony would like. Reminds me of when the GF3 first came out and was beaten by the GF2U in some tests. IIRC it took a while for games to come out that took advantage of its programability. It'll be interesting to see how well the programmers can fair between now and Cell's release.
  • by ponos ( 122721 ) on Friday January 21, 2005 @06:52AM (#11430052)
    There are several assumptions that lead to tremendous theoretical performance figures. The simple fact is that like the Itanium, the Cell processor depends on some rather complicated software that will solve issues like parallelism, coherency etc. The article clearly states that the Cell architecture is a combination of software and hardware (1st page). This is good because performance can always increase (via a better OS or microcode) but it is also bad because it means that initial versions may not stand up to their performance claims.

    Also, let's not forget that developers will be unable to keep up, unless some highly sophisticated libraries and languages are made available. I really don't expect the majority of developers to be able to cope with massive parallelism from the beggining (not just 2x SMP or hyperthreading, this needs a totally different mindset).

    To sum this up: the hardware will deliver, but the software is a critical unknown in the equation. I have faith in IBM ;-)

    P.
  • Locked Up (Score:2, Interesting)

    by DingerX ( 847589 ) on Friday January 21, 2005 @06:57AM (#11430065) Journal
    I read all five sections at once, intending to stream each chapter through separate phases from character recognition to criticism. Unfortunately, every time the article used "it's" in a predicative sense, everything ground to a halt.

    Fortunately, cell reading meant I hardly noticed the claim that hardware would compete with the x86 because, unlike the x86, cell computers need all their software written for the specific hardware.

    I like how "hardware-specific" becomes "OS-independent". Great I can plug my HDTV into my G/Fs "electrically powered adult novelty device", and harness the extra computing power to find out we are really alone in the world. Of course, no firmware will stand in the way.

    I'm also surprised that, in pandering to all the OS underdogs in the slashdot crowd (Great day for Apple, since they like G5s; Great day for Linux, since many obsessive-compulsive coders work on Linux projects anyway), he left out a true lightweight OS designed from the ground up for just this sort of multitasking: Amiga OS 4.0. To get something like this to actually work, you'll need more than iPod huggers, OSX preachers or Linux fans. You need genuine madwomen and madmen. You need AmigaOS.
  • Best of both worlds? (Score:1, Interesting)

    by Anonymous Coward on Friday January 21, 2005 @07:18AM (#11430154)
    This is a very interesting architecture. It arises quite logically from the 'rediscovery' of vector processing for high throughput, low interdependence instruction streams that GPUs represent. For a long time, home computers didn't do the kinds of things vector processors were good at - they did complex, heterogeneous instruction streams, and the processors evolved to match. You can't run word on a Cray after all. We got microops, RISC, superscalar architectures, multi-level burst filling caches, branch prediction, hyperthreading, all the things that roughly speaking make a stream of instructions that depend on each other's results a lot, don't repeat themselves much, and get information from all over memory go quickly. The thing is, the jobs vector computers are good at started cropping up in home computer loads, mainly for graphics and media type uses. The industry started responding to this with dinkly little SIMD cores in essentially conventional processors, and then we got the GPU, which makes many more compromises to get you greater throughput. Now we have this thing, which is less graphics specific than a GPU but much more vector-heavy than a microprocessor. Most of the 'traditional PC' work, executing code, will be done in the 'supervising' processor I think, which is why it's so fat - you don't need a G5 just to push jobs around after all. Where these vector units come in is for doing 'work units' of the kind of stuff vector processors are good at - 3D graphics, physics, compression and decompression. Basically, all the heavy lifting needed for games and media use. Those who have said it will be very hard to program are correct. If you ran a conventional program on it, you'd get the power of the supervising processor and not much more. You'd have to start looking for these 'work units' in your game, or whatever, and shooting them out to the vector units, along with 128k of all the data they'll ever need, and then pulling the results back when they're done. You'd have to cut the work units small enough so that they were done by the time they were due to hit the screen, or the speakers, which is why there is so little RAM in each vector unit - the jobs aren't expected individually to run for all that long. It is also why the article mentions realtime stuff built in there somewhere, so you know if you're falling behind before the output buffers run dry, and can do something about it. The more of these vector jobs you found, the faster you'd go. You are really looking however at designing a whole program with the philosophy of an OS writer - there will be no abstraction here, and to get the performance you'll have to work for it. The thing is, where has abstraction really got us, we developers? Code is easier to write, and we can write it more quickly and ambitiously than we used to be able to. The problem is that we seem to have 'spent' a whole lot of the new capacity the hardware industry gives us on this one aim, the result being that this 2GHz P4 runing XP in front of me seems about as responsive as a mac classic. Maybe it's time we gave up a bit of our precious abstraction, worked little harder and passed a bit more of that new capacity we get every year on to the users. After all, the first guy to write a game that really does this for this kind of architecture is going to look pretty good, and the guy who comes out the next week with a simple port of the PC version is not. It'll be interesting to see how the game programmers go at getting the best out of this thing. Rumours seem to indicate that they didn't really manage to 'get' the PS2 in this way...
  • Re:next please (Score:3, Interesting)

    by Rinikusu ( 28164 ) on Friday January 21, 2005 @08:42AM (#11430482)
    Sony can kiss my ass, too. But I'll probably be in the fucking line to buy it when it comes out. See you there?

  • by Siker ( 851331 ) on Friday January 21, 2005 @10:16AM (#11431229) Homepage
    So what will this tremendous power be used for? Since the GPU will handle the rendering task, what will the vector units do (the vector units is where the power of the system is)?

    Actually, the CPU speed has a lot to do with graphics speed. If you look at recent performance charts for nVidia's high end GPUs in SLI setups you will find that their performance levels off unless you run the absolutely highest resoultion with top filtering and antialias settings. In fact, the high end cards are still CPU limited at the highest settings for many but the most recent games. [Tom's Hardware Guide [tomshardware.com]]

    In addition, programmers will always find things to do with additional CPU power. Ray traced occlusion culling to reduce the number of polygons sent to the GPU is one idea if you have extreme amounts of processing power just sitting around. That in turn would allow you to use extremely advanced pixel shaders as overdraw is almost eliminated. It would also allow you to add a few more polygons to every scene, knowing that most polygons are correctly culled.

  • by rob_osx ( 851996 ) on Friday January 21, 2005 @10:36AM (#11431433)
    Look at this article and then believe it.

    http://www.siliconvalley.com/mld/siliconvalley/103 23259.htm [siliconvalley.com]

    IBM has made the Cell for servers and embedded applications. I don't know much about the author of the article, but the Cell will change computing.

    Here's my analysis on why Apple will use the Cell http://www.siliconvalley.com/mld/siliconvalley/103 23259.htm [siliconvalley.com]

  • by mwvdlee ( 775178 ) on Friday January 21, 2005 @11:05AM (#11431740) Homepage
    Since the Cell processors are basically arrays of vector processors, quite similar to the shader units in GPU's, I suspect NVidia will just implement the specialized low-level 3D stuff and leave all the shader work to be done by the Cell processors.

    So basically you'll have a fixed graphics core which isn't likely to change (since it hasn't for the last couple of years) and an extremely flexible and powerful array of shader units.
  • by buddhaseviltwin ( 786340 ) on Friday January 21, 2005 @11:07AM (#11431761)
    Considering how much IBM has invested and banked on Java, wouldn't you think they would try to design a virtual machine that would take advantage of this architecture? I wouldn't expect the JIT to be able to parallelize (???) everything, but I would think it would know how to detect and translate certain segments of code which are easy to translate to a parallel architecture.

    I don't know about you, but when I first heard about cell processors (and that fact that IBM was behind it), I immediately began speculating how IBM would exploit this architecture in their server market. This sounds like the sort of thing that will enable them to sell 256 processor monsters running AIX, DB2 and J2EE.

    Even if designing to take advantage of this architecture is terribly difficult, just porting your webserver, database server, and transaction will solve the scalability issues for most Web/Client/Server applications.
  • by Anonymous Coward on Friday January 21, 2005 @11:14AM (#11431839)
    IBM is buddying up to two (or three depending on your outlook) of the most powerfull forces in "entertainment computing" Sony and Microsoft. Apple and IBM (and Freescale) are plugging away at the Power/PowerPC and passing the fruits onto Microsoft in the next gerneration XBOX.

    Sony and IBM (and Toshiba) are putting together some of the most esoteric hardware ever to be contained in a consumer computing device and in all cases IBM sits right in the middle of it. Is this cross-fire waiting to kill the giant, or is IBM counting on its size and contracts to protect it?

    What happens when Apple says; "Hey! We want to play with CELL in our next generation hardware." Or Microsoft says; "Whats with giving Sony all the good hardware?"

    Whats the potential of all these competetors sourcing silicon from IBM and ending up converging on a common, or nearly common architecture? Is this IBM carefully constructing a future monoply? Is it the beginning of the ability to run software from Microsoft, Sony, and Apple on a single (or nearly single) architecture?

    Remember the IBM LongTrail CHRP/PREP development boards? A friend of mine who did some Web work for IBM was paid partialy with a dual processor PowerPC 604 development board that contained both an Apple ROM slot (used the beige G3 ROM) and full OpenFirmware (not the cripled version Apple implemented). Only the lack of well develped drivers for the peripheral hardware prevented it from running as a 100% compatible MacOS/WinNT/AIX (it was given to him with all three installed on the hard drive) machine. This sort of attempted hardware convergence is not a new route for IBM.

    What do people out there think is going to happen? Are Sony and Microsoft far enough apart in thier long-term goals to keep working with IBM without stomping all over each other in the process? Where does Apple (or even Nintendo for that matter) fit into this? What rumors or conspiracies have people heard?
  • by aphor ( 99965 ) on Friday January 21, 2005 @11:35AM (#11432093) Journal

    No, this sort of architecture is a general trend towards paralellization. It is smart, and it is known to work, and I would expect some bright Sparc wise people to chime in and say "u-huh" and some SGI wise people to chime in and say "I've seen some of this before." The OS people [dragonflybsd.org] are starting to move things in this direction, and I've heard that Darwin has had the asynchronous messaging type threading model for a while (RTFA: the article explicitly mentions Tiger's GPU leveraging techniques). If you have the head for it, try reading up on NUMA and compare that with SMP.

    The math is simple. CPUs are CPUs, and anyone can make one that is the same speed as the competition, and if they do it second they can do it cheaper. The guy that can make 20 CPUs work like one CPU that does 20 times the work in a given time will win because he can always just throw more hardware at the problem. The SMP guys have to go back to the drawing board. In this case, the only way to beat-em is to join-em. Maybe doing the specific "Cell" computing design isn't it, but the ol' PC is dead. If these things start hitting the commodity price-points.

    That's a big, fat IF. So, don't bet on it (yet), but it's even worse to ignore it.

  • by theolein ( 316044 ) on Friday January 21, 2005 @11:49AM (#11432250) Journal
    I'm not actually surprised that so-called journalists, especially the technical kind, get good salaries. If you look at the painful clowns running the show at ZDNet, and most technical publications for that matter, including such wonder rags, such as the Register, you know that the Agenda is almost the most important thing. The actual realities of the tech world be damned as long as you have someone passing you your monthly wad of cash.

    And this story is no different.

    As many have noted, Sony did exactly this kind of hyping the last time around when the PS2, with its emotion engine, was supposed to be the future of all things computing. As everyone knows, the PS2 was a real pain to code for, and the actual performance was not better than the PC's of the day. The Cell will undoubtedly suffer from the same problems when it comes to coding real applications. Concurrency and parrallelism do not an easier coding experience make.

    I have no doubt that this thing will be good, but I absolutely doubt that it will have much or any effect on the x86 world of computing. The G4 processor, when it came out with the Altivec SIMD processsor, which was apparently better than SSE at the time didn't turn Apple into the next Microsoft overnight either, did it?

    So, I expect that the x86 world will continue to thrive and that Apple will stick some of these Cell processors, having as they do a PPC 970, aka G5, in their core, in some of their machines and will make the usual wild RDF claims about how hot it is while it will be used by only a small fraction of actual Mac developers in reality, the Mac having to maintain backward compatibility only slightly less then the x86 world does.

    In other words, it'll be business as usual.
  • by zogger ( 617870 ) on Friday January 21, 2005 @12:08PM (#11432459) Homepage Journal
    Because IBM is an R&D and service company mostly, or it looks like they are headed that direction eventually. They can make better profit margins by just designing then licensing out the tech. By concentrating on their core missions they can maximise ROI, and leave the headaches and drudge work of mass production and marketing of consumer level stuff to some other company, and still get paid well for it. Granted, you get a higher gross income with being the manufacturer, but you get a better net income by just licensing and developing.

    At least it looks that way to me, and it's following their past business model of selling off consumer level production, like they did with hard drive manufacture to Hitachi. Whether that will be a very long term smooth move I have no idea, but in the short term it's actually making them money. Profit margins at low end retail are small, they want no part of that, too clunky for them. Fabbing the chips is a different story, they need to be able to have a place to build what they R & D, so in that sense its logical for them to do that,and get that aspect subsidised by licensing and direct sales (saves them research costs long run) but after that point it's just manufacturing vacuum cleaners or blenders, they don't want to, and that's all PCs are now, just another consumer appliance.
  • by Raunch ( 191457 ) <http://sicklayouts.com> on Friday January 21, 2005 @01:18PM (#11433244) Homepage
    Then perhaps some of those bright people can shine some light in my direction.

    FTFA
    Caches work by storing part of the memory the processor is working on, if you are working on a 1MB piece of data it is likely only a small fraction of this (perhaps a few hundred bytes) will be present in cache, there are kinds of cache design which can store more or even all the data but these are not used as they are too expensive or too slow.

    APU local memory - no cache
    To solve the complexity associated with cache design and to increase performance the Cell designers took the radical approach of not including any. Instead they used a series of local memories, there are 8 of these, 1 in each APU.

    The APUs operate on registers which are read from or written to the local memory. This local memory can access main memory in blocks of 1024 bits but the APUs cannot act directly on main memory.

    By not using a caching mechanism the designers have removed the need for a lot of the complexity which goes along with a cache.

    This may sound like an inflexible system which will be complex to program and it most likely is but this system will deliver data to the APU registers at a phenomenal rate. If 2 registers can be moved per cycle to or from the local memory it will in it's first incarnation deliver 147 Gigabytes per second. That's for a single APU, the aggregate bandwidth for all local memories will be over a Terabyte per second - no CPU in the consumer market has a cache which will even get close to that figure. The APUs need to be fed with data and by using a local memory based design the Cell designers have provided plenty of it.


    Ok, so regular memory is too slow, or too expensive (apparently caching is the main problem although what I remember from comp. arch. was that caching was a Good Thing). So then they abolish the cache.
    Caching: CPU > cache (1 or 2) > main memory > HD
    and then implement another system that is *completely different*
    Non-Caching: APU > local memory > main memory
    Now first off, how is it different? Secondly how does this improve the physical memory speed? Is the author claiming that a page fault is what we are avoiding here? If that's the case, then the problem is hard drives not solid state memory. But (again if what I learned in comp arch is correct), because there is a tendency for programs to run the same code over and over again, then (assuming you have a good algorithm) the time saved by caching is signifigant.

    Anyone?
  • by Henk Poley ( 308046 ) on Friday January 21, 2005 @02:05PM (#11433732) Homepage
    So then they abolish the cache.
    Caching: CPU > cache (1 or 2) > main memory > HD
    and then implement another system that is *completely different*
    Non-Caching: APU > local memory > main memory
    Now first off, how is it different?


    Let me take a vaguely educated guess.

    Currently the cache managers in x86 CPUs "predict" what part of the memory space is needed. This prediction isn't always that good, and efforts to make programs hint to processor what to cache haven't worked good enough (or at least according to Cell CPU designers). So they force to program to operate in a 'small' memory space where data can be read in from a large RAM storage.

    I don't know if it will actually help. To me it seems a bit like going back tot he 80268/80386 era with himem.drv under DOS so programs could acces higher memory regions by commanding the driver to swap in and out memory to the lower 640k.

    But then, maybe Bill Gates -or whoever put up that quote in his name- was right, 640k RAM is enough for everyone (multiplied by the number of cores on dye..).
  • Think another way (Score:2, Interesting)

    by marcus ( 1916 ) on Friday January 21, 2005 @04:35PM (#11435469) Journal
    The current processing bottleneck, and the reason for caches in the first place, is the bandwidth between the processor and the memory. A "normal" memory bus cannot keep up. This is why you see so many attempts to speed this particular part of the system up. There is RAMBUS, DDR, even HyperLink.

    What these guys are trying to do is move the processor to the memory rather than the inverse. Having fast expensive caches near the processor is an attempt to get the memory closer to the proc. What has been happening of late is that lots and lots of on-chip transistors have been spent on the cache. The Cell architecture is a step in the other direction. They want to spend those transistors on processors instead of memory.

    At the limit of this idea you would see something like a super-granulated architecture with a processor on each memory chip. Imagine a PC with 32/64/whatever cell processors *and* no classic "processor socket" on the motherboard, just some DIMM-like "cell" slots. Each proc would have exclusive access to the memory on its own chip and all would communicate via some sort of bus or fabric of links. So, instead of one mega proc with tens of millions of transistors(perhaps half would be cache) at 4GHz with a 400MHz x 32, 64, 128, whatever bit width memory bus you'd have maybe 64, 128, 256? simple ARM-like procs at 400MHz each with something like 400MBs or more available memory bandwidth per proc.

    Of course the extreme limit would be to have millions of 1 bit processors, but I don't think that anyone is proposing that just yet. Things do get more and more neuron-like as you approach this limit, interesting eh?
  • by be-fan ( 61476 ) on Friday January 21, 2005 @11:44PM (#11438727)
    The local storage is different for the following reasons:

    1) It must be programmed differently. Instead of just accessing memory how you want, you must explicitly copy the part of memory you need at the moment. So, if your APU is acting as a vertex shader, you need to copy the shader code into the LS before you start processing. Essentially, the LS can give you the time savings of a cache, but you have to manage it yourself to get the benefits.

    2) Since the LS isn't managed by the hardware, it doesn't need a lot of management hardware. You don't need cache tags, lookup hardware, hardware to manage misses, etc. This saves a lot of transistors.

    3) A regular cache has to do some management on each access. It has to search the tags to find what cache line holds a given memory word, it has to perform write-back, etc. Since the LS doesn't need to do any of this, latency can be cut down.

    4) Since the LS is addressed directly, and isn't mapped onto memory, there is no need for cache coherency protocols. A cache-coherent multi-processor system needs to communicate with it's peers to coordinate access to the cache. For example, when it writes to a memory location, it must notify all other processors caching that location that their copies are now invalid. The LS doesn't need to do any of this, and that cuts down on both management hardware and latency.

    The APUs are stream processors. It is common for stream processors to not have a general memory cache. The geforce 3's vertex processor, for example, has enough cache to hold 18 vertices. The are not an LRU cache like a Pentium's, but a FIFO (much cheaper to manage), and are only used in certain circumstances. In comparison, the Cell's 128KB LS per PE is enormous!
  • by Anonymous Coward on Saturday January 22, 2005 @01:18PM (#11441765)
    > They take favorable numbers (eg: flat-shaded polygons instead of textured polygons), but they are for the most part verifiable

    It's worth noting that a whole lot of Japanese games use only shaded polygons. Even now, most console games use less textures than a PC game, and use algorithmic tricks instead. It took developers years to get their heads around the PS2's bizarre VPU design, but it turned out to be damn powerful in the end (it manages to stay reasonably competitive with the X-Box even though it's a full generation older and has half the RAM).

    Hopefully the API will have a better translation job done this time. CLITIndex anyone?

Kleeneness is next to Godelness.

Working...