Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Technology

Heterogenous Multiprocessor Chip Runs Tao/Elate 43

Madmac wrote with this cool item: "A trio of Japanese companies have teamed up on a multiprocessor chip design that can embed multiple processors and DSPs on a single chip, all running Tao's VP code." An interesting snippet: "Up to eight processor engines can be added to one MISC device, for performance of 200 to 900 million operations per second. They can include any general-purpose RISC processor, DSPs, SIMD engines, vector processors, graphics processors or customized logic."
This discussion has been archived. No new comments can be posted.

Heterogenous Multiprocessor Chip Runs Tao/Elate

Comments Filter:
  • Yes it is [tao.co.uk]. The bastards are making a bundle out of it too. Ah, business in the Internet age: take a completely un-original idea that's been thoroughly researched and investigated for the past few decades, make it just a little bit less clever, slap a "Java(tm)" (or "XML", or "Web", or "Linux", or "e-", et cetera) on it, and get ready to make millions! :)

    Ah well. I guess I'm just jealous.

  • I agree this is interesting technology however in comparison with just running a bunch of processors in parrallel there is no real advantage, at least not in performance that I can see. Costwise however there would probably be an advantage but then you have all the problems of trying to make this heterogenous "solution" talk to itself. Personally I think we are better off just going the multi-processor route on a single motherboard. Also the Japanese were not the first to do this, similar research has been going on at the Almaden Research Center (IBM) in San Jose for some time now. It is just the japanese who where the first to publicly announce it.

    My question always is, why do we need faster processors when the software we are using is still not utilizing this speed. Of course the argument can always be made that in a year to six months from now the software will have become so bloated that it will need this speed. I disagree, look at Linux, you can easily run a Linux box off of a 200Mhz processor, you sure as heck don't need 1.2 Ghz or whatever is the cutting edge right now.

    The funny thing is that I read somewhere that Microsoft had come out with some sort of spec where it said that to have Windows 2000 run properly and see a performance boost in all its Apps it would need at least a 2Ghz processor. Boy, there must be something going on between Intel and Windows (Wintel) for this kind of announcement. You would think it would hurt sales but it doesn't.

    Anyhow, I think the speed thing is just a little overated. I say we need to pay more attention to the motherboard speed and the peripheals...


    Nathaniel P. Wilkerson
    NPS Internet Solutions, LLC
    www.npsis.com [npsis.com]
  • http://www.theregister.co.uk/000505 -000008.html [theregister.co.uk]


    Nathaniel P. Wilkerson
    NPS Internet Solutions, LLC
    www.npsis.com [npsis.com]
  • If a programmer wants to fully exploit the available CPU power of multiple processors, then indeed they have to put in extra work. This is not necessarily a big deal. If you are already forking processes, then you have already done that work. Likewise if you are running a server that runs multiple processes in parallel (eg Apache), then the work has been done for you.

    And, of course, for many things the end user doesn't care that the program is an efficient hog of multiple CPUs. If I background one process and turn to doing other things, I have already won if the backgrounded process has a minimal impact on my observed performance.

    For all of these reasons people find wins running dual-processor systems even though they are mostly running programs whose programmers were not explicitly writing programs that take advantage of multiple processors.

    Cheers,
    Ben
  • Although this may sound similar the end result will be completely different.

    With Transmeta you will be solely dependent on their chip, as with Amiga/Tao`s RTOS you can choose any CPU you like. For example a SMP systems that uses both x86 and PPC chips. Elate is the only OS that supports Heterogenous Multiprocessing!
  • I always trust Intel to give me the absolute truth about whether I need to upgrade my processor or not.
  • For example, do all the chips have to operate at the same frequency, how does it handle memory caching? What's the system design for simultaneous I/O streaming? Can they control the activation/deactivation of energy sources/chips?

    As for why it is useful. OK imagine that today's CPUs are engines, so far we've been ramping up the revs/sec getting to the stage of gas turbines. But the problem is that you can only have homogenous or balanced trubines, much like a plane requires 4 engines of exactly the same type otherwise it is unflyable. This imposes an incredible overhead if you are only using it infrequently. Unlike PCs/clusters where if you're sitting down, you are working continuously therefore you want reduced latency or improved throughput, comsumer items are flogged with the intention that you own several (think of stereo components) that don't always operate together. After all, you watch TV in your living room, nuke your microwave food in the kitchen, have clock radio in bedroom, etc. Hence it makes sense to attempt to link together smaller engines to achieve the same task, giving the equivalent of standby processing or processing on demand.

    Another consideration is fab plants takes billions to establish. By using IP cores from multiple sources (itself a thorny issue), they can keep the fabs ticking for a lot longer by using different combinations and only improving the ones which need to change (e.g. encryption for IP protection) rather than the basic DSPs. It shifts the balance away from software and back into specialised hardware which is cheaper for *SOME* applications which don't change often (e.g. MPEG4 decoders).

    Also think of future applications like having the Sony PlayStation EE and GS chip being mixed with transmeta and dedicated CDMA chips ... a low power consumption web pad to kill for.

    LL
  • by Anonymous Coward
    My advice to the world: if a company quotes you a theoretical MIPS number, run away. If that is the best performance measure he can give you, that means the chip is going to run like molasses on real code.

    Depends on who you're talking to. In the DSP world, when TI says their chip does 100 MIPS, that means it does a hard 100 MIPS. Realtime systems guarantee performance; if your code executes in 43 ticks, it will always execute in 43 ticks, and you can count on that fact.

  • >Duplicate register sets.
    Processes already have duplicate register sets. This overhead is already there.

    >Different instruction pointers.
    Ditto.

    >Greater memory load.

    Only if you are loading more pages. Ok, maybe you'd want to completely separate the pages between processes, but the cache could be made to operate independently of any (weak?) MMU scheme for protecting data.

    >More hardware complexity.

    I wonder what they're gonna cool this chip with. ;-)

    My greatest concern here is the bus and memory. It gotta be made specifically for multiprocessing, or else you're gonna have an expensive bottleneck on your hands.

    - Steeltoe
  • The point of the architecture is not that you can add new custom chips, custom instructions, etc. and utilise them.

    The point is that Tao makes it easy to do so (recall how long it took for MMX to catch on and get used... with this, you update the VM/compiler thing and you're virtually done).
    John
  • MISC has been used to refer to Minimal Instruction Set Chips for a while now. A little research would have shown them this; now the acronym recognition is severely diluted.
    MIS Chips [ultratechnology.com]

  • I used to use bogomips for benchmarking. Not because I thought it was a great benchmark, but because it was so easy to set up a job to collect bogomips data from all of our unix/linux boxes. I didn't have to worry that any of our users would get upset that we were running an obtrusive benchmark program - bogomips completed very quickly.

    Then I noticed that my K6/2 450 had a significantly higher bogomips rating than my Athlon 550 (both AMD, Athlon being a higher end CPU line than K6/2 - so if anything, I'd expect the Athlon to outperform a K6/2 at the same MHz).

    Hence, I gave up on bogomips and switched to nbench, the Byte benchmark thingy. I doubt this is a perfect benchmark either, but it does seem to be better correlated with reality than bogomips. Alas, this one pegs the CPU for a while, so I no longer rest easy about users not complaining about it.
  • Well, the "parallelism" of the P6 architechture is sorta bound... It's not really explicitly parallel, it's good ole superscalar processing. You are actually executing a handful of instructions in parallel, but you aren't gonna get less-than-one IPC (Instruction per cycle) out of this... here is the reason: you still have to go through the motions. here's a brief explaination of how superscalar crap works so maybe you people can see what I mean.

    The processing is broken up in to descrete stages and when an instruction moves from one stage to the next, another instruction enters the stage that was vacated. The instructions are moving through a pipe, and all the parts of the processor are being utalized at once (ideally). This is a rough generalization for a generic processor with steps divided into IF(instruction fetch), ID(instruction decode), EX(execute), MM(memory), and WB(writeback). An instruction is only complete when it exits all of these stages and at it's fastest the instructions move through one stage for every CPU cycle. so if we have 5 instructions entering a superscalar processor with 5 stages, all noops, we end up with something like this (where each division takes a cycle to complete):

    IF | ID | EX | MM | WB |
    -----IF | ID | EX | MM | WB |
    ----------IF | ID | EX | MM | WB |
    ---------------IF | ID | EX | MM | WB |
    --------------------IF | ID | EX | MM | WB |

    Each instruction enters at IF and leaves after WB. since we are talking superscalar, it's looked at like this: each no op takes 5 cycles to complete, but since we are pipelining, each instruction effectively only takes one cycle after the inital 4 cycles to start the process. No instruction takes one cycle to go from fetch to finish. It has become standard practice though to post execution cycles in a pentium class processor (and for the most part superscalar in general) as 1 + (stalls needed). Stalls happen when you need to go to memory or when you have a string of data dependancies or are doing an operation that just takes a freaking long time in the EX stage. And sometimes they just go by how long it takes in the EX stage, ignoring any memory access problems.

    And so, yes we have parallelism in a P6 on a superscalar level (and as far a branch predicting goes, but that is irrelevant for this case), but it will not have the effect of acheiveing less than 1 IPC ... It doesn't matter how fast you can issue instructions, only completion matters for this case. And that's part of why bogomips are useless pieces of benchmark.

    And that's also why a 450MHz P6 can only acheive a theoretical max of 450 million instructions per second.
  • How many times do we have to have this conversation on /.? Parallelism in hardware won't do any good until parallell code that works well can be written.

    Alrighty, then you and your one process can have at it. But I like to listen to some MP3's while I'm playing with the GIMP, and I don't like waiting for filters. Are you such a dumbass as to not realize that different processes (or threads) can be executed in parallel on parallel processors?

    Your assertion that everything should be done at once is just plain stupid. If everything could just be that easily made to run all at once, it would have, a long time ago.

    Yep, it's dumb to be efficient. I'm stupid for trying to say that things should be written to take advantage of parallelism... HELLO!!!! Wake up, it's not easy to write parallel code for a single process, but it's not hard to write a single application using multiple threads. Obviously you know I didn't mean run all the instructions in one cycle and put together the answers, cause no one is that idiotic. The only thing I meant was to do as much at once as is possible. And not as is easy either. You are right in that if it was easy to write parallel algorithms it would have been done a long time ago. But it is very difficult (one of the hardest CSC topics out there), and it has been done before.

    Instead of wasting your time learning all there is to know about the history of running ass loads of no ops through your processor, perhapse you should educate yourself about how processors work. That'd be much more useful and you'd sound less clueless, and much less arogant. And apparently we will have to have these conversations on slashdot until idiots like you get some kind of clue about good programming practice and what that piece of porn downloading equiptment you keep in your bathroom called a computer actually does.

    So get back to me when you can tell me all about the dining philosophers problem and how to use semaphores for critical section protection in a parallel algorithm. Cause then maybe you will have a clue as to what parallelism actually is and actually can do. Or maybe you'll give up and realize that stupid hacks like you just can't understand how to write good code.

    Or maybe you can come back and tell me what good Trolling you've had today to cover up your idiocy... Either way I'll be happy, cause either way you are the crack whore.
  • Is that not how BeOS works?
    I thought I read something along those lines in the user book that came with it.

    -----
    If my facts are wrong then tell me. I don't mind.
  • by dougman ( 908 )
    Is this the same Tao mentioned in yesterday's (gasp) Amiga story?

    (mind starting to race)
  • An interesting snippet: "Up to eight processor engines can be added to one MISC device, for performance of 200 to 900 million operations per second. They can include any general-purpose RISC processor, DSPs, SIMD engines, vector processors, graphics processors or customized logic

    Clearly I'm behind the times with my new 80386 16Mhz 2MB RAM machine.
    Yes?
  • As was already stated, dont use bogomips to gauge your processors performance. bogomips are just that-- bogus. As for why your reported bogomips doubled, somewhere along the 2.3.99preN line this happened. Dunno if it was intentional or not, most likely not. My box went from a paltry 998.something to a whopping 1998.84 bogomips. It seems to have happened at 2.3.99pre3
  • This sounds fairly similar to the Transmeta Crusoe to me. This new idea can be reconfigured to produce various CPU cores, whereas the Crusoe can interpret various instruction sets - similar end result.

    This does add the interesting possibility of custom DSP stuff directly on the CPU though.
  • Something truly worthy of Beowulf clustering! We can put Linux on these and send them to attack Redmond in hordes! Soon we will conquer the world! Our enemies shall talk themselves to death.... We shall prevail!!!!

  • you have pushed a lot of that workd down to having the OS identify 2 processes

    Actually, you have pushed the work all the way back to the programmer, forcing the code to be broken up into n parallelizable parts. Now, some code is "embarassingly" parallel, but most is not. Being a lazy programmer, I prefer tradeoffs that allow me to work less. For example, look at all the work that's gone into making the Linux kernel support SMP effectively for large numbers of CPUs. It's coming along, but it's taken a while, and hasn't been easy.

  • All developers who bought Amiga`s SDK (Software developers kit) or developer`s machine will be able to get free support for the Elate based Amie RTOS here [amigadev.net].
  • by cheese_wallet ( 88279 ) on Sunday June 04, 2000 @10:14AM (#1026494) Journal
    Why is this so exciting? Is it really that difficult to put multiple processors on a die? And this sounds like they are just putting multiple die in a package. Oddly enough, I believe it is referred to as die and not dice when speaking of pluralities in ICs

    When a die gets large, you have thermal expansion problems, so one can't just stick four pIII's (or whatever) on a single die, it gets to large. But you could stick four pIII die in a single package. flip chip might not be the best way to go about doing this though.

    fyi, a company I've worked for in the past makes a multi-chip module that incorporates 5 die. Not all are processors though, a couple are cache.

    --Scott
  • And Sony and Motorola are big investers in the Amiga and Tao Group company. :)
  • I've been following Taos for a while... but what does all this mean? I bet there's some explanation somewhere on the page, but it's late and someone's been spamming. As far as I can see it's an announcement that stuff's happening with Taos - which can only be a good thing.
  • How many systems have you seen recently that have only one process/thread running at a time? I haven't used one in some time (Though I did just buy an Atari 1200XL at the flea market.) Even if nine CPUs only give you three times the speed, you've just tripled your performance without having to develop a new process or go into, for example, quantum computing.
  • Virata, which makes some very nice communications chips, do the same thing with multiple ARM and DSP cores on a chip. Their Boron ASSP has multiple CPUs, one for apps and high-level code, and one for network protocols, plus a DSP for ADSL DMT. They are fabless, and the chips do not try to set speed records. It is mostly a matter of simplifying: why multi-task a DSP between disparate types of tasks (in terms of real-time requirements) when you can add a second core and never have to worry about it?

    In the article, Tops says they will give a choice of what processors. That is kind of a cop-out. The tough job is figuring out which porcessors and RTOSs to use, how to divide tasks among processors, and getting the tasks, protocols, and algorithms running on the RTOSs and CPUs. It isn't child's play.

  • by Anonymous Coward
    It all depends what applications the CPU is being designed for. Sure a PIII might be more powerful but it's not going to do you much good if you're putting it in a cell phone. Power requirements and the instructions it'll have to run need to be taken into consideration. If your application requires lots of DSP instructions you can add another DSP. This way you get maximum use of your silicon. You won't have parts of the CPU (like MMX) sitting around burning power.

    Look at the PowerPC. It might not be as powerfull as the newest PIIIs but it uses 1/3 the silicon and consumes under 10Watts. There are many application where a PowerPC is much better suited then an Intel chip. Why do you think Nintendo is going to use an IBM PPC in their next game console?

    CPUs are used in a lot more things then just computers. In fact, PCs probably have the worst CPU architecture out there (8x86). And you can blame Microsoft for that!! They won't release a consumer OS for anything that isn't an 8x86. I bet a PIII stripped of the 8x86 junk would run faster, cost less to produce, and use less power. If only....

    Oh ya, this looks like a cool technology but I don't think it'll catch on. The CPU giants out there will be integrating CPUs onto a single chip and will have the $$$$ to do a better job. To bad, it would be good to see them succeed.
  • Because the cool bit isn't really the multiple-cpu cores-on-die, the cool bit is the way they tie the processors together. Basically, they can throw chips specialised for different tasks, with different instruction sets, and different strengths and weaknesses together, and the Tao/Amiga system will take advantage of all of them according to their strengths, and dynamically recompile your application for them, and each thread will run on an appropriate processor i.e. a sound fx intensive thread onto a DSP core, 3D graphics onto a vector core, etc. And you don't need to know the nitty-gritty details, just write code in the language of your choice (Java, C, C++ and VP asm are the current possibilities) for a virtual machine.

    If you remember the original "classic" amiga from 1985, its design philosphy was not to run everything on the CPU, but to slap task-specific (but programmable) co-processors in to do various tasks extremely quickly, with their own DMA to the unified memory pool. This was coupled to an OS that used message-passing by reference, which meant that there was no memory copy overhead in interprocess communication (and unfortunately also meant that proper interprocess memory protection was impossible - i.e. there's no distinction between threads and processes on a "classic" Amiga), which is why the Amiga, at the time, could wipe the floor with any other system (and often well above) its price range in terms of simultaneous graphics and sound data throughput. (later to be termed "multimedia"). It also made programming the system... interesting...

    Really high-performance stuff tended to mean stepping outside the OS - while the OS was cool in other ways, it didn't expose the full power of the hardware architecture.

    The Tao VP technology makes programming a heterogenous multiprocessing environment (relatively) easy -
    the software design concepts have caught up with the hardware design concepts. It's probably no coincidence that the developers behind Tao were once Amiga programmers, who had to bend their minds around usng the 68000, copper, blitter, sound and disk IO processors all at once.

    It now becomes increasingly clear that the new Amiga will, indeed, be "in the spirit of the 'classic' Amiga", just like Amiga Inc. have been saying all along, and will also be, like the original Amiga, technologically advanced copared to its peers. Of course, both the hardware and software will be astronomically more advanced than the amiga's set of custom chips (the "PAD" - for Paula, Agnus,Denise - Amiga custom chips were traditionally given women's names (and the motherboards B-52 album names))

    It remains to be seen whether managment can screw it up as thoroughly as the original Amiga. :-(

    My major worry is some parts of the system are a tad more proprietary than people are used to these days in the Open Source world, including the multitudes of ex-Amiga users who have changed over to Linux.

  • Imagine a Beowulf cluster of these?

    :)

  • No, the task is still spread out on eight cpus or however many there are.


    Don't criticise someone who is attempting to use free software for not using enough free software.
  • Isn't this similar to the Ameoba project I remember reading about in DDJ back in '94? I remember you could combine any kind of processor to create a larger "virtual" processor. It seemed like some pretty trick shtuff!
  • My Celeron @ 450 does 897.84 bogomips (according to 2.3.99-pre6). 900 million ops per second doesn't sound all that fast any more.

    As a side note, with 2.2.14, linux reported my CPU was about 450 bogomips. Anyone know why there was the change?

    Regardless, with today's gigahertz processors, 900 million ops per second is certainly no better that what we already have.

    Don't get me wrong, it sounds like interesting technology. Could produce very flexible chips, but I don't see a real speed gain here.

  • by jdwilso2 ( 90224 ) on Sunday June 04, 2000 @08:50AM (#1026505)
    well actually, bogomips and mips and all that don't really mean jack and a half. If you look at ratings like that, they come from frequency. You're cpu is a 450MHz, it can theoretically execute 450 million instructions per second. Yeah, 450 million no ops. But what good does that do you? to tell you the truth, most of the performance of any x86 processor is lost in branch prediction and memory latency. As far as I can tell, bogomips = bogus mips anyway... I don't know why 2.3.99 would be reporting 2xMHz for your bogomips unless they have something screwed up in the kernel they need to fix. Ain't no way a standard superscalar 450MHz processor is gonna execute one instruction every 0.5 cycles. Not an x86 anyway...

    And even if you benchmark an actual program to try and see how many instructions per second it's actually getting, it only means anything for that program. It's totally dependant on how many branches are in the code, and how much of the time memory is being accessed.

    These people's idea is excellent because it focuses on the direction the computer industry needs to go: parallelism. Forget doing everything sequentially, do it all at the same time! I'm not talking about in one program though, I'm talking about throughout the system. You've got your websever and you fileserver and all your device drivers and your os and all that good crap to run while you play quake, and the only way to really help out is to tack on another processor to run other programs at the same time. This is kind of like what SUN is doing for the MAJC architechture with thread level parallelism.

    And finally, I don't see how you people can honestly think that one processor running at something like 450 is going to be as good as 8 all on the same die all running different processes in parallel.

    Speed isn't about megahertz or gigahertz or instructions per second. It's about the time it take to run something. Who needs benchmarks when I have my analog wall clock to tell me what's best.

    JDW
  • by Signal 12 ( 196465 ) on Sunday June 04, 2000 @08:35AM (#1026506)
    And not only are you using Bogomips numbers to compare to some other processor speed, someone even moderated that up! How's this world going to end?
  • by 575 ( 195442 ) on Sunday June 04, 2000 @08:51AM (#1026507) Journal
    Tao sounds familiar...
    Echos of an article [slashdot.org]
    Amiga [amiga.com] perhaps?
  • Just to let interested people know that this is teh company who are partnered with Amiga for the NG OS.

    Also, given the actual nature of the article, I am going to quote from the Amiga site from last Friday before the Elate/Amiga SDK was announced -

    It was with this heavy attitude that I attended an impromptu meeting with a group of visiting Japanese consumer electronics companies.

    I had never met them before, but they had heard what we were doing, and they remembered the Amiga fondly.

    After our presentation, one of the gentlemen sat back, and informed me that what he had just heard, and saw was the most exciting opportunity that he had seen in years, and that we were absolutely the correct company for them to work with.

    What they had not told us until after that, was that these three gentlemen were actually representing a group of over 50 consumer electronics companies, and they were looking for a long term partner!

    Let's just say that they liked what they saw, and heard. There are many things that appear to be going on behind closed doors.....

  • by tilly ( 7530 ) on Sunday June 04, 2000 @08:38AM (#1026509)
    Today CPUs spend a lot of energy trying to extract parallelism out of code designed to be run linearly. The ability to take advantage of parallelism is strongly limited by your ability to find it, rather than the ability of the chip to carry out instructions in parallel.

    Well if the chip is emulating a dual processor machine, then you have pushed a lot of that work down to having the OS identify 2 processes that can run in parallel. I would think that this would be a huge win.

    Is there something obvious that I am missing?

    Cheers,
    Ben
  • You're cpu is a 450MHz, it can theoretically execute 450 million instructions per second.

    Surely a superscalar architecture (ie anything later than a Pentium) can issue more than one instruction per clock cycle? Granted, this isn't always the case by any means, unless the code is particularly parallel - but surely nop's don't have many data dependencies?

  • by smack_attack ( 171144 ) on Sunday June 04, 2000 @09:35AM (#1026511) Homepage
    Yeah, that's all fine and dandy, but can you program missiles with it like my Playstation II ?
  • Today CPUs spend a lot of energy trying to extract parallelism out of code designed to be run linearly. The ability to take advantage of parallelism is strongly limited by your ability to find it, rather than the ability of the chip to carry out instructions in parallel.

    Well if the chip is emulating a dual processor machine, then you have pushed a lot of that work down to having the OS identify 2 processes that can run in parallel. I would think that this would be a huge win.


    This is nice, if you can overcome two concerns:

    • Duplicate register sets.
      Each thread will need to work with its own copy of the virtual chip's registers. For processes, instead of threads, each will also need its own page table (though you could just remap them to different sections of the same table). You might be able to emulate dual x86 processors, but something with a larger register set would pose problems.

    • Different instruction pointers.
      Each thread is an independent instruction stream, that can wander and jump any which way. You'd need explicit hardware support to be able to emulate this without prohibitive overhead.

    • Greater memory load.
      As the processes would be (hopefully) independent, you'd be hitting two completely different sets of pages when accessing memory. This means you'd need a cache twice as large to avoid thrashing. You could get around this by supporting multiple threads, not processes, but this limits the advantage of your proposal.

    • More hardware complexity.
      Transmeta's fundamental approach was to reduce hardware complexity and cost. As some additional hardware support is needed (in fact, a fair bit of additional support), emulation of SMP systems is unlikely to be embraced by Transmeta.


    This is a fascinating idea, but there are substantial hurdles that would have to be overcome implementing it in practice.
  • I remember about 18 months ago when Cyrix bet the farm on one of these processors. If I recall correctly there were major compatability issues with the built in DSP's. If this company is going to try and do something like this they better have some support from software companies.

It's been a business doing pleasure with you.

Working...