Heterogenous Multiprocessor Chip Runs Tao/Elate 43
Madmac wrote with this cool item: "A trio of Japanese companies have teamed up on a multiprocessor
chip design that can embed multiple processors and DSPs on a single
chip, all running Tao's VP code." An interesting snippet: "Up to eight processor engines can be added to one MISC device, for performance of 200 to 900 million operations per second. They can include any general-purpose RISC processor, DSPs, SIMD engines, vector processors, graphics processors or customized logic."
Re:Hmm (Score:2)
Ah well. I guess I'm just jealous.
Too complex... (Score:1)
My question always is, why do we need faster processors when the software we are using is still not utilizing this speed. Of course the argument can always be made that in a year to six months from now the software will have become so bloated that it will need this speed. I disagree, look at Linux, you can easily run a Linux box off of a 200Mhz processor, you sure as heck don't need 1.2 Ghz or whatever is the cutting edge right now.
The funny thing is that I read somewhere that Microsoft had come out with some sort of spec where it said that to have Windows 2000 run properly and see a performance boost in all its Apps it would need at least a 2Ghz processor. Boy, there must be something going on between Intel and Windows (Wintel) for this kind of announcement. You would think it would hurt sales but it doesn't.
Anyhow, I think the speed thing is just a little overated. I say we need to pay more attention to the motherboard speed and the peripheals...
Nathaniel P. Wilkerson
NPS Internet Solutions, LLC
www.npsis.com [npsis.com]
Here is your link... (Score:2)
Nathaniel P. Wilkerson
NPS Internet Solutions, LLC
www.npsis.com [npsis.com]
That depends on what you are doing (Score:2)
And, of course, for many things the end user doesn't care that the program is an efficient hog of multiple CPUs. If I background one process and turn to doing other things, I have already won if the backgrounded process has a minimal impact on my observed performance.
For all of these reasons people find wins running dual-processor systems even though they are mostly running programs whose programmers were not explicitly writing programs that take advantage of multiple processors.
Cheers,
Ben
Re:Types of guys you can meet in a public restroom (Score:1)
Re:Similar to Crusoe? (Score:1)
With Transmeta you will be solely dependent on their chip, as with Amiga/Tao`s RTOS you can choose any CPU you like. For example a SMP systems that uses both x86 and PPC chips. Elate is the only OS that supports Heterogenous Multiprocessing!
Oh, of course (Score:1)
Any more details? (Score:1)
As for why it is useful. OK imagine that today's CPUs are engines, so far we've been ramping up the revs/sec getting to the stage of gas turbines. But the problem is that you can only have homogenous or balanced trubines, much like a plane requires 4 engines of exactly the same type otherwise it is unflyable. This imposes an incredible overhead if you are only using it infrequently. Unlike PCs/clusters where if you're sitting down, you are working continuously therefore you want reduced latency or improved throughput, comsumer items are flogged with the intention that you own several (think of stereo components) that don't always operate together. After all, you watch TV in your living room, nuke your microwave food in the kitchen, have clock radio in bedroom, etc. Hence it makes sense to attempt to link together smaller engines to achieve the same task, giving the equivalent of standby processing or processing on demand.
Another consideration is fab plants takes billions to establish. By using IP cores from multiple sources (itself a thorny issue), they can keep the fabs ticking for a lot longer by using different combinations and only improving the ones which need to change (e.g. encryption for IP protection) rather than the basic DSPs. It shifts the balance away from software and back into specialised hardware which is cheaper for *SOME* applications which don't change often (e.g. MPEG4 decoders).
Also think of future applications like having the Sony PlayStation EE and GS chip being mixed with transmeta and dedicated CDMA chips
LL
Re:Not that fast. *not* (Score:1)
Depends on who you're talking to. In the DSP world, when TI says their chip does 100 MIPS, that means it does a hard 100 MIPS. Realtime systems guarantee performance; if your code executes in 43 ticks, it will always execute in 43 ticks, and you can count on that fact.
Re:I wonder why Transmeta hasn't tried this... (Score:1)
Processes already have duplicate register sets. This overhead is already there.
>Different instruction pointers.
Ditto.
>Greater memory load.
Only if you are loading more pages. Ok, maybe you'd want to completely separate the pages between processes, but the cache could be made to operate independently of any (weak?) MMU scheme for protecting data.
>More hardware complexity.
I wonder what they're gonna cool this chip with.
My greatest concern here is the bus and memory. It gotta be made specifically for multiprocessing, or else you're gonna have an expensive bottleneck on your hands.
- Steeltoe
Re:Not that fast. *not* (Score:1)
The point is that Tao makes it easy to do so (recall how long it took for MMX to catch on and get used... with this, you update the VM/compiler thing and you're virtually done).
John
MISC (Score:2)
MIS Chips [ultratechnology.com]
Re:Damn, it's called BOGOmips for a reason (Score:1)
I used to use bogomips for benchmarking. Not because I thought it was a great benchmark, but because it was so easy to set up a job to collect bogomips data from all of our unix/linux boxes. I didn't have to worry that any of our users would get upset that we were running an obtrusive benchmark program - bogomips completed very quickly.
Then I noticed that my K6/2 450 had a significantly higher bogomips rating than my Athlon 550 (both AMD, Athlon being a higher end CPU line than K6/2 - so if anything, I'd expect the Athlon to outperform a K6/2 at the same MHz).
Hence, I gave up on bogomips and switched to nbench, the Byte benchmark thingy. I doubt this is a perfect benchmark either, but it does seem to be better correlated with reality than bogomips. Alas, this one pegs the CPU for a while, so I no longer rest easy about users not complaining about it.
Re:Not that fast. *not* (Score:1)
The processing is broken up in to descrete stages and when an instruction moves from one stage to the next, another instruction enters the stage that was vacated. The instructions are moving through a pipe, and all the parts of the processor are being utalized at once (ideally). This is a rough generalization for a generic processor with steps divided into IF(instruction fetch), ID(instruction decode), EX(execute), MM(memory), and WB(writeback). An instruction is only complete when it exits all of these stages and at it's fastest the instructions move through one stage for every CPU cycle. so if we have 5 instructions entering a superscalar processor with 5 stages, all noops, we end up with something like this (where each division takes a cycle to complete):
IF | ID | EX | MM | WB |
-----IF | ID | EX | MM | WB |
----------IF | ID | EX | MM | WB |
---------------IF | ID | EX | MM | WB |
--------------------IF | ID | EX | MM | WB |
Each instruction enters at IF and leaves after WB. since we are talking superscalar, it's looked at like this: each no op takes 5 cycles to complete, but since we are pipelining, each instruction effectively only takes one cycle after the inital 4 cycles to start the process. No instruction takes one cycle to go from fetch to finish. It has become standard practice though to post execution cycles in a pentium class processor (and for the most part superscalar in general) as 1 + (stalls needed). Stalls happen when you need to go to memory or when you have a string of data dependancies or are doing an operation that just takes a freaking long time in the EX stage. And sometimes they just go by how long it takes in the EX stage, ignoring any memory access problems.
And so, yes we have parallelism in a P6 on a superscalar level (and as far a branch predicting goes, but that is irrelevant for this case), but it will not have the effect of acheiveing less than 1 IPC
And that's also why a 450MHz P6 can only acheive a theoretical max of 450 million instructions per second.
Re:Not that fast. *not* (Score:1)
Alrighty, then you and your one process can have at it. But I like to listen to some MP3's while I'm playing with the GIMP, and I don't like waiting for filters. Are you such a dumbass as to not realize that different processes (or threads) can be executed in parallel on parallel processors?
Your assertion that everything should be done at once is just plain stupid. If everything could just be that easily made to run all at once, it would have, a long time ago.
Yep, it's dumb to be efficient. I'm stupid for trying to say that things should be written to take advantage of parallelism... HELLO!!!! Wake up, it's not easy to write parallel code for a single process, but it's not hard to write a single application using multiple threads. Obviously you know I didn't mean run all the instructions in one cycle and put together the answers, cause no one is that idiotic. The only thing I meant was to do as much at once as is possible. And not as is easy either. You are right in that if it was easy to write parallel algorithms it would have been done a long time ago. But it is very difficult (one of the hardest CSC topics out there), and it has been done before.
Instead of wasting your time learning all there is to know about the history of running ass loads of no ops through your processor, perhapse you should educate yourself about how processors work. That'd be much more useful and you'd sound less clueless, and much less arogant. And apparently we will have to have these conversations on slashdot until idiots like you get some kind of clue about good programming practice and what that piece of porn downloading equiptment you keep in your bathroom called a computer actually does.
So get back to me when you can tell me all about the dining philosophers problem and how to use semaphores for critical section protection in a parallel algorithm. Cause then maybe you will have a clue as to what parallelism actually is and actually can do. Or maybe you'll give up and realize that stupid hacks like you just can't understand how to write good code.
Or maybe you can come back and tell me what good Trolling you've had today to cover up your idiocy... Either way I'll be happy, cause either way you are the crack whore.
Re:I wonder why Transmeta hasn't tried this... (Score:1)
I thought I read something along those lines in the user book that came with it.
-----
If my facts are wrong then tell me. I don't mind.
Hmm (Score:2)
(mind starting to race)
Wow! (Score:1)
Clearly I'm behind the times with my new 80386 16Mhz 2MB RAM machine.
Yes?
Re:Not that fast. (Score:1)
Similar to Crusoe? (Score:1)
This does add the interesting possibility of custom DSP stuff directly on the CPU though.
Finally! (Score:1)
Re:I wonder why Transmeta hasn't tried this... (Score:1)
Actually, you have pushed the work all the way back to the programmer, forcing the code to be broken up into n parallelizable parts. Now, some code is "embarassingly" parallel, but most is not. Being a lazy programmer, I prefer tradeoffs that allow me to work less. For example, look at all the work that's gone into making the Linux kernel support SMP effectively for large numbers of CPUs. It's coming along, but it's taken a while, and hasn't been easy.
The Amiga Developer Support site is now online. (Score:2)
interesting, but (Score:3)
When a die gets large, you have thermal expansion problems, so one can't just stick four pIII's (or whatever) on a single die, it gets to large. But you could stick four pIII die in a single package. flip chip might not be the best way to go about doing this though.
fyi, a company I've worked for in the past makes a multi-chip module that incorporates 5 die. Not all are processors though, a couple are cache.
--Scott
Re:Yes it is. (Score:1)
The implications (Score:1)
Re:Not that fast. *not* (Score:1)
Sounds confused. What's it for? (Score:1)
In the article, Tops says they will give a choice of what processors. That is kind of a cop-out. The tough job is figuring out which porcessors and RTOSs to use, how to divide tasks among processors, and getting the tasks, protocols, and algorithms running on the RTOSs and CPUs. It isn't child's play.
Re:Not that fast. (Score:1)
Look at the PowerPC. It might not be as powerfull as the newest PIIIs but it uses 1/3 the silicon and consumes under 10Watts. There are many application where a PowerPC is much better suited then an Intel chip. Why do you think Nintendo is going to use an IBM PPC in their next game console?
CPUs are used in a lot more things then just computers. In fact, PCs probably have the worst CPU architecture out there (8x86). And you can blame Microsoft for that!! They won't release a consumer OS for anything that isn't an 8x86. I bet a PIII stripped of the 8x86 junk would run faster, cost less to produce, and use less power. If only....
Oh ya, this looks like a cool technology but I don't think it'll catch on. The CPU giants out there will be integrating CPUs onto a single chip and will have the $$$$ to do a better job. To bad, it would be good to see them succeed.
Re:interesting, but (Score:2)
If you remember the original "classic" amiga from 1985, its design philosphy was not to run everything on the CPU, but to slap task-specific (but programmable) co-processors in to do various tasks extremely quickly, with their own DMA to the unified memory pool. This was coupled to an OS that used message-passing by reference, which meant that there was no memory copy overhead in interprocess communication (and unfortunately also meant that proper interprocess memory protection was impossible - i.e. there's no distinction between threads and processes on a "classic" Amiga), which is why the Amiga, at the time, could wipe the floor with any other system (and often well above) its price range in terms of simultaneous graphics and sound data throughput. (later to be termed "multimedia"). It also made programming the system... interesting...
Really high-performance stuff tended to mean stepping outside the OS - while the OS was cool in other ways, it didn't expose the full power of the hardware architecture.
The Tao VP technology makes programming a heterogenous multiprocessing environment (relatively) easy -
the software design concepts have caught up with the hardware design concepts. It's probably no coincidence that the developers behind Tao were once Amiga programmers, who had to bend their minds around usng the 68000, copper, blitter, sound and disk IO processors all at once.
It now becomes increasingly clear that the new Amiga will, indeed, be "in the spirit of the 'classic' Amiga", just like Amiga Inc. have been saying all along, and will also be, like the original Amiga, technologically advanced copared to its peers. Of course, both the hardware and software will be astronomically more advanced than the amiga's set of custom chips (the "PAD" - for Paula, Agnus,Denise - Amiga custom chips were traditionally given women's names (and the motherboards B-52 album names))
It remains to be seen whether managment can screw it up as thoroughly as the original Amiga.
My major worry is some parts of the system are a tad more proprietary than people are used to these days in the Open Source world, including the multitudes of ex-Amiga users who have changed over to Linux.
Obligatory: Beowulf reference (Score:1)
Re:Forgive my idiocy, but... (Score:1)
Don't criticise someone who is attempting to use free software for not using enough free software.
Ameoba (Score:1)
Not that fast. (Score:2)
As a side note, with 2.2.14, linux reported my CPU was about 450 bogomips. Anyone know why there was the change?
Regardless, with today's gigahertz processors, 900 million ops per second is certainly no better that what we already have.
Don't get me wrong, it sounds like interesting technology. Could produce very flexible chips, but I don't see a real speed gain here.
Re:Not that fast. *not* (Score:3)
And even if you benchmark an actual program to try and see how many instructions per second it's actually getting, it only means anything for that program. It's totally dependant on how many branches are in the code, and how much of the time memory is being accessed.
These people's idea is excellent because it focuses on the direction the computer industry needs to go: parallelism. Forget doing everything sequentially, do it all at the same time! I'm not talking about in one program though, I'm talking about throughout the system. You've got your websever and you fileserver and all your device drivers and your os and all that good crap to run while you play quake, and the only way to really help out is to tack on another processor to run other programs at the same time. This is kind of like what SUN is doing for the MAJC architechture with thread level parallelism.
And finally, I don't see how you people can honestly think that one processor running at something like 450 is going to be as good as 8 all on the same die all running different processes in parallel.
Speed isn't about megahertz or gigahertz or instructions per second. It's about the time it take to run something. Who needs benchmarks when I have my analog wall clock to tell me what's best.
JDW
Damn, it's called BOGOmips for a reason (Score:4)
Haiku (Score:3)
Echos of an article [slashdot.org]
Amiga [amiga.com] perhaps?
Just for the record.....on Amiga & Tao relations (Score:2)
Also, given the actual nature of the article, I am going to quote from the Amiga site from last Friday before the Elate/Amiga SDK was announced -
It was with this heavy attitude that I attended an impromptu meeting with a group of visiting Japanese consumer electronics companies.
I had never met them before, but they had heard what we were doing, and they remembered the Amiga fondly.
After our presentation, one of the gentlemen sat back, and informed me that what he had just heard, and saw was the most exciting opportunity that he had seen in years, and that we were absolutely the correct company for them to work with.
What they had not told us until after that, was that these three gentlemen were actually representing a group of over 50 consumer electronics companies, and they were looking for a long term partner!
Let's just say that they liked what they saw, and heard. There are many things that appear to be going on behind closed doors.....
I wonder why Transmeta hasn't tried this... (Score:3)
Well if the chip is emulating a dual processor machine, then you have pushed a lot of that work down to having the OS identify 2 processes that can run in parallel. I would think that this would be a huge win.
Is there something obvious that I am missing?
Cheers,
Ben
Re:Not that fast. *not* (Score:1)
You're cpu is a 450MHz, it can theoretically execute 450 million instructions per second.
Surely a superscalar architecture (ie anything later than a Pentium) can issue more than one instruction per clock cycle? Granted, this isn't always the case by any means, unless the code is particularly parallel - but surely nop's don't have many data dependencies?
The Real Question... (Score:3)
Re:I wonder why Transmeta hasn't tried this... (Score:5)
Well if the chip is emulating a dual processor machine, then you have pushed a lot of that work down to having the OS identify 2 processes that can run in parallel. I would think that this would be a huge win.
This is nice, if you can overcome two concerns:
Each thread will need to work with its own copy of the virtual chip's registers. For processes, instead of threads, each will also need its own page table (though you could just remap them to different sections of the same table). You might be able to emulate dual x86 processors, but something with a larger register set would pose problems.
Each thread is an independent instruction stream, that can wander and jump any which way. You'd need explicit hardware support to be able to emulate this without prohibitive overhead.
As the processes would be (hopefully) independent, you'd be hitting two completely different sets of pages when accessing memory. This means you'd need a cache twice as large to avoid thrashing. You could get around this by supporting multiple threads, not processes, but this limits the advantage of your proposal.
Transmeta's fundamental approach was to reduce hardware complexity and cost. As some additional hardware support is needed (in fact, a fair bit of additional support), emulation of SMP systems is unlikely to be embraced by Transmeta.
This is a fascinating idea, but there are substantial hurdles that would have to be overcome implementing it in practice.
Didnt Cyrix make one of these? (Score:1)