Emergence of SMT 104
yellow writes "SMT, or Simultaneous Multithreading, is a concept that is rapidly gaining adherence in the microprocessor area. It essentially allows for a single processor with multi-processor capabilities in both ILP (Instruction Level Parallelism) and TLP (Thread Level Parallelism). When comparing SMT vs. dual or multiprocessor performance data, it is important to compare apples to apples, and understand why under an OS such as Win 98 or ME, even a single Pentium III 1 GHz will handily outperform a dual-Pentium III 500 MHz setup. This is the discussion topic of a new feature on HWC."
Hemos - The Story is a Troll (Score:1)
SMT is really the reaction to the death of PC's (Score:2)
The fact that Intel's last processor release was a "mobile processor", Intel's future x86 vapormap includes pure SMT chips, and Compaq's future vapormap includes an SMT alpha shows how important that size reduction is.
Re:Multi Processors under Win9x (Score:2)
--
Well, your wrong... (Score:1)
Time for some education on computer architecture.. (Score:5)
Rather unnecessary, don't you think? I don't think you would be quite so hostile to SMT if you had a better understanding of it.
Time to do some informing...
> Surely this 'virtual-multi-CPU' system can only decrease the sheer number of operations per second a CPU of a given size/speed can do?
This statement doesn't make much sense. I think you mean that a CPU which spent the die area on additional functional units as opposed to SMT support could achieve a greater MIPS value. This is true, but only for theoretical MIPS. Simply adding more functional units to modern chips, would *not* improve actual performance. Explanation follows...
> The overhead - whether it be in sacrificed MIPS or die area, of distributing instructions among execution units is going to be significant, compared to a maxed-out single core design.
So you are stating that implementing SMT comes at a cost in die area? Of course it does, but the important point is that using that die area instead to add more conventional execution units would *not* increase the performance of the processor. Why, you ask? There is a limited amount of instruction level parallelism available in a single thread of program execution. Current wide-issue superscalars get something on the order of 2.3 dispatches per clock, despite the fact that they have the *capability* to issue far more. The processor simply can not find enough independent instructions to keep it's functional units full. If memory serves, the Athlon is a 9-issue core. You could add functional units up to 12-issue or more, but your actual dispatch rate would still be around 2-3 per clock. While your theoretical performance would increase your actual performance would remain stagnant.
So, current software does not exhibit enough parallelism to keep the functional units in even current processors busy. SMT proposes to increase available parallelism by issuing instructions from *multiple* threads at once. Instructions from different threads are guaranteed to be independent, so if you have n threads running at once, your number of available instructions for dispatch each clock is improved about n times. Of course, this method has a cost in complexity and area -- now the CPU has to have knowledge of threads, and keep a process table on die. However, provided many threads are run at once, this *greatly* increases the utilization of the processor's resources, and thus the performance of the part.
> Since reading and writing to various RAM caches are the biggest bottlenecks in the current PC architecture, adding more units is just going to lead to increased contention for these resources.
...this is a valid point. SMT particularly increases the burden on instruction fetch and cache, since it is pulling from several different streams at once. However, there are methods that can somewhat compensate for the contention of resources introduced. Now, you have multiple threads available at all times. So, when one thread stalls on a cache miss, the processor can dispatch a different thread to run while the cache miss on the first is being serviced. This effectively hides the latency of the cache miss since the processor is able to do useful work during the service. You see, it's all about keeping those functional units busy.
> So many CPU cycles are wasted with the current generation of software that it seems a bit pointless increasing the number of potential instructions you could perform..
If you believe this, then you should be pro-SMT. SMT doesn't address increasing potential instructions performed per second. Instead, it is an attempt to close the gap between *actual* performance and *theoretical* performance by keeping more of your processor busy.
>you have to question the thinking behind such a modification.
Re:Too Many TLAs (Score:1)
Re:Back to the Future (Score:1)
Btw. a similar sollution was implemented in Atari's Jaguar 64-bit console, the RISC processors (two of them) both had 64 registers, divided into two banks of 32 registers each. Then they had a special instruction for switching banks. This was very useful in the Jaguar architecture which was very depending on fast interrupt handling (hardware like the blitter sent back an interrupt request as soon as it was finished with its task), you simply let your main thread get one bank and your interrupts share the other one. Never had to push/pop registers on the stack.
Re:"Bollocks" ? (Score:1)
Optimization is all about turning compute-bound problems into I/O bound problems. Most processors are fast enough these days that they're I/O bound. Particularly if you have an OS that tends to keep L1 completely thrashed. (I recall seeing numbers which showed that Win9x tends to keep L1 completely hosed, whereas WinNT and Linux do not.)
Note that I/O bound doesn't necessarily mean bandwidth starved -- latency is also an issue with many typical tasks. Notice how RAMBUS machines tend to perform fairly slowly on many non-latency-tolerant tasks. It's not for lack of bandwidth.
What's worse is that we've passed the sweet spot in cache sizes such that L1 cache sizes are going to stop increasing and start decreasing again, sadly. (Transport delay in getting bits from the far side of L1 is limiter, I understand.)
Now the statement that most machines spend their time waiting for disk, I don't think that's quite true. That's maybe true under Windows (it certainly seemed to be last time I used it regularly, even with gobs of RAM handy), but it's certainly not true under Linux or Solaris. I almost never hear my HD run under normal circumstances. Even when it does run, it's not chugging incessantly. It's called "having enough RAM."
Even with enough RAM, though, PC133 SDRAM is quite a bit slower than L2, which is noticably slower than L1. Anything that doesn't stay in L1 most of the time is going to quickly bottleneck on the other levels of the memory hierarchy. Not much you can do about it, either.
--Joe--
Re:SMT != ILP != multiple pipes. (Score:2)
That's true only if you have sufficient hardware registers (not to be confused with architectural registers), and your tasks aren't bottlenecked on memory. If you have truly independent tasks, then each will effectively see half as much cache at all levels of the hierarchy. This can get especially painful in L1I -- if you can't keep the CPU fed from both streams, oops! And, large register files can be a speed limiter in the architecture. (On the plus side, though, the hardware register files can be distinct between the threads, so it's not too bad. As I understand it, it's the unified register files that are the real problem.)
One of the main attractive things I see in SMT is that you can effectively make the pipeline deeper on the architecture while completely hiding it. This is important on VLIW-style architectures that have an exposed pipeline. To make it deeper, you need to somehow hide the fact that it's deeper than the code thinks it is. One way is to add interlocks (gradually making stages of the pipeline protected, rather than exposed). Another is to interleave multiple threads, so that each thread sees the pipeline length it expects, but the actual pipeline is some factor larger.
Of course, in this VLIW world, things can get tricky outside the CPU. Because you lack superscalar issue (that's what VLIW's about), the issue of stalls becomes a problem. "One-stall-all-stall" is an oft-mentioned SMT VLIW technique, and with it, you really need to make sure you aren't bottlenecked on memory before you go down the SMT path. "One-stall-all-stall" means if one SMT thread stalls, all threads stall... As I understand it, it's the "cheap" way to maintain the VLIW state in an SMT VLIW machine, but it also amplifies any memory system bottlenecks you might have.
--Joe "Mr. VLIW"--
Re:SMT? Blow it out your ass. (Score:1)
Some think it means multiple cores on one chip, a la the new POWER4 from IBM. Apparently IBM doesn't think so, since they don't call that SMT themselves. If you go by that definition, then yes, you have just increased CPU power without increasing the overall bandwidth of the system, and that gives you sucky performance.
However, the other definition of SMT is a single core capable of keeping multiple contexts and switching between them without software help. To software they look (almost) like two regular CPU's, and so e.g. linux would assign two processes or threads to each core. The idea is that the core will execute the instructions from virtual CPU #1 until it hits a cache miss. Then it switches to executing instructions from virtual CPU #2, until it hits a cache miss... and so on.
In the second scenario you have a CPU with the same MIPS as before, but suddenly you are not wasting as much CPU power waiting. In the context of the PC industry that means you can get away with smaller caches and memory with higher latency (hello RAMBUS).
Re:Back to the Future (Score:2)
CISC, RISC, and VLIW. It's very simple. (Score:2)
The BOTTLENECK is the memory bus. Refilling the cache from a hose that's 1/10th the speed of the processor. That's why CISC lasted so long in the first place, and is still with us today. CISC has variable length instructions. If you can express an instruction in 8 bits, you do so. 16 for the more complex ones, 32 bits for the really complex ones. So when you're sucking data into the cache 32 bits at a time, you can get 2 or 3 instructions in a 32 bit mouthful. (Or, in the case of pentiums, 64 bits to feed 2 cores, but the principle's the same.) You're optimizing for the real bottleneck with compressed instructions.
The fixed length instructions of RISC can be executed 2 at a time because you don't have to decode the first one to see where the second one starts. But Sparc, PowerPC, and even Alpha haven't displaced Intel because the real bottleneck is the memory bus, and bigger instructions aren't necessarily a win. (That and Intel translates Cisc to Risc inside the cache, and pipelines stuff.)
VLIW as iTanium picked it up sucks so badly because the real bottlneck is sucking data from main memory, and now they want 192 bits of it per clock! For only three instructions, and on average at least one will probably be a NOP. Crusoe has a MUCH better idea, sucking compressed CISC instructions in and converting them to VLIW in the cache (like Pentium and friends do for CISC to RISC).
This multi-threading stuff is just a way to keep the extra VLIW execution cores from being fed NOPs. They don't deal with the real problem, the memory bus de-optimization by reverting to full-sized instructions all the time.
Rob
Re:Is this new? (Score:1)
The idea here is that if your program has many threads, you can run them all at the same time. Since each thread does not depend on the results of another thread's instructions (at least in the common case, if they're fighting over a lock, that's different), then there is no reason why they shouldn't be able to run at the same time. There are almost no data dependencies between threads, hence a whole lot of blindingly obvious ILP that the processor can take advantage of. This doesn't cause your programs to run any faster than a processor that only runs a single thread, but it allows more threads to run "slower" at the same time.
Re:Too Many TLAs [ot] (Score:3)
A nice article about SMT on the Alpha (Score:1)
It is a 3-part article, click on "Alpha EV8 (Part x): Simultaneous Multi-Threat".
One of the things I like about SMT is that as it quite "cheap", it has a chance to spread quite effectively.
And when there will be a huge base of SMT ready CPUs, we will see AT LAST more software which takes advantage of parallelism (be it SMT or SMP).
"Bollocks" ? (Score:2)
You overlook a couple of very important factors.
First of all, it would cost you almost no extra silicon or latency to have duplicate L1 caches, and to add a selection bit to the addresses sent out on memory operations.
Secondly, technologies like SMT help _save_ you when you have a cache miss, because you still have an instruction stream that can execute while one thread's waiting for data.
Re:"Bollocks" ? (Score:2)
And this depends entirely on your workload. Many tasks are memory-bound - and many are not. Generally, anything that can fit in the on-die caches will be CPU bound (for most cases). This still covers a wide range of useful problems.
The gap between memory speed and CPU speed is caused by DRAM latency and system bus speed, neither of which are issues for on-die caches. If clock speed increases and die sizes stay the same, propagation latency will become an issue, but SMT is great for alleviating _that_, too; as long as throughput scales with clock speed, you can tolerate higher latency by interleaving requests from different threads.
In summary, I think that memory bottleneck problems aren't as severe as you make them out to be. Yes, they're very relevant for programs that work with large data sets, but that by no means covers all tasks we want computers to perform.
SMT != ILP != multiple pipes. (Score:5)
You might want to doublecheck the terms you're using:
Multiple pipes are a relatively old idea. Ditto instruction-level parallelism, which is one of the analytical quantities used to judge how well multiple pipes will work in a given situation. SMT is a relatively new idea that lets you easily boost the instruction-level parallelism, which in turn makes scheduling and issuing instructions *much* easier.
Re:"Bollocks" ? (Score:5)
Um, no.
Most of your die is taken up by the _L2_ cache. You have plenty of space to add more L1 cache. The reason you usually don't is that a larger L1 cache served by the same set of address lines has longer latency. Two independent duplicates of an L1 cache will behave identically to the original L1 cache.
Performing the selection adds latency, but this can be masked because you know the value of the selection bit long before you know the value of the address to fetch.
In fact, you'd almost certainly _reduce_ the cache load compared to a single-threaded processor capable of issuing the same number of loads per clock, because they'd be hitting different caches, and you wouldn't have to multiport.
SMT also doesn't save you from cache miss latency. Out-of-order instruction issue saves you from that.
SMT, in any sane design, is used on an OOO core. An OOO core won't save you if your next set of instructions has a true dependence on the value being fetched from memory. SMT gives you a second thread with no data dependence on the stalled load, and hence plenty of instructions in the window that you can execute while waiting.
I'm having trouble seeing where your arguments are coming from. As far as most of the core's concerned, there's still only one (interleaved) instruction stream, just with less data dependence in it. This is scheduled and dispatched as usual.
Back to the Future (Score:3)
Re: and what of the multicore PPC? (Score:1)
Long live Big Blue!
(jfb)
Re: and what of the multicore PPC? (Score:1)
And while I don't doubt that an eight-way POWER4 unit will be as terrifyingly expensive as it is fast, would an equivalent EV8 system be any cheaper? Either way, it'll be interesting at the high end again, now that Alpha finally has competition.
Peace,
(jfb)
Re: and what of the multicore PPC? (Score:1)
(jfb)
Re:Multi Processors under Win9x (Score:1)
IIRC further, there have long been available daughter boards for various Mac platforms to do filter acceleration in Photoshop. I think these have ranged from general purpose CPUs to DSPs. Earliest one I can remember was for JPEG compression, although I think more commonly they were used for stuff like gaussian blurs. I can remember working at prepress shop on an unaccelerated '030 and waiting like 5 minutes to do filters on a 30 MB file, so they had some value.
Re:Multi Processors under Win9x (Score:1)
Yup, kinda reminds me when people were getting all excited about those dual CPU boxes that Apple was selling to take attention away from the megahertz gap vs. x86. Yeah, nevermind that hardly at anything at all can utilize more than one processor under MacOS, it's got two CPUs! w00t!
Cheers,
Re:Multi Processors under Win9x (Score:1)
You're not wrong, which is why I said "hardly anything at all" can use more than one CPU under MacOS, instead of saying that nothing can. The amount of software that can, though, is extremely limited.
Cheers,
Re:high-class SMP on x86? but *why?* (Score:2)
While in many senses you are right I think you are pointing to the wrong issue. It is not something inherent in the x86 arch that causes problems in scaling it is mostly Intel's SMP bus design. Having all the CPU's share a single, shared, bus between each other and system memory is the bottleneck. I mean, look at the Athalon, it isn't riding on an Intel designed bus, it rides on a DEC desigend EV6, originally made for the Alpha.
While Beowulf is a nifty technology, it does not solve all the scaling problems as you might think. Beowulf clusters are only useful for a specific subset of available problems, stuff that can be easily split up and sent to many, semi-independent, processing nodes. Beowulf clusters are generally connected together with 1Gb or 100Mb Ethernet which does not have high bandwidth or low latency compared to the CPU-Memory bus in even the cheapest computers. I would take a single 128 CPU box over 64 dual proc boxes connected via 1Gb Ethernet (or even Myrinet) any day.
Re:Back to the Future (Score:2)
*Sigh* . . . It seems that no one really does much research into CPU and computer system design anymore. All the major archetectures are pretty homogenized, they are either full RISC machines (probably a tad bloated with too many instructions) or RISC machines emulating a CISC instruction set (x86). RISC seems the last fundemental change in hardware design, the processors get smaller and faster but not much really changes.
One thing that I like, at least as a concept, is the IBM S/390. The IBM Mainframe and its customers have been living in a pretty insular world over the last 20 years and the hardware/software that runs on this beast is just, different. Some things are goofy, IIRC it boots off an emulated punch card reader (!!), or nifty, like the magic migrating storage system.
Good to see it finally taking hold (Score:1)
I am soooooo cooool.
Bollocks (Score:1)
I'm much more interested in enhanced cache ideas like IRAM [berkeley.edu] that seek to enhance performance by putting a very large L2$ on chip by combining the discrete logic circuits of the CPU and static L1$ with the capacitor cell circuits of DRAM.
Crispin
----
Crispin Cowan, Ph.D.
Chief Research Scientist, WireX Communications, Inc. [wirex.com]
Immunix: [immunix.org] Security Hardened Linux Distribution
Re:"Bollocks" ? (Score:1)
SMT also doesn't save you from cache miss latency. Out-of-order instruction issue saves you from that.
The main advantage of SMT is that it gives computer architecture scholars something interesting to study :-)
Crispin
----
Crispin Cowan, Ph.D.
Chief Research Scientist, WireX Communications, Inc. [wirex.com]
Immunix: [immunix.org] Security-hardened Linux
Re:Too Many TLAs (Score:1)
Maybe ICANN would take the job on ?
Re:Too Many TLAs (Score:1)
Re:strange (Score:1)
With a single chip multiprocessor (SCMP) you can design a smaller, simpler processor and spend more time performance tuning it. Once you have a single core working and tested, you can stamp it out multiple times. The complexity then gets put in the glue logic which is used to communicate between cores and share caches between them. This problem is well understood from the design of conventional multiprocessors.
Basically, since SMT is new, the design takes longer. SCMP relies of understood technologies, and potentially could be put into production faster.
Mmm, content-free article... (Score:1)
Well, where do we start...
Multi-core CPUs like IBM's are SMP-on-a-chip, which is not the same as SMT by any stretch.
SMT, because more of the functional units on the chip are staying active at one time, increases heat and power consumption just about as much as SMP-on-a-chip, though it may be marginally better because the core-level overhead won't be present.
"SMP-aware" applications? Yeah, you need something like that with the Mac and its cooperative multitasking and wacky thread model. However, with any normal preemptive multitasking, thread supporting OS, introducing threading into a program makes it "SMP-aware" by default (though you may find new/different bugs on an SMT or SMP system).
The only thing I can think of that would be an "SMT optimization" at the application programming level would be threading any floating point calculations separately from integer calculations, thus allowing the FP units to be running independently from the rest of the application.
Methinks Vince needs to bone up at little...
Some more links:
Universiity of Washington SMT info [washington.edu], this is also linked to from the UMass link previously posted
Look at some more Alpha [alphapowered.com] specifics from the source
I believe these Real World guys [realworldtech.com] were quoted in the last Slashdot SMT reference [slashdot.org] (and look, Hemos posted that one too... you'd think they'd read the links...>
Disclaimer: I worked for Compaq (though not the DEC side) on porting an OS to Alpha a couple years ago and having to be aware of SMT in EV8 coming down the pike.
Re:SMT is really the reaction to the death of PC's (Score:1)
The fact that Intel's last processor release was a "mobile processor", Intel's future x86 vapormap includes pure SMT chips, and Compaq's future vapormap includes an SMT alpha shows how important that size reduction is.
SMT on EV8 has been on the Alpha "vapormap" for at least three years, probably considerably longer than that. July 1998 was when I first saw it and I'm not a DEC person. Don't think it has anything to do with embedded stuff. Don't think you'll find Alpha's in very many embedded situations either, not compare to PowerPC or the real embedded archiectures...
Re:CISC, RISC, and VLIW. It's very simple. (Score:2)
The BOTTLENECK is the memory bus.
OK.
That's why CISC lasted so long in the first place, and is still with us today.
No way. CISC (i.e. x86) lasted so long because of duopoly action and backward compatibility. In fact, like you said, CISC is dead because even since Pentiums, x86 chips have been RISC on the inside and CISC to the outside world (to varying degrees).
The fixed length instructions of RISC can be executed 2 at a time because you don't have to decode the first one to see where the second one starts.
Or n at a time. Any OOO RISC processor these days worth its snot decodes 4 ops/clock, some are at 6 or 8. (If it can't retire that fast, it doesn't really matter...)
Alpha haven't displaced Intel because the real bottleneck is the memory bus
Really? For scientific computing, which is where you have really big datasets and memory bandwidth is key? I don't think you see x86 there very much. You see DEC, IBM, Sun and HP. Who are all, surprise, surprise, RISC-based hardware vendors. Many RISC chips (Alpha, POWER, PowerPC) have long since passed x86 in sheer performance, especially on FP. Intel has defintely won in price/performance, but I would argue that's more due to volume than anything else.
This multi-threading stuff is just a way to keep the extra VLIW execution cores from being fed NOPs.
Umm, Alpha EV8 uses SMT. Not VLIW. Itanium is VLIW-like. Doesn't use SMT. Example no worky.
Not saying that the concept is wrong, that SMT as a concept might alleviate some of the performance issues with superfluous instructions in a VLIW instruction stream. But that's sort of the point of VLIW, to let the compiler, rather than OOO hardware, figure out how to best use the available functional units as much as possible. It puts NOPs in to keep the instruction stream balanced so the decoder can work in a predictable way just like in RISC.
Re:SMT ... (Score:2)
Say what? Most applications can't fill a deep pipe, even out-of-order and with aggressive prefetching. The ways this stuff wins include having two (or maybe more!) instruction streams to crunch, and switching away from the one that's now blocking on a memory access. Prefetch on the other surely completed already ...
The P4 is a good example of a pipeline
that's too long.
And by the way, why has this taken so long to arrive? It's still not something I can purchase yet, and I first heard of it back in 1992. There's something fishy.
Re:Multi Processors under Win9x (Score:1)
Now if the article had said that a 2 x 800 was just barely as fast as a 1 x 1000, under an OS with the capability to use both CPUs, that would have been noteworthy. But this must have been one of the least informative comparisons I've heard in a long time.
(*) disregarding memory bandwidth, which depends on busses and what not.
Re:Multi Processors under Win9x (Score:1)
SMT? Blow it out your ass. (Score:2)
The overhead - whether it be in sacrificed MIPS or die area, of distributing instructions among execution units is going to be significant, compared to a maxed-out single core design.
Since reading and writing to various RAM caches are the biggest bottlenecks in the current PC architecture, adding more units is just going to lead to increased contention for these resources.
So many CPU cycles are wasted with the current generation of software that it seems a bit pointless increasing the number of potential instructions you could perform..
Its like putting a 700 cubic inch supercharged W16 engine constructed from 3 straight-8 blocks into a VW Kombi van.
Sure, it'll theoretically go pretty fast, but when its parked by the side of the road 340 days out of the year and only ever driven by a bunch of hippies who are too stoned to see the road properly at 20 kmph, you have to question the thinking behind such a modification.
Re:SMT? Blow it out your ass. (Score:2)
Re:Back to the Future (Score:1)
You could move single values to/from the other bank with moveta/movefa instructions.
Re:Back to the Future (Score:1)
Re:A nice article about SMT on the Alpha (Score:2)
Re:Sun's already done it (Score:2)
Re:"Bollocks" ? (Score:2)
I'm sat here working on the software side of an SMT project. This is exactly where SMT offers benefits. The processor switches threads on a cache miss. A 4-way SMT scheme can offer > 2x performance for 2x die size. A SMT procoessor cannot reduce the cache latency experienced by a particular thread of execution, but it can reduce the amount of time that execution units sit idle.
Re:"Bollocks" ? (Score:2)
Disk-IO should never be a problem. If it is, then you need to alter the system. If your work-station is disk-bound, then you simply add more memory, if your high-end server is disk bound, then you put in a very expensive RAID.
Typical applications are Office suites, video games, web servers and databases. Well a game should never have to hit the disk except for level loads / movies. Office should only have a hit the first time you start it up(and nowadays that's reduced by boot-time prefetching). web-servers should _always_ contain enough memory to serve the majority of the web-site. And professional Databases will typically demand multiple expensive drives.
Most other operations only require a moderate use of IO, which is further reduced by OS level caching. It's not hard at all to get 2 or more CPU-load doing things like compiles (which are obviously very IO bound). The CPU is more than taxed in such circumstances.
All levels could beboth larger and faster
There is most certainly a boost from memory latency and bandwidth enhancement, but I don't agree that these are the universal bottleneck.. Many applications are nicely optimized to fit within the half meg of L2 still commonly found. This applies to video games, heavy-duty compression tools, etc. For example, I got 99% performance boost on a dual-processor when doing MP3 encoding. Obviously main memory / disk-IO wasn't the bottle-neck.
I think Lx$ becomes a bottle-neck when you context-switch often; thereby flushing the $.
Unfortunately economically sound SMT isn't going to be as fast as SMP because you can't have fully redundant and optimized L2 (which will affect large-data-set applications). What I see happening is the use of thread-independant register sets and L0 cache (for the x86 processors at least), then having a large number of ports on the shared L1, and finally a minimally ported, though larger than current on-die L2. There would possibly be a very large off-die L3 (typically targeted at 2Meg).
It won't help all applications (Apache 1.x or Postgres 7 certainly won't improve), but more and more Solaris and Windows applications could definately benifit (heck, even the traditional win-benchmark Quake 3 is MT aided).
What I see as ideal is SMT/P interleaved memory. You use 2-way SMT to take up the slack where ILP bottles up. Then you have a second separate chip (possibly on the same cartraige, sharing an external L3 $ ). In this way, you have a minimal increase in complexity of the core, and you optionally sell more cores (kind of like the 3DFX's VSA-100 mentality.. Make the core simple and scalable). So long as you have a large enough $ and a minimal number of loading processes, putting all the cores on the same system-memory-bus shouldn't be a problem (thus alleviating the complexity of the EV8 point-to-point bus). I don't believe AMD's P2P architecture is going to be worth the added cost, delays, and complexity. Not to mention, main-stream memory can't handle the additional BW.
Re:SMT ... (Score:1)
Now with a superscalar/OO system things are different - you can keep partially done stuff sitting in reservation stations (but you need twice as many of them which may cost you in cycle time - which is what marketting are interested in). There's still probably a lot of serializing things around (like L2 miss). On the other hand one place you can win is with the main memory subsystem - these days you can build systems with multiple outstanding memory transactions (Rambus - no matter how people dislike it - is particularly good at potentially having many concurrent senses running in parallel) - but to be usefull you need to either move the main memory interface onto the CPU or get rid of tacky serializing buses like slot-1 (it's much better to put the memory interface on the CPU in this case because you can run the internal memory interface at a higher clock rate and be exposed to more parallelism).
SMT ... (Score:2)
Seriously though threading like that is kind of at odds with todays very long pipelines (basicly the cost of a thread switch can be very high if you have to fill a deep pipe). With heavily out-of-order systems this can be less of a problem .... but you're still stuck with the problem that if you're using a larger percentage of the CPU's real clocks then you're going to put more pressure on shared resources like caches and TLBs - larger L1s/TLBs are going to potentially hit CPU cycle time and of course these days L2 can take a large percentage of your die size (after all the goal here is to get more usefull clocks/area)
Re:Too Many TLAs [ot] (Score:1)
Bad Example... (Score:1)
This is the expected result under ANY operating system. Multiprocessing only helps for problems that lend themselves to highly parallel processing (exploring independent address spaces of cryptography problems, for example), which many don't, and in any case incurrs overhead.
--
Re:Mmm, content-free article... (Score:1)
Dean Tullsen's papers (Score:4)
For your bedtime reading, y'all.
--Seen
Re:SMP never was all that (Score:1)
Cray has had this for a while (Score:1)
SMT Explained (Score:3)
Introduction to Simultaneous Multi-threading from UMass [umb.edu].
Quick Quiz on SMT [umb.edu].
Caches for Simultaneous Multithreaded Processors: An Introduction [tohoku.ac.jp]
high-class SMP on x86? but *why?* (Score:2)
My feelings shall be vindicated when SMP Athlon machines become readibly available. Their comparatively minor bandwidth advantage will let them blow similarly-clocked Intel boxes out of the water.
Personally, I feel that the best way to scale x86 to supercomputing levels is through clustering, such as is offered by the venerable Beowulf for GNU/Linux. GNU/Linux, for better or worse, is continuing to grow in popularity, and I would like to see commercial software vendors try releasing Beowulf-enabled software for Linux. Imagine being able to buy Oracle for Beowulf! Okay, poor example; Oracle is memory-intensive rather than CPU-intensive, and a RDBMS is one application which is so dependant on a fast disk and good caching that that advantages pale in comparison to the potential problems. What would really be cool are Beowulf ports of statistical analysis and 3D-rendering software. Oooh, yeah... after all, The Matrix and Titanic have both proven the effectiveness of free x86 Unix-workalikes in render farms... I believe that those two movies respectively used FreeBSD and GNU/Linux.
--
Re:Time for some education on computer architectur (Score:1)
I do. I've been working on it for several years now. The project is alive and well, and we fully intend to deliver an SMT processor. Of course delivery is still a couple years off (designing processors takes a really long time), but we'll get there eventually.
BTW, for those of you interested in SMT, we are hiring [alpha-careers.com].
P.S. Great response to the nay-sayer. Wish I had written it. :-)
Re:SMP never was all that (Score:1)
Yes it is. Since cost is not directly proportional to speed, you can buy two processors at greater than half the speed of one. If we assume adding a processor speeds up your application by 50% (a bad case), we can buy two 666MHz PIIIs at $115 each (from pricewatch), instead of one 1GHz PIII at $240, and get equivalent performance (666+333=1000). Then you also have some money left for a more expensive motherboard. 50% is low though, depending on application you should be able to get it to 75-80%. If you also add overclocking into the picture, you can save a lot of money (I have two Celeron 300A running at 450MHz each, this would be equivalent to something like a PIII at 800MHz, which didn't even exist when I bought them).
Re:SMP never was all that (Score:1)
1GHz Thunderbirds seem to be around $165 [pricewatch.com], while 650MHz Thunderbirds are ~$65 [pricewatch.com], so that would make SMP even more cost-effective. It's a bit difficult to compare to combos, since there aren't any combos for SMP Thunderbirds (or any motherboards at all yet). I look forward to buying an SMP Athlon system, but for now they don't exist.
Re:Time for some education on computer architectur (Score:1)
Re:Who cares about performances with Winblows98 ? (Score:1)
That post doesn't deserve a 4 (Score:1)
So, you're just wrong. Trying reading something about what you're commenting on first.
Re:CISC, RISC, and VLIW. It's very simple. (Score:2)
Re:Bollocks (Score:3)
Pay particular note to the fact that you can take an existing superscalar chip and add SMT for only about a 10% chip real estate premium, while it should be able to double throughput. That's a lot better than trying to double throughput by adding another CPU to a machine or by adding two cores to a CPU.
Also note that it isn't recommended to run processes with different address spaces simultaneously on the processor because that would thrash the TLBs. Its only suggested that you let multithreaded apps (oracle, perhaps future versions of apache) load more than one thread into the processor at the same time.
register windows? (Score:1)
I'm a little confused. Don't some processors use register windows to speed up context switching? How is this any different?
PS I'm not trolling, I really don't understand how this is new.
SMP never was all that (Score:1)
Let's leave 98/Me out of the equation for now.
Just how is a dual 500 p3 ever going to beat a 1Ghz p3? You still only have 1000000000 clock cycles per second. Your processor bus isn't any faster, in fact it's got an extra processor on it so there's increased contention, you have a little bit more L2 cache to play with, your memory bandwidth is the same, the overhead from SMP guarantees that you'll never efficiently use more than about 900000000 of those clock cycles in a second, and all your drivers have to play it SMP-safe, which is another 5-10% speed hit.
x+x is not greater than x*2.
SMP helps partition applications from each other so even if one app is hogging a cpu, other stuff will still give decent response times. But that's about it - unless you need to push the bleeding edge (and spend >5k on a box), SMP is not cost effective.
Re:SMP never was all that (Score:1)
Re:SMP never was all that (Score:1)
Then again, if you wanna suffer through first-gen SMP hardware bugs, your wallet will probably have similar masochistic tendencies.
Sun's already done it (Score:1)
IBM multi-threaded CPU, circa 1998. (Score:1)
To my knowledge, this chip is either now in use, or very close to being put in an AS/400 or i Series box.
In-depth computer hardware info? Huh? (Score:1)
Re:Too Many TLAs (Score:1)
COPS, out-of-phase 6800's (Score:2)
But someone else was designing another CPU, called the COPS. They looked at this well-known out-of-phase 6800 technique, and realized that their design basically used clock-up for fetch/decode, and clock-down for execute. During each half-cycle, half the CPU was sitting idle.
So they doubled the registers, using the fetch/decode unit with one register-set during clock-down and the other register-set during clock-up. The execute unit worked in the converse fashion, alternating register sets. A dual CPU on a single chip for the cost of a second register set and a little control/arbitration logic. They didn't attempt any sophisticated contention-prevention, leaving that up to the software. This was mid-late 70's.
With more modern software, COPS might have been the first SMT. I don't know the timeframe of the CDC6000, whether it beats mid-late 70's or not.
SMP _is_ all that!.. (Score:1)
Cheers... .*rc is
--
$HOME is where the
IBM uses it properly too (Score:1)
m
Re:SMP in Windows 98/ME (Score:1)
I'd be willing to bet it's vaporware.
strange (Score:2)
As far as SMT goes, I think it's a good idea (well, obviously, why wouldn't it be). You really can only get so much out of Instruction level parallelism, and I've always thought that splitting CPU time up by thread rather then by instruction parallelism would be a lot more effective.
Rate me on Picture-rate.com [picture-rate.com]
there is only default :P (Score:2)
Rate me on Picture-rate.com [picture-rate.com]
Re:Multi Processors under Win9x (Score:1)
Re: (Score:2)
Comment removed (Score:3)
As they say ... (Score:2)
Actually, while the first bit is true... let's face it, for most business applications we do not need faster machines until we have to deal with the bloat of the next road of software from the major vendors.
Businesses could probably do very well on a single standardized set of software for a decade or more for most common functions. Many have done so as a matter of fact. There are some businesses out the still running win 3.1 apps.
Who cares about performances with Winblows98 ? (Score:1)
Re:Who cares about performances with Winblows98 ? (Score:1)
That's not what the original poster said. The original poster was talking about Windows 98/ME. 90% of Windows 98/ME users don't know how to benchmark their computers, and wouldn't want to either. I'm sure lots of NT/2000 users care, especially if they're using Windows as a server. But the average home user of Windows 98/ME doesn't care. Serious gamers and some kinds of professionals care, but that's a very small part of the Windows 98/ME market.
Re:SMT ... (Score:1)
Blame Microsoft - their continuing use of DOS (95, 98, ME) prevent these technologies from being introduced. Once your basic Microsoft OS is based on NT, these technologies will start to be devoloped.
It currently makes more sense for INTEL/AMD to invest in technologies that will make ME/9x faster as well as NT/Unix. Once SMT capable OSes are the norm they will look at other ways to speed things up.
die ME die!!!
Re:SMT? Blow it out your ass. (Score:1)
Re:SMT vs SMP (Score:1)
A processor can be 4-way SMT for only 10% extra silicon cost. SMT makes processors more EFFICIENT.
C//
Re: and what of the multicore PPC? (Score:1)
C//
Re: and what of the multicore PPC? (Score:1)
Technology Imitates Life? (Score:2)
Too Many TLAs (Score:2)
(http://smt.ucsb.edu/smt-list/smt-main.html)
the Finland Travel Bureau (http://www.smt.fi/)
When will the madness end?
Multi Processors under Win9x (Score:4)
This is because Win9x does not support SMP so even dual 933MHz will be out performed by a single 1GHz. A better comparison would with Win2k or Linux, which will both actually use both CPU's.
Oh please, DEC Alpha's had them longbefore this (Score:1)
Here is an article from DEC (now Compaq) that describes how the Alpha chip does it:
http://www.compaq.com/hpc/ref/ref_alpha_ia64.doc
for word.
and
http://www.compaq.com/hpc/ref/ref_alpha_ia64.pdf
for pdf.
Digital has always tried to do things "the Right Way" and it shows in their products.
Re:Multi Processors under Win9x (Score:1)
You are right on about the graphic designers. Apple's high end machines are selling almost entirely into nitch markets, and for the most parts those markets use software which takes advantage of multi-processing.
Apple's 'regular user' base is for the most part not buying multi-CPU Macs.
Re:As they say ...what's in the real world is... (Score:1)
Or how about the supply ordering system that runs on dos. We actualy bought a celeron based system just to run this one app and then found that the program wouldn't run because it had a stupid win-modem installed! Forget about using the ink-jet printer too.
The windows based version wouldn't load because the installer told us we had to up-grade from WindowsME to Windows 95 or better thought seriously about Linux/wine that's better isn't it?
SMT is not going to help anything we use run faster; so why should we spend the money? Why would anyone develope software for this technology? We don't even get Linux distros optimised for our athlons!
This is just blue smoke and mirrors to try to get the techno-junkies to shell-out a bunch of money for a new system that might actualy run their apps ten percent faster for four times the cost.
"Need for faster processor" misunderstood (Score:1)
Those guys got it totally wrong. The question of "why do I need a 1.5 GHz processor?" stems from the fact that current processors are more than fast enough for many applications.
This question isn't answered by introducing a system with higher performance. IMHO it might be answered by new, attractive applications that need the better performance (speaker-independent speech recognition?).
Re:Who cares about performances with Winblows98 ? (Score:1)
Unfortunately, a similarly dismissive attitude is misplaced when considering server architectures. As real-life presents opportunity to build ever larger scale interactive systems, the speed of execution of critical sections becomes every more important.
There are more useful questions to ask than "What proportion of users browse with IE on Windows" - for example
1. What proportion of users use unconnected (i.e. no ISP/LAN) machines?
2. What proportion of users require to use client server software to access centralised resources?
3. Are users satisfied with the performance of servers when the access shared resources at times of high contention?
Or... in a much more down to earth fashion - do these users of IE also use Exceed (or other win32 X-server), or telnet?
Re:Multi Processors under Win9x (Score:1)
-----------------