Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Technology

Emergence of SMT 104

yellow writes "SMT, or Simultaneous Multithreading, is a concept that is rapidly gaining adherence in the microprocessor area. It essentially allows for a single processor with multi-processor capabilities in both ILP (Instruction Level Parallelism) and TLP (Thread Level Parallelism). When comparing SMT vs. dual or multiprocessor performance data, it is important to compare apples to apples, and understand why under an OS such as Win 98 or ME, even a single Pentium III 1 GHz will handily outperform a dual-Pentium III 500 MHz setup. This is the discussion topic of a new feature on HWC."
This discussion has been archived. No new comments can be posted.

Emergence of SMT

Comments Filter:
  • by Anonymous Coward
    Slashdot has been trolled again.
  • The processor world right now is 99% driven by embedded systems. You need to put multiple cores on the same die to boost embedded performance since there's not enough room for multiple processors. Now that PC's are dead we're seeing a spike in SMT press releases even though the technology has been floating around forever.

    The fact that Intel's last processor release was a "mobile processor", Intel's future x86 vapormap includes pure SMT chips, and Compaq's future vapormap includes an SMT alpha shows how important that size reduction is.
  • Or the Linux 2.4 kernel.

    --

  • I have read several articles about both Suns and Intels SMT tech and they all say it onlys adds about 10% to the die size...Email me if you would like links.
  • by slothbait ( 2922 ) on Wednesday March 14, 2001 @10:18PM (#362637)
    > SMT? Blow it out your ass.

    Rather unnecessary, don't you think? I don't think you would be quite so hostile to SMT if you had a better understanding of it.

    Time to do some informing...

    > Surely this 'virtual-multi-CPU' system can only decrease the sheer number of operations per second a CPU of a given size/speed can do?

    This statement doesn't make much sense. I think you mean that a CPU which spent the die area on additional functional units as opposed to SMT support could achieve a greater MIPS value. This is true, but only for theoretical MIPS. Simply adding more functional units to modern chips, would *not* improve actual performance. Explanation follows...

    > The overhead - whether it be in sacrificed MIPS or die area, of distributing instructions among execution units is going to be significant, compared to a maxed-out single core design.

    So you are stating that implementing SMT comes at a cost in die area? Of course it does, but the important point is that using that die area instead to add more conventional execution units would *not* increase the performance of the processor. Why, you ask? There is a limited amount of instruction level parallelism available in a single thread of program execution. Current wide-issue superscalars get something on the order of 2.3 dispatches per clock, despite the fact that they have the *capability* to issue far more. The processor simply can not find enough independent instructions to keep it's functional units full. If memory serves, the Athlon is a 9-issue core. You could add functional units up to 12-issue or more, but your actual dispatch rate would still be around 2-3 per clock. While your theoretical performance would increase your actual performance would remain stagnant.

    So, current software does not exhibit enough parallelism to keep the functional units in even current processors busy. SMT proposes to increase available parallelism by issuing instructions from *multiple* threads at once. Instructions from different threads are guaranteed to be independent, so if you have n threads running at once, your number of available instructions for dispatch each clock is improved about n times. Of course, this method has a cost in complexity and area -- now the CPU has to have knowledge of threads, and keep a process table on die. However, provided many threads are run at once, this *greatly* increases the utilization of the processor's resources, and thus the performance of the part.

    > Since reading and writing to various RAM caches are the biggest bottlenecks in the current PC architecture, adding more units is just going to lead to increased contention for these resources.

    ...this is a valid point. SMT particularly increases the burden on instruction fetch and cache, since it is pulling from several different streams at once. However, there are methods that can somewhat compensate for the contention of resources introduced. Now, you have multiple threads available at all times. So, when one thread stalls on a cache miss, the processor can dispatch a different thread to run while the cache miss on the first is being serviced. This effectively hides the latency of the cache miss since the processor is able to do useful work during the service. You see, it's all about keeping those functional units busy.

    > So many CPU cycles are wasted with the current generation of software that it seems a bit pointless increasing the number of potential instructions you could perform..

    If you believe this, then you should be pro-SMT. SMT doesn't address increasing potential instructions performed per second. Instead, it is an attempt to close the gap between *actual* performance and *theoretical* performance by keeping more of your processor busy.

    >you have to question the thinking behind such a modification.
  • How about the TLA Lookup Archive?
  • Yes, it could be nice in embedded and special systems, but wouldn't be an option in a normal computer where we still want things like memory protection between processes.


    Btw. a similar sollution was implemented in Atari's Jaguar 64-bit console, the RISC processors (two of them) both had 64 registers, divided into two banks of 32 registers each. Then they had a special instruction for switching banks. This was very useful in the Jaguar architecture which was very depending on fast interrupt handling (hardware like the blitter sent back an interrupt request as soon as it was finished with its task), you simply let your main thread get one bank and your interrupts share the other one. Never had to push/pop registers on the stack.

  • Optimization is all about turning compute-bound problems into I/O bound problems. Most processors are fast enough these days that they're I/O bound. Particularly if you have an OS that tends to keep L1 completely thrashed. (I recall seeing numbers which showed that Win9x tends to keep L1 completely hosed, whereas WinNT and Linux do not.)

    Note that I/O bound doesn't necessarily mean bandwidth starved -- latency is also an issue with many typical tasks. Notice how RAMBUS machines tend to perform fairly slowly on many non-latency-tolerant tasks. It's not for lack of bandwidth.

    What's worse is that we've passed the sweet spot in cache sizes such that L1 cache sizes are going to stop increasing and start decreasing again, sadly. (Transport delay in getting bits from the far side of L1 is limiter, I understand.)

    Now the statement that most machines spend their time waiting for disk, I don't think that's quite true. That's maybe true under Windows (it certainly seemed to be last time I used it regularly, even with gobs of RAM handy), but it's certainly not true under Linux or Solaris. I almost never hear my HD run under normal circumstances. Even when it does run, it's not chugging incessantly. It's called "having enough RAM."

    Even with enough RAM, though, PC133 SDRAM is quite a bit slower than L2, which is noticably slower than L1. Anything that doesn't stay in L1 most of the time is going to quickly bottleneck on the other levels of the memory hierarchy. Not much you can do about it, either.

    --Joe
    --
  • SMT is a relatively new idea that lets you easily boost the instruction-level parallelism, which in turn makes scheduling and issuing instructions *much* easier.

    That's true only if you have sufficient hardware registers (not to be confused with architectural registers), and your tasks aren't bottlenecked on memory. If you have truly independent tasks, then each will effectively see half as much cache at all levels of the hierarchy. This can get especially painful in L1I -- if you can't keep the CPU fed from both streams, oops! And, large register files can be a speed limiter in the architecture. (On the plus side, though, the hardware register files can be distinct between the threads, so it's not too bad. As I understand it, it's the unified register files that are the real problem.)

    One of the main attractive things I see in SMT is that you can effectively make the pipeline deeper on the architecture while completely hiding it. This is important on VLIW-style architectures that have an exposed pipeline. To make it deeper, you need to somehow hide the fact that it's deeper than the code thinks it is. One way is to add interlocks (gradually making stages of the pipeline protected, rather than exposed). Another is to interleave multiple threads, so that each thread sees the pipeline length it expects, but the actual pipeline is some factor larger.

    Of course, in this VLIW world, things can get tricky outside the CPU. Because you lack superscalar issue (that's what VLIW's about), the issue of stalls becomes a problem. "One-stall-all-stall" is an oft-mentioned SMT VLIW technique, and with it, you really need to make sure you aren't bottlenecked on memory before you go down the SMT path. "One-stall-all-stall" means if one SMT thread stalls, all threads stall... As I understand it, it's the "cheap" way to maintain the VLIW state in an SMT VLIW machine, but it also amplifies any memory system bottlenecks you might have.

    --Joe "Mr. VLIW"
    --
  • Actually, the whole SMT thing is so new that people aren't agreeing on what exactly it means.

    Some think it means multiple cores on one chip, a la the new POWER4 from IBM. Apparently IBM doesn't think so, since they don't call that SMT themselves. If you go by that definition, then yes, you have just increased CPU power without increasing the overall bandwidth of the system, and that gives you sucky performance.

    However, the other definition of SMT is a single core capable of keeping multiple contexts and switching between them without software help. To software they look (almost) like two regular CPU's, and so e.g. linux would assign two processes or threads to each core. The idea is that the core will execute the instructions from virtual CPU #1 until it hits a cache miss. Then it switches to executing instructions from virtual CPU #2, until it hits a cache miss... and so on.

    In the second scenario you have a CPU with the same MIPS as before, but suddenly you are not wasting as much CPU power waiting. In the context of the PC industry that means you can get away with smaller caches and memory with higher latency (hello RAMBUS).
  • Some would argue that EPIC (the ISA for IA64 chips like Itanium) is a fundamental change in hardware design - it combined VLIW (an old idea) with explicit encoding of inter-opcode dependency, among other things. That kind of explicit "helper" data for internal ILP engines could end up proving valuable, if the compiler technology can keep up.
  • The bottleneck isn't inside the processor, we can clock multiply those suckers into the stratosphere. A modern Pentium or Athlon is going ten times the bus speed, and that can be increased easily. As long as you're executing out of L1 cache inside the processor, clock multipying is a big win and better design can go hang.

    The BOTTLENECK is the memory bus. Refilling the cache from a hose that's 1/10th the speed of the processor. That's why CISC lasted so long in the first place, and is still with us today. CISC has variable length instructions. If you can express an instruction in 8 bits, you do so. 16 for the more complex ones, 32 bits for the really complex ones. So when you're sucking data into the cache 32 bits at a time, you can get 2 or 3 instructions in a 32 bit mouthful. (Or, in the case of pentiums, 64 bits to feed 2 cores, but the principle's the same.) You're optimizing for the real bottleneck with compressed instructions.

    The fixed length instructions of RISC can be executed 2 at a time because you don't have to decode the first one to see where the second one starts. But Sparc, PowerPC, and even Alpha haven't displaced Intel because the real bottleneck is the memory bus, and bigger instructions aren't necessarily a win. (That and Intel translates Cisc to Risc inside the cache, and pipelines stuff.)

    VLIW as iTanium picked it up sucks so badly because the real bottlneck is sucking data from main memory, and now they want 192 bits of it per clock! For only three instructions, and on average at least one will probably be a NOP. Crusoe has a MUCH better idea, sucking compressed CISC instructions in and converting them to VLIW in the cache (like Pentium and friends do for CISC to RISC).

    This multi-threading stuff is just a way to keep the extra VLIW execution cores from being fed NOPs. They don't deal with the real problem, the memory bus de-optimization by reverting to full-sized instructions all the time.

    Rob

  • The multiple execution pipelines (for instance, a superscalar architecture) are definitely one way to exploit ILP. With superscalar, you're always running a single thread at a time and there is a certain amount of ILP in that thread.

    The idea here is that if your program has many threads, you can run them all at the same time. Since each thread does not depend on the results of another thread's instructions (at least in the common case, if they're fighting over a lock, that's different), then there is no reason why they shouldn't be able to run at the same time. There are almost no data dependencies between threads, hence a whole lot of blindingly obvious ILP that the processor can take advantage of. This doesn't cause your programs to run any faster than a processor that only runs a single thread, but it allows more threads to run "slower" at the same time.

  • by wik ( 10258 ) on Wednesday March 14, 2001 @09:19PM (#362646) Homepage Journal
    This is perhaps one of the most useful sites in today's world of technobabble: www.acronymfinder.com [acronymfinder.com]. It lists 19 different meanings for "SMT", none of which are Simultaneous MultiThreading! :-)
  • A nice (and quite technical) article on the SMT for the Alpha can be found here [realworldtech.com]

    It is a 3-part article, click on "Alpha EV8 (Part x): Simultaneous Multi-Threat".

    One of the things I like about SMT is that as it quite "cheap", it has a chance to spread quite effectively.

    And when there will be a huge base of SMT ready CPUs, we will see AT LAST more software which takes advantage of parallelism (be it SMT or SMP).

  • IMHO, SMT is a load. Modern microprocessors are mostly cache-starved. SMT puts two processors on the wrong side of the L1$, aggrevating the cache bandwidth problem. Worse, the two processors in SMT degrade referential locality, further degrading the performance of the cache.

    You overlook a couple of very important factors.

    First of all, it would cost you almost no extra silicon or latency to have duplicate L1 caches, and to add a selection bit to the addresses sent out on memory operations.

    Secondly, technologies like SMT help _save_ you when you have a cache miss, because you still have an instruction stream that can execute while one thread's waiting for data.
  • My arguments are coming from the "memory wall" perspective of system performance. CPU cycles are no longer the problem: the problem is getting enough data to the CPU core.

    And this depends entirely on your workload. Many tasks are memory-bound - and many are not. Generally, anything that can fit in the on-die caches will be CPU bound (for most cases). This still covers a wide range of useful problems.

    The gap between memory speed and CPU speed is caused by DRAM latency and system bus speed, neither of which are issues for on-die caches. If clock speed increases and die sizes stay the same, propagation latency will become an issue, but SMT is great for alleviating _that_, too; as long as throughput scales with clock speed, you can tolerate higher latency by interleaving requests from different threads.

    In summary, I think that memory bottleneck problems aren't as severe as you make them out to be. Yes, they're very relevant for programs that work with large data sets, but that by no means covers all tasks we want computers to perform.
  • by Christopher Thomas ( 11717 ) on Wednesday March 14, 2001 @09:16PM (#362650)
    I thought that the major processor companies had been working with multiple execution pipelines for years now. Doesn't that fall under the category of ILP?

    You might want to doublecheck the terms you're using:

    • "ILP" is "instruction-level parallelism". It's not a physical part of the chip - it's a quality of the instruction stream. ILP is the number of instructions (usually average) that could theoretically be executed at one time, without violating data relationships within the program. Modern processors _can_ execute multiple instructions per clock because the ILP of most programs is greater than one (i.e. there are usually multiple instructions that can be executed without violating data or control dependencies).

    • "Multiple pipes" is part of the hardware that allows processors to issue multiple instructions per clock. As the name implies, this represents multiple hardware units that are capable of performing operations independently of each other.

    • "SMT" is "Symmetrical Multithreading". Remember how back under ILP, I said that the number of instructions that can be issued per clock depends on the parallelism of the program being run? SMT boosts the parallelism by running two threads at the same time and interleaving their instructions (more or less). As the instructions from different threads usually don't care what the other threads are doing, this gives you many more instructions that can be executed at the same time (assuming you have enough hardware to execute them).



    Multiple pipes are a relatively old idea. Ditto instruction-level parallelism, which is one of the analytical quantities used to judge how well multiple pipes will work in a given situation. SMT is a relatively new idea that lets you easily boost the instruction-level parallelism, which in turn makes scheduling and issuing instructions *much* easier.
  • by Christopher Thomas ( 11717 ) on Wednesday March 14, 2001 @09:26PM (#362651)
    It's not true that doubling L1$ and adding a selection bit costs you nothing. In fact, the size of L1$ is rather limited, and cutting size in half substantially increases the miss rate. It is also fairly expensive to add selection bits.

    Um, no.

    Most of your die is taken up by the _L2_ cache. You have plenty of space to add more L1 cache. The reason you usually don't is that a larger L1 cache served by the same set of address lines has longer latency. Two independent duplicates of an L1 cache will behave identically to the original L1 cache.

    Performing the selection adds latency, but this can be masked because you know the value of the selection bit long before you know the value of the address to fetch.

    In fact, you'd almost certainly _reduce_ the cache load compared to a single-threaded processor capable of issuing the same number of loads per clock, because they'd be hitting different caches, and you wouldn't have to multiport.

    SMT also doesn't save you from cache miss latency. Out-of-order instruction issue saves you from that.

    SMT, in any sane design, is used on an OOO core. An OOO core won't save you if your next set of instructions has a true dependence on the value being fetched from memory. SMT gives you a second thread with no data dependence on the stalled load, and hence plenty of instructions in the window that you can execute while waiting.

    I'm having trouble seeing where your arguments are coming from. As far as most of the core's concerned, there's still only one (interleaved) instruction stream, just with less data dependence in it. This is scheduled and dispatched as usual.
  • by Detritus ( 11846 ) on Wednesday March 14, 2001 @10:16PM (#362652) Homepage
    One of my favorite computer architectures is the CDC 6000 series. It had a Peripheral Processor (PP) that did all of the system I/O. The main CPU crunched numbers while the PP dealt with the outside world. The cool thing about the design of the PP was that it appeared to be 10 independent processors, even though it only had one ALU, instruction decoder etc. This was accomplished by a "barrel" of 10 sets of CPU registers and memory banks. The PP would rotate the barrel every time an instruction was fetched and executed, turning one physical CPU into 10 virtual CPUs. This meant that the PP could simultaneously execute 10 different programs wihout having 10 hardware CPUs. I've often wished there was a microprocessor that could do this. It would be great for embedded real-time systems and I/O controllers. Each I/O device and/or subsystem could have its own virtual CPU, that would never get swiped by other tasks or I/O interrupts.
  • You seem to be talking about the IBM POWER4 [realworldtech.com] chip. Which looks to stomp the living shit out of anything and everything short of Alpha EV7. IA-64? Don't make me laugh.


    Long live Big Blue!



    (jfb)

  • True enough. But when is EV8 scheduled to launch? I was under the impression that POWER4 and EV7 were going to appear more or less at the same time, with EV8 much later.

    And while I don't doubt that an eight-way POWER4 unit will be as terrifyingly expensive as it is fast, would an equivalent EV8 system be any cheaper? Either way, it'll be interesting at the high end again, now that Alpha finally has competition.

    Peace,
    (jfb)
  • That's more or less what I figured. It will be an interesting couple of years.

    (jfb)
  • IIRC, it was even more limited than that. It was certain filters that were SMP enabled. There would be little value and great expense in SMP-enabling the basic UI.

    IIRC further, there have long been available daughter boards for various Mac platforms to do filter acceleration in Photoshop. I think these have ranged from general purpose CPUs to DSPs. Earliest one I can remember was for JPEG compression, although I think more commonly they were used for stuff like gaussian blurs. I can remember working at prepress shop on an unaccelerated '030 and waiting like 5 minutes to do filters on a 30 MB file, so they had some value.
  • Yup, kinda reminds me when people were getting all excited about those dual CPU boxes that Apple was selling to take attention away from the megahertz gap vs. x86. Yeah, nevermind that hardly at anything at all can utilize more than one processor under MacOS, it's got two CPUs! w00t!


    Cheers,

  • You're not wrong, which is why I said "hardly anything at all" can use more than one CPU under MacOS, instead of saying that nothing can. The amount of software that can, though, is extremely limited.


    Cheers,

  • While in many senses you are right I think you are pointing to the wrong issue. It is not something inherent in the x86 arch that causes problems in scaling it is mostly Intel's SMP bus design. Having all the CPU's share a single, shared, bus between each other and system memory is the bottleneck. I mean, look at the Athalon, it isn't riding on an Intel designed bus, it rides on a DEC desigend EV6, originally made for the Alpha.

    While Beowulf is a nifty technology, it does not solve all the scaling problems as you might think. Beowulf clusters are only useful for a specific subset of available problems, stuff that can be easily split up and sent to many, semi-independent, processing nodes. Beowulf clusters are generally connected together with 1Gb or 100Mb Ethernet which does not have high bandwidth or low latency compared to the CPU-Memory bus in even the cheapest computers. I would take a single 128 CPU box over 64 dual proc boxes connected via 1Gb Ethernet (or even Myrinet) any day.

  • *Sigh* . . . It seems that no one really does much research into CPU and computer system design anymore. All the major archetectures are pretty homogenized, they are either full RISC machines (probably a tad bloated with too many instructions) or RISC machines emulating a CISC instruction set (x86). RISC seems the last fundemental change in hardware design, the processors get smaller and faster but not much really changes.

    One thing that I like, at least as a concept, is the IBM S/390. The IBM Mainframe and its customers have been living in a pretty insular world over the last 20 years and the hardware/software that runs on this beast is just, different. Some things are goofy, IIRC it boots off an emulated punch card reader (!!), or nifty, like the magic migrating storage system.

  • I independantly invented what people are now calling SMT in 1994.

    I am soooooo cooool.
  • IMHO, SMT is a load. Modern microprocessors are mostly cache-starved. SMT puts two processors on the wrong side of the L1$, aggrevating the cache bandwidth problem. Worse, the two processors in SMT degrade referential locality, further degrading the performance of the cache.

    I'm much more interested in enhanced cache ideas like IRAM [berkeley.edu] that seek to enhance performance by putting a very large L2$ on chip by combining the discrete logic circuits of the CPU and static L1$ with the capacitor cell circuits of DRAM.

    Crispin
    ----
    Crispin Cowan, Ph.D.
    Chief Research Scientist, WireX Communications, Inc. [wirex.com]
    Immunix: [immunix.org] Security Hardened Linux Distribution

  • It's not true that doubling L1$ and adding a selection bit costs you nothing. In fact, the size of L1$ is rather limited, and cutting size in half substantially increases the miss rate. It is also fairly expensive to add selection bits.

    SMT also doesn't save you from cache miss latency. Out-of-order instruction issue saves you from that.

    The main advantage of SMT is that it gives computer architecture scholars something interesting to study :-)

    Crispin
    ----
    Crispin Cowan, Ph.D.
    Chief Research Scientist, WireX Communications, Inc. [wirex.com]
    Immunix: [immunix.org] Security-hardened Linux

  • The world needs a central registry for TLAs, to ensure confusing multiple definitions can't be used in the same domain.

    Maybe ICANN would take the job on ?
  • Stack overflow.
  • Actually, the tradeoff between SMT and single chip multiprocessors is not quite so simple. By adding SMT to a chip you increase the complexity of the design. This means you have to spend more time designing it, and more time testing it.

    With a single chip multiprocessor (SCMP) you can design a smaller, simpler processor and spend more time performance tuning it. Once you have a single core working and tested, you can stamp it out multiple times. The complexity then gets put in the glue logic which is used to communicate between cores and share caches between them. This problem is well understood from the design of conventional multiprocessors.

    Basically, since SMT is new, the design takes longer. SCMP relies of understood technologies, and potentially could be put into production faster.
  • Well, where do we start...

    Multi-core CPUs like IBM's are SMP-on-a-chip, which is not the same as SMT by any stretch.

    SMT, because more of the functional units on the chip are staying active at one time, increases heat and power consumption just about as much as SMP-on-a-chip, though it may be marginally better because the core-level overhead won't be present.

    "SMP-aware" applications? Yeah, you need something like that with the Mac and its cooperative multitasking and wacky thread model. However, with any normal preemptive multitasking, thread supporting OS, introducing threading into a program makes it "SMP-aware" by default (though you may find new/different bugs on an SMT or SMP system).

    The only thing I can think of that would be an "SMT optimization" at the application programming level would be threading any floating point calculations separately from integer calculations, thus allowing the FP units to be running independently from the rest of the application.
    Methinks Vince needs to bone up at little...

    Some more links:

    Universiity of Washington SMT info [washington.edu], this is also linked to from the UMass link previously posted
    Look at some more Alpha [alphapowered.com] specifics from the source
    I believe these Real World guys [realworldtech.com] were quoted in the last Slashdot SMT reference [slashdot.org] (and look, Hemos posted that one too... you'd think they'd read the links...>

    Disclaimer: I worked for Compaq (though not the DEC side) on porting an OS to Alpha a couple years ago and having to be aware of SMT in EV8 coming down the pike.

  • The fact that Intel's last processor release was a "mobile processor", Intel's future x86 vapormap includes pure SMT chips, and Compaq's future vapormap includes an SMT alpha shows how important that size reduction is.

    SMT on EV8 has been on the Alpha "vapormap" for at least three years, probably considerably longer than that. July 1998 was when I first saw it and I'm not a DEC person. Don't think it has anything to do with embedded stuff. Don't think you'll find Alpha's in very many embedded situations either, not compare to PowerPC or the real embedded archiectures...

  • The BOTTLENECK is the memory bus.

    OK.

    That's why CISC lasted so long in the first place, and is still with us today.

    No way. CISC (i.e. x86) lasted so long because of duopoly action and backward compatibility. In fact, like you said, CISC is dead because even since Pentiums, x86 chips have been RISC on the inside and CISC to the outside world (to varying degrees).

    The fixed length instructions of RISC can be executed 2 at a time because you don't have to decode the first one to see where the second one starts.

    Or n at a time. Any OOO RISC processor these days worth its snot decodes 4 ops/clock, some are at 6 or 8. (If it can't retire that fast, it doesn't really matter...)

    Alpha haven't displaced Intel because the real bottleneck is the memory bus

    Really? For scientific computing, which is where you have really big datasets and memory bandwidth is key? I don't think you see x86 there very much. You see DEC, IBM, Sun and HP. Who are all, surprise, surprise, RISC-based hardware vendors. Many RISC chips (Alpha, POWER, PowerPC) have long since passed x86 in sheer performance, especially on FP. Intel has defintely won in price/performance, but I would argue that's more due to volume than anything else.

    This multi-threading stuff is just a way to keep the extra VLIW execution cores from being fed NOPs.

    Umm, Alpha EV8 uses SMT. Not VLIW. Itanium is VLIW-like. Doesn't use SMT. Example no worky.

    Not saying that the concept is wrong, that SMT as a concept might alleviate some of the performance issues with superfluous instructions in a VLIW instruction stream. But that's sort of the point of VLIW, to let the compiler, rather than OOO hardware, figure out how to best use the available functional units as much as possible. It puts NOPs in to keep the instruction stream balanced so the decoder can work in a predictable way just like in RISC.

  • Say what? Most applications can't fill a deep pipe, even out-of-order and with aggressive prefetching. The ways this stuff wins include having two (or maybe more!) instruction streams to crunch, and switching away from the one that's now blocking on a memory access. Prefetch on the other surely completed already ... The P4 is a good example of a pipeline that's too long.

    And by the way, why has this taken so long to arrive? It's still not something I can purchase yet, and I first heard of it back in 1992. There's something fishy.

  • well, the best(*) a 100% parallel system could do in that configuration would be to equal the faster chip. 2 x 500 = 1 x 1000 !

    Now if the article had said that a 2 x 800 was just barely as fast as a 1 x 1000, under an OS with the capability to use both CPUs, that would have been noteworthy. But this must have been one of the least informative comparisons I've heard in a long time.

    (*) disregarding memory bandwidth, which depends on busses and what not.
  • Don't forget that under an SMP-aware OS running SMP-optimised processes, 2 x 933MHz can be faster and is much cheaper than a single 1.5GHz chip...
  • Surely this 'virtual-multi-CPU' system can only decrease the sheer number of operations per second a CPU of a given size/speed can do?

    The overhead - whether it be in sacrificed MIPS or die area, of distributing instructions among execution units is going to be significant, compared to a maxed-out single core design.

    Since reading and writing to various RAM caches are the biggest bottlenecks in the current PC architecture, adding more units is just going to lead to increased contention for these resources.

    So many CPU cycles are wasted with the current generation of software that it seems a bit pointless increasing the number of potential instructions you could perform..

    Its like putting a 700 cubic inch supercharged W16 engine constructed from 3 straight-8 blocks into a VW Kombi van.

    Sure, it'll theoretically go pretty fast, but when its parked by the side of the road 340 days out of the year and only ever driven by a bunch of hippies who are too stoned to see the road properly at 20 kmph, you have to question the thinking behind such a modification.

  • Sure, but you need to use some of the parts from the third engine block to construct the more complex W-16 configuration, which would theoretically provide more usable power than, say, a straight-24.

  • Sadly there wasn't a special instruction for switching the banks. You had to set or clear bit 14 of $f02100.
    You could move single values to/from the other bank with moveta/movefa instructions.
  • Oh BTW, that was for the GPU, for the DSP it was $f1a100, before some smartass pipes up...
  • Yikes. The EV8 will dissipate 250watts! That's more than my monitor! Of course, watts are good ;) I want one (or four, this does SMP right?)
  • This is not SMT. To quote the website.
    • It includes 2 tightly coupled processor units
    In SMT, you have multiple threads of execution (eg. multiple PCs) feeding one CPU.
    • SMT also doesn't save you from cache miss latency.
    Please enlighten us.
    I'm sat here working on the software side of an SMT project. This is exactly where SMT offers benefits. The processor switches threads on a cache miss. A 4-way SMT scheme can offer > 2x performance for 2x die size. A SMT procoessor cannot reduce the cache latency experienced by a particular thread of execution, but it can reduce the amount of time that execution units sit idle.
  • Put up a CPU monitor and see how much time your machine spends CPU-bound: in most cases, the CPU is very rarely fully loaded, and actually spends much of its time waiting for disk.

    Disk-IO should never be a problem. If it is, then you need to alter the system. If your work-station is disk-bound, then you simply add more memory, if your high-end server is disk bound, then you put in a very expensive RAID.

    Typical applications are Office suites, video games, web servers and databases. Well a game should never have to hit the disk except for level loads / movies. Office should only have a hit the first time you start it up(and nowadays that's reduced by boot-time prefetching). web-servers should _always_ contain enough memory to serve the majority of the web-site. And professional Databases will typically demand multiple expensive drives.

    Most other operations only require a moderate use of IO, which is further reduced by OS level caching. It's not hard at all to get 2 or more CPU-load doing things like compiles (which are obviously very IO bound). The CPU is more than taxed in such circumstances.

    All levels could beboth larger and faster
    There is most certainly a boost from memory latency and bandwidth enhancement, but I don't agree that these are the universal bottleneck.. Many applications are nicely optimized to fit within the half meg of L2 still commonly found. This applies to video games, heavy-duty compression tools, etc. For example, I got 99% performance boost on a dual-processor when doing MP3 encoding. Obviously main memory / disk-IO wasn't the bottle-neck.

    I think Lx$ becomes a bottle-neck when you context-switch often; thereby flushing the $.

    Unfortunately economically sound SMT isn't going to be as fast as SMP because you can't have fully redundant and optimized L2 (which will affect large-data-set applications). What I see happening is the use of thread-independant register sets and L0 cache (for the x86 processors at least), then having a large number of ports on the shared L1, and finally a minimally ported, though larger than current on-die L2. There would possibly be a very large off-die L3 (typically targeted at 2Meg).

    It won't help all applications (Apache 1.x or Postgres 7 certainly won't improve), but more and more Solaris and Windows applications could definately benifit (heck, even the traditional win-benchmark Quake 3 is MT aided).

    What I see as ideal is SMT/P interleaved memory. You use 2-way SMT to take up the slack where ILP bottles up. Then you have a second separate chip (possibly on the same cartraige, sharing an external L3 $ ). In this way, you have a minimal increase in complexity of the core, and you optionally sell more cores (kind of like the 3DFX's VSA-100 mentality.. Make the core simple and scalable). So long as you have a large enough $ and a minimal number of loading processes, putting all the cores on the same system-memory-bus shouldn't be a problem (thus alleviating the complexity of the EV8 point-to-point bus). I don't believe AMD's P2P architecture is going to be worth the added cost, delays, and complexity. Not to mention, main-stream memory can't handle the additional BW.
  • I've done simulations in the past - the problem is with data stalls - they occur deep in the pipe and for a non-OO CPU you have to toss all the stuff that's partially started in the pipe, and start the new thread which takes a while - so thread switch is expensive. To make things worse it makes no sense to switch on an L1 miss - simulations say that waiting for L2 miss is a must.

    Now with a superscalar/OO system things are different - you can keep partially done stuff sitting in reservation stations (but you need twice as many of them which may cost you in cycle time - which is what marketting are interested in). There's still probably a lot of serializing things around (like L2 miss). On the other hand one place you can win is with the main memory subsystem - these days you can build systems with multiple outstanding memory transactions (Rambus - no matter how people dislike it - is particularly good at potentially having many concurrent senses running in parallel) - but to be usefull you need to either move the main memory interface onto the CPU or get rid of tacky serializing buses like slot-1 (it's much better to put the memory interface on the CPU in this case because you can run the internal memory interface at a higher clock rate and be exposed to more parallelism).

  • Surface Mount Technology .... of course all us chip weenies think packaging when you use that TLA.

    Seriously though threading like that is kind of at odds with todays very long pipelines (basicly the cost of a thread switch can be very high if you have to fill a deep pipe). With heavily out-of-order systems this can be less of a problem .... but you're still stuck with the problem that if you're using a larger percentage of the CPU's real clocks then you're going to put more pressure on shared resources like caches and TLBs - larger L1s/TLBs are going to potentially hit CPU cycle time and of course these days L2 can take a large percentage of your die size (after all the goal here is to get more usefull clocks/area)

  • it does however list seaside music theatre. ummm hmmm. besides everybody already knows it means surface mount anyway, not this spurious muffin threbbing you guys are talking about.
  • understand why under an OS such as Win 98 or ME, even a single Pentium III 1 GHz will handily outperform a dual-Pentium III 500 MHz setup.

    This is the expected result under ANY operating system. Multiprocessing only helps for problems that lend themselves to highly parallel processing (exploring independent address spaces of cryptography problems, for example), which many don't, and in any case incurrs overhead.

    --
  • Here's more information on another processor that does implement SMT....to handle network traffic. (From Microprocessor Forum) http://www.eet.com/story/OEG20001009S0061 The XStream core can execute eight instructions per cycle, the same as the two cores in SiByte Inc.'s Mercurian SB-1250 network processor implementation. In SiByte's case, however, each core is limited by the parallelism in the code. SiByte claims its cores each average about 1.5 instructions/cycle, after accounting for data dependencies and pipeline stalls. The simpler cores found in most network processors average less than 1 instruction/cycle. The magic of SMT is that the XStream core issues instructions from up to eight threads at the same time. That is, the eight instructions per cycle could each come from different threads. Since each thread has its own register space, this eliminates data dependencies, keeping the issue rate high. If one or more threads stall while waiting for memory, the chip can issue as many as four instructions from a single thread, keeping the core operating near peak efficiency. As a result, the XStream core averages about 6 instructions/cycle, Nemirovsky said, or twice the rate of the two SiByte cores combined. XStream, however, did not disclose the clock speed of its core. While SiByte is pushing the leading edge at 1 GHz, most other network processor vendors are toiling at a few hundred megahertz. In the end, XStream's core may not execute significantly more instructions than SiByte's SB-1250.
  • by Seenhere ( 90736 ) on Wednesday March 14, 2001 @08:56PM (#362686) Homepage
    To dig into this a bit deeper, here's a link to some research papers by the guy who invented SMT [ucsd.edu] (it was the topic of his PhD thesis back in 1996).

    For your bedtime reading, y'all.

    --Seen

  • x+x is not greater than x*2. but it does equal 2x :)
  • Read up on MTA [cray.com], it's cool. Supports 128 active threads per processor.
  • by Carnage4Life ( 106069 ) on Wednesday March 14, 2001 @09:19PM (#362689) Homepage Journal
    For those with a technical bent who were disappointed by the lack of information on SMT in the linked artilce, here are some better resources:

    Introduction to Simultaneous Multi-threading from UMass [umb.edu].

    Quick Quiz on SMT [umb.edu].

    Caches for Simultaneous Multithreaded Processors: An Introduction [tohoku.ac.jp]

  • When comparing SMT vs. dual or multiprocessor performance data, it is important to compare apples to apples, and understand why under an OS such as Win 98 or ME, even a single Pentium III 1 GHz will handily outperform a dual-Pentium III 500 MHz setup.
    With all due respect for my Intel-loving friends here on Slashdot, I feel that you'll never see much of the real advantages of SMP on x86 boxes anyway. Even if it were possible for Joe Average Computer Scientist to obtain an Intel box with more than eight CPUs, the severe bandwidth limitations of the x86 architecture become apparent with even four Pentium IIIs. Only machines with high-bandwidth architectures (most notably those from IBM's mainframe division, SUN, and SGI) are able to compensate for the exponential growth in processing overhead with more than eight CPUs, and it's no coincidence that these companies are where you turn to when you need a machine with between eight and sixty-four CPUs.

    My feelings shall be vindicated when SMP Athlon machines become readibly available. Their comparatively minor bandwidth advantage will let them blow similarly-clocked Intel boxes out of the water.

    Personally, I feel that the best way to scale x86 to supercomputing levels is through clustering, such as is offered by the venerable Beowulf for GNU/Linux. GNU/Linux, for better or worse, is continuing to grow in popularity, and I would like to see commercial software vendors try releasing Beowulf-enabled software for Linux. Imagine being able to buy Oracle for Beowulf! Okay, poor example; Oracle is memory-intensive rather than CPU-intensive, and a RDBMS is one application which is so dependant on a fast disk and good caching that that advantages pale in comparison to the potential problems. What would really be cool are Beowulf ports of statistical analysis and 3D-rendering software. Oooh, yeah... after all, The Matrix and Titanic have both proven the effectiveness of free x86 Unix-workalikes in render farms... I believe that those two movies respectively used FreeBSD and GNU/Linux.

    --

  • It might interest you that Digital was very enthusiastic about it, and intended the Alpha 21464 to be built around this notion. Who knows if we'll ever see a 21464 now, though...

    I do. I've been working on it for several years now. The project is alive and well, and we fully intend to deliver an SMT processor. Of course delivery is still a couple years off (designing processors takes a really long time), but we'll get there eventually.

    BTW, for those of you interested in SMT, we are hiring [alpha-careers.com].

    P.S. Great response to the nay-sayer. Wish I had written it. :-)

  • SMP is not cost effective.

    Yes it is. Since cost is not directly proportional to speed, you can buy two processors at greater than half the speed of one. If we assume adding a processor speeds up your application by 50% (a bad case), we can buy two 666MHz PIIIs at $115 each (from pricewatch), instead of one 1GHz PIII at $240, and get equivalent performance (666+333=1000). Then you also have some money left for a more expensive motherboard. 50% is low though, depending on application you should be able to get it to 75-80%. If you also add overclocking into the picture, you can save a lot of money (I have two Celeron 300A running at 450MHz each, this would be equivalent to something like a PIII at 800MHz, which didn't even exist when I bought them).

  • 1GHz Thunderbirds seem to be around $165 [pricewatch.com], while 650MHz Thunderbirds are ~$65 [pricewatch.com], so that would make SMP even more cost-effective. It's a bit difficult to compare to combos, since there aren't any combos for SMP Thunderbirds (or any motherboards at all yet). I look forward to buying an SMP Athlon system, but for now they don't exist.

  • IBM's Power 4 Architecture [ibm.com] was designed to exploit SMT. They're looking to leapfrog Sun and Dec in server performance.
  • If they're anything like me they keep a Win box arounf for IE & Games and use VNC or some such to browse but a proper computer around for doing proper work
    .oO0Oo.
  • You pay a 10% die cost for twice the throughput according to what Compaq is claiming for the Alpha EV8 (ref [realworldtech.com]).

    So, you're just wrong. Trying reading something about what you're commenting on first.

  • The point of SMT is that if one thread gets a cache stall because it has to hit main memory, then another execution thread has its instructions loaded into the CPU. SMT is actually one way to help reduce the CPUMEMORY botttleneck.
  • by lamontg ( 121211 ) on Wednesday March 14, 2001 @10:56PM (#362698)
    I'd suggest you read this paper this paper [realworldtech.com] or this paper [realworldtech.com]

    Pay particular note to the fact that you can take an existing superscalar chip and add SMT for only about a 10% chip real estate premium, while it should be able to double throughput. That's a lot better than trying to double throughput by adding another CPU to a machine or by adding two cores to a CPU.

    Also note that it isn't recommended to run processes with different address spaces simultaneously on the processor because that would thrash the TLBs. Its only suggested that you let multithreaded apps (oracle, perhaps future versions of apache) load more than one thread into the processor at the same time.

  • I'm a little confused. Don't some processors use register windows to speed up context switching? How is this any different?

    PS I'm not trolling, I really don't understand how this is new.

  • a single Pentium III 1 GHz will handily outperform a dual-Pentium III 500 MHz

    Let's leave 98/Me out of the equation for now.

    Just how is a dual 500 p3 ever going to beat a 1Ghz p3? You still only have 1000000000 clock cycles per second. Your processor bus isn't any faster, in fact it's got an extra processor on it so there's increased contention, you have a little bit more L2 cache to play with, your memory bandwidth is the same, the overhead from SMP guarantees that you'll never efficiently use more than about 900000000 of those clock cycles in a second, and all your drivers have to play it SMP-safe, which is another 5-10% speed hit.

    x+x is not greater than x*2.

    SMP helps partition applications from each other so even if one app is hogging a cpu, other stuff will still give decent response times. But that's about it - unless you need to push the bleeding edge (and spend >5k on a box), SMP is not cost effective.

  • A cheap-ass socket a+thunderbird 1ghz combo is only 72+146=$218 (again quoting pricewatch), cheaper than your two chips alone.
  • The cost difference of the motherboards will be more than the second processor, and by the time they come out the chip prices will be completely different anyway.

    Then again, if you wanna suffer through first-gen SMP hardware bugs, your wallet will probably have similar masochistic tendencies.

  • Their MAJC5200 processor already does SMT, although they call it something like spatial computing. Check it out here [sun.com]
  • IBM [ibm.com] demonstrated a multi-threaded POWER CPU destined for their AS/400 [ibm.com] series of workstations and servers at the ISSCC [isscc.org] conference back in 1998. The synopsis of their presentation is available here [128.100.10.145], as paper 15.3.

    To my knowledge, this chip is either now in use, or very close to being put in an AS/400 or i Series box.
  • Hardware Central: Nice site layout. Pleasant writing. Very lightweight.
  • Simultaneous Marijuana Trafficking?
  • Back in the early-mid 70's people were taking two 6800 CPUs, wiring them out-of-phase, and essentially building tightly-coupled SMP systems. We didn't really have threading in those days, or else the correct OS could have made such a system SMT, instead.

    But someone else was designing another CPU, called the COPS. They looked at this well-known out-of-phase 6800 technique, and realized that their design basically used clock-up for fetch/decode, and clock-down for execute. During each half-cycle, half the CPU was sitting idle.

    So they doubled the registers, using the fetch/decode unit with one register-set during clock-down and the other register-set during clock-up. The execute unit worked in the converse fashion, alternating register sets. A dual CPU on a single chip for the cost of a second register set and a little control/arbitration logic. They didn't attempt any sophisticated contention-prevention, leaving that up to the software. This was mid-late 70's.

    With more modern software, COPS might have been the first SMT. I don't know the timeframe of the CDC6000, whether it beats mid-late 70's or not.
  • If you compare 1Ghz with 2x0.5Ghz you are right. But suppose that the fastest processor right now is 1Ghz and you need - don't you think a 4x1Ghz would be a good thing (if you had the cash). Why else would sun [sun.com] have computers like this [sun.com] (minimum 4, max 64 CPUs)?

    Cheers...
    --
    $HOME is where the .*rc is

  • The Blue Gene supercomputer uses SMT. Guess they don't just put multiple cores on a single chip :)

    m
  • I wonder how many copies are actually used in a production environment.

    I'd be willing to bet it's vaporware.

  • Well, maybe not by the time I get this up. Actually, win98 and ME don't support dual processors at all, so you're second one will just be sitting on the motherboard turned off.

    As far as SMT goes, I think it's a good idea (well, obviously, why wouldn't it be). You really can only get so much out of Instruction level parallelism, and I've always thought that splitting CPU time up by thread rather then by instruction parallelism would be a lot more effective.

    Rate me on Picture-rate.com [picture-rate.com]
  • actualy, every pic on there is 'default' Right now, my pic is at http://picture-rate.com:8080/hello/viewpic.jsp?sys tem=happy&owner=default&picture=i170 [picture-rate.com], although, that could cange. I would link directly to it, but chad hasn't put in the ablity to select which picture to rate, otherwise I owuld put that link in my sig.

    Rate me on Picture-rate.com [picture-rate.com]
  • Neither W2k nor Linux are particularly impressive with 2+ cpu's. Use BeOS if you want to see what a dual processor machine is really capable of speed wise.
  • Comment removed based on user account deletion
  • by account_deleted ( 4530225 ) on Wednesday March 14, 2001 @09:51PM (#362715)
    Comment removed based on user account deletion
  • Just peruse back issues of popular computer magazines and you will see that most new Intel processors were pigeonholed as "server solutions" until they inevitably migrated to the business and consumer PC markets. If the actual SMT performance data matches or exceeds the simulated results, then an SMT processor may become an increasingly attractive option for many buyers. SMT could also be the "shot in the arm" microprocessor companies are looking for. There is a malaise in the current market, along with growing consumer resistance to ever-increasing processor clock speeds. If the hardware and software portions of SMT come together, then the question of "why do I need a 1.5 GHz processor?' may be answered in very short order.

    Actually, while the first bit is true... let's face it, for most business applications we do not need faster machines until we have to deal with the bloat of the next road of software from the major vendors.

    Businesses could probably do very well on a single standardized set of software for a decade or more for most common functions. Many have done so as a matter of fact. There are some businesses out the still running win 3.1 apps.

  • I didn:t know that slashdot had been "acquired" by Microshit's PR dept...
  • Hmm.. Who Cares? Probably the 90% of computer users who use a Windows based-OS.


    That's not what the original poster said. The original poster was talking about Windows 98/ME. 90% of Windows 98/ME users don't know how to benchmark their computers, and wouldn't want to either. I'm sure lots of NT/2000 users care, especially if they're using Windows as a server. But the average home user of Windows 98/ME doesn't care. Serious gamers and some kinds of professionals care, but that's a very small part of the Windows 98/ME market.
  • And by the way, why has this taken so long to arrive? It's still not something I can purchase yet, and I first heard of it back in 1992. There's something fishy

    Blame Microsoft - their continuing use of DOS (95, 98, ME) prevent these technologies from being introduced. Once your basic Microsoft OS is based on NT, these technologies will start to be devoloped.

    It currently makes more sense for INTEL/AMD to invest in technologies that will make ME/9x faster as well as NT/Unix. Once SMT capable OSes are the norm they will look at other ways to speed things up.

    die ME die!!!

  • This is spectacularly wrong. SMT is an attempt to reduce the effects of long memory latency by running several threads concurrently; with a bit of luck, if one thread is blocked on a memory access another will still be runnable and the processor does not need to stall (just the one thread). The question now is whether one should share the various caches between all the threads or have separate (snoopy) caches, one per thread.
  • "I got the impression that SMP was still better".

    A processor can be 4-way SMT for only 10% extra silicon cost. SMT makes processors more EFFICIENT.

    C//
  • The POWER4 chip is a very impressive design and a very nice chip. EV8 will be far more efficient, however. Look at the transistor/silicon cost of the chips. POWER4 is a 2-way MPU designed for 4x2 transputer-like grid assemblies. Such a configuration will be insanely fast, but surely quite bleedingly expensive.

    C//
  • An equivalent EV8 system should be less expensive to manufacturer, however IBM could possibly manage better economies of scale and will quite possibly out-market Alpha. Alpha and POWER4 will be competing in the same markets (HPC).
  • Ah, At long last circuitry catches-up with functionality that women have been laying claim to for aeons.
    • SMT = Society for Music Theory
      (http://smt.ucsb.edu/smt-list/smt-main.html)
    • SMT = Surface Mount Technology
    • SMT = Simultaneous MultiThreading
    And (drumroll) of course ....

    the Finland Travel Bureau (http://www.smt.fi/)

    When will the madness end?
  • by stretch_jc ( 243794 ) on Wednesday March 14, 2001 @08:51PM (#362726)
    why under an OS such as Win 98 or ME, even a single Pentium III 1 GHz will handily outperform a dual-Pentium III 500 MHz setup

    This is because Win9x does not support SMP so even dual 933MHz will be out performed by a single 1GHz. A better comparison would with Win2k or Linux, which will both actually use both CPU's.

  • Please, this is not a new idea.

    Here is an article from DEC (now Compaq) that describes how the Alpha chip does it:
    http://www.compaq.com/hpc/ref/ref_alpha_ia64.doc
    for word.
    and
    http://www.compaq.com/hpc/ref/ref_alpha_ia64.pdf
    for pdf.
    Digital has always tried to do things "the Right Way" and it shows in their products.
  • http://www.xlr8yourmac.com/G4ZONE/G4_733mhz_review /apple_G4_733_tests.html#storytop has some Single vs. Duel benchmarks on MacOS.

    You are right on about the graphic designers. Apple's high end machines are selling almost entirely into nitch markets, and for the most parts those markets use software which takes advantage of multi-processing.

    Apple's 'regular user' base is for the most part not buying multi-CPU Macs.
  • Will this help our dental friends using a cobol program for office management, developed on Xenix and only recently ported to SCO OpenServer? We're using curses to display on terminals over serial lines.

    Or how about the supply ordering system that runs on dos. We actualy bought a celeron based system just to run this one app and then found that the program wouldn't run because it had a stupid win-modem installed! Forget about using the ink-jet printer too.

    The windows based version wouldn't load because the installer told us we had to up-grade from WindowsME to Windows 95 or better thought seriously about Linux/wine that's better isn't it?

    SMT is not going to help anything we use run faster; so why should we spend the money? Why would anyone develope software for this technology? We don't even get Linux distros optimised for our athlons!

    This is just blue smoke and mirrors to try to get the techno-junkies to shell-out a bunch of money for a new system that might actualy run their apps ten percent faster for four times the cost.
  • From the article: If the hardware and software portions of SMT come together, then the question of "why do I need a 1.5 GHz processor?' may be answered in very short order
    Those guys got it totally wrong. The question of "why do I need a 1.5 GHz processor?" stems from the fact that current processors are more than fast enough for many applications.
    This question isn't answered by introducing a system with higher performance. IMHO it might be answered by new, attractive applications that need the better performance (speaker-independent speech recognition?).
  • These days, typically only ardent graphicophiles require a powerful machine on their desk, given a high bandwidth network connection and the availability of suitable shared resources. Sure it can be handy for a developer to sit next to his obscenely over-specified machine, but similar results can be achieved in different ways.

    Unfortunately, a similarly dismissive attitude is misplaced when considering server architectures. As real-life presents opportunity to build ever larger scale interactive systems, the speed of execution of critical sections becomes every more important.

    There are more useful questions to ask than "What proportion of users browse with IE on Windows" - for example

    1. What proportion of users use unconnected (i.e. no ISP/LAN) machines?

    2. What proportion of users require to use client server software to access centralised resources?

    3. Are users satisfied with the performance of servers when the access shared resources at times of high contention?

    Or... in a much more down to earth fashion - do these users of IE also use Exceed (or other win32 X-server), or telnet?

  • Correct me if I'm wrong, but weren't some Mac Apps, like Photoshop, written to take advantage of a multi-processor set-up? Sure the OS is single CPU, but that doesn't stop the program from grabbing the extra cycles when it can.
    -----------------

"Money is the root of all money." -- the moving finger

Working...