Slashdot is powered by your submissions, so send in your scoop


Forgot your password?

HP Shows Off PA-8800 SMP-On-A-Chip CPU Plans 176

Eric^2 writes: "At last week's MicroProcessor Forum, HP's David J. C. Johnson unveiled the details of HP's latest RISC processor destined to redefine performance in Server-Class processors. Following a relatively simple strategy, the PA-8800 processor combines two PA-8700 cores on a single chip to enable symmetric multiprocessing (SMP) on a single processor. Aside from bumping the core speed up to an initial 1 GHz, enhancements include the addition of combined 35 MB L1+L2 cache. The article contains the full text. AMD, please steal an idea..."
This discussion has been archived. No new comments can be posted.

HP Shows Off PA-8800 SMP-On-A-Chip CPU Plans

Comments Filter:
  • And Get Sued? (Score:1, Offtopic)

    by tomblackwell ( 6196 )
    These companies tend to patent anything that will give them a competitive edge in the marketplace. "Stealing an idea" would probably get them into some legal hot water, just like stealing a TV, or your car.
    • Re:And Get Sued? (Score:2, Informative)

      It wouldn't be stealing an idea. This idea has been around for a long time in academia. Maybe the poster forgot this, but the POWER4 from IBM does this, and comes with 32Mb of L3 cache, plus an ondie shared L2 cache. The idea isn't new, it's known as CMP (Chip-level Multi-Processing). Really, "SMP on a chip" is merely called CMP.

      Also, though Sun has decided not to use the MAJC architecture for anything (they were hoping to try to get it to become a video-accelerator, but that's not even going to happen, most likely), that too was fully spec'ed out to have multiple cores on a's really nothign new :)

      The longstanding rumour is that AMD will be coming out with a dual-hammer processor (ie, CMP). In academia, the idea has been used frequently as well.

      The idea of using CMP isn't even that big a deal to most consumers. While it would be nice for AMD to come out with a chip that does multithreading (merely because it increases real-world throughput quite a bit, depending upon the type of multithreading), the average PC running windows 9x/ME/ XP Consumer won't be able to multithread anyway. The only reason for AMD to multithread is for the server-space, which is what they're aiming for with the hammer series...but I digress.
  • The IBM p690 server uses POWER4 processors. Each
    chip has 2 POWER cores with high-speed interconnects. Even better is that each chip is connected to 3 other chips to make up 8 CPU packs.
    • HP can't bring this chip to market, I don't care how cool it is. HP and Compaq are merging to bring about volume manufacturing and economies of scale, largely to fend off Dell, which is destroying them both individually using advanced manufacturing and distribution methods.

      With an agenda based on scale, you don't get there by introducing a new CPU in a dead line. HP's SuperDome line is getting creamed by Sun and IBM - HP cannot afford to go back to the front lines with another enterprise offering unless SuperDome pans out a hell of a lot more than it is currently.

      HP has always had impressive technology but still loses market share . HP-UX has dwindling market share and software support. The merger with Compaq will derail any plans for further proprietary architectures.

      If you want to look at the gee-whiz value here, fine, but don't expect to see this in a product.

  • Wow... And I thought the 8MB L2 cache on UltraSPARC IIIs was a lot, not to mention the 16MB on some IBMs. Now we're talking about 3MB just in L1 with 32MB L2 cache. This beasty should have some impressive benchmark scores (yeah, I know, benchmarks aren't everything...)
    • 1. The extremely large amount of cache will not really affect m/cs with small number of processes, as a small amount of cache is usually anough for a very high rate of hits. After that the latency issues arise, which can be taken care of by aggressive pre-fetching.

      2. It will make a major difference when there are a large number of processes... prevents thrashing...only between the L1/L2 cache and the must be obvious...HP is targetting Servers with that processor.....where in a large difference can be expected....For single user 8MB L2 cache is more than what you will ever ever need.

      Opinions welcome
  • by turbine216 ( 458014 ) <> on Monday October 22, 2001 @01:58PM (#2460817)
    ...a 1 GHZ processor may not sound like much, even in this dual-core configuration, but keep in mind that this is a RISC processor. None of that Super-mega-ultra-long-50-bazillion-stage pipeline crap that Intel uses to pump up their MHz rating. The article kind of sells this point a little bit short. The RISC architecture allows this processor to do roughly twice as much work in the same amount of time - or, to put it in a more concrete scenario: imagine a pair of 2GHz Pentium 4's running in SMP configuration.

    Now that's FAST .
    • It is not just the cpu, Think about the I/O in the HP9000. Now that's FAST

    • It might be fast, but imagine the heat sink on that puppy. It could heat my pool... in the middle of winter.

      • Hehe... well, let's see: the heat sink on the K-class CPUs (180/240Mhz) is a big metal plate, essentially, with some sort of cooling channel in it; the 550Mhz PA-8600s in our N class have a square-socket style stack heat sink, amid a bunch of high-volume fans... that sink is about 3-4 times as tall as an Intel/AMD socket chip. So this new one would have to be a big beastie... active cooling this time? Maybe?
        • As i handled a few of them, I think the K-class cpu (pa8000 - 8200) coolers are active ones. Not sure about the ones in L and N class servers (which have BIG radiators cooled by HUGE fans) or about the SD ones (which are cooled by 4 BIG M-F turbines and some fans)
    • by svirre ( 39068 ) on Monday October 22, 2001 @02:40PM (#2461102)
      Risc or cisc architecture primarily affect the complexity of the fetch and decode stages of the CPU.

      The famous Intel-pipeline is in the execute stage (ALU).

      Pipelining is a strategy which is equally valid for both risc as in cisc architectures, and a risc architecture do not offer any complexity advantage in the execute stage. After all a multiplier is a multiplier regardless of overlaying architecture.

      Nowdays we don't really see much diffrence in performance between risc and cisc architecures for upscale processors. This is because the savings in fetch and decode logic are dwarfed by other costs like prefetch, reordering and brach prediction (which are used for both architectures).
    • RISC vs. CISC doesn't matter anymore. Both AMD and Intel convert the CISC architectural instructions into RISC-like internal instructions called "micro-ops".

      The Iron Law of microprocessor performance states:

      time = Instructions/Program * Cycles/Instruction * Seconds/Cycle

      Notice that the above cancels to Seconds/Program.

      So given the same program, you can do one of two things to reduce the execution time:

      1. Reduce Cycles/Instruction. This is done by executing more instructions at a time (Superscalar execution), and reducing the penalties of speculation (predicting what the program will do).

      2. Reducing the Seconds/Cycle (increasing clock rate). This is the approach Intel has taken with Netburst (Pentium 4).

      By making the PIV pipe 22 stages long, this helps Intel increase the clock rate. It also helps throughput. Assuming an ideal world where no data dependencies exist, more stages == better throughput. However, Control and Data Hazards always messes up a pipeline and the longer the pipeline, the more of a penalty on a branch misprediction. This increases cycles/instruction (CPI).

      The average CPI of AMD and PIII is much lower than the P4, however it makes up for this in clock rate.
    • well yes HP PA-RISC is nice but really its catch up

      MIPS 1GHz Dual core on same die for a while

      and that its 64bit

      check 0002 []
      or 2/index.asp []

      oh yeah did I mention that PA-RISC is a MIPS decendant
      but shhh they made so many changes they fscked the pipeline(they might have got it working again but I dont know any more)

      may the SPECINT and SPECFP fight it out


      john jones

      p.s. I wonder what the HP layout guys think of Intel chips (-;

  • by ruiner13 ( 527499 ) on Monday October 22, 2001 @01:59PM (#2460824) Homepage
    Did that say 35MB of L1 + L2 cache? I may be rusty, but I think I remember reading in my Processor Design for Dummies book that increasing cache size actually can slow down processor performance after a certain amount. Could someone please clarify this?
    • It can hurt performance because the larger the cache, the higher it's latency is going to be. Of course, the larger it is, the higher the hit-rate in that cache, so it's less likely to have to go down another level in the memory hierarchy.

      Using more caches of varying sizes is actually better than a monolithic cache, for reasons you described. If there are more caches, the primary one can focus upon low-latency, the second for high-bandwidth, and the third for high-hit-rate.

      But yes, it is indeed 35Mb of caches, though it's worthy of note that the L2 cache is off-die.
      • So then, theoretically speaking, a combination of a large cache and a long pipeline ala Pentium4 would be very bad, as it would take up more spaces in the pipeline while it waits for the cache to be paged through, or do I have this backwards? Would a larger cache on the P4 help it with its ungodly long pipeline since cache misses would be reduced?
    • As the CPU frequencies outstrip memory frequencies by larger and larger margins, the cost of a cache-miss increases - and so does the number of cycles a chip can afford to use looking through the cache. Because of that, the amount of cache where it stops making sense to add more is much, much higher than it was five years ago.

      Intel chips, though, keep using about the same overall amount of cache, to keep costs down.

    • the ratios between memory heirarchies should be taken into consideration when designing any layer. For instance, increasing vastly the size of the L2 cache will make the L2 hit ratio go up, but the L1 hit ratio go down (assuming the inclusion policy is in effect - this is not always so anymore -- see the 1st generatino Duron chips)

      Similarly, adding more ram to a machine _could_ slow it down in some situations because the "overall" cache hit ratio could go down.

      Also, when caches get to be too large, the cache policy may need to be changed. A fully associative cache is the most flexible placement policy and can give great hit ratio for a large working set, however, a fully associaive cache search takes longer than a direct map "search" or a set-associative search.

      So, if to get a large cache size they had to go to set assoc or direct mapped, then that will generally lower the hit ratio vs a cache of the same size which is fully assocaitive.

      It's all tradeoffs basically. You could write a cache simulator to play around with this :)

      • > For instance, increasing vastly the size of the L2 cache will make the L2 hit ratio go up, but
        > the L1 hit ratio go down (assuming the inclusion policy is in effect - this is not always so
        > anymore -- see the 1st generatino Duron chips)

        This is wrong - the size of the L2 cache has no effect on the hit rate of L1 cache(s). Just think about what a hit or miss really is and then ask yourself why 0MB, 1MB, 2MB ... of L2 cache should make a difference for the hit-rate of the L1 cache.

        And - since L1 caches usually work with virtual and not physical adresses - the L1 hitrate is not even influenced (much) by the amount of RAM.
        • if L1 is implemented as a direct-mapped cache of L2, then if L2 increases, L1's hit ratio goes down.

          It may be the case that no L1 is implemented this way. It is certainly the case that adding more ram will decrease the HR of an L2 cache

          Wether the L2 miss penalty * frequency of L2 miss vs the PF penalty * frequency of a page fault turns out to be greater is probabably

          1) workload specific
          2) generally in favor of more ram at the expense of lower L2 hit ratio (because PF servicing is abysmally slow)
          3) you can probalby generate a pathological case that shows either result :)

          Apologies, I'm rusty on this stuff :)

          • > If L1 is implemented as a direct-mapped cache of L2, then if L2 increases, L1's hit ratio goes
            > down.

            Yes, *if*, but it can't work this way, since the L2 cache isn't used as a direct access memory.

            The L2 cache is a associative memory, i.e. you use the normal (physical, in seldom cases maybe even virtual) adresses to adress memory. (Otherwise it wouldn't be a cache...)

            Which means that the adress space the L1 cache has to handle is not affected by the size of the L2 cache and that means the L2 cache has absolutely no influence on wether a access to the L1 cache results in a hit or miss - it doesn't even matter if there's a L2 cache or not.

    • Yea, I gagged a little myself the first time I read it. Until I remembered that HP was the one who put 1.5 MB caches on chips in the PII era.
    • Sun has been putting 8MB of L1 + L2 cache on the Ultrasparc 3 cpus. Of course..this cpu is really out there now and not just an idea on the drawing board..
    • Remember that those cpu's are intended for some machines which come with a reccomended minimum of 1GB Ram of more ...
  • Smokin (Score:1, Informative)

    Why hasn't someone else done something like this? I would pay whatever it cost to get even an 8MB L1 & L2 Cache. Anyone want to make me one?
    • Re:Smokin (Score:2, Informative)

      by T-Punkt ( 90023 )
      "whatever it cost"?

      Then go an buy something from Sun, IBM, Compaq -> AFAIK all three buy servers with that large L2 Caches. (Maybe HP and SGI as well).

      E.g. something from IBM's z900 serie (mainframe - up to 32 MB L2 (per CPU?)) or pSeries 620 (workgroup/midrange server - up to 8MB L2 per CPU) or Sun Enterprise 450 (workgroup server - up to 8MB L2 Cache per CPU), Sun Fire 15K (High End Server, 8MB L2 per CPU), Compaq Alphaservers GS/ES series (up to 8MB per CPU).

      And if you want just total of 8MB a SGI Origin 300 with more than 4 CPU should do it as well (2MB L2 per CPU).
  • Siroyan's OneDSP (Score:2, Informative)

    by Anonymous Coward
    The most interesting parallel architecture I heard about at the MPF was Siroyan's [] OneDSP architecture. This is a clustered VLIW machine that can execute up to 64 instructions each cycle! See the EE times article [] and their MPF paper []
    • I bet the compiler guys are gonna have fun statically scheduling 64 instructions each cycle! (if you can't tell, I'm dripping with sarcasm, as Intel is even having a tough time scheduling 6 per cycle...though this is a DSP, so that makes it's application much better).
  • The official HP presentation on the PA-8800 is
    available as a PDF from 01.pdf [].

  • Two CPUs on a chip. (Score:5, Informative)

    by Animats ( 122034 ) on Monday October 22, 2001 @02:10PM (#2460904) Homepage
    That makes sense. Two CPUs on a chip isn't a new idea, though. The IBM Power4 [] PowerPC chip is very similar, with two PowerPC processors on the same die. There's even a module with 4 such chips [] (8 processors) inside a machined aluminum block. That's intended as a building block for supercomputers.

    Earlier steps in the multi-CPU direction included the 8-way DEC Alpha (killed in the merger with HP?) and a little National Semiconductor product for embedded systems with two very modest CPUs on a chip.

    • Earlier steps in the multi-CPU direction included the 8-way DEC Alpha (killed in the merger with HP?) and a little National Semiconductor product for embedded systems with two very modest CPUs on a chip.
      The IP for Alpha is now in the hands of Intel rather than Compaq (or Hewlett Paqard). I'm not sure if Intel will assimilate the technology into their IA-64 processors, release it as a high-end EV7 processor, or just kill it altogether.
    • if you can remember wayyyy back, there was another single processor 8700 that was re-issued as an 8800

      they were vaxen.
  • Doesn't Chuck Moore's 25x already do SMP-like things, at a few billion instructions per second? Last time I checked he was using a 20-word instruction set on a stack-based computer, which IMO counts as RISC.

    This is hardly new, but HP's version probably uses some fancy new lithography, and wins when it comes to clock speed.
  • by Spootnik ( 518145 ) on Monday October 22, 2001 @02:15PM (#2460930)
    PA-8800 lets you create two opposite predicates in one instruction, for example the predicate a<b and a>=b.

    This seems to indicate that there are no separate "do this if predicate is true" and "do this if predicate is false" instructions, so for opposite predication you would have to specify two different predicates.

    The processor cannot know that these two predicates are related, so this would give you quite a problem.

    As has been publicly disclosed, in general in PA-8800, an instruction reading any resource (such as a predicate) must be in a later instruction group (cycle) than the instruction writing that resource. As a special case, branches are allowed to use a predicate written by another instruction in the same instruction group (as shown in the IDF slides).

    So, the straightforward (but slow) PA-8800 schedule for the earlier example:

    if (a < 0)
    b += a;
    b -= a;
    c += b;
    d += b;

    would be: pLT, pNLT = a, 0 // pLT & pNLT are 2 complementary preds
    (pLT) add b = b, a // add to b [then]
    (pNLT) sub b = b, a // or sub from b [else]
    add c = c, b // uses of b
    add d = d, b

    which takes 5 instructions in 3 cycles. (Note: In PA-8800 assembly, ";;" indicates the end of an instruction group, "=" separates the target operand(s) from the source(s), "//" begins a comment, and (pred) specifies the controlling predicate.)

    An alternate (faster) schedule in PA-8800 is as follows:

    sub bTmp = b, a // speculatively sub from b (into temp)
    add b = b, a // and add to b pLT, pNLT = a, 0
    (pLT) add c = c, b // uses of b [then]
    (pLT) add d = d, b
    (pNLT) add c = c, bTmp // uses of b (temp) [else]
    (pNLT) add d = d, bTmp
    (pNLT) mov b = bTmp // move bTmp to b [else]

    This takes 8 instructions in 2 cycles and one extra register. The final move of bTmp to b can be eliminated if b isn't live out at that point.
  • Er... (Score:2, Interesting)

    by Glock27 ( 446276 )
    A couple of points:

    Following a relatively simple strategy, the PA-8800 processor combines two PA-8700 cores on a single chip to enable symmetric multiprocessing (SMP) on a single processor.

    It doesn't enable SMP "on a single processor". It provides two processors on a single die. There is a distinction.

    AMD, please steal an idea...

    The big rumor regarding the third version of Hammer is that it'll be a dual-CPU module. Any guesses as to Hammer's clock speed on release?

    299,792,458 m/s...not just a good idea, its the law!

    • 299,792,458 m/s...not just a good idea, its the law!

      Er... I'm afraid it's 299,792,458 km/s...

      Sorry, it's the law! ;)

      • Er... I'm afraid it's 299,792,458 km/s...

        Check your references're mistaken. In English units it's 186,282 mi/s.

        Sorry, it's the law! ;)

        Perhaps in some alternate universe... ;-)

        299,792,458 m/s...not just a good idea, its the law!

  • by zensonic ( 82242 ) on Monday October 22, 2001 @02:18PM (#2460952) Homepage that you actually can go out and buy a new mainframe using Power4. Nothing wrong with looking ahead, but if you remember, AMD said that the Athlon should have been made in an "Athlon Ultra" version spotting 8MB L2 cache. .... I still stick to the motto: "I'll belive it when I can buy it"
  • I thought HP had committed itself to ditching the PA-RISC and moving to Itanic, err, Itanium.
    • I thought HP had committed itself to ditching the PA-RISC and moving to Itanic, err, Itanium.

      Yes, that is still the story as far as I know. It was the intended direction years ago, but IA-64 (Itanium) has taken longer to develop than originally expected (way longer), and the customer base hasn't shift to IA-64 yet. Until the customers start paying $$ for IA-64, HP will continue to make revenue from their existing PA-RISC customers.

      The bottom line is that customers don't really care what the underlying processors is. Customers care that their legacy applications continue to run on new machines, that they run with good performance, and that they run reliably. The processor type is just a piece of the total compute environment. HP's motivation is to move to a higher volume microprocessor (IA-64). The current PA-RISC processor volumes are relatively low, and they have to factor in all the R&D expense that goes into a low volume processor. The more complex the processors get, the more R&D expense goes into the design of the chips and into the fabs to build them.

      I'm still curious what the effect of AMD's Sledgehammer will be for customers. For those customers using IA-32, Sledgehammer (64bit enhanced IA-32) takes less porting effort than IA-64. I have yet to see any reports of a deployed IA-64 application in the real world. Everything so far has been lab results and marketing blubs.

      • The bottom line is that customers don't really care what the underlying processors is.

        You couldn't be more wrong. Enterprise customers often have technical support staff as versed in the tech as the vendor. They have to be - they are making budget and platform decisions that have huge ripple effects in their organizations.

        The viability of the platform will figure keenly in the minds of anyone looking at further extensions of the PA line. Seeing as SuperDome has been a dud, you can presume that the SuperDome successor will thud even louder.

    • Even last year, HP reps at the "HP World" conference were letting it be known that they were seriously hedging on the IA-64/Itanium/whatever chip due to Intel's notoriously crummy product reliability history. HP's got PA-8900 and PA-9000 chips in the pipeline. PA-RISC is not going away soon, if ever.
  • Reading through the article, this design seems to share a lot in common with Sun's MAJC architecture. Both allow for multiple cores on a single chip. Anyone else notice the similarities?

    I guess the biggest difference would be that the HP chip is actually going to be built, while the MAJC chip seems to still just be a design.

    It is interesting that a number of designs lately seem to be looking to the integration of multiple CPU cores on a single chip to increase performance in server applications.

  • As has been mentioned, IBM is doing this with POWER. SiByte/Broadcom has done this with an embedded processor:

    EEtimes Story []

    Everyone in the high-performance CPU market (except itanic) is doing either this or multiple concurrent thread contexts to speed overall system computational throughput.

  • When you consider that the PA-RISC team has been transferred to that "evil" company Intel.

  • I *thought* the cache density looked a bit high for ordinary SRAM - the article mentions something they're calling "single-transistor SRAM".

    Does anyone know how on earth they're managing this? Or is this just some low-leakage variant of DRAM with added marketing spin?
    • Actually 1T-SRAM has been around for awhile. The GameCube uses some for main memory. See the /. article []
      • Actually 1T-SRAM has been around for awhile. The GameCube uses some for main memory. See the /. article

        I checked the article and Mosys's site, but the only detailed descriptions of the technology are behind registration scripts.

        Could you give a brief description of the operating principles of single-transistor SRAM, or should I just bite the bullet and register on Mosys's site?
  • imagine... (Score:2, Funny)

    by vsync64 ( 155958 )
    ...a Furbeowulf cluster [] of these things!
  • In news today, a small chunk of Austin TX vaporized when an engineer tripped over a Thermaltake vortex containment field, causing an experimental single-chip SMP AMD processor to go critical in its 1024 pin socket...
  • by CigarBuff ( 61105 ) on Monday October 22, 2001 @03:27PM (#2461441)
    AIUI, there are two competing methods of scaling CPUs now - Symmetric Multi-threading (SMT), and Chip-level Multi Processing (CMP). HP is going CMP because SMT is too difficult in terms of writing the compilers. Both Compaq (with the Alpha CPU) and IBM (PowerX) are going SMT. In fact, the biggest thing Intel got out of it's purchase of Alpha technology, other than the engineers themselves, is the Alpha SMT work.
    • AIUI, there are two competing methods of scaling CPUs now - Symmetric Multi-threading (SMT), and Chip-level Multi Processing (CMP).

      Are they really competing technologies? I can't think of why SMT cpu's couldn't be used in SMP systems. SMP is a way of adding more CPU's, SMT is a way to keep each CPU busy more of the time. They sound kind of complementary to me.

      • IIRC the Xeon processor from Intel uses SMT but is complimented by being SMP capable. You're right that SMT has really no bearing on whether or not you can use multiple processors in a system. I don't know why the dude was seeing them as competing technologies because they most definitely aren't. Just look at the POWER4, they have several processors on a single chip and you can use more than a single chip in a system as well as using SMT which IIRC they are also going to stick in them once they're released.
    • HP is going CMP because SMT is too difficult in terms of writing the compilers.

      Actually, I think they're doing it because it means they don't have to design a new processor core.

      As far as each thread being executed in an SMT chip is concerned, they're running on a single-thread processor. The same scheduling optimizations that benefit code in a single-thread system will benefit the code running SMT with other threads. SMT actually makes this job a bit easier, by reducing the effective latency of instructions (if neither thread's stalled, each thread will execute every other clock, making a 10-cycle-latency instruction look like a 5-cycle-latency instruction, which in turn makes each thread less _likely_ to stall; nice feedback loop here).

      The only extra complexity would be in the operating system's scheduling and context switching routines, and that wouldn't be much more complicated than on a multiprocessor system.
  • Sounds like more kernel work. I'm won't be happy until I can mount file systems in my cache. Think about it. My 286 only had a 40 MB hard drive. Hello, solid state!
    • If you only required a relatively small (by todays bloated standards atleast) application to run at highest possible speeds, you could run it ENTIRELY from the cache, without any main ram atall.. I upgraded to 6mb on my amiga a few years ago and thought i had more ram than anyone would ever need..
  • As has been pointed out above, this is just HP playing catchup to IBM. IBM has taken a leap ahead of their competitors and now they have to play catchup.

    HP's announcement is nothing compared to what IBM has in development.
  • I have a HP 9000 715/80 with PA7100LC cpu. It can boot hppa Debian, has ethernet connectivity, and has its console on a dumb terminal. It is pretty cool, and from what I have read, I believe it has something like 24 general purpose registers, which is quite a lot for a typical cpu. This one is an older 32 bit cpu, but thats still seems like a lot of registers to work with for high performance code and such.

    HP workstations certainly seem to be very solid and nifty and they have a lot of potential for linux boxes. Assembly programmers will appreciate all of the registers that are available.

"Hey Ivan, check your six." -- Sidewinder missile jacket patch, showing a Sidewinder driving up the tail of a Russian Su-27