HP Shows Off PA-8800 SMP-On-A-Chip CPU Plans 176
Eric^2 writes: "At last week's MicroProcessor Forum, HP's David J. C. Johnson unveiled the details of HP's latest RISC processor destined to redefine performance in Server-Class processors. Following a relatively simple strategy, the PA-8800 processor combines two PA-8700 cores on a single chip to enable symmetric multiprocessing (SMP) on a single processor. Aside from bumping the core speed up to an initial 1 GHz, enhancements include the addition of combined 35 MB L1+L2 cache. The article contains the full text. AMD, please steal an idea..."
And Get Sued? (Score:1, Offtopic)
Re:And Get Sued? (Score:2)
If AMD has a desire to cram two AMD-64 in one package they better come with their own solution or license IBM's one...
Re:And Get Sued? (Score:2, Informative)
Also, though Sun has decided not to use the MAJC architecture for anything (they were hoping to try to get it to become a video-accelerator, but that's not even going to happen, most likely), that too was fully spec'ed out to have multiple cores on a chip...it's really nothign new
The longstanding rumour is that AMD will be coming out with a dual-hammer processor (ie, CMP). In academia, the idea has been used frequently as well.
The idea of using CMP isn't even that big a deal to most consumers. While it would be nice for AMD to come out with a chip that does multithreading (merely because it increases real-world throughput quite a bit, depending upon the type of multithreading), the average PC running windows 9x/ME/ XP Consumer won't be able to multithread anyway. The only reason for AMD to multithread is for the server-space, which is what they're aiming for with the hammer series...but I digress.
Re:And Get Sued? (Score:2)
Not to defend MS, or anything, but XP Consumer is still based on the NT core. It'll multithread.
Hmmm Sounds Like IBM (Score:2)
chip has 2 POWER cores with high-speed interconnects. Even better is that each chip is connected to 3 other chips to make up 8 CPU packs.
And IBM actually has a business model behind it (Score:2)
With an agenda based on scale, you don't get there by introducing a new CPU in a dead line. HP's SuperDome line is getting creamed by Sun and IBM - HP cannot afford to go back to the front lines with another enterprise offering unless SuperDome pans out a hell of a lot more than it is currently.
HP has always had impressive technology but still loses market share . HP-UX has dwindling market share and software support. The merger with Compaq will derail any plans for further proprietary architectures.
If you want to look at the gee-whiz value here, fine, but don't expect to see this in a product.
How much cache??? (Score:2)
Re:How much cache??? (Score:1)
2. It will make a major difference when there are a large number of processes... prevents thrashing...only between the L1/L2 cache and the memory....as must be obvious...HP is targetting Servers with that processor.....where in a large difference can be expected....For single user workstations...an 8MB L2 cache is more than what you will ever ever need.
Opinions welcome
Re:How much cache??? (Score:1)
In the P4, 20 stage pipeline and it's prediction logic are much more succeptible to cache latency problems. Waiting two or three cycles for data on a P4 could cause major pipeline hiccups.
-- Len
Re:How much cache??? (Score:1)
The L2 cache of the P4 probably has a # cycles latency, but I don't think it is a problem. In fact it could be a reason to implement extreme pipelining (so you can make 'small' steps towards computing the result instead of only being able to take big steps). Waiting for main memory _is_ a problem, _BIG_ problem.
I really wonder where you get your ideas.
Re:How much cache??? (Score:1)
RISC takes many more instructions to do the same thing as a CISC chip, so caching the instructions will improve the performance for repeated actions. The P4 is RISC like on the chip level, so it would probably benefit from a much larger cache. I believe that the size of it's cache has less to do with the fact that it's low latency than it does with chip yields for making memory that runs reliably at 2Ghz.
I will concede that some programs would definately show the large cache as superior to a small low latency cache. Witness the older SPEC benchmarks which became obsolete when they would fit entirely into the L1 and L2 caches of some processors. In this case, the large cache is superior to the small low latency cache.
Think about what these processors will be used for. Most likely, they will be used for large database servers, as that is what HU-UX is good for. While an entire database will not fit into 32 Meg, a lot of repetitive queries and essential system and program code can fit into this large cache. In this way, this processor would be superior to a raft of P4's which would be constantly fetching from main memory for this particular type of service. So the first search for "Pam Anderson" web sites would take the same amount of time on a P4 as it would on this PA-8800, but every successive search would have a much higher possibly of being a cache hit on the PA-RISC chip.
I agree completely that waiting for main memory is a huge problem.
-- Len
Re:How much cache??? (Score:1)
> The latency issue just isn't an issue with the
Name me a processor that is high end and it will have a latency problem. If you say that isn't true you haven't designed or thought about architectures. Besides that all designs are aware of it and try to hide latency problems. Designs would be _drastically_ different without latency problems.
>Cache size though, is much more important, as a larger cache can hold many queued instructions and data, which will be fed in a constant stream to the processor without having to revert to slower memory.
You just defined cache and it's importance, but not why it's size is important. More is not better. More is a choice with it's drawbacks. I take your point on how this processor would often be used (databases).
>Waiting two or three cycles for data on a P4 could cause major pipeline hiccups.
But since 2 stages in a P4 are equivalent to 1 stage in an Athlon what is the difference? (P4 has 20 stage, athlon has 10 stage i think).
Re:How much cache??? (Score:1)
Re:How much cache??? (Score:1)
I just assumed that the Ultra SPARC would use it as well. Mea culpa.
-- Len
just to make sure nobody is misled... (Score:4, Interesting)
Now that's FAST .
Re:just to make sure nobody is misled... (Score:1)
Re:just to make sure nobody is misled... (Score:1)
It might be fast, but imagine the heat sink on that puppy. It could heat my pool... in the middle of winter.
~Sean
Re:just to make sure nobody is misled... (Score:1)
Re:just to make sure nobody is misled... (Score:2)
Re:just to make sure nobody is misled... (Score:5, Informative)
The famous Intel-pipeline is in the execute stage (ALU).
Pipelining is a strategy which is equally valid for both risc as in cisc architectures, and a risc architecture do not offer any complexity advantage in the execute stage. After all a multiplier is a multiplier regardless of overlaying architecture.
Nowdays we don't really see much diffrence in performance between risc and cisc architecures for upscale processors. This is because the savings in fetch and decode logic are dwarfed by other costs like prefetch, reordering and brach prediction (which are used for both architectures).
Re:just to make sure nobody is misled... (Score:1)
The Iron Law of microprocessor performance states:
time = Instructions/Program * Cycles/Instruction * Seconds/Cycle
Notice that the above cancels to Seconds/Program.
So given the same program, you can do one of two things to reduce the execution time:
1. Reduce Cycles/Instruction. This is done by executing more instructions at a time (Superscalar execution), and reducing the penalties of speculation (predicting what the program will do).
2. Reducing the Seconds/Cycle (increasing clock rate). This is the approach Intel has taken with Netburst (Pentium 4).
By making the PIV pipe 22 stages long, this helps Intel increase the clock rate. It also helps throughput. Assuming an ideal world where no data dependencies exist, more stages == better throughput. However, Control and Data Hazards always messes up a pipeline and the longer the pipeline, the more of a penalty on a branch misprediction. This increases cycles/instruction (CPI).
The average CPI of AMD and PIII is much lower than the P4, however it makes up for this in clock rate.
MIPS the original RISC (Score:2)
MIPS 1GHz Dual core on same die for a while
and that its 64bit
check
http://www.electronicstimes.com/story/OEG20010612
or
http://www.pmc-sierra.com/products/details/rm9000
oh yeah did I mention that PA-RISC is a MIPS decendant
but shhh they made so many changes they fscked the pipeline(they might have got it working again but I dont know any more)
may the SPECINT and SPECFP fight it out
regards
john jones
p.s. I wonder what the HP layout guys think of Intel chips (-;
Re:An old idea.... (Score:1)
At any rate, read for yourself:
http://developer.intel.com/design/pentium/manua
Its compatible with Itanium moterboards (Score:2)
You know when AMD 1st brought out the Athlon they were spose to be compatible with Alpha 21264 boards too.
AMD even made a couple of engineering samples in slot B packages for testing but that's as far as it it.
If someone could hack a slot A/Slot B adaptor then they could hypothetically do the same thing. They might have to hack a bios update to though.
Re:Practical Ideas (Score:4, Informative)
They don't? What kind of server do you run? Most all pieces of production-class server software that I know of benefit from multiple processes. Look at Apache, forking off five, ten, or even more processes to handle requests. MySQL, I believe, uses threads. PostgreSQL forks off a new backend for each connection. Shoot, even your telnet, ftp, ssh, and mail daemons will fork off for each connection, allowing you to take advantage of more than one CPU.
If you're sitting at home working on a spreadsheet, you're right, SMP isn't for you - and this machine isn't targetted at you. When you're running a server that may have tens, hundreds, or thousands of SIMULTANEOUS processes fighting for CPU time, every processor counts.
And, to make things even better, even if you're only running a single, non-threaded process, having two processors still makes the machine much more "responsive", as the second CPU can handle kernel code for file IO, network code, interrupt handling, writing to logs, and a lot of other tasks. Ever seen how much CPU time even syslog can chew up?
steve
Re:Practical Ideas (Score:1)
But some programs will benefit. Have you ever run a heavily used web server? They fork off lots of processes. It will benefit greatly from SMP.
Most processor intensive programs have become multithreaded, and the rest can be if SMP becomes popular.
I see too many people asking what the use of something is if all their existing stuff wouldn't benefit from it. This is often because their stuff hasn't had any reason to adapt to this cool new thing that people are going to reject because their stuff doesn't benefit from it now. Take the plan9 OS for example. It does the Right Thing for a great networked internal structure, but the GUI stinks. It is not popular, partly because people don't like the UI. But if poeple used it, the GUI would be improved and we would have all its cool benefits.
Did I read that right? (Score:4, Insightful)
Re:Did I read that right? (Score:1)
Using more caches of varying sizes is actually better than a monolithic cache, for reasons you described. If there are more caches, the primary one can focus upon low-latency, the second for high-bandwidth, and the third for high-hit-rate.
But yes, it is indeed 35Mb of caches, though it's worthy of note that the L2 cache is off-die.
Re:Did I read that right? (Score:1)
Re:Did I read that right? (Score:1)
Intel chips, though, keep using about the same overall amount of cache, to keep costs down.
steve
Re:Did I read that right? (Score:2)
Similarly, adding more ram to a machine _could_ slow it down in some situations because the "overall" cache hit ratio could go down.
Also, when caches get to be too large, the cache policy may need to be changed. A fully associative cache is the most flexible placement policy and can give great hit ratio for a large working set, however, a fully associaive cache search takes longer than a direct map "search" or a set-associative search.
So, if to get a large cache size they had to go to set assoc or direct mapped, then that will generally lower the hit ratio vs a cache of the same size which is fully assocaitive.
It's all tradeoffs basically. You could write a cache simulator to play around with this
Re:Did I read that right? (Score:1)
> the L1 hit ratio go down (assuming the inclusion policy is in effect - this is not always so
> anymore -- see the 1st generatino Duron chips)
This is wrong - the size of the L2 cache has no effect on the hit rate of L1 cache(s). Just think about what a hit or miss really is and then ask yourself why 0MB, 1MB, 2MB
And - since L1 caches usually work with virtual and not physical adresses - the L1 hitrate is not even influenced (much) by the amount of RAM.
Re:Did I read that right? (Score:2)
It may be the case that no L1 is implemented this way. It is certainly the case that adding more ram will decrease the HR of an L2 cache
Wether the L2 miss penalty * frequency of L2 miss vs the PF penalty * frequency of a page fault turns out to be greater is probabably
1) workload specific
2) generally in favor of more ram at the expense of lower L2 hit ratio (because PF servicing is abysmally slow)
3) you can probalby generate a pathological case that shows either result
Apologies, I'm rusty on this stuff
Re:Did I read that right? (Score:1)
> down.
Yes, *if*, but it can't work this way, since the L2 cache isn't used as a direct access memory.
The L2 cache is a associative memory, i.e. you use the normal (physical, in seldom cases maybe even virtual) adresses to adress memory. (Otherwise it wouldn't be a cache...)
Which means that the adress space the L1 cache has to handle is not affected by the size of the L2 cache and that means the L2 cache has absolutely no influence on wether a access to the L1 cache results in a hit or miss - it doesn't even matter if there's a L2 cache or not.
Re:Did I read that right? (Score:2)
Re:Did I read that right? (Score:1)
Re:Did I read that right? (Score:1)
Re:Did I read that right? (Score:2)
They even have one system [ibm.com] with a 128MB L3 cache!
Re:Did I read that right? (Score:2)
Smokin (Score:1, Informative)
Re:Smokin (Score:2, Informative)
Then go an buy something from Sun, IBM, Compaq -> AFAIK all three buy servers with that large L2 Caches. (Maybe HP and SGI as well).
E.g. something from IBM's z900 serie (mainframe - up to 32 MB L2 (per CPU?)) or pSeries 620 (workgroup/midrange server - up to 8MB L2 per CPU) or Sun Enterprise 450 (workgroup server - up to 8MB L2 Cache per CPU), Sun Fire 15K (High End Server, 8MB L2 per CPU), Compaq Alphaservers GS/ES series (up to 8MB per CPU).
And if you want just total of 8MB a SGI Origin 300 with more than 4 CPU should do it as well (2MB L2 per CPU).
Re:Smokin (Score:1)
Siroyan's OneDSP (Score:2, Informative)
Re:Siroyan's OneDSP (Score:1)
Official HP presentation at MPF 2001 (Score:2)
available as a PDF from http://www.cpus.hp.com/technical_references/mpf_2
Y.
Two CPUs on a chip. (Score:5, Informative)
Earlier steps in the multi-CPU direction included the 8-way DEC Alpha (killed in the merger with HP?) and a little National Semiconductor product for embedded systems with two very modest CPUs on a chip.
Re:Two CPUs on a chip. (Score:1)
Re:Two CPUs on a chip. (Score:1)
Re:Two CPUs on a chip. (Score:1)
Re:Two CPUs on a chip. (Score:1)
they were vaxen.
Chuck Moore (Score:1)
This is hardly new, but HP's version probably uses some fancy new lithography, and wins when it comes to clock speed.
HP PA-8800 integer numbers (Score:3, Offtopic)
This seems to indicate that there are no separate "do this if predicate is true" and "do this if predicate is false" instructions, so for opposite predication you would have to specify two different predicates.
The processor cannot know that these two predicates are related, so this would give you quite a problem.
As has been publicly disclosed, in general in PA-8800, an instruction reading any resource (such as a predicate) must be in a later instruction group (cycle) than the instruction writing that resource. As a special case, branches are allowed to use a predicate written by another instruction in the same instruction group (as shown in the IDF slides).
So, the straightforward (but slow) PA-8800 schedule for the earlier example:
if (a < 0)
b += a;
else
b -= a;
c += b;
d += b;
would be:
cmp.lt pLT, pNLT = a, 0
(pLT) add b = b, a
(pNLT) sub b = b, a
add c = c, b
add d = d, b
which takes 5 instructions in 3 cycles. (Note: In PA-8800 assembly, ";;" indicates the end of an instruction group, "=" separates the target operand(s) from the source(s), "//" begins a comment, and (pred) specifies the controlling predicate.)
An alternate (faster) schedule in PA-8800 is as follows:
sub bTmp = b, a
add b = b, a
cmp.lt pLT, pNLT = a, 0
(pLT) add c = c, b
(pLT) add d = d, b
(pNLT) add c = c, bTmp
(pNLT) add d = d, bTmp
(pNLT) mov b = bTmp
This takes 8 instructions in 2 cycles and one extra register. The final move of bTmp to b can be eliminated if b isn't live out at that point.
Re:HP PA-8800 integer numbers (Score:4, Informative)
Re:HP PA-8800 integer numbers (Score:2)
Re:HP PA-8800 integer numbers (Score:1)
Er... (Score:2, Interesting)
Following a relatively simple strategy, the PA-8800 processor combines two PA-8700 cores on a single chip to enable symmetric multiprocessing (SMP) on a single processor.
It doesn't enable SMP "on a single processor". It provides two processors on a single die. There is a distinction.
AMD, please steal an idea...
The big rumor regarding the third version of Hammer is that it'll be a dual-CPU module. Any guesses as to Hammer's clock speed on release?
299,792,458 m/s...not just a good idea, its the law!
Re:Er... (Score:1)
Er... I'm afraid it's 299,792,458 km/s...
Sorry, it's the law!
/max
Re:Er... (Score:1)
Check your references again...you're mistaken. In English units it's 186,282 mi/s.
Sorry, it's the law! ;)
Perhaps in some alternate universe... ;-)
299,792,458 m/s...not just a good idea, its the law!
Re:Er... (Score:1)
My mistake, apologies.
/max
The BIG difference between PA8800 and Power4 (Score:3)
Re:The BIG difference between PA8800 and Power4 (Score:1)
Re:The BIG difference between PA8800 and Power4 (Score:2)
What about Itanium? (Score:2)
Re:What about Itanium? (Score:1)
I thought HP had committed itself to ditching the PA-RISC and moving to Itanic, err, Itanium.
Yes, that is still the story as far as I know. It was the intended direction years ago, but IA-64 (Itanium) has taken longer to develop than originally expected (way longer), and the customer base hasn't shift to IA-64 yet. Until the customers start paying $$ for IA-64, HP will continue to make revenue from their existing PA-RISC customers.
The bottom line is that customers don't really care what the underlying processors is. Customers care that their legacy applications continue to run on new machines, that they run with good performance, and that they run reliably. The processor type is just a piece of the total compute environment. HP's motivation is to move to a higher volume microprocessor (IA-64). The current PA-RISC processor volumes are relatively low, and they have to factor in all the R&D expense that goes into a low volume processor. The more complex the processors get, the more R&D expense goes into the design of the chips and into the fabs to build them.
I'm still curious what the effect of AMD's Sledgehammer will be for customers. For those customers using IA-32, Sledgehammer (64bit enhanced IA-32) takes less porting effort than IA-64. I have yet to see any reports of a deployed IA-64 application in the real world. Everything so far has been lab results and marketing blubs.
Re:What about Itanium? (Score:2)
You couldn't be more wrong. Enterprise customers often have technical support staff as versed in the tech as the vendor. They have to be - they are making budget and platform decisions that have huge ripple effects in their organizations.
The viability of the platform will figure keenly in the minds of anyone looking at further extensions of the PA line. Seeing as SuperDome has been a dud, you can presume that the SuperDome successor will thud even louder.
Re:What about Itanium? (Score:2, Insightful)
Re:What about Itanium? (Score:1)
"That's what they said about domain OS" doesn't convey much in this context. HP's "official line" is that they're on board with IA-64, but if you talk to the people in the "back room" you get a much less confident outlook.
Re:What about Itanium? (Score:2)
In this case, it's that they're saying semi-officially that "...PA-RISC is here for the forseeable future, so relax already..".
In the past when they bought Apollo they officially said "Domain/OS is here to stay, we won't be cramming HP/UX down your throat" and then Domain/OS goes away.
It's very clear to me that marketing people are trying to have it both ways. One camp wants to ditch PA-RISC and do Itanium everywhere, another camp that has happy PA-RISC customers wants to make like Itanium ain't happening. Corporate america, fscking the customer. Go figure.
Re:What about Itanium? (Score:1)
Re:What about Itanium? (Score:2)
As it stands the market for this chip just doesn't justify production. SuperDome has been a failure (although HP is loathe to admit it), and the market for HPUX is dwindling rapidly.
The Alpha offers the concise history - a great core, but no marketing vision. Its just a product guys, if no one wants it, it isn't worth the billions to produce.
Re:What about Itanium? (Score:2)
Care to explain why ? I (the VBC i work for) have one and it does its work fine (_way_ faster than we expected)
How does it compare to the MAJC spec? (Score:1)
I guess the biggest difference would be that the HP chip is actually going to be built, while the MAJC chip seems to still just be a design.
It is interesting that a number of designs lately seem to be looking to the integration of multiple CPU cores on a single chip to increase performance in server applications.
zor_prime
Old News (Score:1)
EEtimes Story [eetimes.com]
Everyone in the high-performance CPU market (except itanic) is doing either this or multiple concurrent thread contexts to speed overall system computational throughput.
AMD won't be using this anytime soon... (Score:1)
Re:AMD won't be using this anytime soon... (Score:1, Informative)
"Single-transistor SRAM"? (Score:2)
Does anyone know how on earth they're managing this? Or is this just some low-leakage variant of DRAM with added marketing spin?
Re:"Single-transistor SRAM"? (Score:2)
Re:"Single-transistor SRAM"? (Score:2)
I checked the article and Mosys's site, but the only detailed descriptions of the technology are behind registration scripts.
Could you give a brief description of the operating principles of single-transistor SRAM, or should I just bite the bullet and register on Mosys's site?
imagine... (Score:2, Funny)
AMD onchip SMP? Imagine pulling the heat sink off! (Score:2, Funny)
Not the best way to go (Score:3, Interesting)
Re:Not the best way to go (Score:1)
Are they really competing technologies? I can't think of why SMT cpu's couldn't be used in SMP systems. SMP is a way of adding more CPU's, SMT is a way to keep each CPU busy more of the time. They sound kind of complementary to me.
steve
Re:Not the best way to go (Score:2)
Compiler shouldn't be more difficult. (Score:2)
Actually, I think they're doing it because it means they don't have to design a new processor core.
As far as each thread being executed in an SMT chip is concerned, they're running on a single-thread processor. The same scheduling optimizations that benefit code in a single-thread system will benefit the code running SMT with other threads. SMT actually makes this job a bit easier, by reducing the effective latency of instructions (if neither thread's stalled, each thread will execute every other clock, making a 10-cycle-latency instruction look like a 5-cycle-latency instruction, which in turn makes each thread less _likely_ to stall; nice feedback loop here).
The only extra complexity would be in the operating system's scheduling and context switching routines, and that wouldn't be much more complicated than on a multiprocessor system.
35 MB cache??? (Score:1)
Re:35 MB cache??? (Score:1)
Yaaaaaaawnnn.. (Score:1)
HP's announcement is nothing compared to what IBM has in development.
hppa cpu's are cool! (Score:1)
HP workstations certainly seem to be very solid and nifty and they have a lot of potential for linux boxes. Assembly programmers will appreciate all of the registers that are available.
Re:Wait a minute! (Score:1)
It's the same idea as Via combining the north and south bridges on some of their motherboard chipsets. They take the two cores, and put them on a single wafer, with a bus (still on the same wafer) between them.
The idea really isn't revolutionary. Ever since microprosessors were invented, the trend has been to pack more and more onto a single chip, as it reduces cost, complexity, and design complexity while increasing compatibility and (most importantly) bandwidth. While your fastest P4 front-side bus chugs along at 400 MHz, busses that are kept on the wafer can run at full core frequency, even in the gigahertz range. Plus, you can run a lot more of them, and since the distances covered are shorter, it's easier to avoid external RF interference. And in multi-processing computers, the connectivity between cores is vitally important.
Look at a lot of motherboard chipsets these days. In one or two chips, they'll have circuitry for video, audio, modem, network, IDE, floppy, serial, USB, PCI, and memory controllers, to name just a few. One of the long-term goals that some companies have been talking about is "SOC", or "System On A Chip", where a single chip will have everything you need for a computer. At the point where the CPU has all of the other controllers inside of it, not only could performance increase dramatically, you could potentially use a motherboard for any CPU that you wanted, as all the motherboard would do is provide power to the CPU and traces from the CPU to the connectors for external componants.
steve
Re:Wait a minute! (Score:1)
Re:When do you think they will have Transparent SM (Score:1)
Re:When do you think they will have Transparent SM (Score:1)
1) It is actually possible to get better than linear improvement under certain conditions (like if something is already in a shared cache because it was fetched by the other cpu).
2) It is possible to have each cpu schedule itself based on contents of ram.
Yes, there is overhead of having two cpus, but it is very variable dependent on OS and workload.