Intel and HP Commit $10 billion to Boost Itanium 272
YesSir writes "Support for the high-end processor that has had difficulties catching on is coming in from its co-developers Intel and HP. 'The 10 billion investment is a statement that we want to accelerate as a unified body' said Tom Kilroy, general manager of Intel’s digital enterprise group."
Intel just removed 32bit hardware support (Score:4, Informative)
Anyone want to tie this into their $10 billion push?
It is rather uninspiring to see all the negativity (Score:3, Informative)
Re:AMD64 (Score:5, Informative)
There is a market for Itanic in some traditional supercomputing applications but it is a relatively small market and never been a big growth market. I really doubt Intel and HP will ever recover the billions they've already sunk in to Itanic, let alone another $10 billion.
I imagine the people at AMD are dancing in the streets at this news because Intel and HP are going to keep throwing even more good money after bad trying to salvage this dog. Its money that they wont be investing in R&D in markets that really matter.
AMD can continue their push to dominate servers, workstations and desktops. If they could crack laptops, phones and embedded apps Intel would be in serious trouble.
Re:AMD64 (Score:4, Informative)
If any of you have ever put together a computer that has a bad part, its sometimes really hard to figure out what caused the problem. Systems that Itaniums usually go in have the error detection and error logging to exactly pinpoint where problems lie. This is the reason oracle DBs use these type of processors. It doesn't make sense for the common user to use Itanium, but companies like Amazon and Visa want these systems more for the reliability features than the speed.
I hope this works. (Score:4, Informative)
It's other caveats, for example, poor compiler support, are issues that need to be considered carefully. I'd like to specifically address the poor compiler support. I am not concerned about this issue for the following reasons:
1. Compilers can improve easily, with a recompile. If the architecture achieves a critical mass, then more people and organizations will justify the time and effort to improve compilers on the architecture. Not only can they improve, but taking advantage of such improvements would not require replacing hardware, which makes it an issue of time.
2. The architecture is much more realistic about the guarantees that it's willing to make as a processor. One of the early complaints, was that initial generation of compilers for IA64 would generate, on average, 40% NOPs. It's important to consider a few details when regarding that statement.
A. First, each clock cycle could allow the execution of up to 3 concurrent operations.
B. Second, the architecture is not inserting extra NOPs transparently into the pipeline, as almost all modern processors do in the event of a pipeline data hazard. This fact can be viewed different ways.
i. Most modern processors have to evaluate wether to insert a pipeline stall every single time that an instruction is executed. This is, essentially, wasted work because such a computation could be done by the assembler, however, it does spare the processor the burden of loading useless NOPs into the pipeline and the cache. On the other hand, minimizing the logic that a processor has to complete per cycle generally decreases the minimum amount of time necessary per clock (meaning that it could scale to higher clock speeds.)
ii. The immediate question is, does reading all these NOPs out of memory cause a bigger hit to performance, than making the processor calculate the data hazards? Personally, I don't know. But, let's consider the idea for a moment. On both processors, let's assume that the instruction cache is fast enough to deliver data without wait states, assuming the cache has the data. When your processor is prefetching well, then the NOP issue shouldn't be a big issue. (Except for the fact that the NOPs will now be in the binary, making the binaries larger. I consider this a moot point given the inexpense of modern storage.) When your prefetcher can't anticipate correctly, though, I think the IA64 loses. Both IA64 and other modern architectures have branch predictors, so I suspect unanticipated branches which cause a pipeline flush (unavoidable) and unanticipated cache fills (unavoidable) will be mitigated roughly equally, But because the IA64 has longer instructions that aren't quite as dense, the IA64 will stall longer. Btw, I'm ignoring data stalls, to simplify my argument and because I don't think the architectural differences in the IA64 will significantly impact it. I'd enjoy being corrected on this point.
The IA64 includes a predicate register, which stores the results of comparison instructions. Instructions in an IA64 'bundle' can be qualified to be executed conditionally, based on the condition of a certain bit in the predicate register. This allows the IA64 to avoid some branches. The compiler/assembler can pack a bundle which includes the appropriate two instructions, each qualified to execute for different states of the predicate register. Essentially, the processor is simultaneously issued the commands for both p
Re:Itanium vs. Ultrasparc T1 (Score:2, Informative)
"Sales of IBM's Unix systems, called the pSeries, grew 15% in the first quarter and 36% in the second quarter--far outpacing Sun and HP. The trend should continue in the fourth quarter--historically, industrywide Unix sales have spiked 25% during this period--and into 2006, when IBM introduces a new high-end chip called Power5+."
Re:Apple (Score:3, Informative)
Re:Itanium vs. Ultrasparc T1 (Score:3, Informative)
Yes, you're right about this. The T1 can only do a single thread of floating point ops at a time. This is why it's being marketed to the web/ap server market which don't do many flops. Sun is working on a new chip code named Rock which will address these issues. If I remember correctly rock will support 8 floating point threads at a time. It will also have some really awsome I/O lookahead features that allow a special 'thread' to read thousands of instructions ahead and look for I/O that can be started early. What the T1 is going to do to the Webserver market, Rock will do to the high end number crunching market.
Re:I hope this works. (Score:3, Informative)
Agreed. The point I was trying to make was that realizing the benefits of compiler improvements requires updating your software, not replacing the processor. Obviously, recompiling the same software isn't going to be an advantage.
B huh? Are you mixing up RISC and VLIW (EPIC) designs?
No, I'm not mixing them up. I was trying to compare their merits.
Essentially, I tried to reason a guess to the following question.
What would be the effect of removing the data-hazard protection from the chip and relying on the compiler to insert explicit noops? I surmise that a unpredicted branch will hurt more on IA64.
Then I babble on for too long about different features. Sorry.
Look at Itaniums performance on data dependent branches, it is underwhelming...
This is unfortunate; do you know what is limiting the chip here?
Itanium greatly (like: insanely) benefits from repeated compile-execute-profile iterations of the benchmark.
Where, generally, does the compile-and-execute profile work improve things? Does it use the profiling output to hint the processor's branch predictor?
Patterson and Hennesey, Computer Organization and Design - on the shelf, well worn and well read.
I'll check out the others at the library.
Re:I hope this works. (Score:2, Informative)
Ah... but you see. this is the problem. improving compiler technology is extremely hard. Of course, the big hope in VLIW and EPIC architectures was that compiler technology would improve by some huge factor. This hasn't really panned out. Most code that we run is highly data dependent and branches way too frequently to parallelize anything. This is the same reason chips are moving to multiple cores now. It's hard to eek out that extra 3% single thread performance now - in chip or in the compiler.
From your original post...
Most modern processors have to evaluate wether to insert a pipeline stall every single time that an instruction is executed. This is, essentially, wasted work because such a computation could be done by the assembler, however, it does spare the processor the burden of loading useless NOPs into the pipeline and the cache
uh this doesn't make any sense. Inserting nops for data dependencies/cache misses/etc doesn't "burden" processors. The only burden is if you happen to load your instruction stream with a ton of useless NOPs. Now I don't know IA-64 well, but somehow I doubt they removed all data dependency stalls - the instruction code explosion would amazing. your binaries would be huge.
Look at Itaniums performance on data dependent branches, it is underwhelming... This is unfortunate; do you know what is limiting the chip here?
data dependant branches - the hold back is that it's a serial stream of instructions. you can't parallelize code at all if each instruction is dependent on the instruction before it.
Where, generally, does the compile-and-execute profile work improve things? Does it use the profiling output to hint the processor's branch predictor?
no, you feed back the profiling information to the compiler, which will use loop counts and branch results to unroll certain loops, spend more time software pipelining heavily used loops, moving around basic blocks to reduce branching and increase block sizes. then you'll get faster code. Of course, it's not unheard of for intel or amd to make specific compiler optimizations to speed up SPEC. When I mean specific, i mean like very specific. if you see a unique-only-to-SPEC block of code, then compile into the nice hand optimized assembly. :P
Re:Itanium vs. Ultrasparc T1 (Score:3, Informative)