Introducing the PowerPC SIMD unit 83
An anonymous reader writes "AltiVec? Velocity Engine? VMX? If you've only been casually following PowerPC development, you might be confused by the various guises of this vector processing SIMD technology. This article covers the basics on what AltiVec is, what it does -- and how it stacks up against its competition."
The article makes a very good point... (Score:4, Interesting)
Re:The article makes a very good point... (Score:1, Interesting)
Oh? What's your source for that? Is that a problem with the processor or with the optimiser?
Re:The article makes a very good point... (Score:1, Funny)
Re:The article makes a very good point... (Score:5, Informative)
What does this even mean? I've written a great deal of optimized SSE code, and I can promise you that it works just as well on AMD. In fact, if you look at Athlon's pipeline, it does some really amazing things rescheduling and executing operations out-of-order. Fiddling around with ordering individual instructions is basically pointless because the scheduler has gotten so good at doing it on-the-fly.
Can you cite a specific example, because I've never run into this.
What it means... (Score:1, Flamebait)
We get on average one of these per month posted here to slashdot as news.
Nothing to see here. Move along please.
Re:What it means... (Score:2)
Re:What it means... (Score:2)
Tell me, are they recruiting at IBM? Only I need a new job and I'm willing to sell my soul for the right price.
Re:What it means... (Score:1)
Re:The article makes a very good point... (Score:3, Insightful)
Re:The article makes a very good point... (Score:2)
Re:The article makes a very good point... (Score:3, Interesting)
I'm sure well written sse code would run faster on both platforms, atleast in cases where vectorization makes sense.. Intel implemented the weak fpu unit in the p4 to try and steer users onto sse co
Re:The article makes a very good point... (Score:2)
SSE has scalar instructions too. Since the SSE registers are in a flat register file (c.f. stack in the legacy FPU descended from the 287/387), it's actually easier for a compiler to generate efficient code for SSE. SSE only does single precision though, so for double precision you need SSE2/SSE3. Incidentally, if you look at the intel manuals you will see that SSE3 is only a minor exten
Re:The article makes a very good point... (Score:2)
Re:The article makes a very good point... (Score:3, Interesting)
Hold on there, partner. This isn't AltiVec stuff! (Score:5, Informative)
The problems you're talking about are not the AltiVec's fault, and the AltiVec instruction set is still stable. Code will still run very quickly even if you don't optimize for the G5. But, let me bring a quote from one of those linked papers:
See, the problem you're complaining about is a problem with any port to the G5, or really any port from a slow-thin-memory-access system to a fast-wide-memory-access system. It has nothing to do with your AltiVec code. It just has to do with tuning for a larger L2 cache and and faster FSB rather than a slow FSB and a huge L3 cache.So let's not blame AltiVec for this. Except for a brief change in policy in the 745X G4, it seems like the AltiVec invocation has been stable for quite awhile.
Re:The article makes a very good point... (Score:2)
Infact, the 1.67ghz beats the 2ghz G5 and isn't far behind the 2.5ghz G5..
Altivec and OS X (Score:5, Insightful)
Re:Altivec and OS X (Score:1, Insightful)
Best guess no - you have to weigh up the immediate gains against the overhead of saving/reloading the SIMD registers when you enter/leave the TCP stack. Usually OS kernels avoid SIMD/FPU for this reason.
(Uh, why was that modded troll?)
Re:Altivec and OS X (Score:4, Interesting)
Re:Altivec and OS X (Score:2, Interesting)
I can't remember if this was implemented in the end...
Re:Altivec and OS X (Score:3, Interesting)
Re:Altivec and OS X (Score:5, Informative)
I don't know how much of OS X has AltiVec code, but there are many other apple apps that use it. iTunes uses it for encoding music. I'm sure the video codecs in Quicktime use it as well.
The Mac has a really nice optimization tool called shark [apple.com] which will help you find things that can be put into the AltiVec processor (it also helps with general optimization).
Re:Altivec and OS X (Score:2)
Re:Altivec and OS X (Score:5, Informative)
The reason the article mentions the checksum case is not because Apple is missing the boat, but because there was a nice research article written about writing optimized TCP checksum code for Altivec, providing a good set of example code for aspiring Altivec coders.
Re:Altivec and OS X (Score:2)
Also, afaik most kernel code tries hard not to use any math/vectorization coprocessor. In Linux RAID is supposedly MMX/SSE -accelerated and tries hard not to botch everything, but most other aren't.
Re:Altivec and OS X (Score:2)
Re:Altivec and OS X (Score:3, Interesting)
Re:Altivec and OS X (Score:5, Informative)
No, at this point too much needs hand tuning for everything to fully utilize the potential of Altivec. Most serious DSP-class apps spend the effort to do this in critical code, but there's plenty of compiled code running in OSX that doesn't benefit from the parallel vectorization that the Altivec unit can offer.
This is all about to change with GCC 4 which offers an SSA [gnu.org] tree optimizer. The SSA form is particularly useful for doing automatic vectorization of code. I'm not sure what the efficiency will be like in the first release but it looks like good things are coming.
Re:Altivec and OS X (Score:3, Informative)
Of course it is rarely true that AltiVec instructions are used to their "full potential" in the sense you can usually find another CPU cycle to eliminate, but neither is it necessary to use hand tuning to get big boosts from AltiVec. We do the hand tuning for you (in C with AltiVec extensions or in assembly language) and p
Re:Altivec and OS X (Score:2)
Do I get these with a generic compile of an XCode project (curious)?
Re:Altivec and OS X (Score:2)
Yes, in things like memcpy, you will get AltiVec instructions with just default switches. You could single-step through memcpy (actually a subroutine named __bigcopy) in the debugger and see the instructions.
The compiler isn't going to automatically recognize you're doing an FFT routine and call an optimized routine instead of using your code. So, to use the optimized signal processing routines, you would add a reference to the Acce
Re:Altivec and OS X (Score:2)
Excellent. Kudos to your team.
Re:Altivec and OS X (Score:2)
I switched from a 450mhz G3 to a 450mhz G4 a couple years ago, and there's a HUGE performance difference in OSX's boot and response time.
Of course, the GUI runs a bit faster on the G4, too, but that could be because of the AGP video card.
AltiVec is nice... (Score:5, Informative)
Re:AltiVec is nice... (Score:5, Informative)
The other nice thing about Altivec on OS X is that Apple has done a fairly good job of making it accessible without forcing the programmer to learn and use assembly language. These libraries will automatically fall back to a scaler code path if they're running on a G3 so it saves you from a fair bit of work there too. They have included a number of optimized libraries that use Altivec that are ready to go "out of the box" with xCode including:
Apple has documentation and source code for the libraries on their Developer Connection Website [apple.com]. What good are vector units if nobody can make use of them? I can't wait for Apple to put the GPUs image processing abilities into my hads with CoreImage/Video.
Re:AltiVec is nice... (Score:1)
Re:AltiVec is nice... (Score:3, Informative)
One of my favorite ArsTechnica articles (Score:3, Informative)
Re:One of my favorite ArsTechnica articles (Score:4, Interesting)
The VUs have the sweetest SIMD instruction set I've seen. 32 registers (like altivec), but you can do component swizzling within an instruction, it has MADD and also a sweet Accumulate register that can be re-written to on successive cycles (throughput is worse if you accumulate results in a normal vector register, like you have to on all other SIMDs). So you can do a 4x4 matrix/vector multiply in just 4 instructions!
The big problem was that you didn't get any of the nice instruction scheduling/re-ordering that you get on PPC or x86 platforms, so the onus was on the programmer to NOP through latency issues (huge pain!)... They finally came out with the VCL that would process chunks of VU assembly and reschedule everything at compile time.
The really sad thing is that Sony/IBM/Toshiba opted for AltiVec in the Cell. I guess it probably has better tools and IBM is highly leveraged into VMX, but VU was very, very clever considering that it pre-dates all these other SIMD instruction sets.
Re:One of my favorite ArsTechnica articles (Score:4, Informative)
simdtech.org (Score:5, Informative)
--
Join the Pyramid - Free Mini Mac [freeminimacs.com]
API matters (Score:4, Insightful)
Anyway, what we need is not an autovec compiler, but instead a library with most CPU hungry algorithms well implemented with SIMD extensions.
What about an open library, cross-platform, multimedia oriented, along the line of SUN's mediaLib [sun.com] ? Would SUN allow the re-use of their API ?
I'm looking for such a library, with GPL/LGPL compatible license. The API has to be in C, to maximise audience. For many projects, C++ is not an option.
Primary use will be DSP work in GNU Radio project [gnu.org], but multimedia extensions could prove useful anywhere in GUI's to audio/video app, etc.
I would take any pointers to such an already existing API/project, or be ready to start a new one, if other people interested in.
See also this previous story [slashdot.org] for cheap recylced comments.
Re:API matters (Score:3, Informative)
tradeoffs (Score:1, Flamebait)
-- How much work do I need to do in order to take advantage of it? Some BLAS implementations may support it and some Fortran 95 compilers may generate code for it for some primitives, but other than that, it's a lot of manual work to tune code for it. (My own experience with using the AltiVec instructions can only be described as "painful", among other things because the C interface to them is poorly defined and causes name conflicts.)
-- Wha
Re:tradeoffs (Score:3, Informative)
Don't forget the small Apple desktop in a fancy case. [apple.com]
Does it matter? (Score:2, Insightful)
Unless and until I can go down to Fry's and buy a motherboard based off of this chip and put it into a standard case, it really doesn't matter if the CPU is better or not. It is the system as a whole that matters, not the relative performance of one of its components. I'm not going to paint myself
Re:Does it matter? (Score:1, Interesting)
Re:Does it matter? (Score:2)
The strength of the PC has never been that it is always the best and the fastest. At various times other platforms have held that crown. The strength of the PC as a platform is that it is based upon interchangable parts from multiple vendors. This ensures that at any given time you will always get the best bang for your buck. It also allows you to completely customize your system and pick and choose the preci
Re:Does it matter? (Score:2)
> You can just buy a $499 mini and get done with it.
uhm, did you notice the word **motherboard**?
and how about a **standard** case?
the grandparent is right - it's open but not actually.
buy a bladecenter loaded with IBM's PowerPC blades now and pray to dear god you'll have anyone but IBM and their partners give you quotations for maintenance agreement in 2006.
"makes you think twice who you invite to your house"
Re:Does it matter? (Score:2)
Re:Does it matter? (Score:5, Insightful)
CHRP, the PowerPC Common Hardware Reference Platform is what you're looking for, and it's been around since before there were Apple PowerPCs. AFAIK most, if not all, the PowerPC-based workstations shipped by IBM, the BeBox, various third-party PowerPCs such as those from PowerComputing, and many of Apple's machines (even tody) are either compliant or as-close-to-compliant-as-makes-sense with this or evolutions of this standard (such that some fanatics Rhapsody/OS X were able to get it running on AIX PowerPC workstations).
CHRP Links [firmworks.com]
I'm not going to paint myself into a corner with a proprietary system from anyone, let alone Apple.
Until I can make the computer from sand, copper ore, and crude oil using recipes downloaded from the internet (i.e. "The Diamond Age"), I don't see the useful distinction between being able to build a computer out of proprietary chips from one of, count them, two CPU manufacturers, a video card from one of, count them, two graphics card manufacturers, etc. and simply buying a computer that works.
Re:Does it matter? (Score:1)
that's a slightly oxymoronic way of expressing it, isn't it? There are descendants of CHRP that exist, such as the Power Mac. Last I checked
Re:Does it matter? (Score:1, Informative)
http://www.walibe.com/modules.php?name=News&file=a rticle&sid=16 [walibe.com]
http://slashdot.org/article.pl?sid=03/09/22/239215 &tid=137&tid=138 [slashdot.org]
the AmigaOne boards are either G3 or G4. no G5.
Re:Does it matter? (Score:1)
and what difference does it matter if it's an already complete system, or not in a "generic" case?? if you program for altivec enabled processors, it's probably going to be running on a Mac anyway. it's highly unlikely that any code you may write will be running on IBM hardware as it's even less common and much more expensive than any Mac(G4/5) w/altivec.
Ignoring Altivec isn't a very good idea eit
Re:Does it matter? (Score:2)
AltiVec instructions (Score:1, Informative)
portable vectorization (Score:5, Interesting)
about implementing a vectorization syntax, so
we can have portable vector code which
approach the speed of hand coded vectorization.
Here is something from the list.
What is a vectorized expression? Basically, loops that does not specify any
order of execution. If there is no order specified, of course the compiler
can choose any one that is efficient or maybe even distribute the code and
execute it in parallel.
Here is some examples.
Adding a scalar to a vector.
[i in 0..l](a[i]+=0.5)
Finding size of a vector.
size=sqrt(sum([i in 0..l](a[i]*a[i])));
Finding dot-product;
dot=sum([i in 0..l](a[i]*b[i]));
Matrix vector multiplication.
[i in 0..l](r[i]=sum([j in 0..m](a[i,j]*v[j])));
Calculating the trace of a matrix
res=sum([i in 0..l](a[i,i]));
Taylor expansion on every element in a vector
[i in 0..l](r[i]=sum([j in 0..m](a[j]*pow(v[i],j))));
Calculating Fourier series.
f=sum([j in 0..m](a[j]*cos(j*pi*x/2)+b[j]*sin(j*pi*x/2)))+c;
Calculating (A+I)*v using the Kronecker delta-tensor : delta(i,j)={i=j ? 1 : 0}
[i in 0..l](r[i]=sum([j in 0..m]((a[i,j]+delta(i,j))*v[j])));
Calculating cross product of two 3d vectors using the
antisymmetric tensor/Permutation Tensor/Levi-Civita tensor
[i in 0..3](r[i]=sum([j in 0..3,k in 0..3](anti(i,j,k)*a[i]*b[k])));
Calculating determinant of a 4x4 matrix using the antisymmetric tensor
det=sum([i in 0..4,j in 0..4,k in 0..4,l in 0..4]
(anti(i,j,k,l)*a[0,i]*a[1,j]*a[2,k]*a[3,l]
Re:portable vectorization (Score:1)
Re:portable vectorization (Score:1)
-
Introducing.... 1999! (Score:1)
"Re-Introducing" would be a better title.