Slashdot Log In
Impressive Benchmarks: Sorting with a GPU
Posted by
timothy
on Wed Jun 29, 2005 07:20 AM
from the but-that's-for-graphics dept.
from the but-that's-for-graphics dept.
An anonymous reader writes "The Graphics research group at the University of North Carolina at Chapel Hill has posted some interesting benchmarks for a sorting implementation which is done entirely on a GPU. There have been efforts on doing general purpose computation on GPUs before (previous Slashdot article). However, most of them had generally utilized the fragment processing pipeline of the GPUs which is slower then the default high speed rendering pipeline. Apparently, the above implementation is done using "simple texture mapping operations" and "cache efficient memory accesses" only. There also seems to an option to download the distribution for non-commercial use, though the requirements seem pretty hefty (a very decent nVidia graphics card and the latest nVidia drivers)."
This discussion has been archived.
No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
Full
Abbreviated
Hidden
Loading... please wait.
Is GPU memory swapped? (Score:3, Interesting)
~~~
Re:Is GPU memory swapped? (Score:4, Informative)
Parent
What about other sorts? (Score:2)
Re:What about other sorts? (Score:3, Informative)
Not including it in the benchmarks would have been a sign that some smoke and mirrors were involved. If the MSVC++ library had anything faster, people would be using it.
Re:What about other sorts? (Score:2)
Re:What about other sorts? (Score:5, Interesting)
Not necessarily...
Their use of the GPU to sort might very well run something along the lines of assigning Z-coordinates based on the key values, and colors based on a simple index , then asking the GPU to "show" the "pixels" in Z-order, then just read the "real" data of any arbitrary size and type in the order specified by the returned colors/indices. That would perform a sort using the GPU, very very rapidly, but you can't really translate it to run on a CPU - Sure, you could write code to fake it, but at the lowest level, you'd end up using something like a quicksort, rather than dedicated hardware, to emulate the desired behavior.
Now, admittedly, I don't know that the method under consideration used such an approach. But it appears they at least took the approach of using the GPU for its strong points, rather than trying to force it to act as a general-purpose CPU.
As for the choice of Quicksort - Most likely, they chose it because just about every C library out there has an implementation of quicksort. And while personally I prefer heapsort (in the worst case, quicksort has Q*O(n^2) behavior, while heapsort always takes only P*O(n log n), But P >> Q), I'll admit that for almost all unstructured input sets, quicksort finishes quite a lot faster than anything else.
Parent
Quicksort versus HeapSort (Score:3, Informative)
Quicksort exhibits n-squared behaviour when the data set is sorted but reversed. Most decent quicksort routines do a random shuffle of the data at the start to avoid this issue.
Choosing a sort for a particular situation is very much a matter of choosing. A shell sort is very fast for small data sets - it's quick to implement and follows C*O(n^1.27), where C is small compared to P or Q in your example. With mostly sorted data sets, the choice to sort al
Re:What about other sorts? (Score:3, Interesting)
I humbly suggest you do not understand quicksort vs. heap sort. Layman's terms: Quicksort is be
Re:What about other sorts? (Score:5, Informative)
Presumably though the algorithm they used in GPUsort can be made to work on a Pentium IV.
Yes, but the algorithm won't be anything special. It won't be a better algorithm than qsort and definitely not more efficient than O(n log n) comparisons. What is special is that it runs on a GPU.
they should have compared GPUsort on the CPU as well
And how exactly were they supposed to do that?!? GPUsort has been programmed to run on a GPU, and even if you don't know the first thing about computers, the G should suggest that GPU is very different from a CPU.
One can prove that no sorting algorithm using binary comparisons can do better than use O(n log n) comparisons. Hence GPUsort couldn't have been asymptotically more efficient that qsort.
If you think about it, the standart qsort implementation is definitely more optimised than most algorithms out there; it has been around for very long. But none of this matters; GPUsort can't run on a CPU.
Parent
Re:What about other sorts? (Score:4, Interesting)
It:
a) Makes a call to a function, via a function pointer, for each comparison.
b) Uses a variable element size
Both of these things will slow down the sort a lot, compared to a specialized implementation that only sorts 32-bit integers.
Parent
Re:What about other sorts? (Score:3, Informative)
Re:What about other sorts? (Score:2)
Unless your compiler knows about qsort and can (1) inline it, and (2) inline the comparison function into it, then you'll always have that indirect call in the innermost loop. (I just checked; gcc can't do it. I haven't tried the Intel or MS compilers.)
Re:What about other sorts? (Score:2)
Re:What about other sorts? (Score:3, Interesting)
Bzzt, wrong. qsort() is NOT, I repeat, NOT, the same as quicksort.
According to unix manpages: "The qsort subroutine sorts a table of data in place. It uses the quicker-sort algorithm.
Note that "quicker-sort" is not the same as "quicksort". In fact, I doubt anyone really knows what the hell "quicker sort" is, anyway. My guess is that it might be a name just invented to explain away the q, once some standardization committee de
Re:What about other sorts? (Score:4, Informative)
http://theory.stanford.edu/~amitp/rants/c++-vs-c/ [stanford.edu]
A comparison with the much faster STL sort should be interesting.
Parent
Like Judy for GPU? (Score:3, Interesting)
especially since i use Judy arrays for tons of things on two different architectures, and its just a darn efficient hash library for pretty much all of my needs
Not accurate (Score:5, Informative)
For the past two generations or so (starting with the Radeon 9800), there has been no such thing as "the default high speed rendering pipeline". The only circuitry present in the chip has been for evaluating shaders, and the fixed-function pipeline has been implemented as "shaders" that the driver runs on the chip automatically.
At least, I know for a fact this is true of ATI chips, and would not be at all surprised if nVidia is doing something similar.
Re:Not accurate (Score:4, Informative)
First, as the GP said, there is no fixed function pipeline on modern GPUs. When you submit a primitive (or a batch of primitives) with fixed-function functionality, the driver converts the current fixed function operations into appropriate shader programs and sends it on down the pipe.
Secondly, the "fixed function pipeline" for vertex processing is ludicrously simple. There's actually really nothing that's done, save multiplying the vertices by the appropriate matrix (world * view * proj, in the usual case). The interesting work, and the work that's really done by the GPU is in the fragment processor. That's where the overwhelming majority of fixed function operations are actually performed.
However, it concerns me that you talk about writing programs that try to be as general as the fixed function pipeline. Due to the nature of fragment processors, it's phenomenally expensive to branch. And even if you were to write a general-case implementation of the fixed function pipeline that didn't branch, it would contain so many instructions as to totally hammer your performance.
A rule of thumb in the fragment processor is that you should have no more than ~40 instructions per fragment for a full screen fill. (For pre-7800 hardware).
Parent
Obl. Futurama quote (Score:3, Funny)
Fry: Magic. Got it.
True GPU Genius: J. Ruby (Score:3, Informative)
Ruby's first published work was the SETI@HOME modified client which uses Nvidia (or) ATI GPU for the waveform FFT calculations. I have watched him steadily upgrade his Nvidia GPU up to this wicked 7800 arrangement he is using today.
Go Jim! You owe me a beer.
Re:True GPU Genius: J. Ruby (Score:2, Funny)
very nice (Score:4, Interesting)
What is the performance if the GPU is busy rendering when you play a game?
When the GPU is busy doing what it is supposed to do... a program should resort to qsort right?
Probably not for game applications (Score:3, Interesting)
Re:Probably not for game applications (Score:2)
Re:Probably not for game applications (Score:5, Insightful)
Kind of like how a bulldozer is much more powerful than a hammer, but totally unsuitable to banging a nail in a bit of wood. If you want something torn down or moved about, though...
Parent
I like that analogy... (Score:2)
Next time I need to hammer a bunch of nails in parallel, perhaps I'll consider using a bulldozer! :D
Your point is well made, of course. Nevertheless, with work like this (the GPU sorter) even off-loading a little work to the GPU can allow your CPU to do other work thereby shortening the wall-clock time required to do your computations (in theory - in practice, it can hurt you if you're not careful!). My research area (neural networks) is inherently parallelizable, but I am not yet aware of work to efficie
Re:very nice (Score:2)
Re:very nice (Score:2)
But often things that are constructed for scientific purposes are one day going to be used for general purpose. I was obviously thinking about that...
I knew somebody who did math on a "blitter" (Score:2, Funny)
He was always getting it wrong too. I have logged thousands of hours and thousands of miles, from Montreal, Canada to Lisbon, Portugal cleaning up after this yobbo. What a fuckup he was.
The opinion was shared too. It got to the point to where we could write code that would detect his code and, as soon as we came across it and confirmed it we would remove it and read the original spec to know what to code.
"G
It's been said... (Score:5, Interesting)
Take a look at what they are doing with their GPUs [nvidia.com] right now and you can understand why someone would suggest this.
It is really time... (Score:2, Interesting)
Re:It is really time... (Score:2)
GPUs, and Floating Point Numbers General Question (Score:2, Interesting)
Re:GPUs, and Floating Point Numbers General Questi (Score:4, Insightful)
I've heard about doing GPU calculations (as per the implementation being non-IEEE)? Doubly, would someone in the know help explain what the aforementioned "weirdness" means?
There are just some functions missing, like different roundings and the missing double precision. GPUs are simply not optimized for scientific calculations, but that doesn't mean that they can't be programmed to be useful for those workloads.
Parent
Re:GPUs, and Floating Point Numbers General Questi (Score:3, Informative)
precision? (Score:2, Interesting)
very decent? (Score:2, Funny)
So is that like "average", except "very average"?
Re:very decent? (Score:2)
Wow (Score:2)
The algorithm is called "Bitonic sort" (Score:2, Informative)
From the article, when comparing their sort to previous GPU sort: "These algorithms also implement Bitonic Sort on the GPU for 16/32-bit floats using the programmable pipeline of GPUs."
So as I understand it they made a very fast implementation (using the GPU) of an old algorithm suited to parallel processing: bitonic sort was published in 1968 (hey, where were the fast parallel processors in the late sixties
The point? (Score:3, Interesting)
Not every machine has a GPU. I don't know, but I suspect GPUs aren't terribly compatible with each other, so for any sort of market, you'd have to code for multiple GPU types.
The fact is, co-processors have been around since the early x86 CPUs. Not just math co-processors, but Intel used to cell i860 and i960 co-processor boards (some with multiple CPUs) for just this sort of thing.
I'd also suspect that if your GPU is being used for sorting or some other calculation intensive operation, it's less useful as a GPU. If you don't need the GPU as a GPU, but you need the computing power, I suspect spending the additional money on more CPU power instead, is going to have a bigger overall payoff since it's going to speed up every operation, not just the ones a GPU can be adapted to.
So, again, I don't really see the point. Really, if you need specialized co-processing, getting a FPGA board will probably be a much better use of your money since it can be customized for many more tasks than a GPU.
Re:Just what I need! (Score:5, Insightful)
Perhaps you've heard of, I dunno.. Google, or Oracle?
Parent
Re:Just what I need! (Score:5, Interesting)
Parent
Re:Just what I need! (Score:2)
(Given that the sort procedure is something that CAN be run in the background whilst the user is doing something else).
I highly doubt that companies like Oracle or Google would benefit from using GPUS in large scale instead of CPUS...
Re:Just what I need! (Score:2)
Probably the reasons why GPU's are fast for the task they are designed for - a small amount of very fast (assuming you access in the right order) non paged memory and a very simple (no Out of order execution) but highly parallel processor make them bad at general purpose stuff.
On the other hand, I can imagine that you could build a sort coprocesseor in an FPGA - the fact that it was optimised for the algorithm might be able to outweigh the fact that FPGA implementations of a given pi
Re:Just what I need! (Score:2)
You can do the math, but that's a lot of data at the corporate level.
However are they reliable enough? (Score:3, Interesting)
In fact, some people are actually talking about ways of testing and allowing slightly broken graphics hardware to be used in situations where the _visual_ results won't be that off.
GPU has a different architecture (Score:2)
Re:GPU has a different architecture (Score:3, Interesting)
Just like the Cell processor...
Re:Come on! (Score:2)
"Then" and "than" was grammar school stuff back in my day. Maybe that's why they called it grammar school.
Here's one that might have taken until junior high school to master:
"I need a discrete source of discreet components."
"I need a discreet source of discrete components."