Please create an account to participate in the Slashdot moderation system

Impressive Benchmarks: Sorting with a GPU 222

Posted by timothy on Wednesday June 29, 2005 @08:20AM from the but-that's-for-graphics dept.

An anonymous reader writes "The Graphics research group at the University of North Carolina at Chapel Hill has posted some interesting benchmarks for a sorting implementation which is done entirely on a GPU. There have been efforts on doing general purpose computation on GPUs before (previous Slashdot article). However, most of them had generally utilized the fragment processing pipeline of the GPUs which is slower then the default high speed rendering pipeline. Apparently, the above implementation is done using "simple texture mapping operations" and "cache efficient memory accesses" only. There also seems to an option to download the distribution for non-commercial use, though the requirements seem pretty hefty (a very decent nVidia graphics card and the latest nVidia drivers)."

This discussion has been archived. No new comments can be posted.

Impressive Benchmarks: Sorting with a GPU

Load All Comments

Search 222 Comments Log In/Create an Account

Comments Filter:

Is GPU memory swapped? (Score:3, Interesting)

by Anonymous Coward writes: on Wednesday June 29, 2005 @08:25AM (#12940386)

If not, it might be a good place to stash crypto keys, passwords, etc. (At least until someone writes a utility to dump it and adds it to something like Cain).
~~~

Share
twitter facebook
- Re:Is GPU memory swapped? (Score:4, Informative)
  
  by SlightOverdose ( 689181 ) writes: on Wednesday June 29, 2005 @09:05AM (#12940567)
  
  As far as I know, the Linux kernel allows applications to have their pages marked as non-swappable. I presume most mainstream OS's also do this.
  
  Parent Share
  twitter facebook
  - Re:Is GPU memory swapped? (Score:2)
    
    by cbreaker ( 561297 ) writes:
    
    I'd agree with your presumtion; Windows can do the same thing. Take VMWare GSX server, for instance; you can instruct it to never swap VM memory, or swap some of it, or swap all of it.
What about other sorts? (Score:2)

by Archibald Buttle ( 536586 ) writes:

This isn't exactly a fair test as I see it. As far as I can make out they've put a custom sorting algorithm up against the standard C library qsort. How about some comparisons of this GPUsort against other sorting algorithms run on the CPU?
- Re:What about other sorts? (Score:3, Informative)
  
  by Defiler ( 1693 ) * writes:
  
  qsort() is a very well-understood algorithm that has been highly optimized.
  Not including it in the benchmarks would have been a sign that some smoke and mirrors were involved. If the MSVC++ library had anything faster, people would be using it.
  - Re:What about other sorts? (Score:2)
    
    by Archibald Buttle ( 536586 ) writes:
    
    Presumably though the algorithm they used in GPUsort can be made to work on a Pentium IV. Comparing only against qsort looks suspicious to me - they should have compared GPUsort on the CPU as well as with it on the GPU and qsort.
    - Re:What about other sorts? (Score:5, Interesting)
      
      by pla ( 258480 ) writes: on Wednesday June 29, 2005 @08:56AM (#12940530) Journal
      
      Presumably though the algorithm they used in GPUsort can be made to work on a Pentium IV
      
      Not necessarily...
      
      Their use of the GPU to sort might very well run something along the lines of assigning Z-coordinates based on the key values, and colors based on a simple index , then asking the GPU to "show" the "pixels" in Z-order, then just read the "real" data of any arbitrary size and type in the order specified by the returned colors/indices. That would perform a sort using the GPU, very very rapidly, but you can't really translate it to run on a CPU - Sure, you could write code to fake it, but at the lowest level, you'd end up using something like a quicksort, rather than dedicated hardware, to emulate the desired behavior.
      
      Now, admittedly, I don't know that the method under consideration used such an approach. But it appears they at least took the approach of using the GPU for its strong points, rather than trying to force it to act as a general-purpose CPU.
      
      As for the choice of Quicksort - Most likely, they chose it because just about every C library out there has an implementation of quicksort. And while personally I prefer heapsort (in the worst case, quicksort has Q*O(n^2) behavior, while heapsort always takes only P*O(n log n), But P >> Q), I'll admit that for almost all unstructured input sets, quicksort finishes quite a lot faster than anything else.
      
      Parent Share
      twitter facebook
      - Re:What about other sorts? (Score:2, Offtopic)
        
        by agurkan ( 523320 ) writes:
        
        Actually P~2*Q. Also, there are tweaks to Quicksort which guarantees O(nlogn) behaviour. If you need an efficient but somewhat general purpose quicksort I strongly recommend Jörg Schön's collection of C programs [uni-heidelberg.de] which include a simple include file that creates a very efficient sortinf routine, and coding takes seconds!!
      - Quicksort versus HeapSort (Score:3, Informative)
        
        by tjwhaynes ( 114792 ) writes:
        
        in the worst case, quicksort has Q*O(n^2) behaviour
        Quicksort exhibits n-squared behaviour when the data set is sorted but reversed. Most decent quicksort routines do a random shuffle of the data at the start to avoid this issue.
        
        Choosing a sort for a particular situation is very much a matter of choosing. A shell sort is very fast for small data sets - it's quick to implement and follows C*O(n^1.27), where C is small compared to P or Q in your example. With mostly sorted data sets, the choice to sort al
      - Re:What about other sorts? (Score:3, Interesting)
        
        by philovivero ( 321158 ) writes:
        
        As for the choice of Quicksort - Most likely, they chose it because just about every C library out there has an implementation of quicksort. And while personally I prefer heapsort (in the worst case, quicksort has Q*O(n^2) behavior, while heapsort always takes only P*O(n log n), But P >> Q), I'll admit that for almost all unstructured input sets, quicksort finishes quite a lot faster than anything else.
        
        I humbly suggest you do not understand quicksort vs. heap sort. Layman's terms: Quicksort is be
        
        Re:What about other sorts? (Score:2)
        
        by philovivero ( 321158 ) writes:
        
        Lordy. I notice that the page claims quicksort has n^2 complexity when input is ordered. That is only true of naive quicksort implementations. Textbook quicksort implementations from when I was in college (about 10 years ago) already took care of cases where input was ordered.
        
        Re:What about other sorts? (Score:2)
        
        by doktor-hladnjak ( 650513 ) writes:
        
        That website seems to imply that you need an additional order n amount of space to do heapsort. However, it can be done in place without too much effort:
        
        1. Take your input array containing n elements and perform a build maxheap operation on it. This takes O(n) time and can be done in place.
        
        2. Swap the root (the maximum element in the heap) with the last value in the part of the array still occupied by the heap, effectively removing the root from the heap and placing it in the sorted portion at the end o
    - Re:What about other sorts? (Score:5, Informative)
      
      by Shisha ( 145964 ) writes: on Wednesday June 29, 2005 @09:05AM (#12940570) Homepage
      
      Presumably though the algorithm they used in GPUsort can be made to work on a Pentium IV.
      Yes, but the algorithm won't be anything special. It won't be a better algorithm than qsort and definitely not more efficient than O(n log n) comparisons. What is special is that it runs on a GPU.
      
      they should have compared GPUsort on the CPU as well
      
      And how exactly were they supposed to do that?!? GPUsort has been programmed to run on a GPU, and even if you don't know the first thing about computers, the G should suggest that GPU is very different from a CPU.
      
      One can prove that no sorting algorithm using binary comparisons can do better than use O(n log n) comparisons. Hence GPUsort couldn't have been asymptotically more efficient that qsort.
      
      If you think about it, the standart qsort implementation is definitely more optimised than most algorithms out there; it has been around for very long. But none of this matters; GPUsort can't run on a CPU.
      
      Parent Share
      twitter facebook
      - Re:What about other sorts? (Score:2)
        
        by Archibald Buttle ( 536586 ) writes:
        
        Chill dude!
        
        GPUs are of course specialist processors and not the same as CPUs as you state - I am well aware of the one letter difference. However my understanding is that the processing units in GPUs are rather similar to the likes of the SSE units within Pentium's and Altivec units in PowerPC chips. They are essentially linear vector processors, from what I understand.
        
        We're talking about algorithms here - if you can express an algorithm in code for one type of processor chip then you can express it in
        
        Re:What about other sorts? (Score:2)
        
        by InfiniteWisdom ( 530090 ) writes:
        
        The whole point of GPUsort is to exploit the parallelism of the GPU... you certainly could implement it... hell you could implement it on a postscript printer... anything that is Turing-complete. But it won't be a valid comparison. What they have compared is their best-known algorithms for sorting on a GPU to what is probably the best general-purpose sorting algorithm on a CPU.
      - Re:What about other sorts? (Score:2)
        
        by Politburo ( 640618 ) writes:
        
        even if you don't know the first thing about computers, the G should suggest that GPU is very different from a CPU.
        
        Based on that logic, you would say that a VCR and a VTR are very different machines, correct?
      - Radix is O(n)? Nice try (Score:2)
        
        by hypnagogue ( 700024 ) writes:
        
        For general purpose sorting, the number of passes in radix sort is log n (e.g. the number of digits of a decimal number is ceil(log10(number) ), so it is also O(n log n). It's only an optimization in sparse sets.
        
        Re:Radix is O(n)? Nice try (Score:2, Informative)
        
        by cain ( 14472 ) writes:
        
        The number of passes in radix sort is not dependent on the number of items in the list, n. Therefore the number of passes is not log n. The number of passes in radix sort in based on the length of the key being sorted. And for most data is a finite number, k. Therefore the complexity of radix sort is O(kn), where k is finite and not related to n at all, and thus the complexity is O(n). I think your rationale assumes that the length of the key you are sorting can be infinite.
        
        Re:What about other sorts? (Score:2)
        
        by AuMatar ( 183847 ) writes:
        
        n*log(n) is only a limit for sorting in-place. You can go as low as O(n) if you have sufficient memory.
  - Re:What about other sorts? (Score:4, Interesting)
    
    by Anonymous Coward writes: on Wednesday June 29, 2005 @08:42AM (#12940457)
    
    I really hope they are not using the C-library implementation of qsort in those timing comparisons...
    
    It:
    a) Makes a call to a function, via a function pointer, for each comparison.
    b) Uses a variable element size
    
    Both of these things will slow down the sort a lot, compared to a specialized implementation that only sorts 32-bit integers.
    
    Parent Share
    twitter facebook
    - Re:What about other sorts? (Score:2)
      
      by BlowChunx ( 168122 ) writes:
      
      The obvious answer is to make a qsort kernel module and get it out of user space!
      - Re:What about other sorts? (Score:2)
        
        by joto ( 134244 ) writes:
        
        The obvious answer is to make a qsort kernel module and get it out of user space!
        Yeah, that would be really neat. Not only do you still have to call a function for each comparison. But *each* and *every one* of those function calls will need to go across the kernel-userspace boundary. My oh my, you are clever (not!)
    - Re:What about other sorts? (Score:2, Informative)
      
      by pooly7 ( 892966 ) writes:
      
      I don't know why the parent comment is mod to "Funny" because this is actually _totally_ true. http://www.rt.com/man/qsort.3.html [rt.com]
    - Re:What about other sorts? (Score:2)
      
      by Have Blue ( 616 ) writes:
      
      If all you want to sort is integers, you can go beyond quicksort. Google "bucket sort" or "radix sort"- these are O(n) sorts while even quicksort is O(n log n).
      - Re:What about other sorts? (Score:3, Informative)
        
        by AuMatar ( 183847 ) writes:
        
        Of course, they require a lot of memory- buckt requires an array of integers of size maxvalue-minvalue, where max and min value are the biggest numbers being sorted. To sort all uint32 will take 2^32 indexes into the array, or at least 16 GB with 32 bit array indixes. Not practicle for random data, although if your data is limited to a range (say 1-1000) its a nice sort method, and fairly simple to write.
  - Re:What about other sorts? (Score:2)
    
    by p3d0 ( 42270 ) writes:
    
    No, qsort() does an indirect function call in hottest innermost loop. A hand-coded data-specific sort routine will always be faster unless the comparison operation itself is very slow.
    Unless your compiler knows about qsort and can (1) inline it, and (2) inline the comparison function into it, then you'll always have that indirect call in the innermost loop. (I just checked; gcc can't do it. I haven't tried the Intel or MS compilers.)
  - Re:What about other sorts? (Score:2)
    
    by exp(pi*sqrt(163)) ( 613870 ) writes:
    
    Quicksort is a well understoord algorithm. qsort() in the standard C library is a notoriously poor implementation. std::sort() in the C++ library might be a better choice.
    - - Re:What about other sorts? (Score:2)
        
        by exp(pi*sqrt(163)) ( 613870 ) writes:
        
        but becuase the comparison functor for std::sort will nearly always be inlined
        
        Knowing how to ensure the right things get inlined is one of the aspects of what I call "good implementation". But opinions may vary I suppose.
  - Re:What about other sorts? (Score:3, Interesting)
    
    by joto ( 134244 ) writes:
    
    qsort() is a very well-understood algorithm that has been highly optimized.
    Bzzt, wrong. qsort() is NOT, I repeat, NOT, the same as quicksort.
    
    According to unix manpages: "The qsort subroutine sorts a table of data in place. It uses the quicker-sort algorithm.
    Note that "quicker-sort" is not the same as "quicksort". In fact, I doubt anyone really knows what the hell "quicker sort" is, anyway. My guess is that it might be a name just invented to explain away the q, once some standardization committee de
- Re:What about other sorts? (Score:4, Informative)
  
  by top_down ( 137496 ) writes: on Wednesday June 29, 2005 @08:51AM (#12940501)
  
  Indeed, qsort is known to be slow. See:
  
  http://theory.stanford.edu/~amitp/rants/c++-vs-c/ [stanford.edu]
  
  A comparison with the much faster STL sort should be interesting.
  
  Parent Share
  twitter facebook
- More specifically (Score:2)
  
  by gr8_phk ( 621180 ) writes:
  
  "The implementation can handle both 16 and 32-bit floats."
  So it's "hard coded" for a couple types. The standard qsort has you pass a function pointer to a comparison routine so it can sort anything. Standard qsort also sorts a list of pointers to the items - I bet this GPU sort works directly on the data. Implementing a less generic version on the CPU is likely to result in it being faster than the GPU sort.
  
  The blurb at the end about increasing GPU speed with each generation is crap too. Both CPU and G
  - Re:More specifically (Score:2)
    
    by cbreaker ( 561297 ) writes:
    
    "Both CPU and GPU performance are now limited by power dissipation issues."
    
    Somewhat.
    
    There's a lot more room for growth in the GPU arena then there currently is in the CPU. They're only running at about 500Mhz right now, and they can add more processing pipelines and other enhancements currently found in the CPU, as well as shrinking the manufacturing process that will allow higher Mhz and more processing power.
    
    The present-day GPU is very advanced, but they're still behind modern CPU's in terms of manufa
  - Re:More specifically (Score:2)
    
    by Mr. Mikey ( 17567 ) writes:
    
    "Implementing a less generic version on the CPU is likely to result in it being faster than the GPU sort."
    
    How do you make a meaningful statement concerning the speed of a "less generic" algorithm running on a general-purpose CPU vs. a "generic" algorithm running on a piece of special-purpose hardware (GPU)?
    
    Methinks your statement carries with it a load of unstated assumptions... care to spell them out for us?
    - Re:More specifically (Score:2)
      
      by gr8_phk ( 621180 ) writes:
      
      "Methinks your statement carries with it a load of unstated assumptions... care to spell them out for us?"
      What I meant is that the GPU version is NOT generic like the qsort() CPU version. By a "less generic" version on the CPU I meant one that is optimized for a specific data type like "float". The standard quicksort does not sort a bunch of numbers, it sorts a list of pointers to things (like structs for example) when you provide it a pointer to a compare function that compares the things.
      
      One can easi
    - On further investigation (Score:2)
      
      by gr8_phk ( 621180 ) writes:
      
      I just read more of their documentation. They sort an array of (key, pointer) pairs, where the pointer is to the rest of a record. Kind of like takeing structures and pulling the key value out into the tuple. This makes their sort more useful than I originally thought, but it is still not the same as qsort even when just using it for floats. The graphs don't specify which GPUsort they used (they do have one that isn't even a tuple version).
      Implementing a similar "tuple sort" on a CPU with the same restric
- Numerical sorting (Score:2)
  
  by RealProgrammer ( 723725 ) writes:
  
  I was working on a sorting algorithm based on curve fitting. It's not a comparison sort, but more like a radix sort, but not one.
  
  You start with a bunch of numbers, or things that can be given a number. Loop through the numbers, keeping track of High, Low, total, mean, std dev, etc. Use that information to develop an interpolating curve, which tells you for a given value where it ought to end up.
  
  Put the Low at the front and the High at the back. On the next pass throught the numbers, put the number you
Like Judy for GPU? (Score:3, Interesting)

by torpor ( 458 ) writes: <ibisum@gmai[ ]om ['l.c' in gap]> on Wednesday June 29, 2005 @08:30AM (#12940406) Homepage Journal

I'd love to see Judy-style thinking applied to GPU problems..." [sourceforge.net]

especially since i use Judy arrays for tons of things on two different architectures, and its just a darn efficient hash library for pretty much all of my needs ..

Share
twitter facebook
Not accurate (Score:5, Informative)

by Have Blue ( 616 ) writes: on Wednesday June 29, 2005 @08:31AM (#12940414) Homepage

the fragment processing pipeline of the GPUs which is slower then the default high speed rendering pipeline

For the past two generations or so (starting with the Radeon 9800), there has been no such thing as "the default high speed rendering pipeline". The only circuitry present in the chip has been for evaluating shaders, and the fixed-function pipeline has been implemented as "shaders" that the driver runs on the chip automatically.

At least, I know for a fact this is true of ATI chips, and would not be at all surprised if nVidia is doing something similar.

Share
twitter facebook
- - - Re:Not accurate (Score:4, Informative)
      
      by daVinci1980 ( 73174 ) writes: on Wednesday June 29, 2005 @12:01PM (#12941956) Homepage
      
      Your comment doesn't really make sense, and seems to convey a lack of understanding of how graphics accelerators work. (This isn't a dig, they're incredibly complicated.)
      
      First, as the GP said, there is no fixed function pipeline on modern GPUs. When you submit a primitive (or a batch of primitives) with fixed-function functionality, the driver converts the current fixed function operations into appropriate shader programs and sends it on down the pipe.
      
      Secondly, the "fixed function pipeline" for vertex processing is ludicrously simple. There's actually really nothing that's done, save multiplying the vertices by the appropriate matrix (world * view * proj, in the usual case). The interesting work, and the work that's really done by the GPU is in the fragment processor. That's where the overwhelming majority of fixed function operations are actually performed.
      
      However, it concerns me that you talk about writing programs that try to be as general as the fixed function pipeline. Due to the nature of fragment processors, it's phenomenally expensive to branch. And even if you were to write a general-case implementation of the fixed function pipeline that didn't branch, it would contain so many instructions as to totally hammer your performance.
      
      A rule of thumb in the fragment processor is that you should have no more than ~40 instructions per fragment for a full screen fill. (For pre-7800 hardware).
      
      Parent Share
      twitter facebook
      - Re:Not accurate (Score:2, Interesting)
        
        by graphicsguy ( 710710 ) writes:
        
        Yes, my post was a bit haphazard...
        
        First, as the GP said, there is no fixed function pipeline on modern GPUs. When you submit a primitive (or a batch of primitives) with fixed-function functionality, the driver converts the current fixed function operations into appropriate shader programs and sends it on down the pipe.
        
        Agreed. However, because the guys at NVIDIA don't have to write their code in CG or VP or HLSL, they may have access to some additional means of optimizations that take advantage of so
Obl. Futurama quote (Score:3, Funny)

by Anonymous Coward writes: on Wednesday June 29, 2005 @08:34AM (#12940422)

Apparantly, the above implementaion is done using "simple texture mapping operations" and "cache efficient memory accesses" only.

Fry: Magic. Got it.

Share
twitter facebook
True GPU Genius: J. Ruby (Score:3, Informative)

by Anonymous Coward writes: on Wednesday June 29, 2005 @08:34AM (#12940423)

First, congratulations to J. Ruby who pioneered this work and is mentioned throughout the article. I work with him, and on his desk are _already_ four machines with dual GeForce 7800 in SLI mode. Talk about lust factor.

Ruby's first published work was the SETI@HOME modified client which uses Nvidia (or) ATI GPU for the waveform FFT calculations. I have watched him steadily upgrade his Nvidia GPU up to this wicked 7800 arrangement he is using today.

Go Jim! You owe me a beer.

Share
twitter facebook
- Re:True GPU Genius: J. Ruby (Score:2, Funny)
  
  by NatasRevol ( 731260 ) writes:
  
  I don't think an AC will get a beer...
very nice (Score:4, Interesting)

by __aahlyu4518 ( 74832 ) writes: on Wednesday June 29, 2005 @08:34AM (#12940424)

I probably don't know what I'm talking about, but I'm wondering....

What is the performance if the GPU is busy rendering when you play a game?
When the GPU is busy doing what it is supposed to do... a program should resort to qsort right?

Share
twitter facebook
- Probably not for game applications (Score:3, Interesting)
  
  by benhocking ( 724439 ) writes:
  
  Most GPGPU (general purpose GPU) researchers are envisioning scientific purposes. It really galls many of us in the scientific community that GPUs are so much more powerful than CPUs (if one can efficiently use the parallel processing capabilities of GPUs), and yet mostly we have to let these powerful processors go unused because it is typically very difficult to use GPUs for non-graphical computations (hence the G in GPU, of course).
  - Re:Probably not for game applications (Score:2)
    
    by RootsLINUX ( 854452 ) writes:
    
    By the way, for those interested in GPGPU research/ideas, there's a pretty nice site here: http://www.gpgpu.org [gpgpu.org]. It has some sample code, slides from conferences presentations, a forum, etc. It's a pretty nice site for information. I was interested in GPGPUs a few months ago and read through the material on that site heavily, but in the end I didn't have the time to try anything cool out because you'll need to learn how to use a language like Cg or steam to program your GPGPU.
  - Re:Probably not for game applications (Score:5, Insightful)
    
    by Tim C ( 15259 ) writes: on Wednesday June 29, 2005 @09:29AM (#12940732)
    
    Well, to be fair GPUs are only "so much more powerful than CPUs" if your task is suitable for running on a GPU. If not, then you're better off using the CPU.
    
    Kind of like how a bulldozer is much more powerful than a hammer, but totally unsuitable to banging a nail in a bit of wood. If you want something torn down or moved about, though...
    
    Parent Share
    twitter facebook
    - I like that analogy... (Score:2)
      
      by benhocking ( 724439 ) writes:
      
      Next time I need to hammer a bunch of nails in parallel, perhaps I'll consider using a bulldozer! :D
      
      Your point is well made, of course. Nevertheless, with work like this (the GPU sorter) even off-loading a little work to the GPU can allow your CPU to do other work thereby shortening the wall-clock time required to do your computations (in theory - in practice, it can hurt you if you're not careful!). My research area (neural networks) is inherently parallelizable, but I am not yet aware of work to efficie
      - Beowolf cluster of Smalltalk processes (Score:2)
        
        by crovira ( 10242 ) writes:
        
        You could run a large simulation as a neural net in distributed Smalltalk, taking advantage of the inherent parallelizability (if such a word exists?) of the actions of a neuron firing when inputs hit certain critical tresholds.
        
        Basically, it the same structure I would use when doing a terrain simulation using finite state automata. The wider you can spread the computational net, the more computational fish you can catch.
        
        Its not the object that is so complicated, (a neuron is, uh, stupidly simple [despite
- Re:very nice (Score:2)
  
  by FriedTurkey ( 761642 ) * writes:
  
  Yes, I am sure the researchers are finding faster ways to map the human genome while playing DOOM3 at the same time. It is very important because it is annoying to have your frames per second drop because of some "cure for cancer" crap.
  - Re:very nice (Score:2)
    
    by __aahlyu4518 ( 74832 ) writes:
    
    LOL... you made your point :-)
    
    But often things that are constructed for scientific purposes are one day going to be used for general purpose. I was obviously thinking about that...
- Re:very nice (Score:2)
  
  by ImaLamer ( 260199 ) writes:
  
  I'm thinking that this type of thing should be good for screensavers and other hacks. I'd like to see Electric Sheep (http://electricsheep.org/ [electricsheep.org])run entirely on the GPU.
- - - - Re:very nice (Score:2, Informative)
        
        by graphicsguy ( 710710 ) writes:
        
        Sorting is not the problem to be solved. It is a piece of the solution to MANY problems. So for lots of problems you would like to solve on the GPU, you need to do sorting, and doing it on the GPU avoids constantly moving the data back and forth between the CPU and GPU.
I knew somebody who did math on a "blitter" (Score:2, Funny)

by crovira ( 10242 ) writes:

(Smalltalk bit block transfer) He was always making assmptions about the underlying virtual engine.

He was always getting it wrong too. I have logged thousands of hours and thousands of miles, from Montreal, Canada to Lisbon, Portugal cleaning up after this yobbo. What a fuckup he was.

The opinion was shared too. It got to the point to where we could write code that would detect his code and, as soon as we came across it and confirmed it we would remove it and read the original spec to know what to code.

"G
It's been said... (Score:5, Interesting)

by TobyWong ( 168498 ) writes: on Wednesday June 29, 2005 @08:41AM (#12940449)

...that the greatest threat to Intel's market domination in the future is not going to come from AMD but from a company such as Nvidia.

Take a look at what they are doing with their GPUs [nvidia.com] right now and you can understand why someone would suggest this.

Share
twitter facebook
- Re:It's been said... (Score:2)
  
  by fizze ( 610734 ) writes:
  
  Well, for highly specialised tasks, take a look at Analog Devices http://www.analog.com/ [analog.com] or Texas Instruments http://www.ti.com/ [ti.com].
  
  They have been producing highly sophisticated cores that left a P4 bite the dust in a lot of cases.
  
  I have worked on test-bed equipment that used a DSP PCI card that produced more test-data than a dual Xeon system could handle. JFYI.
  
  GPUs like those from nVidia or ATI are still a lot less sophisticated than those DSPs, or hybrid DSP/uCs.
  
  Still, in a few years FPGAs or CPLDs w
  - Re:It's been said... (Score:2)
    
    by mmp ( 121767 ) writes:
    
    They have been producing highly sophisticated cores that left a P4 bite the dust in a lot of cases.
    
    I may be wrong, but I'm guessing they don't cost a few hundred dollars each, though, like GPUs do. (The economies of scale that GPU manufacturers see thanks to the fact that they make millions of these things are really quite nice for keeping prices down.)
    
    GPUs like those from nVidia or ATI are still a lot less sophisticated than those DSPs, or hybrid DSP/uCs.
    
    I think I disagree. How familiar are you wit
    - Re:It's been said... (Score:2)
      
      by savuporo ( 658486 ) writes:
      
      I may be wrong, but I'm guessing they don't cost a few hundred dollars each, though, like GPUs do Absolutely, they routinely cost from $1 to $100 at most. For instance, take a look at AC motor control hybrid DSP/uC chips by Freescale. Those allow vector control of AC motors by implementing very sophisticated and high-performance algorithms on specialized chip, this kind of stuff was simply impossible just ten years ago.
- Re:It's been said... (Score:2)
  
  by megalomang ( 217790 ) writes:
  
  Wasn't that said by... Jen-Hsun Huang, the Nvidia CEO? I think he was interviewed in Wired, and on the cover it said Nvidia would be the "Intel killer". It's no surprise that the Nvidia CEO would say things that inspire such headlines.
  
  But wasn't that several years ago, just before Intel increased their market share to well over half the market by selling Intel integrated graphics?
It is really time... (Score:2, Interesting)

by ratta ( 760424 ) writes:

that, since graphic hardware is now completely programmable, ATI and Nvidia realese their specs, so that everyone can use the GPU as a specialized vectorial coprocessor. A CPU that cannot be programmed freely just makes no sense, the same should be for GPU.
- Re:It is really time... (Score:2)
  
  by Xrikcus ( 207545 ) writes:
  
  It's not really a vectorial coprocessor, more suited for large scale stream processing. I'm not sure what extra opening of specs would help in this area though, the programming specifications seem to be quite easily available (people know how to write shaders, afterall) and work has been done on using the GPUs as generalised streaming processors http://graphics.stanford.edu/projects/brookgpu/in d ex.html [stanford.edu] being a prime example. Within the hardware limitations (of which there are many) they are freely program
- Re:It is really time... (Score:2)
  
  by mikael ( 484 ) writes:
  
  The current generations of CPU's are designed to optimise the processing of conditional statements within a pipeline. Since instructions usually take four stages to perform (fetch, read, execute, write) and the location of the current instruction can depend on the condition result (branch on whatever) of the previous instruction. To prevent this from degrading performace, the CPU has to anticipate the effect that both results of a conditional instruction will have on future instructions. Also, there is no
GPUs, and Floating Point Numbers General Question (Score:2, Interesting)

by BioCS.Nerd ( 847372 ) writes:

This may be slightly offtopic insofar that it doesn't directly deal with the subject (sorting with a GPU) at hand, but I was wondering how these sorts of research projects overcome "floating point number weirdness" I've heard about doing GPU calculations (as per the implementation being non-IEEE)? Doubly, would someone in the know help explain what the aforementioned "weirdness" means?
- Re:GPUs, and Floating Point Numbers General Questi (Score:4, Insightful)
  
  by slashflood ( 697891 ) writes: <flow@howflow . c om> on Wednesday June 29, 2005 @08:58AM (#12940546) Homepage Journal
  
  I've heard about doing GPU calculations (as per the implementation being non-IEEE)? Doubly, would someone in the know help explain what the aforementioned "weirdness" means?
  
  There are just some functions missing, like different roundings and the missing double precision. GPUs are simply not optimized for scientific calculations, but that doesn't mean that they can't be programmed to be useful for those workloads.
  
  Parent Share
  twitter facebook
- Re:GPUs, and Floating Point Numbers General Questi (Score:3, Informative)
  
  by graphicsguy ( 710710 ) writes:
  
  It is true. The GPU floating point implementations are not IEEE compliant. There is an interesting study on the floating point behavoir of the GPU here [unc.edu] (and the associated pdf [unc.edu])
precision? (Score:2, Interesting)

by Barbarian ( 9467 ) writes:

Are not most GPUs limited to 24 or 32 bit precision? The x86 processors can go to 80 bits.
very decent? (Score:2, Funny)

by mnemonic_ ( 164550 ) writes:

a very decent nVidia graphics card

So is that like "average", except "very average"?
- Re:very decent? (Score:2)
  
  by cide1 ( 126814 ) writes:
  
  it is so very, very OK
Wow (Score:2)

by CmdrGravy ( 645153 ) writes:

Seriously, like who here hasn't already done this themselves anyway. At least it's not a dupe I suppose !
- Re:Wow (Score:2)
  
  by SeventyBang ( 858415 ) writes:
  
  {perk} Dupe?
  
  Look at the editor for this story - Timothy - the dupe from last night. Both stories on the front page
  
  On top of that, the story has then instead of than.
  
  'spose he's on day 6 of a 15-day meth binge?
The algorithm is called "Bitonic sort" (Score:2, Informative)

by lonedroid ( 888148 ) writes:

This is an implementation of the Bitonic sort.

From the article, when comparing their sort to previous GPU sort: "These algorithms also implement Bitonic Sort on the GPU for 16/32-bit floats using the programmable pipeline of GPUs."

So as I understand it they made a very fast implementation (using the GPU) of an old algorithm suited to parallel processing: bitonic sort was published in 1968 (hey, where were the fast parallel processors in the late sixties ;)
Separate the GPU from the video output? (Score:2)

by ebcdic ( 39948 ) writes:

Perhaps the time has come for separate GPU and video output cards. The video output card could be fairly dumb, perhaps on the motherboard, and the GPU card optional and usable even in servers without a video card.
- Re:Separate the GPU from the video output? (Score:2)
  
  by Wesley Felter ( 138342 ) writes:
  
  And then you need a multi-GB/s connection between the GPU and the DAC, which isn't free.
  - - Re:Separate the GPU from the video output? (Score:2)
      
      by Wesley Felter ( 138342 ) writes:
      
      OK, so you need a multi-GB/s connection between the GPU and the DVI transmitter, which still isn't free.
Why qsort? (Score:2, Informative)

by ringm000 ( 878375 ) writes:

Comparing with C qsort is strange, qsort will never work fast due to being unable to inline the comparison function. Hand-written qsort of floats should work much faster.
LAME GPU (Score:2)

by Doc Ruby ( 173196 ) writes:

I can't wait for someone to port an MP3 encoder to GPGPU. A $300 P3, stuffed with 5 $200 videocards, could host a pool of MP3 encoders, faster than 10 Xeons which would cost $30k. And a lot easier to administer in a streaming farm.
- - Re:LAME GPU (Score:2)
    
    by Doc Ruby ( 173196 ) writes:
    
    AGP (1x) does [pcguide.com] 254.3MBps, PCI (v1.0) does 127.2MBps. CD data is 172.4KBps. So PCI does 738 CD streams, while AGP does 1475. At 128Kbps, each stream IO is 192.4KBps duplex, so AGP: 1322 encoder streams, PCI 661 encoder streams. A P4/3GHz does 6GFLOPS, including host apps (OS, etc), while a $115 GeForce FX5900 [shentech.com] does 20GFLOPS [72.14.207.104]; a $470 GeForce 6800 Ultra [compuplus.com] does 40GFLOPS. Even at the slowest AGP and PCI speeds, and the fastest theoretical GPGPU speeds, the GPU is still slower than the bus, and 5 of those cards in PCI
My professor was doing this (Score:2, Interesting)

by yellowcord ( 607995 ) writes:

a couple of years ago. Nvidia gave him a FX5900 (I think. It was the one that sounded like a hair drier and got pulled from the market) to do his research with. Anyways, check out his papers [usask.ca] on the subject.
The point? (Score:3, Interesting)

by Pedrito ( 94783 ) writes: on Wednesday June 29, 2005 @10:55AM (#12941350)

I don't really see the point. I mean, yes, it's kind of neat, but that's about it. It serves no practical purpose, really.

Not every machine has a GPU. I don't know, but I suspect GPUs aren't terribly compatible with each other, so for any sort of market, you'd have to code for multiple GPU types.

The fact is, co-processors have been around since the early x86 CPUs. Not just math co-processors, but Intel used to cell i860 and i960 co-processor boards (some with multiple CPUs) for just this sort of thing.

I'd also suspect that if your GPU is being used for sorting or some other calculation intensive operation, it's less useful as a GPU. If you don't need the GPU as a GPU, but you need the computing power, I suspect spending the additional money on more CPU power instead, is going to have a bigger overall payoff since it's going to speed up every operation, not just the ones a GPU can be adapted to.

So, again, I don't really see the point. Really, if you need specialized co-processing, getting a FPGA board will probably be a much better use of your money since it can be customized for many more tasks than a GPU.

Share
twitter facebook
- Re:The point? (Score:2)
  
  by Chirs ( 87576 ) writes:
  
  The point is that many machines come with a GPU as default hardware, even if they're never going to do 3D graphics work. If you can make use of something that's *already* on your machine, it's a net gain.
  
  Also, FPGA boards are not necessarily cheap, while GPUs can be obtained for reasonable amounts of money and have very high performance if your problem can be coded to run on them.
- Re:The point? (Score:2)
  
  by cant_get_a_good_nick ( 172131 ) writes:
  
  as far as generally available co-processors, GPUs are by far the most common. therefore you get general availability on a number of machines, and economies of scale to lower the price somewhat.
  
  Though this particular implementation (sorting) probably won't revolutionize the world, it can be seen as a step, a way of learning how to use the GPU in a more general fashion. Maybe we should see this more as "Hello World" in a new programming language on a new system rather than an end in itself.
hmmmm... $250 aint so bad... (Score:2)

by iamhassi ( 659463 ) writes:

"Our results on the ATI X800 XT and NVIDIA GeForce 6800 GPUs indicate better performance in comparison to the qsort routine on a 3.4 GHz Pentium IV PC....Note that in both cases, GPUSort performs considerably faster."
suddenly paying $250 for a video card doesnt seem like such a bad deal...

Looks like we have a new videocard benchmark too.
- Re:Just what I need! (Score:5, Insightful)
  
  by Defiler ( 1693 ) * writes: on Wednesday June 29, 2005 @08:29AM (#12940401)
  
  Who the fuck has enough data to sort on a regular basis that they'd need hardware-assisted sorting?
  
  Perhaps you've heard of, I dunno.. Google, or Oracle?
  
  Parent Share
  twitter facebook
  - Re:Just what I need! (Score:5, Interesting)
    
    by grazzy ( 56382 ) writes: <grazzy@NospAM.quake.swe.net> on Wednesday June 29, 2005 @08:31AM (#12940412) Homepage Journal
    
    Sorts are very common in applications and _can_ be slow. In a future where everyone has a GPU this can effectivly serve as a dual processor setup. Just as the FPU helped us 10 years ago with floating point operations.
    
    Parent Share
    twitter facebook
    - - Re:Just what I need! (Score:2)
        
        by grazzy ( 56382 ) writes:
        
        Probably not, however if a GPU is available and you cna do your sort operation on it, and the sort operation would take long enough to hog the main CPU + that it wont occupy busses etc to much, you might end up gaining from it i suppose.
        
        (Given that the sort procedure is something that CAN be run in the background whilst the user is doing something else).
        
        I highly doubt that companies like Oracle or Google would benefit from using GPUS in large scale instead of CPUS...
      - Re:Just what I need! (Score:2)
        
        by Hal_Porter ( 817932 ) writes:
        
        I think it's just fun.
        
        Probably the reasons why GPU's are fast for the task they are designed for - a small amount of very fast (assuming you access in the right order) non paged memory and a very simple (no Out of order execution) but highly parallel processor make them bad at general purpose stuff.
        
        On the other hand, I can imagine that you could build a sort coprocesseor in an FPGA - the fact that it was optimised for the algorithm might be able to outweigh the fact that FPGA implementations of a given pi
  - Re:Just what I need! (Score:2)
    
    by udderly ( 890305 ) writes:
    
    I'd imagine that there are probably more entities than the OP realizes that would need this sort of sorting. I worked in a Fortune 100 retailer, in one of their 600 stores, which had 5000 individual transactions per day, with an average of 18-20 items per transaction. They routinely do data analysis on item movement, during what hours, bought in conjunction with what other items...blah, blah, blah.
    
    You can do the math, but that's a lot of data at the corporate level.
  - Re:Just what I need! (Score:2)
    
    by cpeterso ( 19082 ) writes:
    
    and I thought DOOM 3 had serious graphics card requirements! Now Oracle database servers will require NVIDIA graphics cards?! :)
- - However are they reliable enough? (Score:3, Interesting)
    
    by TheLink ( 130905 ) writes:
    
    Are GPUs as reliable as CPUs for these tasks? I mean if there's an error in a GPU you might get a brief rendering glitch and lots of people may not notice.
    
    In fact, some people are actually talking about ways of testing and allowing slightly broken graphics hardware to be used in situations where the _visual_ results won't be that off.
- GPU has a different architecture (Score:2)
  
  by gvc ( 167165 ) writes:
  
  The GPU has lots of separate processing elements that operate independently. It also supports floating-point vector operations. Quite a different animal from yer-average-x86.
  - Re:GPU has a different architecture (Score:3, Interesting)
    
    by slashflood ( 697891 ) writes:
    
    The GPU has lots of separate processing elements that operate independently. It also supports floating-point vector operations. Quite a different animal from yer-average-x86.
    
    Just like the Cell processor...
- Re:Come on! (Score:2)
  
  by fnj ( 64210 ) writes:
  
  What is so difficult about knowing when to use then and when to use than?
  
  "Then" and "than" was grammar school stuff back in my day. Maybe that's why they called it grammar school.
  
  Here's one that might have taken until junior high school to master:
  
  "I need a discrete source of discreet components." ... or is it ...
  "I need a discreet source of discrete components."
  - Re:Come on! (Score:2)
    
    by SeventyBang ( 858415 ) writes:
    
    Of course it's the latter.
    
    If we could only educate the morons who can't get its and it's correct.
    
    Rule of thumb: "explode" your contractions. It's is a contraction, not a designation of ownership. The same goes for you're, they're, etc. If it doesn't make sense exploded, it won't make sense as a contraction.
    
    I have yet to figure out why computer people think they are so smart but cannot master simple grammar and punctuation.
    - Re:Come on! (Score:2)
      
      by m50d ( 797211 ) writes:
      
      But as a general rule, it's as a designation of ownership would make sense. John's computer, the computer's mouse, all make sense. Why not it's buttons? Using its in sentences like that is a special case that has to be memorised, because it does not make sense.
    - its vs. it's (Score:2)
      
      by bodrell ( 665409 ) writes:
      
      Rule of thumb: "explode" your contractions. It's is a contraction, not a designation of ownership. The same goes for you're, they're, etc. If it doesn't make sense exploded, it won't make sense as a contraction.
      I do know and understand the difference between its and it's, but your "rule of thumb" is retarded in this case. For all other nouns (not pronouns), an apostrophe-s signifies possession. It's the only remaining case-ending (well, two if you count plural), which is the genitive case [wikipedia.org]. Latin and
- - Re:Let's get an Nvidia fanboy response. (Score:2)
    
    by Hal_Porter ( 817932 ) writes:
    
    Actually I was joking about fanboys and meaningless benchmarks.
    
    E.g. when game x comes out which favours Nvidia, all the Nvidia crow about it. When game y comes out that favours ATI, all the ATI fan boys crow about it.
    
    But when you look at it, most of the time they are talking about +20% on framerates that are already twice the refresh rate. It's completely artificial - in the FPS benchmark they turn off the Vsync to be a more informative test, but not in the game. In the game, the drawing code will wait fo
    - Re:Let's get an Nvidia fanboy response. (Score:2, Informative)
      
      by default luser ( 529332 ) writes:
      
      But when you look at it, most of the time they are talking about +20% on framerates that are already twice the refresh rate. It's completely artificial - in the FPS benchmark they turn off the Vsync to be a more informative test, but not in the game. In the game, the drawing code will wait for a Vsync before drawing a new frame, so both cards will end up rendering the same frame rate when you actually play the game.
      
      Not true. Both Nvidia and ATIs' driver sets default profiles have force v-sync disabled.
- Re:Forgive the stupid question, but....er.....why? (Score:2, Insightful)
  
  by graphicsguy ( 710710 ) writes:
  
  It is important because sorting is an important algorithmic building block for solving many problems. If you want to solve important problems entirely on the GPU (which has many more GFLOPS available than the CPU, if you can manage to use them efficiently), then you need to be able to do the sorting component of those larger algorithms.
- Re:CS at liberal arts universities? (Score:2)
  
  by doktor-hladnjak ( 650513 ) writes:
  
  UNC is most definitely NOT a liberal arts university. Although there is no real engineering program, there is a lot of natural science and medical research going on. Plus, there's the business, law, journalism and a myriad of other more professional schools.
  The CS department here is a top 20 research-based program with a very strong focus in the computer graphics and imaging areas.
- Re:CS at liberal arts universities? (Score:2)
  
  by cant_get_a_good_nick ( 172131 ) writes:
  
  UNC Chapel Hill (besides being home of the Dean Dome) has one of the nations, possibly the world's best visualization labs. The fact they were doing GPU research is not a surprise to me.
  
  I have some pride that my otherwise crappy school (UIC) is also at the relative forefront of visualization as well.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Is GPU memory swapped? (Score:3, Interesting)

Re:Is GPU memory swapped? (Score:4, Informative)

Re:Is GPU memory swapped? (Score:2)

What about other sorts? (Score:2)

Re:What about other sorts? (Score:3, Informative)

Re:What about other sorts? (Score:2)

Re:What about other sorts? (Score:5, Interesting)

Re:What about other sorts? (Score:2, Offtopic)

Quicksort versus HeapSort (Score:3, Informative)

Re:What about other sorts? (Score:3, Interesting)

Re:What about other sorts? (Score:2)

Re:What about other sorts? (Score:2)

Re:What about other sorts? (Score:5, Informative)

Re:What about other sorts? (Score:2)

Re:What about other sorts? (Score:2)

Re:What about other sorts? (Score:2)

Radix is O(n)? Nice try (Score:2)

Re:Radix is O(n)? Nice try (Score:2, Informative)

Re:What about other sorts? (Score:2)

Re:What about other sorts? (Score:4, Interesting)

Re:What about other sorts? (Score:2)

Re:What about other sorts? (Score:2)

Re:What about other sorts? (Score:2, Informative)

Re:What about other sorts? (Score:2)

Re:What about other sorts? (Score:3, Informative)

Re:What about other sorts? (Score:2)

Re:What about other sorts? (Score:2)

Re:What about other sorts? (Score:2)

Re:What about other sorts? (Score:3, Interesting)

Re:What about other sorts? (Score:4, Informative)

More specifically (Score:2)

Re:More specifically (Score:2)

Re:More specifically (Score:2)

Re:More specifically (Score:2)

On further investigation (Score:2)

Numerical sorting (Score:2)

Like Judy for GPU? (Score:3, Interesting)

Not accurate (Score:5, Informative)

Re:Not accurate (Score:4, Informative)

Re:Not accurate (Score:2, Interesting)

Obl. Futurama quote (Score:3, Funny)

True GPU Genius: J. Ruby (Score:3, Informative)

Re:True GPU Genius: J. Ruby (Score:2, Funny)

very nice (Score:4, Interesting)

Probably not for game applications (Score:3, Interesting)

Re:Probably not for game applications (Score:2)

Re:Probably not for game applications (Score:5, Insightful)

I like that analogy... (Score:2)

Beowolf cluster of Smalltalk processes (Score:2)

Re:very nice (Score:2)

Re:very nice (Score:2)

Re:very nice (Score:2)

Re:very nice (Score:2, Informative)

I knew somebody who did math on a "blitter" (Score:2, Funny)

It's been said... (Score:5, Interesting)

Re:It's been said... (Score:2)

Re:It's been said... (Score:2)

Re:It's been said... (Score:2)

Re:It's been said... (Score:2)

It is really time... (Score:2, Interesting)

Re:It is really time... (Score:2)

Re:It is really time... (Score:2)

GPUs, and Floating Point Numbers General Question (Score:2, Interesting)

Re:GPUs, and Floating Point Numbers General Questi (Score:4, Insightful)

Re:GPUs, and Floating Point Numbers General Questi (Score:3, Informative)

precision? (Score:2, Interesting)

very decent? (Score:2, Funny)

Re:very decent? (Score:2)

Wow (Score:2)

Re:Wow (Score:2)

The algorithm is called "Bitonic sort" (Score:2, Informative)

Separate the GPU from the video output? (Score:2)

Re:Separate the GPU from the video output? (Score:2)

Re:Separate the GPU from the video output? (Score:2)

Why qsort? (Score:2, Informative)

LAME GPU (Score:2)

Re:LAME GPU (Score:2)

My professor was doing this (Score:2, Interesting)

The point? (Score:3, Interesting)

Re:The point? (Score:2)