Stories
Slash Boxes
Comments

News for nerds, stuff that matters

Slashdot Log In

Log In

Create Account  |  Retrieve Password

NVIDIA's $10K Tesla GPU-Based Personal Supercomputer

Posted by timothy on Sun Nov 23, 2008 04:25 AM
from the plugs-into-standard-power-strip dept.
gupg writes "NVIDIA announced a new category of supercomputers — the Tesla Personal Supercomputer — a 4 TeraFLOPS desktop for under $10,000. This desktop machine has 4 of the Tesla C1060 computing processors. These GPUs have no graphics out and are used only for computing. Each Tesla GPU has 240 cores and delivers about 1 TeraFLOPS single precision and about 80 GigaFLOPS double-precision floating point performance. The CPU + GPU is programmed using C with added keywords using a parallel programming model called CUDA. The CUDA C compiler/development toolchain is free to download. There are tons of applications ported to CUDA including Mathematica, LabView, ANSYS Mechanical, and tons of scientific codes from molecular dynamics, quantum chemistry, and electromagnetics; they're listed on CUDA Zone."
+ -
story

Related Stories

[+] Hardware: NVIDIA Shaking Up the Parallel Programming World 154 comments
An anonymous reader writes "NVIDIA's CUDA system, originally developed for their graphics cores, is finding migratory uses into other massively parallel computing applications. As a result, it might not be a CPU designer that ultimately winds up solving the massively parallel programming challenges, but rather a video card vendor. From the article: 'The concept of writing individual programs which run on multiple cores is called multi-threading. That basically means that more than one part of the program is running at the same time, but on different cores. While this might seem like a trivial thing, there are all kinds of issues which arise. Suppose you are writing a gaming engine and there must be coordination between the location of the characters in the 3D world, coupled to their movements, coupled to the audio. All of that has to be synchronized. What if the developer gives the character movement tasks its own thread, but it can only be rendered at 400 fps. And the developer gives the 3D world drawer its own thread, but it can only be rendered at 60 fps. There's a lot of waiting by the audio and character threads until everything catches up. That's called synchronization.'"
[+] BOINC Now Available For GPU/CUDA 20 comments
GDI Lord writes "BOINC, open-source software for volunteer computing and grid computing, has posted news that GPU computing has arrived! The GPUGRID.net project from the Barcelona Biomedical Research Park uses CUDA-capable NVIDIA chips to create an infrastructure for biomolecular simulations. (Currently available for Linux64; other platforms to follow soon. To participate, follow the instructions on the web site.) I think this is great news, as GPUs have shown amazing potential for parallel computing."
[+] Inside Tsubame, Japan's GPU-Based Supercomputer 75 comments
Startled Hippo writes "Japan's Tsubame supercomputer was ranked 29th-fastest in the world in the latest Top 500 ranking with a speed of 77.48T Flops (floating point operations per second) on the industry-standard Linpack benchmark. Why is it so special? It uses NVIDIA GPUs. Tsubame includes hundreds of graphics processors of the same type used in consumer PCs, working alongside CPUs in a mixed environment that some say is a model for future supercomputers serving disciplines like material chemistry." Unlike the GPU-based Tesla, Tsubame definitely won't be mistaken for a personal computer.
[+] AMD RV790 Architecture To Change GPGPU Landscape? 102 comments
Vigile writes "To many observers, the success of the GPGPU landscape has really been pushed by NVIDIA and its line of Tesla and Quadro GPUs. While ATI was the first to offer support for consumer applications like Folding@Home, NVIDIA has since taken command of the market with its CUDA architecture and programs like Badaboom and others for the HPC world. PC Perspective has speculation that points to ATI addressing the shortcomings of its lineup with a revised GPU known as RV790 that would both dramatically increase gaming performance as well as more than triple the compute power on double precision floating point operations — one of the keys to HPC acceptance."
This discussion has been archived. No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More
Loading... please wait.
  • Graphics (Score:5, Funny)

    by Anonymous Coward on Sunday November 23 2008, @04:34AM (#25863373)

    Wow, that's some serious computing power! I wonder if anyone has thought of using these for graphics or rendering? I imagine they could make some killer games, especially with advanced technology like Direct 3D.

      • Re:Graphics (Score:5, Funny)

        by 404 Clue Not Found (763556) * on Sunday November 23 2008, @05:29AM (#25863591) Homepage

        Whoosh. Sorry.

      • Re:Graphics (Score:5, Funny)

        by Gnavpot (708731) on Sunday November 23 2008, @06:05AM (#25863697)

        "I wonder if anyone has thought of using these for graphics or rendering?"

        These are effectively just NVIDIA GT280 chips with the ports removed. Their heritage is gaming.

        We need a "+1 Whoosh" moderation option.

        No, I do not mean "-1 Whoosh". I want to see those embarrassingly stupid postings. But perhaps this moderation option should subtract karma.

      • Re: (Score:3, Informative)

        In much the same way that the current Quadro FX cards are based on the same chip as the gaming gforce cards. But still the most expensive gaming card is ~£400, but you'll pay ~£1500 for the top of the line FX5700.

        It's because workstation graphics cards are configured for accuracy above all else, where as gaming cards are configured for speed. Having a few pixels being wrong does not affect gaming at all, getting the numbers wrong in simulations is going to cause problems.

        Mostly the people who us

  • Heartening... (Score:3, Interesting)

    by blind biker (1066130) on Sunday November 23 2008, @04:37AM (#25863381) Journal

    ...to see a company established in a certain market, to branch out so aggressively and boldly into something... well, completely new, really.

    Does anyone know if Comsol Multiphysics can be ported to CUDA?

      • Re:Heartening... (Score:5, Interesting)

        by mangu (126918) on Sunday November 23 2008, @06:10AM (#25863719)

        Can you imagine a Beowulf cluster of these?

        Yes, I can. My first thought when I saw the article was to calculate how many of them one would need to simulate a human brain in real time. The answer is: with 2500 of these machines one could simulate a hundred billion neurons with a thousand synapses each, firing a hundred times per second, which is the approximate capacity of a human brain.

        People have paid $20 million to visit the space station, now who will be the first millionaire hobbyist to pay $25 million to have his own simulated human brain?

        • Re: (Score:3, Interesting)

          Would the interconnects be fast enough? There's a lot of non-locality in the synaptic connections, so you're going to need some pretty heavy comms between the cores.

          Also a selection of neurons are far more heavily connected than 1000s of synapses, and they're fairly essential ones. Might these be a critical path?

          Sure would be cool to build such a beast, do some random connections, and see what happens...

          • Re: (Score:3, Interesting)

            I think your post was intended humorously, but I'm going to pretend otherwise. (Note, I'm not a specialist in computational mentalistics, or whatever the field would be called, but:)

            I'm fairly certain the interconnects are fast enough. The brain is no speed demon on individual connections. It's basically chemical, with only a little electrical stuff on top that's still based on ions floating in liquid.

            The problem is the software. And the sensoria. And the effectors.

            Each of those problems is being addre

        • Re:Heartening... (Score:5, Interesting)

          by smallfries (601545) on Sunday November 23 2008, @07:38AM (#25863971) Homepage

          Your figures are off by several orders of magnitude. 2500 of these is roughly 10,000T/flops. As a Tflop is 10^12 operations, and we have 10^11 neurons that leaves 10^5 floating point operations per neuron. If each has 1000 synapses to process then we are down to 100 operations per connection, per second.

          At this point it seems obvious that you've assumed a really simplistic model of a neuron that can compute a synaptic value in a single floating point operation. These simple neuron models don't behave like a real brain, and scaling up simulations of them doesn't produce anything interesting. Real neurons are capable of computing much more complex functions than these models. The throughput on the interconnect is going to be a major factor, and simulating each neuron will require from 10s to 1000000s of operations depending on the level of biological realism that is required. The Blue Brain project has a lot of interesting material on different models of the neuron and the tradeoff between performance and realism.

          Their end goal is to dedicate a large IBM Blue Gene to simulating an entire column within the brain (roughly 1,000,000 neurons) using a biologically-realistic model.

          • Re:Heartening... (Score:5, Informative)

            by LeDopore (898286) on Sunday November 23 2008, @10:13AM (#25864527) Homepage Journal

            You're right unless there's a computational way to take advantage of the fact that most neurons in cortex pretty much never fire (1), and that a small minority of synapses are responsible for nearly all of the excitation in a slab of cortical tissue (2). If not active == not important == not necessary to simulate with a 100% duty cycle (these are big "ifs"), then we could be literally about 3-5 orders of magnitude closer to being able to simulate whole brains than anyone realizes.

            (1) How silent is the brain: is there a "dark matter" problem in neuroscience? Shy Shoham, Daniel H. O'Connor, Ronen Segev. J Comp Physiol A (2006)

            (2) Highly Nonrandom Features of Synaptic Connectivity in Local Cortical Circuits. Sen Song, Per Jesper Sjostro, Markus Reigl, Sacha Nelson, Dmitri B. Chklovskii. PLOS biology March 2005

  • 4 TFLOPS? (Score:5, Insightful)

    by Anonymous Coward on Sunday November 23 2008, @04:38AM (#25863385)

    A single Radeon 4870x2 is 2.4 TFLOPS. Some supercomputer, that.

    Seriously, why is this even news? nVidia makes a product, which is OK, but nothing revolutionary. The devaluation of the "supercomputer" term is appalling.

    Also, how much of that 4 TFLOPS you can get on actual applications? How's FFT? Or LINPACK?

    • Re:4 TFLOPS? (Score:5, Informative)

      by GigaplexNZ (1233886) on Sunday November 23 2008, @05:30AM (#25863599)

      A single Radeon 4870x2 is 2.4 TFLOPS.

      A single Radeon 4870x2 uses two chips. This Tesla thing uses 4 chips that are comparable to the Radeon ones. It should be obvious that they would be in a similar ballpark.

      Seriously, why is this even news?

      It isn't. Tesla was released a while ago, this is just a slashvertisement.

      • NVIDIA has done a good job of making the processing power accessible to programmers that are not GPU coding experts. In addition, they have made hardware changes to better support the type of scientific computation being done on these devices.

        So, while in theory you could put together some Radeon's, work with their API and achieve the same thing, NVIDIA has significantly reduced the level of effort to make it happen.
    • Re:4 TFLOPS? (Score:4, Interesting)

      by hairyfeet (841228) <bassbeast1968&gmail,com> on Sunday November 23 2008, @10:41AM (#25864667)
      The problem is how do you actually define supercomputer. I mean, does only machines released in the past month count? Or do you still count the original bad boys like the Cray? After all, when first built most Crays were multi million dollar number crunching beasts. Does the fact that you can get the same performance in a desktop now mean the Cray no longer counts? The power of computers is still growing at such a pace that the machine that costs millions a decade ago can probably be beaten by a cluster that would cost you less than 25K today, so how exactly would you suggest they define supercomputer?
  • by dgun (1056422) on Sunday November 23 2008, @04:48AM (#25863409) Homepage
    What a rip.
  • At first glance I thought these used actual Tesla coils [wikipedia.org] in the processor, or the devices were at least powered or cooled by some apparatus that used Tesla coils.

    Turns out "Tesla" is just the name of the product.

    Drat. I demand a refund.
      • . . . that's probably exactly the person who would buy one of these.

        Folks who are professionally working on mainstream problems that require supercomputers, well, they probably have access to one already. (Maybe one of the supercomputing folks might want to chime in here; do you have enough access/time? Would a baby-supercomputer be useful to you?)

        But there is certainly someone out there who was denied access, because his idea was rejected by peer review. He is considered a loopy nut bag, because he

  • by Anonymous Coward on Sunday November 23 2008, @04:50AM (#25863421)

    The toolchain is binary only and has an EULA that prohibits reverse engineering.

  • by rdnetto (955205) on Sunday November 23 2008, @05:08AM (#25863503)

    4 Terraflops should be more than enough for anybody...

  • weak DP performance (Score:5, Informative)

    by Henriok (6762) on Sunday November 23 2008, @05:53AM (#25863655)
    I supercomputing circles (i.e. Top500.org) double precision floating point operations seems to be what is desired. 4 TFLOPS single precision, while impressive, is overshadowed by the equally weak 80 GFLOPS double precision, beaten by a single PowerXCell 8i (successor to the Cell in PS3) or the latest crop of Xeons. I'm sure tesla will find its users but we won't see them on the Top500 list anytime soon.
  • there were a lot of early efforts trying to implement realtime rayracing engines for games (e.g. at Intel recently [intel.com]), let's port that stuff and have some fun.
    • On that note, it would be a good development platform for realtime raytraced game engines. That way the code would be mature when affordable GPU's come out that can match that level of performance.

  • Erlang (Score:3, Interesting)

    by Safiire Arrowny (596720) on Sunday November 23 2008, @06:25AM (#25863763) Homepage
    So how do you get an Erlang system to run on this?
    • Re: (Score:3, Insightful)

      By writing an Erlang-to-CUDA compiler?

      More seriously though, it is probably not worth even trying, since the GPUs used in the Tesla support a very limited model of parallelism. Shoehorning the flexibility of Erlang into that would at the very leas result in a dramatic performance loss, if it is possible at all.

  • by bsDaemon (87307) on Sunday November 23 2008, @06:58AM (#25863841) Homepage

    ... AMD has annouced today it new Edison Personal Supercomputer technology.

    The game is on.

  • by Gearoid_Murphy (976819) on Sunday November 23 2008, @07:16AM (#25863913)
    it's not about how many cores you have but how efficiently they can be used. If your CUDA application is any way memory intensive you're going to experience a serious drop in performance. A read from the local cache is 100 times faster than a read from the main ram memory. This cache is only 16kb. I spend most of my time figuring out how to minimise data transfers. That said, CUDA is probably the only platform that offers a realistic means for a single machine to tackle problems requiring gargantuan computing resources.
    • by anon mouse-cow-aard (443646) on Sunday November 23 2008, @07:52AM (#25864025) Journal
      People are always coming out of the wood work to claim supercomputer performance with such and such a solution, go back and look at GRAPE (which is really cool.) http://arstechnica.com/news.ars/post/20061212-8408.html [arstechnica.com] or a lot of other supercomputer clusters. When you want something flexible, you look for "balance" that means a good relationship between memory capacity, latency & bandwidth, as well as computer power. in terms of memory capacity, the number people talk about is: 1 byte/flop... that is 1 Tbyte of memory is about right to keep 1 TFLOP flexibly useful. this thing has 4 G of memory for 4 TF... in other words: 1 byte / 1000 flops. it's going to be hard to use in a general purpose way.
  • Patmos International (Score:3, Interesting)

    by Danzigism (881294) on Sunday November 23 2008, @08:57AM (#25864197)
    ahh yes the idea of personal supercomputing. Back in '99 I worked for Patmos International. We were at the Linux Expo for that year as well if some of you might remember. Our dream was to have a parallel supercomputer in everyone's home. We used mostly Lisp and Daisy for the programming aspect. The idea was wonderful, but eventually came to a screeching halt when nothing was being sold. It was ahead of it's time for sure. you can find out a little more about it here. [archive.org] I find the whole ideal of symbolic multiprocessing very fascinating though.
    • Re: (Score:3, Insightful)

      It IS marketed for academia. Normal users don't really need to fold proteins or simulate nuclear weapons at home.
          • but I don't know enough about it to be able to give useful information on the subject.

            I do write some CUDA code, so I'll try to help.

            I believe that each of the chips has a 512 bit wide bus to 4GiB of memory.

            Indeed each physical package has entirely access to its own whole chuck of memory, regardless of who many "cores" the package contains (between 2 for the lowest end laptops GPUs and 16 for the highest end 8/9800 cards. Don't know about GT280. But the summary is wrong 240 is probably the amount of ALUs or the width of the SIMD) and regaless of how many "stream processor" there are (each core has 8 ALUs, which are exposed as 32-wide SIMD processing units, which i

    • And then there is the whole "ECONOMY" thing.

      The whole reason the ECONOMY is in the tank is because there are not enough people like you taking loans out against their house to buy random stuff like this.

      Basically... IT'S ALL YOUR FAULT!

       

    • Re: (Score:3, Interesting)

      by Anonymous Coward

      It's cultural.

      You're not even allowed to say that you're "coding", but only that you produce "codes".

      Maybe it's because analytic science is basic on equations which become algorithms in computing, and you can't say that you're "equationing" nor "algorithming".

      In practice it's actually dishonest, because the algorithms don't have the conceptual power of the equations that they represent (they would if programmed in LISP, but "codes" are mostly written in Fortran and C), so the computations are often question

    • Weird options (Score:4, Insightful)

      by mangu (126918) on Sunday November 23 2008, @06:03AM (#25863691)

      I went to the site and tried to configure one. The disk partition options are: "General Purpose, Internet Server, Developer's Workstation, File Server". I wonder, who needs three Tesla cards in a file server or an internet server?

    • It also runs Python (Score:4, Informative)

      by mangu (126918) on Sunday November 23 2008, @06:15AM (#25863729)

      Look, there's Python here [nvidia.com]. You can do the low-level high-performance core routines in C, and use Python to do all the OO programming. This is how God intended us to program.

      • by xororand (860319) on Sunday November 23 2008, @06:26AM (#25863765)

        OO is very good for graphical interfaces, but it isn't particularly well suited for algorithms and other maths oriented stuff.

        The term OO is too general to make a statement about its usefulness for mathematics oriented problems. The powerful templating features of modern C++ are indeed very useful for numerical simulations:

        It's called C++ Expression Templates, an excellent tool for numerical simulations. ETs can get you very close to the performance of hand optimized C code while they're much more comfortable to use than plain C. Parallelization is also relatively easy to achieve with expression templates.

        A research team at my university actually uses expression templates to build some sort of meta compiler which translates C++ ETs into CUDA code. They use it to numerically simulate laser diodes.

        Search for papers by David Vandevoorde & Todd Veldhuizen if you want to know more about this. They both developed the technique independently.

        Vandevoorde also explains ETs to some degree in his excellent book "C++ Templates - The Complete Guide".

        • Re: (Score:3, Informative)

          Actually yes it is. For instance nobody has yet figured out an efficient matrix class in C++ that uses operator overloading. This is basically an impossible task to write B=A*X*A^t efficiently, which occurs all the time in linear analysis, because in C++ the transpose would require a copy operator, whereas one ought to get the job done simply with a different iterator. C++ is not equipped for this yet.

      • Re: (Score:3, Informative)

        OOP with virtual and all, yes. OOP with template magic to allow the compiler to do specializations can beat the heck out of even quite tediously hand-written C or FORTRAN, with much superior readability.
        • by SpinyNorman (33776) on Sunday November 23 2008, @09:44AM (#25864379)

          From NVidia's CUDA site, most of their regular display cards support CUDA, just with less cores (hence less performance) than the Tesla card. The cores that CUDA uses are what used to be called the vertex shaders on your (NVidia) card. The CUDA API is designed so that your code doesn't know/specify how many cores are going to be used - you just code to the CUDA architecture and at runtime it distrubutes the workload to the available cores... so you can develop for a low end card (or they even have an emulator) then later pay for th hardware/performance you need.

        • Re: (Score:3, Informative)

          The 10K refers to a rack mount solution containing 4xGPUs. You can still buy a single GPU and try and put it in a standard machine (provided it doesn't melt - I'd read the specs) for about a quarter of the price.