NVIDIA's $10K Tesla GPU-Based Personal Supercomputer 236

Posted by timothy on Sunday November 23, 2008 @05:25AM from the plugs-into-standard-power-strip dept.

gupg writes "NVIDIA announced a new category of supercomputers — the Tesla Personal Supercomputer — a 4 TeraFLOPS desktop for under $10,000. This desktop machine has 4 of the Tesla C1060 computing processors. These GPUs have no graphics out and are used only for computing. Each Tesla GPU has 240 cores and delivers about 1 TeraFLOPS single precision and about 80 GigaFLOPS double-precision floating point performance. The CPU + GPU is programmed using C with added keywords using a parallel programming model called CUDA. The CUDA C compiler/development toolchain is free to download. There are tons of applications ported to CUDA including Mathematica, LabView, ANSYS Mechanical, and tons of scientific codes from molecular dynamics, quantum chemistry, and electromagnetics; they're listed on CUDA Zone."

NVIDIA's $10K Tesla GPU-Based Personal Supercomputer

This discussion has been archived. No new comments can be posted.

Search 236 Comments Log In/Create an Account

Comments Filter:

Binary-only toolchain (Score:5, Informative)

by Anonymous Coward writes: on Sunday November 23, 2008 @05:50AM (#25863421)

The toolchain is binary only and has an EULA that prohibits reverse engineering.

Re:And the worst timing ever award goes to... (Score:2, Informative)

by palegray.net ( 1195047 ) writes: <philip.paradis@nosPAM.palegray.net> on Sunday November 23, 2008 @06:15AM (#25863529) Homepage Journal

I'm perfectly normal, and I fold proteins all the time [webshots.com].

Re:Binary-only toolchain (Score:5, Informative)

by FireFury03 ( 653718 ) writes: <[gro.kusuxen] [ta] [todhsals]> on Sunday November 23, 2008 @06:23AM (#25863561) Homepage

has an EULA that prohibits reverse engineering.
Not really a big deal to those of us in the EU since we have a legally guaranteed right to reverse engineer stuff for interoperability purposes.

Re:4 TFLOPS? (Score:5, Informative)

by GigaplexNZ ( 1233886 ) writes: on Sunday November 23, 2008 @06:30AM (#25863599)

A single Radeon 4870x2 is 2.4 TFLOPS.
A single Radeon 4870x2 uses two chips. This Tesla thing uses 4 chips that are comparable to the Radeon ones. It should be obvious that they would be in a similar ballpark.
Seriously, why is this even news?
It isn't. Tesla was released a while ago, this is just a slashvertisement.

weak DP performance (Score:5, Informative)

by Henriok ( 6762 ) writes: on Sunday November 23, 2008 @06:53AM (#25863655)

I supercomputing circles (i.e. Top500.org) double precision floating point operations seems to be what is desired. 4 TFLOPS single precision, while impressive, is overshadowed by the equally weak 80 GFLOPS double precision, beaten by a single PowerXCell 8i (successor to the Cell in PS3) or the latest crop of Xeons. I'm sure tesla will find its users but we won't see them on the Top500 list anytime soon.

FTFL (Score:3, Informative)

by mangu ( 126918 ) writes: on Sunday November 23, 2008 @06:53AM (#25863659)

now what the heck to do with it...
All you need to do is follow the fscking link [nvidia.com]. Plenty of examples there.

It also runs Python (Score:4, Informative)

by mangu ( 126918 ) writes: on Sunday November 23, 2008 @07:15AM (#25863729)

Look, there's Python here [nvidia.com]. You can do the low-level high-performance core routines in C, and use Python to do all the OO programming. This is how God intended us to program.

Re:Only in C? Oh dear. (Score:5, Informative)

by xororand ( 860319 ) writes: on Sunday November 23, 2008 @07:26AM (#25863765)

OO is very good for graphical interfaces, but it isn't particularly well suited for algorithms and other maths oriented stuff.
The term OO is too general to make a statement about its usefulness for mathematics oriented problems. The powerful templating features of modern C++ are indeed very useful for numerical simulations:
It's called C++ Expression Templates, an excellent tool for numerical simulations. ETs can get you very close to the performance of hand optimized C code while they're much more comfortable to use than plain C. Parallelization is also relatively easy to achieve with expression templates.
A research team at my university actually uses expression templates to build some sort of meta compiler which translates C++ ETs into CUDA code. They use it to numerically simulate laser diodes.
Search for papers by David Vandevoorde & Todd Veldhuizen if you want to know more about this. They both developed the technique independently.
Vandevoorde also explains ETs to some degree in his excellent book "C++ Templates - The Complete Guide".

Re:Only in C? Oh dear. (Score:3, Informative)

by cnettel ( 836611 ) writes: on Sunday November 23, 2008 @07:43AM (#25863793)

OOP with virtual and all, yes. OOP with template magic to allow the compiler to do specializations can beat the heck out of even quite tediously hand-written C or FORTRAN, with much superior readability.

Re:Graphics (Score:3, Informative)

by evilbessie ( 873633 ) writes: on Sunday November 23, 2008 @08:42AM (#25863985)

In much the same way that the current Quadro FX cards are based on the same chip as the gaming gforce cards. But still the most expensive gaming card is ~£400, but you'll pay ~£1500 for the top of the line FX5700.
It's because workstation graphics cards are configured for accuracy above all else, where as gaming cards are configured for speed. Having a few pixels being wrong does not affect gaming at all, getting the numbers wrong in simulations is going to cause problems.
Mostly the people who use these cards care about OpenGL support, but some people do use them under Windows and DirectX.
This type of computing came in with the gforce 8 range when CUDA (Computer Unified Device Architecture) brought C programming to the massively parallel graphics chips. Which has allowed nVidia to port the Ageia PhysX technology to the gforce cards so a separate addin card is not necessary.
I believe that ATi are doing something similar with their FireGL cards, which again are based on the same chip as their Radeon cards. This is why they have both moved from Shader/Vertex to Unified Stream processors. This is a really interesting development if you happen to work in a research establishment, otherwise please move along nothing to see here.

Re:Penguins' Got One Liquid Cooled! (Score:2, Informative)

by BOFHelsinki ( 709551 ) writes: on Sunday November 23, 2008 @09:02AM (#25864053)

BTW, TFS makes a mistake calling this Tesla rig a supercomputer. Nvidia correctly just calls it a cluster replacement. A cluster is not a supercomputer, the interconnect makes all the difference, no matter how much FP crunching power there is. See NEC NX-9 or Cray's Seastar for a real supercomputer interconnect. Can't be arsed to check (this is Slashdot after all) but that Penguin Computing system likely has only InfiniBand or 10GbE for the switch network, making it "only" a cluster. :-)

Re:Can I have a smaller version? (Score:4, Informative)

by SpinyNorman ( 33776 ) writes: on Sunday November 23, 2008 @10:44AM (#25864379)

From NVidia's CUDA site, most of their regular display cards support CUDA, just with less cores (hence less performance) than the Tesla card. The cores that CUDA uses are what used to be called the vertex shaders on your (NVidia) card. The CUDA API is designed so that your code doesn't know/specify how many cores are going to be used - you just code to the CUDA architecture and at runtime it distrubutes the workload to the available cores... so you can develop for a low end card (or they even have an emulator) then later pay for th hardware/performance you need.

Re:Heartening... (Score:5, Informative)

by LeDopore ( 898286 ) writes: on Sunday November 23, 2008 @11:13AM (#25864527) Homepage Journal

You're right unless there's a computational way to take advantage of the fact that most neurons in cortex pretty much never fire (1), and that a small minority of synapses are responsible for nearly all of the excitation in a slab of cortical tissue (2). If not active == not important == not necessary to simulate with a 100% duty cycle (these are big "ifs"), then we could be literally about 3-5 orders of magnitude closer to being able to simulate whole brains than anyone realizes.
(1) How silent is the brain: is there a "dark matter" problem in neuroscience? Shy Shoham, Daniel H. O'Connor, Ronen Segev. J Comp Physiol A (2006)
(2) Highly Nonrandom Features of Synaptic Connectivity in Local Cortical Circuits. Sen Song, Per Jesper Sjostro, Markus Reigl, Sacha Nelson, Dmitri B. Chklovskii. PLOS biology March 2005

Re:Only in C? Oh dear. (Score:3, Informative)

by HuguesT ( 84078 ) writes: on Sunday November 23, 2008 @11:30AM (#25864615)

Actually yes it is. For instance nobody has yet figured out an efficient matrix class in C++ that uses operator overloading. This is basically an impossible task to write B=A*X*A^t efficiently, which occurs all the time in linear analysis, because in C++ the transpose would require a copy operator, whereas one ought to get the job done simply with a different iterator. C++ is not equipped for this yet.

CUDA memory structure (Score:3, Informative)

by DrYak ( 748999 ) writes: on Sunday November 23, 2008 @12:35PM (#25865071) Homepage

but I don't know enough about it to be able to give useful information on the subject.
I do write some CUDA code, so I'll try to help.
I believe that each of the chips has a 512 bit wide bus to 4GiB of memory.
Indeed each physical package has entirely access to its own whole chuck of memory, regardless of who many "cores" the package contains (between 2 for the lowest end laptops GPUs and 16 for the highest end 8/9800 cards. Don't know about GT280. But the summary is wrong 240 is probably the amount of ALUs or the width of the SIMD) and regaless of how many "stream processor" there are (each core has 8 ALUs, which are exposed as 32-wide SIMD processing units, which in turn can keep up to 768-threads in flight thanks to some clever hyperthreading-like scheduling).
So in one single GPU card all the memory is accessible.
In a dual-GPU SLI card, each GPU has a full access to its own memory.
So, in our situation, it's 4GiB for each Tesla Card.
Then each core has a special internal memory which is shared by all the 32-to-768 threads running in parallel on the SIMD. (A couple of KiB, don't have the exact number handy).
I'm not sure what the memory allocation per stream processor is but I think the other parts of the chip control what goes where.
There's no actual per-stream-processor control of memory. There is something that looks like a "per-thread memory" but it's actually memory auto-allocated from the global memory.
(It all the same global memory, and the compiler just makes sure that each thread uses a different chunk of it to avoid conflicts).
And you actually do not control the stream-processors themselves.
You write a kernel (a piece of code which will process a mass of data) and throw a number of threads to one GPU (one physical package : i.e.: either 1 normal graphic card, or half of a SLI dual GPU graphic card).
The sceduler will dynamically spread all the concurrent threads among the SIMD processors on the GPU.
There probably are some bottlenecks
Yes, indeed :
- These 4GiB aren't cached at all (that's why it's preferable to use them only in the begin and the end of a calculation and use other types of memory during the calculations), have a big latency (that's why its better to have more threads running together so the scheduler can switch threads to hide latency) and you have to access them in a special fashion to group together the read-writes for faster access.
- Then there's the texture access. Using a special set of functions you can access the memory not directly but as if it was textures. It still has a big latency and it read-only. On the other hand, it has a cache so it has much better bandwidth and the texture units don't require special ordering of the access.
- The last type of memory is an ultra fast on-chip read-write memory which is shared for all the threads executed at the same time on the same core. But its access pattern is weird because everything is accessed in banks (one bank per thread or all threads on the same bank. Never many-to-many).
So, in the end writing good CUDA code requires some voodoo magic to correctly organise your stuff into memory in the most efficient way.

Re:Can I have a smaller version? (Score:3, Informative)

by kramulous ( 977841 ) * writes: on Sunday November 23, 2008 @06:09PM (#25867665)

The 10K refers to a rack mount solution containing 4xGPUs. You can still buy a single GPU and try and put it in a standard machine (provided it doesn't melt - I'd read the specs) for about a quarter of the price.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

NVIDIA's $10K Tesla GPU-Based Personal Supercomputer 236

NVIDIA's $10K Tesla GPU-Based Personal Supercomputer More Login

NVIDIA's $10K Tesla GPU-Based Personal Supercomputer

Binary-only toolchain (Score:5, Informative)

Re:And the worst timing ever award goes to... (Score:2, Informative)

Re:Binary-only toolchain (Score:5, Informative)

Re:4 TFLOPS? (Score:5, Informative)

weak DP performance (Score:5, Informative)

FTFL (Score:3, Informative)

It also runs Python (Score:4, Informative)

Re:Only in C? Oh dear. (Score:5, Informative)

Re:Only in C? Oh dear. (Score:3, Informative)

Re:Graphics (Score:3, Informative)

Re:Penguins' Got One Liquid Cooled! (Score:2, Informative)

Re:Can I have a smaller version? (Score:4, Informative)

Re:Heartening... (Score:5, Informative)

Re:Only in C? Oh dear. (Score:3, Informative)

CUDA memory structure (Score:3, Informative)

Re:Can I have a smaller version? (Score:3, Informative)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot