TACC "Stampede" Supercomputer To Go Live In January 67
Nerval's Lobster writes "The Texas Advanced Computing Center plans to go live on January 7 with "Stampede," a ten-petaflop supercomputer predicted to be the most powerful Intel supercomputer in the world once it launches. Stampede should also be among the top five supercomputers in the TOP500 list when it goes live, Jay Boisseau, TACC's director, said at the Intel Developer Forum Sept. 11. Stampede was announced a bit more than two years ago. Specs include 272 terabytes of total memory and 14 petabytes of disk storage. TACC said the compute nodes would include "several thousand" Dell Stallion servers, with each server boasting dual 8-core Intel E5-2680 processors and 32 gigabytes of memory. In addition, TACC will include a special pre-release version of the Intel MIC, or "Knights Bridge" architecture, which has been formally branded as Xeon Phi. Interestingly, the thousands of Xeon compute nodes should generate just 2 teraflops worth of performance, with the remaining 8 generated by the Xeon Phi chips, which provide highly parallelized computational power for specialized workloads."
Why so little memory? (Score:5, Interesting)
I wonder why it's got such little memory? You can easily run 64GB per socket at full speed with the E5-2600 (16GB x 4 channels) without spending that much money. Heck for maybe 10% more you can run 128GB per socket (You need RDIMM's to run two 16GB modules per bank). They're apparently only running one 16GB DIMM per socket (any other configuration would be slower on the E5) which IMHO is crazy as you're going to have a hard time keeping 8 cores busy with such a small amount.
GDDR5 (Score:2)
The Knights Corner chips use GDDR5 memory, bandwidth is a big problem when you have 50+ cores to feed.
Ooops. (Score:2)
Ooops, scratch that miss-read the summary. There probably is n't a need for that much memory because the kind of problems they are most likely to be dealing with will have massive datasets that don't fit in memory anyway. The limiting factory will be CPU and node interconnect bandwidth so adding extra memory wont make much if any difference to performance.
Re: (Score:2)
Re: (Score:1)
Re: (Score:2)
Re: (Score:3)
You can easily run 64GB per socket at full speed with the E5-2600 (16GB x 4 channels) without spending that much money. Heck for maybe 10% more you can run 128GB per socket (You need RDIMM's to run two 16GB modules per bank).
As TFA has put it:
" ... the compute nodes would include "several thousand" Dell Stallion servers, with each server boasting dual 8-core Intel E5-2680 processors and 32 gigabytes of memory"
I am guessing it might have something to do with budget
From the way I look at it, they are populating each memory slot with 4GB of el-cheapo DDR3 DRAM and that way they may be saving quite a bit of $$$ to buy more Dell servers
Re: (Score:2)
Re: (Score:2)
To program the MIC you need to design your program so that each thread only requires 128 MB of RAM anyway...
Re: (Score:1)
Ah, another esoteric, nearly impossible to program for architecture that flies for some problem sets but is nearly unapproachable for the non-CS science folks. I mean it's great that such things exist I guess, and in theory they can have great FLOPS/watt figures, but I wonder how much science will really get accomplished per dollar spent compared to something where standard code just runs?
Re: (Score:2)
How is that esoteric? A thread shouldn't require more than this even on a PC. That's also much more than the Cell allowed, which is a similar architecture.
Re: (Score:3, Informative)
Esoteric? Nearly impossible to program for? Methinks you haven't read through the actual docs for it. You can use all the standard Intel tools to program for it, which are also MIC-aware, just like you program for a standard multi-core CPU. That includes the threading and math kernel libraries, as well as OpenCL if you want to go that route.
Re: (Score:2)
I was taking loufoque's comment literally that you were architecturally limited to 128MB per thread which would be fairly difficult to code for.
Re: (Score:1)
I'm assuming you got the 128MB number from dividing 8GB by 64 cores, I have n't seen anything to indicate that a core is limited in that way, in fact Knights Corner caches are coherent.
Re: (Score:2)
Yes, a core can access the memory of other cores.
Your point being?
If you run all cores at the same time each with their own dataset, which is what you want to do in order to actually use the architecture properly, you'll have that limit for each thread.
Re: (Score:1)
According to the architecture each core does n't have it's own memory other than the L2 & L1 caches. How the memory is mapped per core is arbitary, there is nothing stopping you from having for exampe a shared data set using 4GB and using 64MB per core for a raytracer where the scene data is stored in the shared memory and each core works on part of scene. So no, you don't have to limit memory per thread to fully use the architecture properly.
Re: (Score:2)
You can do that, but that will only reduce the amount of memory available to each core, not increase it...
You could also have something even less homogeneous, but that would be a nightmare to schedule.
Re: (Score:1)
On a card with 8GB your effective memory accessible per core is 8GB, lots of problems have large data sets that can be shared over cores such as the example I gave. In fact this is a major advantage of MIC versus GPUs.
There is nothing nighmarish of the above, it would appear just a shared memory area to the process.
Re:Why so little memory? (Score:4, Interesting)
You will be parallelizing, and each thread will only ever be able to use max_mem/N for its own processing.
When you parallelize, you avoid sharing memory between threads. Your data set is split over the threads and synchronization is minimized. In a SMP/NUMA model, this is done transparently by simply avoiding to access memory that other threads are working on. In other models, you have to explicitly send the chunk of memory that each thread will be working on (through DMA, the network, an in-memory FIFO or whatever), but it doesn't change anything from a conceptual point of view.
If your parallel decomposition is much more efficient if your data per thread is larger than 1GB, then you cannot possibly run 64 threads set up like this on the MIC platform. There is often a minimum size required for a parallel primitive to be efficient, and if that minimum size is greater than max_mem/N then you have a problem. This is the limiting factor I'm talking about.
128 MB, however, is IMO quite large enough.
The advantage of MIC lies in ease of programming thanks to compatibility with existing tools and the more flexible programming model.
Memory on GPUs is global as well, so I have no idea what you're talking about. There is also so-called "shared" memory (CUDA terminology, OpenCL is different) which is per block, but that's just some local scratch memory shared by a group of threads.
Please stop deforming what I'm saying. What is nightmarish is finding the optimal work distribution and scheduling of a heterogeneous or irregular system.
Platforms like GPUs are only fit for regular problems. Most HPC applications written using OpenMP or MPI are regular as well. Whether the MIC will be able to enable good scalability of irregular problems remains to be seen, but the first applications will definitely be regular ones.
Re: (Score:1)
You will be parallelizing, and each thread will only ever be able to use max_mem/N for its own processing.
When you parallelize, you avoid sharing memory between threads. Your data set is split over the threads and synchronization is minimized. In a SMP/NUMA model, this is done transparently by simply avoiding to access memory that other threads are working on. In other models, you have to explicitly send the chunk of memory that each thread will be working on (through DMA, the network, an in-memory FIFO or whatever), but it doesn't change anything from a conceptual point of view.
If your parallel decomposition is much more efficient if your data per thread is larger than 1GB, then you cannot possibly run 64 threads set up like this on the MIC platform. There is often a minimum size required for a parallel primitive to be efficient, and if that minimum size is greater than max_mem/N then you have a problem. This is the limiting factor I'm talking about.128 MB, however, is IMO quite large enough.
For algorithms where you have a basically regular streaming data then yes, your working data set will be mem/n but as I mentioned there are a number of problems where you have a large mainly static dataset such as raytracing or financial modeling. In these scenarios being able to access a large shared pool of memory has big advantages.
The advantage of MIC lies in ease of programming thanks to compatibility with existing tools and the more flexible programming model.
Memory on GPUs is global as well, so I have no idea what you're talking about. There is also so-called "shared" memory (CUDA terminology, OpenCL is different) which is per block, but that's just some local scratch memory shared by a group of threads.
Accessing global memory on GPUs is extremely slow and there is a strict memory heirarchy that you have to adhere to in order to get any kind of performance. This heirarchy i
Re: (Score:2)
It could be seen as being the same as the CPU, except will automatically cache it to fast memory for you.
What makes you think it would be faster with the MIC?
Re: (Score:1)
It could be seen as being the same as the CPU, except will automatically cache it to fast memory for you.
What makes you think it would be faster with the MIC?
Granted it is the hardware doing the work, but you have four threads per core on Knights Corner with a cacheline miss causing a context switch masking the latencies. You also do n't have the extra overhead of your code setting up the copies between global and shared memory (which is limited to 48K on CUDA) everytime you want to access a data structure. Obviously you have many more cores on a GPU but how much performance do you think you will get once you have to jump through all the hoops and basically
Re: (Score:2)
Re: (Score:2)
Seems like you simply want to make sure you parallelize on the L3 cache line boundary to avoid false sharing (same as with regular CPUs)
Re: (Score:2, Insightful)
That's 2GB per core, a fine amount for supercomputer problems requiring compute density and bandwidth. No virtualization there and the compilers, middleware and programmers are probably sufficiently educated to know how to split the problem.
Re:Why so little memory? (Score:4, Informative)
Re: (Score:3)
The Cynic in me says that you don't get to into the Top 5 by spending all of your budget on memory. :)
Practically speaking there are a lot of research codes out there that are using 1GB or less of memory per core. Our systems at MSI typically had somewhere between 2-3GB of memory per core and often were only using half of their memory or less. There's a good chance that TACC has looked at the kinds of computations that would happen on the machine and determined that they don't need more.
We had another muc
Summary: s/tera/peta/ (Score:4, Informative)
Time For A New Supercomputer Metric (Score:4, Insightful)
We need a standard that actually makes sense.
Re: (Score:2)
Flops had always been a useless metric. If you want good metrics, look at the instruction reference with the speed in cycles of each instruction, its latency, its pipelining capabilities, the processor frequency, and cross it all with the number of cores and memory and cache interconnect specifications.
Flops are just a number that give a value for a single dumb computation in the ideal case ; real computations can be up to 100 times slower than that.
Re: (Score:3)
Well, as far as achievable computation, that's why Linkpack reports Rmax and Rpeak, however the one big area where Linpack is lacking as a measurement stick for many real workloads is its small communications overhead, it's much easier to achieve high utilization on Linpack then it is for many other workloads.
Umm, No. (Score:3)
I'm pretty sure you are mistaken on this point.
Most modern supercomputers get their "flop" count from SSE3/4 and/or GPUs which are not integer, but Floating point processing machines(at least 32-bit single precision fp, but also double precision albeit at a slower rate). These machines most certainly do NOT simulate floating point with their integer units (nor cheat by calling an integer op as an approximate fp op), and they have massive amounts of dedicated hardware SIMD FP processing units to do their he
Re: (Score:2)
"These machines most certainly do NOT simulate floating point with their integer units (nor cheat by calling an integer op as an approximate fp op), and they have massive amounts of dedicated hardware SIMD FP processing units to do their heavy lifting."
I did not say that they did. I said that you could consider integer units that emulated fp hardware to be doing flops. I did not state that this is the usual case.
"... they are rated by the number of IEEE FP operations..."
I KNOW that... my point is that it probably is not an appropriate rating these days. Not representative of many real-world problems.
"The integer OPs currently don't count in the current ratings and I don't see that changing any time soon."
Well, thanks for repeating pretty much what I already said.
Re: (Score:2)
I KNOW that... my point is that it probably is not an appropriate rating these days. Not representative of many real-world problems.
What real world problems are you thinking of? Most of the big super computer problems are focussed on scientific simulation of some sort which is very floating point heavy.
Going through the top 5 on wikipedia, of the applicaitons mentioned, all are floating point.
Besides vector units like SSE can churn through either integer or FP instructions at about the same rate and through
Re: (Score:2)
Okay, can you tell me what the following statement means to you?
I said that you could consider integer units that emulated fp hardware to be doing flops.
1. I don't see any supercomputers emulating fp hardware on integer units...
2. Even if they did (which they don't), it would be so slow that it would be a rounding error in their flop rating.
As I (and others have posted), although there are some interesting integer problems, existing supercomputers issue those instructions to processing units that are essentially the same speed as FP units, so there's not much difference between a FLOP and the I
Re: (Score:2)
...sort of. And whoever rated the parent "insightful" apparently has little insight into HPC and supercomputing. Interesting might have been appropriate.
First off, any metric which yields a single number is bound to be misleading as it is easy to find two applications a and b where a runs faster than b on machine 1 and slower than b on machine 2. Bit since we want such a simple metric, we might just as well settle for the one we already have. Why flops? Because applications use them. I know that the calls o
Re: (Score:2)
"First off, any metric which yields a single number is bound to be misleading as it is easy to find two applications a and b where a runs faster than b on machine 1 and slower than b on machine 2."
Part of the point I was making.
"it's much easier to prove numerical stability with floating point numbers"
False.
Re: (Score:2)
"First off, any metric which yields a single number is bound to be misleading as it is easy to find two applications a and b where a runs faster than b on machine 1 and slower than b on machine 2."
Part of the point I was making.
Sure, but apparently people still want such a metric, however imperfect and misleading it may be.
"it's much easier to prove numerical stability with floating point numbers"
False.
How eloquent. Simple example: I have two numbers a and b, approximately of the same size, with their LSB being tainted by rounding errors. If I add them on a floating point machine the last two bits of the mantissa will be tainted, but because during normalization the mantissa will be cut we end up with again just the LSB beint tainted.
On a fixed point machine however adding both numbers will either result in
Re: (Score:2)
"On a fixed point machine however adding both numbers will either result in an overflow or in two bits being tainted. And so on. Care to disprove me? "
I would not attempt to try to "disprove" you on Slashdot. But I will argue with some of your assumptions.
First, you can do "floating point" math using scaled integers of a size to represent the number of decimal points you desire. But the integer math is not subject to either the speed limitations, or the bugs that have been not just known but fairly common in fp hardware. Sure, you still get rounding errors, but you get those anyway. But they aren't the only kind of errors that occur with floating-point
Re: (Score:2)
Apologies for any confusion. It was my fault but not intentional.
The solution to overflow in fixed point is to use scaled integers.
I know of no way to absolutely avoid rounding errors, except to simply use more digits than the significant digits you require, and even that is simply a probability game; you can't absolutely avoid
Re: (Score:2)
Floating-point arithmetic in processors is fraught with errors (such as rounding errors) and has quite often turned out to contain very significant bugs. Integer math simply does not have those problems. If you want "numerical stability", you need to stay away from floating-point in hardware.
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
No, I was not referring to exactness.
Re: (Score:2)
It would be interesting to see how well flops and the Linpack score correlate across the members of the Top500. My guess would be that they correlate pretty well, just because flops is never used as a serious benchmark so nobody bothers gaming it. But I could certainly be wrong.
Ah, memories! (Score:5, Funny)
This reminds me of an old science fiction story. The designers, builders and programmers assemble. The Switch is flipped. The computer boots. The first question they ask is, "Is there a God?" The machine hums away for a few seconds, then arc welds the power switch open and responds, "There is now!"
Re: (Score:2)
Re: (Score:2)
The machine hums away for a few seconds, then arc welds the power switch open and responds, "There is now!"
And then abruptly died as its capacitors drain because it didn't know the difference between an open switch and a closed switch.
What! (Score:1)
Re: (Score:2)
It would clock in at rank 3 or 4, because the current rank 3 has 10 petaFLOPS Rpeak, (which isn't an Intel System, but POWER based), and the rank 4 is currently an Intel system at 3 petaFLOPs Rpeak.
But can I play doom on it? (Score:1)
yea will it run Doom!
O well... (Score:2)
I'm seriously bothered by the fact they couldn't figure out how to put an O at the end of the acronym.
First Projects (Score:1)
The chief science official of Texas has divined that the computational projects will be:
1. Derive a proof that the universe is only 5000-6000 years old.
2. Derive a proof that God is a silver haired, white man from Texas.
Just how many, Earl? (Score:2)
Just how many supercomputers are required for a stampede, Earl? I mean, is it like three or more? Is there a minimum speed?
Disappointed (Score:2)
I got excited for a couple seconds, I thought it was talking about a "Taco Stampede".