NVIDIA Ampere A100 GPU For AI Unveiled, Largest 7nm Chip Ever Produced (hothardware.com) 35
MojoKid writes: NVIDIA CEO Jensen Huang unveiled the company's new Ampere A100 GPU architecture for machine learning and HPC markets today. Jensen claims the 54B transistor A100 is the biggest, most powerful GPU NVIDIA has ever made, and it's also the largest chip ever produced on 7nm semiconductor process. There are a total of 6,912 FP32 CUDA cores, 432 Tensor cores, and 108 SMs (Streaming Multiprocessors) in the A100, paired to 40GB of HBM2e memory with maximum memory bandwidth of 1.6TB/sec. FP32 compute comes in at a staggering 19.5 TLFLOPs, compared to 16.4 TFLOPs for NVIDIA's previous gen Tesla V100. In addition, its Tensor Cores employ FP32 precision that allows for a 20x uplift in AI performance gen-over-gen. When it comes to FP64 performance, these Tensor Cores also provide a 2.5x performance boost, versus its predecessor, Volta. Additional features include Multi-Instance GPU, aka MIG, which allows an A100 GPU to be sliced up into up to seven discrete instances, so it can be provisioned for multiple discrete specialized workloads. Mulitple A100 GPUs will also make their way into NVIDIA's third-generation DGX AI supercomputer that packs a whopping 5 PFLOPs of AI performance. According to NVIDIA, its Ampere-based A100 GPU and DGX AI systems are already in full production and shipping to customers now. Gamers are of course looking forward to what the company has in store with Ampere for the enthusiast PC market, as expectations for its rumored GeForce RTX 30 family are incredibly high.
FP32? (Score:2)
It is odd that they used FP32. GPU-based AI engines are mostly running at FP16. Googles TPU is FP8.
FP16 would have allowed many more FLOPS.
Also, going from 16.4 to 19.5 is only a 20% improvement. I wouldn't call that "staggering".
Re: (Score:2)
So you tell us that the die is the biggest ever for this node, but you don't tell us how big it is? Stupid.
From the article...
.. on 7nm, yields will be terrible, leading to the cost.. which is not given in the article, presumably because nVidia aint sayin.
826mm^2
So if its square, about 2.87cm x 2.87cm
Its 54 billion transistors, so about 65 million transistors per square mm.
Re: (Score:2)
So if its square, about 2.87cm x 2.87cm .. on 7nm, yields will be terrible, leading to the cost.. which is not given in the article, presumably because nVidia aint sayin.
TSMC has gotten yields on 7nm up so high that AMD announced they were moving from 4 core to 8 core chiplets for Ryzen. Yields won't be that terrible, and well, that's what 3060s and 3070s are for: selling chips with the failed CUDA pipelines disabled. Nvidia, AMD, and Intel all design in easy cutouts specifically to salvage not-quite-perfect chips. For GPUs it's easiest since a GPU is the physical embodiment of the phrase "massively parallel."
Re: (Score:2)
TSMC has gotten yields on 7nm up so high that AMD announced they were moving from 4 core to 8 core chiplets for Ryzen.
AMD already moved to 8 core chiplets with Zen 2.
...and those core Zen 2 chiplets are about 0.9cm x 0.9cm ... with a bit lower transistor density (about 48 billion transistors per square mm) than this nvidia gpu.
.. cost to manufacture will thusly be about 10^2 .. or 100x.
So yields will easily be about 10 times worse for this GPU than for Zen 2 chiplets, and combine that with about 10 times more wafer
Re: (Score:2)
Re: Hey retard, Ms. Mash! (Score:1)
According to EETimes:
DGX-A100 systems are already installed at the US Department of Energyâ(TM)s Argonne National Laboratory where they are being used to understand and fight Covid-19.
Nvidia DGX A100 systems start at $199,000 and are shipping now.
Re:Hey retard, Ms. Mash! (Score:5, Informative)
The full implementation of the GA100 GPU includes the following units:
8 GPCs, 8 TPCs/GPC, 2 SMs/TPC, 16 SMs/GPC, 128 SMs per full GPU
64 FP32 CUDA Cores/SM, 8192 FP32 CUDA Cores per full GPU
4 third-generation Tensor Cores/SM, 512 third-generation Tensor Cores per full GPU
6 HBM2 stacks, 12 512-bit memory controllers
The A100 Tensor Core GPU implementation of the GA100 GPU includes the following units:
7 GPCs, 7 or 8 TPCs/GPC, 2 SMs/TPC, up to 16 SMs/GPC, 108 SMs
64 FP32 CUDA Cores/SM, 6912 FP32 CUDA Cores per GPU
4 third-generation Tensor Cores/SM, 432 third-generation Tensor Cores per GPU
5 HBM2 stacks, 10 512-bit memory controllers
Unless, for some reason, they decided to only announce the mid-tier product during the big next-gen unveil; it would appear that they are tolerating a fair number of faults to end up with enough to ship. Likely the reasonable approach; but some of those numbers reflect a fair bit of trimming.
Re: (Score:2)
Yeah it makes sense since they are selling these to datacenters for use as clusters. If you needed another 12% of memory or you can just add another virtual GPU from another card in the cluster.
Re: (Score:2)
TSMC has gotten yields on 7nm up so high that AMD announced they were moving from 4 core to 8 core chiplets for Ryzen.
On that note, the new mobile 4000 series Ryzen have up to 8 cores plus a GPU all on one die. Placing the cores on one die does help reduce latency when communicating between cores. It also helps reduce the signalling cost (power required), so it is a good thing.
Focusing first on desktop parts utilizing the smaller chiplet now makes a lot of sense. It takes time for the yield to improve and make the larger parts affordable. I imagine a similar approach will be taken when transitioning to the 5 nm node
Re: (Score:2)
Re: (Score:2)
Googles TPU is FP8
Are you sure?
My understanding is that TPU 1.0 was Int8, and 2.0 added FP16.
Re: (Score:2)
Are you sure?
I was sure when I wrote it ... but when I saw your reply I did some Googling and the latest TPU actually uses BFloat16 [wikipedia.org], with an 8-bit exponent and 8-bit mantissa, specifically designed for AI.
The original 8-bit TPU was good for running AI tensors, but poor at training them. The newest TPUs are very useful for both.
Re: (Score:2)
They want to sell into more than AI. There are many applications that require better than FP16.
Re: (Score:2)
They want to sell into more than AI. There are many applications that require better than FP16.
AI and HPC are diverging markets.
They should pick one and do it right, rather than a mediocre compromise.
Then make a different device for the other market.
Re: (Score:2)
FP32 is still the thing that you quote for people to make sense of the numbers. While some deep learning methods could only use FP16, the best performance is still obtained by mixed precision arithmetic spanning from FP16 to FP64. Depending on the part of the calculation you actually want different precision in the training phase.
Also, Deep Learning is not the only thing one wants to do with these GPUs. Most graphics is done in FP32. In many high performance computing codes, you still need FP32 and FP64. Th
Re:FP32? (Score:4, Informative)
FP32 is only in their CUDA cores (i.e. the usual GPU shader processor cores). Their tensor cores don't support FP32 or FP32-like precision, despite what the summary says: it's wrong, the "TF32" format offers FP32 range by using 8 bit exponents. The precision is still the 10-bit mantissa like FP16. Though though the article also seems confused, as it claims TF32 is a 20-bit number, but 1 bit sign + 8 bit exponent + 10 bit mantissa is only 19 bits.
Re: (Score:2)
Thanks for the details. Your description makes much more sense than the summary.
Re: (Score:1)
FP16 and FP8 is nice for inference. FP32 is nice for training.
The A100 is capable of 312 TFLOPS of FP32 performance, 1,248 TOPS of Int8, and 19.5 TFLOPS of FP64 HPC throughput.
An NVIDIA DGX A100 offers up to 10 PetaOPS of Int8 performance, 5 PFLOPS of FP16, 2.5 PFLOPS of TF32, and 156 TLOPS of FP64 compute performance.
Yes, you get more performance on the same chip if you use FP16. It does indeed support FP16 (and FP64)
Not too shabby (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
This isnt a gaming GPU by any stretch.
It isn't a bitcoin miner either. FLOPS are meaningless for a miner.
Miners need IOPS and bit twiddling. ASICs will be far more cost-effective.
Re: (Score:2)
This isnt a gaming GPU by any stretch. It will probably cost $3000 or more.
The prebuilt server they're selling is $200k for 8 of them, granted that has two $4k CPUs and a bunch of other expensive shit but it's probably closer to $20k/GPU.
Re: (Score:2)
but (Score:3)
Re: (Score:2)
Mate, this can program Crysis for you in realtime!
History (Score:5, Interesting)
So 19.5 TFLOPS is a lot. How much is it?
In 2000, the number one machine in the world, first place on the Top500 supercomputing list, was IBM's ASCI White, running at 7.2 TFLOPS in an installation at Lawrence Livermore National Laboratory.
In 2002, the number one machine in the world was NEC's Earth Simulator, running at 35.86 TFLOPS in an installation in Yokohama, Japan. Two Amperes connected via NVLink beats that.
It's not until 2004 and the number one on the Top500 list at 70.72 TFLOPS for IBM's first Blue Gene/L installation that we find a machine that can't be beaten by... the gaming rig of a teenage kid with a well-to-do daddy in 2020.
Re: History (Score:2)
Thank you! That was a neat and informative historical comparison.
Re: (Score:2)
Agree - very useful
Made me look up what I am currently running: 2080Ti which apparently does 14.2 TF
https://www.techradar.com/uk/n... [techradar.com]
so while a 25% increase over current consumer level hardware is good, it hasn't exactly shattered Moore's law, although the full fat versions that go to govt / military will be more capable (and a lot more expensive).
Re: (Score:2)
To be fair, the biggest improvements they were touting weren't in terms of raw computing power gains, but rather were in terms of being able to operate much more efficiently thanks to improved machine learning systems, namely DLSS 2.0 (which is basically being touted as a real life incarnation of the fictional "Enhance" button that crime shows have featured for the last few decades).
Going off what the demo showed—which, granted, was almost certainly a cherry-picked example, so take it with a grain of
The old "Zoom in and enhance" trope (Score:1)
"(which is basically being touted as a real life incarnation of the fictional "Enhance" button that crime shows have featured for the last few decades)"
Not exactly. It seems to me that this
uses some sort of fuzzy logic to add detail to an image being upconverted
(example: creating and adding higher rez trees to an far shot of a forest originally taken at 1080p that is upconverted to 4K).
In other words, it's reading details from an image, and 'doing it's best' to create *fictional* higher rez
Interesting (Score:1)