Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
AI Hardware

NVIDIA Ampere A100 GPU For AI Unveiled, Largest 7nm Chip Ever Produced (hothardware.com) 35

MojoKid writes: NVIDIA CEO Jensen Huang unveiled the company's new Ampere A100 GPU architecture for machine learning and HPC markets today. Jensen claims the 54B transistor A100 is the biggest, most powerful GPU NVIDIA has ever made, and it's also the largest chip ever produced on 7nm semiconductor process. There are a total of 6,912 FP32 CUDA cores, 432 Tensor cores, and 108 SMs (Streaming Multiprocessors) in the A100, paired to 40GB of HBM2e memory with maximum memory bandwidth of 1.6TB/sec. FP32 compute comes in at a staggering 19.5 TLFLOPs, compared to 16.4 TFLOPs for NVIDIA's previous gen Tesla V100. In addition, its Tensor Cores employ FP32 precision that allows for a 20x uplift in AI performance gen-over-gen. When it comes to FP64 performance, these Tensor Cores also provide a 2.5x performance boost, versus its predecessor, Volta. Additional features include Multi-Instance GPU, aka MIG, which allows an A100 GPU to be sliced up into up to seven discrete instances, so it can be provisioned for multiple discrete specialized workloads. Mulitple A100 GPUs will also make their way into NVIDIA's third-generation DGX AI supercomputer that packs a whopping 5 PFLOPs of AI performance. According to NVIDIA, its Ampere-based A100 GPU and DGX AI systems are already in full production and shipping to customers now. Gamers are of course looking forward to what the company has in store with Ampere for the enthusiast PC market, as expectations for its rumored GeForce RTX 30 family are incredibly high.
This discussion has been archived. No new comments can be posted.

NVIDIA Ampere A100 GPU For AI Unveiled, Largest 7nm Chip Ever Produced

Comments Filter:
  • It is odd that they used FP32. GPU-based AI engines are mostly running at FP16. Googles TPU is FP8.

    FP16 would have allowed many more FLOPS.

    Also, going from 16.4 to 19.5 is only a 20% improvement. I wouldn't call that "staggering".

    • Googles TPU is FP8

      Are you sure?

      My understanding is that TPU 1.0 was Int8, and 2.0 added FP16.

      • Are you sure?

        I was sure when I wrote it ... but when I saw your reply I did some Googling and the latest TPU actually uses BFloat16 [wikipedia.org], with an 8-bit exponent and 8-bit mantissa, specifically designed for AI.

        The original 8-bit TPU was good for running AI tensors, but poor at training them. The newest TPUs are very useful for both.

    • by Junta ( 36770 )

      They want to sell into more than AI. There are many applications that require better than FP16.

      • They want to sell into more than AI. There are many applications that require better than FP16.

        AI and HPC are diverging markets.

        They should pick one and do it right, rather than a mediocre compromise.

        Then make a different device for the other market.

    • by godrik ( 1287354 )

      FP32 is still the thing that you quote for people to make sense of the numbers. While some deep learning methods could only use FP16, the best performance is still obtained by mixed precision arithmetic spanning from FP16 to FP64. Depending on the part of the calculation you actually want different precision in the training phase.
      Also, Deep Learning is not the only thing one wants to do with these GPUs. Most graphics is done in FP32. In many high performance computing codes, you still need FP32 and FP64. Th

    • Re:FP32? (Score:4, Informative)

      by Baloroth ( 2370816 ) on Thursday May 14, 2020 @03:04PM (#60060856)

      FP32 is only in their CUDA cores (i.e. the usual GPU shader processor cores). Their tensor cores don't support FP32 or FP32-like precision, despite what the summary says: it's wrong, the "TF32" format offers FP32 range by using 8 bit exponents. The precision is still the 10-bit mantissa like FP16. Though though the article also seems confused, as it claims TF32 is a 20-bit number, but 1 bit sign + 8 bit exponent + 10 bit mantissa is only 19 bits.

    • FP16 and FP8 is nice for inference. FP32 is nice for training.

      The A100 is capable of 312 TFLOPS of FP32 performance, 1,248 TOPS of Int8, and 19.5 TFLOPS of FP64 HPC throughput.

      An NVIDIA DGX A100 offers up to 10 PetaOPS of Int8 performance, 5 PFLOPS of FP16, 2.5 PFLOPS of TF32, and 156 TLOPS of FP64 compute performance.

      Yes, you get more performance on the same chip if you use FP16. It does indeed support FP16 (and FP64)

  • I think I see why Bitcoin manufacturing payout halved [slashdot.org].
    • Comment removed based on user account deletion
      • This isnt a gaming GPU by any stretch. It will probably cost $3000 or more.
        • This isnt a gaming GPU by any stretch.

          It isn't a bitcoin miner either. FLOPS are meaningless for a miner.

          Miners need IOPS and bit twiddling. ASICs will be far more cost-effective.

        • by Kjella ( 173770 )

          This isnt a gaming GPU by any stretch. It will probably cost $3000 or more.

          The prebuilt server they're selling is $200k for 8 of them, granted that has two $4k CPUs and a bunch of other expensive shit but it's probably closer to $20k/GPU.

      • I don't think anyone is going to buy enterprise GPUs to mine crypto currency. During the last craze the miners used the opposite approach and bought the cheap mid-range GPUs that were produced in higher quantities and sold a much lower prices due to the more competitive nature of the market. These cards are probably going to cost well over $10,000 initially and you probably need to have an existing relationship with a supplier just to get them considering how limited the supply will be initially. For that k
  • by blackomegax ( 807080 ) on Thursday May 14, 2020 @02:19PM (#60060722) Journal
    Can it run crysis?
  • History (Score:5, Interesting)

    by Areyoukiddingme ( 1289470 ) on Thursday May 14, 2020 @02:56PM (#60060834)

    So 19.5 TFLOPS is a lot. How much is it?

    In 2000, the number one machine in the world, first place on the Top500 supercomputing list, was IBM's ASCI White, running at 7.2 TFLOPS in an installation at Lawrence Livermore National Laboratory.

    In 2002, the number one machine in the world was NEC's Earth Simulator, running at 35.86 TFLOPS in an installation in Yokohama, Japan. Two Amperes connected via NVLink beats that.

    It's not until 2004 and the number one on the Top500 list at 70.72 TFLOPS for IBM's first Blue Gene/L installation that we find a machine that can't be beaten by... the gaming rig of a teenage kid with a well-to-do daddy in 2020.

    • Thank you! That was a neat and informative historical comparison.

      • Agree - very useful

        Made me look up what I am currently running: 2080Ti which apparently does 14.2 TF

        https://www.techradar.com/uk/n... [techradar.com]

        so while a 25% increase over current consumer level hardware is good, it hasn't exactly shattered Moore's law, although the full fat versions that go to govt / military will be more capable (and a lot more expensive).

        • To be fair, the biggest improvements they were touting weren't in terms of raw computing power gains, but rather were in terms of being able to operate much more efficiently thanks to improved machine learning systems, namely DLSS 2.0 (which is basically being touted as a real life incarnation of the fictional "Enhance" button that crime shows have featured for the last few decades).

          Going off what the demo showed—which, granted, was almost certainly a cherry-picked example, so take it with a grain of

          • "(which is basically being touted as a real life incarnation of the fictional "Enhance" button that crime shows have featured for the last few decades)"

            Not exactly. It seems to me that this
            uses some sort of fuzzy logic to add detail to an image being upconverted
            (example: creating and adding higher rez trees to an far shot of a forest originally taken at 1080p that is upconverted to 4K).

            In other words, it's reading details from an image, and 'doing it's best' to create *fictional* higher rez

  • by Anonymous Coward
    Imagine a Beowulf cluster of these

"Pull the trigger and you're garbage." -- Lady Blue

Working...