Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Technology Hardware

How Thermal Management is Changing in the Age of the Kilowatt Chip (theregister.com) 15

An anonymous reader shares a report: As Moore's Law slowed to a crawl, chips, particularly those used in AI and high-performance computing (HPC), have steadily gotten hotter. In 2023 we saw accelerators enter the kilowatt range with the arrival of Nvidia's GH200 Superchips. We've known these chips would be hot for a while now -- Nvidia has been teasing the CPU-GPU franken-chip for the better part of two years. What we didn't know until recently is how OEMs and systems builders would respond to such a power-dense part. Would most of the systems be liquid cooled? Or, would most stick to air cooling? How many of these accelerators would they try to cram into a single box, and how big would the box be?

Now that the first systems based on the GH200 make their way to market, it's become clear that form factor is very much being dictated by power density than anything else. It essentially boils down to how much surface area you have to dissipate the heat. Dig through the systems available today from Supermicro, Gigabyte, QCT, Pegatron, HPE, and others and you'll quickly notice a trend. Up to about 500 W per rack unit (RU) -- 1 kW in the case of Supermicro's MGX ARS-111GL-NHR -- these systems are largely air cooled. While hot, it's still a manageable thermal load to dissipate, working out to about 21-24 kW per rack. That's well within the power delivery and thermal management capacity of modern datacenters, especially those making use of rear door heat exchangers.

However, this changes when system builders start cramming more than a kilowatt of accelerators into each chassis. At this point most of the OEM systems we looked at switched to direct liquid cooling. Gigabyte's H263-V11, for example, offers up to four GH200 nodes in a single 2U chassis. That's two kilowatts per rack unit. So while a system like Nvidia's air-cooled DGX H100 with its eight 700 W H100s and twin Sapphire Rapids CPUs has a higher TDP at 10.2 kW, it's actually less power dense at 1.2 kW/RU.

This discussion has been archived. No new comments can be posted.

How Thermal Management is Changing in the Age of the Kilowatt Chip

Comments Filter:
  • It essentially boils down to how much surface area you have to dissipate the heat.

    Droll - very droll.

    • As Oscar Wilde once remarked to me:
      Never ascribe to clever wordplay that which can be adequately explained by MsMash's unthinking use of tired cliches.
  • How about creating direct liquid cooled chips? With increased density it might be a necessity instead of a coolness factor.

    • Re: (Score:3, Interesting)

      by AmiMoJo ( 196126 )

      Or just make better chips. AMD parts use significantly less power than Nvidia ones.

      Nvidia and Intel are doing this because their last generation was, to use the vernacular, kak. To stay competitive they just boosted the power draw, while AMD made more efficient chips.

      • AMD parts use significantly less power than Nvidia ones.

        Not on planet earth they don't.

      • Or just make better chips. AMD parts use significantly less power than Nvidia ones.

        Indeed, they perform significantly poorer as well. AMD is about 2 generations behind in terms of performance per dollar, in terms of peak performance, and sort of only just behind in terms of thermal performance per unit of work.

        My TI-82 uses significantly less power than my laptop as well, but Doom runs poorly on it, so I don't recommend people go out and buy it for gaming.

  • My electric stove has a maximum power draw of 11.2 kW. The oven element is 5 kW.

    Two of those racks would fully load the electrical service for the entire house. (200 amps at 240 V)

    • 2x 100A x 240V per row of 6-10 standard racks/enclosers was always the futureproof number before crypto. One average Arizona house of waste heat per DC square, so we were always looking to scale cooling just before actual need in the small data rooms. Getting that into a room is contract and thick copper cables, getting the waste heat out is a fisaco for a decade, because that chiller gear on the roof is no where near transformer reliability. That this is being centralized to someone elses prob
    • by tlhIngan ( 30335 )

      Two of those racks would fully load the electrical service for the entire house. (200 amps at 240 V)

      There are houses that have 300A service, and many large houses now have 3-phase 600A service (200A/phase). Of course, you aren't getting 240V anymore, since phase to phase is 208V.

      • Three phase power isn't "starting to go in" residential homes . That's only necessary for running a machine shop or assembly line. Jonny Ive or Norm Abram might have it in home workshops but, you'll agree, those are special cases.
  • Cerebras' wafer-sized AI chip [ieee.org] is 1) pretty efficient for the amount of computing it does, but 2) still has crazy high dissipation because of its size. This white paper [cerebras.net] indicates that their complete system is 15U and 23kW peak load, or about 1.5kW/U. And that was in April 2021.
  • ... my Tektronix 545 'scope the other day. Just to keep the electrolytics formed. That thing makes a great space heater.

  • by Big Bipper ( 1120937 ) on Wednesday December 27, 2023 @02:54PM (#64109955)
    Remember the P4, Intel's last HOT chip. I had a laptop powered by a P4. I don't recall ever running it on the battery for more than a few minutes. Whenever I used it ( plugged in ) on my actual lap, I had to sit it on top of a binder or something on my lap to keep my b*lls from roasting. It was literally unusable for more than a few minutes on an actual lap. To paraphrase Jeff Goldblum ( Malcolm ) in Jurassic Park. Just because you can make something, doesn't mean you should. AMD came up with a better idea and ended the megahertz wars.

The sooner all the animals are extinct, the sooner we'll find their money. - Ed Bluestone

Working...