Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
AI Technology

'AI Ambition is Pushing Copper To Its Breaking Point' (theregister.com) 61

An anonymous reader shares a report: Datacenters have been trending toward denser, more power-hungry systems for years. In case you missed it, 19-inch racks are now pushing power demands beyond 120 kilowatts in high-density configurations, with many making the switch to direct liquid cooling to tame the heat. Much of this trend has been driven by a need to support ever larger AI models.

According to researchers at Fujitsu, the number of parameters in AI systems is growing 32-fold approximately every three years. To support these models, chip designers like Nvidia use extremely high-speed interconnects -- on the order of 1.8 terabytes a second -- to make eight or more GPUs look and behave like a single device.

The problem though, is that the faster you shuffle data across a wire, the shorter the distance at which the signal can be maintained. At those speeds, you're limited to about a meter or two over copper cables. The alternative is to use optics, which can maintain a signal over a much larger distance. In fact, optics are already employed in many rack-to-rack scale-out fabrics like those used in AI model training. Unfortunately, in their current form, pluggable optics aren't particularly efficient or particularly fast.

Earlier in 2024 at GTC, Nvidia CEO Jensen Huang said that if the company had used optics as opposed to copper to stitch together the 72 GPUs that make up its NVL72 rack systems, it would have required an additional 20 kilowatts of power.

'AI Ambition is Pushing Copper To Its Breaking Point'

Comments Filter:
  • Speed of light (Score:5, Informative)

    by bradley13 ( 1118935 ) on Friday November 29, 2024 @06:44AM (#64979039) Homepage
    It all gets crazier, when you remember the speed of light, and that an electrical signal only travels at about 70% of that. If you have, say, a 5GHz signal, that means that in one complete period (which takes 1 / 5*10 second), the signal will only travel about 4cm. Assuming you are trying to get a bunch of separate components to work in sync, well, that's not going to work. Work in parallel, maybe, but never synchronously.
    • Re:Speed of light (Score:5, Informative)

      by bradley13 ( 1118935 ) on Friday November 29, 2024 @06:46AM (#64979041) Homepage
      Ah, you gotta love Slashdot. I really did have an exponent in the equation, and it even showed in the preview, but in the final comment? Gone...
      • This issue showed up in the first super computers. There are physical limits to how fast you can transfer data and how much energy you can push into an area without everything melting. Still the human brain functions on relatively slow connections and without excessive heat generation, so there must be a way to do it. It probably just is not a LLM.
        • Re:Speed of light (Score:5, Interesting)

          by drinkypoo ( 153816 ) <drink@hyperlogos.org> on Friday November 29, 2024 @07:26AM (#64979057) Homepage Journal

          I suspect the human brain uses something similar to a LLM because of the similar things it does. But it also does stuff the LLMs don't, and the wetware is very different from hardware. Instead of picturing a 4096 node transputer network or even a million 4004s which is probably closer, imagine 80+ billion nodes each capable of storing and processing a bare handful of floats but each with connections not just to immediate neighbors [nih.gov] in every direction but also to other nearby neighbors on the other side of them, and maybe with storage in the connections themselves as well. And we're trying to simulate it on not only fundamentally but also massively different equipment and with a fractional understanding, it's pretty funny really when you think about it, and then think about what thinking means some more :)

          The human brain achieves its relatively low power dissipation with voltages commonly measured in tens of millivolts, and yet it is thermally sensitive and also produces substantial heat [nih.gov], some of which is removed by air cooling of blood passing through the sinuses [physiology.org].

          • by Tablizer ( 95088 )

            wetware is very different from hardware...each with connections not just to immediate neighbors in every direction but also to other nearby neighbors on the other side of them

            God is a spaghetti-coder; when you're omnipotent they let you do it.

            It's not every neighbor as you kind of imply, but it indeed does have connections far and wide around the brain, something AI hardware will have a harder time emulating. But chipWare does have a speed advantage which can perhaps be used to compensate for and/or emulat

            • I suspect the human brain uses something similar to a LLM because of the similar things it does. But it also does stuff the LLMs don't, and the wetware is very different from hardware.

              But I suspect the best AI for chipWare may work different than the brain. It's kind of like trying to build an airplane via flapping wings to mirror birds, but that's not the best approach for mechanical machines. Chip-AI may similarly have to find a different way to AI than the brain has.

              A Core i5 uses something similar to an 8086, and I think we’re on the right path with a ways to go. We’re past gluing feathers to flappy things though. More like we’ve basically got lift and fixed wing gliding figured out and a ways to go for controls and powered flight.

              • Chip-AI may similarly have to find a different way to AI than the brain has.

                Our brains can apparently do processing and memory access in the same step (they may even have to) and thought processes seem to work their way back and forth through the squishy matrix of processing and memory getting refined into something hopefully eventually worth realizing outside of our heads. There's a whole cross-checking process involved in having those impulses travel through various regions of our heads, the more convoluted the better, getting transmuted along the way.

                The AI is only able to do th

            • " it indeed does have connections far and wide around the brain,"

              Is that what the attention mechanism does, for tokens?

        • > relatively slow connections

          The neuronal connections are slow but we're just now beginning to measure the terahertz waves coming off the microtubules in the neurons themselves which appear to have tubulin crystals that oscillate and send information through entanglement across the brain network.

          Like in chips we can imagine a power-side bus and a data-side bus and both are valuable.

          Both anaestheics and psychedelics are shown to dampen or modulate the resonance of these crystals in the Layer-5 pyramidal n

    • I've made the following comment numerous times to my teams when waiting for software or patches or updates to install: Considering this is moving at the speed of light, it's taking an awfully long time to finish.

      However, since the signal is moving at only 70% of the speed of light, it makes sense why it takes so long. After all, its only moving at 130K/kps rather than the standard 186K/kps.

      For those who are wondering, yes, I am being sarcastically facetious.

      • I've made the following comment numerous times to my teams when waiting for software or patches or updates to install: Considering this is moving at the speed of light, it's taking an awfully long time to finish.

        However, since the signal is moving at only 70% of the speed of light, it makes sense why it takes so long. After all, its only moving at 130K/kps rather than the standard 186K/kps.

        For those who are wondering, yes, I am being sarcastically facetious.

        RF velocities in wires is just plain weird. A good example is antennas, which must take into account the coax or ladder line, then of course the antenna itself.

        The velocity factor even has a difference if the antenna is bare metal, or insulated. On a wire antenna, if you use insulated wire, the antenna will be a slightly different length shorter than a bare wire.

        It is a real trap for chip design. Using a rule of thumb 1 foot per nanosecond, and trying to get sub nanosecond performance, we end up with c

        • At telecom PCB and backplane scale even 8 years ago, striplines with low-k / low loss tangent dielectric materials, we had difficulties that were identified as insufficiently controlled edge roughness on copper traces. Just plain weird indeed.
          • At telecom PCB and backplane scale even 8 years ago, striplines with low-k / low loss tangent dielectric materials, we had difficulties that were identified as insufficiently controlled edge roughness on copper traces. Just plain weird indeed.

            As a rough guess, It probably created extra capacitance, and at the frequencies involved knocked down the signal. And always has something new to throw at us, so could be something else as well.

    • by Entrope ( 68843 )

      Clock distribution networks are all about making sure the clock edges stay aligned enough across the circuit. ("Enough" normally means within the tolerance of setup and hold times.)

      Are people still trying to build self-clocking data flow processors, or is the overhead too high compared to traditional clocked designs?

    • by rossdee ( 243626 )

      David Gerrold in "when Harlie was 1" mentioned this in respect to the Graphic Omniscient Device

    • Yeah, the original article is full of shit. Datacenter size and needs have been growing for a lot longer than this new fake AI schtick has been going up. The BS AI chips that just chew raw power to find exponential interactions between n^1 of data are just today's hot air. When real AI gets here it will be real AI, much more useful, and much more efficient. And yes, at that point those companies still doing the AI Schtick will be done. So will NVIDIA's stock price.

      Previous poster says:

      > ... the spee

      • by PPH ( 736903 )

        When real AI gets here it will be real AI

        Do fusion first. When you've got that working, then AI.

        Fusion will solve the biggest AI problems anyway.

    • If the delay is symmetric, you can still get the systems in sync. You need to know or measure the delays.
      • Part of the initialization sequence in some equipment even several years ago, was a tuning process to discover and compensate for electrical length variation and driver/receiver variation in wide SoC to RAM interconnects.
    • Re: (Score:2, Interesting)

      by Ol Olsoc ( 1175323 )

      It all gets crazier, when you remember the speed of light, and that an electrical signal only travels at about 70% of that. If you have, say, a 5GHz signal, that means that in one complete period (which takes 1 / 5*10 second), the signal will only travel about 4cm. Assuming you are trying to get a bunch of separate components to work in sync, well, that's not going to work. Work in parallel, maybe, but never synchronously.

      Yup, and even at the unemcumbered speed of light in free space, there is a hard speed limit.

      AI is going to be a bubble like no other. My guess is that the incredible amount of power they will suck out of the grids will affect the price of electricity to the point where normies will just install solar and go off the grid. If you don't live in a development, it is already often cheaper to go to a discrete solar array.

      Then when the AI bubble bursts and the sunk money and the gigawatt power isn't neede

    • Work in parallel, maybe, but never synchronously.

      Of course it will work, it's just an engineering problem to make it happen. Digital signalling has worked with this problem for an eternity, it is especially important when signals are sent differentially. Many things by necessity on your motherboard need to happen in sync and despite components being close together you may find that the actual signals arrive at wildly different times.

      This is why when you look at a high speed parallel interface you often see squiggly lines, that is an engineering attempt to

  • by thesjaakspoiler ( 4782965 ) on Friday November 29, 2024 @07:55AM (#64979083)

    With 20kW for 72 devices, it would come down to an extra 270 watt per device.
    Only a hefty CO2 laser can deliver that amount of power.
    Not to forget that copper interconnects don't go from chip to chip directly in between racks.

  • by evanh ( 627108 ) on Friday November 29, 2024 @07:59AM (#64979087)

    Fundamentally, the total power required is already stupid high with nothing to show. Spreading it out isn't any sort of fix. All you're doing is upping the total power needs even further.

    • by thegarbz ( 1787294 ) on Friday November 29, 2024 @10:14AM (#64979285)

      with nothing to show

      AI has plenty to show. You have just fallen into the trap of thinking AI = ChatGPT / CoPilot. In reality we've used large models of trained neural networks for many years doing many different things and the output of it is absolutely amazing.

      For example one tool I have uses an AI model trained only on astronomy images to correct for motion and blur in stars. It achieves in seconds orders of magnitude better results than some people whose profession is to work with deconvolution can do in many hours. Even in industry such tools are heavily used such as supercomputers you find at some oil companies - what once was sitting solving wave equations is now sitting training AI models because it turns out it is actually better at classifying oil fields than people were at throwing equations at the problem. We've blown a lot of computing power and time at training models to identify how control schemes change as well pressure depletes, we have trained models to identify cancer on MRIs with accuracy better than professional diagnosis for some cancers.

      To say it has nothing to show, is to say "I've not looked - and I haven't found anything". AI is far bigger than a shitty chatbot that most of Slashdot associate with it.

      • For example one tool I have uses an AI model trained only on astronomy images to correct for motion and blur in stars.

        Can you share the name of the tool?

        • https://www.rc-astro.com/softw... [rc-astro.com]

          BlurXterminator - it's unfortunately a plugin for Pixinsight only as the model was trained on linear data, so if you don't have Pixinsight then you're in for an expensive investment. I also am super impressed with NoiseXterminator - this one is available for photoshop too.

          That said don't waste your money on StarXterminator or GradientXterminator. The former I find no different from the free StarNet+ plugin, and the latter I find I can get the same results in about 10 min usi

  • >"Unfortunately, in their current form, pluggable optics aren't particularly efficient or particularly fast."

    It is every bit as fast as anything you can do with copper, and potentially very much faster (and not because of the "speed of light" but due to the properties of light, like lack of EMI). As far as power efficient, I would say that is mostly because of distance. Fiber modules, in the current form, are designed to push a readable signal for long (hundreds of meters) or very long distances (kilom

    • If you RTFA that's what this article is actually about. There's a few paragraphs on how copper is currently a limiter and then a whole bunch of them on a startup called "Ayar Labs" which has developed and is continuing to develop optical interconnects which are going into products from Intel and Fujitsu.

    • Agreed that a different fibre standard could help - but... it's been a long time since I really knew this stuff... in an optical connection, you're converting an electrical signal to light, which is inherently inefficient. What's more, the power of the light isn't used at the other end, in effect it's thrown away and new power is used to convert it to an electrical signal again. Contrast to a copper cable, where the electrical signal is (probably) boosted a bit to send it out, but the energy contained withi

  • by dfghjk ( 711126 ) on Friday November 29, 2024 @08:02AM (#64979091)

    "Much of this trend has been driven by a need to support ever larger AI models."

    What need? It's a want, not a need, and it's the direction because tech bros can't think of anything else.

    Just because "the number of parameters in AI systems is growing 32-fold approximately every three years" doesn't mean it should or that it will continue to. One good reason is to avoid "pushing copper to its breaking point". Hell, even Intel recognizes this 25 years ago. There comes a time when turning the knob up stops working.

    "Unfortunately, in their current form, pluggable optics aren't particularly efficient or particularly fast."
    OK, definitely don't solve that. Just add 20 KW of power to a rack, and pay for it by cancelling social security and medicare and eliminating the minimum wage. Put the elderly and indigent to work cooling the racks of our AI overlords.

    • by gweihir ( 88907 )

      What need? It's a want, not a need, and it's the direction because tech bros can't think of anything else.

      Exactly. But these assholes always like to pretend what they are doing is critical and will save the world. In this case, very obviously not so.

      That said, there seems to be some value in small, specialized LLMs.

    • What need? It's a want, not a need, and it's the direction because tech bros can't think of anything else.

      The accuracy and usability of a model is dependent on its size. You can see that in the development of models even over the past few years. The bigger the data set the better the result.

      You may be happy that your AI assisted cancer diagnosis is basically no better than a coin flip, but other people may not want to stake their lives on it.

    • Just because "the number of parameters in AI systems is growing 32-fold approximately every three years" doesn't mean it should or that it will continue to.

      Well apparently they have hit a wall, throwing more parameters into the models is not significantly producing better models. The only measurable increase in ChatGPT-5 is in language skills. Logic, coding etc. are the same, despite having significantly more parameters. So if more parameters is not making better models we might have reached the peak ability of large language models. Unless they pull a bunny out of a hat ChatGPT-4 might be as good as it gets.

      "Bill Gates feels Generative AI has plateaued,

  • by gweihir ( 88907 ) on Friday November 29, 2024 @08:14AM (#64979109)

    Does the human race not have any real problems to solve?

    • by Luckyo ( 1726890 )

      Well, current gen AI is rapidly solving the problem that we have struggled and spend incredible amount of resources to solve since the first meeting between two different tribes. That of a language barrier.

      This alone elevates LLM into the niche of "uniquely valuable tool that is actually solving a problem we failed to solve for entire existence of our species".

      • by gweihir ( 88907 )

        Not really. Automated translation _is_ has become pretty good, but it does not use LLMs. The neural networks used for automated translations use proper supervised learning, which is vastly superior if you can afford it. LLMs, on the other hand, do only use weak forms of supervised learning or something that is called "self-supervised". Hence LLMs do not actually play a role here.

        • by Luckyo ( 1726890 )

          >Automated translation _is_ has become pretty good, but it does not use LLMs.

          Overwhelming majority of current translations are LLM translations. In fact, not "LLM" in general, but one specific LLM. ChatGPT.

          You can check this very easily. Go on steam, sort by recent Japanese indie games, look up what they claim they used to translate game from Japanese to English.

    • by MpVpRb ( 1423381 )

      Scientists and engineers are working on developing AI that can do useful stuff. Alpha Fold is a good example
      Unfortunately, most of the consumer facing AI is just crap generators

    • YES! We really need to solve the small penis problem! Just imagine all if all the insecure tiny men wouldn't turn into over compensating ass holes? or worse: narcissists.

  • They better figure out efficient opamps and transducers soon because the Green New Deal needs all of the world's copper production through 2169 by 2030.

    Scratch that - that estimate was before environmentalists shut down the world's largest copper mine in Guatemala.

    Copper is going to get crazy expensive if those regulations are rammed through. Many economists will have black eyes along the way. And Jaguar shareholders.

    I thought the photonics chaps had something close to 1:1 electron/photon transduction a fe

  • All the AI bullshit aside, the technical problem sounds exactly what Seymour Cray was running up against when trying to build the CDC 8600, and prompted him to completely rethink supercomputer design. The results was the Cray-1.

    It seems to my ill-informed brain that we may be at a similar inflection point: either we are on the brink of a new computer design breakthrough, or the LLM approach will be scrapped (but the vast array of resources now available to AI research will let them try totally unique ideas

  • When you hit bandwidth issues, then increasing the speed is your only solution. Now, too many companies just think about doing things the same way, just faster. When it comes to many of these things, putting real work into things like increasing the bandwidth for interconnects would be a better solution, but NVIDIA doesn't want to put in the hard work of coming up with things that allow more to be done in parallel. If you have memory bandwidth issues, and HBM with 512 bit isn't enough, then look for w

    • NVIDIA doesn't want to put in the hard work of coming up with things that allow more to be done in parallel

      What? That's all they do. The whole nature of the modern GPU and/or GPGPU is that you have tons of little cores which are aggressively parallel and have special memory to improve the parallelism.

      Can things be broken up so you don't need as high "speeds" to get the job done?

      My guess is that we will eventually get better at switching functional units on and off (or even just parts of them) so that we can turn P-cores into E-cores whenever we want, but before then we will get schedulers which can intelligently feed both P-cores and E-cores at the same time. But GPUs seem to have so many

  • by Baron_Yam ( 643147 ) on Friday November 29, 2024 @10:32AM (#64979315)

    Put a bunch of processing units on a giant inverted concave surface, and let them communicate directly with each other by laser.

    The bowl can be a heat sink, and you can cool it from above. Make the bowl a mesh so any hot air accumulating below can rise through. Put it all in a clean room so the lasers don't have to deal with dust. Keep the lights off so you have maximum contrast and can use less powerful lasers.

    Now your system size / node quantity is limited by the number of laser transceivers you can mount on a processor. You're never going to make it big enough to worry about speed of light delay like you would with a copper data bus.

    • Actually, scratch the ceiling-bowl idea; a circular room might be a lot more practical. Maybe made of angled horizontal panels so the heat from lower process can be routed away from those above them.

      If angles are too sharp for immediately adjacent nodes, put mirrors between the nodes so they can reach them with an extra hop.

      • Obviously you want it to be a flexible sphere outfitted with shape memory wires which can contort it into different spheroids so that communications can be redirected to different nodes. They will of course have to be filled with vacuum for maximum performance, so the only cheap place to put them is going to be in orbit.

  • by perry64 ( 1324755 ) on Friday November 29, 2024 @11:51AM (#64979463)

    Anyone else get one of Grace Hopper's nanoseconds?

  • Maybe they can use the nuclear power plants that power AI to transmute gold into copper.

  • Optics are an order of magnitude or even worse than copper interconnect.

    Cluster complexity is currently limited by L1 reliability on optics. Much development happening very quickly.

  • I looked for a conversion guide in Google and it offered up this AI answer:

    https://imgur.com/a/ejcuokq

    We are totally getting our money's worth burning down the planet for this.

To communicate is the beginning of understanding. -- AT&T

Working...