Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
AI Graphics Microsoft

Microsoft Details 'Planet-Scale' AI Infrastructure Packing 100,000+ GPUs (theregister.com) 45

Microsoft has revealed it operates a planet-scale distributed scheduling service for AI workloads that it has modestly dubbed "Singularity." The Register reports: Described in a pre-press paper [PDF] co-authored by 26 Microsoft employees, Singularity's aim is described as helping the software giant control costs by driving high utilization for deep learning workloads. Singularity achieves that goal with what the paper describes as a "novel workload-aware scheduler that can transparently preempt and elastically scale deep learning workloads to drive high utilization without impacting their correctness or performance, across a global fleet of AI accelerators (e.g., GPUs, FPGAs)."

The paper spends more time on the scheduler than on Singularity itself, but does offer some figures to depict the system's architecture. An analysis of Singularity's performance mentions a test run on Nvidia DGX-2 servers using a Xeon Platinum 8168 with two sockets of 20 cores each, eight V100 Model GPUs per server, 692GB of RAM, and networked over InfiniBand. With hundreds of thousands of GPUs in the Singularity fleet, plus FPGAs and possibly other accelerators, Microsoft has at least tens of thousands of such servers! The paper focuses on Singularity's scaling tech and schedulers, which it asserts are its secret sauce because they reduce cost and increase reliability.

The software automatically decouples jobs from accelerator resources, which means when jobs scale up or down "we simply change the number of devices the workers are mapped to: this is completely transparent to the user, as the world-size (i.e. total number of workers) of the job remains the same regardless of the number of physical devices running the job." That's possible thanks to "a novel technique called replica splicing that makes it possible to time-slice multiple workers on the same device with negligible overhead, while enabling each worker to use the entire device memory." [...] "Singularity achieves a significant breakthrough in scheduling deep learning workloads, converting niche features such as elasticity into mainstream, always-on features that the scheduler can rely on for implementing stringent SLAs," the paper concludes.

This discussion has been archived. No new comments can be posted.

Microsoft Details 'Planet-Scale' AI Infrastructure Packing 100,000+ GPUs

Comments Filter:
  • by nospam007 ( 722110 ) * on Tuesday February 22, 2022 @09:27PM (#62294021)

    Plus the billions of chips he injected in us, Skynet here we come.

  • why not AMD epic? get more pci-e lanes!

    • by gweihir ( 88907 )

      Money talks. Intel has been convicted several times for illegal business practices...
      Nobody sane would use Intel for any new DC application these days.

  • ...we WILL run Crysis on max settings!

  • ... what could possibly go wrong.

    Pro-tip: When taking about things that might scare people, don't use scary names.

    • I think the name refers more to the current cost of GPUs and even Microsoft can only afford one setup like it.
    • by narcc ( 412956 )

      The singularity is silly science fiction nonsense. No adult of normal intelligence should be frightened of it.

      Pro tip: There are no monsters in your closet, nothing is hiding under your bed, the boggy man doesn't exist.

      • *shrugs* but refactoring, recasting the idea of OS scheduling? systems people, is this something to take note of or not? as opposed to popular machine learning techniques?

      • The singularity is silly science fiction nonsense. No adult of normal intelligence should be frightened of it.

        Pro tip: There are no monsters in your closet, nothing is hiding under your bed, the boggy man doesn't exist.

        You're right about the "silly science fiction" part. However, the technological singularity (to differentiate it from the star-eating monster of physics) has actually been described as the geek version of nirvana or heaven (depending on your quasi-religious orientation), something very desirable, a state of orgasmic bliss for the post-human human. So nothing to be afraid of?

        • by narcc ( 412956 )

          Ah, yeah. I've called that Ray Kurzweil's promise salvation with a glorious video game afterlife. It's an odd little cult. I've always wondered who they thought was going to bother keeping the lights turned on...

    • Though Microsoft did change SkyDrive to OneDrive, so ... Onenet?

  • Speed and size affect how well you do what you can already do. They do not grant new capabilities.

    We need a paradigm shift to get new capabilities. Perhaps quantum chips, perhaps new software.

    • Good points, however one can simulate more sophisticated neural networks with bigger size. The origin or the nature of consciousness is still being debated, some say it's an emergent property some say it's quantum in nature and requires a quantum device, if the former then considering that a human brain has about ~100e9 neurons with 100e12 connections, it's just a matter of the size of the simulation.

      • by narcc ( 412956 )

        People believe a lot of dumb things against all reason. Computationalism is among them.

        They picked the name "singularity" precisely to set the imaginations of that particular group of idiots on fire. I think I hurt my eyes rolling them when I read the name.

      • by AmiMoJo ( 196126 )

        It's not that simple because in an organic brain signal propagation speed is determined by the laws of physics, but in computers it's determined by things like the clock speed of the interconnects and the need to synchronize units.

        It's also impossible to completely simulate an analogue process digitally. The fidelity can be very good, but never perfect. As an example, look at SPICE packages that simulate electronic circuits. That may or may not matter much with AI, it remains to be seen.

        When we finally do g

  • and they ask me to pick up a piece of paper. Call that job satisfaction? I don't.
  • Might as well name it holocaust or Armageddon. They joke about the hypothetical end of civilization.
  • Microsoft re-imagining the brain fucked scheduler

  • That seems very arrogant. I guess they did deliver planet-scale blue screens of death, followed by planet-scale forced updates and reboots. So even though the number seems small, I guess 100,000 GPUs should be enough for anybody?

  • For those who remember the HHGTTG.

    I hope that Slartibartfast does the fjords again.

  • Sounds like the XBox Cloud Gaming service is not very popular and they bought way too much hardware.

  • ...unable to buy a single GPU without having to wait a year or paying multiple times the MSRP.
  • to Microsoft until they forced to keep sending money. So stupid.
  • MS can't even tell me why my Azure backup failed to restore when needed. They have not a hope in hell of making AI work.

    MS's core problem is that no one with any pride in their work would go near them for employment, so it's staffed top to bottom with failures and egotists without any real skill. That's why their products are almost uniformly shit with one exception, but I doubt that they're going to put the XBox team onto this.

    • MS can't even tell me why my Azure backup failed to restore when needed.

      More likely they won't tell you that it's their negligence.

  • Only 100,000 GPUs? Not enough to take control of a blockchain from Chinese mining farms.

  • Do you want Skynet? Because that's how you get Skynet.

  • Companies need more powerful AI to slip wanker-pill spam past spam filters.

  • It has a good ring to it.

The hardest part of climbing the ladder of success is getting through the crowd at the bottom.

Working...