Google Proposes Shutdown Changes To Speed Linux Reboots

Google Proposes Shutdown Changes To Speed Linux Reboots (phoronix.com) 50

Posted by msmash on Tuesday March 29, 2022 @12:50PM from the moving-forward dept.

UnknowingFool writes: Google has proposed a change on how Linux kernel handles shutdowns specifically when NVMe drives are used. The issue that Google is finding is that the current NVMe drivers use synchronous APIs when shutting down and it can take 4.5 seconds for each NVMe drive. For a system with 16 NVMe drives that could take more than a minute longer. While this is a problem that only large enterprise systems face currently, more enterprises are replacing their mechanical disk RAID servers with SSD ones.

[...] The proposed patches from Google allow for an optional asynchronous shutdown interface at the bus level. The new interface maintains backwards compatibility with the synchronous implementation. As part of the patches, all PCI Express based devices are moved to use the async interface, implements the changes at the PCIe level, and then the changes to the NVMe driver to exploit the async shutdown interface.

Google Proposes Shutdown Changes To Speed Linux Reboots

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 50 Comments Log In/Create an Account

Comments Filter:

Seems reasonable (Score:2)

by wgoodman ( 1109297 ) writes:

Not sure why this is news. Lots of large corporations, including Google submit changes to Linux all the time.
- Re: (Score:1)
  
  by suss ( 158993 ) writes:
  
  They're probably trying to get popup ads into the kernel. Google is an advertisement company.
  - Re: (Score:2)
    
    by dfn5 ( 524972 ) writes:
    
    They're probably trying to get popup ads into the kernel. Google is an advertisement company.
    You meant systemd, right?
    - Re: (Score:2)
      
      by suss ( 158993 ) writes:
      
      systemd isn't the kernel... yet.
      - Re: (Score:2)
        
        by erice ( 13380 ) writes:
        
        systemd isn't the kernel... yet.
        I think you mean the kernel isn't in systemd yet.
- Re: (Score:2)
  
  by omnichad ( 1198475 ) writes:
  
  I think what's real news is that the API takes 4.5 seconds to shut down an NVMe drive. Making them happen simultaneously is great, but how can it take so long?
  - Re: (Score:2)
    
    by F.Ultra ( 1673484 ) writes:
    
    not sure, I have several NVMe drives in my Linux system and it does not take 4.5s to shutdown, shutdown is basically instantanious so this is probably some very specific brand/version of NVMe that Google uses in their servers. Sounds like they flush a cache considering the time it takes.
    - Re: (Score:1)
      
      by cj* ( 149112 ) writes:
      
      +1
      If it takes 4.5 seconds to cleanly shut down, that seems to mean there is up to 4.5s worth of data in transit per NVME drive.
      Granted random writes are pretty pathetic for NAND, but are power failures and kernel crashes totally eliminated?
      - Re: (Score:1)
        
        by NFN_NLN ( 633283 ) writes:
        
        Parent just pulls the plug out of the wall. It's basically pico seconds.
  - Re: (Score:2)
    
    by arglebargle_xiv ( 2212710 ) writes:
    
    Also, this is Linux, not Windows, it's not like you're rebooting the things eight times a day just to keep them running.
- Re: (Score:2)
  
  by phantomfive ( 622387 ) writes:
  
  Maybe because this is the exact same problem systemd was trying to solve, except at startup and their solution is much simpler.
I feel their pain (Score:2)

by systemd-anonymousd ( 6652324 ) writes:

I too have waited forever for a server to reboot, only to connect a monitor and see "a stop job is running [5m50s of ???]"
- Re: (Score:2)
  
  by Major_Disorder ( 5019363 ) writes:
  
  I too have waited forever for a server to reboot, only to connect a monitor and see "a stop job is running [5m50s of ???]"
  I hate that message. My NAS server does that once in a while. I thought it might be a bad drive, but I tested them all one at a time, and collectively, and they are all good.
  - Re: (Score:2)
    
    by systemd-anonymousd ( 6652324 ) writes:
    
    What's even worse is that ctrl-c does nothing, so you start Googling wondering what could possibly be so serious as to prevent ctrl-c from working on a simple stopped job on your own computer, and find the numerous closed bug reports: "Will Not Fix - Working as Designed." You can also find Poettering's explanation in one of the bugs, saying that systemd can't be sure that you have ownership of the system so therefore it won't trust the terminal and ctrl-c does nothing by design.
    >"systemctl cancel-job" is
    - Re: (Score:2)
      
      by belg4mit ( 152620 ) writes:
      
      This isn't a documentation issue but a breaking convention issue.
      May I humbly suggest the issue is Pottering's ego, and whatever dirt he has on others to con them into backing his ill-conceived rats nest.
      - Re: (Score:2)
        
        by systemd-anonymousd ( 6652324 ) writes:
        
        Poettering is absolutely the problem, and his conspirators.
    - Re: (Score:2)
      
      by aRTeeNLCH ( 6256058 ) writes:
      
      Alt-Sysrq-resuib should still work, no? I recall my brother boasting about Linux that it doesn't just turn off at ctrl-alt-del, and I hit alt-Sysrq-b to have it reboot. Good fun!
- Re: (Score:2)
  
  by ArchieBunker ( 132337 ) writes:
  
  That's systemd bullshit. My laptop does it occasionally and I just hold the power switch. I had a box with two quad nics with only a few ports connected. There was a systemd service that waited 90 seconds for each port to obtain a link. Yeah that got disabled in a hurry.
  - Re: (Score:2)
    
    by account_deleted ( 4530225 ) writes:
    
    Comment removed based on user account deletion
    - Re: (Score:2)
      
      by systemd-anonymousd ( 6652324 ) writes:
      
      It's systemd because systemd is responsible for startup, shutdown, stopping jobs, timeouts, and mounting/unmounting. It's very common for users with network connected drives to wonder why their system never seems to reboot or startup in a timely manner (or at all) when there's a network issue.
I got a brilliant idea: (Score:2)

by Tablizer ( 95088 ) writes:

...make a thing called say "systemd" that tracks installed components so that t ~% 4``;& NO CARRIER
- Re: (Score:2)
  
  by NateFromMich ( 6359610 ) writes:
  
  ...make a thing called say "systemd" that tracks installed components so that t ~% 4``;& NO CARRIER
  server01 ~ # nvmectl --please-actually-shutdown-the-nvmes-asynchronously
what about the RAID system shutdown tasks? (Score:2)

by Joe_Dragon ( 2206452 ) writes:

what about the RAID system shutdown tasks?
Why is this necessary (Score:1)

by guruevi ( 827432 ) writes:

You never need to reboot a Linux system. It either gets powered off if the datacenter shuts down and the UPS doesn't work or it gets turned off when it's dead.
- Re: (Score:2)
  
  by Baconsmoke ( 6186954 ) writes:
  
  Kernel upgrade?
  - Re: (Score:1)
    
    by guruevi ( 827432 ) writes:
    
    Livepatch.
- Re: (Score:2)
  
  by Pieroxy ( 222434 ) writes:
  
  So, when you do a config change on your server, how do you know if after a reboot it will still work? Well, by rebooting them.
  And when you install a new kernel, how do you run it? Well, by rebooting them.
  What if you want to reinstall a different Linux flavor on your servers, how do you do it? Well, by rebooting them.
  What if you want to add a RAM module or a SSD on your servers, how do you do it? Well, by rebooting them.
  There are plenty of reasons to reboot a server, Linux or otherwise. It is a normal part o
  - Re: (Score:1)
    
    by guruevi ( 827432 ) writes:
    
    The post was a bit tongue-in-cheek but for kernel upgrades, there's live patching, I've never switched a Linux flavor in production, you either are RHEL or Ubuntu these days, or perhaps SLES in the European markets. Any disk is hot-plug these days. If you end up going for a physical hardware upgrade, the 5s doesn't matter in the downtime window you schedule.
- Re: (Score:2)
  
  by AvitarX ( 172628 ) writes:
  
  Seems like a waste of power to me.
  Why would you keep it powered up all the time if it can quickly come online?
  - Re: (Score:2)
    
    by MBGMorden ( 803437 ) writes:
    
    I mean, servers are typically the sort of thing that are in 24/7 operation outside of planned maintenance windows. You can't exactly have a janitor sitting there waiting for a request to the website to come in so that he can run go power on the server to send the information.
    - Re: (Score:2)
      
      by AvitarX ( 172628 ) writes:
      
      I suppose it depends the company size.
      But I don't think it's as common as you think past a certain size.
      I definitely wouldn't be shocked to learn that Google scales the amount of servers they have on with predicted demand.
      I suppose you think every company uses the same amount of resources with AWS 24/7 too?
- Re: (Score:2)
  
  by caseih ( 160668 ) writes:
  
  I was shocked to hear on a podcast recently that one of the hosts, who is certainly not a newbie, reboots his personal servers once a week. And I think he does this with a script. He pulls in updates and then has it reboots.
  I don't maintain any servers but my own, and I certainly only reboot for kernel updates after checking the errata. Sometimes kernel updates are not security updates, so if they aren't impacted by my hardware, I just skip them.
  All that said, I'm sure in a data center which might have th
- Re: (Score:2)
  
  by UnknowingFool ( 672806 ) writes:
  
  Most companies do need to reboot a small percentage of their Linux servers now and then for things like major updates. For someone like Google that could be thousands of servers as they have millions of them.
How retro (Score:2)

by fermion ( 181285 ) writes:

Is this like having to shut down MS Windows machine to change the screen resolution?
- Re: (Score:2)
  
  by ArchieBunker ( 132337 ) writes:
  
  Was that even a thing even in the 3.x days? Having to reboot NT for changing an ip address was always a pain in the ass.
- Re: How retro (Score:2)
  
  by reanjr ( 588767 ) writes:
  
  That sounds like X Windows in many early systems.
  - Re: (Score:2)
    
    by aRTeeNLCH ( 6256058 ) writes:
    
    Ctrl-alt-backspace. Never a reboot for an X-Windows change. Just update the xorg.conf (or the predecessor like xfree86.conf), log off (if you want) and kill the X-server (later login managers got a restart X option), voila.
- Re: (Score:2)
  
  by Gibgezr ( 2025238 ) writes:
  
  OS/2 you had to reboot the OS to change the 256-colour palette. That is why none of the big game companies (and few of any size) made games for OS/2, which is one big reason it never went anywhere.
Why did nobody do this sooner? (Score:2)

by mark-t ( 151149 ) writes:

Fewer than a hundred lines of change, and honestly it seems like it should be obvious that you'd want to be have always been able to shut down multiple devices in parallel.
Sounds like the drives stink (Score:2)

by dgatwood ( 11270 ) writes:

I mean, this is good and all — any long-running operation should always be asynchronous with a completion callback so that you can wait at an appropriate time (e.g. after ten of them are queued up or whatever), and this API should never have been designed to be synchronous in the first place — but it seems to me that the bigger problem (which this masks) is that the drives are designed wrong.
If a drive takes five seconds to shut down, that means it has enough state in RAM to require five second
- Re: (Score:2)
  
  by billyswong ( 1858858 ) writes:
  
  I agree. Drives shouldn't keep multi-seconds worth of unwritten data in cache. Let's hope it is just a worst case scenario, or at least someone fix the storage driver so that it becomes a worst case scenario in the middle of heavy workload.
- Re: Sounds like the drives stink (Score:2)
  
  by Fons_de_spons ( 1311177 ) writes:
  
  Back when SSDs were new, there were a lot of (expensive) capacitors on the drive. These provided power to the SSD when there was a hard shutdown or power failure. They provided power long enough for the drive to safely store the onboard RAM cache.
  Not sure if this is still the case since prices dropped so much.
- Re: (Score:2)
  
  by jmccue ( 834797 ) writes:
  
  Am I missing something?
  Maybe, maybe not. My guess is Google has an insane amount of memory and wants to use that for data storage. Then at shutdown time, that memory needs to write to the disk. You really need a real good UPS for that.
  - Re: (Score:2)
    
    by dgatwood ( 11270 ) writes:
    
    Am I missing something?
    Maybe, maybe not. My guess is Google has an insane amount of memory and wants to use that for data storage. Then at shutdown time, that memory needs to write to the disk. You really need a real good UPS for that.
    I don't think this can be in-RAM, kernel-provided caching. This is down at the NVMe driver level, which means that whatever is causing that delay must involve RAM on the NVMe card itself, i.e. the drive's write buffer. I mean, unless Linux's driver architecture is very strange.... I haven't looked at it since the 2.1 kernel or thereabouts.
  - Re: (Score:2)
    
    by DarkOx ( 621550 ) writes:
    
    If this the case - and not having RTFA'd in proper slashdot fashion I have no idea.
    It would seem to me the better approach would be do something in the shutdown process to explicitly fsync the various disks (in parallel) after you shutdown the works loads. The caches may get slightly dirtied again as basic system serves shutdown and such but it should not be a massive back log of writes waiting. Than you let the kernel do its very conservative thing its doing now an clean everything up in serial one at a ti
- Re: (Score:2)
  
  by tlhIngan ( 30335 ) writes:
  
  If a drive takes five seconds to shut down, that means it has enough state in RAM to require five seconds worth of flushing to reach a state in which it is safe to power off the module. Assuming modern NVMe speeds, that's double-digit gigabytes of unwritten data. Having so much data in a transient state violates reasonable user expectations about data integrity.
  Write caching is fine up to a point, but any cached data should be written to disk within a reasonable amount of time after a flash page stops being
std::future APIs (Score:2)

by freax ( 80371 ) writes:

Remember dear programmers in also user land. It's std::future HardwareClass::shutdownAsync() instead of bool HardwareClass::shutdown() and if you want to use C instead of C++, it's GAsyncResult when using GLib's GIO.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Seems reasonable (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

I feel their pain (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

I got a brilliant idea: (Score:2)

Re: (Score:2)

what about the RAID system shutdown tasks? (Score:2)

Why is this necessary (Score:1)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

How retro (Score:2)

Re: (Score:2)

Re: How retro (Score:2)

Re: (Score:2)

Re: (Score:2)

Why did nobody do this sooner? (Score:2)

Sounds like the drives stink (Score:2)

Re: (Score:2)

Re: Sounds like the drives stink (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

std::future APIs (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals