Google Proposes Shutdown Changes To Speed Linux Reboots (phoronix.com) 50
UnknowingFool writes: Google has proposed a change on how Linux kernel handles shutdowns specifically when NVMe drives are used. The issue that Google is finding is that the current NVMe drivers use synchronous APIs when shutting down and it can take 4.5 seconds for each NVMe drive. For a system with 16 NVMe drives that could take more than a minute longer. While this is a problem that only large enterprise systems face currently, more enterprises are replacing their mechanical disk RAID servers with SSD ones.
[...] The proposed patches from Google allow for an optional asynchronous shutdown interface at the bus level. The new interface maintains backwards compatibility with the synchronous implementation. As part of the patches, all PCI Express based devices are moved to use the async interface, implements the changes at the PCIe level, and then the changes to the NVMe driver to exploit the async shutdown interface.
[...] The proposed patches from Google allow for an optional asynchronous shutdown interface at the bus level. The new interface maintains backwards compatibility with the synchronous implementation. As part of the patches, all PCI Express based devices are moved to use the async interface, implements the changes at the PCIe level, and then the changes to the NVMe driver to exploit the async shutdown interface.
Seems reasonable (Score:2)
Not sure why this is news. Lots of large corporations, including Google submit changes to Linux all the time.
Re: (Score:1)
They're probably trying to get popup ads into the kernel. Google is an advertisement company.
Re: (Score:2)
They're probably trying to get popup ads into the kernel. Google is an advertisement company.
You meant systemd, right?
Re: (Score:2)
systemd isn't the kernel... yet.
Re: (Score:2)
systemd isn't the kernel... yet.
I think you mean the kernel isn't in systemd yet.
Re: (Score:2)
I think what's real news is that the API takes 4.5 seconds to shut down an NVMe drive. Making them happen simultaneously is great, but how can it take so long?
Re: (Score:2)
Re: (Score:1)
+1
If it takes 4.5 seconds to cleanly shut down, that seems to mean there is up to 4.5s worth of data in transit per NVME drive.
Granted random writes are pretty pathetic for NAND, but are power failures and kernel crashes totally eliminated?
Re: (Score:1)
Parent just pulls the plug out of the wall. It's basically pico seconds.
Re: (Score:2)
Re: (Score:2)
Maybe because this is the exact same problem systemd was trying to solve, except at startup and their solution is much simpler.
I feel their pain (Score:2)
I too have waited forever for a server to reboot, only to connect a monitor and see "a stop job is running [5m50s of ???]"
Re: (Score:2)
I too have waited forever for a server to reboot, only to connect a monitor and see "a stop job is running [5m50s of ???]"
I hate that message. My NAS server does that once in a while. I thought it might be a bad drive, but I tested them all one at a time, and collectively, and they are all good.
Re: (Score:2)
What's even worse is that ctrl-c does nothing, so you start Googling wondering what could possibly be so serious as to prevent ctrl-c from working on a simple stopped job on your own computer, and find the numerous closed bug reports: "Will Not Fix - Working as Designed." You can also find Poettering's explanation in one of the bugs, saying that systemd can't be sure that you have ownership of the system so therefore it won't trust the terminal and ctrl-c does nothing by design.
>"systemctl cancel-job" is
Re: (Score:2)
This isn't a documentation issue but a breaking convention issue.
May I humbly suggest the issue is Pottering's ego, and whatever dirt he has on others to con them into backing his ill-conceived rats nest.
Re: (Score:2)
Poettering is absolutely the problem, and his conspirators.
Re: (Score:2)
Re: (Score:2)
That's systemd bullshit. My laptop does it occasionally and I just hold the power switch. I had a box with two quad nics with only a few ports connected. There was a systemd service that waited 90 seconds for each port to obtain a link. Yeah that got disabled in a hurry.
Re: (Score:2)
Re: (Score:2)
It's systemd because systemd is responsible for startup, shutdown, stopping jobs, timeouts, and mounting/unmounting. It's very common for users with network connected drives to wonder why their system never seems to reboot or startup in a timely manner (or at all) when there's a network issue.
I got a brilliant idea: (Score:2)
...make a thing called say "systemd" that tracks installed components so that t ~% 4``;& NO CARRIER
Re: (Score:2)
...make a thing called say "systemd" that tracks installed components so that t ~% 4``;& NO CARRIER
server01 ~ # nvmectl --please-actually-shutdown-the-nvmes-asynchronously
what about the RAID system shutdown tasks? (Score:2)
what about the RAID system shutdown tasks?
Why is this necessary (Score:1)
You never need to reboot a Linux system. It either gets powered off if the datacenter shuts down and the UPS doesn't work or it gets turned off when it's dead.
Re: (Score:2)
Re: (Score:1)
Livepatch.
Re: (Score:2)
So, when you do a config change on your server, how do you know if after a reboot it will still work? Well, by rebooting them.
And when you install a new kernel, how do you run it? Well, by rebooting them.
What if you want to reinstall a different Linux flavor on your servers, how do you do it? Well, by rebooting them.
What if you want to add a RAM module or a SSD on your servers, how do you do it? Well, by rebooting them.
There are plenty of reasons to reboot a server, Linux or otherwise. It is a normal part o
Re: (Score:1)
The post was a bit tongue-in-cheek but for kernel upgrades, there's live patching, I've never switched a Linux flavor in production, you either are RHEL or Ubuntu these days, or perhaps SLES in the European markets. Any disk is hot-plug these days. If you end up going for a physical hardware upgrade, the 5s doesn't matter in the downtime window you schedule.
Re: (Score:2)
Seems like a waste of power to me.
Why would you keep it powered up all the time if it can quickly come online?
Re: (Score:2)
I mean, servers are typically the sort of thing that are in 24/7 operation outside of planned maintenance windows. You can't exactly have a janitor sitting there waiting for a request to the website to come in so that he can run go power on the server to send the information.
Re: (Score:2)
I suppose it depends the company size.
But I don't think it's as common as you think past a certain size.
I definitely wouldn't be shocked to learn that Google scales the amount of servers they have on with predicted demand.
I suppose you think every company uses the same amount of resources with AWS 24/7 too?
Re: (Score:2)
I was shocked to hear on a podcast recently that one of the hosts, who is certainly not a newbie, reboots his personal servers once a week. And I think he does this with a script. He pulls in updates and then has it reboots.
I don't maintain any servers but my own, and I certainly only reboot for kernel updates after checking the errata. Sometimes kernel updates are not security updates, so if they aren't impacted by my hardware, I just skip them.
All that said, I'm sure in a data center which might have th
Re: (Score:2)
How retro (Score:2)
Re: (Score:2)
Was that even a thing even in the 3.x days? Having to reboot NT for changing an ip address was always a pain in the ass.
Re: How retro (Score:2)
That sounds like X Windows in many early systems.
Re: (Score:2)
Re: (Score:2)
OS/2 you had to reboot the OS to change the 256-colour palette. That is why none of the big game companies (and few of any size) made games for OS/2, which is one big reason it never went anywhere.
Why did nobody do this sooner? (Score:2)
Sounds like the drives stink (Score:2)
I mean, this is good and all — any long-running operation should always be asynchronous with a completion callback so that you can wait at an appropriate time (e.g. after ten of them are queued up or whatever), and this API should never have been designed to be synchronous in the first place — but it seems to me that the bigger problem (which this masks) is that the drives are designed wrong.
If a drive takes five seconds to shut down, that means it has enough state in RAM to require five second
Re: (Score:2)
Re: Sounds like the drives stink (Score:2)
Not sure if this is still the case since prices dropped so much.
Re: (Score:2)
Am I missing something?
Maybe, maybe not. My guess is Google has an insane amount of memory and wants to use that for data storage. Then at shutdown time, that memory needs to write to the disk. You really need a real good UPS for that.
Re: (Score:2)
Am I missing something?
Maybe, maybe not. My guess is Google has an insane amount of memory and wants to use that for data storage. Then at shutdown time, that memory needs to write to the disk. You really need a real good UPS for that.
I don't think this can be in-RAM, kernel-provided caching. This is down at the NVMe driver level, which means that whatever is causing that delay must involve RAM on the NVMe card itself, i.e. the drive's write buffer. I mean, unless Linux's driver architecture is very strange.... I haven't looked at it since the 2.1 kernel or thereabouts.
Re: (Score:2)
If this the case - and not having RTFA'd in proper slashdot fashion I have no idea.
It would seem to me the better approach would be do something in the shutdown process to explicitly fsync the various disks (in parallel) after you shutdown the works loads. The caches may get slightly dirtied again as basic system serves shutdown and such but it should not be a massive back log of writes waiting. Than you let the kernel do its very conservative thing its doing now an clean everything up in serial one at a ti
Re: (Score:2)
std::future APIs (Score:2)
Remember dear programmers in also user land. It's std::future HardwareClass::shutdownAsync() instead of bool HardwareClass::shutdown() and if you want to use C instead of C++, it's GAsyncResult when using GLib's GIO.