Patch the Linux Kernel Without Reboots 286
evanbro writes "ZDNet is reporting on ksplice, a system for applying patches to the Linux kernel without rebooting. ksplice requires no kernel modifications, just the source, the config files, and a patch. Author Jeff Arnold discusses the system in a technical overview paper (PDF). Ted Ts'o comments, 'Users in the carrier grade linux space have been clamoring for this for a while. If you are a carrier in telephony and don't want downtime, this stuff is pure gold.'"
Update: 04/24 10:04 GMT by KD : Tomasz Chmielewsk writes on LKML that the idea seems to be patented by Microsoft.
Already been used (Score:5, Informative)
Impressive hack (Score:5, Informative)
He basically compiles a patched and unpatched kernel with the same compiler, compares the ELF output, and uses that to generate a binary file that corresponds to the change. That gets wrapped in a generic module for use, another module installs it along with JMPs to bypass the old code and use the new, and he performs the checks needed to make sure he can safely install the redirects.
He also has to differentiate real changes from incidental ones (the example given is changing the address of a function - all references to it will change, but they don't really need to be included in the binary diff).
The only human work required is to check whether a patch makes semantic changes to a data structure... whether eg. an unsigned integer variable that was being used as a number is now a packed set of flags - the data declaration is the same, but it's being used differently.
Interesting paper. Also a useful new set of capabilities for any Linux user who can't handle downtime for quarterly patching... worth its weight in gold in some businesses.
Erik
And Microsoft claims to have invented it (Score:4, Informative)
Tomasz Chmielewski wrote on LKML: the idea seem to be patented by Microsoft, i.e. this patent from December 2002: http://www.google.com/patents?id=cVyWAAAAEBAJ&dq=hotpatching [google.com] In essence, they patented kexec ;)
Andi Kleen promptly provided prior art: The basic patching idea is old and has been used many times, long predating kexec. e.g. it's a common way to implement incremental linkers too.
Re:Wrong way to solve the uptime problem (Score:4, Informative)
Of course many of the reasons is a lot of commercial telecom apps are badly implemented and need better management controls.
Re:Unless it fails. (Score:3, Informative)
Re:Unless it fails. (Score:4, Informative)
Re:The real test... (Score:3, Informative)
So yes, ksplice can be installed/used without rebooting.
Re:Needed that bad? (Score:5, Informative)
If you have a load balanced environment then you have the ability to redirect new connections away from a given server. Then it's just a matter of waiting for the active connections to terminate before the machine ends up in an idle state where you can safely apply patches offline. I've worked in a number of telephony environments and this was always the way we would patch systems. Stop accepting new connections, wait for existing ones to end, then perform the patch, reboot, verify, and start accepting connections again.
Second, this is telephony, meaning it is the infrastructure on which the internet is based. There's no dns tricks or tcp/ip you can use to send people to a different "server" if that server is the switch connected to your fiber backbone. Basically, there are points in the infrastructure where there are by necessity a single chokepoint.
Any mission critical hardware, switches, routers, servers, etc. should be set up in redundant pairs (or triplets,
Redundancy is key, and any commercial datacenter will offer it all the way from their connections to the outside world to the connections they provide their customers. Every datacenter used by every company I ever worked for (about 10) offered redundant power and redundant network drops (using HSRP, VRRP, etc) for our equipment. If the datacenter needed to upgrade a router they'd move all traffic off one router so they could upgrade and test it, then move traffic off the other and repeat the process. Similarly if we needed to upgrade our firewalls, switches, etc. we'd fail over to the second redundant device first. In some cases we had bonded interfaces right on the end servers so as long as one path remained active we could power down an entire switch, router, firewall, etc. In other cases we relied on load balancing across servers that were alternately connected to one or another switch.
Re:Already been used (Score:3, Informative)
Re:Wrong way to solve the uptime problem (Score:1, Informative)
And as a sysadmin in a bank, the solution described in the story isn't that appealing. It strikes me as something inherently less reliable than doing a cold boot with a new kernel. Scheduled downtime is OK, unscheduled problems because someone wanted to do an upgrade on the fly are *bad*.
Re:Needed that bad? (Score:4, Informative)
This assumes that active connections will terminate in a timely fashion. I used to have internet service via an ISDN via a connection to my office. My ISDN calls would connected for a couple of months at a time. Yes, one connection lasting multiple months. There are other cases where a connection, context, or state between two systems would need to be maintained for extended periods of time. Many of these situations can not be solved by load balancing and would benefit greatly by the ability to make kernel changes without interrupting current work, or waiting for it to complete.
Re:Needed that bad? (Score:5, Informative)
A patch to the kernel almost never requires changes to startup scripts. They're not talking about adding new functionality with user-space-addressable interfaces with this tool. They're talking about being able to install about 84% of security hotfixes in a hurry outside your scheduled reboots then rebooting on your regular maintenance schedule.
Re:Wrong way to solve the uptime problem (Score:4, Informative)
Why not 24/7/52 or 24/7/4.3/12 or just 24/365 (or 24/365.242 for the pedants).
Re:Needed that bad? (Score:2, Informative)
We have systems that run for years at a time because we have change management tools that guarantee that those systems are in the exact state of configuration they should be in, and these tools run every night. If you're running around making undocumented configuration changes that even have a ghost of a chance of affecting server operation, anyone that gave you root access needs to have their fingers shortened.
Re:Needed that bad? (Score:5, Informative)
For all you know, your apparent always-on connection was actually a virtual connection being frequently switched & reswitched over many different real physical connections. That would be a fairly standard architecture for having a network infrastructure which can have components being worked on while data is still flowing through the network.
When the telecom provider is "waiting for active connections to go away" on a particular device only means that all of the virtual connections that are momentarily being switched that device have been successfully switched to another device. It doesn't mean that any of those virtual connections have to be terminated.
Re:Needed that bad? (Score:3, Informative)
Re:And Microsoft claims to have invented it (Score:3, Informative)
The patching function was not an accident either; there was an OS-function for this purpose. Originally it was intended to allow bug-fixed to be installed without having to change the ROM, but it was quickly coopted into a mechanism for enhancing the OS in various other ways as well.
This was the smallest part of the interview... (Score:4, Informative)
Microsoft has NOT patented this! (Score:2, Informative)
Their only resort is to appeal to court.
There are no applications in other countries.
A