Patch the Linux Kernel Without Reboots

Follow Slashdot blog updates by subscribing to our blog RSS feed

Patch the Linux Kernel Without Reboots 286

Posted by kdawson on Thursday April 24, 2008 @11:00AM from the click-n-go dept.

evanbro writes "ZDNet is reporting on ksplice, a system for applying patches to the Linux kernel without rebooting. ksplice requires no kernel modifications, just the source, the config files, and a patch. Author Jeff Arnold discusses the system in a technical overview paper (PDF). Ted Ts'o comments, 'Users in the carrier grade linux space have been clamoring for this for a while. If you are a carrier in telephony and don't want downtime, this stuff is pure gold.'" Update: 04/24 10:04 GMT by KD : Tomasz Chmielewsk writes on LKML that the idea seems to be patented by Microsoft.

This discussion has been archived. No new comments can be posted.

Patch the Linux Kernel Without Reboots

Search 286 Comments Log In/Create an Account

Comments Filter:

Needed that bad? (Score:5, Insightful)

by MetalliQaZ ( 539913 ) writes: on Thursday April 24, 2008 @11:04AM (#23183178)

If you are a carrier in telephony, you should have many load-balanced servers that can be taken offline one at a time and restored after patching. They probably would be taken out of the loop for the in-place patching anyway. So who is "clamoring"?

Share
twitter facebook
Unless it fails. (Score:3, Insightful)

by Joe Snipe ( 224958 ) writes: on Thursday April 24, 2008 @11:05AM (#23183188) Homepage Journal

honestly how much downtime are we talking here? 30 seconds?

Share
twitter facebook
Wrong way to solve the uptime problem (Score:5, Insightful)

by Anon E. Muss ( 808473 ) writes: on Thursday April 24, 2008 @11:07AM (#23183244)

Trying to keep one server up 24/7/365 is a usually mistake. You'll never achieve 100% uptime. A much better idea is to use clustering and distributed computing so your overall system can survive the loss of individual servers.

Share
twitter facebook
Re:replace modules (Score:1, Insightful)

by Anonymous Coward writes: on Thursday April 24, 2008 @11:20AM (#23183512)

Theory of operation:
1. Build new_module
2. rmmod old_module
3. modprobe new_module

Gee, that was hard :-)

Parent Share
twitter facebook
If it's that critical, shouldn't you have two? (Score:5, Insightful)

by Paul Carver ( 4555 ) writes: on Thursday April 24, 2008 @11:20AM (#23183514)

I'd rather have at least two of anything important and have statefull failover between them.

If you've got this system that's so critical you can't reboot it for a kernel upgrade, what do you do when the building catches fire or a tanker truck full of toxic waste hops the curb and plows through the wall of your datacenter?

I'd rather have a full second set of anything that critical. It should be in a different state (or country) and have a well designed and frequently used method of seamlessly transferring the load between the two (or more) sites without dropping anything.

If you can't transfer the workload to a location at least a couple hundred miles away without users noticing then you're not in the big league.

And as long as the workload is in another datacenter, what's the big deal about rebooting for a kernel upgrade.

Share
twitter facebook
Over-engineered solution to a non-existent problem (Score:4, Insightful)

by hacker ( 14635 ) writes: <hacker@gnu-designs.com> on Thursday April 24, 2008 @11:23AM (#23183556)

Once again, we have an over-engineered solution to a non-existent problem.
Any enterprise-level customer is going to have a VERY lengthy Q&A process before deploying anything into production. This includes testing kernels, hardware, networks, interaction, application, data and so on. One pharmaceutical company I know of is federally mandated to do this twice a year, every year, for every single machine that reads, writes or generates data. Period.
So you hot-patch a running Linux kernel. How do you Q&A that? How do you roll back if the patch fails? Where is your 'control'?
The answer? A duplicate machine. But wait, if you have two identical machines... isn't that... a cluster?
Exactly. And THIS is how you perform upgrades. You split the cluster, upgrade one half, verify that the upgrade worked, then roll the cluster over to that node, and upgrade the second portion of the cluster. If you have more machines in the cluster, you do 'round-robin' upgrades. You NEVER EVER touch a running, production system like that.
Well, not if you want any sort of data integrity or control and want to pass any level of quality validation on that physical environment.

Share
twitter facebook
Re:Wrong way to solve the uptime problem (Score:5, Insightful)

by trybywrench ( 584843 ) writes: on Thursday April 24, 2008 @11:31AM (#23183728)

Trying to keep one server up 24/7/365 is a usually mistake. You'll never achieve 100% uptime. A much better idea is to use clustering and distributed computing so your overall system can survive the loss of individual servers.
People using Linux on BigIron(tm) bank on 24/7/365/25years uptime. When a single server costs hundreds of thousands or millions of dollars you can't afford a spare sitting idle. From day 1 the server needs to be making money and never ever stop. For smaller general purpose servers like you can buy at Dell.com then yeah having a fail-over makes sense.

Parent Share
twitter facebook
Re:Wrong way to solve the uptime problem (Score:1, Insightful)

by Anonymous Coward writes: on Thursday April 24, 2008 @11:31AM (#23183730)

yes, but if the CEO knew anything, he'd know that clustered computing is part of the job (or not) and he (maybe she?) wouldn't ask stupid questions.

Parent Share
twitter facebook
Re:Amazing (Score:5, Insightful)

by KeithJM ( 1024071 ) writes: on Thursday April 24, 2008 @11:40AM (#23183896) Homepage

someone with root access could slip a rootkit right under your nose
Yeah, someone with root access can take control of your server. Oh, wait, they've got root access. They already have control of your server. At some point, you have to just accept that giving someone root access is a security risk.

Parent Share
twitter facebook
It's Not For 100% Uptime (Score:3, Insightful)

by Bob9113 ( 14996 ) writes: on Thursday April 24, 2008 @11:44AM (#23184020) Homepage

Lots of people are saying, "100% uptime of a particular machine is neither necessary nor desirable, full failover is better. Full failover is the only way to handle catastrophic hardware failures." Or something to that extent.

But this isn't about 100% uptime. It's about not having to reboot for a kernel upgrade. You should still have hot failover if you want HA, this just removes one more thing that requires a reboot.

It's like people saying, "I don't mind rebooting after installing Office, I don't expect 100% uptime from my workstation." Of course you don't need to be able to do software installs without rebooting. But isn't it nice to have that option available?

Same with this. When (and if) it gets stabilized and standardized, you'll use it. Not for 100% uptime, just because it's nice to not be required to reboot to enable a particular software install.

Share
twitter facebook
Re:Needed that bad? (Score:4, Insightful)

by Paul Carver ( 4555 ) writes: on Thursday April 24, 2008 @11:45AM (#23184048)

If your load balancer can't take a server out of the pool while allowing current sessions to finish cleanly then you need to shop for a new load balancer.

A decent load balancer will obviously give you the choice of whether to take a server out of service immediately disrupting existing sessions or simply stop sending new sessions to it while allowing existing sessions to continue.

As for your comment about physical connections, that's what portchannels and multilink trunks are for. Or VRRP and HSRP depending on which level of "connected to" you mean.

Parent Share
twitter facebook
Re:Unless it fails. (Score:2, Insightful)

by m50d ( 797211 ) writes: on Thursday April 24, 2008 @12:01PM (#23184390) Homepage Journal

Uh, if you actually need that, then you needed it anyway. And if you don't need it but don't know how to disable it, you shouldn't be running a system.

Parent Share
twitter facebook
Re:Amazing (Score:4, Insightful)

by katz ( 36161 ) writes: <Email? What e-mail?> on Thursday April 24, 2008 @12:03PM (#23184424)

My bad, I meant to say,

"A remote attacker who successfully executes a privilege escalation exploit and gains root access will have an easier time taking control of your server and hiding their tracks".

Thanks for pointing that out

- Roey

Parent Share
twitter facebook
You are Wrong (Score:3, Insightful)

by mpapet ( 761907 ) writes: on Thursday April 24, 2008 @12:10PM (#23184576) Homepage

And THIS is how you perform upgrades. You split the cluster, upgrade one half, verify that the upgrade worked, then roll the cluster over to that node, and upgrade the second portion of the cluster. If you have more machines in the cluster, you do 'round-robin' upgrades

Hmmm. I happen to live by your words in an environment where this is theoretically possible, but practically impossible. Why? Because when the cluster rolls to a passive node, the application times out on the existing connections. The time outs have business ($$$$) implications. I wish it were okay to have infinite retries, but it's viewed as a violation of the service agreement. Telephony is like this too.

An academic ideal for sure, but please speak more humbly because it is no silver bullet.

Parent Share
twitter facebook
Re:Needed that bad? (Score:5, Insightful)

by jelle ( 14827 ) writes: on Thursday April 24, 2008 @12:12PM (#23184610) Homepage

<i>So you take it out of rotation on the load balancer and give it a few minutes to complete all its active connections. Patch/reboot whatever. Bring it back into rotation, and repeat with the other box.</i>

Methods like that usually suck in real-life, because right the day before you want to 'take it out of rotation', a circuit is opened through it that requires five nines (so you can't drop it), and it will remain open for months...

You will end up with 99 boxes waiting to 'get out of rotation' for every
single box that you don't need to update...

Murphy will make sure of that.

Parent Share
twitter facebook
Re:Wrong way to solve the uptime problem (Score:3, Insightful)

by Anon E. Muss ( 808473 ) writes: on Thursday April 24, 2008 @12:14PM (#23184640)

People using Linux on BigIron(tm) bank on 24/7/365/25years uptime.

I doubt there are many people running Linux on true Big Iron. I'm not saying it doesn't happen, I'm saying that most Big Iron runs something else. I know many financial institutions and telecom operators use HP NonStop systems. These can stay up 24/7/365/25years, and you pay millions of dollars for that. They have full redundant hardware inside the box, run a proprietary OS, and proprietary applications.

Parent Share
twitter facebook
Re:Wrong way to solve the uptime problem (Score:2, Insightful)

by Anonymous Coward writes: on Thursday April 24, 2008 @12:21PM (#23184750)

Now is not the time to claim banks know what they are doing.

Parent Share
twitter facebook
Re:Wrong way to solve the uptime problem (Score:2, Insightful)

by Ed Avis ( 5917 ) writes: <ed@membled.com> on Thursday April 24, 2008 @12:22PM (#23184784) Homepage

Who cares about servers? I want my Linux desktop to stay up-to-date with security fixes without having to reboot it every few days.

Parent Share
twitter facebook
Re:Needed that bad? (Score:5, Insightful)

by Anonymous Coward writes: on Thursday April 24, 2008 @12:43PM (#23185246)

I have internal processing servers that have up times of over 3 years

I've never understood this boasting about uptime. Long uptimes are a bad thing! How do you know a configuration change hasn't rendered one of your startup scripts ineffective? If you have to reboot for some unexpected reason, you could be stuck debugging unrelated problems at very inopportune moments.

You need to schedule regular reboots so that you can test that your servers can start up fine at a moment's notice. Long uptimes are a sign a sysadmin hasn't been doing his job.

Parent Share
twitter facebook
Re:No, No, No and No again. (Score:5, Insightful)

by hab136 ( 30884 ) writes: on Thursday April 24, 2008 @01:17PM (#23185838) Journal

As an admin for some -very- high availability systems, load balancers are not a silver bullet. This solution would most apply for running one-node clusters who are using a single machine as a perimeter network device. (ex. firewall) I see lots of these in the racks at our NOC provider.

1. We connect to several load balanced systems and the complexity introduced by load balancers translates to inexplicable down time. No load balancers means a pretty steady diet of the latest and greatest server hardware, but no down time. The a few minutes of down time costs more than the server hardware.

I spent a decade in perimeter networking at a Fortune 50 US bank. My group didn't do the internal network, just the perimiter, and we still had dozens of network sites and thousands of pieces of equipment. The bank itself has hundreds of thousands of employees, millions of users. Online banking and brokerage are about as high availability as you can get save utilities (power, water, telephony, etc) or military. Seconds of online brokerage downtime equated to millions of dollars lost.

The idea that load balancing introduces inexplicable down time is completely unsupported by my experience.

"One-node clusters" seems like marketing speak for "single point of failure". A cluster by definition is two or more nodes.

Redundant routers, switches, firewalls, the works or you're not high-availability in my opinion. The fact that you're talking about Postgresql instead of Oracle or DB2 on mainframes makes me think that your idea of high availability is different than mine.

Parent Share
twitter facebook
Re:Needed that bad? (Score:5, Insightful)

by Kookus ( 653170 ) writes: on Thursday April 24, 2008 @01:31PM (#23186102) Journal

Production systems are not for testing purposes. You want to test rebooting? Do it on a test box.

Parent Share
twitter facebook
Re:Needed that bad? (Score:5, Insightful)

by Kymermosst ( 33885 ) writes: on Thursday April 24, 2008 @01:39PM (#23186270) Journal

How do you know a configuration change hasn't rendered one of your startup scripts ineffective?

Isn't that what QA systems and effective approaches to change management are supposed to handle?

If I am planning a change, I should discover problems with the startup scripts in QA, not in production, especially if a production reboot is not required to implement the change.

Parent Share
twitter facebook
Re:Needed that bad? (Score:5, Insightful)

by adrianbaugh ( 696007 ) writes: on Thursday April 24, 2008 @01:53PM (#23186532) Homepage Journal

How do you know that your test boxes are configured precisely identically to the production boxes?

dd your production box's system filesystems to another hard drive, put in an identically specced machine, boot that?

Parent Share
twitter facebook
Linux just gets better. (Score:3, Insightful)

by bannerman ( 60282 ) writes: <curdie@gmail.com> on Thursday April 24, 2008 @03:43PM (#23188200)

I would think that on top of the benefits of patching running high-uptime servers this would in the long run also result in yet another benefit to running Linux on your desktop instead of Windows. I don't see any reason RedHat, Ubuntu and everyone else wouldn't implement this type of kernel upgrade for convenience' sake.

Share
twitter facebook
Re:Needed that bad? (Score:1, Insightful)

by Anonymous Coward writes: on Thursday April 24, 2008 @04:56PM (#23189316)

If you're running around making undocumented configuration changes that even have a ghost of a chance of affecting server operation, anyone that gave you root access needs to have their fingers shortened.

Mistakes happen. Your attitude seems to be "well don't make mistakes then", while mine is "verify that you haven't made any from time to time when you have time to fix things if you discover that you have". Your system falls down as soon as you realise that yes, you are capable of making mistakes.

Parent Share
twitter facebook
Re:Wrong way to solve the uptime problem (Score:3, Insightful)

by afabbro ( 33948 ) writes: on Thursday April 24, 2008 @05:21PM (#23189668) Homepage

I doubt there are many people running Linux on true Big Iron.

And you would be wrong. Sure, most mainframes are running z/OS, but a goodly number of them are also running Linux images. I don't know the percentages but the IBM "run Linux on your mainframe" training classes are usually full.

Parent Share
twitter facebook
Re:Needed that bad? (Score:2, Insightful)

by scott_karana ( 841914 ) writes: on Friday April 25, 2008 @02:04AM (#23194536)

You could also virtualize over a network file system. Removes the need for 1:1 identical machines. :)

Parent Share
twitter facebook

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Patch the Linux Kernel Without Reboots 286

Patch the Linux Kernel Without Reboots More Login

Patch the Linux Kernel Without Reboots

Needed that bad? (Score:5, Insightful)

Unless it fails. (Score:3, Insightful)

Wrong way to solve the uptime problem (Score:5, Insightful)

Re:replace modules (Score:1, Insightful)

If it's that critical, shouldn't you have two? (Score:5, Insightful)

Over-engineered solution to a non-existent problem (Score:4, Insightful)

Re:Wrong way to solve the uptime problem (Score:5, Insightful)

Re:Wrong way to solve the uptime problem (Score:1, Insightful)

Re:Amazing (Score:5, Insightful)

It's Not For 100% Uptime (Score:3, Insightful)

Re:Needed that bad? (Score:4, Insightful)

Re:Unless it fails. (Score:2, Insightful)

Re:Amazing (Score:4, Insightful)

You are Wrong (Score:3, Insightful)

Re:Needed that bad? (Score:5, Insightful)

Re:Wrong way to solve the uptime problem (Score:3, Insightful)

Re:Wrong way to solve the uptime problem (Score:2, Insightful)

Re:Wrong way to solve the uptime problem (Score:2, Insightful)

Re:Needed that bad? (Score:5, Insightful)

Re:No, No, No and No again. (Score:5, Insightful)

Re:Needed that bad? (Score:5, Insightful)

Re:Needed that bad? (Score:5, Insightful)

Re:Needed that bad? (Score:5, Insightful)

Linux just gets better. (Score:3, Insightful)

Re:Needed that bad? (Score:1, Insightful)

Re:Wrong way to solve the uptime problem (Score:3, Insightful)

Re:Needed that bad? (Score:2, Insightful)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot