Remus Project Brings Transparent High Availability To Xen

Remus Project Brings Transparent High Availability To Xen 137

Posted by timothy on Wednesday November 11, 2009 @06:48PM from the when-servers-go-south-a-song dept.

An anonymous reader writes "The Remus project has just been incorporated into the Xen hypervisor. Developed at the University of British Columbia, Remus provides a thin layer that continuously replicates a running virtual machine onto a second physical host. Remus requires no modifications to the OS or applications within the protected VM: on failure, Remus activates the replica on the second host, and the VM simply picks up where the original system died. Open TCP connections remain intact, and applications continue to run unaware of the failure. It's pretty fun to yank the plug out on your web server and see everything continue to tick along. This sort of HA has traditionally required either really expensive hardware, or very complex and invasive modifications to applications and OSes."

Remus Project Brings Transparent High Availability To Xen

This discussion has been archived. No new comments can be posted.

Search 137 Comments Log In/Create an Account

Comments Filter:

Re:It's pretty fun (Score:2, Insightful)

by Fulcrum of Evil ( 560260 ) writes: on Wednesday November 11, 2009 @06:56PM (#30067032)

if it's a webserver, what's the big deal? Run 4 and if 1 drops off, stop sending it requests. For an app server, I can see the advantages.

state transfer (Score:4, Insightful)

by girlintraining ( 1395911 ) writes: on Wednesday November 11, 2009 @07:03PM (#30067110)

... Of course, this ignores the fact that if it's a software glitch, it'll happily replicate the bug into the copy. Also, there are certain hardware bugs that will also replicate: Mountain dew spilled on top of the unit, for example. There's this huge push for virtualization, but it only solves a few classes of failure conditions. No amount of virtualization will save you if the server room starts on fire and the primary system and backup are colocated. Keep this in mind when talking about "High Availability" systems.
On a different note, nothing that's claimed to be transparent in IT ever is. Whenever I hear that word, I usually cancel my afternoon appointments... Nothing is ever transparent in this industry. Only managers use that word. The rest of us use the term "hopefully".

Re:Already done by VMware (Score:3, Insightful)

by palegray.net ( 1195047 ) writes: <philip DOT paradis AT palegray DOT net> on Wednesday November 11, 2009 @07:18PM (#30067242) Homepage Journal

I'll bet a paycheck that prior art in various incarnations would handily dispatch any such patent. As for it already being done by VMware, a lot of organizations prefer a purely open source solution, and Xen works extremely well for many companies.

Re:Already done by VMware (Score:1, Insightful)

by illegibledotorg ( 1123239 ) writes: on Wednesday November 11, 2009 @07:24PM (#30067294)

Yeah, and at a great price point. *rolleyes*

IIRC, to get this kind of functionality from ESX or vSphere you have to pay licenses numbering in the thousands of dollars for each VM host as well as a separate license fee for their centralized Virtual Center management system. I'm glad to see that this is finally making it into the Xen mainline.

Nope (Score:4, Insightful)

by Anonymous Coward writes: on Wednesday November 11, 2009 @07:35PM (#30067410)

Remus presented their software well before VMware came out with their product.
What's different now is that the Remus patches have finally been incorporated into the Xen source tree.
If VMware has any patents, they'll have to jump over the hurdle of being before the Remus work was originally published, which was a while ago.
Besides, Remus can be used in more ways than what VMware offers, since you have the source code.

Re:Already done by VMware (Score:1, Insightful)

by Anonymous Coward writes: on Wednesday November 11, 2009 @07:36PM (#30067416)

To anyone who actually needs this kind of uninterrupted HA the cost of a VMware license is an insignificant irrelevance. Of course, it's nice that people can play around with HA at home now for free.

Re:It's pretty fun (Score:4, Insightful)

by Fulcrum of Evil ( 560260 ) writes: on Wednesday November 11, 2009 @07:50PM (#30067536)

Webservers are far less stateless than you might think. Nowadays they practically are app servers. (Disclosure: I did web applications since 2000, so I know a bit about the subject.)
Webservers have no business being the sole repository for these things - the whole point of separating out web from app is that web boxes are easily replaceable with no state.
Session mgmt: store the session in a distributed way at least after each request. Transactions: they fail if you die half way through. Shopping cart: this doesn't live on a web server.
If you require all that state, how do you ever do load balancing? Add a web server and it's another SPOF.
When 5 minutes downtime mean over a hundred complaints in your inbox and tens of thousands of dropped connections, which your boss does not find funny at all, you don't do that error again.
That's right, you move the state off the webserver so nobody ever sees the downtime and tell your boss that you promised 99.9 and damnit, you're delivering it!

Re:It's pretty fun (Score:4, Insightful)

by stefanlasiewski ( 63134 ) writes: <(moc.ocnafets) (ta) (todhsals)> on Wednesday November 11, 2009 @08:11PM (#30067730) Homepage Journal

In many cases, the webserver IS the app server.
This sort of feature could be very useful for those smaller shops and cheap shops who haven't yet created a dedicated Web tier, or for all those internal webservers which host the Wiki, etc.
Webservers also help with capacity. Run 4 and if 1 drops off, not a big problem. But what if half the webservers drop off because the circuit which powers that side of the cage went down? And the 'redundant' power supplies on your machines weren't really 'redundant' (Thanks Dell)?

Wrong place to put a failsafe? (Score:4, Insightful)

by mattbee ( 17533 ) writes: <matthew@bytemark.co.uk> on Wednesday November 11, 2009 @09:18PM (#30068232) Homepage

Surely there is a strong possibility of a failure where both VMs run at once- the original image thinking it has lost touch with a dead backup, and the backup thinking the master is dead, and so starting to execute independently? If they're connected to the same storage / network segment, it could cause data loss, bring down the network service and so on. I've not investigated these types of lockstep VMs, but it seems you have to make some pretty strong assumptions about failure modes, which always break eventually commodity hardware (I've seen bad backplanes, network chips, CPU caches, RAM of course, switches...). How can you possibly handle these cases to avoid having to mop up after your VM is accidentally cloned?

Re:How does it deal with replication latency? (Score:5, Insightful)

by bcully ( 1676724 ) writes: on Wednesday November 11, 2009 @09:19PM (#30068236)

I think you're missing the point of output buffering. Remus _does_ introduce network delay, and some applications will certainly be sensitive to it. But it never loses transactions that have been seen outside the machine. Keeping an exact copy of the machine _without_ having to synchronize on every single instruction is exactly the point of Remus.

Re:How does it deal with replication latency? (Score:5, Insightful)

by Antique Geekmeister ( 740220 ) writes: on Wednesday November 11, 2009 @11:08PM (#30068950)

If your application cannot tolerate a 50 msec pause in outbound traffic (which is what Remus seems to introduce, similar to VMWare switchovers) then you have no business running it over a network, much less over the Internet as a whole. Similar pauses are introduced in core switching and core routers on a fairly frequent basis, and are entirely unavoidable.
There are certainly classes of application sensitive to that kind of issue: various "real-time-programming" and motor control sensor systems require consistently low latency. But for public facing, high-availability services, it seems useful, and much lighter to implement than VMWare's expensive solutions.

Re:Himalaya (Score:4, Insightful)

by Vancorps ( 746090 ) writes: on Thursday November 12, 2009 @12:32AM (#30069348)

Were you replying to my comment? Because it doesn't sound like you read my comment. I specifically said there are cut-off points where virtual infrastructure doesn't make sense.
Also, the fact that you think the IO of SAN is any different than that of an HP Non-Stop setup is where things get really comical because you're talking about Infiniband which is used in x86 hardware as well. As I said, the threshold is moving into higher and higher workloads.
I'm also not sure where you get your information about Exchange not being IO intensive. Exchange setups easily handle billions of transactions just like the big RDBMS out there. That's why when you evaluate virtual platforms they always ask you about your Exchange environment as well as your database environment. They are both considered to be high IO applications as all they do practically is read and write from disk.
I find the whole concept of your argument funny considering the Non-stop setups were early attempts at abstraction from the hardware to handle failure and be able to spread the load. In essence it was the start of virtual infrastructure. There is a reason Non-Stop isn't primarily part of HP's business anymore, people are achieving what they need to with commodity hardware. Sorry, but you do indeed save a lot of money that way too. Enterprise crap used to cost boat loads, now it is accessible to much smaller players with smaller workloads but the same demands for up-time.

Re:Already done by VMware (Score:3, Insightful)

by Bert64 ( 520050 ) writes: <bert AT slashdot DOT firenzee DOT com> on Thursday November 12, 2009 @05:40AM (#30070570) Homepage

They bought a particular version of vmware, and paid vmware to support the setup they had bought and paid for...
VMware's method of providing support was to tell them to buy new expensive products... They failed to provide adequate support for the version they were actually being supported for...
If their product fails, then an upgrade to a working version should be free at the very least.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Remus Project Brings Transparent High Availability To Xen 137

Remus Project Brings Transparent High Availability To Xen More Login

Remus Project Brings Transparent High Availability To Xen

Re:It's pretty fun (Score:2, Insightful)

state transfer (Score:4, Insightful)

Re:Already done by VMware (Score:3, Insightful)

Re:Already done by VMware (Score:1, Insightful)

Nope (Score:4, Insightful)

Re:Already done by VMware (Score:1, Insightful)

Re:It's pretty fun (Score:4, Insightful)

Re:It's pretty fun (Score:4, Insightful)

Wrong place to put a failsafe? (Score:4, Insightful)

Re:How does it deal with replication latency? (Score:5, Insightful)

Re:How does it deal with replication latency? (Score:5, Insightful)

Re:Himalaya (Score:4, Insightful)

Re:Already done by VMware (Score:3, Insightful)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot