Self-Repairing Computers 224
Roland Piquepaille writes "Our computers are probably 10,000 times faster than they were twenty years ago. But operating them is much more complex. You all have experienced a PC crash or the disappearance of a large Internet site. What to do to improve the situation? This Scientific American article describes a new method called recovery-oriented computing (ROC). ROC is based on four principles: speedy recovery by using what these researchers call micro-rebooting; using better tools to pinpoint problems in multicomponent systems; build an "undo" function (similar to those in word-processing programs) for large computing systems; and injecting test errors to better evaluate systems and train operators. Check this column for more details or read the long and dense original article if you want to know more."
Managerspeak (Score:3, Insightful)
I don't think anybody (any company) is willing to undertake such an enterprise, having to re-architect/redesign whole systems from ground up. Systems that work these days, but aren't 100% reliable.
Will it be worth it? For those systems to have a smaller boot up time after failure? I don't think so, but ymmv.
Cheers,
Costyn.
Interesting choice (Score:5, Insightful)
Translation: "when we started this project, we thought we'd be able to spin it off into a hot IPO and get rich!!"
it will not work now (Score:4, Insightful)
uunnschulding sme.. (Score:3, Insightful)
I find it quite funny that "a ground course in computer"-courses we have (here in sweden) only educate people in how to use word/excel/powerpoint/etc... nothing _fundamental_ about how to opporate a computer. It`s like learning how to use the cigaret lighter in your car, and declareing yourself as someone who can drive a car. And now you want a quick fix for your incompentance in driving "the car".
Hmm. (Score:5, Insightful)
I think that's a big fat lie.
Write scripts for it... (Score:5, Insightful)
This concept isn't particularily new. It's easy to write a script that will check a partiular piece of the system by running some sort of diagnostic command (e.g. netstat), parse the output, and make sure everything looks normal. If something doesn't look normal, just stop the process and restart, or whatever you need to do to get some service back up an running, or secured, or whatever is needed to make the system normal again.
Make sure that script is part of a crontab that's run somewhat frequently, and things should recover on their own as soon as they fail (well, within the time-frame that you have the script running within your crontab.)
"Undo" feature? That's what backups are for.
Of course, the article was thinking that this would be built into the software, but I don't think that is that much better of a solution. In fact, I would say that that would make things more complicated than anything.
Re:/etc/rc.d ? (Score:5, Insightful)
If my Samba runs in trouble and gets its poor little head confused, I can restart the Samba daemon. There's no equivalent on Windows -- if SMB-based filesharing goes down on an NT box, you're restarting the computer, there is no other choice.
Self Repairing gone bad (Score:2, Insightful)
The Office 2000 self-repairing installations is another notorious one [google.com], if you remove something, the installer thinks it has been removed in error and tries to reinstall it...
Oh well, lets wish the recovery-oriented computing guys luck...
Second paragraph (Score:5, Insightful)
The second paragraph of the "long and dense article" strikes me as hyperbole. I haven't noticed that my computer's "operation has become brittle and unreliable" or that it "crash[es] or freeze[s] up regularly." I have not experienced the "annual outlays for maintenance, repairs and operations" that "far exceed total hardware and software costs, for both individuals and corporations."
Since this is /. I feel compelled to say this: "Gee, sounds like these guys are Windows users." Haha. But, to be fair, I have to say that - in my experience, at least - Windows2000 has been pretty stable both at home and at work. My computers seem to me to have become more stable and reliable over the years.
But maybe my computers have become more stable because I learned to not tweak on them all the time. As long as my system works, I leave it the hell alone. I don't install the "latest and greatest M$ service pack" (or Linux kernel, for that matter) unless it fixes a bug or security vulnerability that actually affects me. I don't download and install every cutesy program I see. My computer is a tool I need to do my job - and since I've started treating it as such, it seems to work pretty damn well.
"Managerspeak"?! (Score:4, Insightful)
Rather than trying to eliminate computer crashes--probably an impossible task--our team concentrates on designing systems that recover rapidly when mishaps do occur.
The goal here is clearly to make the stability of the operating system and software less critical, so we don't have to hope and pray that a new installation doesn't overwrite a system file with a weird buggy version, or that our OS won't decide to go tits-up in the middle of an important process. Since all us good Slashdotters KNOW there will still be crufty, evil OS's around in 10 years, even if WE aren't using them
ACID ROC? (Score:4, Insightful)
Professionals in the field, while usually in agreement about the desirability of systems which pass the ACID test, most admit that while the concepts are well understood, the real-world cost of the additional software complexity often precludes strict ACID compliance in typical systems. I would certainly be interested if there were more to ROC than evaluating the performance of existing well understood ACID-related techniques but can't find anything more than the "hype." For example, has ROC suggested designs to resolve distributed incoherence due to hardware failure? Classified non-trivial architectures immune to various classes of failure? Discovered a cost effective approach to ACID?
Re:/etc/rc.d ? (Score:2, Insightful)
That aside, wouldn't the proper solution be to fix the bug, rather than covering it up by treating the symptom?
I think this ROC could only encourage buggier programs.
Ah, youth... (Score:3, Insightful)
You're saying the computers of today are more complex to operate than those of 20 years ago?
What was the popular platform 20 years ago.... (1983). The MacOS had not yet debutted, but the PC XT had. The Apple ][ was the main competitor.
So you had a DOS command line and an AppleDOS command line. Was that really a simpler than pointing and clicking in XP and OSX today? I mean, you can actually have your *mother* operate a computer today.
I'm not sure I agree with the premise.
A computer is no washmachine, but why ? (Score:3, Insightful)
This is because the technical computer stuff is so new every year and so...
1: Its to expensive to make it failsafe, development would take to long.
2: You cant refine/redesign and resell, because of new technologie.
3: If it just works noone will buy new systems, so they have to fail every now and then.
While with other consumer products they have a much longer development cycle, cars for example shouldnt fail and if it should be fairly easy to repair, cars also have been around since i dont know like a hundred years and have they changed much ?. Computers heck just buy a new one or hire a PC Repair Man [www.pcrm.nl] (Dutch only) todo your fixing.
excuse me for my bad english
Does SCI AM review articles properly nowadays? (Score:4, Insightful)
Or the factor of 1000 to 1 in hard disk sizes.
Or the 20:1 price difference.
I think a suitable punishment would be to lock the authors in a museum somewhere that has a 70s mainframe, and let them out when they've learned how to swap disk packs, load the tapes, splice paper tape, connect the Teletype, sweep the chad off the floor, stack a card deck or two and actually run an application...those were the days, when computing kept you fit.
Re:Managerspeak (Score:4, Insightful)
Yet you feel qualified to comment....
requiring a whole plethora of yet unwritten code
You do realize they have running code for (for example) an email server [berkeley.edu] (actually a proxy) which uses these principals? NB this was based on proxying sendmail, so they didn't "re-architect/redesign whole systems from ground up". This isn't the only work they've done either.
As for 'will it be worth it', if you'd read the article you'd find their economic justifications. This [berkeley.edu] has a good explanation of the figures. Note in particular that a large proportion of the failure they are concerned about is operator error, hence why they emphasise system rollback as a recovery technique, as opposed to software robustness.
Re:SPOFs (Score:2, Insightful)
the primary immediately hands over the responsibility to the redundant/backup
is there an effective way to judge which processor is correct? you need an odd number of processors to do that or an odd split on an even number of processors.
I'm not saying that this system is flawed actually the way you described here it is certainly far more reliable than the usual servers, what I'm trying to point out is that the concept itself is the bottleneck.
Multiple CPU/processes (Score:3, Insightful)
Multiple (3 or more) cpu's or processes, performing the same action. At least 2 out of the 3 need to agree on any particular action. The offending one is taken offline and 'fixed' (rebooted/repaired/whatever).
Of course, with multiples, you increase the probability of a failure, but reduce the probability of a critical failure.
Re:Does SCI AM review articles properly nowadays? (Score:5, Insightful)
I have to say that I am just shocked at the inane reactions on slashdot to this interesting article. Here we have a joint project of two of the most advanced CS departments in the world. David Patterson's name, at least, should be familiar to anyone who has studied computer science in the last two decades since he is co-author of the pre-eminent textbook on computer architecture.
Yet most of the comments (+5 Insightful) are (1) this is pie in the sky, (2) they must just know Windows, har-de-har-har, (3) Undo is for wimps, that is what backups are for, (4) this is just "managerspeak".
Grow up people. They are not just talking about operating systems, they do know what they are talking about. Some of their research involved hugely complex J2EE systems that run on, yes, Unix systems. Some of their work involves designing custom hardware--"ROC-1 hardware prototype, a 64-node cluster with special hardware features for isolation, redundancy, monitoring, and diagnosis."
Perhaps you should just pause for a few minutes to think about their research instead of trying to score Karma points.
But operating them is much more complex? (Score:2, Insightful)
I disagree. Feature for feature, modern computers are much more reliable and easy to use than their vaccuum-tube, punch card, or even command-line predecessors. How many mom and pop technophobes do you think could hope to operate such a machine? Nowadays anybody can operate a computer, even my 85 year old grandmother who has never touched one until a few months ago. Don't mistake feature-overload for feature-complexity.
Re:"Managerspeak"?! (Score:2, Insightful)
Self-paying computers (Score:1, Insightful)
To make components reset themselfs or to let them memorize states for the purpose of undoing work is the approach of those not involved.
The need to reset a component is because it has reached a state where it stops responding to any input. Or in other words, the component depended on receiving correct input without checking the input according to its state and thus locked itself up.
An undo operation on the other hand would lead to components accepting any input and to reach any state (even the undefined one) but with the need to memorize their previous states. Other components making use of them now would have to ignore the operability of these components and to memorize the previously issued actions on their part to be able to undo them.
The only component beeing able to start an undo would be the button on the GUI the user can click on.
It is a very interesting concept, giving all power to the user in front and to let him/her decide whether the computer is in an invalid state or not. It would be a radical change in the history of computer science. A user would not anymore be a slave to the blue screen (or a kernel panic) demanding a confirmation of the unavoidable reset!
Everything would have to be redesigned and reimplemented. Reuse of old, existing components would of course be impossible and errors in the final product are only because of imperfect programmers and will be solved through updates and newer releases.
Sven
Re:/etc/rc.d ? (Score:5, Insightful)
Depending on how file sharing "goes down", you may need to restart a different service. Don't be ignorant: it is usually possible to fix an NT box while it's running. However, it's usually easier to reboot, and if it's not too big of a big deal, Windows admins usually choose to reboot rather to go in and figure out what processes they have to kick.
Nothing new. (Score:4, Insightful)
I think they just invented Lisp
Re:Self-diagnostics (Score:3, Insightful)
We can learn some lessons from how human society works. If your messages don't make sense to most other people, or if you start damaging a lot of other people, you get separated from the rest and possibly "rebooted" (some call this "electroshock therapy") or even deactivated (some call this "Welcome to Texas").
The difference here is that if the computers in the cluster are all running the same programs, they will contain the exact same coding flaw that they will all concur is the only sane answer (in human terms, this is called "religion"). So we're protected from hardward malfunctions, but not bugs in software or hardware.
That's why this stuff is so hard to do. It may be possible to use selective program restarts to temporarily keep service up in spite of a nasty memory leak, but nothing is really "repaired"; it's just providing a few more fingers to plug holes in the dam while the river keeps rising. So... do you get into providing alternative services for the ones malfunctioning?
Interesting stuff (maybe I'll even read the article now).
Re:Hmm. (Score:3, Insightful)
You don't need to go that far back to history to see a really big difference. Just compare the FPU speed of i287 and Athlon. i287 took minimum of 90 cycles for FMUL, minimum of 70 cycles for FADD and at least 30 cycles for a floating point load [8m.com]. Compare that to Athlon that can do two loads, FMUL and FADD every cycle [cr.yp.to]. So, something that took i287 at least 90+70+2*30 = 220 cycles, Athlon can do every clock cycle. In addition to that, Athlon is running at 2GHz instead of 10MHz. So one could argue that current Athlon is 2000/10*220 = 44000 times faster than about a 20 year old FPU (when was 287 released anyway?). In addition to that, we have MMX, SSE and SSE2 that can further boost best case scenarios but I think it's safe to say that current x86 CPUs are at least 10000 times faster than 20 year old ones. Not to count more advanced caches -- not too many years ago L2 cache was external and optional. Of course, if you compare 20 year old Gray and a CPU inside modern portable device the difference is much smaller.