Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
Technology Hardware

Self-Repairing Computers 224

Roland Piquepaille writes "Our computers are probably 10,000 times faster than they were twenty years ago. But operating them is much more complex. You all have experienced a PC crash or the disappearance of a large Internet site. What to do to improve the situation? This Scientific American article describes a new method called recovery-oriented computing (ROC). ROC is based on four principles: speedy recovery by using what these researchers call micro-rebooting; using better tools to pinpoint problems in multicomponent systems; build an "undo" function (similar to those in word-processing programs) for large computing systems; and injecting test errors to better evaluate systems and train operators. Check this column for more details or read the long and dense original article if you want to know more."
This discussion has been archived. No new comments can be posted.

Self-Repairing Computers

Comments Filter:
  • Managerspeak (Score:3, Insightful)

    by CvD ( 94050 ) on Monday May 12, 2003 @06:59AM (#5935342) Homepage Journal
    I haven't read the long and dense article, but this sounds like managerspeak, PHB-talk. The concepts described are all very high level, requiring a whole plethora of yet unwritten code to roll back changes in a large system. This will require a lot of work, including rebuilding a lot of those large systems from the ground up.

    I don't think anybody (any company) is willing to undertake such an enterprise, having to re-architect/redesign whole systems from ground up. Systems that work these days, but aren't 100% reliable.

    Will it be worth it? For those systems to have a smaller boot up time after failure? I don't think so, but ymmv.

    Cheers,

    Costyn.
  • Interesting choice (Score:5, Insightful)

    by sql*kitten ( 1359 ) on Monday May 12, 2003 @07:00AM (#5935344)
    From the article:

    We decided to focus our efforts on improving Internet site software. ...
    Because of the constant need to upgrade the hardware and software of Internet sites, many of the engineering techniques used previously to help maintain system dependability are too expensive to be deployed.

    (etc)

    Translation: "when we started this project, we thought we'd be able to spin it off into a hot IPO and get rich!!"
  • by KingRamsis ( 595828 ) <kingramsis&gmail,com> on Monday May 12, 2003 @07:06AM (#5935357)
    Computers still rely on the original John von Neumann architecture they are not redundant in anyway, there will be always a single point of failure for ever, no matter what you hear about RAID, redundant power suppliers etc.. etc.. basically the self-healing system is based on the same concept, compare that to a natural thing like the nervous system of humans now that is redundant and self healing, a fly has more wires in it's brain than all of the internet nodes, cut your finger and after a couple of days a fully automated autonomous transparent healing system will fix it, if we ever need to create self healing computers we need to radically change what is a computer, we need to break from the John von Neumann not because anything wrong with it but because it is reaching it's limits quickly, we need truly parallel autonomous computers with replicated capacity that increase linearly by adding more hardware, and software paradigms that take advantage of that, try make a self-healing self-fixing computer today and you will end up with a every complicated piece of software that will fail in real life.
  • by danalien ( 545655 ) on Monday May 12, 2003 @07:08AM (#5935366) Homepage
    but if end-users got a better computer education, I think most of the problems would be fixed.

    I find it quite funny that "a ground course in computer"-courses we have (here in sweden) only educate people in how to use word/excel/powerpoint/etc... nothing _fundamental_ about how to opporate a computer. It`s like learning how to use the cigaret lighter in your car, and declareing yourself as someone who can drive a car. And now you want a quick fix for your incompentance in driving "the car".
  • Hmm. (Score:5, Insightful)

    by mfh ( 56 ) on Monday May 12, 2003 @07:12AM (#5935376) Homepage Journal
    Our computers are probably 10,000 times faster than they were twenty years ago. But operating them is much more complex

    I think that's a big fat lie.

  • by ndogg ( 158021 ) <the@rhorn.gmail@com> on Monday May 12, 2003 @07:15AM (#5935382) Homepage Journal
    and cron them in.

    This concept isn't particularily new. It's easy to write a script that will check a partiular piece of the system by running some sort of diagnostic command (e.g. netstat), parse the output, and make sure everything looks normal. If something doesn't look normal, just stop the process and restart, or whatever you need to do to get some service back up an running, or secured, or whatever is needed to make the system normal again.

    Make sure that script is part of a crontab that's run somewhat frequently, and things should recover on their own as soon as they fail (well, within the time-frame that you have the script running within your crontab.)

    "Undo" feature? That's what backups are for.

    Of course, the article was thinking that this would be built into the software, but I don't think that is that much better of a solution. In fact, I would say that that would make things more complicated than anything.
  • Re:/etc/rc.d ? (Score:5, Insightful)

    by Surak ( 18578 ) * <surakNO@SPAMmailblocks.com> on Monday May 12, 2003 @07:15AM (#5935384) Homepage Journal
    Exactly. It isn't. I think the people who wrote this are looking at Windows machines, where restarting individual subcomponents is often impossible.

    If my Samba runs in trouble and gets its poor little head confused, I can restart the Samba daemon. There's no equivalent on Windows -- if SMB-based filesharing goes down on an NT box, you're restarting the computer, there is no other choice.

  • Windows Installer [microsoft.com], was an effort in self "repairing" or "healing" , what ever you would like to call it. However, am the only one who has seen errors like "Please insert Microsoft Office XP CD.." blah blah, when nothing is wrong, and you have to cancel out of it just to use something totally unrelated, like say Excel or Word.

    The Office 2000 self-repairing installations is another notorious one [google.com], if you remove something, the installer thinks it has been removed in error and tries to reinstall it...

    Oh well, lets wish the recovery-oriented computing guys luck...

  • Second paragraph (Score:5, Insightful)

    by NewbieProgrammerMan ( 558327 ) on Monday May 12, 2003 @07:18AM (#5935389)

    The second paragraph of the "long and dense article" strikes me as hyperbole. I haven't noticed that my computer's "operation has become brittle and unreliable" or that it "crash[es] or freeze[s] up regularly." I have not experienced the "annual outlays for maintenance, repairs and operations" that "far exceed total hardware and software costs, for both individuals and corporations."

    Since this is /. I feel compelled to say this: "Gee, sounds like these guys are Windows users." Haha. But, to be fair, I have to say that - in my experience, at least - Windows2000 has been pretty stable both at home and at work. My computers seem to me to have become more stable and reliable over the years.

    But maybe my computers have become more stable because I learned to not tweak on them all the time. As long as my system works, I leave it the hell alone. I don't install the "latest and greatest M$ service pack" (or Linux kernel, for that matter) unless it fixes a bug or security vulnerability that actually affects me. I don't download and install every cutesy program I see. My computer is a tool I need to do my job - and since I've started treating it as such, it seems to work pretty damn well.

  • "Managerspeak"?! (Score:4, Insightful)

    by No Such Agency ( 136681 ) <abmackay@@@gmail...com> on Monday May 12, 2003 @07:32AM (#5935421)
    Somebody has to suggest the weird ideas, even if they sound stupid and impractical now. Of course we won't be retrofitting our existing systems in six months, I think this is a bigger vision than that.

    Rather than trying to eliminate computer crashes--probably an impossible task--our team concentrates on designing systems that recover rapidly when mishaps do occur.

    The goal here is clearly to make the stability of the operating system and software less critical, so we don't have to hope and pray that a new installation doesn't overwrite a system file with a weird buggy version, or that our OS won't decide to go tits-up in the middle of an important process. Since all us good Slashdotters KNOW there will still be crufty, evil OS's around in 10 years, even if WE aren't using them :-)
  • ACID ROC? (Score:4, Insightful)

    by shic ( 309152 ) on Monday May 12, 2003 @07:39AM (#5935434)
    I wonder... is there a meaningful distinction between ROC and the classical holy-grail of ACID systems(i.e. systems which meet Atomic, Consistent, Isolated and Durable assumptions commonly cited in the realm of commercial RDBMS?) Apart from the 'swish' buzzword re-name that isn't even an acronym?

    Professionals in the field, while usually in agreement about the desirability of systems which pass the ACID test, most admit that while the concepts are well understood, the real-world cost of the additional software complexity often precludes strict ACID compliance in typical systems. I would certainly be interested if there were more to ROC than evaluating the performance of existing well understood ACID-related techniques but can't find anything more than the "hype." For example, has ROC suggested designs to resolve distributed incoherence due to hardware failure? Classified non-trivial architectures immune to various classes of failure? Discovered a cost effective approach to ACID?
  • Re:/etc/rc.d ? (Score:2, Insightful)

    by GigsVT ( 208848 ) * on Monday May 12, 2003 @07:40AM (#5935443) Journal
    Apache sorta does this with its thread pool.

    That aside, wouldn't the proper solution be to fix the bug, rather than covering it up by treating the symptom?

    I think this ROC could only encourage buggier programs.
  • Ah, youth... (Score:3, Insightful)

    by tkrotchko ( 124118 ) * on Monday May 12, 2003 @07:51AM (#5935471) Homepage
    "But operating them is much more complex."

    You're saying the computers of today are more complex to operate than those of 20 years ago?

    What was the popular platform 20 years ago.... (1983). The MacOS had not yet debutted, but the PC XT had. The Apple ][ was the main competitor.

    So you had a DOS command line and an AppleDOS command line. Was that really a simpler than pointing and clicking in XP and OSX today? I mean, you can actually have your *mother* operate a computer today.

    I'm not sure I agree with the premise.
  • by Quazion ( 237706 ) on Monday May 12, 2003 @08:02AM (#5935504) Homepage
    Washingmachines have a life time of around 15-20 years i guess, computers about 1-3 years.
    This is because the technical computer stuff is so new every year and so...

    1: Its to expensive to make it failsafe, development would take to long.
    2: You cant refine/redesign and resell, because of new technologie.
    3: If it just works noone will buy new systems, so they have to fail every now and then.

    While with other consumer products they have a much longer development cycle, cars for example shouldnt fail and if it should be fairly easy to repair, cars also have been around since i dont know like a hundred years and have they changed much ?. Computers heck just buy a new one or hire a PC Repair Man [www.pcrm.nl] (Dutch only) todo your fixing.

    excuse me for my bad english ;-) but i hope you got the point, no time to ask my living dictionary.
  • by panurge ( 573432 ) on Monday May 12, 2003 @08:10AM (#5935525)
    The authors either don't seem to know much about the current state of the art or are just ignoring it. And as for unreliability - well, it's true that the first Unix box I ever had (8 user with VT100 terminals) could go almost as long without rebooting as my most recent small Linux box, but there's a bit of a difference in traffic between 8 19200 baud serial links and two 100baseT ports, not to mention the range of applications being supported.
    Or the factor of 1000 to 1 in hard disk sizes.
    Or the 20:1 price difference.

    I think a suitable punishment would be to lock the authors in a museum somewhere that has a 70s mainframe, and let them out when they've learned how to swap disk packs, load the tapes, splice paper tape, connect the Teletype, sweep the chad off the floor, stack a card deck or two and actually run an application...those were the days, when computing kept you fit.

  • Re:Managerspeak (Score:4, Insightful)

    by Bazzargh ( 39195 ) on Monday May 12, 2003 @08:43AM (#5935652)
    I haven't read the long and dense article

    Yet you feel qualified to comment....

    requiring a whole plethora of yet unwritten code

    You do realize they have running code for (for example) an email server [berkeley.edu] (actually a proxy) which uses these principals? NB this was based on proxying sendmail, so they didn't "re-architect/redesign whole systems from ground up". This isn't the only work they've done either.

    As for 'will it be worth it', if you'd read the article you'd find their economic justifications. This [berkeley.edu] has a good explanation of the figures. Note in particular that a large proportion of the failure they are concerned about is operator error, hence why they emphasise system rollback as a recovery technique, as opposed to software robustness.
  • Re:SPOFs (Score:2, Insightful)

    by KingRamsis ( 595828 ) <kingramsis&gmail,com> on Monday May 12, 2003 @08:52AM (#5935694)
    so it is basically two synchronized computers, it probably cost 3x the normal, and if you wiped out the self-correcting logic the system was likely to die, you mentioned that they managed to duplicate everything did they duplicated the self-correcting logic itself ?


    the primary immediately hands over the responsibility to the redundant/backup
    is there an effective way to judge which processor is correct? you need an odd number of processors to do that or an odd split on an even number of processors.
    I'm not saying that this system is flawed actually the way you described here it is certainly far more reliable than the usual servers, what I'm trying to point out is that the concept itself is the bottleneck.
  • by YrWrstNtmr ( 564987 ) on Monday May 12, 2003 @09:16AM (#5935807)
    Do it the way it's done in critical aircraft systems.

    Multiple (3 or more) cpu's or processes, performing the same action. At least 2 out of the 3 need to agree on any particular action. The offending one is taken offline and 'fixed' (rebooted/repaired/whatever).

    Of course, with multiples, you increase the probability of a failure, but reduce the probability of a critical failure.
  • by NearlyHeadless ( 110901 ) on Monday May 12, 2003 @09:24AM (#5935831)
    The authors either don't seem to know much about the current state of the art or are just ignoring it.

    I have to say that I am just shocked at the inane reactions on slashdot to this interesting article. Here we have a joint project of two of the most advanced CS departments in the world. David Patterson's name, at least, should be familiar to anyone who has studied computer science in the last two decades since he is co-author of the pre-eminent textbook on computer architecture.

    Yet most of the comments (+5 Insightful) are (1) this is pie in the sky, (2) they must just know Windows, har-de-har-har, (3) Undo is for wimps, that is what backups are for, (4) this is just "managerspeak".


    Grow up people. They are not just talking about operating systems, they do know what they are talking about. Some of their research involved hugely complex J2EE systems that run on, yes, Unix systems. Some of their work involves designing custom hardware--"ROC-1 hardware prototype, a 64-node cluster with special hardware features for isolation, redundancy, monitoring, and diagnosis."


    Perhaps you should just pause for a few minutes to think about their research instead of trying to score Karma points.

  • by fbg111 ( 529550 ) on Monday May 12, 2003 @10:02AM (#5936070)
    But operating them is much more complex.

    I disagree. Feature for feature, modern computers are much more reliable and easy to use than their vaccuum-tube, punch card, or even command-line predecessors. How many mom and pop technophobes do you think could hope to operate such a machine? Nowadays anybody can operate a computer, even my 85 year old grandmother who has never touched one until a few months ago. Don't mistake feature-overload for feature-complexity.
  • by cloudmaster ( 10662 ) on Monday May 12, 2003 @10:31AM (#5936290) Homepage Journal
    It might be a better use of time to write code that works correctly and is properly tested before release, rather than doing all of that on some other piece of meta-code that's likely to have a bunch o' problems too.
  • by sdack ( 601542 ) on Monday May 12, 2003 @10:39AM (#5936327)
    The moment you buy them, they add to the profit ...

    To make components reset themselfs or to let them memorize states for the purpose of undoing work is the approach of those not involved.

    The need to reset a component is because it has reached a state where it stops responding to any input. Or in other words, the component depended on receiving correct input without checking the input according to its state and thus locked itself up.
    An undo operation on the other hand would lead to components accepting any input and to reach any state (even the undefined one) but with the need to memorize their previous states. Other components making use of them now would have to ignore the operability of these components and to memorize the previously issued actions on their part to be able to undo them.
    The only component beeing able to start an undo would be the button on the GUI the user can click on.

    It is a very interesting concept, giving all power to the user in front and to let him/her decide whether the computer is in an invalid state or not. It would be a radical change in the history of computer science. A user would not anymore be a slave to the blue screen (or a kernel panic) demanding a confirmation of the unavoidable reset!
    Everything would have to be redesigned and reimplemented. Reuse of old, existing components would of course be impossible and errors in the final product are only because of imperfect programmers and will be solved through updates and newer releases.

    Sven
  • Re:/etc/rc.d ? (Score:5, Insightful)

    by delta407 ( 518868 ) <slashdot@nosPAm.lerfjhax.com> on Monday May 12, 2003 @10:57AM (#5936451) Homepage
    There's no equivalent on Windows -- if SMB-based filesharing goes down on an NT box, you're restarting the computer, there is no other choice.
    How about restarting the "Server" service?

    Depending on how file sharing "goes down", you may need to restart a different service. Don't be ignorant: it is usually possible to fix an NT box while it's running. However, it's usually easier to reboot, and if it's not too big of a big deal, Windows admins usually choose to reboot rather to go in and figure out what processes they have to kick.
  • Nothing new. (Score:4, Insightful)

    by pmz ( 462998 ) on Monday May 12, 2003 @11:58AM (#5936937) Homepage
    micro-rebooting; using better tools to pinpoint problems in multicomponent systems; build an "undo" function...

    I think they just invented Lisp :). I don't program in Lisp, but have seen people who are very good at it. Quite impressive.
  • by jtheory ( 626492 ) on Monday May 12, 2003 @11:58AM (#5936941) Homepage Journal
    There are other ways with which to solve this in distributed/clustered computing, such as voting: servers in the cluster vote for each other's sanity (i.e. determine if the messages sent by one computer make sense to at least two others). However, even not this system is rock solid (what if two computers happen to malfunction in the same manner simultaneously? what if the malfunction is contagious? or widespread in the cluster?).

    We can learn some lessons from how human society works. If your messages don't make sense to most other people, or if you start damaging a lot of other people, you get separated from the rest and possibly "rebooted" (some call this "electroshock therapy") or even deactivated (some call this "Welcome to Texas").

    The difference here is that if the computers in the cluster are all running the same programs, they will contain the exact same coding flaw that they will all concur is the only sane answer (in human terms, this is called "religion"). So we're protected from hardward malfunctions, but not bugs in software or hardware.

    That's why this stuff is so hard to do. It may be possible to use selective program restarts to temporarily keep service up in spite of a nasty memory leak, but nothing is really "repaired"; it's just providing a few more fingers to plug holes in the dam while the river keeps rising. So... do you get into providing alternative services for the ones malfunctioning?

    Interesting stuff (maybe I'll even read the article now).
  • Re:Hmm. (Score:3, Insightful)

    by mr3038 ( 121693 ) on Monday May 12, 2003 @06:58PM (#5940497)
    IBM PC XT 4.7 Megahertz to Pentium 4 at 3 Gigahertz. (3,000 Megahertz) It seems a little shy of 10,000 times unless you factor going from an 8 bit processor to a 32 bit processer.

    You don't need to go that far back to history to see a really big difference. Just compare the FPU speed of i287 and Athlon. i287 took minimum of 90 cycles for FMUL, minimum of 70 cycles for FADD and at least 30 cycles for a floating point load [8m.com]. Compare that to Athlon that can do two loads, FMUL and FADD every cycle [cr.yp.to]. So, something that took i287 at least 90+70+2*30 = 220 cycles, Athlon can do every clock cycle. In addition to that, Athlon is running at 2GHz instead of 10MHz. So one could argue that current Athlon is 2000/10*220 = 44000 times faster than about a 20 year old FPU (when was 287 released anyway?). In addition to that, we have MMX, SSE and SSE2 that can further boost best case scenarios but I think it's safe to say that current x86 CPUs are at least 10000 times faster than 20 year old ones. Not to count more advanced caches -- not too many years ago L2 cache was external and optional. Of course, if you compare 20 year old Gray and a CPU inside modern portable device the difference is much smaller.

An authority is a person who can tell you more about something than you really care to know.

Working...