Self-Repairing Computers

Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

Self-Repairing Computers 224

Posted by Hemos on Monday May 12, 2003 @06:52AM from the repairing-the-box dept.

Roland Piquepaille writes "Our computers are probably 10,000 times faster than they were twenty years ago. But operating them is much more complex. You all have experienced a PC crash or the disappearance of a large Internet site. What to do to improve the situation? This Scientific American article describes a new method called recovery-oriented computing (ROC). ROC is based on four principles: speedy recovery by using what these researchers call micro-rebooting; using better tools to pinpoint problems in multicomponent systems; build an "undo" function (similar to those in word-processing programs) for large computing systems; and injecting test errors to better evaluate systems and train operators. Check this column for more details or read the long and dense original article if you want to know more."

This discussion has been archived. No new comments can be posted.

Self-Repairing Computers

Load All Comments

Search 224 Comments Log In/Create an Account

Comments Filter:

This would be great (Score:5, Funny)

by CausticWindow ( 632215 ) writes: on Monday May 12, 2003 @06:55AM (#5935335)

coupled with self debugging code.

Share
twitter facebook
- DWIM (Score:3, Funny)
  
  by PhilHibbs ( 4537 ) writes:
  
  We've had RISC, MMX, VLIW, SSI, maybe it's time for DWIM [google.co.uk] processors.
This post (Score:3, Funny)

by nother_nix_hacker ( 596961 ) writes: on Monday May 12, 2003 @06:58AM (#5935339)

Is Ctrl-Alt-Del ROC too? :)

Share
twitter facebook
Managerspeak (Score:3, Insightful)

by CvD ( 94050 ) writes: on Monday May 12, 2003 @06:59AM (#5935342) Homepage Journal

I haven't read the long and dense article, but this sounds like managerspeak, PHB-talk. The concepts described are all very high level, requiring a whole plethora of yet unwritten code to roll back changes in a large system. This will require a lot of work, including rebuilding a lot of those large systems from the ground up.

I don't think anybody (any company) is willing to undertake such an enterprise, having to re-architect/redesign whole systems from ground up. Systems that work these days, but aren't 100% reliable.

Will it be worth it? For those systems to have a smaller boot up time after failure? I don't think so, but ymmv.

Cheers,

Costyn.

Share
twitter facebook
- Re:Managerspeak (Score:5, Interesting)
  
  by gilesjuk ( 604902 ) writes: <giles@jones.zen@co@uk> on Monday May 12, 2003 @07:02AM (#5935347)
  
  Not to mention that the ROC system itself will need to be rock solid. It's no good to have a recovery system that needs to recover itself, which would then recover itself and so on :)
  
  Parent Share
  twitter facebook
  - Self-diagnostics (Score:5, Interesting)
    
    by 6hill ( 535468 ) writes: on Monday May 12, 2003 @07:45AM (#5935459)
    
    I've done some work on high availability computing (incl. my Master's thesis) and one of the more interesting problems is the one you described here -- true metaphysics. The question as it is usually posed goes, How does one self-diagnose? Can a computer program distinguish between a malfunctioning software or malfunctioning software monitoring software -- is the problem in the running program or in the actual diagnostic software? How do you run diagnostics on diagnostics running diagnostics on diagnostics... ugh :).
    My particular system of research finally wound up relying on the Windows method: if uncertain, erase and reboot. It didn't have to be 99.999% available, after all. There are other ways with which to solve this in distributed/clustered computing, such as voting: servers in the cluster vote for each other's sanity (i.e. determine if the messages sent by one computer make sense to at least two others). However, even not this system is rock solid (what if two computers happen to malfunction in the same manner simultaneously? what if the malfunction is contagious? or widespread in the cluster?).
    So, self-correcting is an intriguing question, to say the least. I'll be keenly following what the ROC fellas come up with.
    
    Parent Share
    twitter facebook
    - Re:Self-diagnostics (Score:2)
      
      by Zirnike ( 640152 ) writes:
      
      Ummm... why not use the shuttle method? 3 monitors, take a poll to determine of something needs to be rebooted. If all 3 agree, that's easy. If 2 agree, reboot the problem app, then reboot the 3rd monitor, and reboot the one of the other 2 as it comes back online. (so you have 3 fresh monitor programs)
    - Re:Self-diagnostics (Score:3, Insightful)
      
      by jtheory ( 626492 ) writes:
      
      There are other ways with which to solve this in distributed/clustered computing, such as voting: servers in the cluster vote for each other's sanity (i.e. determine if the messages sent by one computer make sense to at least two others). However, even not this system is rock solid (what if two computers happen to malfunction in the same manner simultaneously? what if the malfunction is contagious? or widespread in the cluster?).
      
      We can learn some lessons from how human society works. If your messages don
  - Re:Managerspeak (Score:2)
    
    by Salamander ( 33735 ) writes:
    
    The key is not to build the system hierarchically, with one "big brain" that watches everyone else but nobody watching it back. A more robust approach is to have several peers all watching each other and using a more "democratic" method to determine who's faulty. It's more difficult to design and implement the necessary protocols, but it's not impossible. The folks at Berkeley have quite a bit of experience with this stretching from OceanStore back (at least) to NOW and, having met them, I have full conf
- "Managerspeak"?! (Score:4, Insightful)
  
  by No Such Agency ( 136681 ) writes: <abmackay@@@gmail...com> on Monday May 12, 2003 @07:32AM (#5935421)
  
  Somebody has to suggest the weird ideas, even if they sound stupid and impractical now. Of course we won't be retrofitting our existing systems in six months, I think this is a bigger vision than that.
  
  Rather than trying to eliminate computer crashes--probably an impossible task--our team concentrates on designing systems that recover rapidly when mishaps do occur.
  
  The goal here is clearly to make the stability of the operating system and software less critical, so we don't have to hope and pray that a new installation doesn't overwrite a system file with a weird buggy version, or that our OS won't decide to go tits-up in the middle of an important process. Since all us good Slashdotters KNOW there will still be crufty, evil OS's around in 10 years, even if WE aren't using them :-)
  
  Parent Share
  twitter facebook
  - Re:"Managerspeak"?! (Score:2)
    
    by _typo ( 122952 ) writes:
    
    so we don't have to hope and pray that a new installation doesn't overwrite a system file with a weird buggy version, or that our OS won't decide to go tits-up in the middle of an important process. Since all us good Slashdotters KNOW there will still be crufty, evil OS's around in 10 years, even if WE aren't using them :-)
    
    Then maybe the solution isn't using aditional bug-prone software to try to recover fast from failures but to actually replace the crufty, evil OS's
  - Re:"Managerspeak"?! (Score:2, Insightful)
    
    by cloudmaster ( 10662 ) writes:
    
    It might be a better use of time to write code that works correctly and is properly tested before release, rather than doing all of that on some other piece of meta-code that's likely to have a bunch o' problems too.
    - Re:"Managerspeak"?! (Score:3, Interesting)
      
      by fgodfrey ( 116175 ) writes:
      
      No, it's not (well, debugging software is definetly good, but writing "self healing" code is important too). An operating system is an incredibly complex piece of software. At Cray and SGI a *very* large amount of testing goes on before release, but software still gets released with bugs. Even if you were, by some miracle, to get a perfect OS, hardware still breaks. In a large system, hardware breaks quite often. Having an OS that can recover from a software or hardware failure on a large system is ess
- Re:Managerspeak (Score:3, Funny)
  
  by TopShelf ( 92521 ) writes:
  
  Speaking for the PHB's, this sounds very exciting. I can't wait until they have self-upgrading computers as well. No more replacing hardware every 3 years!
- Re:Managerspeak (Score:4, Insightful)
  
  by Bazzargh ( 39195 ) writes: on Monday May 12, 2003 @08:43AM (#5935652)
  
  I haven't read the long and dense article
  
  Yet you feel qualified to comment....
  
  requiring a whole plethora of yet unwritten code
  
  You do realize they have running code for (for example) an email server [berkeley.edu] (actually a proxy) which uses these principals? NB this was based on proxying sendmail, so they didn't "re-architect/redesign whole systems from ground up". This isn't the only work they've done either.
  
  As for 'will it be worth it', if you'd read the article you'd find their economic justifications. This [berkeley.edu] has a good explanation of the figures. Note in particular that a large proportion of the failure they are concerned about is operator error, hence why they emphasise system rollback as a recovery technique, as opposed to software robustness.
  
  Parent Share
  twitter facebook
- Re:Managerspeak (Score:5, Interesting)
  
  by sjames ( 1099 ) writes: on Monday May 12, 2003 @08:44AM (#5935659) Homepage Journal
  
  There are allready steps in place towards recoverability in currently running system. That's what filesystem journaling is all about. Journaling doesn't do anything that fsck can't do EXCEPT that replaying the journal is much faster. Vi recovery files are another example. As the article pointed out, 'undo' in any app is an example.
  
  Life critical systems are often actually two seperate programs, 'old reliable' which is primarily designed not to allow a dangerous ondition, and the 'latest and greatest' which has optimal performance as it's primary goal. Should 'old reliable' detect that 'latest and greatest' is about to do something dangerous, it will take over and possibly reboot 'latest and greatest'.
  
  Transaction based systems feature rollback, volume managers support snapshot, and libraries exist to support application checkpointing. EROS [eros-os.org] is an operating system based on transactions and persistant state. It's designed to support this sort of reliability.
  
  HA clustering and server farms are another similar approach. In that case, they allow individual transactions to fail and individual machines to crash, but overall remain available.
  
  Apache has used a simple form of this for years. Each server process has a maximum service count associated with it. It will serve that many requests, then be killed and a new process spawned. The purpose is to minimize the consequences of unfixed memory leaks.
  
  Many server daemons support a reload method where they re-read their config files without doing a complete restart. Smart admins make a backup copy of the config files to roll back to should their changes cause a system failure.
  
  Also as the article points out, design for testing (DFT) has been around in hardware for a while as well. That's what JTAG is for. JTAG itself will be more useful once reasonably priced tools become available. Newer motherboards have JTAG ports built in. They are intended for monitor boards, but can be used for debugging as well (IMHO, they would be MORE useful for debugging than for monitoring, but that's another post!). Built in watchdog timers are becoming more common as well. ECC RAM is now manditory on many server boards.
  
  It WILL take a lot of work. It IS being done NOW in a stepwise manner. IF/when healthy competition in software is restored, we will see even more of this. When it comes down to it, nobody likes to lose work or time and software that prevents that will be preferred to that which doesn't.
  
  Parent Share
  twitter facebook
- Re:Managerspeak (Score:2)
  
  by cyberlync ( 450786 ) writes:
  
  As with many other things, its really just a matter of choosing the right tool for the job. In this case, it sounds allot like Erlang may be the right tool with it's live-reload, built-in fault tolerance , and distributed nature. It may still be allot of work but picking the right tool would move it from impossible (C, C++) to just difficult.
Interesting choice (Score:5, Insightful)

by sql*kitten ( 1359 ) writes: on Monday May 12, 2003 @07:00AM (#5935344)

From the article:

We decided to focus our efforts on improving Internet site software. ...
Because of the constant need to upgrade the hardware and software of Internet sites, many of the engineering techniques used previously to help maintain system dependability are too expensive to be deployed.

(etc)

Translation: "when we started this project, we thought we'd be able to spin it off into a hot IPO and get rich!!"

Share
twitter facebook
/etc/rc.d ? (Score:4, Interesting)

by graveyhead ( 210996 ) writes: <fletchNO@SPAMfletchtronics.net> on Monday May 12, 2003 @07:01AM (#5935345)

Frequently, only one of these modules may be encountering trouble, but when a user reboots a computer, all the software it is running stops immediately. If each of its separate subcomponents could be restarted independently, however, one might never need to reboot the entire collection. Then, if a glitch has affected only a few parts of the system, restarting just those isolated elements might solve the problem.

OK, how is this different from the scripts in /etc/rc.d that can start, stop, or restart all my system services? Any daemon process needs this feature, right? It doesn't help if the machine has locked up entirely.

Maybe I just don't understand this part. The other points all seem very sensible.

Share
twitter facebook
- Re:/etc/rc.d ? (Score:5, Insightful)
  
  by Surak ( 18578 ) * writes: <surakNO@SPAMmailblocks.com> on Monday May 12, 2003 @07:15AM (#5935384) Homepage Journal
  
  Exactly. It isn't. I think the people who wrote this are looking at Windows machines, where restarting individual subcomponents is often impossible.
  
  If my Samba runs in trouble and gets its poor little head confused, I can restart the Samba daemon. There's no equivalent on Windows -- if SMB-based filesharing goes down on an NT box, you're restarting the computer, there is no other choice.
  
  Parent Share
  twitter facebook
  - Re:/etc/rc.d ? (Score:2)
    
    by Imperator ( 17614 ) writes:
    
    Or you restart the appropriate services, like the Server service (and possibly some others in your vaguely described situation). Come on, have you ever actually used NT?
    - Re:/etc/rc.d ? (Score:4, Interesting)
      
      by Surak ( 18578 ) * writes: <surakNO@SPAMmailblocks.com> on Monday May 12, 2003 @10:09AM (#5936124) Homepage Journal
      
      Yes. I'm typing this on last night's build of Mozilla Firebird running under Windows NT 4.0. Sure you can stop and start the workstation and/or server services. Ever done it? How stable is NT after that?
      
      I can tell you that on *nix restarting the Samba daemon happens seamlessly.
      
      Parent Share
      twitter facebook
  - Re:/etc/rc.d ? (Score:5, Insightful)
    
    by delta407 ( 518868 ) writes: <slashdot@nosPAm.lerfjhax.com> on Monday May 12, 2003 @10:57AM (#5936451) Homepage
    
    There's no equivalent on Windows -- if SMB-based filesharing goes down on an NT box, you're restarting the computer, there is no other choice.
    
    How about restarting the "Server" service?
    
    Depending on how file sharing "goes down", you may need to restart a different service. Don't be ignorant: it is usually possible to fix an NT box while it's running. However, it's usually easier to reboot, and if it's not too big of a big deal, Windows admins usually choose to reboot rather to go in and figure out what processes they have to kick.
    
    Parent Share
    twitter facebook
- Re:/etc/rc.d ? (Score:4, Interesting)
  
  by Mark Hood ( 1630 ) writes: on Monday May 12, 2003 @07:30AM (#5935414) Homepage
  
  It's different (in my view) in that you can go even lower than that... Imagine you're running a webserver, and you get 1000 hits a minute (say).
  
  Now say that someone manages to hang a session, because of a software problem. Eventually the same bug will hang another one, and another until you run out of resources.
  
  Just being able to stop the web server & restart to clear it is fine, but it is still total downtime, even if you don't need to reboot the PC.
  
  Imagine you could restart the troublesome session and not affect the other 999 hits that minute... That's what this is about.
  
  Alternatively, making a config change that requires a reboot is daft - why not apply it for all new sessions from now on? If you get to a point where people are still logged in after (say) 5 minutes you could terminate or restart their sessions, perhaps keeping the data that's not changed...
  
  rc.d files are a good start, but this is about going further.
  
  Parent Share
  twitter facebook
  - Re:/etc/rc.d ? (Score:2)
    
    by platypus ( 18156 ) writes:
    
    How about killing just the worker process which hangs?
  - Re:/etc/rc.d ? (Score:2, Insightful)
    
    by GigsVT ( 208848 ) * writes:
    
    Apache sorta does this with its thread pool.
    
    That aside, wouldn't the proper solution be to fix the bug, rather than covering it up by treating the symptom?
    
    I think this ROC could only encourage buggier programs.
    - Re:/etc/rc.d ? (Score:2)
      
      by the-dude-man ( 629634 ) writes:
      
      thats what the goal is, but apache is also trying to keep these threads locked down as well, ie-someone trys to do a bufferoverrun, because of this, we cant simply 'return' they may have overrun the return address, so kill the tread imediatly and flush the stack and dont give them a chance to get to that pointer.
      
      yes fixing the bug is a proper solution, however, the idea behind this is that you can never catch 100 % of the bugs, that is the one thing you can gaurnetee with any pice of software, because
      - Re:/etc/rc.d ? (Score:2)
        
        by Mark Hood ( 1630 ) writes:
        
        Exactly.
        
        This is what happened in the telco system I mentioned [slashdot.org]. Sure, we need to fix the bug, but when the system spots it and cleans up it also produces a report. This allows a patch to be created and loaded (on the fly, usually) which solves the bug without affecting anyone else. In the meantime, the bug only affects the people who trigger it, not everyone logged in at once!
        
        Re:/etc/rc.d ? (Score:2)
        
        by the-dude-man ( 629634 ) writes:
        
        yes...there are alot of warnings, and some non-fatel erros...however, some of these are in X some of these are in KDE....but the reasons for them are in the code.. the warnings you get are because some of the code is portable, ie its designed to compile on ppc, mips32/64 and x86, the waranings you are getting are largley a result of the code having to be hacked up a little so it will compile/run correclty on other archetectures (so you cant always do things the way the compiler wants) and because coders in
  - Re:/etc/rc.d ? (Score:2)
    
    by 42forty-two42 ( 532340 ) writes:
    
    Imagine you could restart the troublesome session and not affect the other 999 hits that minute...
    So delete the offending session from the database.
    - Re:/etc/rc.d ? (Score:2)
      
      by the-dude-man ( 629634 ) writes:
      
      in order to isolate that session in memory (without affecting other users), you need some of the very concepts we are talking about. Also, the goal is to make it more stable for end users, so we want to only kill the session if we cant fix the bug
      - Re:/etc/rc.d ? (Score:2)
        
        by 42forty-two42 ( 532340 ) writes:
        
        Put all the data for a session in a few tables, link it all to the SID. Then just do a SQL query to kill it if a sanity check somewhere detects corruption.
- Re:/etc/rc.d ? (Score:2)
  
  by 42forty-two42 ( 532340 ) writes:
  
  HURD is an even better example - TCP breaking? Reboot it! Of course, you have a single-threaded filesystem, but that's okay, right?
- Re:/etc/rc.d ? (Score:2)
  
  by Bluelive ( 608914 ) writes:
  
  rc.d doesnt detect failures in the deamons, it doesnt resolve dependencies between deamons, and more of these things. rc.d is a step in the right direction but it isnt a solution to the whole problem set.
- Re:/etc/rc.d ? (Score:2)
  
  by NearlyHeadless ( 110901 ) writes:
  
  OK, how is this different from the scripts in /etc/rc.d that can start, stop, or restart all my system services? Any daemon process needs this feature, right? It doesn't help if the machine has locked up entirely.
  
  If you're really interested, take a look at http://www.stanford.edu/~candea/research.html [stanford.edu], especially JAGR: An Autonomous Self-Recovering Application Server [stanford.edu], built on top of JBOSS.
hmmmmm (Score:5, Funny)

by Shishio ( 540577 ) writes: on Monday May 12, 2003 @07:03AM (#5935349)

the disappearance of a large Internet site.

Yeah, I wonder what could ever bring down a large Internet site?
Ahem. [slashdot.org]

Share
twitter facebook
test errors (Score:3, Funny)

by paulmew ( 609565 ) writes: on Monday May 12, 2003 @07:03AM (#5935350)

"Last, computer scientists should develop the ability to inject test errors" Ah, so that explains those BSOD's It's not a fault, it's a feature....

Share
twitter facebook
ROC detail (Score:5, Informative)

by rleyton ( 14248 ) writes: on Monday May 12, 2003 @07:04AM (#5935352) Homepage

For a much better, and more detailed, discussion of Recovery Oriented Computing, you're better off visiting the ROC group at Berkeley [berkeley.edu], specifically David Paterson's writings [berkeley.edu].

Share
twitter facebook
Computer.... (Score:3, Funny)

by Viceice ( 462967 ) writes: on Monday May 12, 2003 @07:05AM (#5935353)

Heal thy-self!

Share
twitter facebook
it will not work now (Score:4, Insightful)

by KingRamsis ( 595828 ) writes: <kingramsis&gmail,com> on Monday May 12, 2003 @07:06AM (#5935357)

Computers still rely on the original John von Neumann architecture they are not redundant in anyway, there will be always a single point of failure for ever, no matter what you hear about RAID, redundant power suppliers etc.. etc.. basically the self-healing system is based on the same concept, compare that to a natural thing like the nervous system of humans now that is redundant and self healing, a fly has more wires in it's brain than all of the internet nodes, cut your finger and after a couple of days a fully automated autonomous transparent healing system will fix it, if we ever need to create self healing computers we need to radically change what is a computer, we need to break from the John von Neumann not because anything wrong with it but because it is reaching it's limits quickly, we need truly parallel autonomous computers with replicated capacity that increase linearly by adding more hardware, and software paradigms that take advantage of that, try make a self-healing self-fixing computer today and you will end up with a every complicated piece of software that will fail in real life.

Share
twitter facebook
- Re:it will not work now (Score:2, Interesting)
  
  by torpor ( 458 ) writes:
  
  So what are some of the other paradigms which might be proferred instead of von Neumann?
  
  My take is that for as long as CPU design is instruction-oriented instead of time-oriented, we won't be able to have truly trusty 'self-repairable' computing.
  
  Give every single datatype in the system its own tightly-coupled timestamp as part of its inherent existence, and then we might be getting somewhere ... the biggest problems with existing architectures for self-repair are in the area of keeping track of one thing:
  - Re:it will not work now (Score:2, Interesting)
    
    by KingRamsis ( 595828 ) writes:
    
    well the man who answers this question will certainly become the von Neumann of the century, you need to do some serious out of the box thinking, first you throw away the concept of the digital computer as you know it, personally I think there will be a split in computer science, there will be generally two computer types the "classical" von Neumann and a new and different type of computer, the classical computer will be useful as a controller of some sort for the newer one, it is difficult to come up with
    - A Real Nostradamus (Score:2)
      
      by JCMay ( 158033 ) writes:
      
      1. It must be data oriented with no concept of instructions (just routing information), data flows in the system and transformed in a non-linear way, and the output will be all possible computations doable by the transformations.
      
      So, what would these transformations be other than... instructions? You could show me a list of "transformations" that the input data is to undergo to generate an output, and I'd show you a list of "instructions" that tell the computer what to do to the input data to generate a
- Re:it will not work now (Score:2)
  
  by the-dude-man ( 629634 ) writes:
  
  Well yes and no.
  
  ROC I dont think will every yeild servers that can heal themselves...rather, yeild servers that will be able to take corrective measures for a wide array problems...there really is no way to make a completely redudnat system, well there may be, but as you said, we are no were near there yet.
  
  ROC may someday evelove into that, however, for the moment, its really a constantly expanding range of exceptional situations that a system can handel by design. Using structures such as excepti
- SPOFs (Score:2)
  
  by 6hill ( 535468 ) writes:
  
  there will be always a single point of failure for ever
  Well, yes and no. Single points of failure are extremely difficult to find in the first place, not to mention remove, but it can be done on the hardware side. I could mention the servers formerly known as Compaq Himalaya, nowadays part of HP's NonStop Enterprise Division [hp.com] in some manner. Duplicated everything, from processors and power sources to I/O and all manner of computing doo-dads. Scalable from 2 to 4000 processors.
  They are (or were, when I d
  - Re:SPOFs (Score:2, Insightful)
    
    by KingRamsis ( 595828 ) writes:
    
    so it is basically two synchronized computers, it probably cost 3x the normal, and if you wiped out the self-correcting logic the system was likely to die, you mentioned that they managed to duplicate everything did they duplicated the self-correcting logic itself ?
    
    the primary immediately hands over the responsibility to the redundant/backup
    is there an effective way to judge which processor is correct? you need an odd number of processors to do that or an odd split on an even number of processors.
    I'm
    - Re:SPOFs (Score:2)
      
      by 6hill ( 535468 ) writes:
      
      so it is basically two synchronized computers, it probably cost 3x the normal, and if you wiped out the self-correcting logic the system was likely to die, you mentioned that they managed to duplicate everything did they duplicated the self-correcting logic itself ?
      Uh...? No self-correcting logic itself, merely hardware duplication. The processor checks were (IIRC) implemented with checksums or some such integrity checks, so this is not in essence a self-correcting system in anything but the assembly le
- - - Re:it will not work now (Score:2)
      
      by swillden ( 191260 ) writes:
      
      english is not my native language.
      Your English is fine. You just need to learn to break it into sentence-sized chunks.
      just extract the knowledge in the post
      Sorry, not interested. I have better things to do. If you want people to read what you write, you should do your best to make it easy for them. Otherwise they'll spend their time more efficiently, reading the ideas of someone who cares enough to make themselves understandable.
Various levels of rebooting... (Score:5, Funny)

by jkrise ( 535370 ) writes: on Monday May 12, 2003 @07:06AM (#5935358) Journal

Micro-rebooting: Restart service.
Mini-rebooting: Restart Windows 98
Rebooting : Switch off/on power
Macro-rebooting: BSOD.
Mega-rebooting: BSOD--> System crash--> reload OS from Recovery CD--> Reinstall apps --> reinstall screen savers --> reinstall Service Packs --> Say your prayers --> Reboot ---> Curse --> Repeat.

Share
twitter facebook
!RTFA, but (Score:3, Interesting)

by the_real_tigga ( 568488 ) writes: <[nephros] [at] [users.sourceforge.net]> on Monday May 12, 2003 @07:08AM (#5935363) Journal

I wonder if this [osdl.org] [PDF!] cool new feature will help there.

Sounds a lot like "micro-rebooting" to me...

Share
twitter facebook
uunnschulding sme.. (Score:3, Insightful)

by danalien ( 545655 ) writes: on Monday May 12, 2003 @07:08AM (#5935366) Homepage

but if end-users got a better computer education, I think most of the problems would be fixed.

I find it quite funny that "a ground course in computer"-courses we have (here in sweden) only educate people in how to use word/excel/powerpoint/etc... nothing _fundamental_ about how to opporate a computer. It`s like learning how to use the cigaret lighter in your car, and declareing yourself as someone who can drive a car. And now you want a quick fix for your incompentance in driving "the car".

Share
twitter facebook
Compulsory M$ joke (Score:3, Funny)

by Rosco P. Coltrane ( 209368 ) writes: on Monday May 12, 2003 @07:11AM (#5935373)

Third, programmers ought to build systems that support an "undo" function (similar to those in word-processing programs), so operators can correct their mistakes. Last, computer scientists should develop the ability to inject test errors; these would permit the evaluation of system behavior and assist in operator training.
[WARNING]
You have installed Microsoft[tm] Windows[tm]. Would you like to undo your mistake, or are you simply injecting test errors on your system ?
[Undo] [Continue testing]

Share
twitter facebook
Hmm. (Score:5, Insightful)

by mfh ( 56 ) writes: on Monday May 12, 2003 @07:12AM (#5935376) Homepage Journal

Our computers are probably 10,000 times faster than they were twenty years ago. But operating them is much more complex
I think that's a big fat lie.

Share
twitter facebook
- Re:Hmm. (Score:2)
  
  by Technician ( 215283 ) writes:
  
  Our computers are probably 10,000 times faster than they were twenty years ago. But operating them is much more complex
  
  Let's see. IBM PC XT 4.7 Megahertz to Pentium 4 at 3 Gigahertz. (3,000 Megahertz) It seems a little shy of 10,000 times unless you factor going from an 8 bit processor to a 32 bit processer. That's 4X the bandwidth. I don't think they missed the mark by much. 10,000 times or 12,000 times, what the the diff?
  - Re:Hmm. (Score:2)
    
    by gl4ss ( 559668 ) writes:
    
    hmm... is it just me but does windows/beos/whatever look more complex to operate than ms dos 1.0 and dos based programs in general?
  - Re:Hmm. (Score:2)
    
    by justin_speers ( 631757 ) writes:
    
    I agree with the original post actually, I think you misinterpreted it.
    
    Computers may be (approximately) 10,000 times faster, but is operating them really more complex?
  - Re:Hmm. (Score:3, Insightful)
    
    by mr3038 ( 121693 ) writes:
    
    IBM PC XT 4.7 Megahertz to Pentium 4 at 3 Gigahertz. (3,000 Megahertz) It seems a little shy of 10,000 times unless you factor going from an 8 bit processor to a 32 bit processer.
    You don't need to go that far back to history to see a really big difference. Just compare the FPU speed of i287 and Athlon. i287 took minimum of 90 cycles for FMUL, minimum of 70 cycles for FADD and at least 30 cycles for a floating point load [8m.com]. Compare that to Athlon that can do two loads, FMUL and FADD every cycle [cr.yp.to]. So, somethin
Write scripts for it... (Score:5, Insightful)

by ndogg ( 158021 ) writes: <the@rhorn.gmail@com> on Monday May 12, 2003 @07:15AM (#5935382) Homepage Journal

and cron them in.

This concept isn't particularily new. It's easy to write a script that will check a partiular piece of the system by running some sort of diagnostic command (e.g. netstat), parse the output, and make sure everything looks normal. If something doesn't look normal, just stop the process and restart, or whatever you need to do to get some service back up an running, or secured, or whatever is needed to make the system normal again.

Make sure that script is part of a crontab that's run somewhat frequently, and things should recover on their own as soon as they fail (well, within the time-frame that you have the script running within your crontab.)

"Undo" feature? That's what backups are for.

Of course, the article was thinking that this would be built into the software, but I don't think that is that much better of a solution. In fact, I would say that that would make things more complicated than anything.

Share
twitter facebook
- Re:Write scripts for it... (Score:2)
  
  by the-dude-man ( 629634 ) writes:
  
  Your quite right....most large systems are maintained by shell scripts and the crontab
  
  However, this is inheriently limited to finding the errors, some errors (ie /var/run has incorrect permissions) cant be solved by restarting the service, this concept is about identifing the problem and then taking correct measures.
  
  What you described is a primitive version of this, it will handle most of the *dumb* errors, not persistant errors that could be outside of the programs control. ROC is more/less an ev
Self Repairing gone bad (Score:2, Insightful)

by UndercoverBrotha ( 623615 ) writes:

Windows Installer [microsoft.com], was an effort in self "repairing" or "healing" , what ever you would like to call it. However, am the only one who has seen errors like "Please insert Microsoft Office XP CD.." blah blah, when nothing is wrong, and you have to cancel out of it just to use something totally unrelated, like say Excel or Word.

The Office 2000 self-repairing installations is another notorious one [google.com], if you remove something, the installer thinks it has been removed in error and tries to reinstall it...

Oh w
- Re:Self Repairing gone bad (Score:2)
  
  by swankypimp ( 542486 ) writes:
  
  This week I took a look at my sister's chronicly gimpy machine. It had Gateway's "GoBack" software on it, which lets the OS return to a bootable state if it gets completely hosed (the "system restore" option on newer versions of Windows are similar, but GoBack loads right after the BIOS POST, before the machine tries to boot the OS).
  The problem is that GoBack interprets easily recoverable errors as catastrophic. The machine didn't shutdown properly? GoBack to previously saved state. BSOD lockup? GoBa
Second paragraph (Score:5, Insightful)

by NewbieProgrammerMan ( 558327 ) writes: on Monday May 12, 2003 @07:18AM (#5935389)

The second paragraph of the "long and dense article" strikes me as hyperbole. I haven't noticed that my computer's "operation has become brittle and unreliable" or that it "crash[es] or freeze[s] up regularly." I have not experienced the "annual outlays for maintenance, repairs and operations" that "far exceed total hardware and software costs, for both individuals and corporations."

Since this is /. I feel compelled to say this: "Gee, sounds like these guys are Windows users." Haha. But, to be fair, I have to say that - in my experience, at least - Windows2000 has been pretty stable both at home and at work. My computers seem to me to have become more stable and reliable over the years.

But maybe my computers have become more stable because I learned to not tweak on them all the time. As long as my system works, I leave it the hell alone. I don't install the "latest and greatest M$ service pack" (or Linux kernel, for that matter) unless it fixes a bug or security vulnerability that actually affects me. I don't download and install every cutesy program I see. My computer is a tool I need to do my job - and since I've started treating it as such, it seems to work pretty damn well.

Share
twitter facebook
I already do this with Linux... (Score:3, Interesting)

by jkrise ( 535370 ) writes: on Monday May 12, 2003 @07:18AM (#5935391) Journal

Here's the strategy:
1. Every system will have a spare 2GB filesystem partition, where I copy all the files of the 'root' filesystem, after successful instln., drivers, personalised settings, blah blah.
2. Every day, during shutdown, users are prompted to 'copy' changed files to this 'backup OS partition'. A script handles this - only changed files are updated.
3. After the 1st instln. a copy of the installed version is put onto a CD.
4. On a server with 4*120GB IDE disks, I've got "data" (home dirs) of about 200 systems in the network - updated once a quarter.

Now, for self-repairing:
1. If user messes up with settings, kernel etc., boot tomsrtbt, run a script to recopy changed files back to root filesystem -> restart. (20 mins)
2. If disk drive crashes, install from CD of step 3, and restore data from server.(40 mins)

Foolproof system, so far - and yes, lots of foolish users around.

Share
twitter facebook
I used systems like this (Score:5, Interesting)

by Mark Hood ( 1630 ) writes: on Monday May 12, 2003 @07:24AM (#5935401) Homepage

they were large telecomms phone switches.

When I left the company in question, they had recently introduced a 'micro-reboot' feature that allowed you to only clear the registers for one call - previously you had to drop all the calls to solve a hung channel or if you hit a software error.

The system could do this for phone calls, commands entered on the command line, even backups could be halted and started without affecting anything else.

Yes, it requires extensive development, but you can do it incrementally - we had thousadnds of software 'blocks' which had this functionality added to them whenever they were opened for other reasons, we never added this feature unless we were already making major changes.

Patches could be introduced to the running system, and falling back was simplicity itself - the same went for configuration changes.

This stuff is not new in the telecomms field, where 'five nines' uptime is the bare minimum. Now the telco's are trying to save money, they're looking at commodity PCs & open standard solutions, and shuddering - you need to reboot everything to fix a minor issue? Ugh!

As for introducing errors to test stability, I did this, and I can vouch for it's effects. I made a few patches that randomly caused 'real world' type errors (call dropped, congestion on routes, no free devices) and let it run for a weekend as an automated caller tried to make calls. When I came in on Monday I'd caused 2,000 failures which boiled down to 38 unique faults. The system had not rebooted once, so only those 2,000 calls had even noticed a problem. Once the software went live, the customer spotted 2 faults in the first month, where previously they'd found 30... So I swear by 'negative testing'.

Nice to see the 'PC' world finally catching up :)

If people want more info, then write to me.

Mark

Share
twitter facebook
- Re:I used systems like this (Score:2)
  
  by the-dude-man ( 629634 ) writes:
  
  I've been striving to work this kind of stability into my client's software for years! To a certian extent, alot of its there, the problem with the pc world is you have to do an update every 3 days just to prevent someone for rooting your box with all the remote exploits floating aroung out there.
  
  I usually use large sets of negitive data to isolate the problem...but there are just some things that users can cause, that in an itergrated world like the pc world, will just take things down.
  
  Thats not
already done? (Score:2)

by the-dude-man ( 629634 ) writes:

hmmmm....Recovery Oreinted Computing......This just screams linux.

Recovery Oreinted Computing is nothing new, most devlopers (well *nix devlopers) have been heading down this route for years, particularly with more hardcore OO languages (is java...and in many respects c++) come to the surface with exception structures, it becomes easier to isloate and identify the exception that occured and take appropiate action to keep the server going.

However, this method of coding is still growing...there are
Excellent (Score:2, Funny)

by hdparm ( 575302 ) writes:

we could in that case:
rm -rf /*

^Z

jut for fun!
ACID ROC? (Score:4, Insightful)

by shic ( 309152 ) writes: on Monday May 12, 2003 @07:39AM (#5935434)

I wonder... is there a meaningful distinction between ROC and the classical holy-grail of ACID systems(i.e. systems which meet Atomic, Consistent, Isolated and Durable assumptions commonly cited in the realm of commercial RDBMS?) Apart from the 'swish' buzzword re-name that isn't even an acronym?

Professionals in the field, while usually in agreement about the desirability of systems which pass the ACID test, most admit that while the concepts are well understood, the real-world cost of the additional software complexity often precludes strict ACID compliance in typical systems. I would certainly be interested if there were more to ROC than evaluating the performance of existing well understood ACID-related techniques but can't find anything more than the "hype." For example, has ROC suggested designs to resolve distributed incoherence due to hardware failure? Classified non-trivial architectures immune to various classes of failure? Discovered a cost effective approach to ACID?

Share
twitter facebook
Not going to work (Score:2, Offtopic)

by locarecords.com ( 601843 ) writes:

This is pie in the sky.
My experience is the best system is paired computers running in parallel that are balanced by another computer that watches for problems and switches the crashed system from Live to the other computer seamlessly. It then reboots the system with problems and allows it to recreate its dataset from its partner.
In effect this points the way to the importance of massive parallelism required for totally stable systems so that clusters form the virtual computer and we get away from the i
The Hurd (Score:4, Interesting)

by rf0 ( 159958 ) writes: <rghf@fsck.me.uk> on Monday May 12, 2003 @07:43AM (#5935448) Homepage

Wouldn't some sort of software solution be the Hurd (if/when it becomes ready) in that as each system is a micro-kernel you just restart that bit of the operating system. As said in another post this is like /etc/rc.d but at a lower level.

Or you could just have some sort of failover setup.

Rus

Share
twitter facebook
- Re:The Hurd (Score:2)
  
  by sql*kitten ( 1359 ) writes:
  
  Wouldn't some sort of software solution be the Hurd (if/when it becomes ready) in that as each system is a micro-kernel you just restart that bit of the operating system. As said in another post this is like /etc/rc.d but at a lower level.
  
  QNX, I believe, already does this, and has been in production use throughout the world for years.
Magic Server Pixie Dust (Score:3, Funny)

by thynk ( 653762 ) writes: <slashdotNO@SPAMthynk.us> on Monday May 12, 2003 @07:45AM (#5935455) Homepage Journal

Didn't IBM come out with some Magic Server Pixie Dust that did this sort of thing already, or am I mistaken?

Share
twitter facebook
- Re:Magic Server Pixie Dust (Score:2)
  
  by the-dude-man ( 629634 ) writes:
  
  That was just a gimmick for that commerical...what they were actually selling is BSD boxes running ports that can update themselves, and rebuilt the kernel acording to pre-defined specs, and reboot when necceary to implement the changes, but designed not to reboot whenever possible (so they build the kerenel to be very modular and only update the modules as needed until something in the base needs to be updated, then rebuild the kerenl and reboot. And useing some tweaked out bash scripting to respawn servic
"operating them is much more complex" (Score:2, Funny)

by NReitzel ( 77941 ) * writes:

Are you crazy?
My first "PC" was a PDP-11/20, with paper tape reader and linc tape storage. Anyone who tries to tell me that operating today's computers is much more complex needs to take some serious drugs.
What is more complex is what today's computers do, and increasing their reliability or making them goal oriented are both laudable goals. What will not be accomplished is making the things that these computers actually do less complex.
Ah, youth... (Score:3, Insightful)

by tkrotchko ( 124118 ) * writes: on Monday May 12, 2003 @07:51AM (#5935471) Homepage

"But operating them is much more complex."

You're saying the computers of today are more complex to operate than those of 20 years ago?

What was the popular platform 20 years ago.... (1983). The MacOS had not yet debutted, but the PC XT had. The Apple ][ was the main competitor.

So you had a DOS command line and an AppleDOS command line. Was that really a simpler than pointing and clicking in XP and OSX today? I mean, you can actually have your *mother* operate a computer today.

I'm not sure I agree with the premise.

Share
twitter facebook
- Re:Ah, youth... (Score:2)
  
  by the-dude-man ( 629634 ) writes:
  
  So you had a DOS command line and an AppleDOS command line. Was that really a simpler than pointing and clicking in XP and OSX today? I mean, you can actually have your *mother* operate a computer today.
  
  This is true, however, keep in mind that none of the DOS operating systems had a kernel. nor were any of them truely mutlitasking until windows 95 for the windows world(shudders). And the debut of Unix 20 years ago.
  
  Also keep in mind all the new technologies such as netwroking, (thats a whole post o
- Re:Ah, youth... (Score:3, Interesting)
  
  by Idarubicin ( 579475 ) writes:
  
  I mean, you can actually have your *mother* operate a computer today.
  Do we have to keep using this tired old notion of little old (middle-aged, for the /. crowd) ladies cringing in terror when faced with a computer?
  My mother has a B.Math in CS, acquired more than a quarter century ago. Her father is pushing eighty, and he upgrades his computer more often than I do. When he's not busy golfing, he's scanning photographs for digital retouching. (In his age bracket, a man who can remove double chins and
A computer is no washmachine, but why ? (Score:3, Insightful)

by Quazion ( 237706 ) writes: on Monday May 12, 2003 @08:02AM (#5935504) Homepage

Washingmachines have a life time of around 15-20 years i guess, computers about 1-3 years.
This is because the technical computer stuff is so new every year and so...

1: Its to expensive to make it failsafe, development would take to long.
2: You cant refine/redesign and resell, because of new technologie.
3: If it just works noone will buy new systems, so they have to fail every now and then.

While with other consumer products they have a much longer development cycle, cars for example shouldnt fail and if it should be fairly easy to repair, cars also have been around since i dont know like a hundred years and have they changed much ?. Computers heck just buy a new one or hire a PC Repair Man [www.pcrm.nl] (Dutch only) todo your fixing.

excuse me for my bad english ;-) but i hope you got the point, no time to ask my living dictionary.

Share
twitter facebook
- English (Score:2)
  
  by rf0 ( 159958 ) writes:
  
  I wouldn't worry about your english. Its better than some native speaks I've seen
  
  Rus
But I do that already... (Score:3, Informative)

by edunbar93 ( 141167 ) writes: on Monday May 12, 2003 @08:03AM (#5935508)

build an "undo" function (similar to those in word-processing programs) for large computing systems

This is called "the sysadmin thinks ahead."

Essentially, when any sysadmin worth a pile of
beans makes any changes whatsoever, he makes sure there's a backup plan before making his changes live. Whether it means running the service on a non-standard port to test, running it on the development server to test, making backups of the configuration and/or the binaries in question, or making backups of the entire system every night. She is thinking "what happens if this doesn't work?" before making any changes. It doesn't matter if it's a web server running on a lowly Pentium 2 or Google - the sysadmin is paid to think about actions before making them. Having things like this won't replace the sysadmin, although I can imagine a good many PHBs trying before realizing that just because you can back out of stupid mistakes, doesn't mean you can keep them from happening in the first place.

Share
twitter facebook
Does SCI AM review articles properly nowadays? (Score:4, Insightful)

by panurge ( 573432 ) writes: on Monday May 12, 2003 @08:10AM (#5935525)

The authors either don't seem to know much about the current state of the art or are just ignoring it. And as for unreliability - well, it's true that the first Unix box I ever had (8 user with VT100 terminals) could go almost as long without rebooting as my most recent small Linux box, but there's a bit of a difference in traffic between 8 19200 baud serial links and two 100baseT ports, not to mention the range of applications being supported.
Or the factor of 1000 to 1 in hard disk sizes.
Or the 20:1 price difference.
I think a suitable punishment would be to lock the authors in a museum somewhere that has a 70s mainframe, and let them out when they've learned how to swap disk packs, load the tapes, splice paper tape, connect the Teletype, sweep the chad off the floor, stack a card deck or two and actually run an application...those were the days, when computing kept you fit.

Share
twitter facebook
- Re:Does SCI AM review articles properly nowadays? (Score:5, Insightful)
  
  by NearlyHeadless ( 110901 ) writes: on Monday May 12, 2003 @09:24AM (#5935831)
  
  The authors either don't seem to know much about the current state of the art or are just ignoring it.
  
  I have to say that I am just shocked at the inane reactions on slashdot to this interesting article. Here we have a joint project of two of the most advanced CS departments in the world. David Patterson's name, at least, should be familiar to anyone who has studied computer science in the last two decades since he is co-author of the pre-eminent textbook on computer architecture.
  Yet most of the comments (+5 Insightful) are (1) this is pie in the sky, (2) they must just know Windows, har-de-har-har, (3) Undo is for wimps, that is what backups are for, (4) this is just "managerspeak".
  
  Grow up people. They are not just talking about operating systems, they do know what they are talking about. Some of their research involved hugely complex J2EE systems that run on, yes, Unix systems. Some of their work involves designing custom hardware--"ROC-1 hardware prototype, a 64-node cluster with special hardware features for isolation, redundancy, monitoring, and diagnosis."
  
  Perhaps you should just pause for a few minutes to think about their research instead of trying to score Karma points.
  
  Parent Share
  twitter facebook
Some of this isn't entirely new... (Score:5, Interesting)

by Mendenhall ( 32321 ) writes: on Monday May 12, 2003 @08:19AM (#5935564)

As one component of my regular job (I am a physicist), I develop control systems for large scientific equipment, and have been doing so for about 25 years. One of the cornerstones of this work has been high-reliability operation and fault tolerance.

One of the primary tricks I have used has always been mockup testing of software and hardware with an emulated machine. In a data acquisition/control system, I can generate _lots_ of errors and fault conditions, most of which would never be seen in real life. This way, I can not only test the code for tolerance, and do so repeatedly, I can thoroughly check the error-recovery code to make sure it doesn't introduce any errors itself.

This is really the software equivalent to teaching an airline pilot to fly on a simulator. A pilot who only trains in real planes only gets one fatal crash, (obviously), and so never really learns how to recover from worst-case scenarios. In a simulator, one can repeat 'fatal' crashes until they aren't fatal any more. My software has been through quite the same experience, and it is surprising the types of errors one can avoid this way.

Really, the main problem with building an already highly reliable system, using very good hardware, etc., is that you must do this kind of testing, since failures will start out very rare, and so unless one intentionally creates faults, the ability to recover from them is not verified. Especially in asynchronous systems, one must test each fault many times, and in combinations of multiple faults to find out how hard it is to really break a system, and this won't happen without emulating the error conditions.

Share
twitter facebook
- Re:Some of this isn't entirely new... (Score:3, Interesting)
  
  by redragon ( 161901 ) writes:
  
  A lot of simulators have the ability to roll-back. Seriously, if you're running a simulation that can take days/weeks/etc, do you really want a crash to bring the system completely down?
  
  Heck no.
  
  Many of these systems will save results in such a fashion that if the system does go down, when the faulty component is found and fixed, the system can be brought back up to it's state just prior to the crash.
Nope. Memory (Score:5, Interesting)

by awol ( 98751 ) writes: on Monday May 12, 2003 @08:27AM (#5935600) Journal

The problem here is that whilst it is true that _certain_ aspects of computational power has increased "probably 10,000 times" others have not. In order to really make stuff work like this, with an undo, because that is the critical bit since redundant hardware already exists, Non-Stop from HP (nee Himalaya) for example.

Where I work we implemented at least one stack based undo functionality and it worked really nicely, we trapped sigsevs etc and just popped the appropriate state back into the places that were touched in the event of an error. We wrote a magical "for loop" construct that broke out after N iterations reagrdless of the other constraints. The software that resulted from this was uncrashable. I mean that relatively seriously, you could not crash the thing. you could very seriously screw data up through bugs, but the beast would just keep on ticking.

I had a discussion with a friend of mine more than a decade ago that eventually all these extra MHz that were coming would eventually be overkill. His argument was that, no, more of them will be consumed, in the background making good stuff happen. He was thinking about things like voice recognition, handwriting recognition, predictive work etc etc. I agree with his point. If you have a surfeit of CPU then use it to do cool things (not wasting it on eycandy necessarily) to make the things easier to use. Indeed we see some of that stuff now, not enough, but some.

Self-Reparing is an excellent candidate and with so much CPU juice lying around in your average machine, it must be workable. I mean think about the computers used for industrial plant. Most of them could be emulated faster in a P4 than they currently run. So emulate N and check the results against each other, if one breaks just emulate a new one and turf the old one. Nice.

But here's the rub. Memory, we have nowhere near decreased the memory latency by the same amount we have boosted the processing power (and as for IO, sheesh!), as a result, undo is very expensive to do generically, i mean it at least halves the amount of bandwidth since it is [read/write write] for each write not to mention the administrative overhead and we just haven't got that much spare capacity in memory latency left. Indeed, just after the ten year old discussion, I had to go and enhance some software to get past the HPUX9 800MB single shared memory segment limit and the demand is only just being outstripped by the affordable supply of memory, we do not yet have the orders of magnitude of performance to make the self correcting model work ina generic sense.

I think this idea will come, but it will not come until we have an order of magnitude more capacity in all the areas of the system. Until then we will see very successful but limited solutions like the one we implemented.

Share
twitter facebook
- Re:Nope. Memory (Score:2)
  
  by Glock27 ( 446276 ) writes:
  
  But here's the rub. Memory, we have nowhere near decreased the memory latency by the same amount we have boosted the processing power (and as for IO, sheesh!), as a result, undo is very expensive to do generically, i mean it at least halves the amount of bandwidth since it is [read/write write] for each write not to mention the administrative overhead and we just haven't got that much spare capacity in memory latency left.
  Ah, you obviously need the new carbon nanotube RAM [economist.com] coupled with IBM's carbon nanotub [ibm.com]
'IMPORTANT' 'NEW' 'DISCOVERY'! (Score:3, Funny)

by kahei ( 466208 ) writes: on Monday May 12, 2003 @08:42AM (#5935648) Homepage

Scientists discovered this week that well-known and rather obvious software engineering concepts like componentization and redundancy could seem new and impressive if written up like Science!

Although this week's breakthrough yielded little direct benefit, it is theorized that applying the verbal style of Science to other subjects, such as aromatherapy and running shoes, could have highly profitable results.

Share
twitter facebook
Micro-booting (Score:2)

by Root Down ( 208740 ) writes:

It's not really about self-repairing computers, per se, but about rebooting modules. Seems that it is a good idea so long as the modules (EJBs, for instance) are fairly self-contained. The bigger issue is that if one EAR is misbehaving, it is likely that the others are experiencing the same troubles. (Unless they are running on different JVMs.) And what about the process that watches the processes for failure? If that fails... well, you're back to manual intervention, which is probably unavoidable in t
Multiple CPU/processes (Score:3, Insightful)

by YrWrstNtmr ( 564987 ) writes: on Monday May 12, 2003 @09:16AM (#5935807)

Do it the way it's done in critical aircraft systems.

Multiple (3 or more) cpu's or processes, performing the same action. At least 2 out of the 3 need to agree on any particular action. The offending one is taken offline and 'fixed' (rebooted/repaired/whatever).

Of course, with multiples, you increase the probability of a failure, but reduce the probability of a critical failure.

Share
twitter facebook
10,000 times faster in 20 years? (Score:3, Interesting)

by dutky ( 20510 ) writes: on Monday May 12, 2003 @09:34AM (#5935879) Homepage Journal

If this figure comes from the poster, fine, but if the authors of the article said this, then I don't see the need to read anything else by them.

20 years ago we had machines running at around 3-4 MHz (The Apple II was slower, the IBM PC faster). Today we can get machines running between 3-4 GHz, that's only a factor of 1000. If you count memory speeds, the increase is a lot lower (~300ns in 1983, down to ~60ns today: about a factor of 5).

Other folk have posted about the questionable assertion that modern computers are harder to operate, but the fact that the simplest arithmetic calculation is off by an order of magnitude is at least as troubling as a questionable opinion or two.

Share
twitter facebook
- Re:10,000 times faster in 20 years? (Score:2)
  
  by demaria ( 122790 ) writes:
  
  If we were all running 4GHz Motorola 6502's then you might have a point. But you're looking at just megahertz (a mistake in itself), forgetting that we've gone from 8 bit computing to 32 & 64 bit, and have had significant gains in processing capabilities (pipelines).
- Re:10,000 times faster in 20 years? (Score:3, Informative)
  
  by NearlyHeadless ( 110901 ) writes:
  
  If this figure comes from the poster, fine, but if the authors of the article said this, then I don't see the need to read anything else by them.
  
  20 years ago we had machines running at around 3-4 MHz (The Apple II was slower, the IBM PC faster). Today we can get machines running between 3-4 GHz, that's only a factor of 1000. If you count memory speeds, the increase is a lot lower (~300ns in 1983, down to ~60ns today: about a factor of 5).
  
  Because, as we all know, clock rate is all there is to performan
OSQ (Score:2)

by Hell O'World ( 88678 ) writes:

[aol.com]
Ahhhh! Undo! Undo!
Better than recovering from a crash... (Score:2)

by Glock27 ( 446276 ) writes:

Don't crash in the first place.
Many of these issues are best addressed at the hardware level, IMO. First of all, the software people don't have to worry about it then! ;-) For instance, look at RAID as a good example of reliable hardware (especially redundant RAIDS;). It is possible, using ECC memory and cache, and multiple CPUs, to be quite sure you're getting the correct results for a given calculation. You can also provide failover for continuous uptime.
Some of the rest of the article addressed issue
But operating them is much more complex? (Score:2, Insightful)

by fbg111 ( 529550 ) writes:

But operating them is much more complex.

I disagree. Feature for feature, modern computers are much more reliable and easy to use than their vaccuum-tube, punch card, or even command-line predecessors. How many mom and pop technophobes do you think could hope to operate such a machine? Nowadays anybody can operate a computer, even my 85 year old grandmother who has never touched one until a few months ago. Don't mistake feature-overload for feature-complexity.
Oh yeah. (Score:3, Funny)

by schnitzi ( 243781 ) writes: on Monday May 12, 2003 @10:43AM (#5936352) Homepage

Our computers are probably 10,000 times faster than they were twenty years ago. But operating them is much more complex. You all have experienced a PC crash or the disappearance of a large Internet site.

Oh yeah. My TRS-80 used to NEVER crash twenty years ago when I accessed LARGE INTERNET SITES.

Share
twitter facebook
MORE bubblegum and spit vs. engineering (Score:2)

by alispguru ( 72689 ) writes:

Why are desktop computing systems fragile? Because in the markteplace, they are judged on exactly two criteria:
How big is the check I'm writing right now?
How fast is it?

With these as your evaluation function, you are guaranteed to get systems with little redundancy and little or no internal safety checks.

One regrettable example of this is the market for personal finance programs. The feature that sells Quicken is quick-fill - the heuristic automatic data entry that makes entering transactions fast. N
Nothing new. (Score:4, Insightful)

by pmz ( 462998 ) writes: on Monday May 12, 2003 @11:58AM (#5936937) Homepage

micro-rebooting; using better tools to pinpoint problems in multicomponent systems; build an "undo" function...

I think they just invented Lisp :). I don't program in Lisp, but have seen people who are very good at it. Quite impressive.

Share
twitter facebook
- - Re:No clue (Score:5, Informative)
    
    by Gordonjcp ( 186804 ) writes: on Monday May 12, 2003 @07:22AM (#5935397) Homepage
    
    Well, yeah. That's basically a watchdog timer. It's very common in embedded stuff, because it's cheap to implement - in fact, many microcontrollers have it built into the hardware. In microcontrollers they're very simple - a counter counts up (say) 1024 clock pulses, and if it rolls over then reset the CPU. In normal operation then every time round the main loop you'd write to a specified IO port to kick the watchdog once every millisecond or so - this resets the counter. It's crude but effective, and is very commonly used in things like ECUs for automotive electrickery - although the software is simple enough to be thoroughly tested (BMW 735i's aside) there's still dirty power and mechanically harsh environment to deal with. And your ABS ECU doesn't have <CTRL><ALT><DELETE>, does it?
    
    Parent Share
    twitter facebook
- Not Just In DataBases (Score:2)
  
  by the-dude-man ( 629634 ) writes:
  
  n databases, you have your actions and when a sequence of events start, they are committed at the end of the event cycle. When you change things, there is a sequence of events that lead to a "stable" state. When the stable state has arrived, you commit.
  
  This is actualy exaclty what iptables does...there is even a commit command at the end of every rulset after all exceptional circumstances have been handled

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

This would be great (Score:5, Funny)

DWIM (Score:3, Funny)

This post (Score:3, Funny)

Managerspeak (Score:3, Insightful)

Re:Managerspeak (Score:5, Interesting)

Self-diagnostics (Score:5, Interesting)

Re:Self-diagnostics (Score:2)

Re:Self-diagnostics (Score:3, Insightful)

Re:Managerspeak (Score:2)

"Managerspeak"?! (Score:4, Insightful)

Re:"Managerspeak"?! (Score:2)

Re:"Managerspeak"?! (Score:2, Insightful)

Re:"Managerspeak"?! (Score:3, Interesting)

Re:Managerspeak (Score:3, Funny)

Re:Managerspeak (Score:4, Insightful)

Re:Managerspeak (Score:5, Interesting)

Re:Managerspeak (Score:2)

Interesting choice (Score:5, Insightful)

/etc/rc.d ? (Score:4, Interesting)

Re:/etc/rc.d ? (Score:5, Insightful)

Re:/etc/rc.d ? (Score:2)

Re:/etc/rc.d ? (Score:4, Interesting)

Re:/etc/rc.d ? (Score:5, Insightful)

Re:/etc/rc.d ? (Score:4, Interesting)

Re:/etc/rc.d ? (Score:2)

Re:/etc/rc.d ? (Score:2, Insightful)

Re:/etc/rc.d ? (Score:2)

Re:/etc/rc.d ? (Score:2)

Re:/etc/rc.d ? (Score:2)

Re:/etc/rc.d ? (Score:2)

Re:/etc/rc.d ? (Score:2)

Re:/etc/rc.d ? (Score:2)

Re:/etc/rc.d ? (Score:2)

Re:/etc/rc.d ? (Score:2)

Re:/etc/rc.d ? (Score:2)

hmmmmm (Score:5, Funny)

test errors (Score:3, Funny)

ROC detail (Score:5, Informative)

Computer.... (Score:3, Funny)

it will not work now (Score:4, Insightful)

Re:it will not work now (Score:2, Interesting)

Re:it will not work now (Score:2, Interesting)

A Real Nostradamus (Score:2)

Re:it will not work now (Score:2)

SPOFs (Score:2)

Re:SPOFs (Score:2, Insightful)

Re:SPOFs (Score:2)

Re:it will not work now (Score:2)

Various levels of rebooting... (Score:5, Funny)

!RTFA, but (Score:3, Interesting)

uunnschulding sme.. (Score:3, Insightful)

Compulsory M$ joke (Score:3, Funny)

Hmm. (Score:5, Insightful)

Re:Hmm. (Score:2)

Re:Hmm. (Score:2)

Re:Hmm. (Score:2)

Re:Hmm. (Score:3, Insightful)

Write scripts for it... (Score:5, Insightful)

Re:Write scripts for it... (Score:2)

Self Repairing gone bad (Score:2, Insightful)

Re:Self Repairing gone bad (Score:2)

Second paragraph (Score:5, Insightful)

I already do this with Linux... (Score:3, Interesting)

I used systems like this (Score:5, Interesting)

Re:I used systems like this (Score:2)

already done? (Score:2)

Excellent (Score:2, Funny)

ACID ROC? (Score:4, Insightful)

Not going to work (Score:2, Offtopic)

The Hurd (Score:4, Interesting)

Re:The Hurd (Score:2)

Magic Server Pixie Dust (Score:3, Funny)

Re:Magic Server Pixie Dust (Score:2)

"operating them is much more complex" (Score:2, Funny)

Ah, youth... (Score:3, Insightful)

Re:Ah, youth... (Score:2)

Re:Ah, youth... (Score:3, Interesting)

A computer is no washmachine, but why ? (Score:3, Insightful)

English (Score:2)

But I do that already... (Score:3, Informative)