Self-Repairing Computers 224

Posted by Hemos on Monday May 12, 2003 @06:52AM from the repairing-the-box dept.

Roland Piquepaille writes "Our computers are probably 10,000 times faster than they were twenty years ago. But operating them is much more complex. You all have experienced a PC crash or the disappearance of a large Internet site. What to do to improve the situation? This Scientific American article describes a new method called recovery-oriented computing (ROC). ROC is based on four principles: speedy recovery by using what these researchers call micro-rebooting; using better tools to pinpoint problems in multicomponent systems; build an "undo" function (similar to those in word-processing programs) for large computing systems; and injecting test errors to better evaluate systems and train operators. Check this column for more details or read the long and dense original article if you want to know more."

Self-Repairing Computers

This discussion has been archived. No new comments can be posted.

Search 224 Comments Log In/Create an Account

Comments Filter:

/etc/rc.d ? (Score:4, Interesting)

by graveyhead ( 210996 ) writes: <`fletch' `at' `fletchtronics.net'> on Monday May 12, 2003 @07:01AM (#5935345)

Frequently, only one of these modules may be encountering trouble, but when a user reboots a computer, all the software it is running stops immediately. If each of its separate subcomponents could be restarted independently, however, one might never need to reboot the entire collection. Then, if a glitch has affected only a few parts of the system, restarting just those isolated elements might solve the problem.

OK, how is this different from the scripts in /etc/rc.d that can start, stop, or restart all my system services? Any daemon process needs this feature, right? It doesn't help if the machine has locked up entirely.

Maybe I just don't understand this part. The other points all seem very sensible.

Re:Managerspeak (Score:5, Interesting)

by gilesjuk ( 604902 ) writes: <giles@jones.zen@co@uk> on Monday May 12, 2003 @07:02AM (#5935347)

Not to mention that the ROC system itself will need to be rock solid. It's no good to have a recovery system that needs to recover itself, which would then recover itself and so on :)

!RTFA, but (Score:3, Interesting)

by the_real_tigga ( 568488 ) writes: <nephros@@@users...sourceforge...net> on Monday May 12, 2003 @07:08AM (#5935363) Journal

I wonder if this [osdl.org] [PDF!] cool new feature will help there.

Sounds a lot like "micro-rebooting" to me...

I already do this with Linux... (Score:3, Interesting)

by jkrise ( 535370 ) writes: on Monday May 12, 2003 @07:18AM (#5935391) Journal

Here's the strategy:
1. Every system will have a spare 2GB filesystem partition, where I copy all the files of the 'root' filesystem, after successful instln., drivers, personalised settings, blah blah.
2. Every day, during shutdown, users are prompted to 'copy' changed files to this 'backup OS partition'. A script handles this - only changed files are updated.
3. After the 1st instln. a copy of the installed version is put onto a CD.
4. On a server with 4*120GB IDE disks, I've got "data" (home dirs) of about 200 systems in the network - updated once a quarter.

Now, for self-repairing:
1. If user messes up with settings, kernel etc., boot tomsrtbt, run a script to recopy changed files back to root filesystem -> restart. (20 mins)
2. If disk drive crashes, install from CD of step 3, and restore data from server.(40 mins)

Foolproof system, so far - and yes, lots of foolish users around.

I used systems like this (Score:5, Interesting)

by Mark Hood ( 1630 ) writes: on Monday May 12, 2003 @07:24AM (#5935401) Homepage

they were large telecomms phone switches.

When I left the company in question, they had recently introduced a 'micro-reboot' feature that allowed you to only clear the registers for one call - previously you had to drop all the calls to solve a hung channel or if you hit a software error.

The system could do this for phone calls, commands entered on the command line, even backups could be halted and started without affecting anything else.

Yes, it requires extensive development, but you can do it incrementally - we had thousadnds of software 'blocks' which had this functionality added to them whenever they were opened for other reasons, we never added this feature unless we were already making major changes.

Patches could be introduced to the running system, and falling back was simplicity itself - the same went for configuration changes.

This stuff is not new in the telecomms field, where 'five nines' uptime is the bare minimum. Now the telco's are trying to save money, they're looking at commodity PCs & open standard solutions, and shuddering - you need to reboot everything to fix a minor issue? Ugh!

As for introducing errors to test stability, I did this, and I can vouch for it's effects. I made a few patches that randomly caused 'real world' type errors (call dropped, congestion on routes, no free devices) and let it run for a weekend as an automated caller tried to make calls. When I came in on Monday I'd caused 2,000 failures which boiled down to 38 unique faults. The system had not rebooted once, so only those 2,000 calls had even noticed a problem. Once the software went live, the customer spotted 2 faults in the first month, where previously they'd found 30... So I swear by 'negative testing'.

Nice to see the 'PC' world finally catching up :)

If people want more info, then write to me.

Mark

Re:it will not work now (Score:2, Interesting)

by torpor ( 458 ) writes: <ibisum&gmail,com> on Monday May 12, 2003 @07:27AM (#5935407) Homepage Journal

So what are some of the other paradigms which might be proferred instead of von Neumann?

My take is that for as long as CPU design is instruction-oriented instead of time-oriented, we won't be able to have truly trusty 'self-repairable' computing.

Give every single datatype in the system its own tightly-coupled timestamp as part of its inherent existence, and then we might be getting somewhere ... the biggest problems with existing architectures for self-repair are in the area of keeping track of one thing: time.

Make time a fundamental to the system, not just an abstract datatype among all other datatypes, and we might see some interesting changes...

Re:/etc/rc.d ? (Score:4, Interesting)

by Mark Hood ( 1630 ) writes: on Monday May 12, 2003 @07:30AM (#5935414) Homepage

It's different (in my view) in that you can go even lower than that... Imagine you're running a webserver, and you get 1000 hits a minute (say).

Now say that someone manages to hang a session, because of a software problem. Eventually the same bug will hang another one, and another until you run out of resources.

Just being able to stop the web server & restart to clear it is fine, but it is still total downtime, even if you don't need to reboot the PC.

Imagine you could restart the troublesome session and not affect the other 999 hits that minute... That's what this is about.

Alternatively, making a config change that requires a reboot is daft - why not apply it for all new sessions from now on? If you get to a point where people are still logged in after (say) 5 minutes you could terminate or restart their sessions, perhaps keeping the data that's not changed...

rc.d files are a good start, but this is about going further.

The Hurd (Score:4, Interesting)

by rf0 ( 159958 ) writes: <rghf@fsck.me.uk> on Monday May 12, 2003 @07:43AM (#5935448) Homepage

Wouldn't some sort of software solution be the Hurd (if/when it becomes ready) in that as each system is a micro-kernel you just restart that bit of the operating system. As said in another post this is like /etc/rc.d but at a lower level.

Or you could just have some sort of failover setup.

Rus

Self-diagnostics (Score:5, Interesting)

by 6hill ( 535468 ) writes: on Monday May 12, 2003 @07:45AM (#5935459)

I've done some work on high availability computing (incl. my Master's thesis) and one of the more interesting problems is the one you described here -- true metaphysics. The question as it is usually posed goes, How does one self-diagnose? Can a computer program distinguish between a malfunctioning software or malfunctioning software monitoring software -- is the problem in the running program or in the actual diagnostic software? How do you run diagnostics on diagnostics running diagnostics on diagnostics... ugh :).
My particular system of research finally wound up relying on the Windows method: if uncertain, erase and reboot. It didn't have to be 99.999% available, after all. There are other ways with which to solve this in distributed/clustered computing, such as voting: servers in the cluster vote for each other's sanity (i.e. determine if the messages sent by one computer make sense to at least two others). However, even not this system is rock solid (what if two computers happen to malfunction in the same manner simultaneously? what if the malfunction is contagious? or widespread in the cluster?).
So, self-correcting is an intriguing question, to say the least. I'll be keenly following what the ROC fellas come up with.

Some of this isn't entirely new... (Score:5, Interesting)

by Mendenhall ( 32321 ) writes: on Monday May 12, 2003 @08:19AM (#5935564)

As one component of my regular job (I am a physicist), I develop control systems for large scientific equipment, and have been doing so for about 25 years. One of the cornerstones of this work has been high-reliability operation and fault tolerance.

One of the primary tricks I have used has always been mockup testing of software and hardware with an emulated machine. In a data acquisition/control system, I can generate _lots_ of errors and fault conditions, most of which would never be seen in real life. This way, I can not only test the code for tolerance, and do so repeatedly, I can thoroughly check the error-recovery code to make sure it doesn't introduce any errors itself.

This is really the software equivalent to teaching an airline pilot to fly on a simulator. A pilot who only trains in real planes only gets one fatal crash, (obviously), and so never really learns how to recover from worst-case scenarios. In a simulator, one can repeat 'fatal' crashes until they aren't fatal any more. My software has been through quite the same experience, and it is surprising the types of errors one can avoid this way.

Really, the main problem with building an already highly reliable system, using very good hardware, etc., is that you must do this kind of testing, since failures will start out very rare, and so unless one intentionally creates faults, the ability to recover from them is not verified. Especially in asynchronous systems, one must test each fault many times, and in combinations of multiple faults to find out how hard it is to really break a system, and this won't happen without emulating the error conditions.

Re:it will not work now (Score:2, Interesting)

by KingRamsis ( 595828 ) writes: <kingramsis.gmail@com> on Monday May 12, 2003 @08:22AM (#5935572)

well the man who answers this question will certainly become the von Neumann of the century, you need to do some serious out of the box thinking, first you throw away the concept of the digital computer as you know it, personally I think there will be a split in computer science, there will be generally two computer types the "classical" von Neumann and a new and different type of computer, the classical computer will be useful as a controller of some sort for the newer one, it is difficult to come up with the working principle of that computer, let me elaborate it is like a missing piece of the puzzle you know how it looks like but you are not certain what exactly will be printed on it, but I can summarize it is features:
1. It must be data oriented with no concept of instructions (just routing information), data flows in the system and transformed in a non-linear way, and the output will be all possible computations doable by the transformations.
2. It must be based on a fully interconnected grid of very simple processing elements.
3. The performance of said computer will be measured in terms of bandwidth not the usual MIPS. As you can see you will need a classical type computer to operate the described computer above so it will not totally replace it.
I believe that we should look into nature more closely, we stole the design of the plane straight from birds wings, and the helicopter from the dragonfly, and there are a lot that was inspired to us by mother nature, one of the relevant examples that always fascinated me was the fly brain, each eye is a processor on its own, the works independently conveying information to a more concise layer and so on, even human vision is based on similar concept of retina cells, there is no "pixel" concept, each layer that process vision emphasize on one concept of vision like texture, color, outline, shadowing, movement...etc ..Etc Finally well such a computer be useful? can we just write a plain spread sheet on it and send it by email to someone and then resume our saved DOOM game?
well it is possible but we need also to redefine what we can do with a computer because the classical von Neumann computer that we are stuck with for the last half a century certainly limited our imagination on what can be done with a computer.

Nope. Memory (Score:5, Interesting)

by awol ( 98751 ) writes: on Monday May 12, 2003 @08:27AM (#5935600) Journal

The problem here is that whilst it is true that _certain_ aspects of computational power has increased "probably 10,000 times" others have not. In order to really make stuff work like this, with an undo, because that is the critical bit since redundant hardware already exists, Non-Stop from HP (nee Himalaya) for example.

Where I work we implemented at least one stack based undo functionality and it worked really nicely, we trapped sigsevs etc and just popped the appropriate state back into the places that were touched in the event of an error. We wrote a magical "for loop" construct that broke out after N iterations reagrdless of the other constraints. The software that resulted from this was uncrashable. I mean that relatively seriously, you could not crash the thing. you could very seriously screw data up through bugs, but the beast would just keep on ticking.

I had a discussion with a friend of mine more than a decade ago that eventually all these extra MHz that were coming would eventually be overkill. His argument was that, no, more of them will be consumed, in the background making good stuff happen. He was thinking about things like voice recognition, handwriting recognition, predictive work etc etc. I agree with his point. If you have a surfeit of CPU then use it to do cool things (not wasting it on eycandy necessarily) to make the things easier to use. Indeed we see some of that stuff now, not enough, but some.

Self-Reparing is an excellent candidate and with so much CPU juice lying around in your average machine, it must be workable. I mean think about the computers used for industrial plant. Most of them could be emulated faster in a P4 than they currently run. So emulate N and check the results against each other, if one breaks just emulate a new one and turf the old one. Nice.

But here's the rub. Memory, we have nowhere near decreased the memory latency by the same amount we have boosted the processing power (and as for IO, sheesh!), as a result, undo is very expensive to do generically, i mean it at least halves the amount of bandwidth since it is [read/write write] for each write not to mention the administrative overhead and we just haven't got that much spare capacity in memory latency left. Indeed, just after the ten year old discussion, I had to go and enhance some software to get past the HPUX9 800MB single shared memory segment limit and the demand is only just being outstripped by the affordable supply of memory, we do not yet have the orders of magnitude of performance to make the self correcting model work ina generic sense.

I think this idea will come, but it will not come until we have an order of magnitude more capacity in all the areas of the system. Until then we will see very successful but limited solutions like the one we implemented.

Re:Managerspeak (Score:5, Interesting)

by sjames ( 1099 ) writes: on Monday May 12, 2003 @08:44AM (#5935659) Homepage Journal

There are allready steps in place towards recoverability in currently running system. That's what filesystem journaling is all about. Journaling doesn't do anything that fsck can't do EXCEPT that replaying the journal is much faster. Vi recovery files are another example. As the article pointed out, 'undo' in any app is an example.

Life critical systems are often actually two seperate programs, 'old reliable' which is primarily designed not to allow a dangerous ondition, and the 'latest and greatest' which has optimal performance as it's primary goal. Should 'old reliable' detect that 'latest and greatest' is about to do something dangerous, it will take over and possibly reboot 'latest and greatest'.

Transaction based systems feature rollback, volume managers support snapshot, and libraries exist to support application checkpointing. EROS [eros-os.org] is an operating system based on transactions and persistant state. It's designed to support this sort of reliability.

HA clustering and server farms are another similar approach. In that case, they allow individual transactions to fail and individual machines to crash, but overall remain available.

Apache has used a simple form of this for years. Each server process has a maximum service count associated with it. It will serve that many requests, then be killed and a new process spawned. The purpose is to minimize the consequences of unfixed memory leaks.

Many server daemons support a reload method where they re-read their config files without doing a complete restart. Smart admins make a backup copy of the config files to roll back to should their changes cause a system failure.

Also as the article points out, design for testing (DFT) has been around in hardware for a while as well. That's what JTAG is for. JTAG itself will be more useful once reasonably priced tools become available. Newer motherboards have JTAG ports built in. They are intended for monitor boards, but can be used for debugging as well (IMHO, they would be MORE useful for debugging than for monitoring, but that's another post!). Built in watchdog timers are becoming more common as well. ECC RAM is now manditory on many server boards.

It WILL take a lot of work. It IS being done NOW in a stepwise manner. IF/when healthy competition in software is restored, we will see even more of this. When it comes down to it, nobody likes to lose work or time and software that prevents that will be preferred to that which doesn't.

Re:Some of this isn't entirely new... (Score:3, Interesting)

by redragon ( 161901 ) writes: <codonnell@@@mac...com> on Monday May 12, 2003 @08:53AM (#5935699) Homepage

A lot of simulators have the ability to roll-back. Seriously, if you're running a simulation that can take days/weeks/etc, do you really want a crash to bring the system completely down?

Heck no.

Many of these systems will save results in such a fashion that if the system does go down, when the faulty component is found and fixed, the system can be brought back up to it's state just prior to the crash.

10,000 times faster in 20 years? (Score:3, Interesting)

by dutky ( 20510 ) writes: on Monday May 12, 2003 @09:34AM (#5935879) Homepage Journal

If this figure comes from the poster, fine, but if the authors of the article said this, then I don't see the need to read anything else by them.

20 years ago we had machines running at around 3-4 MHz (The Apple II was slower, the IBM PC faster). Today we can get machines running between 3-4 GHz, that's only a factor of 1000. If you count memory speeds, the increase is a lot lower (~300ns in 1983, down to ~60ns today: about a factor of 5).

Other folk have posted about the questionable assertion that modern computers are harder to operate, but the fact that the simplest arithmetic calculation is off by an order of magnitude is at least as troubling as a questionable opinion or two.

Re:/etc/rc.d ? (Score:4, Interesting)

by Surak ( 18578 ) * writes: <surak@mailblocks. c o m> on Monday May 12, 2003 @10:09AM (#5936124) Homepage Journal

Yes. I'm typing this on last night's build of Mozilla Firebird running under Windows NT 4.0. Sure you can stop and start the workstation and/or server services. Ever done it? How stable is NT after that?

I can tell you that on *nix restarting the Samba daemon happens seamlessly.

Re:Ah, youth... (Score:3, Interesting)

by Idarubicin ( 579475 ) writes: on Monday May 12, 2003 @10:40AM (#5936330) Journal

I mean, you can actually have your *mother* operate a computer today.
Do we have to keep using this tired old notion of little old (middle-aged, for the /. crowd) ladies cringing in terror when faced with a computer?
My mother has a B.Math in CS, acquired more than a quarter century ago. Her father is pushing eighty, and he upgrades his computer more often than I do. When he's not busy golfing, he's scanning photographs for digital retouching. (In his age bracket, a man who can remove double chins and smooth wrinkles is very popular.)
The notion that women and/or the elderly are unable to use computers is a generalization that just doesn't hold much water anymore. Maybe some of these people are frightened of (or frustrated with) computers because their exposure to technology is through the 'typical'* arrogant, smug, condescending /.er--concealing his embarrassment over being unable to get a girlfriend behind clouds of technobabble.
*How does it feel to be the target of an unfair stereotype?

Re:"Managerspeak"?! (Score:3, Interesting)

by fgodfrey ( 116175 ) writes: <fgodfrey@bigw.org> on Monday May 12, 2003 @12:41PM (#5937250) Homepage

No, it's not (well, debugging software is definetly good, but writing "self healing" code is important too). An operating system is an incredibly complex piece of software. At Cray and SGI a *very* large amount of testing goes on before release, but software still gets released with bugs. Even if you were, by some miracle, to get a perfect OS, hardware still breaks. In a large system, hardware breaks quite often. Having an OS that can recover from a software or hardware failure on a large system is essential to keeping the system running.

The software that I'm responsible for, in fact, is specifically designed to detect, report, and try to work around errors. We have code to detect a processor hang (through software or hardware failure) and remove it from the running OS image, etc. The Cray T3E (which I didn't work on) can warm-reboot an individual processor on either a software or hardware panic/hang and reintegrate it into the running OS.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Self-Repairing Computers 224

Self-Repairing Computers More Login

Self-Repairing Computers

/etc/rc.d ? (Score:4, Interesting)

Re:Managerspeak (Score:5, Interesting)

!RTFA, but (Score:3, Interesting)

I already do this with Linux... (Score:3, Interesting)

I used systems like this (Score:5, Interesting)

Re:it will not work now (Score:2, Interesting)

Re:/etc/rc.d ? (Score:4, Interesting)

The Hurd (Score:4, Interesting)

Self-diagnostics (Score:5, Interesting)

Some of this isn't entirely new... (Score:5, Interesting)

Re:it will not work now (Score:2, Interesting)

Nope. Memory (Score:5, Interesting)

Re:Managerspeak (Score:5, Interesting)

Re:Some of this isn't entirely new... (Score:3, Interesting)

10,000 times faster in 20 years? (Score:3, Interesting)

Re:/etc/rc.d ? (Score:4, Interesting)

Re:Ah, youth... (Score:3, Interesting)

Re:"Managerspeak"?! (Score:3, Interesting)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot