Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Software Power Security

Software Update Shuts Down Nuclear Power Plant 355

Garabito writes "Hatch Nuclear Power Plant near Baxley, Georgia was forced into a 48-hour emergency shutdown when a computer on the plant's business network was rebooted after an engineer installed a software update. The Washington Post reports, 'The computer in question was used to monitor chemical and diagnostic data from one of the facility's primary control systems, and the software update was designed to synchronize data on both systems. According to a report filed with the Nuclear Regulatory Commission, when the updated computer rebooted, it reset the data on the control system, causing safety systems to errantly interpret the lack of data as a drop in water reservoirs that cool the plant's radioactive nuclear fuel rods. As a result, automated safety systems at the plant triggered a shutdown.' Personally, I don't think letting devices on a critical control system accept data values from the business network is a good idea."
This discussion has been archived. No new comments can be posted.

Software Update Shuts Down Nuclear Power Plant

Comments Filter:
  • by Anonymous Coward on Friday June 06, 2008 @08:03PM (#23689371)
    I'd rather it shut itself down then suffer major failure.
  • Fail-Safe (Score:5, Insightful)

    by lobiusmoop ( 305328 ) on Friday June 06, 2008 @08:07PM (#23689405) Homepage
    Personally, I am reassured that these reactors are designed to shut down at the drop of a hat. This is not a situation were fuck-ups should be masked, any discontinuity, however minor, really needs to be highlighted and dealt with immediately.
  • Re:Lesson learned: (Score:3, Insightful)

    by Anonymous Coward on Friday June 06, 2008 @08:17PM (#23689497)
    However useful a tip that may be, it has nothing to do with this incident. You clearly never even made it to the article summary, let alone the actual article.

    "... when the updated computer rebooted, it reset the data on the control system, causing safety systems to errantly interpret the lack of data as a drop in water reservoirs that cool the plant's radioactive nuclear fuel rods. As a result, automated safety systems at the plant triggered a shutdown."

    From that snippet alone, it stands to reason that _any_ reboot of the computer would have caused this reset in at the control system. Nor is this at all surprising; go reset any data collection system connected to controller software for any sort of industrial process and see if the controller doesn't receive spurious data.

    To me this is an example of the automated system doing it's job. "Hark! I am a coolant reservoir monitor and I have reason to believe there may be a loss of coolant inventory. Time to trip the system."
  • by RiotingPacifist ( 1228016 ) on Friday June 06, 2008 @08:20PM (#23689515)
    The only safe way to update a system is a reboot, sure you CAN do some stuff on linux bsd etc to avoid having to reboot( hell this was probably running some unix derivative so it was probably possible to do the update without rebooting), but you wouldn't want to run the risk of introducing an unchecked bug by doing a live update. when your choices are:
    a) high chance of accidentally shutting down a reactor harmlessly
    b) small chance of fucking up a nuclear reactor
    you'll always go for (a), if your sane.
  • by sharkey ( 16670 ) on Friday June 06, 2008 @08:21PM (#23689527)
    What, did they change the phone number in Dial-Up Networking?
  • This was Good (Score:4, Insightful)

    by snkline ( 542610 ) on Friday June 06, 2008 @08:21PM (#23689531)
    While perhaps the system should be designed to behave differently, what happened here was a good thing. When things went wrong, rather than the reactor systems freaking out and doing random crap, they were properly designed to shift to a known safe state (i.e. Shut the hell down).
  • Re:Fail-Safe (Score:5, Insightful)

    by snkline ( 542610 ) on Friday June 06, 2008 @08:23PM (#23689545)
    Umm, yes you do. If something in the system is shit, you don't want the reactor ON!
  • by Quadraginta ( 902985 ) on Friday June 06, 2008 @08:24PM (#23689549)
    Think about the cost associated with having and maintaining a completely hot-pluggable second control system. How much do you want your power bills to go up to pay for that? And what would be the point?

    They have a perfectly adequate safety system that did exactly what it's supposed to do. It read confusing data and decided to shut the reactor down until a human came along and explained things satisfactorily. What's wrong with that? Aside from having the reactor offline for 48 hours, there was no other cost.
  • Re:Fail-Safe (Score:4, Insightful)

    by NMerriam ( 15122 ) <NMerriam@artboy.org> on Friday June 06, 2008 @08:25PM (#23689569) Homepage

    Yeah, but you don't want the reactor shutting down because the computer system is shit. That is most definitely not reassuring to me.


    On, the contrary, shutting down because the system is shit sounds like a much better option than continuing to run despite the shittiness of the computer monitoring everything.

    Of course, the ideal situation would be to have good computers that only get updated in scheduled, planned ways so that you don't have the issue at all. But shutting everything down when something is amiss is the only sensible response.
  • by BlueParrot ( 965239 ) on Friday June 06, 2008 @08:25PM (#23689581)
    The chemical diagnostic data is damn important because it may determine things like corrosion rates and the amount of impurities circulating in the water, potentials for clogs etc... As with all other software, occasionally errors occur, and the appropriate way to respond when it does is to shutdown and blow some whistles as to ensure that the reactor is brought into a safe state before something else goes wrong. This is one of those cases where "Better safe than sorry" is a really rather good motto.
  • Re::O (Score:5, Insightful)

    by Lurker2288 ( 995635 ) on Friday June 06, 2008 @08:40PM (#23689679)
    What exactly do you find frightening about an automatic safety system doing exactly what it's supposed to in response to unusual input?
  • by Anonymous Coward on Friday June 06, 2008 @08:41PM (#23689683)
    Correct. It is not the better choice. In the foreseeable future, it is the only choice.
  • Re:No! (Score:1, Insightful)

    by Anonymous Coward on Friday June 06, 2008 @08:43PM (#23689697)
    Wow, way to parrot the summary.
  • by Ungrounded Lightning ( 62228 ) on Friday June 06, 2008 @08:46PM (#23689725) Journal
    Secondly the software update did not respect the data in the nuclear control system and synchronized it to new initial data in the update on the other system! Not a good idea. In critical safety systems, you always practice an update before actually doing one.

    I have no problem with a computer on the process control subnet reporting information to a computer on the business subnet.

    I have a BIG problem with a computer on the business subnet being able to modify and corrupt data in a computer on the process control subnet.

    "I can't dump data to the business side" is a reason to make a log entry and maybe sound a minor alarm. It's not a reason to shut down the reactor (unless the data is needed for regulatory compliance and the process control side isn't able to buffer it until the business side is working correctly.)

    But if a business subnet computer can tamper with something as critical as a process control machine's idea of the level of coolant in a reservoir, it rings my "design flaw" alarms.

    Is it ONLY able to reset it to "empty" as poorly-designed part of a communication restart sequence? Or could it also make the process control machine think the level was nominal when it WAS empty?

    IMHO this should be examined more closely. It may have exposed a dangerous flaw in the software design.

    Security flaws don't care if they're exercised by mischance or malice. If nothing else, this is a way to Dos a nuclear plant through a breakin on the business side of the net.
  • by Drenaran ( 1073150 ) on Friday June 06, 2008 @08:48PM (#23689745)
    The problem here is that the system didn't shut down because it detected an error in the data collection system, instead it incorrectly detected a problem that did not in fact exist and then proceeded to take action. While the engineer in me is fairly certain that the system is designed to always fail to a safe state (as in, any automatic emergency response couldn't accidentally make things worse - at least not without raising all sorts of alarms), it is still concerning that internal control systems can be so effected by external servers.

    In the article they mention that the system wasn't designed for security (since it was meant to be internal) - but this isn't a security issue at all! Any sort of system that relies upon other systems should be designed to assume failure can and will occur in other systems - that is not to say that it needs to verify/evaluate incoming data to make sure it is "good", but rather that it can tell the difference between receiving data (such as current water levels) and receiving no data at all (system failure). Once it has that it can ideally automatically switch to a backup system, or do what it did here and enter a fail-safe state (the difference being that it does so while pointing out the actual problem and not a incorrectly perceived problem in a different part of the system).
  • by Anonymous Coward on Friday June 06, 2008 @08:50PM (#23689759)
    And a shutdown, while incovenient, is not a catastrophe. In fact, it speaks well for the plant's safety that it did automatically shut down when faced with bad data.
  • by Anonymous Coward on Friday June 06, 2008 @09:00PM (#23689823)
    Before there are too many retarded "OMG why was it on the business network!!!?LOL!??!" comments, I'll cover that right here:

    It says the software is supposed to sync data between the control system and the business network. Obviously it has to be connected to both sides somehow. I'm not a power plant designer, but there's probably a good reason why people might need access to that data from the control system, and thus some kind of system acting as a safe bridge between the two rather than allowing unrestricted access from the business network.

    The update f'd up and the control network went "Holy crap where did the cooling water go? Abort!" Everything worked like it was supposed to. The failure was caused by not testing the update in a lab environment before applying it to a live system.
  • by dindi ( 78034 ) on Friday June 06, 2008 @09:00PM (#23689825)
    At least it did not turn it into a meltdown, so at least the safety features worked in the software.

    That is definitely a glass half full, as opposed to empty.

  • by Anonymous Coward on Friday June 06, 2008 @09:10PM (#23689901)
    Agreed. That was good software design to assume a worst-case scenario when the sensors stopped sending in data. The alternative (sending pager alerts or something) would be far worse.
  • by dbIII ( 701233 ) on Friday June 06, 2008 @09:32PM (#23690001)
    While that may be true the first full scale prototypes of pebble bed are yet to go online - however construction of several in China is at an advanced stage. As Superphoenix showed with fast breeders you really need a full scale prototype to identify all of the problems (it was economic ones that killed fast breeders and not safety issues).

    India's accelerated thorium idea is also very promising.

    The major problem I see with US nuclear power is the assumption that it is a solved problem and almost zero has been spent on R&D for decades. The "new generation" of reactors from Westinghouse and others is little more than 1960's white elephants painted green.

  • Re::O (Score:5, Insightful)

    by afidel ( 530433 ) on Friday June 06, 2008 @09:34PM (#23690015)
    I have quite a few Windows 2003 servers that haven't been rebooted since August 2006 when we upgraded our computer room to a small datacenter (we went from a single busline and a constantly breaking AC unit to dual UPS's powered by separate generators and dual chillers with separate condensers.) It's not like it's impossible to get good uptimes on Windows, the only servers we reboot on a regular basis are our Citrix servers due to some bad code on Citrix's part that leaks memory over time and our Oracle server due to a bug where 10gR2 pulls time from the deprecate ticks counter (the same one that used to crash Windows9x) which rolls over after ~42 days. Both of those are the result of poor third party coding, not bugs in Windows.
  • by zappepcs ( 820751 ) on Friday June 06, 2008 @09:36PM (#23690033) Journal
    I'd go just a bit further and say that it speaks well for the software coders. There are at least three ways to treat any 'out of bounds' condition. They chose to make sure that the safe action was chosen.

    An area where that loosely controlled type of team work gets into trouble unless all coders treat data passed to their code, and from their code in the same uniform functional ways.

    It also makes me wonder how the code will react to certain malicious software, should it get loose in the facility. If I were writing code to destroy a nuclear facility, it is how data is passed from one process to another that I would definitely attack as well as other vulnerable places.

    It is sort of reassuring to have seen a failure result in a controlled shutdown rather than some other, more undesirable action.
  • by Phantom of the Opera ( 1867 ) on Friday June 06, 2008 @09:41PM (#23690067) Homepage
    Scenario : System comes up. Things don't work quite right. Some configurations are tweaked and system is now working fine.

    Reboot. The tweaked configurations happen to go away. No one remembers which ones they were. The system is b0rked for a while.

    I would hope that isn't the case for that system, but I have seen it happen before.
  • by Wo1ke ( 1218100 ) on Friday June 06, 2008 @09:50PM (#23690117)
    Yeah, so when a sensor breaks and stops sending in data, it'll keep running like usual, with maybe a small error code in the background. Cause, you know, that's how we want nuclear fucking powerplants to work.
  • by Anonymous Coward on Friday June 06, 2008 @10:10PM (#23690235)
    Subnet? It should not even been on the same physical network!
  • by ppanon ( 16583 ) on Friday June 06, 2008 @10:14PM (#23690251) Homepage Journal
    Yeah, sure. When somebody screws up an expression in a way that makes no sense, we should just accept it. In addition, since people on Slashdot constantly misuse pairs of homonyms like then/than, effect/affect, their/they're, we should just ignore historical usage differences and use them interchangeably. We should just accept sloppiness and mediocrity because that's how Western civilization was built.

    Then again, maybe intelligent and well-educated people will just ignore people who aren't intelligent enough or who can't be bothered to learn how to properly communicate. The medium is the message, and a badly-formed message says to the recipient either "I don't care enough about talking to you to take the time to say it properly" or "the content of the message can't be that great if I can't be educated enough to learn to express it well enough".

    I don't get out of the way for subgeniuses.

  • by ZombieEngineer ( 738752 ) on Friday June 06, 2008 @10:50PM (#23690451)
    Something is not right here...

    Yes, the safety system kicking in is "a good thing".

    Pulling data from another computer system for a safety related control system is not a bright idea (the weakest link problem).

    Historically a safety control system in an Oil & Gas environment, all the inputs to the safety system are either hardwired or pulled from another safety system controller which has the appropriate level of redundancy (CPU boards and communication paths with communication watchdog timers).

    Even transmitters in some circumstances can not be trusted hence the 2 out of 3 voting systems (take three transmitters measuring the same value and pick the middle of the three, if one of the transmitters fails high or low your choice will be the safe option).

    Someone needs a serious think about where this plant is getting data for its safety shutdown system.

    ZombieEngineer
  • by spikedvodka ( 188722 ) on Friday June 06, 2008 @11:06PM (#23690537)
    It's not a nuclear power plant, but still, my network...

    I've set nagios up to monitor my network, and any los of signal is considered CRITICAL, not just a warning, but critical... and I need to know then.
  • by Dachannien ( 617929 ) on Friday June 06, 2008 @11:31PM (#23690665)

    instead it incorrectly detected a problem that did not in fact exist
    This might be splitting hairs, but I'd say it correctly detected a data inconsistency and responded appropriately. There could be a dangerous condition that is indistinguishable to the failsafe system from what actually happened - and it could be a condition that nobody's ever thought of before. It's far better to trigger the failsafe when a data inconsistency has occurred than to make a potentially incorrect automated judgment concerning the cause of the inconsistency leading to a more severe problem down the road.
  • Re::O (Score:1, Insightful)

    by Anonymous Coward on Friday June 06, 2008 @11:46PM (#23690731)

    What exactly do you find frightening about an automatic safety system doing exactly what it's supposed to in response to unusual input?
    The words "nuclear reactor" scare many people, like "monster in closet" scares many kids.

  • by courseofhumanevents ( 1168415 ) on Friday June 06, 2008 @11:50PM (#23690743)
    That's the thing, though; it's a misuse of a phrase so much as "kick the bucket" as a literal expression of kicking a bucket would be a misuse of a phrase. The definition of it can be easily achieved by examining the words it uses and their contexts, making it much less likely to confuse a non-native speaker than many other expressions in wide use. The main source of confusion would be if someone tries to make it out to be an invalid phrase.

    One of the great things about English is that one can phrase something a million different ways and still get the same meaning; banning the use of one phrase because it happens to also be the name of a logical fallacy is silly and pointless.
  • Re::O (Score:4, Insightful)

    by sbjornda ( 199447 ) <sbjornda@noSpaM.hotmail.com> on Saturday June 07, 2008 @12:17AM (#23690895)
    Patching for patching sake is an IT fetish

    Well, the auditors seem to expect it... as do the vendors when we call for support - "Oh, you say foobar isn't working... well it looks like you're 15 revisions behind; why don't you just fix that and call me when you're done. Oh, your policies state you need to test and certify them? Well I guess I won't be hearing from you for a while, then."

    --
    .nosig

  • Re:Fail-Safe (Score:3, Insightful)

    by Skrapion ( 955066 ) <skorpionNO@SPAMfirefang.com> on Saturday June 07, 2008 @01:07AM (#23691155) Homepage

    That's my point. I don't want a reactor with ANY flaws. No matter how safe its default shutdown threasholds are.
    And I'd like to be king of all Londinum and wear a shiny hat.

    Systems without flaws will never exist, so we need to design systems that do reasonable things when they encounter flaws.

    In this case, the flaw wasn't even caused by the machines, but instead was directly caused by the "fleshy" parts of the system, and the machines still managed to handle the problem safely.
  • by kilroy0097 ( 924790 ) on Saturday June 07, 2008 @01:16AM (#23691217)
    It doesn't really matter in this case if the operation system is looking at plant data from a minor monitoring system. What is troubling here is that it's completely reliant upon this minor monitoring system. If this box someplace is so important as to cause a emergency shutdown in a nuclear power plant then one would think there would be a backup system that comes into place when the primary monitoring system goes down. Did they think this box would never have a hardware failure? That it would last forever as some kind of cosmic perpetual motion machine? I am very worried that operations management systems like this even get implemented in high security and important locations such as a nuclear power plant. Looks like it's time to higher a better and more intelligent Information Systems and Network Manager.
  • by GigaplexNZ ( 1233886 ) on Saturday June 07, 2008 @01:20AM (#23691235)

    It did it the way it was programmed to do it.
    Based on the information provided in the article, it was programmed to shut down due to lack of water. What actually happened was accidental data reset, which is what happened. A separate fail safe mechanism should have detected the missing critical data. Instead, it

    errantly interpret the lack of data as a drop in water reservoirs
    - I would rather it correctly, as opposed to errantly, detect unsafe conditions. The plant should have shut down as it did, but it sounds a bit like chance that it actually did.
  • by GigaplexNZ ( 1233886 ) on Saturday June 07, 2008 @01:23AM (#23691249)
    I understand exactly what fail safe means. I agree that no data = very very very bad data. I agree that it should have gone into the safest possible mode. I don't agree that the "low water level" detection is the correct mechanism to determine the "no data = very very very bad data" condition. I'm suggesting that based on the information quoted in my original post,

    safety systems to errantly interpret the lack of data as a drop in water reservoirs
    does not necessarily sound like good planning but sounded more like chance that some erroneous interpretation picked up on the invalid state. It may have detected the "no data = very very very bad data" case and shut down for that reason, but that's not what the article is suggesting. Other users hinting that I am a moron for thinking that the plant shouldn't have shut down have misinterpreted what I was trying to get across.
  • by barius ( 1224526 ) on Saturday June 07, 2008 @01:42AM (#23691333)

    I think you're missing the real point, which is that the central safety systems are being fed data from a 'business network'. What would happen if that computer had an issue that caused it to send the same data continuously even when the coolant level had really dropped? WHY are any safety systems receiving data from an insecure network?

    It's bad enough that most reactors use regular PC's to do the data collection and reporting, given the security risks posed by such systems (especially if networked), but I never realized they would be so stupid as to feed data in the other direction like this!

  • Re:Fail-Safe (Score:3, Insightful)

    by distantbody ( 852269 ) on Saturday June 07, 2008 @02:31AM (#23691515) Journal
    The problem isn't that it shut down-- that's fine; the problem is that a software update for a nuclear power plant was actually allowed to produce an unexpected/unplanned event!
  • by VENONA ( 902751 ) on Saturday June 07, 2008 @03:20AM (#23691697)
    "It would be even better if the control network had a web server"

    Probably not. Web servers are complex, and likely targets for attack. And the business people will end up doing endless cut and paste.

    A better solution would be to accumulate the data that the businesspeople need on a single system on the control LAN. That system rsync's CSV files onto a system on the business LAN. No connections are initiated from the business LAN into the control LAN, and the data are more useful to MIS people on the business LAN.
  • by VENONA ( 902751 ) on Saturday June 07, 2008 @12:21PM (#23693911)
    Simplicity is better than complexity if you're really after security. You could write a small Web server, which did nothing more than respond to HTTP requests, which was provable secure. It's been done. But it's also one more piece of software that has to be maintained. Or use a large Web server, such as Apache. It's been a long while since there was a remote exploit against Apache, when it was simply serving static pages. A DOS attack might still be possible, but that shouldn't accomplish anything but revealing the attack, as long as software running on the systems on the control LAN, which update the data host, don't become wedged if the data host becomes unresponsive. Which you would still want to test for, BTW.

    *However*, one of the more powerful ideas in configuring highly secure LANS is that the more-secure LAN is simply never allowed to accept connections from the less-secure LAN. It's also something that's really easy to firewall, your network becomes easier to audit, etc. If you're a security practitioner, it makes your life easier. You still have to worry about the sneaker-net, physical security, etc., but now you're more able to focus your resources on those areas. Once again, simplicity is better than complexity if you're really after security.

    I don't know where you got the idea that I thought it was, "sooo impossile to have a perl script or whatever fetch the webpage and cut out the data you care about." It's easy. But pretty much nothing is as easy to extract data from as a CSV file, which you could process with nothing more than awk. That doesn't get you far with automating report generation, populating a database, or whatever else you intend to *do* with the data, but there are endless tools for those jobs--Perl included.

    Also, in my experience, people want to mess with Web pages. They're more visual, and people tend to want to 'improve' them, meaning your Perl screen-scraper likely has to change as well. I see a lot less clamor for changing the data format in CSV files.

    In the end, use what you need--XML, for all I care. Just *don't allow your less-secure LAN to initiate connections into your more-secure LAN*. That was the root cause of the failure described in TFA. It's one of many reasons the rule is so basic, though obviously not yet widely-enough followed. Ideally, hosts on a secure LAN communicate with *nothing* outside that LAN. You justify and document every[1] step away from that ideal, if for no other reason than that it plays hell with formal trust models, which can be important inputs into designing a thorough audit. I don't see how you justify accepting incoming traffic when there's an easy way to avoid it. In an audit, I'd be busting you for that Web server. Simple as that.

    An approach like the one above is likely to make life easier for several internal groups, including office staff. And quite possibly the ultimate users--power consumers.

    [1] I mean every, not most. For example, how do you handle time? I favor an NTP server on the secure LAN taking time inputs from the GPS cloud. I've never worked for an organization that had a spare atomic clock lying around, or I'd have used that, and eliminated one more external data flow.

The optimum committee has no members. -- Norman Augustine

Working...