Software Update Shuts Down Nuclear Power Plant 355

Posted by Soulskill on Friday June 06, 2008 @07:58PM from the we-have-safety-systems-because-we-are-very-stupid dept.

Garabito writes "Hatch Nuclear Power Plant near Baxley, Georgia was forced into a 48-hour emergency shutdown when a computer on the plant's business network was rebooted after an engineer installed a software update. The Washington Post reports, 'The computer in question was used to monitor chemical and diagnostic data from one of the facility's primary control systems, and the software update was designed to synchronize data on both systems. According to a report filed with the Nuclear Regulatory Commission, when the updated computer rebooted, it reset the data on the control system, causing safety systems to errantly interpret the lack of data as a drop in water reservoirs that cool the plant's radioactive nuclear fuel rods. As a result, automated safety systems at the plant triggered a shutdown.' Personally, I don't think letting devices on a critical control system accept data values from the business network is a good idea."

Software Update Shuts Down Nuclear Power Plant

This discussion has been archived. No new comments can be posted.

Search 355 Comments Log In/Create an Account

Comments Filter:

just to shortcircuit the nuclear hysteria (Score:5, Informative)

by circletimessquare ( 444983 ) writes: <circletimessquare AT gmail DOT com> on Friday June 06, 2008 @09:06PM (#23689865) Homepage Journal

most freakouts surrounding nuclear power are based on 1960s technology. modern reactor designs, such as pebble bed reactors [wikipedia.org], are designed to be passively safe. that is, you can just walk away from them, doing nothing, and they will not release gas, go china syndrome, or anything else unsafe. older nuke tech requires active safety management: someone must always be on the job, making sure nothing f***s up. designing safety into nuclear reactor design from the philosophical ground up is the way of the future

Re:One begs the question (Score:3, Informative)

by Viper Daimao ( 911947 ) writes: on Friday June 06, 2008 @09:09PM (#23689891) Journal

one begs the question...
No one doesn't [wikipedia.org]

Re:Lesson learned: (Score:5, Informative)

by bluefoxlucid ( 723572 ) writes: on Friday June 06, 2008 @09:13PM (#23689913) Homepage Journal

No, it has no reason to believe the coolant system has water. It's called FAIL SAFE; if I'm not quite sure, then fuck it, back off and shut the grid down and go MAKE SURE everything looks right.

The proper response of a nuclear cooling system to not knowing whether or not it's working correctly is not "let's keep running hot and see if more sample data comes across."

Re::O (Score:3, Informative)

by afidel ( 530433 ) writes: on Friday June 06, 2008 @10:48PM (#23690445)

Probably not, they should be airgapped with tight control over access to the network they sit on. I don't like the idea of SCADA systems being on a shared network to begin with. In fact there's speculation that several recent incidents nationwide were due to systems on the shared network being compromised by targeted attacks from China. That may be conspiracy theory speculation but I've seen it discussed enough on serious network security boards that I'm starting to wonder if there isn't some ring of truth to it.

Patching for patching sake is an IT fetish that just as often as not leads to more problems than it solves. In fact the only problem I've had in the last two years that caused any significant client disruption was caused by a bad dat update (patch) to our AV software.

Re:Wow that is so funny (Score:5, Informative)

by profplump ( 309017 ) writes: <zach-slashjunk@kotlarek.com> on Friday June 06, 2008 @11:22PM (#23690625)

The system as a whole *did* know the reading was bogus. The control/safety system shut down because it stopped getting "safe" indications from the monitoring/input system. It seems pretty clear that the input system itself correctly logged the reason for the error.

The interface to the control system for the tank level doesn't (or at least shouldn't) have an entire separate "error" parameter -- it probably takes a simple numeric value from the input system.

The input software knows when the reading are bogus or missing. In that case it either stops sending input, which would presumably trigger a watchdog in the control system, or it sends data that indicates a worst-case scenario. with which the control system can do whatever it does in a worst-case scenario.

The control system itself doesn't care why there is or may not be safe input parameters, it only cares that it cannot rely on the input it needs for safe operation. Giving it any more information just adds code and interface complexity to safety-critical software.

Here's the system as implemented:
level = tank.getLevel()
if (level < SANE_MIN || level > SANE_MAX)
level = 0
control.input.set(TANK_LEVEL, level)

Here's the system you describe:
error = 1
level = tank.getLevel()
if (level > SANE_MIN && level < SANE_MAX)
error = 0
control.input.set(TANK_LEVEL, level, error)

The later makes the safety-critical control software more complex, with more test cases and more input parameters, none of which add any value to the safe operation of the control system. The error parameter potentially allows for operation during transient errors, but that's a decision you can make in other ways, without adding interface complexity.

The only inconvenience of the simpler interface is that you have to check the logs from the input device in addition to the control device to determine why the error occurred. And please don't argue that consolidated error logging is worth extra code complexity -- that's probably not even true in a web app, let alone a human-safety control system.

Water (Score:2, Informative)

by Grocks ( 706157 ) writes: on Friday June 06, 2008 @11:38PM (#23690703)

You can't put too much water into a nuclear reactor

Re:Only the biz machine was updated. Why trouble? (Score:3, Informative)

by Anonymous Coward writes: on Saturday June 07, 2008 @12:02AM (#23690805)

There are such requirements in the US, be they for SIL ratings, performing haz-op reviews, etc. Particularly in nuclear apps.

In a plant, not all control systems are SIL rated, but the safety backups usually are....though more and more operators are buying or upgrading to SIL qualified systems and extending SIL to other than just the safety and protection backups.

In this case, the engineers were probably asleep at the wheel and didn't realize the changes they made to the control software impacted the trip & protection systems, so didn't bother to even have a haz-op review prior to making the change to get updates to a control parameter (or set of parameters) from a networked device. They probably figured they were just adding a trim or tuning variable of some kind to the control loop and didn't do ANY real failure analysis.

Oops.

Oh well, time for all the governing bodies like the NRC to get out the microscopes and take a peek at the plant's operating procedures and engineers adherence to them.

Cheers

Re:Wow that is so funny (Score:5, Informative)

by icebike ( 68054 ) writes: on Saturday June 07, 2008 @12:10AM (#23690861)

What part of FAIL SAFE don't you understand?

The System FAILED. It is programmed to SAFE the reactor when shit happens.

Without its sensors it had no choice but to assume worse case and scram the reactor.

It did it the right way. It did it the way it was programmed to do it.

What would you have it do to determine why it is no longer getting critical data? Send out a droid to check the cat5 cables? Its a frikin computer in a rack, not R2D2.

It worked the way it was supposed to.

Take a step back and let the big boys handle the reactor, Please.

Re:Wow that is so funny (Score:5, Informative)

by Anonymous Coward writes: on Saturday June 07, 2008 @02:11AM (#23691451)

I think you're missing the real point, which is that the central safety systems are being fed data from a 'business network'. What would happen if that computer had an issue that caused it to send the same data continuously even when the coolant level had really dropped? WHY are any safety systems receiving data from an insecure network?

It's bad enough that most reactors use regular PC's to do the data collection and reporting, given the security risks posed by such systems (especially if networked), but I never realized they would be so stupid as to feed data in the other direction like this!

Obviously you have -zero- experience with power plant networks. Allow me to enlighten albeit anonymously.

The reason machines like this receive data from networks that could be considered 'less secure' is because telemetry is required from a multitude of sources to actually ascertain any useful realtime information. Aggregation machines have to speak many different protocols and translate between them while communicating with other machines that belong at other plants, cities, states, and even companies to effectively get an accurate picture of the entire grid's current conditions.

The world of plant control machines themelves is very vendor-driven. Most facilities have turnkey solutions brought in by the few major players in this field. ABB, Hathaway, GE, etc. Those players don't even use the same SCADA protocols. Some use ICCP, some use DNP, and others prefer Etherpoll. I've seen RS232 data encapsulated into everything from fully-meshed TCP connections via OSI-Soft's PI to barely encoded into modbus and slapped onto ethernet with only an understanding of ARP.

The solutions are required because electricity is not just one powerplant pumping watts blindly. Instead, you have a multitude of plants all pushing power onto ISO-controlled grids that all have to work in concert with each other. This requires -- yes, you guessed it -- networking! The world of plant networks is pretty complex despite the hype you see in the media. The business of making actual watts appear magically at your house at a nice, consistent 60Hz is vastly more involved that most people realize.

Telemetry comes from secured networks, business networks, and other companies and controlling agencies. That is how it works. Period.

If you are actually interested in seeing the way these are regulated to be secured, the information is cleverly hidden in plain sight at the NERC website.

Correct Response to a Faulty Design (Score:1, Informative)

by spaten ( 513670 ) writes: on Saturday June 07, 2008 @02:46AM (#23691575)

yes the system ultimately made the right choice, shutting down with a perceived loss of critical information.

however, this was a best choice response to a poorly engineered shutdown system.

a properly designed critical shutdown system would have completely independent sensors, for exactly this reason. by design, no external system (i.e. business network data collection) should be able to compromise the integrity of a safety system in any way. Safety systems are designed to be redundant within themselves on many levels so that even if some link in the chain were to fail, there's another link waiting to take it's place until repaired. Business systems, and often standard control systems, do not have that sort of availability/reliability, and so should have not part of the safety system.

Re:Business Network? (Score:3, Informative)

by VENONA ( 902751 ) writes: on Saturday June 07, 2008 @02:57AM (#23691603)

"The real issue is why did the primary control system accept a reset..."

That was my first thought, too: a huge separation of privilege flaw. But, from TFA, "...when the updated computer rebooted, it reset the data on the control system, causing safety systems to errantly interpret the lack of data as a drop in water reservoirs that cool the plant's radioactive nuclear fuel rods."

So it's not that the system on the control side accepted a restart command that it shouldn't have. I'm not saying there's no problem--just that this wasn't the failure mode.

But it does make you wonder what else is wrong at this place, doesn't it?

TFA has a link to a Government Accounting Office paper on problems they found with TVA, which operates multiple reactors. Have a look at that thing http://www.gao.gov/new.items/d08526.pdf [gao.gov]
to see a real mess. It's 62 pages of badness, but just reading page 2, "Results in Brief," will give most people the twitching horrors.

Password issues, bypassed firewalls, unpatched systems, limited logging, limited IDS, configuration management policy problems, physical security and training problems, etc. Apparently TVA has left no stone unturned in their efforts to fail an audit.

Re:Install Complete... (Score:3, Informative)

by SlashWombat ( 1227578 ) writes: on Saturday June 07, 2008 @03:42AM (#23691771)

Obviously, you have never seen picures of Chernobyl. While it wasn't like an atomic bomb, it certainly went KABOOM. It blew a several hundred ton metal lid clean off the reactor, and demolished a fair percentage of the building containing the reactor core.

Re:Hmmm, threw an exception (Score:3, Informative)

by jimicus ( 737525 ) writes: on Saturday June 07, 2008 @07:35AM (#23692495)

I like their Reactor's logic...

IF [anything seems fucked up] THEN SHUT DOWN REACTOR

Friend of mine used to work in a nuclear power plant and that was basically how everything was set up. The staff were essentially there to prevent the reactor shutting itself down.

Re:More like bad system design (Score:3, Informative)

by aliquis ( 678370 ) writes: on Saturday June 07, 2008 @10:08AM (#23693077)

Bull shit, Apache with PHP, CGI, some database backend and so on may be vulnerable together with whatever page it runs. But just a simple network application which answers a page request and returns a fixed page with no scripts or nothing are neither complex or a likely target.

Just run a script not running in the webserver process which updates the fixed webpage and be done with it. Feel free to tell me how anyone would play around with that solution ...

"Endless cut and paste", yeah, because it's sooo impossile to have a perl script or whatever fetch the webpage and cut out the data you care about? Thought sure random form of read only network solution may be even better than something web based. I'm sure the GP was ok with such a design aswell.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Software Update Shuts Down Nuclear Power Plant 355

Software Update Shuts Down Nuclear Power Plant More Login

Software Update Shuts Down Nuclear Power Plant

just to shortcircuit the nuclear hysteria (Score:5, Informative)

Re:One begs the question (Score:3, Informative)

Re:Lesson learned: (Score:5, Informative)

Re::O (Score:3, Informative)

Re:Wow that is so funny (Score:5, Informative)

Water (Score:2, Informative)

Re:Only the biz machine was updated. Why trouble? (Score:3, Informative)

Re:Wow that is so funny (Score:5, Informative)

Re:Wow that is so funny (Score:5, Informative)

Correct Response to a Faulty Design (Score:1, Informative)

Re:Business Network? (Score:3, Informative)

Re:Install Complete... (Score:3, Informative)

Re:Hmmm, threw an exception (Score:3, Informative)

Re:More like bad system design (Score:3, Informative)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot