Software Update Shuts Down Nuclear Power Plant 355
Garabito writes "Hatch Nuclear Power Plant near Baxley, Georgia was forced into a 48-hour emergency shutdown when a computer on the plant's business network was rebooted after an engineer installed a software update. The Washington Post reports, 'The computer in question was used to monitor chemical and diagnostic data from one of the facility's primary control systems, and the software update was designed to synchronize data on both systems. According to a report filed with the Nuclear Regulatory Commission, when the updated computer rebooted, it reset the data on the control system, causing safety systems to errantly interpret the lack of data as a drop in water reservoirs that cool the plant's radioactive nuclear fuel rods. As a result, automated safety systems at the plant triggered a shutdown.' Personally, I don't think letting devices on a critical control system accept data values from the business network is a good idea."
Install Complete... (Score:5, Funny)
[Yes] [No] [OMFG!]
Oblig Simpsons reference (Score:5, Funny)
Re: (Score:2, Funny)
Re: (Score:2, Redundant)
Re: (Score:2)
Re: (Score:3, Funny)
Re: (Score:2, Insightful)
Re: (Score:3, Funny)
Re: (Score:2, Funny)
or perhaps just another variation on the BSOD (Blu Screen Of Death)
Re: (Score:3, Funny)
Re: (Score:3, Informative)
Hmmm, threw an exception (Score:5, Insightful)
Re:Hmmm, threw an exception (Score:5, Funny)
Re: (Score:3, Funny)
Suppose, for example, that his first language was French, then he'd likely have a name like "Caword Anonoumouse".
Business Network? (Score:5, Interesting)
From the summary: If it's monitoring the primary control system then it seems to me like the machine would have to be on the control network. The real issue is why did the primary control system accept a reset from a monitoring system. It sounds like there's more than one bug to track down.
Re: (Score:3, Informative)
That was my first thought, too: a huge separation of privilege flaw. But, from TFA, "...when the updated computer rebooted, it reset the data on the control system, causing safety systems to errantly interpret the lack of data as a drop in water reservoirs that cool the plant's radioactive nuclear fuel rods."
So it's not that the system on the control side accepted a restart command that it shouldn't have. I'm not saying there's no probl
Re: (Score:3, Informative)
IF [anything seems fucked up] THEN SHUT DOWN REACTOR
Critical Update (Score:5, Funny)
Re: (Score:3, Funny)
Fail-Safe (Score:5, Insightful)
Re: (Score:2, Funny)
Re:Fail-Safe (Score:5, Insightful)
Re:Fail-Safe (Score:4, Insightful)
On, the contrary, shutting down because the system is shit sounds like a much better option than continuing to run despite the shittiness of the computer monitoring everything.
Of course, the ideal situation would be to have good computers that only get updated in scheduled, planned ways so that you don't have the issue at all. But shutting everything down when something is amiss is the only sensible response.
Re: (Score:3, Insightful)
That's my point. I don't want a reactor with ANY flaws. No matter how safe its default shutdown threasholds are.
And I'd like to be king of all Londinum and wear a shiny hat.
Systems without flaws will never exist, so we need to design systems that do reasonable things when they encounter flaws.
In this case, the flaw wasn't even caused by the machines, but instead was directly caused by the "fleshy" parts of the system, and the machines still managed to handle the problem safely.
This was not a "fail-safe" incident (Score:5, Insightful)
In the article they mention that the system wasn't designed for security (since it was meant to be internal) - but this isn't a security issue at all! Any sort of system that relies upon other systems should be designed to assume failure can and will occur in other systems - that is not to say that it needs to verify/evaluate incoming data to make sure it is "good", but rather that it can tell the difference between receiving data (such as current water levels) and receiving no data at all (system failure). Once it has that it can ideally automatically switch to a backup system, or do what it did here and enter a fail-safe state (the difference being that it does so while pointing out the actual problem and not a incorrectly perceived problem in a different part of the system).
Re: (Score:3, Insightful)
instead it incorrectly detected a problem that did not in fact exist
This might be splitting hairs, but I'd say it correctly detected a data inconsistency and responded appropriately. There could be a dangerous condition that is indistinguishable to the failsafe system from what actually happened - and it could be a condition that nobody's ever thought of before. It's far better to trigger the failsafe when a data inconsistency has occurred than to make a potentially incorrect automated judgment concerning the cause of the inconsistency leading to a more severe problem do
Re: (Score:3, Insightful)
More like bad system design (Score:2)
I guess "software update" can have been used to bash Microsoft a little or something, not that it say windows update, or maybe the poster hates all kinds of software updates?
Re:More like bad system design (Score:5, Insightful)
a) high chance of accidentally shutting down a reactor harmlessly
b) small chance of fucking up a nuclear reactor
you'll always go for (a), if your sane.
damned if you do... (Score:4, Insightful)
Reboot. The tweaked configurations happen to go away. No one remembers which ones they were. The system is b0rked for a while.
I would hope that isn't the case for that system, but I have seen it happen before.
Re: (Score:3, Insightful)
Probably not. Web servers are complex, and likely targets for attack. And the business people will end up doing endless cut and paste.
A better solution would be to accumulate the data that the businesspeople need on a single system on the control LAN. That system rsync's CSV files onto a system on the business LAN. No connections are initiated from the business LAN into the control LAN, and the data are more useful to MIS people on the busines
Re: (Score:3, Informative)
Just run a script not running in the webserver process which updates the fixed webpage and be done with it. Feel free to tell me how anyone would play around with that solution
"Endless cut and paste", yeah, because it's sooo impos
Re:More like bad system design (Score:4, Insightful)
*However*, one of the more powerful ideas in configuring highly secure LANS is that the more-secure LAN is simply never allowed to accept connections from the less-secure LAN. It's also something that's really easy to firewall, your network becomes easier to audit, etc. If you're a security practitioner, it makes your life easier. You still have to worry about the sneaker-net, physical security, etc., but now you're more able to focus your resources on those areas. Once again, simplicity is better than complexity if you're really after security.
I don't know where you got the idea that I thought it was, "sooo impossile to have a perl script or whatever fetch the webpage and cut out the data you care about." It's easy. But pretty much nothing is as easy to extract data from as a CSV file, which you could process with nothing more than awk. That doesn't get you far with automating report generation, populating a database, or whatever else you intend to *do* with the data, but there are endless tools for those jobs--Perl included.
Also, in my experience, people want to mess with Web pages. They're more visual, and people tend to want to 'improve' them, meaning your Perl screen-scraper likely has to change as well. I see a lot less clamor for changing the data format in CSV files.
In the end, use what you need--XML, for all I care. Just *don't allow your less-secure LAN to initiate connections into your more-secure LAN*. That was the root cause of the failure described in TFA. It's one of many reasons the rule is so basic, though obviously not yet widely-enough followed. Ideally, hosts on a secure LAN communicate with *nothing* outside that LAN. You justify and document every[1] step away from that ideal, if for no other reason than that it plays hell with formal trust models, which can be important inputs into designing a thorough audit. I don't see how you justify accepting incoming traffic when there's an easy way to avoid it. In an audit, I'd be busting you for that Web server. Simple as that.
An approach like the one above is likely to make life easier for several internal groups, including office staff. And quite possibly the ultimate users--power consumers.
[1] I mean every, not most. For example, how do you handle time? I favor an NTP server on the secure LAN taking time inputs from the GPS cloud. I've never worked for an organization that had a spare atomic clock lying around, or I'd have used that, and eliminated one more external data flow.
Misreading of the Article (Score:5, Interesting)
I wonder if they were using something like EPICS. I worked on a large experiment which used EPICS to control the system. Rebooting a machine would sometimes expose a problem with resources not being freed, eventually leading to a situation where data channels would read the 'INVALID/MISSING' value. The solution, as anyone who has worked on this sort of experiment will know, was to reboot more machines until the thing worked.
(I don't mean to complain about EPICS. It is very powerful and flexible... it's just that the version we used had these occasional hiccups.)
Re: (Score:2)
No, actually, the summary says "when the updated computer rebooted, it reset the data on the control system, causing safety systems to errantly interpret the lack of data as a drop in water reservoirs"... That doesn't really have much to do with the reboot itself (causing the computer to be unreachable or whatever) but th
Re: (Score:3, Funny)
Terminal Error (Score:2, Interesting)
the slashdot crowd is dying to know... (Score:4, Funny)
Re:the slashdot crowd is dying to know... (Score:5, Funny)
If it was running something else then the application was at fault.
EULA! (Score:5, Funny)
Re: (Score:2)
Re: (Score:2)
In XP, it's just a plain text file on the CD.
This was Good (Score:4, Insightful)
The problem is the update - not business network (Score:5, Interesting)
Re:The problem is the update - not business networ (Score:5, Funny)
Re: (Score:2)
Oh sure, NOW you think of a debian slogan
Good thing it wasn't written in Smalltalk. The slogan there is building the rest of the boat while underway.
Only the biz machine was updated. Why trouble? (Score:5, Insightful)
I have no problem with a computer on the process control subnet reporting information to a computer on the business subnet.
I have a BIG problem with a computer on the business subnet being able to modify and corrupt data in a computer on the process control subnet.
"I can't dump data to the business side" is a reason to make a log entry and maybe sound a minor alarm. It's not a reason to shut down the reactor (unless the data is needed for regulatory compliance and the process control side isn't able to buffer it until the business side is working correctly.)
But if a business subnet computer can tamper with something as critical as a process control machine's idea of the level of coolant in a reservoir, it rings my "design flaw" alarms.
Is it ONLY able to reset it to "empty" as poorly-designed part of a communication restart sequence? Or could it also make the process control machine think the level was nominal when it WAS empty?
IMHO this should be examined more closely. It may have exposed a dangerous flaw in the software design.
Security flaws don't care if they're exercised by mischance or malice. If nothing else, this is a way to Dos a nuclear plant through a breakin on the business side of the net.
Re:Only the biz machine was updated. Why trouble? (Score:4, Interesting)
Re: (Score:3, Informative)
In a plant, not all control systems are SIL rated, but the safety backups usually are....though more and more operators are buying or upgrading to SIL qualified systems and extending SIL to other than just the safety and protection backups.
In this case, the engineers were probably asleep at the wheel and didn't realize the changes they made to the control software impacted the trip
Where is the redundancy? (Score:2, Redundant)
If I were the power co owning this plant, I'd be
Re:The problem is the update - not business networ (Score:3, Funny)
"King-size Homer" season 7 episode 7, Nov 5, 1995 (Score:5, Funny)
Another proof that Homer Simpson was truly ahead of his time. [wikipedia.org]
Are you mad, woman? You never know when an old calendar might come in handy. Sure, it's not 1985 now, but who knows what tomorrow will bring? -Homer
Working as intended (Score:3, Insightful)
well duh (Score:2)
Here's the real story... (Score:2)
We all know what really happened. Dude rebooted the computer so that Windows automatic update reminder to reboot wouldn't interrupt his Solitaire game every 10 minutes.
This is why... (Score:4, Interesting)
I've had a whole plant lose view of it's system because some well meaning retard in IT decided to push updates onto a SCADA system without qualifying the updates....... never had it KILL the control side of things though....well done whoever you were, you've done well.
Re: (Score:2)
If it's the latter, I feel for him
Huh? (Score:2)
Huh? I've read the NTSB report on that accident - and nowhere in it (IIRC) are computers implicated. The accident occurred due to damage to the pipes from con
This was NOT a failure! (Score:2, Insightful)
It says the software is supposed to sync data between the control system and the business network. Obviously it has to be connected to both sides somehow. I'm not a power plant designer, but there's probably a good reason why people might need access to that data from the control system, and thus some kind of system acting as a safe bridge between the two rather than allowing unrestric
Re: (Score:2)
I'll admit that I'm too drunk to read TFA at the mo, so may have missed some detail
It kinda worked then... (Score:4, Insightful)
That is definitely a glass half full, as opposed to empty.
just to shortcircuit the nuclear hysteria (Score:5, Informative)
Re:just to shortcircuit the nuclear hysteria (Score:5, Insightful)
India's accelerated thorium idea is also very promising.
The major problem I see with US nuclear power is the assumption that it is a solved problem and almost zero has been spent on R&D for decades. The "new generation" of reactors from Westinghouse and others is little more than 1960's white elephants painted green.
Reboot on the business network? (Score:2)
Every 108 minutes.... (Score:3, Funny)
... enter 4, 8, 15, 16, 23, 42.
Or else all hell breaks loose.
Time for someone to review the shutdown system (Score:3, Insightful)
Yes, the safety system kicking in is "a good thing".
Pulling data from another computer system for a safety related control system is not a bright idea (the weakest link problem).
Historically a safety control system in an Oil & Gas environment, all the inputs to the safety system are either hardwired or pulled from another safety system controller which has the appropriate level of redundancy (CPU boards and communication paths with communication watchdog timers).
Even transmitters in some circumstances can not be trusted hence the 2 out of 3 voting systems (take three transmitters measuring the same value and pick the middle of the three, if one of the transmitters fails high or low your choice will be the safe option).
Someone needs a serious think about where this plant is getting data for its safety shutdown system.
ZombieEngineer
Re: (Score:2, Funny)
Re: (Score:3, Insightful)
"... when the updated computer rebooted, it reset the data on the control system, causing safety systems to errantly interpret the lack of data as a drop in water reservoirs that cool the plant's radioactive nuclear fuel rods. As a result, automated safety systems at the plant triggered a shutdown."
From that snippet alone, it stands to reason that _an
Re:Lesson learned: (Score:5, Informative)
The proper response of a nuclear cooling system to not knowing whether or not it's working correctly is not "let's keep running hot and see if more sample data comes across."
big increases in your power bill! (Score:4, Insightful)
They have a perfectly adequate safety system that did exactly what it's supposed to do. It read confusing data and decided to shut the reactor down until a human came along and explained things satisfactorily. What's wrong with that? Aside from having the reactor offline for 48 hours, there was no other cost.
Such systems already exist. (Score:4, Interesting)
Re:Obligatory (Score:5, Funny)
Re: (Score:2)
Wow! Imagine a beowulf cluster of these!
Re: (Score:2)
Re::O (Score:5, Insightful)
Re: (Score:3, Funny)
The chemical company I work for has VAX/Unix systems that haven't been rebooted in over four years... and only then because of power outages.
Re::O (Score:5, Insightful)
Re: (Score:3, Informative)
Re::O (Score:4, Insightful)
Well, the auditors seem to expect it... as do the vendors when we call for support - "Oh, you say foobar isn't working... well it looks like you're 15 revisions behind; why don't you just fix that and call me when you're done. Oh, your policies state you need to test and certify them? Well I guess I won't be hearing from you for a while, then."
--
.nosig
Re:Wow that is so funny (Score:4, Insightful)
Re:Wow that is so funny (Score:5, Insightful)
Re: (Score:3, Insightful)
Re:Wow that is so funny (Score:5, Insightful)
Re:Wow that is so funny (Score:5, Insightful)
I've set nagios up to monitor my network, and any los of signal is considered CRITICAL, not just a warning, but critical... and I need to know then.
Re:Wow that is so funny (Score:5, Informative)
The interface to the control system for the tank level doesn't (or at least shouldn't) have an entire separate "error" parameter -- it probably takes a simple numeric value from the input system.
The input software knows when the reading are bogus or missing. In that case it either stops sending input, which would presumably trigger a watchdog in the control system, or it sends data that indicates a worst-case scenario. with which the control system can do whatever it does in a worst-case scenario.
The control system itself doesn't care why there is or may not be safe input parameters, it only cares that it cannot rely on the input it needs for safe operation. Giving it any more information just adds code and interface complexity to safety-critical software.
Here's the system as implemented:
level = tank.getLevel()
if (level < SANE_MIN || level > SANE_MAX)
level = 0
control.input.set(TANK_LEVEL, level)
Here's the system you describe:
error = 1
level = tank.getLevel()
if (level > SANE_MIN && level < SANE_MAX)
error = 0
control.input.set(TANK_LEVEL, level, error)
The later makes the safety-critical control software more complex, with more test cases and more input parameters, none of which add any value to the safe operation of the control system. The error parameter potentially allows for operation during transient errors, but that's a decision you can make in other ways, without adding interface complexity.
The only inconvenience of the simpler interface is that you have to check the logs from the input device in addition to the control device to determine why the error occurred. And please don't argue that consolidated error logging is worth extra code complexity -- that's probably not even true in a web app, let alone a human-safety control system.
Re:Wow that is so funny (Score:5, Insightful)
Re:Wow that is so funny (Score:5, Informative)
The System FAILED. It is programmed to SAFE the reactor when shit happens.
Without its sensors it had no choice but to assume worse case and scram the reactor.
It did it the right way. It did it the way it was programmed to do it.
What would you have it do to determine why it is no longer getting critical data? Send out a droid to check the cat5 cables? Its a frikin computer in a rack, not R2D2.
It worked the way it was supposed to.
Take a step back and let the big boys handle the reactor, Please.
Re:Wow that is so funny (Score:5, Insightful)
Re:Wow that is so funny (Score:5, Insightful)
I think you're missing the real point, which is that the central safety systems are being fed data from a 'business network'. What would happen if that computer had an issue that caused it to send the same data continuously even when the coolant level had really dropped? WHY are any safety systems receiving data from an insecure network?
It's bad enough that most reactors use regular PC's to do the data collection and reporting, given the security risks posed by such systems (especially if networked), but I never realized they would be so stupid as to feed data in the other direction like this!
Re:Wow that is so funny (Score:5, Informative)
I think you're missing the real point, which is that the central safety systems are being fed data from a 'business network'. What would happen if that computer had an issue that caused it to send the same data continuously even when the coolant level had really dropped? WHY are any safety systems receiving data from an insecure network?
It's bad enough that most reactors use regular PC's to do the data collection and reporting, given the security risks posed by such systems (especially if networked), but I never realized they would be so stupid as to feed data in the other direction like this!
The reason machines like this receive data from networks that could be considered 'less secure' is because telemetry is required from a multitude of sources to actually ascertain any useful realtime information. Aggregation machines have to speak many different protocols and translate between them while communicating with other machines that belong at other plants, cities, states, and even companies to effectively get an accurate picture of the entire grid's current conditions.
The world of plant control machines themelves is very vendor-driven. Most facilities have turnkey solutions brought in by the few major players in this field. ABB, Hathaway, GE, etc. Those players don't even use the same SCADA protocols. Some use ICCP, some use DNP, and others prefer Etherpoll. I've seen RS232 data encapsulated into everything from fully-meshed TCP connections via OSI-Soft's PI to barely encoded into modbus and slapped onto ethernet with only an understanding of ARP.
The solutions are required because electricity is not just one powerplant pumping watts blindly. Instead, you have a multitude of plants all pushing power onto ISO-controlled grids that all have to work in concert with each other. This requires -- yes, you guessed it -- networking! The world of plant networks is pretty complex despite the hype you see in the media. The business of making actual watts appear magically at your house at a nice, consistent 60Hz is vastly more involved that most people realize.
Telemetry comes from secured networks, business networks, and other companies and controlling agencies. That is how it works. Period.
If you are actually interested in seeing the way these are regulated to be secured, the information is cleverly hidden in plain sight at the NERC website.
Re:Wow that is so funny (Score:5, Insightful)
An area where that loosely controlled type of team work gets into trouble unless all coders treat data passed to their code, and from their code in the same uniform functional ways.
It also makes me wonder how the code will react to certain malicious software, should it get loose in the facility. If I were writing code to destroy a nuclear facility, it is how data is passed from one process to another that I would definitely attack as well as other vulnerable places.
It is sort of reassuring to have seen a failure result in a controlled shutdown rather than some other, more undesirable action.
Re: (Score:3, Funny)
Re:How could NRC even allow this in the first plac (Score:4, Funny)
Yeah, those bastards, the way they used THE SLIGHTEST AMOUNT OF CARE in designing a system that shuts down in response to unexpected data so as to avoid RECKLESSNESS with the SAFETY OF OTHERS.
Re: (Score:2)
Re:One begs the question (Score:5, Funny)
Re: (Score:2)
No wonder we have such a N.I.M.B.Y. problem with them.
MOD PARENT UP! (Score:5, Funny)
Re: (Score:3, Informative)
Re: (Score:3, Insightful)
Then again, maybe intelligent and well-educated people will just ignore people who aren't intelligent enoug
Re: (Score:3, Insightful)
One of the great things about English is that one can phrase some