Cosmic Rays Causing 30,000 Network Malfunctions in Japan Each Year (mainichi.jp) 71
Cosmic rays are causing an estimated 30,000 to 40,000 malfunctions in domestic network communication devices in Japan every year, a Japanese telecom giant found recently. From a report: Most so-called "soft errors," or temporary malfunctions, in the network hardware of Nippon Telegraph and Telephone Corp. are automatically corrected via safety devices, but experts said in some cases they may have led to disruptions. It is the first time the actual scale of soft errors in domestic information infrastructures has become evident. Soft errors occur when the data in an electronic device is corrupted after neutrons, produced when cosmic rays hit oxygen and nitrogen in the earth's atmosphere, collide with the semiconductors within the equipment. Cases of soft errors have increased as electronic devices with small and high-performance semiconductors have become more common. Temporary malfunctions have sometimes led to computers and phones freezing, and have been regarded as the cause of some plane accidents abroad. Masanori Hashimoto, professor at Osaka University's Graduate School of Information Science and Technology and an expert in soft errors, said the malfunctions have actually affected other network communication devices and electrical machineries at factories in and outside Japan.
Oblig... (Score:5, Funny)
Re: (Score:2)
Re: (Score:2)
Probably even better, all you have to do is poke electrons into a capacitor or create a temporary conductive path to bleed them off.
Keeping an eye on this (Score:2)
Re: (Score:2)
He's got employment as an ambassador for Japan now. If he was up to it, it'd be a problem for everyone else, not them.
Re: (Score:2)
I thought he was working for HBOmax now.
Re: (Score:2)
Re: (Score:2)
An old story (Score:1)
I recall back when a megabyte machine was a big deal, this issue was much discussed in some circles. Why, I gather, pretty much all server class machines have ECC memory. Random bit decay we called it. Nice to know it hasn't been forgotten. The change from synchronous error reporting by the cpu likely obscured some of the problems this caused. I stopped paying attention a long time ago. But then, I only did internals and was never much of an applications hacker.
even older (Score:3)
I recall back when a megabyte machine was a big deal, this issue was much discussed in some circles. Why, I gather, pretty much all server class machines have ECC memory. Random bit decay we called it. Nice to know it hasn't been forgotten. The change from synchronous error reporting by the cpu likely obscured some of the problems this caused. I stopped paying attention a long time ago. But then, I only did internals and was never much of an applications hacker.
When the conversion of RAM from core to semiconductor technology was starting at DEC, in the late 1970s, the hardware engineers worried that cosmic rays would cause memory errors, since semiconductor memory is more sensitive to cosmic rays than are ferrite cores. As a result, the first PDP-10 memories using semiconductors also had ECC.
Re:even older (Score:4, Interesting)
DDR6 is going to be all ECC. Memory sizes are so large now it's pretty much a requirement.
It will still come in two types though. One that does ECC internally, one that exposes the ECC to the CPU.
internal ECC checking (Score:3)
DDR6 is going to be all ECC. Memory sizes are so large now it's pretty much a requirement.
It will still come in two types though. One that does ECC internally, one that exposes the ECC to the CPU.
On the one that does ECC internally, how does it report soft and hard ECC errors?
Re: (Score:2)
It doesn't report errors, it just silently corrects them internally. To the computer it looks like non-ECC RAM.
Re: (Score:3)
It doesn't report errors, it just silently corrects them internally. To the computer it looks like non-ECC RAM.
Correcting but not reporting errors is a bad design--you don't know your memory is getting flakey until it fails hard, so you don't know you need to stock a spare. Does it also not report hard ECC errors, but just return bad data? It would be better to stop, freezing the CPU. At least that way you would know you have a problem.
Re: (Score:2)
Consider that at the moment your computer just carries on completely unaware of the error in most cases. So it's clearly better than the status quo, although not as good as proper ECC memory that reports faults.
I do why wonder why they even bothered with self correcting mode... Probably at Intel's request so they can segment the market by continuing to offer ECC and non-ECC CPUs. Ryzen supports ECC and while it's not qualified on many motherboards it does work just fine in most cases.
Re: (Score:2)
Consider that at the moment your computer just carries on completely unaware of the error in most cases. So it's clearly better than the status quo, although not as good as proper ECC memory that reports faults.
I do why wonder why they even bothered with self correcting mode... Probably at Intel's request so they can segment the market by continuing to offer ECC and non-ECC CPUs. Ryzen supports ECC and while it's not qualified on many motherboards it does work just fine in most cases.
Depending on what they do in the case of an uncorrectable ECC error, it could be better than the status quo.
I suspect the motivation for adding ECC without error reporting is to improve reliability. With the reporting omitted that will be hard to measure. I wonder if somebody will come out with a hack to convert non-reporting ECC to reporting ECC, thus avoiding the attempt by Intel to segment the market. My desktop has a motherboard that supports ECC RAM, but I bought non-ECC RAM because it costs half as
Re: (Score:2)
The issue with Intel is that the CPUs don't support ECC. Or rather they do, it's just disabled on the consumer ones. You have to pay extra for a model that is identical other than having the ECC feature unlocked.
AMD support ECC on most, if not all Ryzen parts. Many motherboard manufacturers don't test ECC RAM so it's technically not qualified and you are on your own with it, but it usually works just fine.
Re: (Score:2)
The issue with Intel is that the CPUs don't support ECC. Or rather they do, it's just disabled on the consumer ones. You have to pay extra for a model that is identical other than having the ECC feature unlocked.
AMD support ECC on most, if not all Ryzen parts. Many motherboard manufacturers don't test ECC RAM so it's technically not qualified and you are on your own with it, but it usually works just fine.
The issue isn't just with Intel. When I priced RAM, I found that ECC memory costs twice as much as non-ECC memory. Maybe the memory manufacturers are taking a lesson from Intel.
Re: (Score:2)
Re: (Score:2)
Virtual +1 funny.
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
There's a talk online somewhere by a guy who writes radiation-hardened code, it was at a conference in Australia a few years ago. The slides are pages and pages of the most paranoid code you've ever seen, he did a demo where he zapped a cellphone with uranium to demonstrate what faults can do. So you can mitigate it in software, it's just software that looks like nothing else on earth.
It is possible, though very difficult, to write code that can deal with certain kinds of hardware faults. IBM's OS/360, for example, would abend a user job if certain kinds of faults, known as "program damage", happened while the user job was running. Other kinds of faults, known as "system damage", would bring down the whole computer.
Even with the most paranoid of code, there are limits to the kinds of hardware faults that software can recover from. If one percent of your memory fetches get ECC Uncorrec
Re: (Score:2)
Re: (Score:2)
Not necessarily, you can create extremely fault-tolerant code that works even in the presence of hardware faults, like the talk I mentioned. Another one is the control systems used in the French SACEM train control, which has multiple data flows that cross-check each other. Or TMR programming, standard in high-radiation environments. You can certainly get essentially zero faults (or at least reduced to a negligible level) via software-only techniques, it just takes a lot of careful programming. Seeing both TMR and self-checking code in action is impressive, you can randomly flip bits while it's running and it just keeps on going.
NASA does a good job of dealing with hardware faults in software, but there are some problems that are utterly out of reach of software. A good example of software recovery is the NEAR Shoemaker mission to the atrroid Eros, which you can read about at this URL: http://near.jhuapl.edu/anom/Ho... [jhuapl.edu] . An example of an unrecoverable problem is the loss of Mars Climate Orbiter, at this URL: https://spectrum.ieee.org/aero... [ieee.org] . I liked this paragraph:
flip a bit (Score:5, Interesting)
I remember having a Perl script that had done a specific function for 20 years without failed and then one day it quit working and spit out an error on invocation. A quick inspection revealed that in the text code an "h" got flipped to "g". So a function call to "hello_gps" was upon inspection was "gello_gps".
Good thing it didn't flip a bit in some gnarly RegEx :)
Re: (Score:2)
That's really fascinating. I've had times where I swore something like that happened, but because it was most likely some transient in volatile memory I couldn't confirm it.
Of course I've had corrupted files, but I've always chalked that up to a bug in the file system, OS, or the software that wrote the file. Uninitialized variables do stuff that looks like that. They're some of the worst bugs, right up there with using free'd memory. You feel miserable when tracking them down, and you feel like Sherlo
Re: (Score:2)
I once had a case where I received a dump file (Windows). After analyzing the place where the program crashed, I realized that it was "impossible" - there is no way that the program could have crashed where it was because the assembly statement before it did a test against the value and should have branched. I figured it was a hardware error of some sort - maybe it could have been a flipped bit? It was the only time in my career where I was confident it was a hardware problem and not a software bug.
Of cours
Re: (Score:2)
WAAY back in the dark ages, I had a PC sorting huge (for the time) data files using a home grown sort routine. After flawless operation, suddenly it produced output horribly out of order. Since it was batch processing, I could exactly repeat the run as often as I wanted. It ran flawlessly every time after.
That's more than one bit flip. (Score:2)
Re:That's more than one bit flip. (Score:5, Interesting)
I was just going from memory as this happened but in 2005. Should've known this was slashdot and someone was going to check me on it... So I searched the official maintenance log and found the actual entry
"Nightingale investigated the problem and after finding no issue with the format of the state file, proceeded to examine the place in the cams.pl code where the error was occurring. Nightingale discovered that the cams.pl file had been mysteriously modified. A single bit was changed causing a variable name to be changed, e.g. State_ref => Stade_ref. ... The source code change was an ASCII character change from "t" to "d".
This is a change of a single bit in the character byte. The change in source code was not made by human editing. Troxel believes that the Piquin hard drive may have been impacted by the "Oh My God Particle". The source code change was an ASCII character change from "t" to "d". This is a change of a single bit in the character byte."
Re: (Score:2)
I'm curious, don't HDDs employ basic error correction, or does this only get invoked if it physically can't read a sector?
If not it's an even stronger use case for a check summing filesystem.
Re: (Score:1)
I think it's to do with disk technology. Used to be we had MFM drives that seemed to be solid. Then RLL drives that were flakey. Now we're well beyond that. Once in a while I compare a USB spindle 4TB disk to my desktop machine. I do a rsync to it every week and tell it to delete the deleted crap. Much to my surprised there are differences. The USB that is kept in a commercial grade steel safe (700+ Lbs) turns out to be right. Things like a picture. When it's messed up on the desktop it's obvious. I get lik
Critical device infrastructure... (Score:2)
... should be shielded but no one really want's to pay for it. Let's be honest, our species will simply ride until a calamity that forces a change in how we do things because it's easy to take the low cost + error prone route as long as those errors aren't severely disrupted.
Re: (Score:2)
... should be shielded but no one really want's to pay for it.
Shielding makes things bigger and heavier, would you really like to go back to 'phones like this ? [alamy.com]
Re: (Score:2)
It means most people would stop using smartphones and by association social media too.
So the answer to your question is: fuck yes.
Re: (Score:2)
Re: (Score:2)
It's always a game of statistics. No amount of shielding will be enough to completely eliminate the problem, so error detection and/or correction will always be needed. At some point spontaneous decay in the shielding becomes a potential source of bit flips. By no amount of shielding, I mean no amount that isn't larger than the Earth itself.
Re: (Score:2)
I've looked into the shielding that would be required, as we had a real problem with bit-flips in FPGAs causing storage devices to spit errors and crash. The necessary shielding wasn't just annoyingly expensive, it was completely infeasible from cost, logistics, and mechanical perspectives. It's much better to simply check for it with various hardware and software based methods, and correct it when it happens (and when you run things at large enough scale, it happens several times a month... and that's ju
Re: (Score:2)
So the answer is clear: we need to develop underwater avionics.
Re: (Score:1)
Some guy tried this in the Hudson River a few years back...
Nonsense (Score:5, Interesting)
This is not the first time this has become evident. There have been papers written on it. You can calculate the expected incident rate based on packet throughput. As others have said companies like DEC and IBM spent a great deal of effort on the general question 50 years ago, CISCO had several papers specific to network impacts in the late 90's.
This is why all such networking equipment, memory and filesystems are fairly resilient to the effect, because there's fuck all you can do about it (at least cost effectively).
Re: (Score:2, Informative)
Re: (Score:2)
Yeah, that's pretty much what I meant about resiliency. There's nothing practical you can do to stop the effect so you have to build in resiliency into the system, for memory that's ECC, for networks that's things like TCP checksums, RAID parity bits etc.
Resilient systems are much more practical than error proof systems.
Re: (Score:2)
Fuck everything, we're doing five parity bits. - Samsung
Re: (Score:2)
Fuck everything, we're doing five parity bits. - Samsung
I think you meant Gillette, not Samsung.
Re: (Score:2)
While in the original joke from The Onion [theonion.com] it is Gillette, it wouldn't make sense here.
Are you telling us you've never heard old jokes rewritten to fit another topic?
Re: Nonsense (Score:2)
So I rewrote the rewrite of a rewrite.... Heck, SNL did this back when the first trip blade model came out.
Re: (Score:2)
In Japan the only internet service you can order is fibre, and the lowest speed is 2Gbps symmetrical. In many places the baseline is 10Gbps.
That's cheap consumer broadband. The hardware they give you is consumer grade. They just supply a modem, you have to get your own router. Of course very few consumer routers have a 10Gbps port and even fewer can actually route packets that fast.
Re: (Score:2)
In Japan the only internet service you can order is fibre, and the lowest speed is 2Gbps symmetrical. In many places the baseline is 10Gbps.
Wow, all that and tentacle hentai p0rn? Time to move to Japan!
Damn muon showers (Score:2)
Not a new phenomenon (Score:4, Interesting)
We had an embedded system with maybe 16KB of (if I recall correctly) Intel 2107 DRAM (4096 x 1 bit) deployed mid-to-late 1977. We had parity on the memory, and were observing higher than expected rates of parity error crashes. Trade newspapers brought word of May and Woods paper (Apr. 1978) on alpha particle upsets from radioactive materials in ceramic chip packages. Reported failure rates applied across our device population, explained most of the parity crashes that we had been observing.
Around 2013 in a telecom application, we (another company) had functional failures that could be traced to single event upsets causing persistent bit flips in hardware routing tables. The failure rate was low, but the limited number of cases observed, seemed to have higher rates of occurrence at higher altitude sites. Third semester physics problem: how can muons make it to ground level (under justifiable assumptions stated in the problem)? Answer: muons at 0.98c are relativistic: their decay "clock" runs slower than earth observer frame by a factor of 5. A sufficient number reach the ground, to cause trouble.
Bonus points: help your kid build a cloud chamber. My dad did. It was fun and spooky, and I had great, if sometimes painful, adventures with the Model T ignition coil. Cosmic ray observation, quite literally on the kitchen table.
Re: (Score:1)
As processors and memory shrink more, down to that ultimate single molecule level, a cosmic ray can disrupt a bit much more easily than with larger more primitive chips and bits. I wonder how much of this is a side effect of Moore's law in action. Unintended consequences, maybe?
Re: (Score:2)
This is why defensive programming is a really good idea.
Some coding practices actually discourage stuff that would protect against this kind of thing. For example -Wall with clang (and probably GCC, I didn't check) warns if you use default in a select statement on an enum. The theory is that you should explicitly cover ever possible enumerated value, but what happens if a bit randomly flips? Default is the best way to handle that, as well as a load of other errors that are not uncommon with communication st
Re: (Score:2)
Some coding practices actually discourage stuff that would protect against this kind of thing. For example -Wall with clang (and probably GCC, I didn't check) warns if you use default in a select statement on an enum. The theory is that you should explicitly cover ever possible enumerated value, but what happens if a bit randomly flips? Default is the best way to handle that, as well as a load of other errors that are not uncommon with communication stuff.
Defensive programming is fine for detecting problems yet there are boundaries in responding to them when crossed become counterproductive.
Where people get into trouble with "defensive" programming is acting as if the goal is not to crash when in fact the goal is correct operation. This means maintaining a level of brittleness in execution where the default response to the unexpected outcome is to stop digging.
Generally the resolution to your questions is to apply runtime assertions for gremlins. Warnings
I knew it (Score:2)
I knew those cosmic rays were a problem. They also sap and impurify all of my precious bodily fluids.
Cosmic rays, thermal errors (Score:1)
The smaller the feature size, the more it is susceptible to both thermal and cosmic ray upsets, but the less cross sectional area there is for either on any one transistor or memory location. It's a dance.
Lower operating voltages are probably a bigger driver. P=fV**2 is a tempting little equation isn't it?
Wife (Score:3)
I tried to explain it to her. Went to an online Flower shop to get her a bouquet, and due to the Chinese Neutron bug, an exotic porn site popped up. For some reason, she wouldn't believe me.
Re: (Score:2)
Online Plower Shop > Your one stop shop for sex machines of any size.
Re: (Score:2)
After all the fuss you did about her order for a pen [penisland.net], who can blame her? You kept insisting that she was ordering dildos!
This sounds right. Live with it. (Score:2)
Woops (Score:1)
your morganstanley BTC bag is now
empty
thank you, please come again - well , i guess that should be covered by the 10k + nodes, so what CAN it do ?