Become a fan of Slashdot on Facebook

That Time The Windows Kernel Fought Gamma Rays Corrupting Its Processor Cache (microsoft.com) 166

Posted by EditorDavid on Sunday November 25, 2018 @11:34AM from the cosmic-coding dept.

Long-time Microsoft programmer Raymond Chen recently shared a memory about an unusual single-line instruction that was once added into the Windows kernel code -- accompanied by an "incredulous" comment from the Microsoft programmer who added it:

;
; Invalidate the processor cache so that any stray gamma
; rays (I'm serious) that may have flipped cache bits
; while in S1 will be ignored.
;
; Honestly. The processor manufacturer asked for this.
; I'm serious.
invd

"Less than three weeks later, the INVD instruction was commented out," writes Chen. "But the comment block remains.

"In case we decide to resume trying to deal with gamma rays corrupting the the processor cache, I guess."

This discussion has been archived. No new comments can be posted.

That Time The Windows Kernel Fought Gamma Rays Corrupting Its Processor Cache

Load All Comments

Search 166 Comments Log In/Create an Account

Comments Filter:

Microsoft's never doing any military or space work (Score:4, Informative)

by johnjones ( 14274 ) writes: on Sunday November 25, 2018 @11:44AM (#57696724) Homepage Journal

preparing your software for failures in hardware due to common problems such as radiation might be a good idea...
This is why some firms/states would not trust microsoft to critical functions....

Share
twitter facebook
- Sure they did (Score:5, Insightful)
  
  by rsilvergun ( 571051 ) writes: on Sunday November 25, 2018 @12:02PM (#57696818)
  
  that's what their embedded OSes were for. AFAIK this was in their consumer code base.
  
  If I had to guess this was because of a real processor bug Intel didn't want to admit to. I remember when Win XP hit the shop I was at was flooded with dead computers from upgrades. Manufacturers had been selling bad ram in computers for years. By default Win98 would only make use of the first 64 MB of ram in most cases (there was a registry hack I've long since forgotten to force it to use your entire ram before going to the cache).
  
  Anyway, XP's installer would copy the CD into ram to make the (very slow) install run faster. So you got to find out your OEM stuck bad ram in your box the hard way when the installer blew up. The best part was the upgrade couldn't roll itself back gracefully. I don't remember all the steps to fix it but it was a pain. We just did software where I was at too so it was fun having to send them somewhere else to get new ram and have them yell at me that the ram was fine. Good times.
  
  Parent Share
  twitter facebook
  - Re:Sure they did (Score:5, Interesting)
    
    by msauve ( 701917 ) writes: on Sunday November 25, 2018 @03:31PM (#57697714)
    
    "If I had to guess this was because of a real processor bug Intel didn't want to admit to."
    
    Alpha particles affecting memory is a known, but uncommon, issue. This code invalidated the cache when coming out of S1 (sleep) state. The deeper (S2+) sleep states already invalidate the cache. The longer the processor is in a static state (sleep), the more chance that an alpha particle hit will flip a bit. Invalidating the cache when coming out of a sleep state has no meaningful impact on performance. The time to re-fetch is nothing compared to the amount of time spent sleeping. Of course, there are many more bits in RAM which could be affected, so a problem is more likely to occur there, which this doesn't address.
    
    But it hurts nothing, avoids an (admittedly rare) issue, and is but a single instruction. I wonder why they removed it?
    
    Parent Share
    twitter facebook
    - Re: (Score:2)
      
      by viperidaenz ( 2515578 ) writes:
      
      S1 i supposed to keep the cache fully powered up. How's it going to make any difference if an alpha particle hits the cache memory cells while the core clock has stopped?
      - Re: (Score:3)
        
        by msauve ( 701917 ) writes:
        
        "How's it going to make any difference if an alpha particle hits the cache memory cells while the core clock has stopped?"
        
        It's not clear what you're asking. If a bit in the cache gets changed, it corrupts the instruction or data. That the cache is powered up makes no difference.
        
        Re: (Score:2)
        
        by viperidaenz ( 2515578 ) writes:
        
        I'm saying the risk of cache corruption from gamma rays should be no different between S0 and S1.
        
        Re: (Score:2)
        
        by msauve ( 701917 ) writes:
        
        "the risk of cache corruption from gamma rays should be no different between S0 and S1."
        
        But invalidating the cache when returning from S1 removes any (even remote) risk. And there's no downside. Better is better.
        
        Re: (Score:2)
        
        by viperidaenz ( 2515578 ) writes:
        
        There's a performance and power consumption impact. Otherwise they wouldn't have any cache at all.
        
        Re: (Score:2)
        
        by msauve ( 701917 ) writes:
        
        I understand how, not knowing how sleep states work, you would think that.
        
        Re: (Score:2)
        
        by viperidaenz ( 2515578 ) writes:
        
        So you don't understand that coming out of a sleep state and not having any data in the cache at all results in more stalls and main memory access and how that translates to a performance hit and more power consumption?
        
        Re: (Score:2)
        
        by msauve ( 701917 ) writes:
        
        No, you simply don't understand that any delay needed to fetch from RAM when coming out of a sleep measured in seconds is absolutely meaningless in the real world.
        
        Re: (Score:2)
        
        by epine ( 68316 ) writes:
        
        It's not clear what you're asking. If a bit in the cache gets changed, it corrupts the instruction or data. That the cache is powered up makes no difference.
        Wrong.
        Next contestant, please.
        You're assuming the cache has not parity or ECC mechanism where active use would eliminate single-bit errors before they accumulate into undetectable errors, whereas pickling the cache in quiescent warm brine would not.
        
        Re: (Score:2)
        
        by msauve ( 701917 ) writes:
        
        But see, that's just it. I'm not assuming anything, it's you who are making the ASSumption. The odds of a double hit (which might pass the parity check) are multiplied billions (GHz) of times when it's sitting there in static sleep for a second or two.
      - Re: (Score:2)
        
        by mlyle ( 148697 ) writes:
        
        It's just that the stuff will have sat for an indeterminate, long time while the clock is stopped-- providing an unusually long window for a bit to flip-- and resuming even from S1 is a relatively costly operation.
        I think overall it is silly, but if you have ECC RAM and non-ECC cache, and spend most of the time in S1, it's not completely crazy.
      - Re: (Score:2)
        
        by Agripa ( 139780 ) writes:
        
        S1 i supposed to keep the cache fully powered up. How's it going to make any difference if an alpha particle hits the cache memory cells while the core clock has stopped?
        The cache is protected against data corruption by ECC or parity however if multiple bit errors accumulate within one word, this protection fails. During normal operation, the cache is continuously scrubbed of errors so this is not a problem.
    - Re: (Score:2)
      
      by TechyImmigrant ( 175943 ) writes:
      
      >Alpha particles affecting memory is a known, but uncommon, issue.
      A known issue for plastic packaging. The alpha emitters are in the plastic.
    - Re: (Score:2)
      
      by Agripa ( 139780 ) writes:
      
      Of course, there are many more bits in RAM which could be affected, so a problem is more likely to occur there, which this doesn't address.
      High performance integrated SRAM is orders of magnitude more susceptible to radiation induced soft errors than DRAM which is why SRAM caches have included ECC or parity protection almost since they were first used.
      Oddly enough, DRAM has actually become more resistant to radiation induced soft errors over the last couple of generations but this is more than cancelled by the increasing amount of DRAM used.
    - - Re: (Score:3)
        
        by svirre ( 39068 ) writes:
        
        The usual source of alpha emissions affecting memory in semiconductor devices come from the capsule of the device itself.
        
        Re: (Score:2)
        
        by spinozaq ( 409589 ) writes:
        
        Polonium, the element used to poison a former KGB guy is a beta emitter, which famously was why it was so hard to detect originally.
        That's not true. The KGB guy, Alexander Litvinenko, was poisoned with Polonium 210, which is a near pure alpha emitter with a small bit of gamma. He died from acute alpha radiation poisoning. The gamma is very detectable, you just have to know enough to look for it. After they figured that out they were able to accurately estimate the intake dosage from a gamma ray measurement.
        
        Re: (Score:2)
        
        by Agripa ( 139780 ) writes:
        
        I was unaware that they were in the habit of using materials that suffer from alpha decay when manufacturing electronics (hint, they don't).
        They try to avoid radioactive materials in packaging for that very reason however sometimes contamination occurs anyway.
        Back during about the 64kbit DRAM generation, this was a huge problem with ceramic packaged parts increasing the demand for plastic packaging despite doubts about its reliability.
  - Re: (Score:2)
    
    by phantomfive ( 622387 ) writes:
    
    If I had to guess this was because of a real processor bug Intel didn't want to admit to.
    I was wondering that too. The article suggests it is true.
  - - Re: Sure they did (Score:1)
      
      by Anonymous Coward writes:
      
      Hey anon, I've seen enough of their posts to recognize the username. What the fuck are you famous for? Get over yourself
- Re: (Score:3, Informative)
  
  by Anonymous Coward writes:
  
  Reading the full story, it's rather strongly implied that it was actually a workaround for a bug in the processor which the manufacturer hadn't found yet, and was blaming on cosmic rays.
  - Re: (Score:2)
    
    by mikael ( 484 ) writes:
    
    I know stray radar microwaves can take out a PC. There was weather radar station close to where I lived. Whenever my smartphone app received a heavy rain warning, my gaming PC would crash seconds before.
    - Re: Microsoft's never doing any military or space (Score:1)
      
      by nowwith25percentmorefree ( 5176963 ) writes:
      
      Was your gaming PC's case solid metal, or did it have large windows / oversized vents?
    - - Re:Microsoft's never doing any military or space w (Score:4, Interesting)
        
        by Mister Transistor ( 259842 ) writes: on Sunday November 25, 2018 @02:18PM (#57697392) Journal
        
        Some of the newer Doppler WX radars do a rapid narrow scan in some modes of operation for some fine examination of a particular front or phenomenon they want to image with more detail or using some more specialized mode like water vapor density, etc.
        So, the usual low(er) power scanning 'round and 'round, like radars usually do, probably isn't enough to trigger this poster's problem, but if the high-powered focused scans happen to be in his direction, well, bad news that day.
        Perhaps some Meteorologist can weigh in on this mode of operation with the radars, I don't know enough about them to be more specific.
        
        Parent Share
        twitter facebook
        
        Re: Microsoft's never doing any military or space (Score:3)
        
        by c6gunner ( 950153 ) writes:
        
        Aircraft have weather radar built in, so I've had my smartphone in front of a powered up radar emitter many times; didn't affect it in the slightest. The ground based ones are probably more powerful, but it seems unlikely that they would be affecting electronics. If they did there would be a lot more problems than just one random guy having his computer crash.
        
        Re: (Score:3)
        
        by Mister Transistor ( 259842 ) writes:
        
        Aircraft radars are in the hundreds of watts power output; WX radars are in the MILLIONS of watts. You're talking an order of magnitude difference of 10,000 or more.
        Also, some circuits are more sensitive than others to particular frequencies due to the length of wires or runners on PCB's that act like little antennas, so not everything is going to be adversely affected, but stuff that's resonant at that frequency will be much more susceptible to external interference.
        RF engineer here, BTW. I just don't do
- Re:Microsoft's never doing any military or space w (Score:4, Interesting)
  
  by mikael ( 484 ) writes: on Sunday November 25, 2018 @12:49PM (#57697034)
  
  One component that many defence contract required was a Nuclear Event Detector. This little component would set a pin when it detected the precursor of a nuclear detonation. What the system did next was up to the vendor, but usually it would involve a shutdown and disconnect of ports and power lines.
  
  Parent Share
  twitter facebook
  - Re: (Score:1)
    
    by Anonymous Coward writes:
    
    Maxwell HSN-1000 [ddc-web.com]. You can't buy them new from Maxwell but you can get them used from recycled military gear for around $150.
That's a great comment (Score:5, Insightful)

by NotSoHeavyD3 ( 1400425 ) writes: on Sunday November 25, 2018 @11:45AM (#57696730) Journal

Since it explains the reasoning why that code is there.(Since another developer could come by and wonder why that code is there.) I've seen way too many people put in a comment like ;invalidate cache and call it a day.

Share
twitter facebook
- Re: (Score:1)
  
  by Anonymous Coward writes:
  
  Would have been somewhat better if they left in which processor, and which manufacturer, they were talking about.
  - Re: (Score:2)
    
    by NotSoHeavyD3 ( 1400425 ) writes:
    
    You're very right there. Also I'm guessing there's probably some issue tracking so putting that in there would be nice as well. I'm just so surprised the original developer didn't put in some pointless comment.
- Re: That's a great comment (Score:3)
  
  by Balial ( 39889 ) writes:
  
  It needs a reference to the errata from the vendor. Future revisions may need to tweak code flow and understand exactly what this is trying to achieve.
- Re: (Score:1)
  
  by Anonymous Coward writes:
  
  +1
  Note from professional programmer: I can read the code to see WHAT is happening, and HOW it is happening. I need the comments to explain WHY it is happening, and WHY I should care. During code review, this comment would get a "awesome comment" comment.
- Re: (Score:3)
  
  by shabble ( 90296 ) writes:
  
  Since it explains the reasoning why that code is there.(Since another developer could come by and wonder why that code is there.).
  But... the code isn't there. The code itself was commented out shortly after.
  What's more concerning is why the commented stuff was actually left in there, since I'm presuming they had source control even back then.
  And "in case someone put it back in later" isn't really covered since the same sort of code could conceivably be put elsewhere in the code without the programmer seeing this bit of code.
- Re: (Score:2)
  
  by hcs_$reboot ( 1536101 ) writes:
  
  The comment could have included "Use this instruction with care. Data cached internally and not written back to main memory will be lost", INVD man [felixcloutier.com].
I'm not sure what's odd about that (Score:5, Interesting)

by vadim_t ( 324782 ) writes: on Sunday November 25, 2018 @11:48AM (#57696746) Homepage

The need for error checking has been around for a very long time. Yes, cosmic particles are indeed a thing, and result in increased memory errors at high altitude, in airplanes, or especially in space.
I remember parity RAM being around in the 90s, and I'm pretty sure it's older than that. Pretty much any server these days uses ECC for this reason.
I run ECC and record the occassional bit flip in my logs once in a while. These can be found at /sys/devices/system/edac/mc/mc0/.
What's odd is that ECC is not routinely used in all hardware. Depending on the conditions it can be of great help, as the rare bit flip can cause strange problems that can take ages to track down. And it works well for figuring out when you have a bad memory module -- the computer will figure it out on its own.

Share
twitter facebook
- Re: ECC everywhere (Score:2)
  
  by davidwr ( 791652 ) writes:
  
  RAM is cheap enough that ECC or similar tech should be routine. Iâ(TM)ll pay 10-15% more per GB for this.
  - Re: (Score:2)
    
    by arth1 ( 260657 ) writes:
    
    The problem is that you need a CPU and north bridge that can handle it, which adds to the initial costs. For Intel, for example, a Xeon CPU costs (artificially) a good deal more than a comparable speed i3/5/7/9, which is an upfront cost that consumers aren't willing to eat, and they tend to choose either a cheaper CPU or a faster CPU for the same kind of money.
    - Re: (Score:3)
      
      by vadim_t ( 324782 ) writes:
      
      Or you could buy AMD, which seems to have excellent support for it.
      - Re: ECC everywhere (Score:1)
        
        by Anonymous Coward writes:
        
        Intel omits ECC from the desktop market as purposeful market segmentation. It's a fact.
    - Re: (Score:2)
      
      by mlyle ( 148697 ) writes:
      
      The most real problem is that this is a way for motherboard and CPU vendors to segment the market, and prevent commodity PC hardware from being used for critical things. Home users "don't need" ECC, so it can be left off the cheap stuff.
    - Re: (Score:2)
      
      by Agripa ( 139780 ) writes:
      
      The problem is that you need a CPU and north bridge that can handle it, which adds to the initial costs. For Intel, for example, a Xeon CPU costs (artificially) a good deal more than a comparable speed i3/5/7/9, which is an upfront cost that consumers aren't willing to eat, and they tend to choose either a cheaper CPU or a faster CPU for the same kind of money.
      In most cases Intel's Xeon and consumer CPUs are the same hardware so the only difference in production might be testing time. Intel's artificial market segmentation of ECC is more about price discrimination [wikipedia.org] then costs which can be seen by their tying ECC to use of the proper south bridge which has nothing to do with it.
    - - Re: (Score:3)
        
        by dshk ( 838175 ) writes:
        
        I do believe missing ECC support is an artifical restriction at Intel. AMD has ECC. One of the reasons I always buy AMD, that I can be sure that all processor features of that generation is enabled in even their cheapest processor. No surprises. Btw. modern processors include most/all of the functionality of the north bridge. Regarding performance, for the same cost AMD usually provides more performance, specifically similar single threaded performance and better multi-threaded performance.
- Re: (Score:3)
  
  by dargaud ( 518470 ) writes:
  
  I have a friend who had written his own accounting software in the 80s on a 6502 PC. Once there was a discrepancy of a few $ at the end of the month. He spent an entire month backtracking the error through software logic, then software debug, then finally assembly until he found the exact place where a single bit had flipped in memory. Took him a month.
  - Re: (Score:2)
    
    by account_deleted ( 4530225 ) writes:
    
    Comment removed based on user account deletion
    - Re: (Score:2)
      
      by account_deleted ( 4530225 ) writes:
      
      Comment removed based on user account deletion
    - Re: (Score:2)
      
      by Megol ( 3135005 ) writes:
      
      Apple II, BBC Micro, Commodore 64, Commodore PET, Atari 8bit series etc. There were many alternatives.
    - Re: (Score:1)
      
      by Anonymous Coward writes:
      
      Technically the Commodore 64 used a 6510 rather than a 6502, although in practise the only difference was that the 6510 had an extra 8-bit I/O port used for bank switching memory and talking to the tape drive.
      Commodore owned MOS Technology who made the 6502 so they made quite a few custom variants like this for various computers and devices.
  - Re: (Score:2)
    
    by account_deleted ( 4530225 ) writes:
    
    Comment removed based on user account deletion
  - - Re: (Score:2)
      
      by Megol ( 3135005 ) writes:
      
      Can you give an example of a personal computer using an 6502 and SRAM in the 80's? One is fine, I can't think of a single one using it as the main memory.
      The 6502 was used as it was inexpensive and (with the right software) reasonable powerful. Equipping a system with a lower cost CPU and then using enough very expensive SRAM to run real programs seem strange if not stupid.
      - Re: I'm not sure what's odd about that (Score:2)
        
        by Miamicanes ( 730264 ) writes:
        
        Not a 6502, but I'm pretty sure the TI99/4a used SRAM for its 256-byte "scratchpad" RAM (the only RAM its CPU could access directly).
        I know the Amiga used "Static Column" RAM, but I think SC ram was what we'd NOW consider to be "PSRAM" -- DRAM with extra onboard circuitry to do its own refreshing automatically so it "looks" (and behaves) like SRAM as far as the outside world is concerned.
  - - Re:I'm not sure what's odd about that (Score:4, Interesting)
      
      by PhunkySchtuff ( 208108 ) writes: <kai@auto[ ]ica.com.au ['mat' in gap]> on Sunday November 25, 2018 @06:34PM (#57698466) Homepage
      
      The issue is not that the error is only a few dollars or even a few cents. The issue is that there is an error at all. If something doesn't balance, even if it's a few cents out, that means that there's likely an error in the logic that calculates everything.
      It's basic maths. You can't say when you're calculating 100 + 100 = 199 and call it a day because it's close enough. There is something fundamentally wrong if you're not getting the exact correct answer.
      
      Parent Share
      twitter facebook
- Re: (Score:2)
  
  by thegarbz ( 1787294 ) writes:
  
  What's odd is that ECC is not routinely used in all hardware.
  Nothing odd about it. It costs more, It performs worse, and the vast majority of the incredibly rare errors that are caused end up being entirely non-critical due to the way people generally use computers.
  If you have a database server handling critical information all day then it makes sense. But hell for the vast majority of workloads your computer is more likely to get "Aw. Snap! Something went wrong" Along with a frowny face displayed in your browser. Any time a consumer is doing anything remotely import
  - Re: (Score:2)
    
    by arth1 ( 260657 ) writes:
    
    Nothing odd about it. It costs more, It performs worse
    Not always. Modern ECC does the fetch and verification in parallel, negating most of the slowdown. And some registered ECC (which used to be slower) is now faster, as it does pre-fetch before the actual request.
    - Re: (Score:2)
      
      by thegarbz ( 1787294 ) writes:
      
      Always. Without fail.
      The process of check and verification itself was only a small part of the performance of memory. ECC memory is almost impossible to find at common desktop speeds with almost all of them being in the sub 3000MHz except for the truly ultra-expensive modules.
      Where someone wants to pay for equal speeds and chose something like a 2166 module the ECC memory invariably has far worse latency figures.
      ECC memory has a lower upper speed limit, lower than the actual standard speed capability of a m
- Re: (Score:3)
  
  by larryjoe ( 135075 ) writes:
  
  What's odd is that ECC is not routinely used in all hardware.
  For a lot of systems and uses, the rate of error occurrence doesn't justify the area cost of ECC. For all fabrication processes in the last decade, error rates per SRAM bit have been decreasing faster than the increase in number of SRAM bits, meaning that the total error rates for most chip families have been decreasing. Furthermore, the vast majority of errors in SRAM never propagate to user-discernible outcomes. For these systems, the user is more interested in a lower initial price or better performan
  - Re: (Score:2)
    
    by Agripa ( 139780 ) writes:
    
    I think you are confusing SRAM and DRAM.
    DRAM soft error rates leveled off a couple generations ago. SRAM soft error rates are a couple orders of magnitude higher and have remained so for integrated SRAM caches. A discussion of the difference and why it exists would be interesting.
    Other than some odd exceptions, integrated SRAM caches have been protected by ECC or parity almost since they were first used.
    - Re: (Score:2)
      
      by larryjoe ( 135075 ) writes:
      
      I think you are confusing SRAM and DRAM.
      DRAM soft error rates leveled off a couple generations ago. SRAM soft error rates are a couple orders of magnitude higher and have remained so for integrated SRAM caches. A discussion of the difference and why it exists would be interesting.
      DRAM per Mbit error rates have not dropped as precipitously as SRAM error rates. Over the last decade, SRAM error rates have dropped by a few orders of magnitude, faster than the increase in the total number of SRAM bits on a chip due to scaling and chip area increase. Ten years ago, the SRAM error rate was quite a bit higher than the DRAM error rate, by about an order of magnitude. Especially with the introduction of FinFET/tri-gate, SRAM error rates have plummeted and are now somewhat lower than that f
- Re: (Score:2)
  
  by Solandri ( 704621 ) writes:
  
  What's odd is that ECC is not routinely used in all hardware. Depending on the conditions it can be of great help, as the rare bit flip can cause strange problems that can take ages to track down. And it works well for figuring out when you have a bad memory module -- the computer will figure it out on its own.
  Others have already covered the higher cost and performance hit of ECC RAM.
  
  The most visible symptom of a random bit flip is that your program crashes. The RAM a program occupies far exceeds the R
- Re: (Score:2)
  
  by complete loony ( 663508 ) writes:
  
  There are so many machines out there with domain names in memory, that squatting domains that are a single bit-flip away can be quite interesting [dinaburg.org].
- Re: (Score:2)
  
  by TechyImmigrant ( 175943 ) writes:
  
  > What's odd is that ECC is not routinely used in all hardware
  I know why. It's a pain to implement on arbitrary logic - as opposed to memory.
  TMR is more appropriate, however the tool support for TMR is still abysmal. Synopsis should have a tmr command you can apply to a module and have it just happen. Instead you waste weeks fighting the optimizer to prevent it removing the TMR you put in manually.
  - Re: (Score:2)
    
    by arglebargle_xiv ( 2212710 ) writes:
    
    For software, you still need to put in the countermeasures by hand, because you've also got things like control flow integrity and other aspects to deal with. Also you don't need the overhead of TMR for all values, just critical system variables and the like, so having a tool try and do it automatically doesn't work.
- Re: (Score:2)
  
  by Agripa ( 139780 ) writes:
  
  What's odd is that ECC is not routinely used in all hardware. Depending on the conditions it can be of great help, as the rare bit flip can cause strange problems that can take ages to track down.
  Whether ECC is used or not depends on the likelyhood of an error and how serious the consequences will be. The number of errors depends on how much memory is used (not installed), how long it is used, and oddly enough some factor related to the access rate. Since servers tend to have much more memory and operate for longer times than desktops, ECC makes more sense for them.
  Who cares about errors while playing a game or media or doing consumer type tasks which do not tax the computer? But if my workstatio
Why is this so strange? (Score:3)

by Brett Buck ( 811747 ) writes: on Sunday November 25, 2018 @11:59AM (#57696800)

It seems to make good sense to put in some protections against register or other bit flips, they do happen from time to time. He probably meant cosmic rays instead of gamma rays, but that definitely can happen and i have spent many, many, hours of my life putting things in software that detect these and recover properly. I have one processor type that has something like this about once a month, very consistently, over several decades.

Share
twitter facebook
- And what's the context? (Score:2)
  
  by drinkypoo ( 153816 ) writes:
  
  If it's being done rarely, and before exceptionally critical operations, then maybe it makes sense. Although, if someone bothered to take it out, then it was probably happening too often and thus affecting performance...
  - - Re: (Score:2)
      
      by drinkypoo ( 153816 ) writes:
      
      Do you really like random bsod?
      If I did, I'd disable the cache ECC that is generally highly successful at protecting users from that kind of problem. I don't know or care when Intel implemented it, but AMD did it at least since the K7.
- - Re: (Score:2)
    
    by account_deleted ( 4530225 ) writes:
    
    Comment removed based on user account deletion
probably cosmic rays rather than gamma rays (Score:3)

by starless ( 60879 ) writes: on Sunday November 25, 2018 @12:03PM (#57696830)

A real gamma ray wouldn't do much, and would just pass through, unless it pair converted to electron and positron.
But cosmic rays (charged particles) would be more likely to interact.

Share
twitter facebook
- Re: (Score:1)
  
  by Anonymous Coward writes:
  
  Gamma rays lose energy while passing through materials by knocking electrons around. This can involve many collisions and many displaced electrons depending on the energy of the gamma ray. Higher energy photons will go a ways without interacting much, but as they lose energy collisions can become more frequent and at some point they can quickly dump the rest of their energy in a smaller volume. Charged particles stop by practically the same process, just interact more strongly and so are more likely to d
- Re: (Score:2)
  
  by mnmn ( 145599 ) writes:
  
  I agree. I came here to comment on 'why is this strange' but looks like many slashdotters (at least ones with physics backgrounds) feel similarly.
  
  It seems ridiculous when you take a cpu rma to Intel for an rca on some OS crash, but their response is that the cpu is fine, it was a cosmic particle. But it's true, and statistically this can happen to any bit in any register. Especially with the lithography processes producing ever smaller gates with few atoms manning the gate/bit.
Actually makes good sense (Score:5, Insightful)

by Crashmarik ( 635988 ) writes: on Sunday November 25, 2018 @12:12PM (#57696884)

Your cpu has been asleep for an apriori unknown amount of time, you are powering back up you'd absolutely want to clear the cache to purge any potential bit flips. It's a relatively cheap way of insuring data integrity.

Share
twitter facebook
Laptop aboard the International Space Station ? (Score:5, Informative)

by Laxator2 ( 973549 ) writes: on Sunday November 25, 2018 @12:17PM (#57696904)

I think they use laptops on the International Space Station and there you are not protected from cosmic rays by the blanket of the Earth's atmosphere. Just read up on the phosphenes experienced by the astronauts as they try to go to sleep.
Not sure if "gamma rays" is the correct term here, as high-energy protons are most likely to create a local change in electric charge density. With modern processors being built ont the 14 nanometres process this becones a serious problem. All the processors that are used in spacecraft and control vital functions are radiation-hardened. That usually means older fabrication processes (wider paths reduce the probability of cross-talk) and amorphous silicon (a monocrystal can sustain permanent damage from a particle of high enough energy)
Overall, it does make sense if it is meant to be used in space.

Share
twitter facebook
- Re: (Score:2)
  
  by Agripa ( 139780 ) writes:
  
  With modern processors being built ont the 14 nanometres process this becones a serious problem.
  Susceptibility is more complicated than just the minimum feature size. It was a serious problem generations ago.
  Denser processes use gate insulators with a higher dielectric constant to store more charge and also provide more drive for a given area. These things make a process more resistant to radiation induce soft errors. The same things caused the susceptibility of DRAM processes to level off or even decrease slightly starting a couple generations ago.
- - - Re: (Score:2)
      
      by Agripa ( 139780 ) writes:
      
      Hi, I'm a software engineer working on embedded computers flying on satellites. Single upset events are a common occurrence up in space, even on low earth orbit. And modern COTS chips are more vulnerable than chips of yore. RADHARD chips are old slow and expensive. They have fat thick lines that aren't as easily bumped by radiation. Satellites up on GEO might have it worse, but LEO still has problems.
      The last time I checked, the reason radiation hardened processes used much larger minimum feature sizes was simply because it was not economical to produce a denser radiation hardened process. More modern fabrication processes require an economy of scale which is not available for such a small market.
Radioactive Packaging Material (Score:1, Insightful)

by tiffanytimbric ( 5470034 ) writes:

Yeah, I'll get this was before they discovered that their processor packaging material was radioactive and that was ramdonly flipping bits. Seriously, radioactive RAM was on culprit which ran Sun Microsystems, Inc. out of business. It took them years to find it. They even started ECC their motherboard data paths, looking to see if their data centers were near nuclear research facilities. By the time they found it it was too late. ...that and they should have ditched Solaris for Linux, but...
Silly idea, if true (Score:2)

by dltaylor ( 7510 ) writes:

Sounds like a smoke screen for something else.
If the cache is susceptible to random gamma rays, or, more likely, cosmic rays, and has no ECC, it is NEVER trustworthy, and should be permanently disabled.
It's like the Intel floating point bugs (yes, plural). Since the end user has no idea WHICH of the operations will produce an erroneous result, NONE of the operations' results are usable, ever.
Could be worse. Intel once had a "genius" purchasing agent that got a "good deal" on clay for the ceramic package o
- Re: (Score:2)
  
  by Agripa ( 139780 ) writes:
  
  If the cache is susceptible to random gamma rays, or, more likely, cosmic rays, and has no ECC, it is NEVER trustworthy, and should be permanently disabled.
  While it is shut down, the cache is not being continuously scrubbed by ECC or parity allowing bit errors to accumulate and defeat the ECC or parity after it is powered up. Invalidating and reloading the contents of the cache makes perfect sense in this situation.
- - Re: (Score:2)
    
    by Agripa ( 139780 ) writes:
    
    The reality is having error detection and/or correction on every cache, register, bus, lane, memory, storage, and everything between is hella expensive. So your desktop processor has relatively little and processors meant for servers or embedded systems have widely varying amounts.
    With some exceptions, most Intel consumer and server processors use the exact same die so the ECC protection is there. The major difference is that with some exceptions, ECC for external memory is disabled on non-Xeon products.
    One of the exceptions is for lower performance processors like the i3 series which do not compete with Xeon products. Fancy that.
Risk/Reward (Score:3)

by QuietLagoon ( 813062 ) writes: on Sunday November 25, 2018 @12:51PM (#57697044)

Nowadays, it probably is far, far more likely that Microsoft's horrendous Windows QA will result in bad data than stray gamma rays flipping bits in a sleeping cache.

Share
twitter facebook
Comment removed (Score:5, Insightful)

by account_deleted ( 4530225 ) writes: on Sunday November 25, 2018 @01:08PM (#57697122)

Comment removed based on user account deletion

Share
twitter facebook
- Re: Commented out code (Score:2)
  
  by functor0 ( 89014 ) writes:
  
  On occasion, I've had to keep the commented out code with comment explanation why this code must not occur. Otherwise, people keep coming in trying to fix code that's not broken.
  - Re: (Score:3)
    
    by TechyImmigrant ( 175943 ) writes:
    
    On occasion, I've had to keep the commented out code with comment explanation why this code must not occur. Otherwise, people keep coming in trying to fix code that's not broken.
    This.
    I've left the wrong code in, commented with a detailed explanation as to why it's wrong, so someone doesn't come and 'fix' it again.
- Re: (Score:2)
  
  by thegarbz ( 1787294 ) writes:
  
  Or maybe someone made a mistake. The specification seems to imply you need to flush the cache *BEFORE* entering the S1 state and the hardware is responsible for the rest:
  "15.1.1 S1 Sleeping State
  The S1 state is defined as a low wake-latency sleeping state. In this state, all system context is preserved with the exception of CPU caches. Before setting the SLP_EN bit, OSPM will flush the system caches. If the platform supports the WBINVD instruction (as indicated by the WBINVD and WBINVD_FLUSH flags in the
Bogus story, fake news. (Score:2)

by 140Mandak262Jamuna ( 970587 ) writes:

Microsoft code does not contain comments.
To thwart lawyers finding out the true intentions of the strategies, Bill Gates decreed that the code should not have comments. Famously he said, "I am paying you to write code, not comment."
- Re: (Score:3)
  
  by Bite The Pillow ( 3087109 ) writes:
  
  http://atdt.freeshell.org/k5/s... [freeshell.org]
  I don't feel like html today for you.
UltraSPARC, anyone? (Score:3)

by nbvb ( 32836 ) writes: on Sunday November 25, 2018 @03:04PM (#57697598) Journal

Anyone surprised by this must have not been around during the UltraSPARC days ....
I must’ve replaced 1000+ of those damn chips when the “Sombra” modules came out. Mirrored SRAM to protect against the ecache bit-flips. Kernel panics due to “ecache parity errors” were so common ....
Cache scrubbers in the Solaris kernel. Replacement CPUs. All of it helped.
This stuff is real and painful if you had a data center full of gear susceptible to it.

Share
twitter facebook
It happened like this... (Score:5, Interesting)

by toxygen01 ( 901511 ) writes: on Sunday November 25, 2018 @04:15PM (#57697954) Journal

A friend of mine, developer of the spreadsheet SW back in the days of DOS a Norton Commander, had one customer who would keep complaining about the SW crashing from time to time. These kind of crashes would only happen to this customer and no other.

He installed a debug build on the customer's site and and waited... and fair enough, the SW would crash, and crash again and again... at completely random places in the code. In some cases there was literally no way those lines of code could make the program crash under any circumstances.

Well, he spent days trying to debug it and came up empty handed. Until it struck him to look at the time when the SW is crashing. And fair enough, it was crashing on one particular day in a week usually in the time-span of few hours during that day. Now comes the interesting part -- the customer's site was actually a railway station on the Slovakia-Ukraine border (in town called Uzghorod). So he called the customer to ask if there was a train in the station regularly on that day and hour every week and voila, there was one train coming from Ukraine to Slovakia with some goods. So he asked the customer to take Geiger counter and see if there was anything going on in the air.

They found out one of the train cars was radiating like hell. It was used for transferring spent nuclear fuel before. And Ukrainians thought they would save some money by using it for regular cargo after EOL. I wouldn't like to be a person living near those railway tracks...

tl;dr
Spreadsheet SW was crashing on the computers in the train station and thanks to customer complaints they found out the crashes were caused by radioactive train coming regularly to the station.

Share
twitter facebook
Common in IBM mid-ranges in the 90s (Score:2)

by coreyh ( 30451 ) writes:

This is actually pretty common and has gone on for a long time, especially on systems that were striving to be low-to-zero downtime.
Some of the idle processing on AS/400s would periodically re-write the microcode from disk. When I asked a core developer why, they cited gamma rays flipping a bit. I then asked if a lead umbrella wouldn't do the job better, and they said yes, but the umbrella would have to be about six feet thick.
- Re: (Score:2)
  
  by Agripa ( 139780 ) writes:
  
  Some of the idle processing on AS/400s would periodically re-write the microcode from disk. When I asked a core developer why, they cited gamma rays flipping a bit. I then asked if a lead umbrella wouldn't do the job better, and they said yes, but the umbrella would have to be about six feet thick.
  Cache and memory scrubbing is a standard feature even on x86 consumer desktop processors whether the user has access to it or not. Motherboards which support ECC memory may make the settings which control scrubbing available in the BIOS. Scrubbing applies to every level of cache which is ECC or parity protected and to main memory if ECC protected.
Once a year (Score:2)

by cwsumner ( 1303261 ) writes:

Cosmic rays causing ram errors, is a thing. Scientists estimate it will happen to PCs, at ground level, about once a year. Surprisingly, which year does not matter much because as the tech gets smaller, the capacity gets larger, so the die size stays about the same.
Once a year might not sound like much, but that is not "at the end of the year", it can happen right away. Chance is strange that way. 8-)
MS should probably -not- have commented it out...
- Re: (Score:2)
  
  by Agripa ( 139780 ) writes:
  
  Cosmic rays causing ram errors, is a thing. Scientists estimate it will happen to PCs, at ground level, about once a year. Surprisingly, which year does not matter much because as the tech gets smaller, the capacity gets larger, so the die size stays about the same.
  Moore's law is about economics and includes cost reduction per transistor from increasing die size so I wonder if the total die area of memory has actually increased at the high end of consumer hardware.
  Once a year might not sound like much, but that is not "at the end of the year", it can happen right away. Chance is strange that way. 8-)
  MS should probably -not- have commented it out...
  A couple of DRAM generations ago it was something like 1 bit per year per gigabyte but later DRAM generations actually improved slightly. My workstations went from 2GB to 8GB in my current one and my next will likely be 64GB but they all use 4 x dual sided DIMMs so the same number of chips but the silicon c
  - Re: (Score:2)
    
    by cwsumner ( 1303261 ) writes:
    
    Magnetic media is not so prone to this. But this makes me wonder if the SSD drives, we are all using now, are having this problem??
    Maybe SSDs have better data check and correction functions, but maybe we should keep a hard drive in our computers to reload the SSD, if necessary.
    - Re: (Score:2)
      
      by Agripa ( 139780 ) writes:
      
      Magnetic media is not so prone to this. But this makes me wonder if the SSD drives, we are all using now, are having this problem??
      Maybe SSDs have better data check and correction functions, but maybe we should keep a hard drive in our computers to reload the SSD, if necessary.
      Both hard disk drives and solid state drives use block based error correction. Several bad bits can be corrected in each sector and sectors may even be considered good with several bad bits below a specified threshold.
      Where SSDs compare poorly to HDDs is endurance and retention time but as long as they are not used for unpowered offline storage like a hard drive might be, retention time is not a problem and few users are going to reach endurance limits. There is a new standard for SSD retention time but I
ACPI specification requirements (Score:2)

by thegarbz ( 1787294 ) writes:

A quick read through the ACPI specification implies that the caches should be flushed *before* entering the S1 state and letting the hardware deal with the rest.
I'm not sure what to make of the comment. Part of the comment makes it apear as though this instruction comes after waking (making it pointless since the cache is already invalid). If this comment is about before going into the sleep state then it wasn't a manufacturer who asked for this, it was the ACPI specification itself, and not flushing the ca
- Self-reply, after reading TFA (Score:2)
  
  by davidwr ( 791652 ) writes:
  
  Shouldâ(TM)ve read the article first, where the author explained that oddly-commented code similar to this was used TEMPORARILY on early processor revisions or on early microcode revisions.
  In these cases, the check-in logs or the context of the code - say, itâ(TM)s in a block of code that only runs on processors that are in pre-production at the time - should make it clear that this is âoework-atoundâ code that we expect to be removed soon.
  - Sorry for the weird characters (Score:1)
    
    by davidwr ( 791652 ) writes:
    
    Appearently, the apostrophe got turned into a curly-apostrophe. Bad computer.
    Still my fault for not previewing.
  - Re: (Score:1)
    
    by Anonymous Coward writes:
    
    I bet the requirement is still there from the manufacturer, but because INVD invalidates all levels of cache, the performance hit for some latency critical code that is supposed to run right after return from S1 is too much. So they chose to not follow the manufacturer recommendation and take the chance that the system does not crash with some instruction mutating into an illegal or operand reference pointing to a wrong register or, worst case, the cache line valid bit gets set and the line of trash bits g
- Re: (Score:2)
  
  by account_deleted ( 4530225 ) writes:
  
  Comment removed based on user account deletion

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Microsoft's never doing any military or space work (Score:4, Informative)

Sure they did (Score:5, Insightful)

Re:Sure they did (Score:5, Interesting)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: Sure they did (Score:1)

Re: (Score:3, Informative)

Re: (Score:2)

Re: Microsoft's never doing any military or space (Score:1)

Re:Microsoft's never doing any military or space w (Score:4, Interesting)

Re: Microsoft's never doing any military or space (Score:3)

Re: (Score:3)

Re:Microsoft's never doing any military or space w (Score:4, Interesting)

Re: (Score:1)

That's a great comment (Score:5, Insightful)

Re: (Score:1)

Re: (Score:2)

Re: That's a great comment (Score:3)

Re: (Score:1)

Re: (Score:3)

Re: (Score:2)

I'm not sure what's odd about that (Score:5, Interesting)

Re: ECC everywhere (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: ECC everywhere (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: I'm not sure what's odd about that (Score:2)

Re:I'm not sure what's odd about that (Score:4, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Why is this so strange? (Score:3)

And what's the context? (Score:2)

Re: (Score:2)

Re: (Score:2)

probably cosmic rays rather than gamma rays (Score:3)

Re: (Score:1)

Re: (Score:2)

Actually makes good sense (Score:5, Insightful)

Laptop aboard the International Space Station ? (Score:5, Informative)

Re: (Score:2)

Re: (Score:2)

Radioactive Packaging Material (Score:1, Insightful)

Silly idea, if true (Score:2)

Re: (Score:2)

Re: (Score:2)