Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Businesses Technology

As the Largest Computer Networks Continue To Grow, Some Engineers Fear that Their Smallest Components Could Prove To Be an Achilles' Heel (nytimes.com) 68

An anonymous reader shares a report: Imagine for a moment that the millions of computer chips inside the servers that power the largest data centers in the world had rare, almost undetectable flaws. And the only way to find the flaws was to throw those chips at giant computing problems that would have been unthinkable just a decade ago. As the tiny switches in computer chips have shrunk to the width of a few atoms, the reliability of chips has become another worry for the people who run the biggest networks in the world. Companies like Amazon, Facebook, Twitter and many other sites have experienced surprising outages over the last year. The outages have had several causes, like programming mistakes and congestion on the networks. But there is growing anxiety that as cloud-computing networks have become larger and more complex, they are still dependent, at the most basic level, on computer chips that are now less reliable and, in some cases, less predictable. In the past year, researchers at both Facebook and Google have published studies describing computer hardware failures whose causes have not been easy to identify. The problem, they argued, was not in the software -- it was somewhere in the computer hardware made by various companies. Google declined to comment on its study, while Facebook did not return requests for comment on its study.

"They're seeing these silent errors, essentially coming from the underlying hardware," said Subhasish Mitra, a Stanford University electrical engineer who specializes in testing computer hardware. Increasingly, Dr. Mitra said, people believe that manufacturing defects are tied to these so-called silent errors that cannot be easily caught. Researchers worry that they are finding rare defects because they are trying to solve bigger and bigger computing problems, which stresses their systems in unexpected ways. Companies that run large data centers began reporting systematic problems more than a decade ago. In 2015, in the engineering publication IEEE Spectrum, a group of computer scientists who study hardware reliability at the University of Toronto reported that each year as many as 4 percent of Google's millions of computers had encountered errors that couldn't be detected and that caused them to shut down unexpectedly. In a microprocessor that has billions of transistors -- or a computer memory board composed of trillions of the tiny switches that can each store a 1 or 0 -- even the smallest error can disrupt systems that now routinely perform billions of calculations each second.

This discussion has been archived. No new comments can be posted.

As the Largest Computer Networks Continue To Grow, Some Engineers Fear that Their Smallest Components Could Prove To Be an Achil

Comments Filter:
  • Ghost in the Machine (Score:5, Informative)

    by King_TJ ( 85913 ) on Tuesday February 08, 2022 @11:55AM (#62249785) Journal

    I remember people writing about this issue a LONG time ago. I thought there was a belief that things like cosmic gamma rays would cause a certain number of errors in complex systems, especially as they kept shrinking the die sizes for CPUs?

    • by i.r.id10t ( 595143 ) on Tuesday February 08, 2022 @12:02PM (#62249817)

      There was also a /. thing a while back about bit flip errors in RAM causing the Windows Update domain name to be changed when it makes a request, some researcher figured out all the valid possibilities and registered a bunch, and was getting traffic to them...

    • by ThosLives ( 686517 ) on Tuesday February 08, 2022 @12:08PM (#62249847) Journal

      Yup, it comes up every time someone says "get those safety critical applications on modern process nodes!" like with automotive/aerospace/military chips.

      Two (of many) ways to get robustness are to use larger process nodes with bigger band gaps and to add redundancy like ECC, lockstep cores, and the like.

      • Meaning often the solution could be to spend more money and/or get a worse part (runs hotter, has half the externally visible cores). Make bigger chips using technology we understand better. Add validation and built in checks inside the hardware, which likely slow down the maximum operating speed. Use older simpler, well tested designs that don't have some performance optimizations like speculative execution.

        Sadly ECC is considered a "server" grade feature by Intel. Meaning they charge WAY more for it t

        • "runs hotter, has half the externally visible cores"
          Actually it has a third of the visible cores, and arbitration hardware in addition to that.
          It's a solved problem for those that actually choose to pay for it.

      • Two (of many) ways to get robustness are to use larger process nodes with bigger band gaps and to add redundancy like ECC, lockstep cores, and the like.

        That's also a great way to get reduced functionality. You can trend that with actual reliability figures in safety systems. As the processor nodes have gotten smaller the reliability has *improved* as far more processing power is available to dedicate to internal diagnostics and redundancy.

        Old and Large != better, especially when you start questioning if your entire program can actually execute in the 300ms you have dedicated as your system scan time. Or my oldschool favourite the nuclear 1E certified Trico

      • by tlhIngan ( 30335 ) <slashdot@worf.ERDOSnet minus math_god> on Tuesday February 08, 2022 @04:04PM (#62250773)

        Yup, it comes up every time someone says "get those safety critical applications on modern process nodes!" like with automotive/aerospace/military chips.

        Two (of many) ways to get robustness are to use larger process nodes with bigger band gaps and to add redundancy like ECC, lockstep cores, and the like.

        Well, a lot of applications don't need ultra fast cores. Most vehicle processors for stuff like ABS and all that do feature lock-step cores, voltage and clock monitoring (if the power supply goes out of spec - and given the crappy nature of automotive electrical supply, it's a wonder it doesn't happen more often - think voltage swings of -110V to +45V - the battery is so essential in helping regulate the voltage that if it fails, you can see those swings). Clock monitoring means the incoming clock and the various oscillators inside the chip, including PLL outputs, are monitored to be in spec both frequency and duty cycle.

        And still, you don't need a fast core - how much power is needed to run an airbag? You're sampling maybe 5-10 accelerometers 1000 times a second, seatbelt and seat occupancy sensors at the same rate and basically trying to determine if the car is undergoing a crash so you can decide to fire the airbags or not. Not exactly something that requires a dozen 3GHz cores to do.

        Even ADAS functions aren't all high powered - a lane keeping assist requires a camera and image processing to find the lane markings but even then it's not going to require a ton of power (the camera is likely to be 1080p or less anyways).

        The only application right now requiring lots of computational power is self-driving. All the basics don't and are usually separated into several different boxes in order to make them cheaper to replace individually and easier to test individually.

        • by dohzer ( 867770 )

          Clock monitoring means the incoming clock and the various oscillators inside the chip, including PLL outputs, are monitored to be in spec both frequency and duty cycle.

          I once saw some accidentally add a 'k' to the value of a series resistor between an oscillator and an automotive processor. It didn't like that too much, and would essentially go into limp-home mode immediate upon powerup until the "correct" low-resistance resistor was replaced.

    • Re: (Score:3, Interesting)

      by iggymanz ( 596061 )

      cosmic rays usually aren't gamma but nuclei though, they make showers of secondary particles in atmosphere. Those can be x-rays, muons, protons, antiprotons, alpha particles, pi mesons, electrons and positrons, and neutrons.

    • by lexios ( 1684614 )
      While theoretically possible, cosmic gamma rays might not be the most likely. Paco Hope mentions a few good points on his Cosmic Rays Did Not Change Election Results page: https://blog.paco.to/2017/cosmic-rays-did-not-change-election-results [blog.paco.to] like:
      • Absence of Evidence is not Evidence of Absence
      • Extraordinary Claims Require Extraordinary Evidence

      Back to the linked article, the suspected/suggested cause is more in the direction of manufacturing defects.

      DRAM chip reliability is ~5.5x worse in DDR4 compared to DDR3.

      from: https://www.amd.com/system/files/documents/advan [amd.com]

    • by kyoko21 ( 198413 )

      I knew someone was going to say "Ghost in the Machine". Glad to know I'm not the only one. :-)

    • by kbahey ( 102895 )

      cosmic gamma rays

      You mean 'cosmic rays'.

      They are not gamma rays, but rather certain particles traveling at relativistic speeds.
      The main components are just naked protons, and alpha particles (basically a nucleus of helium).
      Alpha particles are stopped by a sheet of paper.

      But those don't usually reach us on the surface of earth. They hit the upper part of our atmosphere, and cause secondary particles (e.g. meson, kaons, pions, muons) to rain down on the surface of earth.

      Not sure how deep those secondary ones

      • I love the diversity of minds on Slashdot. It's amazing what you can learn reading the comments here, be it from hobbyists with an intense interest in a field or a proven expert with degrees and years of experience in the subject matter.

        Even if that person is wrong, it can trigger a thought that leads you to learning more about the most random things. I consider myself a huge nerd and love science fiction (and science fact) and my brain just absorbs information, most of it useless. Yet I was so clueless abo

        • by kbahey ( 102895 )

          When it comes to physics and cosmology, I am firmly in the hobbyist camp.
          But over the past several years, I have been trying to catch up on what transpired between when I was in high school (mid to late 70s) and now.

          Also, I am currently testing my house for radon, so did some reading on that too. The detector works by detecting alpha decay, so I was curious if cosmic ray alpha would interfere. But since cosmic alpha (if they manage to reach the surface of earth and not hit oxygen and nitrogen in our atmosph

    • Cosmic rays DO pose a hazard.

      In the early days of electronics the parts were physically too big to be affected by a stray cosmic ray, but as parts and chip dies shrank, the problem became real.

      I remember telling some people about this in the early 1980s, and they laughed at me. And these, sad to say, were actual scientists. They thought the very idea was hilarious and naive and kooky. They hooted at the idea of "rays from outer space" doing anything. They hooted and had a good laugh over it.

      40 years ago the

    • Years ago I worked for a system manufacturer. We had issues with a certain gate array that were mitigated by adding a metal shield over it, so incident radiation at least with GaaS back then was a real factor.
  • by davidwr ( 791652 ) on Tuesday February 08, 2022 @12:04PM (#62249827) Homepage Journal

    As long as you "fail gracefully" and don't, say, crash an airplane, you should be okay.

    So what if a computer hosting one of Google's many redundant servers crashes or declares itself "unreliable" as long as it and the systems that depend on it KNOW it crashed. You lose a few seconds or minutes of time while the backups kick in.

    Now, if you are doing something that's either going to cost lives or lots of money if it's delayed by even a few minutes, then you'll need to pay more up front for good, real-time redundancy.

    • Someone should develop something like...error-correcting algorithms and fault-tolerant design because failure is as certain as death and taxes.

    • Think it's the "declares itself 'unreliable'" part that's at issue. "Silent errors" was what they'd mentioned. Not "hit a bugcheck and rebooted". It's when "2 + 2 = 5" and you think everything is fine. Maybe after a while, enough things go wrong and it'll be obvious something really bad is going on.

      Also just because we've been forced to live with unreliable hardware + software, and told to just reboot...doesn't mean that's OK. It's just we can't choose anything else. How many people would likely be ha

    • by necro81 ( 917438 )

      Now, if you are doing something that's either going to cost lives or lots of money if it's delayed by even a few minutes, then you'll need to pay more up front for good, real-time redundancy.

      To whit: the Space Shuttle had multiple computers running simultaneously (I've seen references to 3, 4, and 5...go figure). If one of them went kerflooey, the others would out-vote it and continue flying. Especially in the space environment, this was not infrequent.

      Another example: the original Segway [google.com] (and the st

      • Multiple redundancy and voting is standard practice for anything of consequence. Google, Facebook, etc., do nothing of consequence.

  • An alpha particle hits a bit of cache, and flips it. This is known. I have personally seen Excel add up a column of numbers incorrectly; after a refresh, it was correct.

    How often this happens, would be interesting to discover. If you want to avoid it, about the only option is running parallel systems, and having them vote on critical results.

    • by jabuzz ( 182671 )

      What utter bullshit unless you are storing radio isotopes inside your computer case, even then how the hell does it get inside a chip? Prey explain how you think an alpha particle can get inside your computer when it can't get through a piece of paper?

      Next time do some basic high school physics before typing stupid comments like that.

      • Carbon-14 in IC package plastic, beta decay, seems a possibility. I'm not competent to make that judgement.
  • yes. solar flares. https://www.johndcook.com/blog... [johndcook.com]
  • Facebook did not return requests for comment on its study

    Their phones were shutoff because they hadn't paid the bill.

  • by leptons ( 891340 ) on Tuesday February 08, 2022 @12:12PM (#62249863)
    There should be a law that parallels Moore's law that says, the more transistors packed into a square millimeter the more unexpected problems will happen.

    But I still have to wonder how much of the problems they found were caused by heat or otherwise some component failing that wasn't silicon-based.
    • by ceoyoyo ( 59147 ) on Tuesday February 08, 2022 @12:17PM (#62249879)

      There is. It's named after Moore's drunk Irish cousin Murphy.

      • by Tablizer ( 95088 )

        But he allows Mulligans.

        • by ceoyoyo ( 59147 )

          After a big enough sacrifice to Murphy I generally prefer a Jameson, but Mulligan is also okay.

          I think Windows users call it "the three fingered salute."

    • by ghoul ( 157158 )
      Doesnt matter. If by getting twice the transistors youlost 10% reliability you still have 1.8 times the transistors and even if you dedicate 30% of those extra transistors to run redundancy and error correction software youare stillcoming out ahead in computing capacity. >br>
      Only when the error rate and the burdenof error correction wipes out allthe extra computing power that it wont make sense to shrink further.
  • Has F.T.T.R. ( Fault Tolerance Through Redundancy ) been discarded in the face of competitive pressure to minimize feature dimensions in the sub-micron world?
    As scale shrinks to single-digit nanometers, and wafers expand from 300 mm to 400 mm, it's not hard to imagine that single-bit failure modes will proliferate. The best ECC available is still only as good as the limits of the physical geometry permit. Cutting feature size by 50% gains little if redundant circuits eat up most ( or all ) the space tha

    • Has F.T.T.R. ( Fault Tolerance Through Redundancy ) been discarded in the face of competitive pressure to minimize feature dimensions in the sub-micron world?

      I'm no expert so I'm just guessing; but at some point will the size, power, and possibly even speed gains made via miniaturization be offset by the redundant hardware and the need to check its results too? We may have reached the point of diminishing returns. Once the hardware's behaviour is no longer deterministic, how many levels of redundancy do you need before you consider computations to be reliable?

  • by rsilvergun ( 571051 ) on Tuesday February 08, 2022 @12:20PM (#62249899)
    your programmers need to build retrying into their design. Cheap programmers don't like doing this. They want to assume everything is always going to work the first time. That's because if it doesn't they can shift blame to whatever caused the failure. With cheap programmers the buck never stops at them.

    Thing is, if you're an Amazon closing these gaps makes sense. It's tens of millions of dollars. For smaller companies they just have manual and back office processes to deal with it. For *really* small companies the owner's wife fixes it :).

    Basically, the big guys will baby sit their cheap programmers and the smaller ones will make offline exception processes. Problem solve, like it's been for ages.
    • by kmoser ( 1469707 )
      Fault-tolerance and redundancy isn't just a software thing. It's a hardware thing, too. No amount of retrying and error checking in your code will help when your DB server is down. See Tandem Computers.
    • I heard from a satellite designer about this. She said they had to write software to continue the mission in spite of radiation flipping bits at random. So the programming techniques exist, though I don't know them well enough to say whether they could scale to running AWS or the Googleplex.

      • by ras ( 84108 )

        I heard from a satellite designer about this. She said they had to write software to continue the mission in spite of radiation flipping bits at random. So the programming techniques exist, though I don't know them well enough to say whether they could scale to running AWS or the Googleplex.

        If you do it in software, you do things like writing everything to two RAM locations and when you read them back verify they are the same. Or you have three computers that cross check each others outputs, and reboot the one that's wrong.

        It does work, and it's even practical if you have gobs of CPU power lying around unused. But it's slow, and it's very expensive use of hardware. Compare the "write twice, read twice" solution to just having ECC built into the chips DRAM controller. It's true the ECC che

        • So it shows the paradox. Those "server feature" that Intel touted to be premium are actually the least needed in big datacenter, because they own a huge number of computers and can do redundancy in cluster level. It is personal computing that actually need ECC the most because we can't afford recomputing stuff thrice and vote the wrong one out. Nor do we have huge RAID1 array in our storage.
          • by ras ( 84108 )

            I think you are missing the point. Redundancy doesn't help if you don't know it failed. And that's what this story was about - it failed, and they didn't find out until later, and now they have no idea what caused it, so no idea what they should do to fix it, so the failures will continue to happen. I think this is fairly obvious as Google in particular has redundancy up to the wahzoo, and it didn't help. There is no way it can help if the failure turned $1 into $1M, and you only discover it when the cu

    • your programmers need to build retrying into their design

      Retrying isn't enough, and it's actually the easier part of the problem to solve. The hard part is to identify that something went wrong so you know that you need to retry. Sometimes low-level errors provoke visible errors, but that's by no means guaranteed.

    • Cheap programmers don't like doing this.

      WTF is a cheap programmer?? It sounds highly appealing - less lines of code - but I'm guessing you meant "lazy" and were merely having a stroke.

    • your programmers need to build retrying into their design.

      And negative testing into their QA/QC procedures. There's nothing quite like rolling out a product with lots of parts spread across a wide area only to find out that while they did setup a "retry" in their design, it only worked on first power on, not if subsequently the comms gateway went offline and was brought back online an hour later.

      I drew a long straw, luckily. I didn't need to drive to every gas well and turn our sensors off and on again.

  • by QuietLagoon ( 813062 ) on Tuesday February 08, 2022 @12:27PM (#62249917)
    ... as the cloud seems to use components in ways they cannot be tested.
    • Switching back from "personal" computing to "time share" as we have done and called it "the cloud" is THE tragedy of our time. Sure, some things need centralization but surely not everything. Yes, some big hardware is needed for big problems. The real motivation for this giant step backward away from personal computing is money. Also, I think, fear; take e-mail, I can't send email from my computer to my neighbor without involving some mail server in the cloud, and that is because server technology is "t
      • ((( - Switching back from "personal" computing to "time share" as we have done and called it "the cloud" is THE tragedy of our time. - }}} --- In a word: yup. Next, there will be a great migration away from the cloud. Maybe it wil be called, "client-server computing" or something like that....
      • There's nothing wrong with timeshare. There's nothing inherently proprietary about plugging in a computer, giving it electricity, making sure it stays on and connecting it to the internet.

        What should have happened was that it was so simple that margins were nearly eradicated. "Either I buy this server, and pay someone to keep it going, or you do. But you have the benefits of economies of scale so you can do it for even cheaper than me, get larger bulk discounts, save on connectivity since it's all in one

    • ... as the cloud seems to use components in ways they cannot be tested.

      You're assuming that not using the cloud makes you immune. In reality all it does is make it far more likely that if you do suffer from a problem you'll have an even harder time recovering from it.

  • by dsgrntlxmply ( 610492 ) on Tuesday February 08, 2022 @12:39PM (#62249959)

    We had a commercial system deployed in 1977-78 that used ceramic packaged DRAM. We had parity protection on the memory. The system would reboot and log a distinctive error code. We tested the parity logic and thoroughly examined memory timing in the design process. As the number of deployed systems went up into the hundreds, these reboots became an increasing concern. Around the same time I believe HP published findings that discovered alpha particle events caused by radioactive trace elements in the chip package ceramics. These reached the electronics trade press, and the reported error rate explained around 1/3 of our crash rate on this error code. Various software errors could fall into the parity error path, giving the other 2/3 (8080 assembly code, no memory mapping or protected regions).

    More recently I worked in telecom equipment. We observed certain network failures to be caused by corruption of an on-chip routing table memory in a SoC that lacked parity protection for these tables. The failure rate was low, but was higher in units that were deployed at higher altitudes in the American West. The problem was eventually mitigated by software table sweeps, and in newer SoC generations, by parity in critical tables. These Single Event Upsets are said to be caused by muons generated by high energy cosmic ray interactions in the upper atmosphere.

    Third semester undergrad physics offers a homework problem where you are fed a few assumed figures, then asked to calculate and explain how a muon generated at altitude can possibly make it to sea level, given observed muon decay times. Answer: relativistic time dilation.

  • There are Single Event Upsets (SEU) caused bu radiation. Doesn't matter the source, there is radiation from everything, rocks, bananas, cosmic rays, the sun. In space there is so much more radiation satellites have to be designed specifically to counter radiation, you can't just send a ryzeen up in space, the radiation would cause it's transistors to switch on permanently, these latchups need to be reset, you can also get bit filps which also cause errors in computation. Satellites are designed to detect an

    • ECC cache is standard now because of the bit flip problem. But other parts can still be affected.

      • And ECC won't solve latch ups, where a cosmic ray blasts through transistors and switches them on until they are unpowered.

    • by jezwel ( 2451108 )

      On earth, not much is done for radiation, and there isn't as much

      Ahhh so this is why you see in some anime that a system is controlled by 3 separate supercomputers where majority rules vote on what changes should be enacted - the chance of the same bit flip occurring on two separate devices for the same decision becomes infinitesimally small.

      • I think in Anime the presumption is usually that the computers are intelligent or nearly so and one might go rogue.

        Having redundant computer systems to guard against failures, though, goes back at least to the space program.

  • by Archtech ( 159117 ) on Tuesday February 08, 2022 @12:45PM (#62249997)

    I wonder whether mainframes such as IBM's are less prone to such random errors? They are certainly designed with a far greater emphasis on reliability.

    • The mainframe itself might be less prone but the issue would be with other equipment like the network. For example I have personally experienced an unmanaged network switch randomly drop 25% of the packets. This kind of failure was hard to figure out as it worked most of the time. Granted it was a consumer model with very little redundancy.
    • by psergiu ( 67614 )

      The HP PA-RISC machines were designed with this in mind - everything in the later PA8xxx server-grade machines had ECC, memory, buses, CPU cache ... The system logs would show when there was a corrected single-bit error and would detect the rare uncorrectable multi-bit errors and work around them ( if the RAM page was unused, mark-it as unavailable, used by a user process -> kill it and alert, used by kernel -> crash). But they were expensive, so ...

    • I wonder whether mainframes such as IBM's are less prone to such random errors?

      Yes. Aside from basics like ECC-everywhere and not trying to squeeze the last few MHz out of the silicon, some fault-tolerant models do things like having CPU pairs operating in lockstep. Each instruction is performed twice in parallel, and the results are compared. If they differ, they automatically retry to isolate the failure.

  • in memory at the very least.

    and at some point, they'll have to do something that verifies proper execution in instruction sequences in processors too.

    don't space programs use more than one processor and make sure they agree ? a brute force approach, but if you have a billion transistors to play with you can certainly come up with consistency checking circuits of some sort.

    i once read a really interesting article (no, i can't find it) where sections of the instruction execution unit were duplicated and cons

  • The problem is as old as computing itself. It's not a bug, it's a feature
  • Why don't you? Maybe because these are absolute standard engineering approaches to this problem and most coders are so far from being engineers that they do not even understand them?

  • Typical clickbait nonsense. Random but flips in memories are easily addressed via error correction codes and hardware integrity via built in scan tests
  • by Retired Chemist ( 5039029 ) on Tuesday February 08, 2022 @04:00PM (#62250753)
    For any finite probability of an error occurring, if the system becomes large enough, the probability of an error will approach certainty. A system that fails more than a small percent of the time becomes useless. I can still remember the frustration of trying to use an Apple SE30 and the constant reboot or retry screens that it produced. If always rebooted whatever you clicked and you lost all your work since the last save. Before the Apple defenders arise in fury, I am well aware that the problems were probably more with the software than the computer or the operating system, but that it did not help at the time. Also, with silent errors there is always the possibility of the computer producing the wrong answer. Adding more data checking is, of course, a corrective, but that increases the complexity and increases the chance of an error. As a system becomes sufficiently complex, it becomes increasingly unreliable. Which may explain humanity.
  • by jd ( 1658 )

    The problem's not in the software? Hmmm. Compilers are also very complex and, other than VST, I can't think of any toolchain that is proven to generate binaries that match the code. The problem could therefore be entirely in the software and you'd have no way to prove it.

    Then there's the operating system. Other than SEL4, there are very few operating systems out there that are provably reliable. Even the best Linux distros are only rated at something like five nines reliability, and I'll bet you a dozen dou

No spitting on the Bus! Thank you, The Mgt.

Working...