Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Bug Software Transportation Technology

Airbus A350 Software Bug Forces Airlines To Turn Planes Off and On Every 149 Hours (theregister.co.uk) 131

An anonymous reader quotes a report from The Register: Some models of Airbus A350 airliners still need to be hard rebooted after exactly 149 hours, despite warnings from the EU Aviation Safety Agency (EASA) first issued two years ago. In a mandatory airworthiness directive (AD) reissued earlier this week, EASA urged operators to turn their A350s off and on again to prevent "partial or total loss of some avionics systems or functions." The revised AD, effective from tomorrow (26 July), exempts only those new A350-941s which have had modified software pre-loaded on the production line. For all other A350-941s, operators need to completely power the airliner down before it reaches 149 hours of continuous power-on time.

Concerningly, the original 2017 AD was brought about by "in-service events where a loss of communication occurred between some avionics systems and avionics network" (sic). The impact of the failures ranged from "redundancy loss" to "complete loss on a specific function hosted on common remote data concentrator and core processing input/output modules." In layman's English, this means that prior to 2017, at least some A350s flying passengers were suffering unexplained failures of potentially flight-critical digital systems.

This discussion has been archived. No new comments can be posted.

Airbus A350 Software Bug Forces Airlines To Turn Planes Off and On Every 149 Hours

Comments Filter:
  • Huh.... (Score:5, Funny)

    by Rick Zeman ( 15628 ) on Thursday July 25, 2019 @05:12PM (#58987216)

    Who knew Airbus was running Windows 95?

    • Re:Huh.... (Score:5, Funny)

      by rsilvergun ( 571051 ) on Thursday July 25, 2019 @05:27PM (#58987338)
      You have a Windows 95 install with 149 hours of uptime? How?
      • Re:Huh.... (Score:5, Funny)

        by Anonymous Coward on Thursday July 25, 2019 @05:47PM (#58987464)

        I left it sitting at the blue screen for 148 hours.

      • Re:Huh.... (Score:5, Insightful)

        by arglebargle_xiv ( 2212710 ) on Thursday July 25, 2019 @11:15PM (#58988940)
        Rebooting, under the name "rejuvenation", is actually a standard technique for maintaining the integrity of high-assurance systems, periodically resetting them to a known-good state. It's commonly used in RTOSes where the code executes out of ROM and IPL is virtually instantaneous, but I could see them doing it in aircraft as well. So from a generic-computer point of view, "needs to be rebooted after X hours" sounds bad, from a high-assurance system point of view "X hour rejuvenation interval" is a standard, and expected, procedure.
        • by i-neo ( 176120 )

          True, it's also something commonly used in all highly resilient platform.

          You design and develop the best you can, and despite the best quality tests, you will always have some issue, so you integrate failure in your design, ending up with a way to return to a known state after a number of transactions (reboot in that case). And after years in the software industry, that's surely the only way to get the results you expect. And if you design for this reboot, all is fine, it's not even visible to the end user.

        • Re:Huh.... (Score:5, Insightful)

          by strikethree ( 811449 ) on Friday July 26, 2019 @09:21AM (#58990826) Journal

          Rebooting, under the name "rejuvenation", is actually a standard technique for maintaining the integrity of high-assurance systems

          Using a "reboot" to ensure things are as you think are is one thing. Being forced to reboot because your shit will break if you don't is another. They are NOT equivalent. One is nice, but not necessary while the other is absolutely necessary or it will be guaranteed to fail.

          Totally different and nowhere NEAR the same thing.

          • Actually, they are. Let's say you have a system where an integrated component degrades by 10% every 24 hours, or at least a fixed 8-12% degradation in a 24-hour time period (lots of evaluation and modelling elided). This means you can run it for about a week before you experience a fault. Your fault isolation and recovery process then is to specify a rejuvenation interval of, say, 48 hours to deal with it and you're fine. The risk is mitigated and you move on.

            Rebooting because $warm_fuzzies is voodoo. K

    • Re: (Score:2, Offtopic)

      by rudy_wayne ( 414635 )

      149 hours is 536,400 seconds. That exceeds the 512k of RAM.

    • 149 is 95 in hex!
    • Re:Huh.... (Score:4, Informative)

      by That YouTube Guy ( 5905468 ) on Thursday July 25, 2019 @06:00PM (#58987522)
      The original Windows 95/98 bug was to crash every 49.7 days [cnet.com] (2^32 milliseconds) of continuous operation. The moral of the crash is never use a desktop OS for a server.
      • > The original Windows 95/98 bug was to crash every 49.7 days [cnet.com] (2^32 milliseconds) of continuous operation. The moral of the crash is never use a desktop OS for a server.

        Did anybody ever trigger that bug ? My Windows 95/98 was crashing 4 times a day, never mind having it on for 49 days.
        • Did anybody ever trigger that bug ? My Windows 95/98 was crashing 4 times a day, never mind having it on for 49 days.

          I ran in into it all the time at work since we had ad hoc servers running Windows 95/98.

        • by hawk ( 1151 )

          >Did anybody ever trigger that bug ?

          Not during testing before it was released . .. .

          Even once out, it took a while before anyone put together that hard upper limit of 50 days for systems that *did* stay up . . .

          hawk

        • I managed to see it in action, on a PC that did nothing but run a scanner that was only used only sporadically. When I figured out that it was close to the 49.7 mark, I made sure I was there to watch it go down in flames. I was expecting to see it bluescreen at the right moment, but alas nothing happened! I then tried the mouse, and as soon a I clicked on something it went BSOD.

          I've also managed to get a Windows Vista system all the way to the 497 day bug that doesn't actually crash the computer, but end

      • Virtually nobody ran into that bug because Windows 95/98 were built on top of DOS. They used cooperative multitasking. The OS would hand control of the CPU over entirely to a running program, and it was up to the program to return control to the OS after using the CPU for a few milliseconds. If the program didn't hand control back or crashed, the OS froze.

        I think the longest uptime I ever got from 95/98 was a little over 3 days. I'd turn off the computer at night because if you left it on, it was a 5
        • by jrumney ( 197329 )
          Windows 95 did not use co-operative multitasking. That was Mac OS 9 and earlier, or Windows 3.1.
        • Windows 3.1 ran on top of DOS. For shakes and giggles, I even had Windows 3.1 running on top of DR DOS. Windows 9X had DOS built into the OS. These days I use PowerShell for my CLI stuff.
  • Still running Win95 I see. What a joke.

  • Huh (Score:2, Insightful)

    Maybe they should ground them all until they actually fix the problem.

    What is with these airplane manufacturers and their seemingly blasé attitude towards flight safety?

    • Did you expect regulations and government agencies to protect you?

      • by skam240 ( 789197 )

        Infinitely more so than the companies themselves.

  • by Anonymous Coward

    2^29 milliseconds is 149 hours and 470.912 seconds. Perhaps an overflow?

    • by Anonymous Coward

      Almost certainly. Chances are device 1 is sending data to device 2; and each packet has a timestamp that must be strictly increasing. Device 1 generates that timestamp by using a counter that overflows, so it starts sending out packet with timestamps around 0, 1, 2, ... etc again. Device 2 then says, "hey, I've already received packet 4 billion, so I should ignore packet 1" and suddenly device 1 is being ignored by the network.

      Ways around include: Increasing the size of the timestamp (2^64 milliseconds i

      • No it isn't.

        The correct approach is to have a protocol with multiple message types, one of which would be a message indicating a serial number reset.  That's absolutely standard practice in high volume message systems.  So standard that it's unlikely that message out of sequence is even related to the problem.

  • IT Crowd (Score:5, Funny)

    by Ukab the Great ( 87152 ) on Thursday July 25, 2019 @05:34PM (#58987396)

    Have you tried turning your plane off and on again?

    • Re:IT Crowd (Score:5, Interesting)

      by Strider- ( 39683 ) on Thursday July 25, 2019 @07:08PM (#58987882)

      I was once a pax on a CRJ (aka barbie-jet) and as we were preparing to head onto the runway, the pilot comes onto the intercom and goes "so folks, we're getting something unexpected from our flight computers. We're going to reboot the jet and see what happens." so they power cycled the jet, and off we went. Didn't give a whole lot of confidence.

      • We're going to reboot the jet and see what happens." so they power cycled the jet, and off we went.

        Hah! Tesla owners reboot their cars while driving! Press and hold both steering wheel "nipples" and wait for the screen to turn off.

        • by Anonymous Coward

          I have one (Model 3), and have done this (reboot while driving), it's freaky when the screen goes black and all the sounds stop.
          BUT, it's only the entertainment and instruments that go dark. All the driving functions still operate normally; brakes, steering, headlights, brake lights, turn signals, etc.
          If you want to do a full power off reset (yes, it's a thing) you have to be parked.

    • ...oh, hi Mom... uh.. uh hu.... yeah.. have you tried turning it on and off again?

  • Ada code.
  • Nice. Now I have to also try not to worry about if they rebooted the plane in the last 149 hours the next time I fly.

    --
    I believe that tomorrow is another day and I believe in miracles. -- Audrey Hepburn

  • by istartedi ( 132515 ) on Thursday July 25, 2019 @06:11PM (#58987594) Journal

    You don't really want 100% uptime anyway.

  • It's TIME FOR MORE QA with aircraft software.

    Maybe even some interdependent ones mandated by the faa.

  • and how bad will the 2038 bug be?

  • While the rest of us is already in the 64 bit era?
    • by lordlod ( 458156 ) on Thursday July 25, 2019 @08:15PM (#58988220)

      1. The plane in question is manufactured by Airbus not Boeing.

      2. Of course they are using 32 bit. It is an embedded system, most processors are 32 bit ARMs or 8 bit chips.

      3. The plane in question started manufacture in 2010 (Wikipedia), the subsystem design would have preceded this by years. Arm didn't release their 64bit architecture until 2011.

      4. A 32bit count of milliseconds corresponds to 49 days, a long way from 149 hours. It does correspond with a 32bit counter and an 8kHz clock though.

  • by divide overflow ( 599608 ) on Thursday July 25, 2019 @09:15PM (#58988454)
    My solution:

    #!/bin/bash
    line="echo * */149 * * * /sbin/reboot"
    (crontab -u root -l; echo "$line" ) | crontab -u root -
    • Re: (Score:1, Insightful)

      by Anonymous Coward

      My solution:


      #!/bin/bash

      line="echo * */149 * * * /sbin/reboot"

      (crontab -u root -l; echo "$line" ) | crontab -u root -

      That will never be approved by FAA. The software in a plane has to be reliable, which is why it has to meet some rather strict requirements. A plane is only allowed to start programs and allocate or free memory while on the ground. We can't have a plane crash because malloc failed. You use "dangerous" commands like echo, which is actually not allowed.

      If you are puzzled to why you can't use echo on a plane due to safety, but you can fly with a plane, which crashes after 149 hours, then that's the airline ind

      • That will never be approved by FAA. The software in a plane has to be reliable, which is why it has to meet some rather strict requirements. A plane is only allowed to start programs and allocate or free memory while on the ground. We can't have a plane crash because malloc failed. You use "dangerous" commands like echo, which is actually not allowed.

        Could it be more obvious that I was joking??

        The solution is obvious: Fix the goddamned code.

      • Whooooooooooooosh!
  • > Rebooted after exactly 149 hours

    > ...rebooted before it reaches 149 hours

    I really, really wish people would stop doing that.

  • this just goes to show that no matter where software is being used, it's all rubbish.
    it's funny because i have written a lot of programs/tools and ofcourse it also contains bugs, mostly corner cases, but that's no excuse because when you hit such a case you've got problems.
    for the few times this happens, my boss pulls out this speech about how much better software is developed for cars & airplanes and that we should try to mimic the same quality, etc.
    as far as i can see, the quality isn't much better th

  • by EmTeedee ( 948267 ) on Friday July 26, 2019 @07:15AM (#58990068) Journal

    The orignal Airworthiness Directive (linked in the article) reference a Service Bulletin which defines the necessary updates.

    Basically there is a patch available since 14 August 2018. The Directive does no longer apply as soon as an airline installs those patches. Seems like Boeing is trying to spread FUD...

    Quotes from the Directive:

    Modification of an aeroplane in accordance with the instructions of the SB constitutes terminating action for the repetitive ground power cycles (resets) as required by paragraph (1) of this AD for that aeroplane.

    Airbus SB A350-42-P010 original issue dated 14 August 2018.

Genius is ten percent inspiration and fifty percent capital gains.

Working...