Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Transportation United Kingdom

French Error Blamed for UK's Air Control Meltdown Which Left 300,000 Passengers With Cancellations (independent.co.uk) 73

What caused Monday's glitch in the UK's air traffic control system that left thousands of passengers stranded?

Wednesday the Independent reported that it may have been triggered by "an incorrectly filed flight plan by a French airline." Several sources say the issue may have been caused when a French airline filed a dodgy flight plan that made no digital sense. Instead of the error being rejected, it prompted a shutdown of the entire National Air Traffic Services (Nats) system — raising questions over how one clerical error could cause such mayhem... Downing Street has launched an independent review into the incident, which caused more than a quarter of flights at UK airports to be cancelled on Monday...

In his statement, Nats chief executive Martin Rolfe said Nats' systems, both primary and the back-ups, responded to the incorrect flight data by suspending automatic processing "to ensure that no incorrect safety-related information could be presented to an air traffic controller or impact the rest of the air traffic system".

The article also points out that "Passengers hit by the air traffic control meltdown face being stranded abroad for up to a week." Around 300,000 airline passengers have now been hit by flight cancellations since the hours-long failure of the Nats system on bank holiday Monday. The knock-on effect is set to last for several more days, as under-pressure airlines battle the backlog in a week where millions are already returning to the UK from their summer holidays.
Thanks to Slashdot reader Bruce66423 for sharing the article.
This discussion has been archived. No new comments can be posted.

French Error Blamed for UK's Air Control Meltdown Which Left 300,000 Passengers With Cancellations

Comments Filter:
  • really? (Score:5, Insightful)

    by neubsi ( 1039512 ) on Saturday September 02, 2023 @02:41PM (#63817314)
    You can blame the french airline all you want, but YOUR system is shit when it collapses because of an dodgy flight plan.
    • Re:really? (Score:5, Insightful)

      by 93 Escort Wagon ( 326346 ) on Saturday September 02, 2023 @02:53PM (#63817350)

      I think deflection is the point - they're hoping the UK public will ignore the fact that their own homegrown system is crap if they can somehow blame it on the French. That way the people responsible can try and keep their jobs.

      • The destination was \; DELETE FROM flight_plans;
      • Re: (Score:3, Insightful)

        I think deflection is the point - they're hoping the UK public will ignore the fact that their own homegrown system is crap if they can somehow blame it on the French. That way the people responsible can try and keep their jobs.

        Absolutely. They blame the French for the debacle that resulted from Brexit, their current economic disaster, their immigration 'crisis', pretty much everything can be 'blamed' on the French!

        • Re:really? (Score:4, Informative)

          by tlhIngan ( 30335 ) <[ten.frow] [ta] [todhsals]> on Saturday September 02, 2023 @06:27PM (#63817798)

          You have to remember the problem is Brexit. All of a sudden the population is feeling the effects of it - from Europe imposing "outsider" restrictions on the UK to increased paperwork.

          Of course, with Brexit, the UK is outside the EU, so now they're being subjected to "big brother" interrogations when they cross the border. Scare quotes from prominent newspapers because no longer was the UK an EU citizen so now they're treated like any other outsider that would come from say, the US. And exporting has more paperwork because again, no special agreements - they're no longer EU members with free trade.

          So the Brexiteers, who promised "live would be the same" are seeing the public start to backlash because the EU is treating them any other non-EU country. So they're trying to spin it less as a problem with Brexit and more an EU picking on UK problem.

          Thus, trying to deflect the blame onto the French airline - the EU submits bad flight plans that take down our network - it's the EU's fault!

          Because crap, Brexit actually means the UK is no longer an EU member. Brexit was purely something the UK imposed on itself, and the pro-Brexit forces kept lying and saying that things with the EU will be exactly the same as they are.

          Yes, the French airline was at fault for making the error. No, the French airline was not responsible for the fact that error caused your system to go down. Just like when Windows NT was at fault for taking down the warship because it blue screened - the application was at fault for causing NT to bluescreen, but NT should've been more protected against the application error that caused it to bluescreen. I.e., the application shouldn't cause NT to bluescreen. The only things should be things the kernel has little control over - e.g., hardware failure or driver error.

          • Is it just in my head, or has the average user ID dropped significantly on this site lately?

            Kudos for mentioning the old Microsoft/warship BSOD story.

        • Absolutely. They blame the French for the debacle that resulted from Brexit, their current economic disaster, their immigration 'crisis', pretty much everything can be 'blamed' on the French!

          The sooner they dynamite that tunnel, the better.

        • I think deflection is the point - they're hoping the UK public will ignore the fact that their own homegrown system is crap if they can somehow blame it on the French. That way the people responsible can try and keep their jobs.

          Absolutely. They blame the French for the debacle that resulted from Brexit, their current economic disaster, their immigration 'crisis', pretty much everything can be 'blamed' on the French!

          As a Brit, I don't see blaming the French as in any way not legitimate. It can and should be done whenever the opportunity arises. That doesn't lessen their incompetence in running a computer system with external input, but the blaming the French thing is correct in any context.

    • by Luckyo ( 1726890 )

      It's deflection onto the favorite enemy to cover either hilarious or criminal negligence. French make for an excellent scapegoat for everything English.

    • by Tablizer ( 95088 )

      Blame the user when data validation is F'd up. Amateur move. (Both need spanks, actually.)

    • Re:really? (Score:5, Informative)

      by test321 ( 8891681 ) on Saturday September 02, 2023 @05:12PM (#63817670)

      This is only in slashdot headline that makes it sound like the UK blames the French airline. TFA mentions the origin would be a French airline, but clearly blames UK traffic control software.

      Full title:

      Simon Calder: What is causing the air traffic control chaos? The authorities have some explaining to do

      The paper:

      Michael O’Leary, chief executive of Europe’s biggest budget airline, Ryanair, had no more luck than me. In a video message, he said: “It’s not acceptable that UK Nats simply allow their computer systems to be taken down, and everybody’s flights get cancelled or delayed.”

      I ventured to pin the blame on some weakness in the monumentally complex Nats computer system.

      Surely the Nats system should automatically have identified an anomaly and spat out the plan, saying “try again”? Yet instead, the flight plan was ingested and set in train a shutdown of the entire system.

      [...] the air traffic control provider has some explaining to do. Very soon.

    • by Kazymyr ( 190114 )

      Basically a buffer overflow applied to air traffic. No the airline is not to blame, at least not in the greater part. The system should be resistant to this type of error, or attack. "They filed an incorrect plan" is not a defense.

  • by Retired Chemist ( 5039029 ) on Saturday September 02, 2023 @02:46PM (#63817326)
    It would be interesting to know why a bad data input crashed the system instead of simply being rejected. It suggests that the error recovery routines for the software are not well written. The real problem, however, is the airlines themselves. For economic reasons, they have eliminated any excess capacity, so when a flight is cancelled, there is no way to recover gracefully. I am not sure what can be done about this, since adding additional unused capacity would increase their costs which they would have to pass on. The airline industry had traditionally been financially borderline at best.
    • Given that 25% of all UK flights got cancelled because of this crappily written software, it's hard to see how there's any possibility of "recovering gracefully" from the situation.

      • The system was only down for about an hour. If that caused 25% of the flights to be cancelled, they have a lot bigger problem with it than they are admitting.
    • Re:Economics (Score:5, Informative)

      by Vlad_the_Inhaler ( 32958 ) on Saturday September 02, 2023 @03:37PM (#63817466)

      I used to work on an ATC system, although more on the analysis-for-billing side than at the sharp end. During my years there we had two outages that I can remember.
      The first one happened when building works cut the fibre connections between the ATC HQ and the airport, the contract stated that both connections were not to be in the same bundle of cables (so far so good) but nobody realised that both bundles were in the same trench. The ATC tower was on the same side of the breach as the ATC mainframes (good), but the network was set up so that the highest priority activity was the communication between that outpost and the HQ (very bad). The network node was so busy trying to speak to the offices in the HQ that there was no bandwidth left over for the actual function. Our computers also handled several other towers and I think they were suffering just as much for the same reason. I don't know how long it took to fix things, probably a few hours. We had the same thing happen a couple of years later but someone had rejigged the network priorities and I don't think the people in the tower even noticed.
      The second outage was a programming error. All ad-hoc updates (and normal release levels) were tested for one hour with recorded production data, the problem was that this bug took around 50 (or was it 100?) hours to manifest itself. However many hours it was, it hit on a Sunday morning and then it was like a Keystone Cops episode:
      - The operations people escalated things (good), unfortunately - even though the ad-hoc update was known to them - they got in touch with the OS people - pretty much the only ones who did not know about the update - rather than the head of the programming group. With them not having a clue what was going on, they said "we" should switch to the backup system.
      - ATC data is pretty dynamic so "switching to the hot standby backup system" meant firing up with an empty database and then populating it with the flight movements. This would have solved the problem for another 50 (or 100) hours, except that various ATC station managers had to all sign off on the switch. One dragged his feet, they did not have a problem. After an hour or so the others got on the phone to him and described the symptoms to him - "oh, yes, we're seeing that as well". Finally he gave his ok and they switched.
      Luckily it was snowing heavily that day and a lot of flights had to be cancelled anyway - our ATC running with reduced capacity was a lot less noticeable and I'm not sure our problem even made the news. Nothing like this outage on a Bank Holiday weekend.

      For economic reasons, they have eliminated any excess capacity, so when a flight is cancelled, there is no way to recover gracefully.

      I have worked for an airline and also for that ATC, my experiences absolutely do not match that claim.
      The airline dealt with large amounts of manually entered data all the time, there was a fair amount of garbage in the data and rejecting that garbage was essential.
      The ATC had meetings to discuss every single outage. Airplanes not being able to land - or colliding with each other - is considered a very big deal, they took this stuff seriously and at a high level within the organisation. If that cost money, tough - it was still cheaper than an aircraft crashing. The first outage I detailed mandated changes well outside "our area" of the organisation, the changes were planned and carried out as a matter of priority.

      • For economic reasons, they have eliminated any excess capacity, so when a flight is cancelled, there is no way to recover gracefully.

        I have worked for an airline and also for that ATC, my experiences absolutely do not match that claim.

        Maybe not, but then the things you said next don't back up the idea that the airlines haven't eliminated the capacity that would allow them to recover from these incidents...

        • Re:Economics (Score:4, Informative)

          by Vlad_the_Inhaler ( 32958 ) on Saturday September 02, 2023 @04:36PM (#63817592)

          I was working for the airline when Eyjafjallajökull polluted the upper atmosphere with volcanic dust and politicians ordered the airspaces closed, in that sense the effect was comparable with US airspace in the aftermath of 9.11.
          It took a day or so for the airlines to handle getting their aircraft to where they needed to be, but then they had it sorted.
          There have been several strikes over the years which have also left aircraft in the wrong places, this is pretty much normal and they have to be able to deal with it.
          What was special about this outage was that it happened at exactly the wrong time, rather like something over Thanksgiving in the US.

        • Airlines literally constantly recover from this. One of my colleagues was affected, result: he flies home 17 hours later than anticipated. Flights get cancelled constantly. Airlines recover constantly and quickly. In many cases people get home the same day. In a few cases is one or two days later. The capacity is there. And if course it would be done it costs a fuckload to compensate passengers.

          I've had 7 flights cancelled on me just this year. In one case I had to stay overnight in an airline funded hotel.

    • That this bad data simply wasnt handled. Halting an entire system deliberately due to a bad record amongst thousands stretches credulity to breaking point so probably what really happened was it passed error checking then crashed the processing section with some OOB value, eg negative flight time or similar.

    • Re:Economics (Score:4, Insightful)

      by hey! ( 33014 ) on Saturday September 02, 2023 @03:48PM (#63817488) Homepage Journal

      Stability in response to invalid input is what we call a "non-functional requirement". You can't verify this requirement in a simple functional test because you'd have to present the system with every possible invalid input. Verifying such a requirement analytically is (usually) not rocket science, but that does presuppose a coherent and stable leadership which doesn't muck around with changing system priorities, scope or requirements very much; and which understands that money spent on *thinking* -- or even hiring people who can think -- actually accomplishes something. That's way too high a bar for many organizations, so failure to handle invalid input correctly is shockingly common, particularly when it comes to security. Many a terrible system passed all its functional tests with flying colors.

      As for ther resiliency of the airlines (or lack thereof): this is a serious problem throughout the global economy, as we discovered in the pandemic. When you start out to build a system or a supply chain, resiliency and efficiency don't necessarily conflict very much. But at some point you have to choose one over the other, and *efficiency* puts money in your pocket, almost right away.

      So there you have it. Many of the systems -- electronic or economic -- that civilization depends upon are unstable, fragile, or both. If the whole house of cards comes down some day it'll be because of a chain reaction spreading across systems whose stability given bad input was never addressed, not because it was hard, but it was hard to explain to managers focused on short term profits. That's much more likely in my opinion than most stock pop-culture apocalyptic scenarios.

      • Stability in response to invalid input is what we call a "non-functional requirement". You can't verify this requirement in a simple functional test because you'd have to present the system with every possible invalid input.

        But that's not relevant here, because the system did successfully detect that the input was invalid. The question is why the system is designed in such a way that someone else can simply commit a message that can have that effect, instead of having it go through a validation-rejection system that would allow it to not take effect at the destination end, while also informing the source end (or more importantly, with the protocol involving not assuming success without a validation response.)

      • So is capitalist efficiency really not efficient from the standpoint of the collateral damage, and the intellectually honest?

        • Efficiency used to mean Pareto Efficiency in economics. It's been redefined to mean cost savings as a way to compromise the societial benefits of Pareto Efficiency and build monopolies: https://www.ineteconomics.org/... [ineteconomics.org] So next time you see a claim of the "efficiency" of a merger, or the "efficiency" of business, think "monopoly". The so called efficiencies are an illusion and don't exist and have no benefit for consumers and are in fact detrimental to consumers.
    • by AmiMoJo ( 196126 )

      Passengers are screwed when this happens. Under EU rules (which still apply in the UK) the airline is reasonable for getting you there and costs if it's not an unusual problem out of their control. Stuff like over booking, maintenance issues, crew not available etc.

      But since it's ATC's fault, the passengers have to rely on their travel insurance.

      • That's not entirely true. Under UK (and EU rules since they've not diverged yet) airlines do still have a duty of care even in scenarios like this. Although you can't expect any compensation, you can still expect alternative transportation and accomodation costs, food etc. until that transportation occurs.

    • by Bert64 ( 520050 )

      Or the bad input didn't crash the system, and just forced it into a failsafe mode. Software like this is typically written to ground flights when something unexpected happens, rather than risk aircraft flying into each other.

  • Someone needs to invent a way to stop this kind of mistake of entering invalid data into computer programs.

    LoB
  • by joe_frisch ( 1366229 ) on Saturday September 02, 2023 @03:01PM (#63817376)
    If a presumably accidental transmission of bad data can crash the entire system that suggests that the input data is not carefully validated. That seems like potential hacking risk. At the minimum it seems easy to create a lot of disruption, and its possible the worst case is much worse

    I'm not familiar with the UK flight plan system but in the US its very easy to get access to file a flight plan using commercial software designed for general aviation pilots.
  • by kaur ( 1948056 ) on Saturday September 02, 2023 @03:10PM (#63817400)

    I am working for a fairly small company that provides a relatively unimportant service. Nevertheless, we are required to have a comprehensive business continuity program. Policies and procedures, disaster recovery plans and tests, crisis management, risk analysis, supply chain security, internal management oversight, reporting to external authorities, audits on all of the above.

    Any company or system can have an input valdation error, however good are your testing / fuzzing / whatnot. But you should be prepared for an incident - for ANY incident - in your critical systems. This is what business continuity is about: to limit and contain the effect of any single mistake, and to guarantee recovery within predefined boundaries, no matter what.

    How CAN an air control service of a major country pull such a stunt and then blame it on someone else?

    I have been working with people from the UK cybersecuriy agency (NCSC) and I have read / used their materials for my own work. They seem to be competent, effective and pragmatic. I expect them to give an assessment of their own, in due time of course. It would make an interesting read.

    • They were prepared, though not well enough. No planes crashed. No-one was injured. The system's worked well for many years - a quick Google search turned up the previous ATC outage in December 2013, almost 10 years ago.

      There was a backup system, but the bad input crashed that too. Presumably the backup has the same software and is a backup against hardware failures. So they fell back to a part-manual system, which worked, but slowly and with much inconvenience.

      I guess the lesson will be to run more fuz

    • Nevertheless, we are required to have a comprehensive business continuity program

      I've never seen a company ever have a 100% capacity business continuity program. To be clear not all flights were cancelled. But there's only so much you can do manually in our automated world. The continuity here is manage planes as possible until the computer is up again. Most people get home within a day or two of their scheduled time. There are continuity plans in place.

  • Not a crash (Score:5, Informative)

    by Hawks ( 102993 ) on Saturday September 02, 2023 @03:17PM (#63817418)

    The system didnâ(TM)t crash. If you even read the summary all the way through you would see the system failed safe. The system got an input it couldnâ(TM)t deal with and âoe⦠suspending automatic processing âto ensure that no incorrect safety-related information could be presented to an air traffic controller or impact the rest of the air traffic systemâ(TM)..â

    Everyone should want something as critical as flight routing to fail into a safe mode so planes donâ(TM)t crash into each other.

    Was the failure mode a huge inconvenience? Sure. Still better than being dead.

    Could the incorrect input have been handled better? Probably. Now that an edge case has been found, people can investigate the reason that type of input caused a full halt of automatic processing, and work on dealing with inputs of that nature and dealing with them in a way that doesnâ(TM)t cancel tons of flights.

    Iâ(TM)d still rather be stuck in some airport for 2 days than being a smudge in the ground after a catastrophic air crash.

    • WTH canâ(TM)t someone be bothered to fix the back end of slashdot to handle unicode properly? I mean come on. It isnâ(TM)t like this has been a known bug for 30 years, oh wait..

    • by butlerm ( 3112 )

      An entire system going into a fail safe mode in response to a data entry error or data format validation error is indistinguishable from a crash. In fact it is actually worse than a crash. A *system* should never do that under any conditions. If the Internet operated that way it would shut down within milliseconds, and somehow I doubt the resulting inability to communicate with anyone or anything anywhere would make people happy just because the system wide outage was credited to a "fail safe" mode.

      • to the 4digit ID

        I disagree. A fail safe is much better than a hard crash, particularly if it causes a crash loop. A fail safe can allow for manual recovery and processing while the system is still up and moving flights around, albeit slowly.

        If the system had hard crashed the primary and backup, recovery would likely have required dump analysis before the system could have been brought back up to any sort of functional state in production.

        I would also argue failing safe is highly preferable to a hard crash t

      • by jd ( 1658 )

        As one 4-digit uid to another, I have to disagree, simply on the basis that systems are too diverse. There is absolutely no common rule that will apply to all of them.

        • by hawk ( 1151 )

          I guess I have to jump in, too.

          Shutting down over bad data is worse that propagating it when lies depend upon it.

          noisily *rejecting* the data would be better yet, and even better would be figuring it out and handlingitproperly!

          hawk

          • by jd ( 1658 )

            The best thing is to trap errors as close to user input as possible through data validation. You shouldn't need to keep revalidating, though - once data has been validated, only valid data should be present. If you can't trust data source X, validate that source. But the only way you'll know what the correct data should be is if you ultimately push back to the user and force them to re-enter. Obviously, this isn't being done in some cases.

            In areas where only valid data is present, you can avoid the deadweig

            • by butlerm ( 3112 )

              This is a large scale international network. Every node should be programmed to reject data from other nodes that does not meet basic integrity constraints rather than initiate some sort of nationwide shutdown. Of course if the data is quite important appropriate alarms and alerts should be raised so that the originator can supply properly formatted data or other corrective action can be taken.

    • by GuB-42 ( 2483988 )

      It this is a fail safe, a segmentation fault is also a fail safe. It prevents the program from corrupting memory by terminating it.

      Maybe next time, if someone accuses me of writing code that crashes, I will be able to answer "no, it failed safe".

    • by mspohr ( 589790 )

      The system should have just rejected the invalid input (and notified the sender) and kept on working. Instead it just stopped working (failed). It shouldn't have stopped working. No reason bad input should cause the system to stop working and shutdown.

    • Fail-operate?

      I did software for the military, and fail-safe is the standard for less critical systems. Important systems are expected to continue operating as far as possible. Isolate the damage, work with the known-good data.

      There's not much detail available on this case, but it sure seems like *something* should have been possible.

    • by jd ( 1658 )

      Whilst I believe there should be better error handling (if a flight plan is faulty, it should never be accepted at the start and those filing it should be forced to reenter at that point), once the data is in the system, I agree that the system should fail gracefully, leaving the database in a known valid state.

  • And your father smelled of elderberries
  • ...the UK error correction sucks.

    • You don't error correct flight plan data, that's how planes crash
      The system correctly went into a fail safe mode, for ~30 minutes

      The airlines then took far too long to sort out the resulting mess ...

  • how convenient that is...
    the UK system is crap, falls over itself, and they're blaming the french...
    liars.

  • by Alworx ( 885008 ) on Saturday September 02, 2023 @07:22PM (#63817944) Homepage

    Voilà le flight nombre: ';drop table flightplans; --

  • by sonoronos ( 610381 ) on Saturday September 02, 2023 @11:16PM (#63818268)

    I am here in the UK this week and have been flying. So have many others. Yes, there were cancellations in a very specific time window, but not an unusual amount compared to past cancellations in the USA, which have been much worse.

    This article makes it sound like the city is burning and there is chaos in the streets. This is totally untrue. The airports here in London have been running just as normal.

    This is probably another piece of journalism tied to a growing trend of news stories criticizing the flaws of the existing european air traffic network.

    Who is driving that media and for what reason is unknown to me.

    • You can't fight hyperbole with equally silly lies. Airports have not been operating normally. They have been in chaos. Yes many people got where they were going. No the city isn't burning. But saying everything is normal is just plain wrong.... Even for Heathrow which is known for typical chaos as normal operation.

  • It reminds me of a time a new employee had to manually enter a barcode that the scanner wouldn't pickup. Accidentally that employee enter numbers but also a letter. Too the application and main DB down. Employee was nearly fired on the spot...but unlike other managers I like to do long, thorough RCAs and it was then that I learned the application is shit. It will take ANY manual input as valid and try and write that into a DB cells whose properties are expecting numbers only. A cascade of odd behaviours but
  • ... french for their extremely sensitive and probably very buggy software.

  • A very useful example of the importance of good testing - I'd be fascinated to hear the full details
  • by Anonymous Coward
    Drop Flights
  • Bad data validation is _always_ the fault of the software maker, not the one producing the bad data. Seriously. Obviously bad error behavior is even more the fault of the software maker.

    Here is a hint: Do not use cheap crapware for processing submitted flight-plans.

Waste not, get your budget cut next year.

Working...