Forgot your password?
typodupeerror
Transportation Bug Software

Failed Software Upgrade Halts Transit Service 125

Posted by Soulskill
from the ay-carumba dept.
linuxwrangler writes "San Francisco Bay Area commuters awoke this morning to the news that BART, the major regional transit system which carries hundreds of thousands of daily riders, was entirely shut down due to a computer failure. Commuters stood stranded at stations and traffic backed up as residents took to the roads. The system has returned to service and BART says the outage resulted from a botched software upgrade."
This discussion has been archived. No new comments can be posted.

Failed Software Upgrade Halts Transit Service

Comments Filter:
  • They should have brought their skateboards to work.
    • by noh8rz10 (2716597)

      wow first it's the unions that are shutting them down and now a software update? I wonder what will happen next.

      • by Anonymous Coward

        I wonder what will happen next.

        People will buy cars. Only so much of this nonsense can be tolerated when it fucks with your livelihood. When the boss shows up and all the people with cars are getting it done and all the people with train tickets are at home making excuses... well, you shouldn't need any help figuring this part out, even if you don't like it.

      • Re:I Guess (Score:5, Funny)

        by RabidReindeer (2625839) on Friday November 22, 2013 @08:25PM (#45497087)

        wow first it's the unions that are shutting them down and now a software update? I wonder what will happen next.

        Unionized software.

        Ironic, isn't it? Silicon Valley commutes wrecked due to bad IT practices!

      • You do realize you've just summoned an earthquake, right?

    • by _Shad0w_ (127912)

      There's a guy who catches one of the trains I catch in the morning who always gets on with his skateboard. Although I work in North London.

  • Strange times (Score:5, Insightful)

    by nightsky30 (3348843) on Friday November 22, 2013 @07:59PM (#45496869)
    Why was a weekday selected for this software update?
    • Re:Strange times (Score:5, Informative)

      by TWX (665546) on Friday November 22, 2013 @08:05PM (#45496935)
      Well, based on my own experience with bureaucracies, there is some existing rule that ensures that certain types of staff have certain days off unless there's an emergency, and a software update probably didn't previously count as an emergency.

      From one standpoint, it makes sense, especially if those doing the work need technical support from a vendor. On the other hand, it probably makes more sense to have a QA lab set up if one is going to operate this way, so that one can test a rollout in advance, hopefully forestalling such problems going live.
      • Re:Strange times (Score:4, Insightful)

        by B33rNinj4 (666756) on Friday November 22, 2013 @08:43PM (#45497211) Homepage Journal
        Man, my company hasn't had a QA environment that mirrored production in over a decade. I'd like to think that they had something set up, but the few state-run departments I've seen have been sorely lacking.
      • it's more the contractors refusing to train and keep their hires. Nobody wants to keep someone around. They cost more every year. But for programmers that means nobody knows how anything works. It keeps profits high for the guy running the sub-contractor, but it means crummy software...
      • Re:Strange times (Score:5, Insightful)

        by girlintraining (1395911) on Friday November 22, 2013 @11:25PM (#45498009)

        On the other hand, it probably makes more sense to have a QA lab set up if one is going to operate this way, so that one can test a rollout in advance, hopefully forestalling such problems going live.

        And that's pretty hopeful. The thing is, in the real world, you just don't test all your patches. You can't; in any non-trivially sized network you're going to have hundreds of them to go through every week, and the workload is the same for a small or large business. That's why large businesses tend to do better (strangely enough) than small ones when it comes to patch management. And this is an attitude that is backed up by the numbers -- I would say over 9 times out of 10, a break/fix patch has no consequences being pushed into the production environment. It goes out. The version increments. The end. It's that 1 time that screws everyone up -- but it happens infrequently enough that management doesn't update its policies.

        Most managers operate under a triage approach to maintenance -- that is, throw resources at a problem when something breaks and complaints start coming in, rather than throwing resources at prevention. In the short run, this is the right approach -- in a crisis you want all hands on deck. The problem is that over time, neglecting preventative maintenance procedures, which show up only as a cost without a defined benefit, results in departments moving to a triage model all the time. Basically, the problem is short-term prioritization over long-term cost reduction.

        And I've seen it in almost every IT department I've worked for. I've even sat down with managers and explained to them that when 35% of their workflow is emergency break/fix and that number is trending upwards, we have a process control issue. They invariably agree with me, but say they can't get out from under the workload. Of course, when I come back three months later and it's now at 47% and the workload is now a third higher, they say the same thing.

        I would lay money that this is how project management is happening at BART, and it has now deteriorated to the point where its starting to impact its core business. The problem is, while it is still likely at a point where effective project management can right this sinking ship... it almost never happens. Unfortunately, the solution most of the time here is to throw someone under the bus, blaming them for the failure, and insisting that as the system has worked up until this point, it does not need an overhaul.

        They couldn't be more wrong; But unfortunately it will take several people being thrown under the bus and a few more high-profile failures before senior management fires the mid-level manager responsible for the project and brings on someone with a strong background in project management and they restructure their department from the ground up following the best practices of change management. Of course, they'll over-do it in the attempt and the pendulum will have to start swinging back the other way, but... that's what happens.

        • by Anonymous Coward

          Here's the thing.

          Every company wants cheap IT right now. They want an endless stream of no-benefit, no-complaint, low wage IT workers to come in and set things up so they can fire now newly redundant staff, enable them to compete with companies handing them their asses on a silver platter, implement new systems to replace ones that are often decades old, or reduce their current IT operating costs. Very few companies want something entirely new built from scratch thanks to ZIRP; it makes no sense right now

        • by tlhIngan (30335)

          And that's pretty hopeful. The thing is, in the real world, you just don't test all your patches. You can't; in any non-trivially sized network you're going to have hundreds of them to go through every week, and the workload is the same for a small or large business. That's why large businesses tend to do better (strangely enough) than small ones when it comes to patch management. And this is an attitude that is backed up by the numbers -- I would say over 9 times out of 10, a break/fix patch has no consequ

    • Why was a weekday selected for this software update?

      Should have been a tuesday. Then our windows updates and our transit updates would match! (... 14% ... for ... ever ...)

    • by x181 (2677887)
      so they can purposely botch it and justify the need to have human operators. in case you don't know, BART is currently going through a tense union battle resulting in a few worker strikes and contract disputes.
    • Why was a *production* system chosen to test the upgrade would be a better question. Why were there no fallbacks an even better one...

      • Yes and I bet there was a least one developer saying the exact same thing who was overruled by mgmt who proceeded with the push regardless!
      • I'm sure that if you asked them the answer would be along the lines of "Huh? What's a production system? We just call it the system."

        I once argued for retention of a QA system, which was basically a 4 week old copy of Prod. Things like being able to replicate actual problems with actual data, test new functionality & patches without impacting the business counted for less than some little tart's fluttering eyelashes. Of course that's what management wanted to hear, because an extra server is just a

    • by Salo2112 (628590) on Friday November 22, 2013 @10:01PM (#45497641)
      Patch *Tuesday*. Duh.
    • by SeaFox (739806)

      Why was a weekday selected for this software update?

      The same reason your cable company does maintenance in the middle of the day when at night they would disrupt far fewer customers -- the managers are tightwads and don't want to pay the rank-and-file employees for the extra hours outside their normal schedules, and the ones on salary are among that group that refuses to work outside 9-5 M-F.

  • Hello, IT. (Score:3, Funny)

    by tech.kyle (2800087) on Friday November 22, 2013 @08:04PM (#45496923)
    Have you tried turning it off and on again?
    • by gagol (583737)
      Reynholm Industries, successful makers of [insert_your_guess_here]. Great quote!
  • BART (Score:5, Interesting)

    by Anonymous Coward on Friday November 22, 2013 @08:19PM (#45497027)

    BART is run by the dumbest people on Earth. First off, it's takes a special kind of stupid to create a rail system that goes almost, but not quite all the way to the airport. 30 years later they extended to one of them but you still have to transfer to a bus for the last mile on another. Then you have to wonder what kind of idiot puts light carpet and cloth seating on public transport. 35 years later they start testing non-porous flooring/seating and maybe in another five years all of the trains will be switched over. Then, some bean counter got a bonus when they closed all the station bathrooms when 9/11 happened, ostensibly for security. Now a fifth of the escalators are out of service at any one time because they are clogged with human shit.

    I also heard there was some sort of labor dispute.

    • Re:BART (Score:4, Insightful)

      by Jane Q. Public (1010737) on Friday November 22, 2013 @08:36PM (#45497161)

      "BART is run by the dumbest people on Earth."

      Well, you really do have to wonder when they say they worked through the whole night only to discover that this new, mysterious problem was caused by the updated they'd made the night before.

      I mean, wow. Wouldn't that be the first thing that popped into your mind?

      • by gagol (583737)
        To suspect something is one thing, to be sure of it you need to gather and analyse data at best. A night to confirm it is reasonable. And bathroom in a metro is a luxury, how many undergrounds have those facilities (dont know, none in montreal, canada)?
        • Japan has them all over the place in Tokyo.
        • by xaxa (988988)

          London Underground toilet map [tfl.gov.uk] (not so great in the centre, but pretty good elsewhere).

          They're in probably half of European underground stations, on average. Expect to pay 0-50c, depending on the country.

          My local station (in London) has one, it's always very clean. I don't think many people use it.

    • Re:BART (Score:5, Informative)

      by MrEricSir (398214) on Friday November 22, 2013 @08:49PM (#45497241) Homepage

      The Bart-SFO extension was a matter of politics, you can't blame the people who run Bart for that. You also can't blame the initial designers for not building the OAK extension, since OAK was a much smaller airport in those days (and had very few passenger flights.)

      The train design was done by an aerospace company with absolutely no rail experience, which explains Bart's quirky design elements. But you can't blame Bart current management for construction contracts awarded in the 1960's.

      • Re:BART (Score:5, Insightful)

        by Anonymous Coward on Friday November 22, 2013 @09:42PM (#45497523)

        Plus, BART is not exactly a metro system like in Boston, Chicago, or New York. It's somewhere between a metro and commuter rail, but closer to the latter. It's a product of 1960s thinking, where people were trying to deal with the population shift out of the urban core. So part of the idea was to create high-speed transit from bed-room communities to downtown Oakland and San Francisco.

        Connecting the airports probably never figured much into the equation. It wasn't built to supplement the transportation needs of carless San Francisco residents. It was built to shuttle people around the Bay Area. If you needed to get to the airport, you got there like everybody else--you drove your car.

        • by drinkypoo (153816)

          It wasn't built to supplement the transportation needs of carless San Francisco residents. It was built to shuttle people around the Bay Area. If you needed to get to the airport, you got there like everybody else--you drove your car.

          But this just comes right back to how BART is stupid. Because when you build public transportation, it's going to be used by people who don't have cars, and to not take them into account is fucking stupid. Also, it's just stupid not to have the rail be able to take commuters from an airport to downtown no matter how you slice it. That should have been an initial design goal.

          • by SeaFox (739806)

            If you needed to get to the airport, you got there like everybody else--you drove your car.

            But this just comes right back to how BART is stupid. Because when you build public transportation, it's going to be used by people who don't have cars, and to not take them into account is fucking stupid.

            Maybe the assumption was if you couldn't afford a car, you probably couldn't afford to be going on many flights either. Keep in mind air fare was a bit pricier in the 60's and gas was quite a bit cheaper. Financial bar for car ownership was lower.

            • by drinkypoo (153816)

              Well, what I meant was that they should have taken both classes of passenger into account.

              Ideally this means having lines segregated by socioeconomic status. You don't want to go to the airport and the ghetto.

    • Re: (Score:3, Funny)

      by Anonymous Coward

      So people take a dump while riding the escalator? That's actually a cool idea.

      • by gagol (583737)
        Let us know how it went for you!
        • by Anonymous Coward

          It was certainly a moving experience; quite uplifting. The person behind me didn't seem to fully appreciate the view; or having to climb backwards when I stopped at the top to wipe --- especially once certain stairs came 'round again full loop. I suppose if I wasn't a Republican, I might have cared about their distress --- but, screw it, shitting on people just feels so good. Made riding on the peons' transit system feel totally worth it.

    • by bluemonq (812827)

      > 30 years later they extended to one of them but you still have to transfer to a bus for the last mile on another.

      Pity you didn't have a spare $100 million a couple decades ago. I'm SURE you'd have been willing to pay for it, right? The extension to SFO wasn't built until recent times because back in the '60s San Mateo County quit the BART project, and the money wasn't around until the tech bubble started growing; ground was broken in 1997. The Oakland extension wasn't started until recently (opens in 2

  • This is really surprising to me.

    For all the "can not fail" systems I've worked on, there has been an identical set of hardware, along with other hardware to simulate load, on which you could try upgrades before you put them on a live system and cost the local economy tens of millions of dollars by screwing up.

    • Most of the "cannot fail" and "mission critical" and "we're betting the company on this" systems I have seen have one (1) production environment, and one (1) development environment that sort of looks like production, with light servers on each developer's system.

      I recently attempted to test the implementation of a client unlike any of those we had previously hosted, and the CIO and his Development VP told me, "we don't have the resources for that, we'll test it in production". It failed in production. I
    • and cost the local economy tens of millions of dollars by screwing up.

      So what? What's BART's incentive to avoid this? The customers will go to a competitor? They'll lose their jobs?

      Unionized monopolies are a wonderful thing.

      • by tlambert (566799)

        and cost the local economy tens of millions of dollars by screwing up.

        So what? What's BART's incentive to avoid this? The customers will go to a competitor? They'll lose their jobs?

        They'll do what they did Thursday and Friday, and flood the roads with drivers who have cars for emergencies, usually take public transit, and are pretty inexperienced as drivers in regular traffic, not just "BART's out traffic". BART isn't really necessary; it's convenient for a lot of people, but once it drops below the convenience threshold, people simply won't use it.

        • I understand your argument, but do you think the BART employees really think that BART will get closed down if they don't do a great job?

    • For all the "can not fail" systems I've worked on, there has been an identical set of hardware, along with other hardware to simulate load

      Yeah gramps, we did all that in history class, along with slideframes and mainrules and all that.

      That's obsolete now because cloud and agile and webscale.

      • by tlambert (566799)

        For all the "can not fail" systems I've worked on, there has been an identical set of hardware, along with other hardware to simulate load

        Yeah gramps, we did all that in history class, along with slideframes and mainrules and all that.

        That's obsolete now because cloud and agile and webscale.

        Let me know when you get the next G.E. Medical systems MRI system running "in the cloud" rather than on a a local control system and a console in the next room, and then trust your life to the thing. Meanwhile, I think I will probably stick with the medical equipment I've worked on instead.

        P.S.: Let me know when your cloud is HIPPA certified.

  • I guess you can't always save by eliminating humans and their expensive unions. Although, I'm sure the software was intended to pick up the financial slack for all of those expensive peeps. Don't worry, Wall Street is highly motivated to eliminate the humans with the software, eventually...

  • by Neo-Rio-101 (700494) on Friday November 22, 2013 @08:31PM (#45497131)
    First I'm not going to plug any VM vendor.... but with certain VM backends, snapshots are possible, and it's a godsend when crap like this happens.
    • You have to realize how few people even know what a VM is. Or a snapshot. Where I work, there is one backup made each week, on the server. No other machine has a snapshot, a disk image, a backup, there are no VM's - nothing. If/when a disk fails, that machine comes to a halt until a vendor is called in to replace the disk, the OS, and all the software.

      We have some fool who is referred to as "the IT guy". I can't even say that with a straight face. This is one of those who got a Microsoft-centric educa

    • by rubycodez (864176)

      you can do snapshots by other means than having VM software. Many volume managers and filesystems can do it, and some disk array controllers have that built in

    • Re: (Score:2, Interesting)

      by Anonymous Coward

      No. Just no.

      Have you ever actually tried this on a production system? I haven't (I'm not stupid enough to do that), but I've seen many others try. In almost every case, the resulting mess from "rolling back" a VM was greater then the mess of a botched software update to begin with. In one particular case, I witnessed a certain VM running some very expensive enterprise software totally hose itself and then proceed to blow away the majority of a database hosted on another VM after it was restored following a

    • by Todd Knarr (15451)

      Gods, no. Just... no. Think for a minute. If your VM's running a database server and you roll back to a snapshot, what happens? Well, the snapshot doesn't know anything about the database since that's an application-level thing, so it'll roll back to being mid-operation (times however many database operations were in progress). The problem is that since the clients haven't been rolled back to the same moment down to the nanosecond, the database is now mid-operation while the clients that're supposedly perfo

      • That's why you power down the VM to take the snapshot. The snapshot is also instantaneous rather than waiting for some vauge, sketchy attempt at quiescing the FS.

        If the downtime for a reboot is unacceptable, do not use snapshots.

  • "assistant general manager for operations, said the system's backup computer had gone down at the same time its central supervisory computer crashed."
    Redundancy is not just running two boxes... How many times do we need to point out that there's a reason true redundancy is hard and expensive?

    TFA (sorry for reading it) states that the problem showed up 12 hours after the upgrade. That's why it's time-consuming to test hi-rel stuff, whatever bean counters say...

  • by Somebody Is Using My (985418) on Friday November 22, 2013 @09:10PM (#45497349) Homepage

    See what happens when you give these guys root access? ;-)

    • by bluemonq (812827)

      BART is a metropolitan transit system. The city government of San Francisco has practically nothing to do with day-to-day operations.

  • by manu0601 (2221348) on Friday November 22, 2013 @10:04PM (#45497657)
    I have seen quite efficient manual train network operation, but the workers behind the success could explain it was only possible because they had a few old timers who where still able to organize train flows using paper and pencil. Younger workers had always worked with computers, and when all the old timers will all be retired, the know-how will be lost.
  • Terry Childs was locked up on the off chance that something far less disruptive than this would happen. At least that was the excuse.
  • If the recent strike wasn't bad enough, now a computer glitch. Man, if I was riding the transit to work and back I would be extremely pissed. Wonder how many people had lost their jobs because they couldn't make it to work??
  • They pilot their solar powered dirigibles.

It is wrong always, everywhere and for everyone to believe anything upon insufficient evidence. - W. K. Clifford, British philosopher, circa 1876

Working...