Failed Software Upgrade Halts Transit Service 125
linuxwrangler writes "San Francisco Bay Area commuters awoke this morning to the news that BART, the major regional transit system which carries hundreds of thousands of daily riders, was entirely shut down due to a computer failure. Commuters stood stranded at stations and traffic backed up as residents took to the roads. The system has returned to service and BART says the outage resulted from a botched software upgrade."
I Guess (Score:2, Funny)
Re: (Score:3)
wow first it's the unions that are shutting them down and now a software update? I wonder what will happen next.
Re: (Score:3, Interesting)
San Fran will turn into Detroit?
While from Reddit posted a day ago, it's so on topic to your post I had to post it your reply
http://www.reddit.com/r/explainlikeimfive/comments/1r6f8w/eli5_americans_what_exactly_happened_to_detroit_i/ [reddit.com]
Very good read if you want to know about Detroit
Re: (Score:2)
Yah, white people have NEVER fucked up a government.
Re: (Score:2, Interesting)
Interesting that you bring up Haiti. They occupy the same island as the Dominican Republic; while Haiti has been a disaster for a very long time, the DR has always been totally different (just look at a satellite photo showing the deforestation on the Haitian side, while the Dominican side is lush and green). Now, if you go look at the people there (which you obviously haven't, because you're a dumb troll who lives in a trailer), you'll see that they're all black! The main difference between them is that
Re: (Score:2)
Congo and Rwanda are total shitpots, as is Liberia. By some reckonings the latter is the shittest shitpot ever.
None of those were French colonies.
Re: (Score:1)
I wonder what will happen next.
People will buy cars. Only so much of this nonsense can be tolerated when it fucks with your livelihood. When the boss shows up and all the people with cars are getting it done and all the people with train tickets are at home making excuses... well, you shouldn't need any help figuring this part out, even if you don't like it.
Re: (Score:2)
Except the boss probably couldn't get to work either, unless maybe he has a bike.
Re: (Score:2)
Except the boss probably couldn't get to work either, unless maybe he has a bike.
What? The executive class condescend to ride in Public Transportation? Scoff!
Re: (Score:2)
They do here. They just have First Class tickets instead.
And the ones who drive just get stuck on the M25 instead.
Re: (Score:2)
Let's say all of the BART riders start driving in. They will find themselves adding more traffic to an already congested highway system that will never, ever, get any larger. There simply isn't the space. And once they get to work, good luck finding some place to park...
Re: (Score:3)
most of them already have cars. BART serves the Bay Area. 50 miles south and east of SF.
the week long strike earlier this year caused havoc on the roads- people were on the road at 0400, and still late for work. extra busses, extra boats, not enough.
https://www.google.com/search?q=bart+strike+traffic&espv=210&es_sm=119&tbm=isch&tbo=u&source=univ&sa=X&ei=EhyQUtq2FYb9iQKq2oG4CQ&ved=0CDYQsAQ&biw=1354&bih=647 [google.com]
Re:I Guess (Score:5, Funny)
wow first it's the unions that are shutting them down and now a software update? I wonder what will happen next.
Unionized software.
Ironic, isn't it? Silicon Valley commutes wrecked due to bad IT practices!
Re: (Score:2)
You do realize you've just summoned an earthquake, right?
Re: (Score:2)
There's a guy who catches one of the trains I catch in the morning who always gets on with his skateboard. Although I work in North London.
Strange times (Score:5, Insightful)
Re:Strange times (Score:5, Informative)
From one standpoint, it makes sense, especially if those doing the work need technical support from a vendor. On the other hand, it probably makes more sense to have a QA lab set up if one is going to operate this way, so that one can test a rollout in advance, hopefully forestalling such problems going live.
Re:Strange times (Score:4, Insightful)
Not so much the bureaucracies (Score:2)
Re:Strange times (Score:5, Insightful)
On the other hand, it probably makes more sense to have a QA lab set up if one is going to operate this way, so that one can test a rollout in advance, hopefully forestalling such problems going live.
And that's pretty hopeful. The thing is, in the real world, you just don't test all your patches. You can't; in any non-trivially sized network you're going to have hundreds of them to go through every week, and the workload is the same for a small or large business. That's why large businesses tend to do better (strangely enough) than small ones when it comes to patch management. And this is an attitude that is backed up by the numbers -- I would say over 9 times out of 10, a break/fix patch has no consequences being pushed into the production environment. It goes out. The version increments. The end. It's that 1 time that screws everyone up -- but it happens infrequently enough that management doesn't update its policies.
Most managers operate under a triage approach to maintenance -- that is, throw resources at a problem when something breaks and complaints start coming in, rather than throwing resources at prevention. In the short run, this is the right approach -- in a crisis you want all hands on deck. The problem is that over time, neglecting preventative maintenance procedures, which show up only as a cost without a defined benefit, results in departments moving to a triage model all the time. Basically, the problem is short-term prioritization over long-term cost reduction.
And I've seen it in almost every IT department I've worked for. I've even sat down with managers and explained to them that when 35% of their workflow is emergency break/fix and that number is trending upwards, we have a process control issue. They invariably agree with me, but say they can't get out from under the workload. Of course, when I come back three months later and it's now at 47% and the workload is now a third higher, they say the same thing.
I would lay money that this is how project management is happening at BART, and it has now deteriorated to the point where its starting to impact its core business. The problem is, while it is still likely at a point where effective project management can right this sinking ship... it almost never happens. Unfortunately, the solution most of the time here is to throw someone under the bus, blaming them for the failure, and insisting that as the system has worked up until this point, it does not need an overhaul.
They couldn't be more wrong; But unfortunately it will take several people being thrown under the bus and a few more high-profile failures before senior management fires the mid-level manager responsible for the project and brings on someone with a strong background in project management and they restructure their department from the ground up following the best practices of change management. Of course, they'll over-do it in the attempt and the pendulum will have to start swinging back the other way, but... that's what happens.
Re: (Score:1)
Here's the thing.
Every company wants cheap IT right now. They want an endless stream of no-benefit, no-complaint, low wage IT workers to come in and set things up so they can fire now newly redundant staff, enable them to compete with companies handing them their asses on a silver platter, implement new systems to replace ones that are often decades old, or reduce their current IT operating costs. Very few companies want something entirely new built from scratch thanks to ZIRP; it makes no sense right now
Re: (Score:2)
Well, often they have someone already picked out, but I don't think many are total fabrications. Most people who go to the effort of posting a job application actually do want to hire someone. They may not want to pay enough to actually get someone who matches their requirements, but that's a spearate matter. (And often their requirements are literally insane. The people who wirte the applications must not have ANY idea of what they're asking for.)
Re: (Score:2)
Re: (Score:3)
Why was a weekday selected for this software update?
Should have been a tuesday. Then our windows updates and our transit updates would match! (... 14% ... for ... ever ...)
Re: (Score:1)
You know, I honestly don't give a fuck about global warming. I figure by the time it happens I'll already be dead. Fuck the future generations. And I don't give a fuck if Obama can see me post this. I'm going to shit debt, CO2 and eye soars on them. I'm living for me. Not some fucking little brat who keeps crying while I'm at a restaurant, trying to enjoy a simple fucking meal. Fuck them. Fuck this planet.
This is a typical Baby Boomer. Imagine it. In all of American history, the Baby Boomers are the first generation to leave their children with a worse, more fucked-up world than what they had. This is more than a mere "fail at life". This is a fail at present AND future life. That's unprecedented in this country.
And the average Baby Boomer is so arrogant and entitled too. If I were them I'd be a lot more humble and try to stay out of the way and stop running up debt and stop ranting about the youth
Re: (Score:2)
Re: (Score:2)
Why was a *production* system chosen to test the upgrade would be a better question. Why were there no fallbacks an even better one...
Re: (Score:3)
Re: (Score:3)
Yes, of course, it's always clueless management ignoring the brave developer who warns of catastrophe.
If management wants the power in the form of the final decisions (which they have), and the ability to take most of the credit (which is often the case), then they also get to keep the responsibility.
Sounds fair to me. Power and responsibility should never be separated. Ever.
Don't need a qualifier if there's only one... (Score:2)
I'm sure that if you asked them the answer would be along the lines of "Huh? What's a production system? We just call it the system."
I once argued for retention of a QA system, which was basically a 4 week old copy of Prod. Things like being able to replicate actual problems with actual data, test new functionality & patches without impacting the business counted for less than some little tart's fluttering eyelashes. Of course that's what management wanted to hear, because an extra server is just a
Re:Strange times (Score:4, Funny)
Re: (Score:2)
Why was a weekday selected for this software update?
The same reason your cable company does maintenance in the middle of the day when at night they would disrupt far fewer customers -- the managers are tightwads and don't want to pay the rank-and-file employees for the extra hours outside their normal schedules, and the ones on salary are among that group that refuses to work outside 9-5 M-F.
Re: (Score:1, Interesting)
Because there is no means in the "cockpit" to actually make the train go. There are three buttons in a BART rail car:
Open Doors
Go to next stop
Emergency Stop
Not even a "close doors" button - that is handled by door sensors and the computer when "Go to the next stop" is pressed.
Everything is automated. A chimpanzee could operate a BART train.
Re:BART has drivers. (Score:5, Interesting)
You've almost certainly never ridden BART, much less seen the driver's cab. Why do I say this? Because there's a section of the BART system (the Oakland Wye, bane of commuters who want to get anywhere during rush hour) where drivers are instructed to go to manual control, limited to 25 MPH. It's the result of your vaunted "automated" system designed in the '60s never having worked properly in the past 50 years, and one of the contributing factors to a crash in 2009 (thankfully no one was seriously injured). There are many well-documented incidents of entire train sets disappearing from the computer system, as well as "ghost" trains randomly appearing.
Here is what an actual BART cab looks like:
http://i.imgur.com/IbYtYTa.jpg [imgur.com]
computers run the track swtichs (Score:3)
computers run the track switches
Re: (Score:2)
I bit more than that. There was an incident a decade or so ago when the driver got out to fix a jammed door, and when it was unjammed, the train decided it was time to take off for the next station. It got there, stopped, and opened the doors. And waited for the driver to show up.
Hello, IT. (Score:3, Funny)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
It was broke (and remains so) decades ago. The automated system never really worked properly.
BART (Score:5, Interesting)
BART is run by the dumbest people on Earth. First off, it's takes a special kind of stupid to create a rail system that goes almost, but not quite all the way to the airport. 30 years later they extended to one of them but you still have to transfer to a bus for the last mile on another. Then you have to wonder what kind of idiot puts light carpet and cloth seating on public transport. 35 years later they start testing non-porous flooring/seating and maybe in another five years all of the trains will be switched over. Then, some bean counter got a bonus when they closed all the station bathrooms when 9/11 happened, ostensibly for security. Now a fifth of the escalators are out of service at any one time because they are clogged with human shit.
I also heard there was some sort of labor dispute.
Re:BART (Score:4, Insightful)
"BART is run by the dumbest people on Earth."
Well, you really do have to wonder when they say they worked through the whole night only to discover that this new, mysterious problem was caused by the updated they'd made the night before.
I mean, wow. Wouldn't that be the first thing that popped into your mind?
Re: (Score:3)
Re: (Score:2)
Re: (Score:2)
And even at the busiest stations they're cleaner than my office bathroom.
Re: (Score:3)
London Underground toilet map [tfl.gov.uk] (not so great in the centre, but pretty good elsewhere).
They're in probably half of European underground stations, on average. Expect to pay 0-50c, depending on the country.
My local station (in London) has one, it's always very clean. I don't think many people use it.
Re:BART (Score:5, Informative)
The Bart-SFO extension was a matter of politics, you can't blame the people who run Bart for that. You also can't blame the initial designers for not building the OAK extension, since OAK was a much smaller airport in those days (and had very few passenger flights.)
The train design was done by an aerospace company with absolutely no rail experience, which explains Bart's quirky design elements. But you can't blame Bart current management for construction contracts awarded in the 1960's.
Re:BART (Score:5, Insightful)
Plus, BART is not exactly a metro system like in Boston, Chicago, or New York. It's somewhere between a metro and commuter rail, but closer to the latter. It's a product of 1960s thinking, where people were trying to deal with the population shift out of the urban core. So part of the idea was to create high-speed transit from bed-room communities to downtown Oakland and San Francisco.
Connecting the airports probably never figured much into the equation. It wasn't built to supplement the transportation needs of carless San Francisco residents. It was built to shuttle people around the Bay Area. If you needed to get to the airport, you got there like everybody else--you drove your car.
Re: (Score:2)
It wasn't built to supplement the transportation needs of carless San Francisco residents. It was built to shuttle people around the Bay Area. If you needed to get to the airport, you got there like everybody else--you drove your car.
But this just comes right back to how BART is stupid. Because when you build public transportation, it's going to be used by people who don't have cars, and to not take them into account is fucking stupid. Also, it's just stupid not to have the rail be able to take commuters from an airport to downtown no matter how you slice it. That should have been an initial design goal.
Re: (Score:3)
If you needed to get to the airport, you got there like everybody else--you drove your car.
But this just comes right back to how BART is stupid. Because when you build public transportation, it's going to be used by people who don't have cars, and to not take them into account is fucking stupid.
Maybe the assumption was if you couldn't afford a car, you probably couldn't afford to be going on many flights either. Keep in mind air fare was a bit pricier in the 60's and gas was quite a bit cheaper. Financial bar for car ownership was lower.
Re: (Score:3)
Well, what I meant was that they should have taken both classes of passenger into account.
Ideally this means having lines segregated by socioeconomic status. You don't want to go to the airport and the ghetto.
Re: (Score:2)
Plus all the people who work at the airport live there.
That's what the extra-large size luggage lockers are for.
Re: (Score:3, Funny)
So people take a dump while riding the escalator? That's actually a cool idea.
Re: (Score:3)
Re: (Score:1)
It was certainly a moving experience; quite uplifting. The person behind me didn't seem to fully appreciate the view; or having to climb backwards when I stopped at the top to wipe --- especially once certain stairs came 'round again full loop. I suppose if I wasn't a Republican, I might have cared about their distress --- but, screw it, shitting on people just feels so good. Made riding on the peons' transit system feel totally worth it.
Re: (Score:3)
> 30 years later they extended to one of them but you still have to transfer to a bus for the last mile on another.
Pity you didn't have a spare $100 million a couple decades ago. I'm SURE you'd have been willing to pay for it, right? The extension to SFO wasn't built until recent times because back in the '60s San Mateo County quit the BART project, and the money wasn't around until the tech bubble started growing; ground was broken in 1997. The Oakland extension wasn't started until recently (opens in 2
This is really surprising to me. (Score:2)
This is really surprising to me.
For all the "can not fail" systems I've worked on, there has been an identical set of hardware, along with other hardware to simulate load, on which you could try upgrades before you put them on a live system and cost the local economy tens of millions of dollars by screwing up.
Re: (Score:2)
I recently attempted to test the implementation of a client unlike any of those we had previously hosted, and the CIO and his Development VP told me, "we don't have the resources for that, we'll test it in production". It failed in production. I
Re: (Score:3)
and cost the local economy tens of millions of dollars by screwing up.
So what? What's BART's incentive to avoid this? The customers will go to a competitor? They'll lose their jobs?
Unionized monopolies are a wonderful thing.
Re: (Score:2)
and cost the local economy tens of millions of dollars by screwing up.
So what? What's BART's incentive to avoid this? The customers will go to a competitor? They'll lose their jobs?
They'll do what they did Thursday and Friday, and flood the roads with drivers who have cars for emergencies, usually take public transit, and are pretty inexperienced as drivers in regular traffic, not just "BART's out traffic". BART isn't really necessary; it's convenient for a lot of people, but once it drops below the convenience threshold, people simply won't use it.
Re: (Score:2)
I understand your argument, but do you think the BART employees really think that BART will get closed down if they don't do a great job?
Re: (Score:2)
Yeah gramps, we did all that in history class, along with slideframes and mainrules and all that.
That's obsolete now because cloud and agile and webscale.
Re: (Score:2)
Yeah gramps, we did all that in history class, along with slideframes and mainrules and all that.
That's obsolete now because cloud and agile and webscale.
Let me know when you get the next G.E. Medical systems MRI system running "in the cloud" rather than on a a local control system and a console in the next room, and then trust your life to the thing. Meanwhile, I think I will probably stick with the medical equipment I've worked on instead.
P.S.: Let me know when your cloud is HIPPA certified.
Software Has No Union Rep (Score:2)
I guess you can't always save by eliminating humans and their expensive unions. Although, I'm sure the software was intended to pick up the financial slack for all of those expensive peeps. Don't worry, Wall Street is highly motivated to eliminate the humans with the software, eventually...
Comment removed (Score:3)
Re: (Score:3)
You have to realize how few people even know what a VM is. Or a snapshot. Where I work, there is one backup made each week, on the server. No other machine has a snapshot, a disk image, a backup, there are no VM's - nothing. If/when a disk fails, that machine comes to a halt until a vendor is called in to replace the disk, the OS, and all the software.
We have some fool who is referred to as "the IT guy". I can't even say that with a straight face. This is one of those who got a Microsoft-centric educa
Re: (Score:2)
you can do snapshots by other means than having VM software. Many volume managers and filesystems can do it, and some disk array controllers have that built in
Re: (Score:2, Interesting)
No. Just no.
Have you ever actually tried this on a production system? I haven't (I'm not stupid enough to do that), but I've seen many others try. In almost every case, the resulting mess from "rolling back" a VM was greater then the mess of a botched software update to begin with. In one particular case, I witnessed a certain VM running some very expensive enterprise software totally hose itself and then proceed to blow away the majority of a database hosted on another VM after it was restored following a
Re: (Score:1)
> The second example you gave could have easily happened outside of a virtual environment. Imagine somebody did a restore from backup, or accidentally fucked up the system clock - the same thing would have occurred. That is just shitty software and not a problem related to virtual machines.
Because people just love to take down a system for hours restoring from tape at random? My point was that they restored the VM from snapshot because it was a quick and easy process. The system itself went down for abou
Re: (Score:3)
Gods, no. Just... no. Think for a minute. If your VM's running a database server and you roll back to a snapshot, what happens? Well, the snapshot doesn't know anything about the database since that's an application-level thing, so it'll roll back to being mid-operation (times however many database operations were in progress). The problem is that since the clients haven't been rolled back to the same moment down to the nanosecond, the database is now mid-operation while the clients that're supposedly perfo
Re: (Score:2)
That's why you power down the VM to take the snapshot. The snapshot is also instantaneous rather than waiting for some vauge, sketchy attempt at quiescing the FS.
If the downtime for a reboot is unacceptable, do not use snapshots.
Good redundancy (Score:2)
"assistant general manager for operations, said the system's backup computer had gone down at the same time its central supervisory computer crashed."
Redundancy is not just running two boxes... How many times do we need to point out that there's a reason true redundancy is hard and expensive?
TFA (sorry for reading it) states that the problem showed up 12 hours after the upgrade. That's why it's time-consuming to test hi-rel stuff, whatever bean counters say...
Looks like Terry Childs had a point (Score:5, Funny)
See what happens when you give these guys root access? ;-)
Re: (Score:2)
BART is a metropolitan transit system. The city government of San Francisco has practically nothing to do with day-to-day operations.
Manual operation (Score:3)
Re: (Score:2)
You got an idea of the complexity.
Now try to imagine the job of a traffic regulator for an average European city train station: it is the same exercice with dozens of tracks and switches, and hundreds of trains a day. Your tools are sheets of paper, pencil, and a telephone to call the workers that run the train and the ones that switch lines (manually, of course).
Some trains are late, they get stopped for various reasons. And in order to ease you, freight trains can be added on the fly as soon as there is
So does somebody go to jail? (Score:2)
Re: (Score:2)
BART is not under the governance of San Francisco.
Terry Childs pissed off the city and he worked for (Score:2)
Terry Childs pissed off the city and he worked for them.
Likely in this case some out side vendor / contractor messed up.
Good grief! (Score:1)
Rich hippies don't ride the train (Score:1)
They pilot their solar powered dirigibles.