1sockchuck writes "A major power outage at Seattle telecom hub Fisher Plaza has knocked payment processing provider Authorize.net offline for hours, leaving thousands of web sites unable to take credit cards for online sales. The Authorize site is still down, but its Twitter account attributes the outage to a fire, while AdHost calls it a 'significant power event.' Authorize.net is said to be trying to resume processing from a backup data center, but there's no clear ETA on when Fisher Plaza will have power again."
It's interesting how many companies have assumed redundancy in place but never take the time to do proper testing. They figure that once a disaster happens, that everything will automatically work because their vendor or staff said so. To achieve true redundancy a company needs to do semi-frequent testing to ensure that everything is working properly. Authorize.net might have had what was assumed a redundant system in place, but once the disaster happen they soon realized their system wasn't designed or con
When this happens in this day and age the CIO should be fired! And if the CIO recommended a redundant D.C. but the CEO, CFO or Board rejected it as "too expensive"????
If that's the case, then the aformentioned officers should give up their pay to the thousands of merchants who lost their day's pay due to this problem. Yeah, like that'll happen.
Phone lines occasionally go out and that might affect local merchants, but when it's a data center that handles the livelihoods of thousands of merchants, there needs to be much greater redundancy. The businesses that are affected by this are not all huge e-tailers either. Many are just small operators trying to make a living on t
I know redundancy and such is better on business stuff, but this kind of reminds me of the fact how customer lines have lots of single failure points aswell. There was a day when TeliaSonera's, large nordic ISP, DHCP stopped working, leading 1/3 of the whole country's residents without internet access. Turns out there was a hardware failure on the dhcp server, leading me to believe that they actually depend on just one server to handle all the dhcp requests coming from customers. They did fix it in a few hours, but it was still unavailable for the rest of the day because hundreds of thousands computer's were trying to get an ip address from it. That being said, I remember it happening only once, but it still seems stupid.
You know we had something similar happen (North Central AR) a few years back. We had over 70k people with zero Internet anything for two days. They couldn't get medical records, use CC, hell the three towns affected pretty much ground to a halt. The cause? The lines heading out to the main branch all converged on a single big fiber trunk that some dumbass farmer nailed with his backhoe while digging a ditch.
So while you can hope there is enough redundancy in the system to keep catastrophic failures like t
there was a failure in the main/generator transfer switch which resulted in a fire. The sprinkler system activated.
Where I work, the D.C. is in a sub-level basement. One day a few years ago, a dim-wit plumber was brazing a pipe with a propane torch, and swung it too close to a sprinkler head.
Sprinkler went off and water did what it does: flow downhill, eventually pouring into the D.C., right onto the SAN storing "my" database...
We were down for a few days. People couldn't access the web site or IVR, but f
Let's imagine that you're actually paying this data centre large amounts of money with the assurance that the money means 99.9% uptime. Then, maybe, it might mean something more.
If you don't give a crap about uptime, then hell, get a Google webpage or something.
Actually -- in a totally unconnected incident -- my grocery shopping was disrupted today because (according to the note pinned to the closed store's shutters) the store's till server was down, and they'd shut up the shop while they waited for an engineer.
I'm guessing that the server was probably local, possibly above the store, and might have gone fritzy in the heat.
So, real-world implications of computer failure. A server goes down, and suddenly Eric Cannot Buy Cheese ("Aaaaiiiieeee!"). Eric has hard cash, store (presumably) has cheese, but store can no longer sell cheese to Eric. Or anything else.
The shop "crashed".
Okay, so I trudged off and did my grocery shopping elsewhere, but it was a little disturbing to think that we've already gotten to the point where a server problem can stop you buying food, in a "real" shop, with "real" money.
That's pathetic. I've seen stores stay open during 24 hour POWER FAILURES! Any manager who does not teach their employees how to manually do credit card transactions (yes you can do them by paper!) should never have been hired in the first place.
When we lose power around here (once every 6 months or so), the stores stay open. They simply don't accept debit cards (which require a connection to the bank) until the power comes back on.
Or (gasp!) make change without a computron! I wonder if they even train that in grocery stores anymore...scary, indeed.
I think the bigger issue in this case would be manually looking up the price for every single item. We tend to simplify selling things manually in this way (manually processing credit card transactions, making change manually, etc.), when really when really the biggest problem is being without the UPC system.
The media are also following the story, KOMO a local station was knocked offline but are broadcasting from a backup site.
Way to go guys! At least two national, and maybe even international, ICT companies on whom numerous affiliates depend upon fail to provide for an adequate backup facility and continuity plan, yet the local AM radio station manages to pull it off. I'm guessing that some heads are gonna roll after the holiday weekend...
I'm pretty sure they're talking about KOMO, the TV station, actually. It's one of the largest stations here in Seattle. I think they take up a fair chunk of Fischer Plaza, where the fire was. Still, your point about international and national business entities failing, when a local business succeeds is pretty stupid.
Apparently Verizon has a single point of failure for much of its FiOS for the metro areas of Western Washington state in this building as well so the FiOS customers are offline as well right now.
Clownshoes: Have no failover plan and be singly homed.
Meh: Have a failover plan.
Good: Have a failover plan that requires humans and exercise it regularly.
Better: Have a failover plan that is automated and exercise it regularly.
Best: Eliminate single points of failure so failover is turning off the flake or fail and going back to drinking a beer.
Hot/Hot is always a more ideal solution than Hot/Warm or Hot/Cold for disaster recovery (and increasing equipment utilization/ROI), and this event demonstrates why.
by Anonymous Coward
on Friday July 03, @12:41PM (#28573405)
Fisher Plaza is supposed to be a regional telecomm / communications / medical care hub for the Seattle area. It was designed and built to *not* crash, even in a magnitude 9.5 quake. Sounds like they've got work to do...
There are four main factors that can take a part of a society's key infrastructure offline.
1: ACTS OF GOD Meteor strike, lightnight strike, extreme weather...
2: ACTS OF MALICE War, terrorism, extortion, employee sabotage, criminal attacks...
3: WEAK INFRASTRUCTRUCTURE Underpowered networks, inadequate UPS backups, skeleton staffing, the shaving of safety margins as an efficiency exercise, inadequate rate of replacing old hardware...
4: MANAGEMENT ARSINESS This is when a problem starts, and the people in charge either don't know how to react, don't care, or prioritise face-saving over actual problem-solving. This happens when you get an outage, and instead of system management promptly calling all their critical clients to inform them, and warn them that there's maybe twenty minutes of UPS capacity in the routers if the system's not fixed by then, they instead cross their fingers and hope that things'll work out, and worry about what to tell the clients afterwards.
Fisher Plaza seems to have suffered from a case of #4 recently, so it's not surprising that they've gone down again. The first time should have been the wakeup call to show them that their human systems were in need of an overhaul. Without that overhaul, you're setting up a dynamic in which the second time it happens, things are even worse (because now people are locked into defensive mode).
No matter how advanced your technological systems, if the people running it have the wrong mindset, you're gonna go down. And when you go down, you're gonna go down far far harder than necessary.
What, was the backup data center on the floor directly below the primary data center?
If I had to guess, either they did something that stupid or they didn't properly test their failover procedures or their backup data center, and either one or both of those things turned out to be inadequate.
Sometimes folks set up a redundant system and forget to make one key piece redundant.
Example: A server rack with two UPS systems. Each server has two power cords, one going to each UPS.. but the switch everything is plugged into only has one power input, so it's connected to UPS A.
Power blinks and UPS A decides to shit itself. Rack goes down, even though all the machines are up, because the network switch loses power.
Solution? An auto switching power Y-cable with two inputs, and one output. But 80% of people will be lazy and not bother. Oops.
An auto switching power Y-cable with two inputs, and one output? ive never seen or heard of these.. Do you have a manufacturer or part number? id defiantly like some.
Not only is it a holiday, but there is a HUGE geocaching event (for 3 days) happening in B.C. and anyone attending (I know some people) are SOL for getting information about it.
If anyone knows of a secondary site for finding info on the events, please post!
... who's broadcast facilities reside in this building (they were broadcasting from a park on Queen Anne hill this morning), it was due to a transformer vault fire. The resulting sprinkler operation rendered their backup generator inoperable.
Being in the power biz, this sort of thing is to be expected in typical office buildings. Sometimes the power goes out. Live with it. What really puzzles me is how someone can take such a structure, install a raised floor and some big A/C units on the roof and sell it as a data center. This kind of crap goes on all the time, as I've seen purpose built data centers go down for single point failures.
Adhost oversees two sites for my family's business: http://www.seliger.com [seliger.com] and http://blog.seliger.com [seliger.com]. At least part of the Fisher Plaza data center seems to be up at the moment because seliger.com will load for me, while blog.seliger.com won't. When I figured this out a few hours ago, I sent an e-mail to Adhost and got this as part of the response:
We have been advised by the building engineering team that they anticipate restoring power to the Plaza East building in plus or minus 4 hours. We sincerely ho
Sorry to reply to my own comment, but the Adhost e-mail servers are also working. I don't know if this is because their main site is coming back online or if it's because their backup worked.
I used to manage a 22 rack cage that we leased from Internap at Fisher Plaza back in 2005. They really did build the place well. Massive diesel generators, independent well water, redundant cooling, etc. But it was designed to survive and continue broadcasting for a local news station for 18 days without resupply in the event of a major external disaster like an earthquake.
I imagine they are reviewing their DR procedures and designs now to minimize collateral damage from internal factors.
Focusing on something that 99% of us screw up at one point or another, particularly when our primary focus at the time is probably getting the service back online rather than checking the calendar to see if it's Daylight Saving Time or not, for me is always a red flag that you're an insufferable pedant.
Come on, the guy's sig is a link to some comic rant about "its versus it's" which, whilst it annoys me no end, is most definitely a good indicator that he is, no doubt, an insufferable pedant.
"Our current estimate for re-establishing Bing Travel functionality is 5pm PST," says a notice at Bing
When someone in a technical role screws up a timezone designation, for me that is always a red flag that they are sloppy with facts, and I need to closely watch their other decisions, actions and statements, because they may be in over their head.
It's quite likely that this message was not posted by somebody in a technical role, but a managerial role. The technical people may very well have just said "by 5:00" or possibly "by 5:00 Pacific Time", and whoever posted the notice on the web site (while the technical people were busy working on trying to fix things) added "PST" instead of "PDT".
Changes to the time are not random at all, they're clearly defined. Of course those definitions are periodically changed randomly with minimal notification, but that's not the same problem.
Wow, you are just as bad as AuthorizeNet... Namely you are putting all of your eggs into one basket called AMERICA... What you are ignoring are the ramifications if a government decides to take you down. And frankly I am more worried about a government taking me down than some accident.
I am part of a hedge fund and we have data centers in... Caymans, Monaco, and Switzerland... I think you get the drift here... And our exchanges that we talk to are scattered throughout the world... Is it simple? Cheap? No
Heh (Score:5, Insightful)
Redundancy ain't just a river in Egypt.
Re: (Score:3, Informative)
It's interesting how many companies have assumed redundancy in place but never take the time to do proper testing. They figure that once a disaster happens, that everything will automatically work because their vendor or staff said so. To achieve true redundancy a company needs to do semi-frequent testing to ensure that everything is working properly. Authorize.net might have had what was assumed a redundant system in place, but once the disaster happen they soon realized their system wasn't designed or con
Re: (Score:2)
Re: (Score:2)
... putting all your eggs in one basket is a stupid idea...
....but... maybe they blew their budget on a really, really good basket?
Re: (Score:2)
maybe they blew their budget on a really, really good basket?
Mark Twain: Put all your eggs in one basket, and then guard that basket!!
Re:No Backup?? (Score:5, Insightful)
When this happens in this day and age the CIO should be fired!
And if the CIO recommended a redundant D.C. but the CEO, CFO or Board rejected it as "too expensive"????
Parent
Re: (Score:3)
When this happens in this day and age the CIO should be fired! And if the CIO recommended a redundant D.C. but the CEO, CFO or Board rejected it as "too expensive"????
If that's the case, then the aformentioned officers should give up their pay to the thousands of merchants who lost their day's pay due to this problem. Yeah, like that'll happen.
Phone lines occasionally go out and that might affect local merchants, but when it's a data center that handles the livelihoods of thousands of merchants, there needs to be much greater redundancy. The businesses that are affected by this are not all huge e-tailers either. Many are just small operators trying to make a living on t
Re:No Backup?? (Score:4, Interesting)
I know redundancy and such is better on business stuff, but this kind of reminds me of the fact how customer lines have lots of single failure points aswell. There was a day when TeliaSonera's, large nordic ISP, DHCP stopped working, leading 1/3 of the whole country's residents without internet access. Turns out there was a hardware failure on the dhcp server, leading me to believe that they actually depend on just one server to handle all the dhcp requests coming from customers. They did fix it in a few hours, but it was still unavailable for the rest of the day because hundreds of thousands computer's were trying to get an ip address from it. That being said, I remember it happening only once, but it still seems stupid.
Parent
Re: (Score:2)
You know we had something similar happen (North Central AR) a few years back. We had over 70k people with zero Internet anything for two days. They couldn't get medical records, use CC, hell the three towns affected pretty much ground to a halt. The cause? The lines heading out to the main branch all converged on a single big fiber trunk that some dumbass farmer nailed with his backhoe while digging a ditch.
So while you can hope there is enough redundancy in the system to keep catastrophic failures like t
Re: (Score:3, Interesting)
there was a failure in the main/generator transfer switch which resulted in a fire. The sprinkler system activated.
Where I work, the D.C. is in a sub-level basement. One day a few years ago, a dim-wit plumber was brazing a pipe with a propane torch, and swung it too close to a sprinkler head.
Sprinkler went off and water did what it does: flow downhill, eventually pouring into the D.C., right onto the SAN storing "my" database...
We were down for a few days. People couldn't access the web site or IVR, but f
Re: (Score:2)
Let's imagine that you're actually paying this data centre large amounts of money with the assurance that the money means 99.9% uptime. Then, maybe, it might mean something more.
If you don't give a crap about uptime, then hell, get a Google webpage or something.
Re:Oh, the humanity! (Score:5, Insightful)
I'm guessing that the server was probably local, possibly above the store, and might have gone fritzy in the heat.
So, real-world implications of computer failure. A server goes down, and suddenly Eric Cannot Buy Cheese ("Aaaaiiiieeee!"). Eric has hard cash, store (presumably) has cheese, but store can no longer sell cheese to Eric. Or anything else.
The shop "crashed".
Okay, so I trudged off and did my grocery shopping elsewhere, but it was a little disturbing to think that we've already gotten to the point where a server problem can stop you buying food, in a "real" shop, with "real" money.
Parent
Re: (Score:3, Insightful)
When we lose power around here (once every 6 months or so), the stores stay open. They simply don't accept debit cards (which require a connection to the bank) until the power comes back on.
Re: (Score:2)
Or (gasp!) make change without a computron! I wonder if they even train that in grocery stores anymore...scary, indeed.
Re: (Score:3, Insightful)
Or (gasp!) make change without a computron! I wonder if they even train that in grocery stores anymore...scary, indeed.
I think the bigger issue in this case would be manually looking up the price for every single item. We tend to simplify selling things manually in this way (manually processing credit card transactions, making change manually, etc.), when really when really the biggest problem is being without the UPC system.
No Carr.... (Score:2)
Hmm. Power outage stops /. posts. News at 11
Backup data center was impacted too (Score:2)
Also affecting Bing.com (Score:2, Interesting)
Bing Travel servers are located in the same server hall. More info: http://isc.sans.org/diary.html?storyid=6721
The best line from the SANS ISC (Score:4, Interesting)
The media are also following the story, KOMO a local station was knocked offline but are broadcasting from a backup site.
Way to go guys! At least two national, and maybe even international, ICT companies on whom numerous affiliates depend upon fail to provide for an adequate backup facility and continuity plan, yet the local AM radio station manages to pull it off. I'm guessing that some heads are gonna roll after the holiday weekend...
Parent
Re: (Score:2)
Failover Planning (and this broke FiOS too) (Score:5, Informative)
Apparently Verizon has a single point of failure for much of its FiOS for the metro areas of Western Washington state in this building as well so the FiOS customers are offline as well right now.
Hot/Hot is always a more ideal solution than Hot/Warm or Hot/Cold for disaster recovery (and increasing equipment utilization/ROI), and this event demonstrates why.
Re: (Score:3, Informative)
Looks like from twitter comments that Verizon finished their failover since people's FiOS is coming back now.
Fisher Plaza is a disaster response center (Score:4, Informative)
Fisher Plaza is supposed to be a regional telecomm / communications / medical care hub for the Seattle area. It was designed and built to *not* crash, even in a magnitude 9.5 quake. Sounds like they've got work to do ...
System failure (Score:5, Informative)
1: ACTS OF GOD ...
Meteor strike, lightnight strike, extreme weather
2: ACTS OF MALICE ...
War, terrorism, extortion, employee sabotage, criminal attacks
3: WEAK INFRASTRUCTRUCTURE ...
Underpowered networks, inadequate UPS backups, skeleton staffing, the shaving of safety margins as an efficiency exercise, inadequate rate of replacing old hardware
4: MANAGEMENT ARSINESS
This is when a problem starts, and the people in charge either don't know how to react, don't care, or prioritise face-saving over actual problem-solving. This happens when you get an outage, and instead of system management promptly calling all their critical clients to inform them, and warn them that there's maybe twenty minutes of UPS capacity in the routers if the system's not fixed by then, they instead cross their fingers and hope that things'll work out, and worry about what to tell the clients afterwards.
Fisher Plaza seems to have suffered from a case of #4 recently, so it's not surprising that they've gone down again. The first time should have been the wakeup call to show them that their human systems were in need of an overhaul. Without that overhaul, you're setting up a dynamic in which the second time it happens, things are even worse (because now people are locked into defensive mode).
No matter how advanced your technological systems, if the people running it have the wrong mindset, you're gonna go down. And when you go down, you're gonna go down far far harder than necessary.
Re: (Score:3, Insightful)
5: Government...
A government that decides to come to your headquarters and decides they want all of your hardware pronto...
Re: (Score:3, Funny)
3: WEAK INFRASTRUCTRUCTURE
It's good to see that you've provided redundancy for the "TRUC" part of your infrastructure, but I'm concerned about the rest of it.
Re: (Score:2)
Authorize.Net did have a backup (Score:3, Informative)
"@gotwww The backup data center was impacted too. Don't have info as to why. The team is solely focused on getting us back up for now."
Re: (Score:2)
If I had to guess, either they did something that stupid or they didn't properly test their failover procedures or their backup data center, and either one or both of those things turned out to be inadequate.
Re:Authorize.Net did have a backup (Score:4, Interesting)
Sometimes folks set up a redundant system and forget to make one key piece redundant.
Example: A server rack with two UPS systems. Each server has two power cords, one going to each UPS.. but the switch everything is plugged into only has one power input, so it's connected to UPS A.
Power blinks and UPS A decides to shit itself. Rack goes down, even though all the machines are up, because the network switch loses power.
Solution? An auto switching power Y-cable with two inputs, and one output. But 80% of people will be lazy and not bother. Oops.
Happens all the time; I see it everywhere.
Parent
Re: (Score:3, Insightful)
An auto switching power Y-cable with two inputs, and one output? ive never seen or heard of these.. Do you have a manufacturer or part number?
id defiantly like some.
Geocaching.com too (Score:5, Informative)
And on a holiday. Bummer. :(
Re: (Score:2)
Not only is it a holiday, but there is a HUGE geocaching event (for 3 days) happening in B.C. and anyone attending (I know some people) are SOL for getting information about it.
If anyone knows of a secondary site for finding info on the events, please post!
According to KOMO news (Score:3, Informative)
... who's broadcast facilities reside in this building (they were broadcasting from a park on Queen Anne hill this morning), it was due to a transformer vault fire. The resulting sprinkler operation rendered their backup generator inoperable.
Being in the power biz, this sort of thing is to be expected in typical office buildings. Sometimes the power goes out. Live with it. What really puzzles me is how someone can take such a structure, install a raised floor and some big A/C units on the roof and sell it as a data center. This kind of crap goes on all the time, as I've seen purpose built data centers go down for single point failures.
Local Seattle coverage: (Score:2)
http://www.seattlepi.com/local/6420ap_wa_fisher_plaza_fire.html?source=mypi [seattlepi.com]
http://seattletimes.nwsource.com/html/localnews/2009415646_webfisherplaza04.html [nwsource.com]
Fisher Plaza? (Score:2)
Wow... (Score:2)
Re: (Score:2)
Fisher Plaza Designed to survive External factors (Score:2)
I used to manage a 22 rack cage that we leased from Internap at Fisher Plaza back in 2005. They really did build the place well. Massive diesel generators, independent well water, redundant cooling, etc. But it was designed to survive and continue broadcasting for a local news station for 18 days without resupply in the event of a major external disaster like an earthquake.
I imagine they are reviewing their DR procedures and designs now to minimize collateral damage from internal factors.
But let's not be to
Huge portable generator arrives at Fisher Plaza (Score:2, Interesting)
Re: (Score:2)
They do. It sounds like the involves-humans failover process failed somehow.
Re: (Score:2)
I guess your point is that it is PDT time now.
Re:sloppy engineering (Score:4, Funny)
Parent
Re:sloppy engineering (Score:4, Informative)
Parent
Re: (Score:3, Insightful)
"Our current estimate for re-establishing Bing Travel functionality is 5pm PST," says a notice at Bing
When someone in a technical role screws up a timezone designation, for me that is always a red flag that they are sloppy with facts, and I need to closely watch their other decisions, actions and statements, because they may be in over their head.
It's quite likely that this message was not posted by somebody in a technical role, but a managerial role. The technical people may very well have just said "by 5:00" or possibly "by 5:00 Pacific Time", and whoever posted the notice on the web site (while the technical people were busy working on trying to fix things) added "PST" instead of "PDT".
Re: (Score:2)
Not really sure what you are complaining about... PST == Pacific Standard Time. I don't see anything wrong with this.
And that's exactly why these kinds of mistakes are made.
Seattle is currently on PDT (GMT -0700), not PST (GMT -0800). The switch back to PST happens in November.
Re: (Score:2)
Changes to the time are not random at all, they're clearly defined. Of course those definitions are periodically changed randomly with minimal notification, but that's not the same problem.
Re: (Score:2)
Wow, you are just as bad as AuthorizeNet... Namely you are putting all of your eggs into one basket called AMERICA... What you are ignoring are the ramifications if a government decides to take you down. And frankly I am more worried about a government taking me down than some accident.
I am part of a hedge fund and we have data centers in... Caymans, Monaco, and Switzerland... I think you get the drift here... And our exchanges that we talk to are scattered throughout the world... Is it simple? Cheap? No
Re: (Score:2)
it would require a terrorist attack on New York PLUS an earthquake in San Francisco to knock us offline.
Which is all moot since you're using authorize.net as a payment gateway. ;)
Re: (Score:2)
*avoids eye contact*