Microsoft Azure's Southern US Data Center Goes Down For Hours, Impacting Office365 and Active Directory Customers (geekwire.com) 86
New submitter courcoul alerted us to Azure outage, which is affecting several customers in many parts of the world: Some Microsoft Azure customers with workloads running in its South Central US data center are having big problems coming back from the holiday weekend Tuesday, after shutdown procedures were initiated following a spike in temperature inside one of its facilities. Around 230am Pacific Time, Microsoft identified problems with the cooling systems in one part of its Texas data center complex, which caused a spike in temperature and forced it to shut down equipment in order to prevent a more catastrophic failure, according to the Azure status page. These issues have also caused cascading effects for some Microsoft Office 365 users as well as those who rely on Microsoft Active Directory to log into their accounts. The cooling system is the most critical part of a modern data center, given the intense heat produced by thousands of servers cranking away in an enclosed area. More resources: The official status page of Azure; and third-party web tracking tool DownDetector's assessment. Further reading: Microsoft Azure suffers outage after cooling issue.
Dad told you not to touch the thermostat! (Score:1)
D'oh!
This is why (Score:5, Insightful)
I do not like software that requires you to phone home to the mothership. The second something go wrong outside of your control it borks all your work. Office 365 is a bad joke if I have ever seen one.
Aside: Yes I know video games do this a lot but games are games and work is work.
Comment (Score:5, Interesting)
My employer was affected. Many employees could not authenticate to our third-party webapps because we use whatever the cloud-provided Active Directory SSO solution is. Ah, well. I wonder if this violated SLAs and we get some money back... My company is always concerned about not violating our SLAs to our customers (Saas), so hopefully we extract the same pound of flesh from our vendors.
Re: (Score:3)
Being dependent upon "the cloud" is not a good thing, and yet so many companies are throwing out their brains and signing up in the hope to reduce costs. The company that recently purchased my previous employer is in whole hog for Microsoft, Microsoft 360, Microsoft cloud, and anything with the word Microsoft attached, most of it all online only. To read some corporate announcements I have to log into a third party site which just seems absurd to me. When the cloud servers eventually get their inevitable
Re:Comment (Score:4, Interesting)
Being dependent upon "the cloud" is not a good thing, and yet so many companies are throwing out their brains and signing up in the hope to reduce costs. The company that recently purchased my previous employer is in whole hog for Microsoft, Microsoft 360, Microsoft cloud, and anything with the word Microsoft attached, most of it all online only. To read some corporate announcements I have to log into a third party site which just seems absurd to me. When the cloud servers eventually get their inevitable downtime, I predict a lot of hand wringing.
I haven't seen this level of slavish devotion to a single vendor since the IBM administration.
For most small to mid-sized businesses, "the cloud" is more reliable than any solution they'd be willing to pay for. I don't know Microsoft's redundancy model, but AWS's multi-AZ model gives much more redundancy than most businesses would build themselves -- even more so for multi-region redundancy since most companies aren't going to spend the money to duplicate their production environment in another region on the other side of the country (or world).
Though the side effect of using a cloud provider is that when a major cloud provider goes down, so do a *lot* of businesses -- but that doesn't mean they would have been better off building their own datacenter.
Re: (Score:2)
"Datacenter:" means a lot of things. When I hear it I think of something giant to support thousands of customers, as opposed to the servers supporting just internal email and documents and backups.
Re: (Score:2)
The equivalent to multi-AZ is pick the region - "Southern USA" is one of 54 different regions [microsoft.com] - when deploying a resource, one simply choose which region it goes to (different regions have different costs, different packet latency profiles, different data sovereignty and sometimes other restricted use cases.
1)If multi-AZ is "pick multiple regions", why is Azure starting (https://azure.microsoft.com/en-us/updates/azure-availability-zones-ga/ ) with multi-AZ regions? Maybe they realise AWS is onto something?
It looks like MS would bill you for inter-region traffic, where AWS doesn't bill you for intra-region (inter-AZ) traffic. Going multi-region on Azure looks much more expensive than multi-AZ on AWS ...
2)54 different regions? You believe Microsoft's big number on the page, when it seems to be counting:
- Regions
Re: (Score:2)
Being dependent upon "the cloud" is not a good thing, and yet so many companies are throwing out their brains and signing up in the hope to reduce costs.
Hardly. Being dependent upon the cloud is infinitely better than what most companies have proven themselves as being capable of.
Remember the cloud is someone else's computer, and that someone else is quite often better at managing it.
Re: (Score:2)
4.5 hours downtime! (Score:1)
4.5 hours downtime for us. Thanks Microsoft.
Re: (Score:3)
On the positive note, as least you can blame the outage on Microsoft and not take the heat yourself for Exchange crashing and being down for 4.5 hours.
Re: (Score:2)
Ah, but your boss will tell you it was your decision to depend on a vendor who turned out to be undependable, making it your fault. And if it was in fact your decision and not something you argued against and was overridden, he has a point.
Re: (Score:2)
Touché
Re: (Score:2)
For that matter, even if you did argue against it and HE is the one that did it anyway, it's your fault for not being persuasive enough.
Re: (Score:2)
Ah, but your boss will tell you it was your decision to depend on a vendor who turned out to be undependable, making it your fault. And if it was in fact your decision and not something you argued against and was overridden, he has a point.
And he has a valid point, even if HE was the one who decided to make the move. If you're not willing to argue against bad decisions of your boss, you're nothing more than a 'yes man', and useless to the organization.
Infrastructure management (Score:3)
Isn't this what backup generators and N+1 infrastructure is for? I can understand Joe's hosting and bait shop emporium going down, but power and HVAC are pretty well solved sciences. The weather in Texas is hot -- this is not a surprise. There are lightning storms in Texas, this is also not a surprise.
It seems like if you a positioning a data center in Texas (which there as some reasons for), you prepare for both heat and lightning. I could understand if there was an incredibly unusual weather event (aster
Re: (Score:1)
Data centers don't normally have a redundant cooling system do they? Backup power, a second (or third) internet connection to an independent provider, sure. But a second air conditioning and ventilation system waiting to take over in case the primary fails?
Re: (Score:2)
They should have redundant chillers, air handlers and environmental controls. Especially as cooling is critical to data center functionality.
A well designed data center has (at a minimum) redundant power feeds from two separate power networks, redundant network connections (from three or more providers) and cooling capacity. Losing a chiller or air handler should not take out the data center.
Re: (Score:2, Interesting)
I maintain HVAC for cell sites. EVERYONE I've worked on had TWO independent HVAC systems.
They toggle back and forth to equalize wear and tear, but when one fails the other system takes control and sends me an email asking for attention.
There is no reason in the world for a data center NOT to have multiple HVAC systems in place. The equipment is pocket change compared to the electronics it protects.
It could be M$ should put more thought in the design of their data centers than was put into Win95 or Vista.
Jus
Re: (Score:3)
N+1 is common. So you take the total cooling need and divide by (for example) 4, then install 5 systems of that size so you can lose one and be at full capacity. That's necessary anyway since you may need to shut one down for routine maintenance from time to time.
Ideally you don't let everything be inter-dependant so if you lose 2, you can still get by with shutting down 1/4 of the hardware.
Re: (Score:2)
Isn't this what backup generators and N+1 infrastructure is for?
Yea, that's the theory.. However, in practice, maintaining no single point of failure fault tolerance is harder than it sounds. I've seen (and implemented) many N+1 system designs. Building them isn't too much of a problem if you have a careful plan and follow it. BUT.. Keeping it N+1 as maintenance and improvements get done is *really* hard.
Remember that you can break N+1 redundancy by simply plugging in a device to the wrong outlet, or moving a network cable from one switch to another during some late
Re: (Score:2)
Not really. You just periodically do disaster tests. Shut down one of the 'N' periodically, and if infra that isn't supposed to go down does, turn it back on quickly, then do a post-mortem to figure out what screw-up needs to be fixed, and fix it.
Re: (Score:2)
Not really. You just periodically do disaster tests. Shut down one of the 'N' periodically, and if infra that isn't supposed to go down does, turn it back on quickly, then do a post-mortem to figure out what screw-up needs to be fixed, and fix it.
Sounds great, again in theory... Taking down 1 is to temporally do away with your redundancy (if it exists) and risks causing an actual outage if it doesn't. If you have SLA's to keep, you are not going to willingly perform such tests except in rather unique situations. You never willingly give up your +1.
The way you maintain N+1 is by first intensive design reviews of installation plans that include comprehensive wire labeling and inspection of same. Strict engineering controls for existing modificatio
Re: (Score:2)
Unless you're a fairly small operation, you need N+1 redundancy per data center, and you need multiple extra data centers to handle fluctuations in your capacity needs. If, during your low-usage periods (typically the wee hours of the morning in the country where your service is most popular), you can't lose one entire data center and still keep running, you're already screwed, because it's just a matter of time before your entire operation c
Re: (Score:2)
You still do not physically test N+1 solutions which are in operation. Even if it's fictional N+1 redundancy because of some configuration error, it's better to stay running than risk breaking your SLA's.
Now, that doesn't mean you don't fully test and vet your N+1 solution prior to putting it into service and then regularly by physical inspection and on paper. You can even test portions of your system which are not in service or part of the +1 redundancy, but you don't risk disrupting operational systems
Cloud - I don't think that word... (Score:5, Insightful)
There is no cloud.... (Score:3)
There is no cloud, just other people's computers.
Back in my day, we called this time sharing. Now you kids get off my lawn.
Same old same old (Score:3)
London Stock Exchange outage blamed on Microsoft [cnet.com]
Irony (Score:4, Funny)
The most ironic part is that the Azure Support Twitter account keeps pointing customers to the Azure status page. Which also happens to be down with 503 errors. Guess they could e-mail for support, unless they are using Office 365. Or request help via the Management Portal, but guess that's down too. lol.
Re: (Score:2)
Maybe Microsoft can move their Azure services onto Twitter's servers to improve uptime and reduce maintenance costs.
Re: (Score:2)
And the POTUS will avail himself to perform thorough regression testing at all hours for any new releases.
Re: (Score:2)
Having an estimate of when a service will be back up is valuable information. Just because contacting support won't immediately resolve an issue doesn't make doing so useless.
Re: (Score:2)
ETAs are so helpful. When this first started yesterday morning, I assumed maybe a couple of hours of downtime - AT MOST. Now we're nearly 30 hours in, with no ETA, and my builds for alternative systems started hours behind. Learning experience for me I guess.
Stay calm and keep paying the subscription fees (Score:1)
Who wants the reliability of locally installed software when you can have Office 365 running on 'the cloud'. With Office 365 your staff can enjoy a nice break at your expense until the service comes back online. Think of the moral benefits this offers, and it's only available with 'the cloud'. As for Azure, what's a few hour outage if it's just some critical business systems you're running.
You don't want to have control over your IT infrastructure. It's far better to pay for subscription based services
Minor correction (Score:5, Funny)
It is now "Office 364"
Re: (Score:3)
Re: (Score:2)
Re: (Score:1)
Re: (Score:2)
Re: (Score:2)
Right. Now we can add 4th September to the list that already includes 29th February.
Texas? (Score:4, Funny)
Re: (Score:2)
Re: (Score:2)
You have something that needs to be cool, and you put it in Texas?
Faithful slashdot reader here for 60 years. Sorry, couldn’t be bothered to read the summary. What’s this got to do with Texas? I’ve seen clouds there before! Does Microsoft own one of them now?
Re: (Score:2)
It's also a major enoug
Re: (Score:2)
If you have something that needs a ton of electricity, you put it where the rates are low and likely to remain low.
How easily can electricity be transported, compared to heat?
How much does the extra cooling contribute to electricity consumption, and does the lower electricity price compensate?
No wonder they're a distant second (Score:2)
Microsoft wants to be a cloud provider but they can't keep their infrastructure up worth a damned.
I honestly don't understand how they can be in second place considering how often they have major news-worthy failures.
I think this affected some XBoxLive stuff too (Score:2)
And this... (Score:1)
And this is why you don't keep critical files and documents only in the cloud.
Re: (Score:3)
Or the software you need to view them or the services you need to keep machines on your LAN running.
Well crap... (Score:1)
I'm almost done setting up a hybrid configuration now :-(
Russian Hackers!!! (Score:2)
Turned up the thermostat!!!
This has to be affecting a lot of shops (Score:2)
So it wasn't Patch Tuesday then? (Score:2)
It was the cooling equipment eh?
I think they are making excuses for patch Tuesday.... :)
How shortsighted (Score:2)
I would expect that a cloud vendor such as Microsoft or AWS would have multiple redundancies in place so that any one data center going down would not affect usage.
I just am still not a cloud fan. It works great but when it doesn't - there's nothing we can do.
Re: (Score:2)
Re: (Score:2)
I would expect that a cloud vendor such as Microsoft or AWS would have multiple redundancies in place so that any one data center going down would not affect usage.
Why are you lumping AWS in with Microsoft here?
AWS has this redundancy multiple availability zones in every region, so that especially the data storage services offered can provide the kind of redundancy that would have been useful in this scenario, but which, according to the Azure status updates, Azure doesn't have:
"Engineers have restored access to storage resources for the majority of services, and most customers should be seeing signs of recovery."
This is the update almost 2 days into this incident.
Whe
Re: (Score:2)
The company has gone all-in with cloud.
Executives (Score:1)
My favorite part of today was the executive screaming on the outage line about getting it back up. Sorry guys, you all put it in the cloud. Now you’ve had a longer outage than the last 5 years combined for the same price you paid for the people, process, and tech. At the same time they lost fileshares (which moved to OneDrive), workflow apps (which are on sharepoint), and every app you moved from the internal sso solution to azure ad. Super fun!