Microsoft Azure's Southern US Data Center Goes Down For Hours, Impacting Office365 and Active Directory Customers (geekwire.com) 86

Posted by msmash on Tuesday September 04, 2018 @01:40PM from the outage-report dept.

New submitter courcoul alerted us to Azure outage, which is affecting several customers in many parts of the world: Some Microsoft Azure customers with workloads running in its South Central US data center are having big problems coming back from the holiday weekend Tuesday, after shutdown procedures were initiated following a spike in temperature inside one of its facilities. Around 230am Pacific Time, Microsoft identified problems with the cooling systems in one part of its Texas data center complex, which caused a spike in temperature and forced it to shut down equipment in order to prevent a more catastrophic failure, according to the Azure status page. These issues have also caused cascading effects for some Microsoft Office 365 users as well as those who rely on Microsoft Active Directory to log into their accounts. The cooling system is the most critical part of a modern data center, given the intense heat produced by thousands of servers cranking away in an enclosed area. More resources: The official status page of Azure; and third-party web tracking tool DownDetector's assessment. Further reading: Microsoft Azure suffers outage after cooling issue.

Microsoft Azure's Southern US Data Center Goes Down For Hours, Impacting Office365 and Active Directory Customers

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 86 Comments Log In/Create an Account

Comments Filter:

Dad told you not to touch the thermostat! (Score:1)

by Anonymous Coward writes:

D'oh!
This is why (Score:5, Insightful)

by Anonymous Coward writes: on Tuesday September 04, 2018 @01:52PM (#57251648)

I do not like software that requires you to phone home to the mothership. The second something go wrong outside of your control it borks all your work. Office 365 is a bad joke if I have ever seen one.
Aside: Yes I know video games do this a lot but games are games and work is work.

Comment (Score:5, Interesting)

by WallyL ( 4154209 ) writes: on Tuesday September 04, 2018 @01:54PM (#57251658)

My employer was affected. Many employees could not authenticate to our third-party webapps because we use whatever the cloud-provided Active Directory SSO solution is. Ah, well. I wonder if this violated SLAs and we get some money back... My company is always concerned about not violating our SLAs to our customers (Saas), so hopefully we extract the same pound of flesh from our vendors.

- Re: (Score:3)
  
  by account_deleted ( 4530225 ) writes:
  
  Comment removed based on user account deletion
  - Re:Comment (Score:4, Interesting)
    
    by hawguy ( 1600213 ) writes: on Tuesday September 04, 2018 @02:46PM (#57252040)
    
    Being dependent upon "the cloud" is not a good thing, and yet so many companies are throwing out their brains and signing up in the hope to reduce costs. The company that recently purchased my previous employer is in whole hog for Microsoft, Microsoft 360, Microsoft cloud, and anything with the word Microsoft attached, most of it all online only. To read some corporate announcements I have to log into a third party site which just seems absurd to me. When the cloud servers eventually get their inevitable downtime, I predict a lot of hand wringing.
    I haven't seen this level of slavish devotion to a single vendor since the IBM administration.
    For most small to mid-sized businesses, "the cloud" is more reliable than any solution they'd be willing to pay for. I don't know Microsoft's redundancy model, but AWS's multi-AZ model gives much more redundancy than most businesses would build themselves -- even more so for multi-region redundancy since most companies aren't going to spend the money to duplicate their production environment in another region on the other side of the country (or world).
    Though the side effect of using a cloud provider is that when a major cloud provider goes down, so do a *lot* of businesses -- but that doesn't mean they would have been better off building their own datacenter.
    
    - Re: (Score:2)
      
      by account_deleted ( 4530225 ) writes:
      
      Comment removed based on user account deletion
    - - Re: (Score:2)
        
        by buchanmilne ( 258619 ) writes:
        
        The equivalent to multi-AZ is pick the region - "Southern USA" is one of 54 different regions [microsoft.com] - when deploying a resource, one simply choose which region it goes to (different regions have different costs, different packet latency profiles, different data sovereignty and sometimes other restricted use cases.
        1)If multi-AZ is "pick multiple regions", why is Azure starting (https://azure.microsoft.com/en-us/updates/azure-availability-zones-ga/ ) with multi-AZ regions? Maybe they realise AWS is onto something?
        It looks like MS would bill you for inter-region traffic, where AWS doesn't bill you for intra-region (inter-AZ) traffic. Going multi-region on Azure looks much more expensive than multi-AZ on AWS ...
        2)54 different regions? You believe Microsoft's big number on the page, when it seems to be counting:
        - Regions
  - Re: (Score:2)
    
    by thegarbz ( 1787294 ) writes:
    
    Being dependent upon "the cloud" is not a good thing, and yet so many companies are throwing out their brains and signing up in the hope to reduce costs.
    Hardly. Being dependent upon the cloud is infinitely better than what most companies have proven themselves as being capable of.
    Remember the cloud is someone else's computer, and that someone else is quite often better at managing it.
- Re: (Score:2)
  
  by account_deleted ( 4530225 ) writes:
  
  Comment removed based on user account deletion
4.5 hours downtime! (Score:1)

by Anonymous Coward writes:

4.5 hours downtime for us. Thanks Microsoft.
- Re: (Score:3)
  
  by mu51c10rd ( 187182 ) writes:
  
  On the positive note, as least you can blame the outage on Microsoft and not take the heat yourself for Exchange crashing and being down for 4.5 hours.
  - Re: (Score:2)
    
    by Chris Mattern ( 191822 ) writes:
    
    On the positive note, as least you can blame the outage on Microsoft and not take the heat yourself for Exchange crashing and being down for 4.5 hours.
    Ah, but your boss will tell you it was your decision to depend on a vendor who turned out to be undependable, making it your fault. And if it was in fact your decision and not something you argued against and was overridden, he has a point.
    - Re: (Score:2)
      
      by mu51c10rd ( 187182 ) writes:
      
      Touché
    - Re: (Score:2)
      
      by sjames ( 1099 ) writes:
      
      For that matter, even if you did argue against it and HE is the one that did it anyway, it's your fault for not being persuasive enough.
    - Re: (Score:2)
      
      by bjwest ( 14070 ) writes:
      
      On the positive note, as least you can blame the outage on Microsoft and not take the heat yourself for Exchange crashing and being down for 4.5 hours.
      Ah, but your boss will tell you it was your decision to depend on a vendor who turned out to be undependable, making it your fault. And if it was in fact your decision and not something you argued against and was overridden, he has a point.
      And he has a valid point, even if HE was the one who decided to make the move. If you're not willing to argue against bad decisions of your boss, you're nothing more than a 'yes man', and useless to the organization.
- Infrastructure management (Score:3)
  
  by klubar ( 591384 ) writes:
  
  Isn't this what backup generators and N+1 infrastructure is for? I can understand Joe's hosting and bait shop emporium going down, but power and HVAC are pretty well solved sciences. The weather in Texas is hot -- this is not a surprise. There are lightning storms in Texas, this is also not a surprise.
  It seems like if you a positioning a data center in Texas (which there as some reasons for), you prepare for both heat and lightning. I could understand if there was an incredibly unusual weather event (aster
  - Re: (Score:1)
    
    by nasch ( 598556 ) writes:
    
    Data centers don't normally have a redundant cooling system do they? Backup power, a second (or third) internet connection to an independent provider, sure. But a second air conditioning and ventilation system waiting to take over in case the primary fails?
    - Re: (Score:2)
      
      by klubar ( 591384 ) writes:
      
      They should have redundant chillers, air handlers and environmental controls. Especially as cooling is critical to data center functionality.
      A well designed data center has (at a minimum) redundant power feeds from two separate power networks, redundant network connections (from three or more providers) and cooling capacity. Losing a chiller or air handler should not take out the data center.
    - Re: (Score:2, Interesting)
      
      by Anonymous Coward writes:
      
      I maintain HVAC for cell sites. EVERYONE I've worked on had TWO independent HVAC systems.
      They toggle back and forth to equalize wear and tear, but when one fails the other system takes control and sends me an email asking for attention.
      There is no reason in the world for a data center NOT to have multiple HVAC systems in place. The equipment is pocket change compared to the electronics it protects.
      It could be M$ should put more thought in the design of their data centers than was put into Win95 or Vista.
      Jus
    - Re: (Score:3)
      
      by sjames ( 1099 ) writes:
      
      N+1 is common. So you take the total cooling need and divide by (for example) 4, then install 5 systems of that size so you can lose one and be at full capacity. That's necessary anyway since you may need to shut one down for routine maintenance from time to time.
      Ideally you don't let everything be inter-dependant so if you lose 2, you can still get by with shutting down 1/4 of the hardware.
  - Re: (Score:2)
    
    by bobbied ( 2522392 ) writes:
    
    Isn't this what backup generators and N+1 infrastructure is for?
    Yea, that's the theory.. However, in practice, maintaining no single point of failure fault tolerance is harder than it sounds. I've seen (and implemented) many N+1 system designs. Building them isn't too much of a problem if you have a careful plan and follow it. BUT.. Keeping it N+1 as maintenance and improvements get done is *really* hard.
    Remember that you can break N+1 redundancy by simply plugging in a device to the wrong outlet, or moving a network cable from one switch to another during some late
    - Re: (Score:2)
      
      by dgatwood ( 11270 ) writes:
      
      BUT.. Keeping it N+1 as maintenance and improvements get done is *really* hard.
      Not really. You just periodically do disaster tests. Shut down one of the 'N' periodically, and if infra that isn't supposed to go down does, turn it back on quickly, then do a post-mortem to figure out what screw-up needs to be fixed, and fix it.
      - Re: (Score:2)
        
        by bobbied ( 2522392 ) writes:
        
        BUT.. Keeping it N+1 as maintenance and improvements get done is *really* hard.
        Not really. You just periodically do disaster tests. Shut down one of the 'N' periodically, and if infra that isn't supposed to go down does, turn it back on quickly, then do a post-mortem to figure out what screw-up needs to be fixed, and fix it.
        Sounds great, again in theory... Taking down 1 is to temporally do away with your redundancy (if it exists) and risks causing an actual outage if it doesn't. If you have SLA's to keep, you are not going to willingly perform such tests except in rather unique situations. You never willingly give up your +1.
        The way you maintain N+1 is by first intensive design reviews of installation plans that include comprehensive wire labeling and inspection of same. Strict engineering controls for existing modificatio
        
        Re: (Score:2)
        
        by dgatwood ( 11270 ) writes:
        
        In short, you NEVER test N+1 systems which are in production use.
        Unless you're a fairly small operation, you need N+1 redundancy per data center, and you need multiple extra data centers to handle fluctuations in your capacity needs. If, during your low-usage periods (typically the wee hours of the morning in the country where your service is most popular), you can't lose one entire data center and still keep running, you're already screwed, because it's just a matter of time before your entire operation c
        
        Re: (Score:2)
        
        by bobbied ( 2522392 ) writes:
        
        You still do not physically test N+1 solutions which are in operation. Even if it's fictional N+1 redundancy because of some configuration error, it's better to stay running than risk breaking your SLA's.
        Now, that doesn't mean you don't fully test and vet your N+1 solution prior to putting it into service and then regularly by physical inspection and on paper. You can even test portions of your system which are not in service or part of the +1 redundancy, but you don't risk disrupting operational systems
Cloud - I don't think that word... (Score:5, Insightful)

by RelaxedTension ( 914174 ) writes: on Tuesday September 04, 2018 @01:58PM (#57251690)

I don't think that word means what they think it means. They need to rethink their distributed model if one data center takes down customers. Isn't the pitch for those services that they basically bulletproof for businesses?

- There is no cloud.... (Score:3)
  
  by klubar ( 591384 ) writes:
  
  There is no cloud, just other people's computers.
  Back in my day, we called this time sharing. Now you kids get off my lawn.
Same old same old (Score:3)

by Tough Love ( 215404 ) writes: on Tuesday September 04, 2018 @01:59PM (#57251698)

London Stock Exchange outage blamed on Microsoft [cnet.com]

Irony (Score:4, Funny)

by gregarican ( 694358 ) writes: on Tuesday September 04, 2018 @02:00PM (#57251710) Homepage

The most ironic part is that the Azure Support Twitter account keeps pointing customers to the Azure status page. Which also happens to be down with 503 errors. Guess they could e-mail for support, unless they are using Office 365. Or request help via the Management Portal, but guess that's down too. lol.

- Re: (Score:2)
  
  by PingSpike ( 947548 ) writes:
  
  Maybe Microsoft can move their Azure services onto Twitter's servers to improve uptime and reduce maintenance costs.
  - Re: (Score:2)
    
    by gregarican ( 694358 ) writes:
    
    And the POTUS will avail himself to perform thorough regression testing at all hours for any new releases.
- - Re: (Score:2)
    
    by jaa101 ( 627731 ) writes:
    
    Having an estimate of when a service will be back up is valuable information. Just because contacting support won't immediately resolve an issue doesn't make doing so useless.
    - Re: (Score:2)
      
      by fjordboy ( 169716 ) writes:
      
      ETAs are so helpful. When this first started yesterday morning, I assumed maybe a couple of hours of downtime - AT MOST. Now we're nearly 30 hours in, with no ETA, and my builds for alternative systems started hours behind. Learning experience for me I guess.
Stay calm and keep paying the subscription fees (Score:1)

by Anonymous Coward writes:

Who wants the reliability of locally installed software when you can have Office 365 running on 'the cloud'. With Office 365 your staff can enjoy a nice break at your expense until the service comes back online. Think of the moral benefits this offers, and it's only available with 'the cloud'. As for Azure, what's a few hour outage if it's just some critical business systems you're running.
You don't want to have control over your IT infrastructure. It's far better to pay for subscription based services
Minor correction (Score:5, Funny)

by sjames ( 1099 ) writes: on Tuesday September 04, 2018 @02:26PM (#57251874) Homepage Journal

It is now "Office 364"

- Re: (Score:3)
  
  by Anne Thwacks ( 531696 ) writes:
  
  Not so much "cloud" as "smog".
- Re: (Score:2)
  
  by The Original CDR ( 5453236 ) writes:
  
  Revised version: "Office 365 +/- a day as the new intern plays around with the power switch."
- Re: (Score:1)
  
  by wildfish ( 779284 ) writes:
  
  thanks for the laugh
- Re: (Score:2)
  
  by account_deleted ( 4530225 ) writes:
  
  Comment removed based on user account deletion
- Re: (Score:2)
  
  by thegarbz ( 1787294 ) writes:
  
  Right. Now we can add 4th September to the list that already includes 29th February.
Texas? (Score:4, Funny)

by Anne Thwacks ( 531696 ) writes: on Tuesday September 04, 2018 @02:32PM (#57251942)

You have something that needs to be cool, and you put it in Texas?

- Re: (Score:2)
  
  by jbengt ( 874751 ) writes:
  
  My first thought as well.
- Re: (Score:2)
  
  by jittles ( 1613415 ) writes:
  
  You have something that needs to be cool, and you put it in Texas?
  Faithful slashdot reader here for 60 years. Sorry, couldn’t be bothered to read the summary. What’s this got to do with Texas? I’ve seen clouds there before! Does Microsoft own one of them now?
- Re: (Score:2)
  
  by Fencepost ( 107992 ) writes:
  
  If you have something that needs a ton of electricity, you put it where the rates are low and likely to remain low. Based on May 2018 commercial rates I found, Texas has the third lowest commercial electric rates in the country, behind only Arkansas and Nevada. If Microsoft wants to get on the renewable energy bandwagon, Texas already has massive wind farms (probably 30,000 megawatts and close to 20% of the state's electrical production) and could also add solar in a lot of the state.
  
  It's also a major enoug
  - Re: (Score:2)
    
    by buchanmilne ( 258619 ) writes:
    
    If you have something that needs a ton of electricity, you put it where the rates are low and likely to remain low.
    
    How easily can electricity be transported, compared to heat?
    How much does the extra cooling contribute to electricity consumption, and does the lower electricity price compensate?
No wonder they're a distant second (Score:2)

by ilsaloving ( 1534307 ) writes:

Microsoft wants to be a cloud provider but they can't keep their infrastructure up worth a damned.
I honestly don't understand how they can be in second place considering how often they have major news-worthy failures.
I think this affected some XBoxLive stuff too (Score:2)

by the_skywise ( 189793 ) writes:

I was unable to access some of my game saves for Fallout Shelter this morning saying the cloud sync was failing. (and it stupidly deleted the local save)
And this... (Score:1)

by Anonymous Coward writes:

And this is why you don't keep critical files and documents only in the cloud.
- Re: (Score:3)
  
  by sjames ( 1099 ) writes:
  
  Or the software you need to view them or the services you need to keep machines on your LAN running.
Well crap... (Score:1)

by laosland ( 55769 ) writes:

I'm almost done setting up a hybrid configuration now :-(
Russian Hackers!!! (Score:2)

by Vinegar Joe ( 998110 ) writes:

Turned up the thermostat!!!
This has to be affecting a lot of shops (Score:2)

by your_mother_sews_soc ( 528221 ) writes:

Where I work all our source control is hosted by TFS on Azure. So all of our checkouts/checkins and code reviews are in limbo. Surely we aren't the only ones who have bought into the whole cloud idea. I may be too old fashioned, but I have a hard time putting all my eggs in someone else's basket. Besides, isn't the cloud supposed to prevent this from happening? I'm curious to know how many shops are affected by this.
So it wasn't Patch Tuesday then? (Score:2)

by bobbied ( 2522392 ) writes:

It was the cooling equipment eh?
I think they are making excuses for patch Tuesday.... :)
How shortsighted (Score:2)

by filesiteguy ( 695431 ) writes:

My company has 77,000 O365 licenses and use it for our email hosting as well as storing documents on OneDrive for Business. I kept seeing warnings about the web page for o365 not working.

I would expect that a cloud vendor such as Microsoft or AWS would have multiple redundancies in place so that any one data center going down would not affect usage.

I just am still not a cloud fan. It works great but when it doesn't - there's nothing we can do.
- Re: (Score:2)
  
  by account_deleted ( 4530225 ) writes:
  
  Comment removed based on user account deletion
- Re: (Score:2)
  
  by buchanmilne ( 258619 ) writes:
  
  I would expect that a cloud vendor such as Microsoft or AWS would have multiple redundancies in place so that any one data center going down would not affect usage.
  Why are you lumping AWS in with Microsoft here?
  AWS has this redundancy multiple availability zones in every region, so that especially the data storage services offered can provide the kind of redundancy that would have been useful in this scenario, but which, according to the Azure status updates, Azure doesn't have:
  "Engineers have restored access to storage resources for the majority of services, and most customers should be seeing signs of recovery."
  This is the update almost 2 days into this incident.
  Whe
  - Re: (Score:2)
    
    by filesiteguy ( 695431 ) writes:
    
    I'm mentioning them because of the overall cloud vendor solution. I'm expecting both Microsoft and Amazon to have multiple redundant centers. We have AWS projects, Azure projects, and use hosted Office 365 for Exchange.
    
    The company has gone all-in with cloud.
Executives (Score:1)

by Anonymous Coward writes:

My favorite part of today was the executive screaming on the outage line about getting it back up. Sorry guys, you all put it in the cloud. Now you’ve had a longer outage than the last 5 years combined for the same price you paid for the people, process, and tech. At the same time they lost fileshares (which moved to OneDrive), workflow apps (which are on sharepoint), and every app you moved from the internal sso solution to azure ad. Super fun!

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Dad told you not to touch the thermostat! (Score:1)

This is why (Score:5, Insightful)

Comment (Score:5, Interesting)

Re: (Score:3)

Re:Comment (Score:4, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

4.5 hours downtime! (Score:1)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Infrastructure management (Score:3)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2, Interesting)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Cloud - I don't think that word... (Score:5, Insightful)

There is no cloud.... (Score:3)

Same old same old (Score:3)

Irony (Score:4, Funny)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Stay calm and keep paying the subscription fees (Score:1)

Minor correction (Score:5, Funny)

Re: (Score:3)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Texas? (Score:4, Funny)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

No wonder they're a distant second (Score:2)

I think this affected some XBoxLive stuff too (Score:2)

And this... (Score:1)

Re: (Score:3)

Well crap... (Score:1)

Russian Hackers!!! (Score:2)

This has to be affecting a lot of shops (Score:2)

So it wasn't Patch Tuesday then? (Score:2)

How shortsighted (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Executives (Score:1)

Related Links Top of the: day, week, month.

Slashdot Top Deals