Microsoft Office 365 Experienced Two Major Outages Within 3 Days (crn.com) 67
On Monday long-time Slashdot reader TorinEdge wrote that Microsoft "appears to have botched an internal Office365 cloud services rollout today, with outages confirmed up and down the West Coast of North America. Confirmed roll backs were good early omens, but in the end did not appear to be successful... Symptoms may include: All 365-related services flaking out, borking, alternately approving logins and confirming they definitely do not exist."
CRN reported service was impacted for five hours. But on Thursday some users were now intermittently unable to access Microsoft Exchange from 12:52 a.m. until 10:50 p.m., "according to a Microsoft email update to Office 365 administrators..."
"Some partners believe the tech giant is grappling with a DevOps crisis." "It looks like they are pushing out software updates that are causing the outages," said a channel source impacted by one of the outages. "They have so much going on right now, rolling Teams out at a breakneck pace. I think they are running into an issue where code tested out fine but there is a configuration problem when they deploy it."
DevOps is a set of practices that, according to the Wikipedia definition, shortens the systems development life cycle and provides continuous delivery of code with high software quality... A senior executive for one of Microsoft's top partners, who did not want to be identified, said he sees both recent outages as clearly DevOps-related... "Microsoft is a development first company, well known in general for DevOps, so the question is: why is this happening?" said the executive. "I love Microsoft but why is a company that paid $7.5 billion for Github, the leading source code repository company in the world, getting taken down by code that is not being well tested or has a single point of failure. That is ridiculous. If we caused this kind of production outage for a customer we would be fired and possibly blacklisted from the ecosystem. We have to bat 1,000 as a partner."
The lesson from the outages may well be that a company's DevOps is only as "good as the humans who configure it and execute upon it," said the executive. The executive said the outages will definitely have a ripple effect in the channel. "I bet the Google G Suite sales reps threw a party when they saw this," he said.
"No cloud vendor is immune to downtime," Microsoft says in a statement quoted by CRN. "Our number one priority is to get to resolution as quickly as possible and ensure our customers stay updated along the way, as was the case here.
"We continuously invest in the resilience of our platform and focus on learning from these incidents to ultimately reduce the impact of inevitable outages..."
CRN reported service was impacted for five hours. But on Thursday some users were now intermittently unable to access Microsoft Exchange from 12:52 a.m. until 10:50 p.m., "according to a Microsoft email update to Office 365 administrators..."
"Some partners believe the tech giant is grappling with a DevOps crisis." "It looks like they are pushing out software updates that are causing the outages," said a channel source impacted by one of the outages. "They have so much going on right now, rolling Teams out at a breakneck pace. I think they are running into an issue where code tested out fine but there is a configuration problem when they deploy it."
DevOps is a set of practices that, according to the Wikipedia definition, shortens the systems development life cycle and provides continuous delivery of code with high software quality... A senior executive for one of Microsoft's top partners, who did not want to be identified, said he sees both recent outages as clearly DevOps-related... "Microsoft is a development first company, well known in general for DevOps, so the question is: why is this happening?" said the executive. "I love Microsoft but why is a company that paid $7.5 billion for Github, the leading source code repository company in the world, getting taken down by code that is not being well tested or has a single point of failure. That is ridiculous. If we caused this kind of production outage for a customer we would be fired and possibly blacklisted from the ecosystem. We have to bat 1,000 as a partner."
The lesson from the outages may well be that a company's DevOps is only as "good as the humans who configure it and execute upon it," said the executive. The executive said the outages will definitely have a ripple effect in the channel. "I bet the Google G Suite sales reps threw a party when they saw this," he said.
"No cloud vendor is immune to downtime," Microsoft says in a statement quoted by CRN. "Our number one priority is to get to resolution as quickly as possible and ensure our customers stay updated along the way, as was the case here.
"We continuously invest in the resilience of our platform and focus on learning from these incidents to ultimately reduce the impact of inevitable outages..."
Good (Score:4, Funny)
People may finally have managed to get some work done.
timing is everything (Score:2)
Just as Redmond is about to drop update support for the old, home-based version of Office the new, cloud-based version starts crashing. Is it the virus?
Re:timing is everything (Score:5, Insightful)
. Is it the virus?
Worse, it's a subscription.
Re: (Score:3)
Just as Redmond is about to drop update support for the old, home-based version of Office the new, cloud-based version starts crashing. Is it the virus?
What are you even talking about? Office 2019 is still under mainstream support, 2016 goes EOMS this month but is still under extended support until 2025, and Microsoft announced Office 2021 will also be available next year with a perpetual, non-subscription license as well.
Re: (Score:2)
https://support.microsoft.com/... [microsoft.com]
Re: (Score:2)
Support for Office 2010 will end on October 13, 2020 and there will be no extension and no extended security updates. All of your Office 2010 apps will continue to function. However, you could expose yourself to serious and potentially harmful security risks.
Mind you 365 is no better:
All of your Office 365 apps might continue to function, except for when we fuck things up yet again. In addition you will be exposing yourself to serious and potentially harmful availability risks.
I'm so glad I'm on the offline version of 2016.
Re: timing is everything (Score:1)
You know that the outage only affected cloud services, and the local apps installed as part of the 365 service were unaffected, right?
Innovation (Score:5, Insightful)
Re: (Score:2)
That may be a bit difficult. The violently disturbed ward tends to have significant restrictions on visitation.
Re: Innovation (Score:2)
You can find them when you work towards the Tax and Finance departments.
You save $$$$ if you're a business and also a college student with tax write-offs. It's a rotten deal for individuals as the only reason to use Office is for setting up meetings with bigwhigs in Outlook and sharing documents which don't make you look retarded as Libre office renders tables and fonts differently than Word.
Re: (Score:2)
Office 365 (Score:3)
Does MS know this is a Leap Year ?
Re: (Score:1)
lol, i'd mod you up if I had the points
Re: (Score:2)
It would have been funny to see Office 365 down for 24 hours on that leap day.
Re: (Score:2)
After the screw up in 2016... I think, (certificate rotation failed due to trying to set an expiration date of 2/29/2017), you'd better believe that there was a lot of focus on that issue. Had a full team dedicated to ensuring everything went smoothly (now the team does code scanning to find similar ticking time bombs).
I love the Cloud! (Score:4, Funny)
"Sorry boss, MICROSOFT Office was down. I couldn't do anything or communicate with you."
"Uh, there was personal email."
"Boss, that is forbidden by company policy."
Wasn't This Supposed To Work? (Score:3)
Now that's an excuse and a rationalization...We're-Only-As-Bad-As-The-Other-Guys doesn't address the problem.
A Pretty Lame Excuse at that!br. IMHO
Re: (Score:2)
Re: (Score:3)
Redmond needs to suspend those ads claiming resilience.
Re: (Score:2)
Doubly so since office 365 is now being pushed into mission critical environments.
That multi state 911 outage that happened at the same time wasn't a coincidence.
What's the SLA? (Score:2)
Re: (Score:2)
If you really don't want downtime, pay the big bucks to host and manage it yourself.
That works but it depends on the quality of your IT team.
Re: (Score:2)
If you really don't want downtime, pay the big bucks to host and manage it yourself.
That works but it depends on the quality of your IT team.
But the quality of your IT team depends on your willingness to pay the big bucks (to them).
Re: (Score:2)
Re: (Score:2)
That's half true, but it also depends on the willingness of other company's to pay them bigger bucks.
I love the idea of a bidding war between rival companies for high-quality IT talent.
Re: (Score:2)
Re: (Score:2)
Their current SLA for O365 is 99.9%, so just over 8 hours downtime.
Re: (Score:1)
99.9% uptime is a funny thing. What is the denominator - 24h? 7 Days? 30 days? ... 1 Yr?
This is key information that tells one what duration of downtime is tolerable for a given downtime. ---- 8 hrs is for 30 days at 99.9%. This is terrible for email software.
Second choice motivational speech (Score:4, Insightful)
"We continuously invest in the resilience of our platform and focus on learning from these incidents to ultimately reduce the impact of inevitable outages..."
Gene Kranz had that in his notes but eventually went with, "Failure is not an option." [wikipedia.org] when speaking to his Apollo 13 Mission Control team. (Which makes me think of this: Houston [xkcd.com].)
Re: (Score:2)
Sadly, only 20 years later, we'd just appreciate it if it wasn't the default for a change.
DevOps? (Score:3)
Re: (Score:2)
Re: (Score:2)
Possibly because iOS is the same way?
Re: (Score:1)
Re: DevOps? (Score:2)
Try working with a corrupted SCCM and an unhealthy A.D. schema. It's a nightmare to get anything done and our CIO won't let us have working tools as he promised the CEO 0 outages.
And yet... (Score:2)
....people will trust "The Cloud" and continue moving everything to "The Cloud", almost as if nothing had happened.
Yes, the public really is that conditioned.
Remember When Office (Score:1)
came on a disk. Ahh .... The good ol days.
A disk? only one? Re:Remember When Office (Score:1)
When was that? Oh wait, you must mean the first version to come out on CD-ROM, gotcha.
----
For you young'uns reading this: Even in the old days before the beginning of Eternal September, Microsoft Office was a "big" program. It came on multiple floppy disks. What's a "floppy disk" you may ask? Well, ask Siri or Alexa, see if they know. What's "Eternal September?" If you don't know, you were probably weren't born yet on August 31st.
Re: (Score:2)
Depends if you count Write [angband.pl] as Office.
Re: (Score:1)
Yes. Compact Disc for the uninitiated. I call them disks although technically they are Compact Disc Read Only Memory.
Clouds are for suckers.
Re: (Score:2)
Eeeee...xactly (Score:3)
"No cloud vendor is immune to downtime,"
And that's why you should stay away from the cloud if you value your uptime. That way you don't have unknown code monkeys disrupting your company's operations willy-nilly, and you have the ability to do something about it when shit happens.
Even Microsoft knows it. They ain't stupid: they don't put their stuff in the cloud :)
Re: (Score:2)
And that's why you should stay away from the cloud if you value your uptime
The cloud as better up time than my local hardware.
Re: (Score:2)
Yeah.... I'm really not a fan of this move to cloudify everything. But I have to also admit that the things people are demanding today from their application software are a bit different than what we had to support traditionally in I.T. Primarily, there's this desire to collaborate on documents as a team.
I don't see how anyone is realistically going to say they'd rather keep everything on their local hardware and still support ability for a large group to open/view a spreadsheet or other document from var
Re: (Score:2)
And that's why you should stay away from the cloud if you value your uptime
The cloud as better up time than my local hardware.
My local hardware has redundancy the cloud seems to lack. Even on a power-outage, I can still work for a few hours.
Re: (Score:2)
Re: (Score:2)
You're silly. Your local servers will have downtime too. How about this, you're probably not saving babies, your shit isn't that important and there will be outages. continue your work when the stuff it back up.
Re: (Score:2)
At least if it's your servers, you can do something about it. With some of these large cloud vendors, you're just one customer of many, and if you aren't a big or important one then their response can end up being along the lines of "yeah, yeah, when we get around to it....".
Re: (Score:2)
No they have service guarantees. If it's your servers you can wait for repair or for a major backbone provider to unfutz itself (like the recently renamed "Lumen" formerly known as CenturyLink that is trying to get away from their shame of dropping 3.5% the globe's internet last month)
Re: (Score:2)
We have redundant fail-over machines on mission critical hardware.
UPSes and after 2 min a big diesel generator kicks in. Spare jerrys of fuel.
2 separate internet connections running on different pipes.
No we do not save babies, but people do rely on us, and many do go to the hospital, doc
Re: (Score:2)
Yes I've worked at such places too that had all that, and they still sometimes had outages for various reasons. Your failover machiens are at the same place, or merely two locations? You're going to go down at times. No better than cloud for uptime.
Microsoft memes itself (Score:2)
continuous delivery of code with high software quality
Throwing more developers at a project does not make it go faster after a certain point.
Re: (Score:2)
continuous delivery of code with high software quality
Throwing more developers at a project does not make it go faster after a certain point.
Fred Brooks expressed that idea well in The Mythical Man-Month https://en.wikipedia.org/wiki/... [wikipedia.org] .
Re: (Score:2)
continuous delivery of code with high software quality
Throwing more developers at a project does not make it go faster after a certain point.
Fred Brooks expressed that idea well in The Mythical Man-Month https://en.wikipedia.org/wiki/... [wikipedia.org] .
It has also proven in practice time and again by now. Yet most "managers" are unaware of this basic fact.
Ah yes! (Score:2)
I remember the good old days. When I could turn to my boss and say, "Sorry. I can't finish that assignment. The mainframe is down."
The "Cloud" is supposed to be distributed (Score:1)
That's what makes it cloudy. All these companies are doing the exact opposite and putting everything on the same service from the same location, under the same tornado that just blew the roof off. Say goodnight...
Good (Score:2)
Now they know how end users feel every time a Windows 10 update is pushed out which borks their system and they have to spend time trying to figure out how to get their system stable and usable again.
I would suggest doing testing before they roll code out into production systems, but like KISS, we don't do that here.
Libreoffice is Free, and Reliable (Score:2)
Re: (Score:3)
It's great for certain uses. However, I did find the charting of spreadsheet data to be either very broken or hard to use.
"No cloud vendor is immune to downtime." (Score:2)
No cloud vendor is immune to downtime.
This is completely true, no matter what the cloud vendor does by way of reliability and resiliency. Which is why your applications should absolutely never be dependent on a single cloud vendor (unless you like uncontrollable outages).
Leslie Lamport on distributed systems.... (Score:2)
"A distributed system is one in which I cannot get something done because a machine I've never heard of is down."
Whoever thought the cloud would be a good idea for mission-critical software is a fool and unaware of history.
Re: (Score:1)
Whoever thought the cloud would be a good idea for mission-critical software is a fool and unaware of history.
That would be Accounting, they share with Middle Management. First elevator on the left, floors 13-18. Pick one and have a nice rest of your day...
Re: (Score:2)
Whoever thought the cloud would be a good idea for mission-critical software is a fool and unaware of history.
That would be Accounting, they share with Middle Management. First elevator on the left, floors 13-18. Pick one and have a nice rest of your day...
Ah, yes. That place of the building that, if it falls of the face of the earth, first nobody notices, and then people begin to find things are working suspiciously well..
Like button in Outlook (Score:1)
Seriously, I found Like button in web interface of Outlook for each received correspondence, never been there before. Do they count the mail messages, users liked? Do they build lists of pleasing mailings for the receiver? Do they return notification to the sender, that their message was liked?
Too big a break of traditional mailing concept, to comprehend easily.
Then, first messages in the subject threads were not appearing this week, another illustration of basic functionality getting broken, while unnecess
Limited use of cloud systems (Score:1)