Microsoft Azure Failure: SSL Certificates Were Updated... Sort Of 103
judgecorp writes "Microsoft has published an explanation of the failure of Windows Azure earlier this month. Users of the Azure storage saw that an SSL certificate had expired. Microsoft's explanation says that the certificate had in fact been renewed, but an update with the new certificate details was not prioritized, and hadn't actually been implemented till after the old certificate expired. There are more interesting details, but Microsoft says better alerts and more automation will stop this particular fault happening again."
Re:When will they accept Windows 8 as a failure? (Score:5, Insightful)
Re: (Score:2)
Re: (Score:3)
Almost nobody cares about a lot of things that matter a great deal.
Re: (Score:1)
We've seen over and over through decades that the backwards-compatible ugly system beats the pretty, usable
Re: (Score:2)
That's because most programmers suck.
OS X is actually quite compatible. Provided you stick to the public APIs
Re: (Score:3)
Re: (Score:2)
That's because most programmers suck.
So what we need to do is make it illegal for the majority of people to become programmers. It should remain a tiny elite class, a bit like being a Catholic Cardinal, but with less sex.
Re: (Score:2)
When you charge an arm and a leg for an OS and your company basically has unlimited money, then there is no excuse for not delivering perfect software with no bugs. So yes I was expecting a perfect version of Metro.
Re: (Score:3)
When you charge an arm and a leg for an OS and your company basically has unlimited money, then there is no excuse for not delivering perfect software with no bugs. So yes I was expecting a perfect version of Metro.
The cost of certifying a modern OS totally bug-free would exceed the GDP of the entire world, hundreds of times over.
Re: (Score:2)
I don't think so. NASA makes almost bug free code with very stringent testing at a cost of $1000 per line of code I believe, so for example Windows 7 which has about 50 million lines of code, would only cost 50 billions, and given the profits of Microsoft that would only take two or three years of their profit.
Re: (Score:2)
Re: (Score:3, Informative)
the adoption rates for students who get windows 8 for free is non existant at least by the anecdotal evidence in my faculty (computer science).
even during exam season (when you suddenly get the urge to clean the room, re-check the fridge or format your laptop).
you can piss on my face but don't tell me it's raining.
Re: (Score:2)
Troll = Fed.
In the future, mod him down and move on.
Re: (Score:2)
As usual, we see the failure of using the closed source model for an operating system. They have to get the users to fund development somehow, so they sell them a shitty version every other time to pay for the real versions, and get the new ideas into the hands of the customers where they can tell them which ones are good and which ones are bad. It can work fine for applications where they can bring out a new version when they're ready, with incremental updates for features or fixes which must and can be ha
Re: (Score:1)
Re: (Score:2)
Re:When will they accept (Linux|Mac) as a failure? (Score:1)
Re: (Score:2)
A true coward: Nothing of worth to say and that without any grace...
It won't happen again (Score:4, Insightful)
Unless I'm horribly mistaken, they've let certificates expire before. Why would I think they won't let it happen again?
Re:It won't happen again (Score:5, Interesting)
"I asked Microsoft for comment Saturday when I was writing this, in particular as to how the rest of its cloud might differ from the Danger set up. Microsoft said Sunday that its the fabric controller that manages the Azure service is built with redundancy in mind. "
It may be built with redundancy in mind, but apparently it still has at least one single point of failure.
Re:It won't happen again (Score:5, Insightful)
I always back up my cloud data to a local harddrive, just to be safe.
Hm. (Score:1)
I always back up my cloud data to a local harddrive, just to be safe.
That sounds like vaporware.
Re: (Score:2)
I always back up my cloud data to a local harddrive, just to be safe.
Isn't that cheating or something?
Re: (Score:1)
All of which suggests that little rational or critical thinking goes into a decision to use Azure. Microsoft has allowed its infrastructure to fail on multiple occasions through of a lack of competence, yet they still have customers. Why?
[I expect the MS fanboys will down-mod this to troll -1 within a few minues of it being posted.]
Re: (Score:2)
1) They don't actually have a lot of customers, and for those customers they do have:
2) There aren't a lot of options. Building your own site with 99.999% uptime is really hard, Amazon's cloud has outage problems as well, where exactly are you going to host your site in a way that doesn't go down? If there aren't options, you just go with the best you can, and Azure at least seems to be easy if your team is only semi-competent (like a lot of programmers these days).
Re: (Score:2)
There are Cloud providers other than Amazon & Microsoft.
None of which can claim to be better than 99.999% uptime, since it's practically impossible to achieve.
Re:It won't happen again (Score:4, Interesting)
None of which can claim to be better than 99.999% uptime, since it's practically impossible to achieve.
Having worked for half a decade on mobile communications infrastructure that regularly exceed 99.999% uptime, I feel qualified to say that it is neither impossible nor super difficult. If it is a goal and you are willing to spend a lot of money than you can accomplish it.
But nobody is going to pay $X for 99.99999% uptime when 98% uptime is available for $X / 100 unless they are forced to. Look at all of the various highly-funded internet services that go down completely when a single Amazon data center has an outage. They aren't even willing to pay a little bit extra and do the extra work to make their services run on multiple data centers at a time. Clearly, it is not a requirement of the venture capital that they are getting.
Re: (Score:2)
Re: (Score:2)
Building your own site with better uptime than Microsoft's Cloud, on the other hand, doesn't look that hard at all.
Re: (Score:3)
It may be built with redundancy in mind, but apparently it still has at least one single point of failure.
Yeah. It's the same single point of failure present in every IT project: It's called The Manager, and it goes something like this:
Engineer: "I sent you the e-mail!"
Manager: "Oh? I never got it."
Users: "Oh f---."
Re:It won't happen again (Score:4, Interesting)
Re:It won't happen again (Score:4, Insightful)
Maybe. It seems to me that if the engineers have let the manager become powerful enough to be a single point of failure, they've designed the system wrong.
You're fired. Anyone else have a problem with the manager?
Re:It won't happen again (Score:5, Insightful)
Re: (Score:1)
Oh, I see you know my manager!
Though you forgot the part about "You! To the bottom of the stack rank, NOW!"
At Microsoft, that typically comes at least a few months before actually being fired.
(Posting this as AC for hopefully obvious reasons).
Re: (Score:2)
God. I'm torn between modding +1 "Funny" or "Insightful"...
Re: (Score:1)
Didn't they let one of their domains, passport.com, expire as well
http://news.cnet.com/Good-Samaritan-squashes-Hotmail-lapse/2100-1023_3-234907.html [cnet.com]
Re: (Score:2)
It may be built with redundancy in mind, but apparently it still has at least one single point of failure.
It's extremely rare that an entire services goes down. Generally what happens is one region goes down. The fact that people don't *pay* for the full redundancy and fail-over protection doesn't mean it's not technically built into the system.
Re: (Score:2)
Pretty sure the last one was a bug that was something to do with the cert expiring on a leap-day though.
This is a much deeper problem that shows that there is not a whole lot of good process going on behind the scenes.
Re:It won't happen again (Score:5, Informative)
Some of us remember when they forgot to renew hotmail.com [cnet.com]. I'd say that might be worse...
Re:It won't happen again (Score:4, Informative)
Re: (Score:3)
And I would have renewed their cert had I been able.
Look, the bottom line is that they haven't learned anything in the past 13 years (wow, I feel old). The sloppiness that allowed a domain registration to lapse is the sloppiness that allows a cert to expire. This is a cultural issue that will likely never be overcome.
To step into another industry, let's look at phone service. The "Phone Company" (AT&T back in the day, then the baby bells) had a culture of "this service has to work, period". I'm 45 t
Re: (Score:2)
Google is not a good example just yesterday, their 8.8.8.8 dns server was returning anything for www.youtube.com ( no not NXDOMIN ) just nothing would time out. Yet every other query I could think to send it worked fine. It was really odd actually. Gmail was down just a couple months ago. Slate even had an article about how debilitating it was for everyone.
Re: (Score:2)
Right, there will always be outages. The point is that most of them aren't caused by general numbskullery.
Re: (Score:2)
Yep, still here.
Re: (Score:2)
"The lapse, which was first reported on the Internet news service Slashdot.org, ..."
... and was again reported on Slashdot.org two days later...
inexcuseable re: ...cert expiring on a leap-day... (Score:2)
.
$gt begin{sarcasm} Well, if it was a leap-day event, well that's totally excuseable because there's no predictable way to know that a particular year might be a leap-year with a leap-day in it, and even if there were, my goodness, you'd need some sort of computational device to carry out the algorithm (that Al Gore, he invents everything!) that would let you figure it out, and who cou
Even more inexcuseable (Score:2)
questlove and windows-phone-suckage (Score:2)
.
And what the fVCk is it with the stomping and jumping and slapping around of hardware in the ms tablet ads? Is that all that the MS tablets are good for? Throwing them around and clunking them onto tables and benches? Wha
Re: (Score:2)
Then if stuff fails because of leap years, expiration or other time related stuff, it's more likely to fail in a test system first and they'll have a week to fix the problem before their users not
Re: (Score:2)
Actually if you bother to read the article it looks like they had a reasonably good process going on behind the scenes i.e. cert owners got alerted & pushed the new cert in an update. The only problem was that they forgot to mark it as containing critical information (well, and their monitoring tools didn't alert them say a week out to say that the certs hadn't been renewed). So there is definitely room to improve the process, but saying that there is not a whole lot of good process is drawing a long b
It WON'T happen again because of automation (Score:1)
It definitely won't happen again, instead the team responsible for keeping the automation software running will fail. Or an automatic upgrade to Windows will break it, or the libraries needed to run it will have been deprecated.
So yeh, it won't happen again, the next time it will be something else to blame.
Never of course a management that chops up roles into such small increments, dis-empowering it's workforce so much that the simple job of updating a certificate becomes a major obstacle each and every tim
Re: (Score:2)
I still wonder why those certificates need an expiry date. And why they just don't put it like 100 years in the future.
But first of all, why have expiry dates at all?
I have seen often enough certificates being revoked for being compromised or whatever; and I have seen quite some trouble due to expired certificates leaving web sites inaccessible, for example.
It doesn't seem to add much if any security (if it's compromised, you'll want to revoke it now, and not wait until it expires months or years later). Wh
Re: (Score:3)
Certificates need an expiry for the same reason that passwords ought to have them. The probability that a certificate has fallen into unauthorized hands increases with the passage of time, so having certificates expire means you can limit the usefulness of a stolen certificate.
Re: (Score:3)
Interesting you mention expiry dates on passwords as plenty of security people will argue that having expiry dates on passwords tends to decrease the security of passwords, as people select easier ones.
Having a multi-year expiry date pretty much beats the purpose: after falling in the wrong hands the certificate is useful only until it's detected that it's in the wrong hands. And that's usually not very long after it's being used.
And a short expiry date (weeks, months) where it may actually have an effect o
Re: (Score:2)
The security guys that argue passwords should not expire are crappy security guys. Passwords should be long enough and not expire at to great a frequency, I would say probably not less than 90 days. Many password attacks are inside jobs. Did the guy who does the backups take home a copy of the sam database? If you don't rotate your passwords he can probably brute force them if they are weak pretty quickly. They might hold up several months if they are strong. Once he gets a password to a privileged ac
Re: (Score:2)
The security guys that argue passwords should not expire are crappy security guys.
It depends what the password is protecting. If someone gets your password, the chances are they are going to use it immediately. A timed expiry of passwords can prevent repeat-uses of it, but if the attacker already had chance to install malware when the account was originally compromised, they probably don't even need the password the second time around. Additionally, if a repeat attack isn't going to get the attacker anything extra over the original attempt, its probably not worth worrying about.
Conver
Re: (Score:2)
Again its about not enabling someone to get your password in the first place. Rotation absolutely helps with that in the even a master password database is stolen. There are any number of reasons you might not be aware of that as well, not the least of which is an admin who has rights to copy the file decides to do so. He might find it very useful to be able to brute force the CEOs passwords and take a look around at the companies financial statements or his mailbox with out appearing in any logs for exa
Re: (Score:2)
Again its about not enabling someone to get your password in the first place. Rotation absolutely helps with that in the even a master password database is stolen.
Of course, but as with all security, this is about balancing the odds - what are the chances of the master password DB being stolen, cracked and the password(s) used, vs. the chances of one of your users being pushed into having a weak password through having to change it regularly?
You will usually know right away if its gone missing and change your password immediately or phone the helpdesk to have your account locked if this happens.
More likely they will be more concerned about their money and credit cards having been stolen and will completely forget that they had a password in there...
Re: (Score:2)
Typical scenario of ... (Score:5, Insightful)
... managers saying "we need to get this up and running sooner ... automating it reliably is hard to do ... just get it working and update things manually for now and we will automate it later". When later comes, everyone is working on something else.
Re: (Score:1)
I don't think think the numbers are that skewed, actually, probably about a 70/30 split, maybe even 60/40. Unfortunately the morons typically hold substantially more power due to their title or caste. Speak the right dialect of Hindi? You can get away with anything, even if you're an idiot. In fact, you'll probably get promoted into management. Speak English as your native language? Your best work will be attributed to the guy who speaks the correct dialect, and everything else will be declared "averag
automation and alerts (Score:2)
uhuh. I think people, especially technology companies, forget that the easiest task to automate is one that a human can simply do.
"Executive assistant in charge of renewing certificates". Make it someone's job. It'll get done. You don't need a robot. You just need it to be in someone contract. That's it.
Re: (Score:2)
Yeah, we can see how well that worked this time.
The biggest problem with all this is something else of course, within big companies people get assigned to different tasks all the time.
Sometimes that means that simple task but very important task gets handed over to some one else who doesn't fully understand the implications when it doesn't get done.
Car analogy... (Score:3)
It's incredible how they keep shuffling blame around, or hot-potato-ing it:
Laughable, if it were not so stupid.
Re: (Score:2)
Re: (Score:2)
It's laughable mostly because they told us just how incompetent they are. They came right out and told us that their service, which they want you to make mission-critical, is managed with a process that would make the three stooges weep with uncontrollable laughter.
Bad infrastructure management (Score:1)
Ouch (Score:2)
Good lord, last year it was a 12 hour outage on leap day, this year it was a 12 hour (as far as I can tell) outage due to expired certificates. They won't be able to claim six 9's uptime for ~274 years!
At the rate of a half day of failure every year, so far, I'm not even sure I'd trust Azure for storage no matter what the discount they offer.
How many days does it take to push an update? (Score:1)
They pushed the update out on Jan 7. By Feb 22, it hadn't been completed. Something is not right with this explanation. Doesn't matter how low a priority it was, it should have been pushed out within in what? Two weeks?, a month?
MS Azure Failure ... (Score:2)
[ "Azure" is a shade of blue, for those that don't know,
and why MS would go with this kind of name, given their history with things "blue" is beyond me. ]
CA emails me when my cert is about to expire (Score:2)
My certificate authority sends me nagging emails like 6 weeks before my certificate's about to expire. Microsoft's certificate authority group needs to create a database and automated emails when certificates get near expiration. Start emailing a bunch of folks. It's very simple. Probably most CA's have such a setup.