Follow Slashdot blog updates by subscribing to our blog RSS feed


Forgot your password?
Cloud Microsoft Windows Technology

Microsoft Azure Failure: SSL Certificates Were Updated... Sort Of 103

judgecorp writes "Microsoft has published an explanation of the failure of Windows Azure earlier this month. Users of the Azure storage saw that an SSL certificate had expired. Microsoft's explanation says that the certificate had in fact been renewed, but an update with the new certificate details was not prioritized, and hadn't actually been implemented till after the old certificate expired. There are more interesting details, but Microsoft says better alerts and more automation will stop this particular fault happening again."
This discussion has been archived. No new comments can be posted.

Microsoft Azure Failure: SSL Certificates Were Updated... Sort Of

Comments Filter:
  • by Nerdfest ( 867930 ) on Tuesday March 05, 2013 @01:25AM (#43076019)

    Unless I'm horribly mistaken, they've let certificates expire before. Why would I think they won't let it happen again?

    • by phantomfive ( 622387 ) on Tuesday March 05, 2013 @01:34AM (#43076075) Journal
      Yeah, and they also had the Sidekick outage [] with actual data loss. A lovely quote from that article:

      "I asked Microsoft for comment Saturday when I was writing this, in particular as to how the rest of its cloud might differ from the Danger set up. Microsoft said Sunday that its the fabric controller that manages the Azure service is built with redundancy in mind. "

      It may be built with redundancy in mind, but apparently it still has at least one single point of failure.

      • by Anonymous Coward on Tuesday March 05, 2013 @01:54AM (#43076159)

        I always back up my cloud data to a local harddrive, just to be safe.

        • I always back up my cloud data to a local harddrive, just to be safe.

          That sounds like vaporware.

        • I always back up my cloud data to a local harddrive, just to be safe.

          Isn't that cheating or something?

      • All of which suggests that little rational or critical thinking goes into a decision to use Azure. Microsoft has allowed its infrastructure to fail on multiple occasions through of a lack of competence, yet they still have customers. Why?

        [I expect the MS fanboys will down-mod this to troll -1 within a few minues of it being posted.]

        • I would suggest two reasons:

          1) They don't actually have a lot of customers, and for those customers they do have:
          2) There aren't a lot of options. Building your own site with 99.999% uptime is really hard, Amazon's cloud has outage problems as well, where exactly are you going to host your site in a way that doesn't go down? If there aren't options, you just go with the best you can, and Azure at least seems to be easy if your team is only semi-competent (like a lot of programmers these days).
          • Building your own site with 99.999% uptime is really hard

            Building your own site with better uptime than Microsoft's Cloud, on the other hand, doesn't look that hard at all.

      • It may be built with redundancy in mind, but apparently it still has at least one single point of failure.

        Yeah. It's the same single point of failure present in every IT project: It's called The Manager, and it goes something like this:

        Engineer: "I sent you the e-mail!"
        Manager: "Oh? I never got it."
        Users: "Oh f---."

      • Didn't they let one of their domains,, expire as well []

      • It may be built with redundancy in mind, but apparently it still has at least one single point of failure.

        It's extremely rare that an entire services goes down. Generally what happens is one region goes down. The fact that people don't *pay* for the full redundancy and fail-over protection doesn't mean it's not technically built into the system.

    • by norpy ( 1277318 )

      Pretty sure the last one was a bug that was something to do with the cert expiring on a leap-day though.

      This is a much deeper problem that shows that there is not a whole lot of good process going on behind the scenes.

      • by 93 Escort Wagon ( 326346 ) on Tuesday March 05, 2013 @01:49AM (#43076133)

        Some of us remember when they forgot to renew []. I'd say that might be worse...

        • by girlinatrainingbra ( 2738457 ) on Tuesday March 05, 2013 @05:47AM (#43076981)
          Nice! I like the fact that it was a linux user who paid the renewal fee and got back up again, allowing further logins into hotmail. Linky to credit card receipt [] of individual user :
          The lapse, which was first reported on the Internet news service, was apparently caused when Microsoft's registration for the domain name expired sometime Dec. 24, Chaney said. The site verifies user identification and passwords for access to Hotmail and about 25 other services, according to Chaney.
          Chaney said he paid the bill Dec. 25 at about 2 p.m. EST and was given invoice #11395965 documenting the transaction. An electronic copy of the receipt can be viewed at his Web site at ""
          • And I would have renewed their cert had I been able.

            Look, the bottom line is that they haven't learned anything in the past 13 years (wow, I feel old). The sloppiness that allowed a domain registration to lapse is the sloppiness that allows a cert to expire. This is a cultural issue that will likely never be overcome.

            To step into another industry, let's look at phone service. The "Phone Company" (AT&T back in the day, then the baby bells) had a culture of "this service has to work, period". I'm 45 t

            • by DarkOx ( 621550 )

              Google is not a good example just yesterday, their dns server was returning anything for ( no not NXDOMIN ) just nothing would time out. Yet every other query I could think to send it worked fine. It was really odd actually. Gmail was down just a couple months ago. Slate even had an article about how debilitating it was for everyone.

          • "The lapse, which was first reported on the Internet news service, ..."

            ... and was again reported on two days later...

      • re Pretty sure the last one was a bug that was something to do with the cert expiring on a leap-day though. [emphasis mine]
        $gt begin{sarcasm} Well, if it was a leap-day event, well that's totally excuseable because there's no predictable way to know that a particular year might be a leap-year with a leap-day in it, and even if there were, my goodness, you'd need some sort of computational device to carry out the algorithm (that Al Gore, he invents everything!) that would let you figure it out, and who cou
        • You'd think after people made fun of the MS Zune for being out of action on a leap day that MS would take a bit more care before the next one.
          • Yeah, all of the window phones silliness is so worth laughing at. I remember the crazy ad that came out for the windows phone last year that had QuestLove in the commercial. I believe that /. had a story about MS cancelling that phone the SAME DAY that the commercial had just aired.
            And what the fVCk is it with the stomping and jumping and slapping around of hardware in the ms tablet ads? Is that all that the MS tablets are good for? Throwing them around and clunking them onto tables and benches? Wha
      • by TheLink ( 130905 )
        Given Microsoft's resources and Ballmer proclaiming they were "All In" on Azure, what they could have done after the leap year bug was to set up test systems that are replicas of production shards/clusters but with the time set to one week ahead or so. Then have the test systems run the usual regression tests 24/7.

        Then if stuff fails because of leap years, expiration or other time related stuff, it's more likely to fail in a test system first and they'll have a week to fix the problem before their users not
      • Actually if you bother to read the article it looks like they had a reasonably good process going on behind the scenes i.e. cert owners got alerted & pushed the new cert in an update. The only problem was that they forgot to mark it as containing critical information (well, and their monitoring tools didn't alert them say a week out to say that the certs hadn't been renewed). So there is definitely room to improve the process, but saying that there is not a whole lot of good process is drawing a long b

    • It definitely won't happen again, instead the team responsible for keeping the automation software running will fail. Or an automatic upgrade to Windows will break it, or the libraries needed to run it will have been deprecated.

      So yeh, it won't happen again, the next time it will be something else to blame.

      Never of course a management that chops up roles into such small increments, dis-empowering it's workforce so much that the simple job of updating a certificate becomes a major obstacle each and every tim

    • I still wonder why those certificates need an expiry date. And why they just don't put it like 100 years in the future.

      But first of all, why have expiry dates at all?

      I have seen often enough certificates being revoked for being compromised or whatever; and I have seen quite some trouble due to expired certificates leaving web sites inaccessible, for example.

      It doesn't seem to add much if any security (if it's compromised, you'll want to revoke it now, and not wait until it expires months or years later). Wh

      • by Alioth ( 221270 )

        Certificates need an expiry for the same reason that passwords ought to have them. The probability that a certificate has fallen into unauthorized hands increases with the passage of time, so having certificates expire means you can limit the usefulness of a stolen certificate.

        • Interesting you mention expiry dates on passwords as plenty of security people will argue that having expiry dates on passwords tends to decrease the security of passwords, as people select easier ones.

          Having a multi-year expiry date pretty much beats the purpose: after falling in the wrong hands the certificate is useful only until it's detected that it's in the wrong hands. And that's usually not very long after it's being used.

          And a short expiry date (weeks, months) where it may actually have an effect o

          • by DarkOx ( 621550 )

            The security guys that argue passwords should not expire are crappy security guys. Passwords should be long enough and not expire at to great a frequency, I would say probably not less than 90 days. Many password attacks are inside jobs. Did the guy who does the backups take home a copy of the sam database? If you don't rotate your passwords he can probably brute force them if they are weak pretty quickly. They might hold up several months if they are strong. Once he gets a password to a privileged ac

            • The security guys that argue passwords should not expire are crappy security guys.

              It depends what the password is protecting. If someone gets your password, the chances are they are going to use it immediately. A timed expiry of passwords can prevent repeat-uses of it, but if the attacker already had chance to install malware when the account was originally compromised, they probably don't even need the password the second time around. Additionally, if a repeat attack isn't going to get the attacker anything extra over the original attempt, its probably not worth worrying about.


              • by DarkOx ( 621550 )

                Again its about not enabling someone to get your password in the first place. Rotation absolutely helps with that in the even a master password database is stolen. There are any number of reasons you might not be aware of that as well, not the least of which is an admin who has rights to copy the file decides to do so. He might find it very useful to be able to brute force the CEOs passwords and take a look around at the companies financial statements or his mailbox with out appearing in any logs for exa

                • Again its about not enabling someone to get your password in the first place. Rotation absolutely helps with that in the even a master password database is stolen.

                  Of course, but as with all security, this is about balancing the odds - what are the chances of the master password DB being stolen, cracked and the password(s) used, vs. the chances of one of your users being pushed into having a weak password through having to change it regularly?

                  You will usually know right away if its gone missing and change your password immediately or phone the helpdesk to have your account locked if this happens.

                  More likely they will be more concerned about their money and credit cards having been stolen and will completely forget that they had a password in there...

    • To be fair, it is really hard to remember so many dates and appointments when you're so busy. If only there was some sort of software, which can remind you of appointments and dates....
  • by Skapare ( 16644 ) on Tuesday March 05, 2013 @02:04AM (#43076203) Homepage

    ... managers saying "we need to get this up and running sooner ... automating it reliably is hard to do ... just get it working and update things manually for now and we will automate it later". When later comes, everyone is working on something else.

  • uhuh. I think people, especially technology companies, forget that the easiest task to automate is one that a human can simply do.

    "Executive assistant in charge of renewing certificates". Make it someone's job. It'll get done. You don't need a robot. You just need it to be in someone contract. That's it.

    • by Lennie ( 16154 )

      Yeah, we can see how well that worked this time.

      The biggest problem with all this is something else of course, within big companies people get assigned to different tasks all the time.

      Sometimes that means that simple task but very important task gets handed over to some one else who doesn't fully understand the implications when it doesn't get done.

  • by girlinatrainingbra ( 2738457 ) on Tuesday March 05, 2013 @07:32AM (#43077349)
    Read the MSDN blog for how screwed up this really was. Here's the car analogy:
    We have a "Secret Store" that tells "the team that owns the tires" that the tires are just about worn out and that they will be useless on a certain specific date. The "team that owns the tires" buys new tires and tells the "Secret Store" that new tires have been bought. But the team does not install the new tires, but places the task of installing the tires in an "unprioritized queue"!!!! Somehow, more important tasks like replacing the windshield washer fluid and replacing that pine-tree air freshener hanging off the mirror get prioritized on the queue and performed. Lo and hehold, the tires get too old, expire, and are taken off of the car. No one bothers putting new tires on the car. The car is nonfunctional. MS FTW, yet again!

    It's incredible how they keep shuffling blame around, or hot-potato-ing it:

    In this case, the Secret Store service notified the Windows Azure Storage service team that the SSL certificates mentioned above would expire on the given dates. On January 7th, 2013 the storage team updated the three certificates in the Secret Store and included them in a future release of the service. However, the team failed to flag the storage service release as a release that included certificate updates. Subsequently, the release of the storage service containing the time critical certificate updates was delayed behind updates flagged as higher priority, and was not deployed in time to meet the certificate expiration deadline. Additionally, because the certificate had already been updated in the Secret Store, no additional alerts were presented to the team, which was a gap in our alerting system. [source link []] [bold emphasis mine]

    Laughable, if it were not so stupid.

    • It's laughable mostly because they told us just how incompetent they are. They came right out and told us that their service, which they want you to make mission-critical, is managed with a process that would make the three stooges weep with uncontrollable laughter.

  • This is what happens when you have bean counters and MBA running the IT department.
  • by thoth ( 7907 )

    Good lord, last year it was a 12 hour outage on leap day, this year it was a 12 hour (as far as I can tell) outage due to expired certificates. They won't be able to claim six 9's uptime for ~274 years!

    At the rate of a half day of failure every year, so far, I'm not even sure I'd trust Azure for storage no matter what the discount they offer.

  • They pushed the update out on Jan 7. By Feb 22, it hadn't been completed. Something is not right with this explanation. Doesn't matter how low a priority it was, it should have been pushed out within in what? Two weeks?, a month?

  • ...otherwise know as the Microsoft Blue Cloud of Death?

    [ "Azure" is a shade of blue, for those that don't know,
    and why MS would go with this kind of name, given their history with things "blue" is beyond me. ]

  • My certificate authority sends me nagging emails like 6 weeks before my certificate's about to expire. Microsoft's certificate authority group needs to create a database and automated emails when certificates get near expiration. Start emailing a bunch of folks. It's very simple. Probably most CA's have such a setup.

Today is a good day for information-gathering. Read someone else's mail file.