Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Communications Networking The Internet

Lessons Learned From Skype’s Outage 278

aabelro writes "On December 22th, 1600 GMT, the Skype services started to become unavailable, in the beginning for a small part of the users, then for more and more, until the network was down for about 24 hours. A week later, Lars Rabbe, CIO at Skype, explained what happened in a post-mortem analysis of the outage."
This discussion has been archived. No new comments can be posted.

Lessons Learned From Skype’s Outage

Comments Filter:
  • Deployed Soldiers. (Score:5, Insightful)

    by puterg33k ( 1920022 ) on Thursday December 30, 2010 @10:18AM (#34710822) Homepage
    For us it's nearly our only way to speak to our loved ones at home. I'm just glad it's back up...
  • Blogspam (Score:5, Informative)

    by ralf1 ( 718128 ) on Thursday December 30, 2010 @10:22AM (#34710876)
    Not sure why you didn't link to the actual article on Skype http://blogs.skype.com/en/2010/12/cio_update.html [skype.com] Instead of the blogspam site.
  • by colinRTM ( 1333069 ) on Thursday December 30, 2010 @10:23AM (#34710886)

    Seriously?

  • you are kidding me (Score:5, Interesting)

    by alphatel ( 1450715 ) * on Thursday December 30, 2010 @10:25AM (#34710908)
    If you are a node-based company worth several billion, charge for services, and don't even run enough of your own supernodes and monitor them in such a way that they cannot handle an outage effectively, you need serious help.
    • by TubeSteak ( 669689 ) on Thursday December 30, 2010 @10:56AM (#34711288) Journal

      If you are a node-based company worth several billion, charge for services, and don't even run enough of your own supernodes and monitor them in such a way that they cannot handle an outage effectively, you need serious help.

      No one expects 40% of a globally distributed network to crash at once. No one.
      FTFA:

      The initial crashes happened just before our usual daily peak-hour (1000 PST/1800 GMT), and very shortly after the initial crash, which resulted in traffic to the supernodes that was about 100 times what would normally be expected at that time of day.

      Not even a multi-billion dollar company would have a disaster plan that provisions 100x capacity as a hot/cold spare.
      Though I bet their new plan includes automatic spawning of nodes on EC2 or some other distributed CDN.

      • I agree. But it wasn't an initial 100x surge, right? It was a cascading failure where eventually supernodes were up 100% because there were fewer and fewer of them. It's a matter of prevention, not cure.
      • a client (or even many) crashing shouldn't cause the server to, too. That's just bad design/software.

        Skype seems clueless. They're thinking of using "processes for providing ‘automatic’ updates to our users so that we can help keep everyone on the latest Skype software. We believe these measures will reduce the possibility of this type of failure occurring again." Contrariwise - this would only make the matter worse. What if the _current_ version were the one with the problem, and an automated
        • Comment removed based on user account deletion
        • Ah, but its a brave new world where the client/server relationship is becoming fuzzier all the time. The part I think you are missing is that if you read the actual post it is obvious that everything that was crashing was applications on clients computers. It appears that some clients are promoted to server status to handle routing requests.

          As for bad design/software I would instead say they had features without consideration of consequences. Here are where their problems are from what I can see.

          1. No
      • No one expects 40% of a globally distributed network to crash at once. No one.

        Oops. I made a mistake.
        It's 40% of 50%. So actually ~20% of global users crashed.
        The problem was that those ~20% of global users represented 25%~30% of active supernodes.

        Either way, losing 20%~30% or 40% of a globally distributed network is still the kind of stuff that only the RAND corporation and the Pentagon make plans for.

        If Skype hadn't included circuit breakers (so that the client would go easy on your bandwidth and CPU), their network might have stayed up.

      • If you are a node-based company worth several billion, charge for services, and don't even run enough of your own supernodes and monitor them in such a way that they cannot handle an outage effectively, you need serious help.

        No one expects 40% of a globally distributed network to crash at once. No one.
        FTFA:

        The initial crashes happened just before our usual daily peak-hour (1000 PST/1800 GMT), and very shortly after the initial crash, which resulted in traffic to the supernodes that was about 100 times what would normally be expected at that time of day.

        Not even a multi-billion dollar company would have a disaster plan that provisions 100x capacity as a hot/cold spare.
        Though I bet their new plan includes automatic spawning of nodes on EC2 or some other distributed CDN.

        It was their own widely deployed buggy software that caused the big chunk to go offline. Any other organization with a big deploy everywhere button would understand the importance of an equally big roll back button, and heavy testing before doing either. I guess because Skype's clients are also their servers so they have no control is an excuse? Is it a good one?

    • The last time I checked, the only service they charge for is IP-based to a standard phone connection, not any PC-to-PC stuff.
  • by smash ( 1351 ) on Thursday December 30, 2010 @10:27AM (#34710948) Homepage Journal

    ... relying on dodgy peer to peer VOIP telephony for business purposes is retarded.

    we've got people bitching at work about how it doesn't work from time to time, and why I've blocked its ability to do voice/video at the firewall. If you want VOIP, use something that uses standard SIP or some other documented, configurable traffic.

    • by commodore64_love ( 1445365 ) on Thursday December 30, 2010 @10:44AM (#34711144) Journal

      Ahh so YOU'RE the one blocking my skype. ;-)
      I don't understand why Net Admins (such as yourself) block useful tools like Skype. Or streaming radio. I don't see any harm in letting those things into the office space, and it provides a more pleasant working environment (to distract from the boredom of sitting at a desk all day).

      • by smash ( 1351 ) on Thursday December 30, 2010 @10:55AM (#34711276) Homepage Journal

        Why do I block skype? Because the only way to have it work properly through most firewalls is to allow ALL outgoing ports. Which means you allow any random program to do any random shit through your firewall to the outside network. Its a massive, massive security issue you could drive an oil tanker through.

        Also, many companies pay for bandwidth. I don't want all of my bandwidth chewed up on video calls instead of mission critical apps.

        Its not just because we're nazis, its because skype protocol is completely fucked when it comes to the ability of your admin to control resources. Want voip/video? Use something else.

        • by smash ( 1351 ) on Thursday December 30, 2010 @10:59AM (#34711318) Homepage Journal

          Just let me clarify: corporate networks are different to your home network. your home network? fine, use skype. in the office, where you've got several hundred PCs that may/may not have malicious software and/or users at the helm - allowing all outgoing connections is just begging for trouble.

          Egress filtering is a good thing.

          Making your day at work "less boring" by enabling you to do non-work related shit with company resources is not what my job is about. It is about ensuring the continued operation of the company's network - and skype is a liability.

          • by BobMcD ( 601576 )

            Making your day at work "less boring" by enabling you to do non-work related shit with company resources is not what my job is about. It is about ensuring the continued operation of the company's network - and skype is a liability.

            Careful there, BOFH. Here I'll help:

            Making your day at work "less boring" by enabling you to do non-work related shit with company resources is none of my business. Get it requested through the proper channels and you can have it. I don't make the business decisions here, I just do what the company needs done to be successful.

            • Look, I'm all for business driven IT, but sometimes you have to save your managers from themselves. It's not being a BOFH to look out for the corporate network. You were hired to have the expertise to make recommendations and keep things as secure as possible. If it gets shoved through anyway then it may be time to start looking for someplace that actually values your skills.
              • by BobMcD ( 601576 )

                Good luck with that. Welcome to 2010's economy.

                Meanwhile, CYA and collect your paycheck. Let those with the MBA's make the calls and take the heat, and NEVER bicker with the end user. You're not paid enough to deal with their crap.

            • by smash ( 1351 )

              It's still not going to be allowed through. They want skype, they can have a 3g service for their laptop and run skype through that.

              I've explained to management the security problems with skype when it was originally requested and have support to block it.

              • by BobMcD ( 601576 )

                Then you're either enjoying bickering with the end users or this is an imaginary scenario...

                • by smash ( 1351 )
                  No, they just figure out skype doesn't work, come see me, i tell them it is not supported and to pick up the telephone.
        • Deep packet is the only way to block Skype (or so I've heard.) The real danger lies not in the voice/videoconferencing but in the potential for tunneling and/or circumvention of data loss prevention controls.
        • Why do I block skype? Because the only way to have it work properly through most firewalls is to allow ALL outgoing ports.

          Skype lists three other firewall configurations [skype.com] that work, including two that only require egress on a single port that's almost always open anyway.

          Its a massive, massive security issue you could drive an oil tanker through.

          Oh, come on. Sure, egress filtering is a polite thing to do, but it's inbound connections that put you at risk. And chances are, if you do fall victim to some nefarious piece of malware that's making unwanted outbound connections, simple packet filtering will be useless anyway because it will fall back to TCP 80, or TCP 443, or even UDP 53, to tunnel out. Just l

      • Back in the day I worked at a place that banned streaming audio because one day there wasn't enough bandwidth for the actual business applications to go about their business when everyone was listening to their streamed music.

        Skype can eat a lot of bandwidth.

        • In places where DSL or cable internet is cheap, it seems basic common sense to have a "toy" internet connection with a wireless router. That's like $25 a month per 100 users (that's what we have where I work).

          Note that I'm not suggesting 100 people could actually use it at the same time, but out of 100 people actually working, maybe 100 use any real bandwidth at once.

      • by noc007 ( 633443 )

        Within the network I manage, it boils down to bandwidth, security, and slacking off.

        We have two large offices and a few small offices. All of the internet traffic is routed through the WAN to the main office that has a 10Mb link which is shared with our internet facing servers. The other large office acts only as a backup and has a 5Mb internet connection. The WAN links are 3Mb with the exception of the main office having a 6Mb one. Regular business WAN traffic is a steady 1Mb across the board with the usua

        • by smash ( 1351 )
          Exactly as above. People get DSL at home and think they have the equivalent at work (for each and every employee). It simply doesn't work that way.
      • I don't understand why Net Admins (such as yourself) block useful tools like Skype. Or streaming radio.

        Well, we don't block Skype here... Though we do block streaming radio. I can give you a couple good reasons for both.

        1) Bandwidth. A service like Skype or streaming some radio station may not actually take all that much bandwidth itself... But if you've got 10 or 100 or 1,000 folks using it simultaneously the bandwidth requirements get quite steep. And it's un-necessary bandwidth. You could pick up your phone and not hit the Internet, you could turn on a regular radio and not hit the Internet. Busin

  • Sorry if this is off topic or an ignorant question, but how does Skype define supernodes? Does the company just randomly choose users who are online a lot and declare them supernodes without the owner's knowledge, or is there some other process?
    cheers

    • "Does the company just randomly choose users who are online a lot and declare them supernodes without the owner's knowledge"

      yes, that's exactly what they do. and yes, that's retarded for a company like skype

      • by smash ( 1351 )
        Well not its not really retarded for skype. its retarded for skype users to actually agree to those terms of service.
        • that's right, because everyone who wants to use VOIP should review the source code and familiarize themselves with the relevant RFC specs

          classic "if you aren't a computer scientist you shouldn't use the internet" ignorant geek snobbery. how's that standard of behavior working for you?

          • by smash ( 1351 )
            I was merely suggesting that its just fine and dandy as far as SKYPE the company goes to rip people's bandwidth off. If you cbf reading the license and just click OK for the free shit then you deserve whatever raping you get. Nothing is free.
  • Obvious problem.... (Score:5, Interesting)

    by dstar ( 34869 ) on Thursday December 30, 2010 @10:28AM (#34710958)

    Hmm. Seems to me their biggest problem is that they allowed clients with a known bug to become supernodes; if 50% of the network had upgraded, they should only have been creating supernodes from the upgraded clients.

    And in hindsight (I don't know that they should be blamed for not considering this before), the number of supernodes should probably be ~100-150% more than needed to service expected load. That way, if a third of them die, they _still_ have more than needed to handle the expected load. (And thus, hopefully, more than needed to handle the excessive load without causing them to shut down).

    • by BobMcD ( 601576 )

      Hmm. Seems to me their biggest problem is that they allowed clients with a known bug to become supernodes; if 50% of the network had upgraded, they should only have been creating supernodes from the upgraded clients.

      If they had the power to stop bugged clients from becoming supernodes, why not just use that same power to make them patch? You're sort of assuming that they ever imagined that this could have happened. It's pretty clear that they didn't...

      It's subtle, but it's there at the bottom where they admit 'we need to test our crap first and we need some way of making people patch' - which is kind of a known thing in the modern software world.

    • Seems to me their biggest problem is that they allowed clients with a known bug to become supernodes

      OK, FTFA

      Approximately 40% of all Skype users that were online crashed, taking down around 30% of all supernodes.

      - so supposedly this means that 30% of the supernodes went offline due to the bug, is this correct?

      But look at the number: 40% of ALL Skype users went offline! That's insane, that's almost half. At the same time ONLY 30% of the supernodes went offline due to this bug, right?

      Something does not add up.

      FTFA:

      Clients that continued to be up and running, and clients that restarted the application had their network searches directed to the supernodes still running, leading to an overload of those. Since Skype has in place a protection when a supernode is overloaded, so it would not consume too much of a client’s system’s resources, the supernodes started to shutdown automatically one after another, leading to a generalized failure of the network.

      - so the sequence of events is supposedly this:

      1. Bug causes 40% of all Skype clients to stop functioning, this includes 30% of all supernodes.

      2. The remaining 60% of all Skype clients relied o

    • by sco08y ( 615665 )

      Hmm. Seems to me their biggest problem is that they allowed clients with a known bug to become supernodes

      Isn't the biggest problem the monolithic app design?

      Look at this bug: it's due to counting the number of voicemail messages. *Why* did that take out the node completely?

      This makes a pretty good argument for modularizing a GUI into discrete tools. Not only does it protect me from bugs in one tool, but I also don't have to run stuff I'm not interested in.

  • by commodore64_love ( 1445365 ) on Thursday December 30, 2010 @10:29AM (#34710964) Journal

    "At its core, Skype relies on a third generation P2P network that has lots of peer nodes and a number of supernodes, one for several hundreds of nodes. Since Skype does not have a centralized directory to support finding routes between two or more nodes that want to communicate, the virtual network uses supernodes as directories. When a client enters Skype, it registers itself with a supernode, giving its IP address so it can be found by other clients who might want to establish a communication."

    Skype is a peer-to-peer network? Like torrent? So the supernode is like a tracker website, to connect peers to one another? No supernode==no tracker==no calls going through. Hmmmm. Maybe they should try DHT.

  • TL;DR version: (Score:5, Interesting)

    by The MAZZTer ( 911996 ) <megazzt&gmail,com> on Thursday December 30, 2010 @10:30AM (#34710974) Homepage

    Lots of users were using an old outdated buggy version of Skype, lots of client crashes at once bringing down big chunks of the P2P network, remaining network couldn't handle the load and went down too, took a while for Skype to put it's own supernodes up to help get the network self-sustaining again.

    They're considering an auto-update feature now since such a feature could have kept this from happening. Personally I think old versions should be blocked from making or receiving calls too, so users would be encouraged to update (works for Team Fortress 2). Of course auto updates would make updating super easy anyway so impact from that would be minimal.

    • by spxero ( 782496 )

      The problem with the auto-update feature in Skype vs. gaming is that most gaming computers will be close to top-of-the-line. Most computers used for Skyping will not be top of the line.

      From experience, the 5.0 version of Skype doesn't work as well as the 3.8 branch. Switching between windowed and full-screen video on the 5.0 branch takes ~4 sec to accomplish, with the audio becoming choppy at the same time. In addition, the video is choppy and audio quality is scratchy at best. The 3.8 branch doesn't have t

  • by syousef ( 465911 ) on Thursday December 30, 2010 @10:36AM (#34711038) Journal

    ...unless you need something in the newer version (feature, security update etc.). Of course us geeks like to have the latest to fiddle with, but for the average Joe end-user, if it ain't broke, don't fix it. There is always the risk that the newer software will contain new bugs. At one point the buggy version of the Skype software was the latest version and was what users were being pushed to upgrade to. If the crash had happened then, I wonder if they'd find a new way to scapegoat users.

    By the way new versions breaking existing functionality isn't theoretical, or rare. I'm currently installing software on my new laptop. I've had to downgrade both Zonealarm and Virtualbox. The former broke remote desktop. The later broke file sharing. No idea why, but in each case uninstalling and installing an older version I knew worked fixed the issue for me.

    • The problem is that it is broke, you just often don't realize it. Older doesn't mean more secure or more stable inherently. New versions fix bugs discovered in old versions. If everyone did update immediately, then everyone would have had the bug fix and this outage wouldn't have happened.

      • by BobMcD ( 601576 )

        You're suffering from sample bias. Newer software is also 'broke' and you also don't know that. I think the point would be, if it is 'broke' but not impacting you in a way that you'd know it, do you care? In some cases yes, in other cases no.

        • It is equally possible that newer software introduces bugs as much as fixes them. But the assumption that older is always more secure and stable is flawed.

          In reality, the best solution is to review changelogs and make informed decisions when upgrading. But avoiding all upgrades isn't the solution.

    • ..unless you need something in the newer version (feature, security update etc.).

      And also especially when the update is a 20 megabytes file. In fact, we need to reinstall the whole software every time.
      Why such a lame updating system ?

  • Supernode Software (Score:5, Interesting)

    by varmittang ( 849469 ) on Thursday December 30, 2010 @10:37AM (#34711060)
    How about they release some supernode only software that people can setup on a server and possibly the ability to setup Skype to use a preferred supernode. So a businesses can setup a supernode of their own and point their users too it. But also that supernode is part of the collective of supernodes and routes Skype connections for everyone else too. This would hopefully give Skype more supernodes out there that are 24/7 and not desktop computers routing the traffic.
  • If problems with the client can lead to problems with the server then the server system lacks robustness. For applications like this the servers should be practically immune to any client state much ups.

    Seems to me skype needs to work on their server side state machines.

    • by smash ( 1351 )
      You missed the point. With skype, the clients ARE the servers ("randomly" (i.e., non-nat well connected) selected supernodes).
    • Do you know what peer to peer means?

      here's a hint: there are no servers, they just use the bandwidth and cpu of random clients to do that work.

    • There's an exception to the client-server divide, and this is a classic example: if your mistake causes a big chunk of your client base to DoS your infrastructure, it's going to go down, no matter how good your infrastructure is.

  • by Ukab the Great ( 87152 ) on Thursday December 30, 2010 @10:44AM (#34711148)

    "We expected a Limewire topology to be as reliable as a Phone companyi topology and oddly enough that bit us in the ass."

  • by scorp1us ( 235526 ) on Thursday December 30, 2010 @10:59AM (#34711328) Journal

    The QA of this release is way down. On top of that, skype auto-updated people from 4.0 to 5.0. Within a few days, the buggy 5.0 had enough penetration (50%) to bring them down.

    The windows client has widely been reported to:
    consume 2x as much CPU (33% to 60% on mine after upgrade)
    leak RAM (starts out ok but after some use over 1.5gig needed)
    the GUI is slow, so the fade effects on some computers (mine) causes video tearing. It is no longer possible to run full-screen. (320x240 is all I get before tearing sets in)
    The fonts in the video area don't render correctly.
    It should be noted that I have a AMD X2 1.6 and Radeon 1200 card in this computer. Its not shabby. But the 5.0 client brought it to its knees.

    It plays SCII just fine (albeit on the lowest setting).

    It comes at a bad time when they are trying for more corporate agreements, but can't run on my 3-year-old hardware.

    I uninstalled 5.0 and installed 4.0 and its back to normal.

  • Public Post-Mortem (Score:5, Insightful)

    by Enderandrew ( 866215 ) <enderandrewNO@SPAMgmail.com> on Thursday December 30, 2010 @11:07AM (#34711424) Homepage Journal

    You can bitch they didn't QA the release. You can bitch that you don't like a P2P topology. But it is nice to see a public post-mortem.

  • Back when I was doing one of the first VOIP solutions (this one mostly for LAN use) we dreamed up something like Skype, that would work in similar fashion. The big advantage is that it could be done by any reasonably large group of users and no phone company at all need be involved -- no charge to anyone, no control over anyone by some big monolithic corp. It could still be done, and I wonder why no one in the open source area has managed? Critical mass issue; selling the first phone is a bear -- who you
  • by mario_grgic ( 515333 ) on Thursday December 30, 2010 @11:15AM (#34711542)
    I hate when apps run auto update daemons. This precisely the reason why I don't use any Google desktop software on my computers.

    Proper thing to do in this case is simply disallow users to log in with a message they need to upgrade their client if they want to continue to use the app. Simple thing to do, rather than each app running a daemon. Soon enough there will be hundred update daemons on each user's computer, eating resources, connecting online all the time and bogging down the user experience. Thanks but no thanks. I refuse to use any of those.
  • About 20 years ago now... sent out code with a bug in the fault recovery code, then a problem in one node cascaded throughout the network. http://www.phworld.org/history/attcrash.htm [phworld.org]
  • "We believe that increased load in supernode traffic led to some of these parameters exceeding normal limits, and as a result, more supernodes started to shut down"

    Maybe I'm missing something, but why are supernodes coded to shut down during increased load instead of simply throttling requests? It seems like the idea of 'too many requests, shut down' is what caused the cascade. Can someone enlighten me as to why this is the preferred overload handling mechanism?
    • its called cheap, crappy developers.

      Assume your socket connections will always work, and don't bother handling errors, throttling or connection requests, its the cheapest, easiest way after all. Its probably not even "too many requests, shut down" but "too many requests, crash". Once there - ship and let your users be damned.

      Only in this case, the company found out why you should hire the best devs you can and not the cheapest. If your business is software, you need to treat it like an asset, not a cost.

    • They are using Windows clients. "c:\> nice skypesupernode" ain't gonna do it.
  • by ThePhilips ( 752041 ) on Thursday December 30, 2010 @11:46AM (#34711928) Homepage Journal

    One important lesson to be learned is this: many users do not update their software if they don’t have to. Skype had a newer version in place, without the triggering bug, but most users had the buggy one.

    Yeah. Right. Because all recent Skype updates (staring with version 3(?)) were known to contain mostly only one of this: more ads or more UI bloat. And occasional breakages.

    So why they expect that users would be updating it regularly?

  • There is an option between "auto-update" and "update when you want"; depricated versions. If a version has a known major bug in it that could compromise the system require updates only those versions. That way only the bad version will be replaced and we won't be updating everyone at every release. The main advantage is that the system is kept safe without unnecessary updates.

  • NAT is evil. Skype needs to build an overly complex networking protocol because too many people are behind NAT gateways. Skype *could* probably get away with their basic available hardware if only they got to design for a NAT free world.

    One could also say they were trying to cheap out and not invest as much hosting required to assure reliability of their chosen networking architecture.

    Of course, on the flip side, Skype as a service would be nearly useless in a NAT-free world. No need for a coordinating e

  • Quote from TFA:

    Approximately 40% of all Skype users that were online crashed, taking down around 30% of all supernodes. Clients that continued to be up and running, and clients that restarted the application had their network searches directed to the supernodes still running, leading to an overload of those. Since Skype has in place a protection when a supernode is overloaded, so it would not consume too much of a client’s system’s resources, the supernodes started to shutdown automatically one

  • Here is what really happened.

    A non-telephone company had a cascading problem with its ad-hoc peer-to-peer networking that provides telephony and video services at costs way below any telephone (or cable) company. The company is profitable enough to make its own way in this world.

    This story was broadcast pretty-much worldwide by all media.

    The non-telephone company was embarrased and released a statement to the media about how this happened as a means by which it might encourage everyone to download new, fr

Help fight continental drift.

Working...