Forgot your password?
typodupeerror
Cloud The Internet

Researcher: Interdependencies Could Lead To Cloud 'Meltdowns' 93

Posted by Soulskill
from the we-can-only-hope dept.
alphadogg writes "As the use of cloud computing becomes more and more mainstream, serious operational 'meltdowns' could arise as end-users and vendors mix, match and bundle services for various means, a researcher argues in a new paper set for discussion next week at the USENIX HotCloud '12 conference in Boston. 'As diverse, independently developed cloud services share ever more fluidly and aggressively multiplexed hardware resource pools, unpredictable interactions between load-balancing and other reactive mechanisms could lead to dynamic instabilities or "meltdowns,"' Yale University researcher and assistant computer science professor Bryan Ford wrote in the paper. Ford compared this scenario to the intertwining, complex relationships and structures that helped contribute to the global financial crisis."
This discussion has been archived. No new comments can be posted.

Researcher: Interdependencies Could Lead To Cloud 'Meltdowns'

Comments Filter:
  • by houstonbofh (602064) on Saturday June 09, 2012 @11:27PM (#40272013)
    If you have a critical service, have it at more than one host... That way when AWS has a bad hair day, you are still up.

    Or, have your entire business totally dependent one someone else. (Sounds kinda scary that way, don't it?)
    • by girlintraining (1395911) on Saturday June 09, 2012 @11:43PM (#40272085)

      If you have a critical service, have it at more than one host... That way when AWS has a bad hair day, you are still up.

      While we're at it, we should probably backup the internet too. You'd think someone would have done it by now, in case it crashes, but I can't find any record of anyone doing it.

    • by martin-boundary (547041) on Sunday June 10, 2012 @12:42AM (#40272243)
      There's a limited number of cloud hardware providers on the internet, and the rest are middle men. It's useless to diversify yourself on the middle men, they will all be affected when the common underlying hardware provider has an issue. Thus there's a limit to the reliability that can be achieved, irrespective of how much mixing and matching is performed at the "business end".

      Diversification only "works" when the alternatives are provably independent. That's not true in a highly interconnected and interdependent world, which is TFA's point, I believe.

      • by Gerzel (240421)

        That's why you use a back up other than the cloud. If you can backup across clouds you almost certainly can backup across some real-hardware and the cloud using both at the same time.

        The could is great to provide extra power and computing resources cheaply but real hardware and servers owned by your company also still serve a vital role. One can be a backup to the other and both be utilized sharing the load at normal times.

        Thus you don't have to pay for the full hardware costs of what you use by offloadin

    • by im_thatoneguy (819432) on Sunday June 10, 2012 @12:52AM (#40272269)

      That's one of the problems though that the researcher is flagging.

      1) If a company has one instance on AWS and one on Azure and AWS fails... Azure suddenly doubles in load ( and also fails due to everybody piling on unexpectedly).

      the other being:

      2) Everybody uses Azure for SQL and AWS for hosting and Azure goes down... suddenly SQL dies and the AWS hosts all fail with the database down. Or the converse happens and AWS goes down and the SQL is useless without a head.

      The more services you rely on the more likely that on any given day one of them will be down. If you have 99% reliability and 20 services that you depend on (without any redundancy) then your failure rate could be up to 20% since any one of the 1% failures could kill your service.

      It's interesting but it seems like most of the cloud failures have been due to #1 internally so far. One sector fails and in an effort to load balance it starts taking out its peers who then also overload and take out their peers.

      • by k8to (9046)

        It's worse than this.

        Most cloud services are built out using a significant number of other cloud services. That's the "upside" of being in the cloud -- you can use software/platform as a service to reduce the management/overhead costs of building out all that infraustructure yourself. So you can use service X for credit card running, and service Y for user support, and service Z for indexing and search, and so on.

        A modern cloud offering might use 10 or more other services. And those offerings may be using

    • That kind of sounds like distributed computing, which is what the cloud was before corporations picked up on the word and applied it to what they wanted to sell.
    • by Shagg (99693)

      If you really have a critical service then you're not going to be putting it on "the cloud" anyway.

  • XKCD (Score:4, Funny)

    by Shadyman (939863) on Saturday June 09, 2012 @11:28PM (#40272019) Homepage
    XKCD (jokingly) saw this coming a while ago: http://xkcd.com/908/ [xkcd.com]
    • by Anonymous Coward

      XKCD (jokingly) saw this coming a while ago: http://xkcd.com/908/ [xkcd.com]

      It's rather chilling. Imagine for a moment a configuration where Cloud A hosts an online auction website. It automatically converts bids into the user's chosen currency. 'A' also hosts a service which calculates the current value of a few obscure currencies. Add in a Cloud B which hosts that auction sites' currency exchange information service and is written to check the exchange rate of, for example, bitcoins, every hour. Also on 'B' is a service that aggregates various auction sites for best deals. Assume

  • we live in an age where information is distributed, even if statistical. (hell I made a fake Facebook account and somehow they found my mom, and she is no where close to me) a meltdown of information can't happen unless there is a world wide melt down of power. we have backups, but also ways of statistically restoring those backups.
    • by c0lo (1497653) on Saturday June 09, 2012 @11:57PM (#40272119)

      we live in an age where information is distributed, even if statistical. (hell I made a fake Facebook account and somehow they found my mom, and she is no where close to me) a meltdown of information can't happen unless there is a world wide melt down of power. we have backups, but also ways of statistically restoring those backups.

      Redundancy helps but it is not bullet-proof. A good chunk of it is the "topology" in which this redundancy is engaged in events of failure.(e.g. we had cascading blackouts in the past even if the energy network had enough total power to serve all consumers)

      Have a look on cascading failures [wikipedia.org].

    • by Gerzel (240421)

      No no it CAN happen. It might not be likely at any given moment but it is well within the range of possibility and that possibility grows larger every day steps are not taken to minimalize it.

      • ok, I agree, it can happen, but the chance is on a logarithmic scale. So, a huge failure is unlikely but it would be...well huge. Data loss is inevitable, why not put things in many area's at once? The 'earthquakes' are easier to deal with for the 'country' but for the individual, they should invest in long term 'earthquake control'. That's probably a HORRIBLE analogy, but it's all I could come up with.
        • by Gerzel (240421)

          I don't think it is even that small of a chance, certainly not on the logarithmic scale. Remember while the chance of failure for any one instance may be very low you have to take the chances over the whole range of instances.

          • But that's the point, it's layers of redundancy. You'd have to have failures across not just one center but ALL centers simultaneously. The chance you get of that is the chance of each one going down multiplied together (.1*.1*.1*n).
            • by Gerzel (240421)

              I thought the article mentioned that layers of redundancy were NOT being used.

  • by stephanruby (542433) on Saturday June 09, 2012 @11:34PM (#40272045)

    The analogy the author uses doesn't work.

    A better analogy would be the airline industry. The airline industry likes to over-book airplane seats it may not have because it's always trying to optimize its profit-margin.

    The same will happen with cloud-services. Cloud-services will always try to optimize their own profit-margins, at the risk of triggering significant outages.

    And I don't see what this has to do with the financial crisis at all.

    • by pitchpipe (708843) on Sunday June 10, 2012 @12:02AM (#40272137)

      A better analogy would be the airline industry.

      I think a better analogy is the power grid. System hits a peak, one line goes down, others try to compensate becoming overloaded, another can't handle the load and goes down, and behold: cascading failures.

      • by Anonymous Coward

        Yep. The power grid is a good example -- especially the wet dreams of a 'smart grid'. Particularly since one of the conclusions from the 2004 blackout was that the management complexity of the power grid was a principle cause of the cascade -- i.e. the potential interactions of all the interconnects transcended our understanding. KISS is still the watchword of the day -- as Scottie once said 'the more you complicate the plumbing the easier it is to plug up the works'. Do the math sometime on the likelihood

    • by c0lo (1497653)

      And I don't see what this has to do with the financial crisis at all.

      The insurance/reinsurance and CDO [wikipedia.org] schemes in finance resembles "fail-over redundancy activation" in the cloud. Enough complexity and nobody can predict what can happen until it actually happens - see cascading failure [wikipedia.org]

    • by TubeSteak (669689) on Sunday June 10, 2012 @12:08AM (#40272159) Journal

      And I don't see what this has to do with the financial crisis at all.

      FTFA

      New cloud services may arise that essentially "resell, trade, or speculate on complex cocktails or 'derivatives' of more basic cloud resources and services, much like the modern financial and energy trading industries operate," he wrote.

      Each of these various cloud components are often maintained and deployed "by a single company that, for reasons of competition, shares as few details as possible about the internal operation of its services," Ford added.

      As a result, the cloud industry could find itself "yielding speculative bubbles and occasional large-scale failures, due to 'overly leveraged' composite cloud services" with weaknesses that don't become known "until the bubble bursts," Ford wrote.

      The metaphor more ore less fits, except for the part that ignores how a lot of what happened during the financial crisis was outright fraud perpatrated by lenders.

      The potential mess with the cloud is not about fraud, just about excessive dependancies.

      • by Gerzel (240421)

        The biggest difference is that it is still somewhat easy for companies to balance themselves against the cloud by having their own hardware running.

        They don't need full capacity capabilities, but even a small amount of capability can keep their services up, if slowed, rather than a full crash mitigating costs when things do go wrong.

        Physical hardware also provides a way out of a service provider.

        Though physical hardware also requires physical staff but that is the downside of in-sourcing. The downside of o

        • "The biggest difference is that it is still somewhat easy for companies to balance themselves against the cloud by having their own hardware running."

          Regarding services, what's the real difference when using my own hardware? I think Amazon owns its own hardware too.

          "They don't need full capacity capabilities, but even a small amount of capability can keep their services up, if slowed"

          Slashdot effect? For so many services, if you can't go full capacity, you don't serve at lower speed, you just don't work, f

          • "The biggest difference is that it is still somewhat easy for companies to balance themselves against the cloud by having their own hardware running."

            Regarding services, what's the real difference when using my own hardware? I think Amazon owns its own hardware too.

            Mega-upload also owned it's own servers. And they are not the only cloud provider to have hardware inappropriately seized. You do not have control of someone else's hardware.

            • "Mega-upload also owned it's own servers. And they are not the only cloud provider to have hardware inappropriately seized. You do not have control of someone else's hardware."

              Maybe, but that's not the point you are making with your example.

              As you already said, megaupload *owned* their servers. If this case has to show anything is that you can't control your own hardware either.

              • No - you can't control when the government seizes your hardware. But the point here is that if you use the 'cloud' as your storage or sole backup, your trusting your data to another service.

                Unlike Megaupload which stored other peoples data and allows 'sharing', if you 'own' your own servers, you control the data, the backups, can write backups to out of state servers you also own....

    • by plover (150551) * on Sunday June 10, 2012 @12:39AM (#40272227) Homepage Journal

      I think by "financial crisis" he meant "a minor market crash due to autotrading algorithms", and not the real crisis being caused by thieves running trillion dollar banking, mortgage, and insurance scams.

      The point is "if you use similar automated response strategies as a large set of other similar entities, you could all suffer the same fate from a common cause."

      Supposedly a market crash was triggered by autotrading algorithms that all tended to do exactly the same thing in the same situations. So when the price of oil shot up (or whatever the trigger was) then all those algorithms said "sell". As all the sell orders came in, the market average dropped, and the next set of algorithms said "sell moar". So there was a cascade because so many systems had identical responses to the same negative stimulus. Think of those automated trades as being akin to a "failover" IT system: if host X is failing, automatically shift my service load this way.

      So that's the analogy the author is trying to make with respect to systems that depend on automated recovery machinery like load balancers: if response time is too high at hosting vendor X, my automated strategy is to failover to hosting vendor Y. And perhaps 500 large sites all have a similar strategy. Now let's say that vendor X suffers a DDoS attack because they host some site that pissed off Anonymous. So now all these customer load balancers see the traffic slowing down a X, and they simultaneously reroute all app traffic to vendor Y in response. Vendor Y then gets hammered due to the new load, and the load balancers shift the traffic elsewhere. Now two main hosting providers are down while they try to clean up the messes, and the several smaller providers are seeing much bigger customers than usual using them as tertiary providers, and they start straining under the load as well, causing their other clients to automatically shift.

      And if that isn't exactly what plays out next year, might not something similar happen with payment gateways, or edge content delivery systems, or advertising providers?

      It's a cascade of failures due to automated responses that's remarkably similar to the electrical grid overloads that caused the northeast coast blackout in 2003. The author's point is "we don't know precisely what bad thing might happen within this particular ecosystem, but there is significant risk because we've seen complex interdependent systems have similar failures before."

      • by khallow (566160)

        The point is "if you use similar automated response strategies as a large set of other similar entities, you could all suffer the same fate from a common cause."

        Systemic risks like this were also present in the real crisis. I think the primary problem here is simply that the risks aren't well understood and that users and suppliers of cloud services are likely to make unwarranted assumptions (or even warranted assumptions that get invalidated when the infrastructure gets stressed in certain ways). There's also the possibility for tragedy of commons problems on a global scale. For example, if requests for DNS (it's a bit low level for a cloud service, just using a c

      • I think by "financial crisis" he meant "a minor market crash due to autotrading algorithms", and not the real crisis being caused by thieves running trillion dollar banking, mortgage, and insurance scams.

        You're right about the cascading failure but wrong about the financial crisis. The larger financial crisis was a crisis because banks had circular loans and insurance on one another. So if one bank failed it would suddenly stress the next bank to the point of failure and bankruptcy which would trigger another bank to fail and so on and so forth. What we had in 2008 was a cascading financial failure because everybody was insuring everybody else assuming that everybody wouldn't fail simultaneously.

        • by TubeSteak (669689)

          You and the GP are both incorrect.
          TFA did not mean "a minor market crash due to autotrading algorithms," which the GP would know if they had RTFA.

          And the larger financial crisis was not about circular loans. It was about 5 banks that were wildly overleveraged*
          when the housing bubble popped and their losses were magnified between 30:1 and 40:1 instead of the industry standard 12:1.
          This disaster devalued their housing holdings, which devalued everyone else's housing holdings, which fucked everything else, eve

          • by plover (150551) *

            The autotrading event was the trigger, but not the cause of the disaster. By itself, the autotrading crash would have been a minor event. The root cause of the disaster was the thieving bank scams, all of them together, including the housing market, the overextended banks, the deregulated investments made by the insurance companies, the insanely complex derivatives that spat out profits but had ultra high risks built in, all of that together was the real cause.

            If you have a barrel full of gunpowder, and y

  • by mcrbids (148650) on Saturday June 09, 2012 @11:39PM (#40272069) Journal

    Efficiency normally comes with economies of scale. As a partner in an outsourced vertical software company, we have hundreds of clients running in our highly tuned hosting cluster, and are able to bring economies of scale to an otherwise ridiculously expensive software niche. Yes, that means that if we have an outage, all of our clients experience an outage as well.

    However, we have carefully laid plans for multiple recovery points in a disaster scenario, (Plan B, Plan C, Plan D, etc) and have maintained an uptime significantly better than our clients would typically attain if left to their own devices. We easily manage close to 4 nines of uptime in an industry where the average is realistically around 2 nines. (having "the computer is down" a day or two every year or so is typical)

    Although the Internet is a "network of ends" the truth is that not all ends are created equal. Having a high quality, high speed (100 Mb), reliable (99.99%+) Internet feed in my small-ish hometown of around 80,000 people is ridiculously expensive. But in a nearby city (500,000 people 2 hours' drive) we host our servers in a tier 1 colo at 1/10th the cost of running it all ourselves, with dramatically improved reliability and network performance.

    Yes, putting all your eggs in one basket means that if that basket fails, you lose all your eggs. But it also makes it easy to buy just one, really nice basket that won't break and lose your eggs.

    • Yes, putting all your eggs in one basket means that if that basket fails, you lose all your eggs. But it also makes it easy to buy just one, really nice basket that won't break and lose your eggs.

      That's a great analogy until someone throws a weasel in your well-crafted basket.

    • Efficiency normally comes with economies of scale. As a partner in an outsourced vertical software company, we have hundreds of clients running in our highly tuned hosting cluster, and are able to bring economies of scale to an otherwise ridiculously expensive software niche. Yes, that means that if we have an outage, all of our clients experience an outage as well.

      In your post is implied that you have a single location. How do your customers feel about that - if they're even aware of it?

  • by Anonymous Coward

    Never turns on its makers. Never. This story is bullshit. Technology is a tool. I treat it like a tool. I control it.

    Now, who's up for another drink?

  • by Karmashock (2415832) on Sunday June 10, 2012 @12:02AM (#40272135)

    systems needs to be compartmentalized or have redundancies built into them.

    For example, I have several systems that send automated emails. I've had a problem in the past of given email servers not accepting or sending messages. It's uncommon but it happens and it's not acceptable. These are mission critical systems. They can't fail.

    Solution? Redundancy up the wazoo. The way it's set up now so many things would all have to happen at the exact same moment that the only way the system is likely to fail is if we fight world war 3... and lose.

    That is how you solve this problem. Don't rely on any one system. Rely on all of them. Once you figure out how to integrate one of them it's typically easier to integrate the rest. The virtues of this approach are manifest. Not just stability but if the services do processing or data retrieval you can cross reference them to find errors in databases or get a more complete data set then exists in any one source.

    I mean is google or bing the best search engine? What about both at the same time?

    • by Anonymous Coward

      Umm, no. Doing what you say will result in the catastrophic failure. You and X percentage of companies are on CloudA. One of those companies gets a massive DNS attack and CloudA can't handle it along with their normal loads (they oversell just like net providers, both lined and wireless telephone services, and airlines). CloudA goes down (or it forwards to CloudB). You and some of X move their loads to another CloudB. CloudB now has way more load than it expected. They've never had so much of CloudA'

      • Why would a DDoS attack cause a chain reaction?

        First off, the cloud is especially resistant to DDoS attacks. Ask Amazon. They've designed their systems specifically to reduce the effectiveness of that sort of attack. And as systems become larger they become harder to hit with DDoS attacks. You might as well try to DDoS a root DNS server. Have fun with that.

        Furthermore, why would ALL systems route to the same alternate cloud provider? Rather then everyone going from A to B what you'd actually see is some goi

    • Redundancy in common sense would only solve hardware issues. You can have multiple servers, multiple network connections, multiple power solutions. What about software? If you run the same mailing software on all servers, a bug or vulnerability in the mailer would happily bring down all your servers. Same with bugs and vulnerabilities in operating systems running on the servers, and other software. Unless you run several different mailer systems, operating systems, etc, and they all synchronize between them

      • Why can't I use several competing cloud systems that do the same thing? What they're talking about here is powerful cloud systems that depend on each other. So if one cloud goes down it causes a chain reaction of failure.

        But if every system can use two or three different sources for everything then it doesn't need any specific cloud to be running so long as most of them are running.

        • Because at some point the ROI isn't there. It's a common problem actually. Everybody knows how to make things redundant - triply, quadruply, etc. The problem is that no one is willing to pay for that kind of redundancy. The business doesn't, the clients don't, and you sure as hell aren't paying for it out of your own pocket. So you rely on failover mechanisms that are generally doubly redundant, or at least that rely on a large number of inexpensive machines. On top of that, you craft as clever a process as

          • It's all about how you design it. It isn't just redundancy. It's compartmentalization. Given systems are going to fail. You need to set it up so they can fail totally without it effecting anything else. The redundancy especially in an enterprise organization is a requirement.

            A lot of people looked at the cloud as a way to save a lot of money on computers etc. For mission critical applications that's the wrong attitude. Instead, you should look at it as an opportunity to make the system more robust. Take th

    • by ultranova (717540)

      For example, I have several systems that send automated emails.

      Didn't those switch to globally distributed clouds years ago? Ones composed mostly of unpatched Windows machines, if I understood correctly.

      • I can't speak to what remote systems outside my control are doing. But we could tolerate 95 percent failure and still operate at 100 percent efficiency.

        As I said, we have a great deal of redundancy built into it.

        We've seen failure rates as high as 40 percent. Only for an hour or so... It had no effect on us.

  • by Dan667 (564390) on Sunday June 10, 2012 @12:05AM (#40272147)
    I think it is funny that lessons learned years ago with mainframes are being presented as new by just changing the word mainframe to cloud.
    • I think it is funny that lessons learned years ago with mainframes are being presented as new by just changing the word mainframe to cloud.

      I wish I could moderate you higher than +5. I got into computing in the mid 80's, when home computers were popular, and never had to deal with mainframes. However, I know enough about computing history to see exactly how absurd this entire "cloud" computing fiasco is becoming. And it is going to exactly follow the same curve that mainframes followed, until "suddenly" the concept of having your own computing resources is going to be "new and exciting" again.

      P.T. Barnum would be proud.

      • So you worked at an Application Service Provider (ASP) at the turn of the millennium too?

        I think this is why the 40+ crowd has trouble getting work in some of the "New Internet" businesses. We were at the old Internet businesses last time and remember how the cool-aid tasted then.
  • In other words (Score:2, Insightful)

    by Anonymous Coward

    Unmanaged systems are hard to manage.

  • by mikes.song (830361)

    Ford compared this scenario to the intertwining, complex relationships and structures that helped contribute to the global financial crisis.

    Cloud computing is like fractional reserve accounting, with artificially low interest rates?

  • by dbIII (701233) on Sunday June 10, 2012 @12:47AM (#40272259)
    It's a leap year, February 28, and all over the world, completely out of the blue (or azure if you prefer) cloud clusters crash as the local clocks swing around to midnight, then stay down all day.
    Still, it's three nines of uptime when it's spread out over a few years :)

    A highly interdependant system is only as reliable as the QC on the weakest link. Who would have thought that somebody from a company that had a lot of embarrassing press about a leap year stuffup would make such a stupid and obvious mistake four years later? That's the cloud, where even the biggest names still don't care anywhere near as much as you would about your own systems and so don't pay enough attention to detail.
  • Jargon, jargon, jargon, jargon. Jargon.
  • Ford compared this scenario to the intertwining, complex relationships and structures that helped contribute to the global financial crisis."

    The difference being, of course, that the global financial crisis was the product of the abyssal greed of speculators and the stupidity of venal governments borrowing from private banks instead of doing the right thing and being directly responsible for the creation of money.

    But other than that, sure it's just like it.

    (/snark)

  • Using a public cloud seems sensible for low risk projects, or one off, large scale computations. The security and availability risk would suggest that anyone using the cloud for their entire infrastructure has either read too many brochures, or is about to do something else crazy, like divest their entire original business, and then hike service charges.

  • Researcher Observes Cloud Interactions, Predicts Lightning
  • Seriously. I don't get why this same description doesn't apply to the internet itself, a thing known to work reliably?

    Don't MAKE me RTFA.

  • the 'global financial crisis' was caused directly by massive fraud and profiteering. is there any incentive for cloud companies to create massive quantities of products that are completely worthless and sell them to sucker investors?

    • by lennier (44736)

      the 'global financial crisis' was caused directly by massive fraud and profiteering. is there any incentive for cloud companies to create massive quantities of products that are completely worthless and sell them to sucker investors?

      Um, is that a trick question?

      If you're selling something - whether it's investment, insurance, public key certificates or data backup - where the other buyer can't directly measure the quality of the product, of course there's incentive for fraud.

      Here, just upload your data to Dev Null Industries quad-cached completely tamperproof server. No, your data isn't encrypted. No, you can't have it back all at once. No, we won't peek, honest, and of course we'd never sell your spreadsheets to your competitors. Seri

  • Just look at the periodic reddit meltdowns.

  • Complexity is rising in all things at a frightening rate, not just technology. Over my lifetime the amount of information required to make any decision has become massive. For instance, can your select the "best" cellphone for you today? Which credit card? Car? Checking account? There is a coming "complexity collapse." What it will look like, or what the consequences will be is hard to project, but there cannot be an infinite rise in complexity in our lives without something painful happening eventually. Wi
  • ...it would be a "storm" in the cloud

Going the speed of light is bad for your age.

Working...