Why Is Less Than 99.9% Uptime Acceptable? 528

Posted by Zonk on Sunday March 02, 2008 @04:21PM from the gets-boring-to-stand-there dept.

Ian Lamont writes "Telcos, ISPs, mobile phone companies and other communication service providers are known for their complex pricing plans and creative attempts to give less for more. But Larry Borsato asks why we as customers are willing to put up with anything less than 99.999% uptime? That's the gold standard, and one that we are used to thanks to regulated telephone service. When it comes to mobile phone service, cable TV, Internet access, service interruptions are the norm — and everyone seems willing to grin and bear it: 'We're so used cable and satellite television reception problems that we don't even notice them anymore. We know that many of our emails never reach their destination. Mobile phone companies compare who has the fewest dropped calls (after decades of mobile phones, why do we even still have dropped calls?) And the ubiquitous BlackBerry, which is a mission-critical device for millions, has experienced mass outages several times this month. All of these services are unregulated, which means there are no demands on reliability, other than what the marketplace demands.' So here's the question for you: Why does the marketplace demand so little when it comes to these services?"

Why Is Less Than 99.9% Uptime Acceptable?

This discussion has been archived. No new comments can be posted.

Search 528 Comments Log In/Create an Account

Comments Filter:

Re:The cost (Score:5, Informative)

by HairyCanary ( 688865 ) writes: on Sunday March 02, 2008 @04:29PM (#22617244)

Exactly what I was thinking. I work for a CLEC, and I have a rough idea how much things cost -- compare what a Lucent 5E costs with what a top of the line Cisco router costs, and you have the answer why voice service achieves five-nines while data service typically does not.

It's a market-wide problem. (Score:2, Informative)

by HazyRigby ( 992421 ) writes: on Sunday March 02, 2008 @04:30PM (#22617254)

As consumers, we're made to feel helpless. The worst we can do (without litigation) to a company is complain or refuse to use their services, but what harm can that do to a giant conglomerate? And in situations in which one company has a monopoly in a certain area of the country, for example, consumers may not have the ability to switch or do without.

As a personal example, Comcast owes me a refund check for Internet services I canceled six months ago. If I, as a consumer, had allowed my debt to go unpaid for that long, my account would have been sent to collections long ago. But the problem is that most of the power--with the economics of the situation, with politicians, and so on--lies on one side of the table, and that power ain't with the consumer.

Re:because its ridiculous (Score:3, Informative)

by X0563511 ( 793323 ) writes: on Sunday March 02, 2008 @04:33PM (#22617286) Homepage Journal

I just did the math. 99.999 uptime is "less than 5 minutes per year" or "less than half a minute per year" depending if i stuck an extra 0 in there...

Clearly, a ridiculous number.

Re:Costs increase geometrically (Score:5, Informative)

by (H)elix1 ( 231155 ) * writes: <slashdot.helix@nOSPaM.gmail.com> on Sunday March 02, 2008 @04:33PM (#22617296) Homepage Journal

Because every nine will cause a geometric increase in costs.

This

Uptime (%) Downtime 90% 876 hours (36.5 days)
95% 438 hours (18.25 days)
99% 87.6 hours (3.65 days)
99.9% 8.76 hours
99.99% 52.56 minutes
99.999% 5.256 minutes
99.9999% 31.536 seconds

I work for a software shop where we can do high availability, but more often than not, folks chose to lower the uptime expectation rather than pony up for the stupid money it takes to have the hardware / software / infrastructure to get there. Most companies know the customer will not pay the extra cash for the uptime, thus... you get what you pay for.

Because it's not necessary? (Score:4, Informative)

by Srass ( 42349 ) * writes: on Sunday March 02, 2008 @04:39PM (#22617336)

Well, my guess would be that many (but not all) people understand that being able to call an ambulance because Aunt Betty has fainted is a necessity, but being able to chat with Aunt Betty for an hour from your car isn't. Missing a rerun of Laverne and Shirley isn't critical, and neither is having to wait to post those vacation pictures to Flickr. Your coworkers will, in all probability, somehow muddle through if you can't send them email from your blackberry.

The telephone as we know it was the first genuinely instantaneous, worldwide communications medium that anyone could use, it was seen as a necessary component for national security during the cold war, and was built out as such. We've had over a century to perfect it, and vast amounts of money were spent doing so. Despite its origins at DARPA, the Internet as we know it today, although more useful, is by and large less of a basic need, is far more complex, and large portions of it are still built on top of the telephone infrastructure, besides.

I can't help but think that most people understand this sort of thing, and understand that bringing such modern conveniences up to five nines of reliability is difficult and expensive, and people have evidently decided that a certain tradeoff to make such things affordable isn't out of line.

The shorter, more pessimistic version of this is probably, "It's cheaper to suck."

The profits (Score:2, Informative)

by RobBebop ( 947356 ) writes: on Sunday March 02, 2008 @04:51PM (#22617442) Homepage Journal

The government didn't always regulate phone companies. That started in 1984 when AT&T became too powerful. But AT&T became so powerful because it did a hell of an awesome thing with its network because it realized that better service equals more customers and more revenue. I recall hearing a story from a Bell Labs alum that they had a goal of handling annual peak call volumes on the busiest day of the year (Mother's Day). The day was worth $24 Million dollars in phone charges to them. They spent $5 Million on each of 2 different hardware architecture projects to get the system up and running to support the day. The monolithic centralized architecture failed, but distributed architecture (spreading the communications through 10-15 national "hubs" worked. The system was a success, and AT&T got to enjoy their lunch by servicing their customers the way a business ought to.
For data networks, their is simply too much clutter and competition to be able to reign in 99.999% rates of performance. We should be happy to get 99.9% from the mismatch of hardware running the routers and OSes which power the internet.

Re:Five 9's is impossible! (Score:3, Informative)

by icebike ( 68054 ) writes: on Sunday March 02, 2008 @05:18PM (#22617650)

You totally misunderstand the 5 9s concept.

It doesn't mean that each and every individual phone will be up 99.999 percent of the time, it means that the system as a whole will be up 99.999% of the time.

Its quite possible for an entire town to be down for an entire year and still meet this criteria.

Yet modern cell operators STILL can not come close.

Just one point ... (Score:4, Informative)

by tomhudson ( 43916 ) writes: <barbara.hudsonNO@SPAMbarbara-hudson.com> on Sunday March 02, 2008 @05:28PM (#22617700) Journal

In a properly designed cell phone system, if the tower you were going to be handed off to can't take the connection, either the tower you're with will keep the connection, or another (though still sub-optimal) will take the connection.
Of course, when you don't have transmitters with overlapping coverage, this doesn't work.

Re:because they've been conditioned (Score:3, Informative)

by kasperd ( 592156 ) writes: on Sunday March 02, 2008 @06:10PM (#22618006) Homepage Journal

Linux administrators will only try this as a last resort (and it almost never works).
I think that about 80% of the time I will know beforehand if rebooting a Linux system is going to solve a particular problem. But even if I'm convinced that a reboot would solve the problem, I usually spend some time looking for a solution that does not involve rebooting. There are multiple reasons why I look for other solutions. Sometimes a reboot is inconvenient because of all the programs that have to be shut down and started again. If I find a solution that does not involve rebooting, I will usually have saved some time on the long term, because next time it happens, I will be back to my work right away. And finally it helps me understand the problem, so maybe I will even be able to prevent it in the future.

If something used to work and suddenly stopped working, chances are, a reboot will solve it. Obvious exceptions are that the problem could be caused by faulty hardware or a full disk. Those are usually easy to spot.

Reality check (Score:3, Informative)

by mstone ( 8523 ) writes: on Sunday March 02, 2008 @06:24PM (#22618130)

The N-nines model is a fast and easy way to compare order-of-magnitude differences between existing networks, but it says almost nothing meaningful about actual usage or the perception of uptime from a user's perspective.

Let's look at the numbers: 99.9% uptime translates to about 9 hours of unscheduled downtime a year. That can be one 9-hour block once a year, 36 minutes per day, 1.5 minutes per hour, 1.5 seconds per minute, or one dropped packet per thousand. Sure, it's easy to spot a 9-hour blackout, but as the slices of downtime get thinner, they get harder to notice at all, or to identify as USD specifically.

99.999% uptime translates to about 5 minutes of USD per year, and is of questionable value. You can't identify a network outage, call in a complaint, and get the issue resolved in the given timeframe. 99.9999%? It is to laugh. You can't even look up the tech support phone number without blowing your downtime budget for the year. Get hit by a rolling blackout for an hour? Kiss your downtime budget goodbye for the next 120 years.

Getting back to 99.9% uptime, let's move on to standard utilization patterns. USD really only becomes an issue if people notice it .. nobody cares if an incoming piece of email got delayed by 30 seconds at the MTU, but they do get testy if they can't load their webpages. But web surfing only uses 1-2 seconds of bandwidth per minute anyway.

If we have 2 seconds of usage and 2 seconds of downtime per minute, the odds of a collision are around 15:1 with an average overlap of 1 second when a collision does happen. Simply interleaving usage and downtime that way increases the perceived uptime by an order of magnitude since 90% of the outages happen when no one is actually using the network. And larger blocks of downtime get lost in larger blocks of non-utilization exactly the same way .. who cares about a half hour of downtime from 0300 to 0330 when no one in your company is actually in the building and using the network?

Granted, if you have higher utilization you'll have a better chance of hitting a chunk of downtime, but you'll also have higher chances of queuing latency within your own use patterns. If you're already using 99% of your bandwidth, you can't just plunk in one more job and expect it to run immediately. It has to wait for that 1% of space no one else is currently using. And when you get to that point, it's really time to consider buying a bigger pipe anyway.

And that brings us to the main point: People don't buy network connectivity in absolute terms. They buy capacity, and the capacity they buy is scaled to what they think of as acceptable peak usage. "Acceptable peak usage" is a subjective thing, and nobody makes subjective judgements with 99.999% precision.

Re:because they've been conditioned (Score:2, Informative)

by dae_vid43 ( 1249614 ) writes: on Sunday March 02, 2008 @06:45PM (#22618276)

my uncle once said: "in construction, clients are interested in 3 things: 1) build it fast, 2) build it cheap, and 3) build it right. realistically, you can have only two of these three". he was right. so, the same goes for 99% wireless or whatever. if you *need* some technology thing to work 100% of the time, you'd better be willing to pay out your ass for it. if you want something to be cheap, than it'll be cheap--and less reliable. look at Japan: they have a 99.999% reliable rail network, but it cost an arm and a leg to build--and it took like 50 years. (note that your cell phone *will* work in subways in Tokyo, because they paid out the ass to make it possible). So, yeah, we could have a 100%-reliable cell network, (or whatever) but most people aren't willing to pay $200 per month to make it happen; i'm certainly not.

Re:The way it has always been (Score:3, Informative)

by cgenman ( 325138 ) writes: on Sunday March 02, 2008 @06:52PM (#22618328) Homepage

That's right, it takes hours on the phone to get one of those companies to either own up to, and pay for losses accrued by their customers through loss of service.

Having been on the other end of these types of calls, this sort of thing can be *very* annoying. People do call all of the time with the expectation that because they do five or six thousand dollars worth of business in a day, the ISP is somehow responsible for those thousands of dollars when some idiot Verizon contractor accidentally cuts our cables. Other reasons for outages include: power loss, fire, flood, exploding transformers, telephone pole collapse, and many other issues outside of our capability.

If you want guaranteed uptime, get it in your contract and be prepared to pay for it! Otherwise, we'll do the best we can to provide service at the funding level we recieve, and will gladly refund the 59c worth of service that you would have paid for a 6 hour outage.

Would you expect Ford to pay you for lost wages when someone hits your car? Would you expect your grocery store to pay for your chiwawa when he starves to death because the store is out of dog food?

Introducing the EULA (Score:5, Informative)

by Mr Pippin ( 659094 ) writes: on Sunday March 02, 2008 @06:57PM (#22618358)

Also, because the EULA came into existence, product warranties effectively vanished, as well as actions the consumer could take via product liability claims, in court..

After all, liability plays a large part in defining QA policies. If software companies were held to the same liability standards most product manufacturers face, I'd bet software development would be more of the engineering practice it should be.

To quote part of Microsoft's EULA for Windows XP.

http://www.microsoft.com/windowsxp/home/eula.mspx [microsoft.com]
ALSO, THERE IS NO WARRANTY OR CONDITION OF TITLE, QUIET ENJOYMENT, QUIET POSSESSION, CORRESPONDENCE TO DESCRIPTION OR NON-INFRINGEMENT WITH REGARD TO THE SOFTWARE.

Re:because they've been conditioned (Score:3, Informative)

by Vellmont ( 569020 ) writes: on Sunday March 02, 2008 @07:05PM (#22618404) Homepage

This has _zero_ to do with the architecture and everything to do with the user. Linux would be (and is) treated in the same way in similar situations.

This is simply not true. Anyone that's ever installed software, or run "windows update" knows that rebooting is a very likely part of this process. The dependencies and non-modular approach of Windows are quite apparent. Software vendors say "just reboot" because of all the complexities and dependencies within windows.

The same simply isn't true for Linux. Replace a critical shared library? No problem, running programs still have a hook to the old version. Any new process that starts will get the new version of the library. Why reload the whole damn OS when restarting a process will do the same thing?

You can be assured the people running those BBSes were far less like to have the "just reboot" mentality.

You're trying to tell me with a straight face that the BBS market influenced Microsoft? (Which flies in the face of what we've all experienced with Windows).

Further, the other reason most people have that attitude is because to them a computer is just another appliance.

No, the reason people have this attitude is because it freaking works.

Incidentally, there are numerous classes of problems on Linux (and UNIX in general) which are more quickly and easily "fixed" with a reboot.

I've been administrating Linux machines for 13+ years. I can count on one hand the number of times a reboot solved any problem. The only class of problem this solved is a kernel bug, or the kernel crashing (usually from a hardware problem).

I can't even remember the last time I had to reboot any of my Windows machines without a good reason (eg: patching).

Why would anyone reboot without a "good reason"? The point is that Linux simply has less "good reasons", and requires less reboots. Linux requires FAR less reboots for "patching".

Finally, there's nothing wrong with rebooting _anyway_. If your service uptime requirements are affected by a single machine rebooting, your architecture is broken.

Wow. Now I know you've really drank the Microsoft kool-aid. Not everyone can afford multiple machine redundancy just to fix the endemic problems of Microsoft who advocate "Just reboot!" to fix so many problems. There's really no reason why I need to reboot just to update what's essentially some new versions of DLLs. The Microsoft architecture is essentially broken if you have to buy another damn machine for the SOLE purpose of maintaining high availability.

Re:At what price? (Score:2, Informative)

by DaveRobb ( 139653 ) writes: on Sunday March 02, 2008 @07:21PM (#22618484)

RFC 1925 Rule 7a.

Good, Fast, Cheap: Pick any two (you can't have all three)

People want high reliability, but they're not prepared to pay for it. If they _are_ prepared to pay more money, they miss the point that unless they spend a LOT more money, they'll only increase one of Good (aka reliable) or Fast, not both.

Re:because they've been conditioned (Score:5, Informative)

by NevermindPhreak ( 568683 ) writes: on Sunday March 02, 2008 @07:22PM (#22618492)

I believe you are correct. The market isn't "conditioned" into thinking that anything less than five 9s is acceptable. They just don't want to pay the cost associated with it. The price/reliability ratio right now is the one that will satisfy the most customers. 99.999% reliability is harder to sell than 99.9% reliability at half the cost.

I work for a cable company, by the way. I design a lot of the building-out of our system, so i know the actual costs associated with creating that kind of reliability. Whenever someone needs that kind of reliability, I actually recommend getting a second ISP as a low-speed backup solution. It is the only smart way to go to get complete reliability, as pretty much any company advertising 99.999% reliability in this area is outright lying to the customer. (I know this from experience. I have switched customers over to our ISP from a week-long (or longer) outage of every ISP here, and there are quite a few.) Besides, a good router will split bandwidth between the ISPs so you're not paying for something you're not using. (called "bonding")

I still get amazed when people yell at me for being offline for a few hours after maybe 3, 4, 5 years of uptime. They say that they are losing thousands of dollars per day they are offline. Yet, they don't want to pay for a $40 roll-over backup. THESE are the vast majority of customers who complain so much about 99.999% uptime.

On another note, I think anyone claiming 99.999% on POTS is anecdotal. Growing up, I had my power cut out at least twice a year, and the phone system was hardly 99.999%. Trees fall on lines, and people cut buried lines for all sorts of accidental reasons. Just like you insure anything worth enough value, just like you back up data in multiple locations, you need a fallback plan if your ISP goes out if it means that much to you.

Re:because they've been conditioned (Score:3, Informative)

by Jurily ( 900488 ) writes: <jurilyNO@SPAMgmail.com> on Sunday March 02, 2008 @07:34PM (#22618582)

Keeping internet services online suffers from the problem of black swans. Nassim Taleb, who invented the term, defines it thus: "A black swan is an outlier, an event that lies beyond the realm of normal expectations." Almost all internet outages are unexpected unexpecteds: extremely low-probability outlying surprises. They're the kind of things that happen so rarely it doesn't even make sense to use normal statistical methods like "mean time between failure." What's the "mean time between catastrophic floods in New Orleans?"

http://www.joelonsoftware.com/items/2008/01/22.html [joelonsoftware.com]

No, it does exist (Score:3, Informative)

by Sycraft-fu ( 314770 ) writes: on Sunday March 02, 2008 @11:43PM (#22619996)

But only with redundant systems. What happens is when something goes down, techs aren't getting it back up in 30 seconds, rather it is instantaneously failing over to another system. You have enough redundancy, you can keep operating even in the face of multiple simultaneous failures.

The problem is, of course, going for that can be really expensive. Not only does the system itself have to have a bunch of redundancy, but so does everything supporting it. For example in the case of a web server you'd not only have to have multiple boxes running that, but multiple power connections, generators, network connections, ISPs, etc.

Doing something like that, you can offer essentially 100% uptime, barring a catastrophic event (and face it, and amount of uptime can be ruined by a sufficiently large event). However it is extremely costly, and of course everything has to be well designed because, as you noted, you fuck up anywhere, you got 30 seconds to fix it.

Or you can just do what the voice guys like to do: Change the rules. For them, the system is "up" so long as there is at least one phone line that can place a call to at least one other phone line. By that standard, the voice switch on campus has never been down. Of course that isn't a particularly useful standard, if you asked me.

On redundancy (Score:4, Informative)

by Animats ( 122034 ) writes: on Monday March 03, 2008 @01:15AM (#22620642) Homepage

In the entire history of electromechanical switching in the Bell System no central office was ever out of action for more than thirty minutes for any reason other than a natural disaster. On the other hand, step-by-step (Strowgear) switches failed to connect about 1% of calls correctly, and crossbar reduced that to about 0.1%. With electronic switching, the failure rate is higher but the error rate is much lower.
This reflects the fact that, in the electromechanical era, the hardware reliability was low enough that the system had to be designed to have a higher reliability than any of its individual units. In the computer era, the component reliability is so high that good error rates can be achieved without redundancy. This is why computer-based networks tend to have common mode failures.
If you're involved in designing highly reliable systems, it's worth understanding how Number 5 Crossbar worked. Here's an oversimplified version.
The biggest component of Number 5 crossbar were the crossbar switches themselves. Think of them as 10x10 matrices of contacts which could be X/Y addressed and set or cleared. Failure of one crossbar switch could take down only a few lines, and they usually failed one row or column at a time, taking down at most one line.
The crossbars had no smarts of their own; they were told what to do by "markers", the smart part of the central office. Each marker could set up or tear down a call in about 100ms. Markers were duplicated, with half of the marker checking the other half. If the halves disagreed, the transaction aborted. Each central office had multiple markers (not that many, maybe ten in an office with 10,000 lines), and markers were assigned randomly to process calls.
When a phone went off hook, a marker was notified, and set up a "call" to some free "originating register", the unit that understood dial pulses and provided dial tone. The marker was then released, while the user dialed. The originating register received the input dial info, and when its logic detected a complete number, it requested a random marker, and sent the number. The marker set up the call, set and locked in the correct contacts in the crossbars, and was released to do other work.
If the marker failed to set up the call successfully (there was a timeout around 500ms), the originating register got back a fail, and retried, once. One retry is a huge win; if there's a 1% fail rate on the first try, there's an 0.01% fail rate with two tries. This little trick alone made crossbar systems appear very reliable. There's much to be said for doing one retry on anything which might fail transiently. If the second retry fails, unit level retry as a strategy probably isn't working and the problem needs to be kicked up a level.
The pattern of requesting resources from a pool at random was continued throughout the system. Trunks (to other central offices), senders (for sending call data to the next switch), translators (for converting phone numbers into routes), billing punches (for logging call data), and trouble punches (for logging faults) were all assigned on a random, or in some cases a cyclic rotation basis. Units that were busy, faulted, or physically removed for maintenance were just skipped.
That's how the Bell System achieved such good reliability with devices that had moving parts.
Note that this isn't a "switch to backup" strategy. The distribution of work amongst units is part of normal operation, constantly being exercised. So handling a failure doesn't involve special cases. Failures cost you some system capacity, but don't take the whole system down.
We need more of that in the Internet. Some (not all) load balancers for web sites work like this. Some (but not all) packet switches work like this. Think about how you can use that pattern in your own work. It worked for more than half a century for the Bell System.

Re:because they've been conditioned (Score:5, Informative)

by tronbradia ( 961235 ) writes: on Monday March 03, 2008 @04:23AM (#22621426)

Actually our health system has completely ballooning costs relative to other countries and is really more of an example of the opposite phenomenon, where insurance must pay for all possible treatment or be sued. Our system without a doubt provides the most care of any system in the world, even though it's pretty obvious that returns diminish dramatically after about 10% of GDP (we are at 15% of GDP, 2nd runner up is Switzerland at 11 or 12%). Returns diminish because, essentially, more care doesn't actually make people healthier past a certain point. 99% if people just need a GP (cheap), immunizations (dirt cheap), antibiotics when they get an bacterial infection (dirt cheap), and surgeons to sew them up when they get in a car crash (expensive-ish but hopefully uncommon and only rarely protracted). The problem is whenever anybody gets anything terminal, there's the potential for basically infinite spending, and the more successful treatment is, the more money goes in because treatment is prolonged. In this case our system is not "barely good enough", it's more way too good, or at least, way too generous.

Zero dropped calls? (Score:2, Informative)

by mc900ftjesus ( 671151 ) writes: on Monday March 03, 2008 @11:21AM (#22623566)

"after decades of mobile phones, why do we even still have dropped calls?" This is just stupid. A dropped call is not the network, it's your phone losing the network. There is absolutely no way to avoid them, none. RF only travels so far and through so many things. I completely understand the article and its merit, but this is just the author being ignorant of their subject or scoring sensationalist points with uninformed readers. Someone explain to me how a company could possibly cover the entire US, and I mean Wyoming and Montana too (if you want zero dropped calls). Then there the fact that Americans will take a $0 junk heap of a phone with a contract and hope that it will perform well.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Why Is Less Than 99.9% Uptime Acceptable? 528

Why Is Less Than 99.9% Uptime Acceptable? More Login

Why Is Less Than 99.9% Uptime Acceptable?

Re:The cost (Score:5, Informative)

It's a market-wide problem. (Score:2, Informative)

Re:because its ridiculous (Score:3, Informative)

Re:Costs increase geometrically (Score:5, Informative)

Because it's not necessary? (Score:4, Informative)

The profits (Score:2, Informative)

Re:Five 9's is impossible! (Score:3, Informative)

Just one point ... (Score:4, Informative)

Re:because they've been conditioned (Score:3, Informative)

Reality check (Score:3, Informative)

Re:because they've been conditioned (Score:2, Informative)

Re:The way it has always been (Score:3, Informative)

Introducing the EULA (Score:5, Informative)

Re:because they've been conditioned (Score:3, Informative)

Re:At what price? (Score:2, Informative)

Re:because they've been conditioned (Score:5, Informative)

Re:because they've been conditioned (Score:3, Informative)

No, it does exist (Score:3, Informative)

On redundancy (Score:4, Informative)

Re:because they've been conditioned (Score:5, Informative)

Zero dropped calls? (Score:2, Informative)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot