Why Is Less Than 99.9% Uptime Acceptable? 528
Ian Lamont writes "Telcos, ISPs, mobile phone companies and other communication service providers are known for their complex pricing plans and creative attempts to give less for more. But Larry Borsato asks why we as customers are willing to put up with anything less than 99.999% uptime? That's the gold standard, and one that we are used to thanks to regulated telephone service. When it comes to mobile phone service, cable TV, Internet access, service interruptions are the norm — and everyone seems willing to grin and bear it: 'We're so used cable and satellite television reception problems that we don't even notice them anymore. We know that many of our emails never reach their destination. Mobile phone companies compare who has the fewest dropped calls (after decades of mobile phones, why do we even still have dropped calls?) And the ubiquitous BlackBerry, which is a mission-critical device for millions, has experienced mass outages several times this month. All of these services are unregulated, which means there are no demands on reliability, other than what the marketplace demands.' So here's the question for you: Why does the marketplace demand so little when it comes to these services?"
Re:The cost (Score:5, Informative)
It's a market-wide problem. (Score:2, Informative)
As consumers, we're made to feel helpless. The worst we can do (without litigation) to a company is complain or refuse to use their services, but what harm can that do to a giant conglomerate? And in situations in which one company has a monopoly in a certain area of the country, for example, consumers may not have the ability to switch or do without.
As a personal example, Comcast owes me a refund check for Internet services I canceled six months ago. If I, as a consumer, had allowed my debt to go unpaid for that long, my account would have been sent to collections long ago. But the problem is that most of the power--with the economics of the situation, with politicians, and so on--lies on one side of the table, and that power ain't with the consumer.
Re:because its ridiculous (Score:3, Informative)
Clearly, a ridiculous number.
Re:Costs increase geometrically (Score:5, Informative)
This
Uptime (%) Downtime 90% 876 hours (36.5 days)
95% 438 hours (18.25 days)
99% 87.6 hours (3.65 days)
99.9% 8.76 hours
99.99% 52.56 minutes
99.999% 5.256 minutes
99.9999% 31.536 seconds
I work for a software shop where we can do high availability, but more often than not, folks chose to lower the uptime expectation rather than pony up for the stupid money it takes to have the hardware / software / infrastructure to get there. Most companies know the customer will not pay the extra cash for the uptime, thus... you get what you pay for.
Because it's not necessary? (Score:4, Informative)
The telephone as we know it was the first genuinely instantaneous, worldwide communications medium that anyone could use, it was seen as a necessary component for national security during the cold war, and was built out as such. We've had over a century to perfect it, and vast amounts of money were spent doing so. Despite its origins at DARPA, the Internet as we know it today, although more useful, is by and large less of a basic need, is far more complex, and large portions of it are still built on top of the telephone infrastructure, besides.
I can't help but think that most people understand this sort of thing, and understand that bringing such modern conveniences up to five nines of reliability is difficult and expensive, and people have evidently decided that a certain tradeoff to make such things affordable isn't out of line.
The shorter, more pessimistic version of this is probably, "It's cheaper to suck."
The profits (Score:2, Informative)
The government didn't always regulate phone companies. That started in 1984 when AT&T became too powerful. But AT&T became so powerful because it did a hell of an awesome thing with its network because it realized that better service equals more customers and more revenue. I recall hearing a story from a Bell Labs alum that they had a goal of handling annual peak call volumes on the busiest day of the year (Mother's Day). The day was worth $24 Million dollars in phone charges to them. They spent $5 Million on each of 2 different hardware architecture projects to get the system up and running to support the day. The monolithic centralized architecture failed, but distributed architecture (spreading the communications through 10-15 national "hubs" worked. The system was a success, and AT&T got to enjoy their lunch by servicing their customers the way a business ought to.
For data networks, their is simply too much clutter and competition to be able to reign in 99.999% rates of performance. We should be happy to get 99.9% from the mismatch of hardware running the routers and OSes which power the internet.
Re:Five 9's is impossible! (Score:3, Informative)
It doesn't mean that each and every individual phone will be up 99.999 percent of the time, it means that the system as a whole will be up 99.999% of the time.
Its quite possible for an entire town to be down for an entire year and still meet this criteria.
Yet modern cell operators STILL can not come close.
Just one point ... (Score:4, Informative)
Of course, when you don't have transmitters with overlapping coverage, this doesn't work.
Re:because they've been conditioned (Score:3, Informative)
If something used to work and suddenly stopped working, chances are, a reboot will solve it. Obvious exceptions are that the problem could be caused by faulty hardware or a full disk. Those are usually easy to spot.
Reality check (Score:3, Informative)
Let's look at the numbers: 99.9% uptime translates to about 9 hours of unscheduled downtime a year. That can be one 9-hour block once a year, 36 minutes per day, 1.5 minutes per hour, 1.5 seconds per minute, or one dropped packet per thousand. Sure, it's easy to spot a 9-hour blackout, but as the slices of downtime get thinner, they get harder to notice at all, or to identify as USD specifically.
99.999% uptime translates to about 5 minutes of USD per year, and is of questionable value. You can't identify a network outage, call in a complaint, and get the issue resolved in the given timeframe. 99.9999%? It is to laugh. You can't even look up the tech support phone number without blowing your downtime budget for the year. Get hit by a rolling blackout for an hour? Kiss your downtime budget goodbye for the next 120 years.
Getting back to 99.9% uptime, let's move on to standard utilization patterns. USD really only becomes an issue if people notice it
If we have 2 seconds of usage and 2 seconds of downtime per minute, the odds of a collision are around 15:1 with an average overlap of 1 second when a collision does happen. Simply interleaving usage and downtime that way increases the perceived uptime by an order of magnitude since 90% of the outages happen when no one is actually using the network. And larger blocks of downtime get lost in larger blocks of non-utilization exactly the same way
Granted, if you have higher utilization you'll have a better chance of hitting a chunk of downtime, but you'll also have higher chances of queuing latency within your own use patterns. If you're already using 99% of your bandwidth, you can't just plunk in one more job and expect it to run immediately. It has to wait for that 1% of space no one else is currently using. And when you get to that point, it's really time to consider buying a bigger pipe anyway.
And that brings us to the main point: People don't buy network connectivity in absolute terms. They buy capacity, and the capacity they buy is scaled to what they think of as acceptable peak usage. "Acceptable peak usage" is a subjective thing, and nobody makes subjective judgements with 99.999% precision.
Re:because they've been conditioned (Score:2, Informative)
Re:The way it has always been (Score:3, Informative)
Having been on the other end of these types of calls, this sort of thing can be *very* annoying. People do call all of the time with the expectation that because they do five or six thousand dollars worth of business in a day, the ISP is somehow responsible for those thousands of dollars when some idiot Verizon contractor accidentally cuts our cables. Other reasons for outages include: power loss, fire, flood, exploding transformers, telephone pole collapse, and many other issues outside of our capability.
If you want guaranteed uptime, get it in your contract and be prepared to pay for it! Otherwise, we'll do the best we can to provide service at the funding level we recieve, and will gladly refund the 59c worth of service that you would have paid for a 6 hour outage.
Would you expect Ford to pay you for lost wages when someone hits your car? Would you expect your grocery store to pay for your chiwawa when he starves to death because the store is out of dog food?
Introducing the EULA (Score:5, Informative)
After all, liability plays a large part in defining QA policies. If software companies were held to the same liability standards most product manufacturers face, I'd bet software development would be more of the engineering practice it should be.
To quote part of Microsoft's EULA for Windows XP.
http://www.microsoft.com/windowsxp/home/eula.mspx [microsoft.com]
ALSO, THERE IS NO WARRANTY OR CONDITION OF TITLE, QUIET ENJOYMENT, QUIET POSSESSION, CORRESPONDENCE TO DESCRIPTION OR NON-INFRINGEMENT WITH REGARD TO THE SOFTWARE.
Re:because they've been conditioned (Score:3, Informative)
This has _zero_ to do with the architecture and everything to do with the user. Linux would be (and is) treated in the same way in similar situations.
This is simply not true. Anyone that's ever installed software, or run "windows update" knows that rebooting is a very likely part of this process. The dependencies and non-modular approach of Windows are quite apparent. Software vendors say "just reboot" because of all the complexities and dependencies within windows.
The same simply isn't true for Linux. Replace a critical shared library? No problem, running programs still have a hook to the old version. Any new process that starts will get the new version of the library. Why reload the whole damn OS when restarting a process will do the same thing?
You can be assured the people running those BBSes were far less like to have the "just reboot" mentality.
You're trying to tell me with a straight face that the BBS market influenced Microsoft? (Which flies in the face of what we've all experienced with Windows).
Further, the other reason most people have that attitude is because to them a computer is just another appliance.
No, the reason people have this attitude is because it freaking works.
Incidentally, there are numerous classes of problems on Linux (and UNIX in general) which are more quickly and easily "fixed" with a reboot.
I've been administrating Linux machines for 13+ years. I can count on one hand the number of times a reboot solved any problem. The only class of problem this solved is a kernel bug, or the kernel crashing (usually from a hardware problem).
I can't even remember the last time I had to reboot any of my Windows machines without a good reason (eg: patching).
Why would anyone reboot without a "good reason"? The point is that Linux simply has less "good reasons", and requires less reboots. Linux requires FAR less reboots for "patching".
Finally, there's nothing wrong with rebooting _anyway_. If your service uptime requirements are affected by a single machine rebooting, your architecture is broken.
Wow. Now I know you've really drank the Microsoft kool-aid. Not everyone can afford multiple machine redundancy just to fix the endemic problems of Microsoft who advocate "Just reboot!" to fix so many problems. There's really no reason why I need to reboot just to update what's essentially some new versions of DLLs. The Microsoft architecture is essentially broken if you have to buy another damn machine for the SOLE purpose of maintaining high availability.
Re:At what price? (Score:2, Informative)
Good, Fast, Cheap: Pick any two (you can't have all three)
People want high reliability, but they're not prepared to pay for it. If they _are_ prepared to pay more money, they miss the point that unless they spend a LOT more money, they'll only increase one of Good (aka reliable) or Fast, not both.
Re:because they've been conditioned (Score:5, Informative)
I work for a cable company, by the way. I design a lot of the building-out of our system, so i know the actual costs associated with creating that kind of reliability. Whenever someone needs that kind of reliability, I actually recommend getting a second ISP as a low-speed backup solution. It is the only smart way to go to get complete reliability, as pretty much any company advertising 99.999% reliability in this area is outright lying to the customer. (I know this from experience. I have switched customers over to our ISP from a week-long (or longer) outage of every ISP here, and there are quite a few.) Besides, a good router will split bandwidth between the ISPs so you're not paying for something you're not using. (called "bonding")
I still get amazed when people yell at me for being offline for a few hours after maybe 3, 4, 5 years of uptime. They say that they are losing thousands of dollars per day they are offline. Yet, they don't want to pay for a $40 roll-over backup. THESE are the vast majority of customers who complain so much about 99.999% uptime.
On another note, I think anyone claiming 99.999% on POTS is anecdotal. Growing up, I had my power cut out at least twice a year, and the phone system was hardly 99.999%. Trees fall on lines, and people cut buried lines for all sorts of accidental reasons. Just like you insure anything worth enough value, just like you back up data in multiple locations, you need a fallback plan if your ISP goes out if it means that much to you.
Re:because they've been conditioned (Score:3, Informative)
http://www.joelonsoftware.com/items/2008/01/22.html [joelonsoftware.com]
No, it does exist (Score:3, Informative)
The problem is, of course, going for that can be really expensive. Not only does the system itself have to have a bunch of redundancy, but so does everything supporting it. For example in the case of a web server you'd not only have to have multiple boxes running that, but multiple power connections, generators, network connections, ISPs, etc.
Doing something like that, you can offer essentially 100% uptime, barring a catastrophic event (and face it, and amount of uptime can be ruined by a sufficiently large event). However it is extremely costly, and of course everything has to be well designed because, as you noted, you fuck up anywhere, you got 30 seconds to fix it.
Or you can just do what the voice guys like to do: Change the rules. For them, the system is "up" so long as there is at least one phone line that can place a call to at least one other phone line. By that standard, the voice switch on campus has never been down. Of course that isn't a particularly useful standard, if you asked me.
On redundancy (Score:4, Informative)
In the entire history of electromechanical switching in the Bell System no central office was ever out of action for more than thirty minutes for any reason other than a natural disaster. On the other hand, step-by-step (Strowgear) switches failed to connect about 1% of calls correctly, and crossbar reduced that to about 0.1%. With electronic switching, the failure rate is higher but the error rate is much lower.
This reflects the fact that, in the electromechanical era, the hardware reliability was low enough that the system had to be designed to have a higher reliability than any of its individual units. In the computer era, the component reliability is so high that good error rates can be achieved without redundancy. This is why computer-based networks tend to have common mode failures.
If you're involved in designing highly reliable systems, it's worth understanding how Number 5 Crossbar worked. Here's an oversimplified version.
The biggest component of Number 5 crossbar were the crossbar switches themselves. Think of them as 10x10 matrices of contacts which could be X/Y addressed and set or cleared. Failure of one crossbar switch could take down only a few lines, and they usually failed one row or column at a time, taking down at most one line.
The crossbars had no smarts of their own; they were told what to do by "markers", the smart part of the central office. Each marker could set up or tear down a call in about 100ms. Markers were duplicated, with half of the marker checking the other half. If the halves disagreed, the transaction aborted. Each central office had multiple markers (not that many, maybe ten in an office with 10,000 lines), and markers were assigned randomly to process calls.
When a phone went off hook, a marker was notified, and set up a "call" to some free "originating register", the unit that understood dial pulses and provided dial tone. The marker was then released, while the user dialed. The originating register received the input dial info, and when its logic detected a complete number, it requested a random marker, and sent the number. The marker set up the call, set and locked in the correct contacts in the crossbars, and was released to do other work.
If the marker failed to set up the call successfully (there was a timeout around 500ms), the originating register got back a fail, and retried, once. One retry is a huge win; if there's a 1% fail rate on the first try, there's an 0.01% fail rate with two tries. This little trick alone made crossbar systems appear very reliable. There's much to be said for doing one retry on anything which might fail transiently. If the second retry fails, unit level retry as a strategy probably isn't working and the problem needs to be kicked up a level.
The pattern of requesting resources from a pool at random was continued throughout the system. Trunks (to other central offices), senders (for sending call data to the next switch), translators (for converting phone numbers into routes), billing punches (for logging call data), and trouble punches (for logging faults) were all assigned on a random, or in some cases a cyclic rotation basis. Units that were busy, faulted, or physically removed for maintenance were just skipped.
That's how the Bell System achieved such good reliability with devices that had moving parts.
Note that this isn't a "switch to backup" strategy. The distribution of work amongst units is part of normal operation, constantly being exercised. So handling a failure doesn't involve special cases. Failures cost you some system capacity, but don't take the whole system down.
We need more of that in the Internet. Some (not all) load balancers for web sites work like this. Some (but not all) packet switches work like this. Think about how you can use that pattern in your own work. It worked for more than half a century for the Bell System.
Re:because they've been conditioned (Score:5, Informative)
Zero dropped calls? (Score:2, Informative)