Keeping an Eye Out When Sites Go Down 77
miller60 writes "Are major web sites going down more often? Or are outages simply more noticeable? The New York Times looks at the recent focus on downtime at services like Twitter, and the services that have sprung up to monitor outages. When a site goes down, word spreads rapidly, fueled by blogs and forums. But there have also been a series of outages with real-world impact, affecting commodities exchanges, thousands of web sites and online stores."
Short version... (Score:4, Insightful)
Is downtime really more frequent? Or is it just more visible?
The answer is both.
The twitter factor (Score:5, Insightful)
Twitter's infrastructure is notoriously poorly thought out, and I sort of doubt they employed any systems administrators (or service engineers, or operations engineers, or whatever) up until recently.
I think the barrier to entry from an engineering standpoint has been lowered such that you can more easily make a site that appears to be pretty decent and attracts an audience. What is often missing is the behind-the-scenes work which ensures that the service is:
- Deployed properly, with testing and staging environments that actually mirror production.
- Fault-tolerant at every practical level. This gets expensive, so you see datacenter failures take down large swaths of sites who don't have multiple locations.
- Constantly monitored, including performance metrics, to find issues quickly or ever before they happen.
This is the kind of work that always seems to take a back seat to development due to resource constraints, but it really needs to occur in tandem with the development process.
If you don't design a site from the ground up to be redundant and highly performing, its pretty difficult to flip a switch and make it that way later. Which is basically what Twitter has found out. Whether or not this mentality is taking over the Interworld is another story though.
Re:Short version... (Score:5, Insightful)
I think monopolization plays a role too.
Back when people jumped between Altavista, Hotbot, Jeeves and other engines, one of them going down wasn't so bad -- you just used another, and a day later, you wouldn't even remember that one of them had been down. But these days, everyone and his dog uses Google, and if Google goes down, people won't know what to do. Similar for other sites and hubs -- they've become too big, and users have become too reliant on them.
So even if uptime has increased, the impact of downtime has become larger, in part due to the larger reliance on single systems.
Re:The twitter factor (Score:2, Insightful)
Twitter made a big mistake by basing their technology around Ruby on Rails.
Ruby on Rails is, of course, great for CRUD-style websites. It makes development lighting fast, and easy as sin. Twitter doesn't exactly fall into that category. Although Ruby on Rails is flexible enough to develop a small-scale version of the Twitter application, it just isn't capable of scaling.
They really need to be looking into Erlang [erlang.org]. Erlang is perfect for the type of software [algorithm.com.au] they need to provide the service they offer (see ejabberd [ejabberd.im] for example). Plus it's open source [erlang.org], and it has a vibrant online community [erlang.org], and frequent releases, numerous conferences, interfacing with other languages [plnews.org], and other goodies.
Erlang originated from, and has been successfully used within, the telecom industry, which is very similar to the market Twitter is involved with. Thus they should learn from the masters, and use Erlang wherever possible for their core services.
Re:The twitter factor (Score:5, Insightful)
"If you don't design a site from the ground up to be redundant and highly performing, its pretty difficult to flip a switch and make it that way later. Which is basically what Twitter has found out."
And really, that's OK.
Sites like Twitter are popping up precisely because the bar is very low to get your idea out on the 'net and compete. Sure, the cost in dollars and person hours is much higher to refactor for stability later, but would Twitter have even come into existence if that was a requirement from the start? Would its founders have considered it a worthwhile risk?
Jason
Re:More sites using multiple external sources (Score:3, Insightful)
These days web pages comprise of multiple sources, often displaying content from multiple servers. Consider that 'back in the day' a web site was a static HTML file with multiple links. These days we have a 'site' linking to an image server, media server, advertising server, with sql backbones and other content providers. When one of these sites fail, often the whole works goes down.
Which is also why many major sites are so slow to load on less than optimal connections (which many are still stuck with). Personally, I find all the bells and whistles distracting, complicating, and useless. It seems like sites compete to see how crowded and busy they can make their pages. Right up at the top of the list for me are sites that insist on displaying some stupid Flash screen (that adds nothing to the meat and potatoes content/function of the site) and give you no option for bypassing it. The Internet could be a marvelous animal for information if website designers could just resist the impulse to throw every possible widget and geegaw into the mix. It not only adds little to the basic functionality of the site, but as pointed out above, just increases the number of individual elements that can fail and slow or stop a site in its tracks.
Me, if I want the MLB scores, or the news headlines, or to compare prices between a few retailers, all I need is the information, please -- I don't need need a floor show accompanying it.
Re:The twitter factor (Score:3, Insightful)
OK, lets explore this. If I was to log to syslog, only the ErrorLog supports it. In order to do this with an AccessLog, I would have to use the piped log output feature to route to a script that I write which in turn writes to syslog for Apache.
This is exactly the sort of bespoke stuff I'm referring to. Why should this need to be implemented 1,000 times at company after company to accomplish the exact same thing?
Re:The twitter factor (Score:2, Insightful)
Did you just get paid to write that?
Re:The twitter factor (Score:3, Insightful)
I agree to an extent, but I also think that not all of these sites will survive their re-implementation periods in the face of better-designed competitors. Flickr, for instance, is internally a mess. I presume part of this is due to poor initial implementation, but its further compounded by a need to Yahooize it at every level.
I presume Twitter will encounter a mass exodus at some point, as its users are likely to be very keen to move on to the next big (and possibly more reliable) thing.
Every time a site is down, you run the risk of irretrievably losing a portion of your users. Once you get enough bad will going, you don't even have to have failures; just having a reputation as not being reliable can be enough.
Re:or... (Score:4, Insightful)
So don't go there, don't click on links to it, and stop bitching about it. It only annoys you if you let it.
Or do you just like to whine?
Yes, they got a mention, because they can't fucking make the damn thing stop dying. If you want to be that prominent you need to get your shit together, or take the flak.
Re:The twitter factor (Score:4, Insightful)
Sites like Twitter are popping up precisely because the bar is very low to get your idea out on the 'net and compete. Sure, the cost in dollars and person hours is much higher to refactor for stability later, but would Twitter have even come into existence if that was a requirement from the start? Would its founders have considered it a worthwhile risk?
That's a common after-the-fact excuse for not thinking at all about performance, but I've concluded that it's mostly bullshit.
Sure, if you consider these questions up front and know what you're doing, it's completely possible to defer most of the work until things start to pick up. That's a very legitimate business decision, and if you get a big surprise in your growth curve, it's possible to get crushed. But with a little load testing, responsible development practices, and a little forethought, you've got a very good chance of avoiding a disaster. And none of that needs to be a big barrier to just getting something out.
On the other hand, if you just don't think about those questions at all, building things willy nilly with no preparation for refactoring and growth down the road, then that's just idiotic. You are in effect betting that you will fail, in that your site will work only if it doesn't get popular. And with something like Twitter, where the network effect is king and you could only make money with a shitload of traffic, massive growth is the only way to succeed.
From what I can tell, Twitter is firmly in that second camp. They've been going for nearly two years, and they've been shaky for most of it. One black eye from a sudden surge is acceptable, and for some is even a badge of honor. But more than a year of load-based suckage, to the point where you are an international joke, is a sign of plain incompetence. Although it hasn't killed Twitter, it has killed other businesses, and Twitter is not out of the woods yet.
Re:The twitter factor (Score:4, Insightful)
By the time you get big enough to really have to worry about scalability more than just turning on caching, you ought to be able to produce enough revenue to reimplement the site. If not, obviously you aren't relevant (or you aren't clever enough.) :)
I've heard this theory a lot. With regrettable frequency, it's part of noob entrepreneur business plans. I see three big problems with it.
Twitter is a good example of all of these problems. They surely started out saying they would worry about scaling later. Then later came, and they had other things to do: new features, dealing with abusers, setting up a customer support infrastructure. Their quick scaling fixes kept their heads barely above water, but they didn't do much for the long term. And they are still in the "grow big, grow fast" stage, so they don't have any revenue and would rather wait a while longer to deal with that.
Re:Blackstart capability (Score:4, Insightful)
An important, related issue is the loss of local knowledge.
If you did a web startup ten years ago, you pretty much had to hire a sysadmin. If you had a good one, they would yell at your developers about their retarded, unscalable designs. Having a scary bearded man threaten you with defenestration has its downsides, but it does give you an incentive to consider the impact to operations.
The ever-lower cost of hosting is also a problem. If you tried to just throw $250k of hardware at a scaling issue back then, hopefully some executive would come by and ask some WTF-ish questions. (Unless you were at Boo.com or Webvan, natch.) But now, monthly rental on equivalent computing power is circa $400. Who'd bitch about that? Which allows you to really settle in to a totally unscalable architecture.
Re:The twitter factor (Score:3, Insightful)
It might even be better savings that way, but the way people talk about how Twitter is set up, it sounds like the people that set it up didn't even know what they were doing, like maybe they dropped out of school halfway through the database class. Given that they are still having problems, I think it's reasonable to suggest that they still don't know what they are doing, even though their VC funding should have allowed them to hire enough qualified people to fix the problem. The way it is now, I wonder if there really is any resale value in the company. At this point, they have no revenue stream, not even ads as far as I can tell, so it looks like they're looking to build a service that gets bought out by a big company. I think whoever buys them would almost certainly not be buying them for the employees, the organization, the code or the infrastructure, but rather, just the users and only the users. I see little value in anyone anything there except in what amounts to buying the users.