Keeping an Eye Out When Sites Go Down 77
miller60 writes "Are major web sites going down more often? Or are outages simply more noticeable? The New York Times looks at the recent focus on downtime at services like Twitter, and the services that have sprung up to monitor outages. When a site goes down, word spreads rapidly, fueled by blogs and forums. But there have also been a series of outages with real-world impact, affecting commodities exchanges, thousands of web sites and online stores."
New sites are more complicated... (Score:4, Interesting)
So they're more likely to suffer downtime as any one of the many pieces can break, causing it to all go down. Look at a site like Drudge Report that gets massive traffic, but is really VERY simple to run. Then look at a site like Twitter or YouTube or something like that, which has many more services to operate and keep running together.
More sites using multiple external sources (Score:3, Interesting)
These days web pages comprise of multiple sources, often displaying content from multiple servers. Consider that 'back in the day' a web site was a static HTML file with multiple links. These days we have a 'site' linking to an image server, media server, advertising server, with sql backbones and other content providers. When one of these sites fail, often the whole works goes down.
Personally, I don't notice an increased frequency in site downtimes with any of the services that I use and I don't feel this is newsworthy. Of course, I don't use Twitter so maybe that's why.
Re:no... (Score:4, Interesting)
Blackstart capability (Score:5, Interesting)
What with the "software as a service" and "outsourcing system administration" fads, more sites are relying on other sites being up when they power up. This could become a problem in bringing a site back up after an outage. It's important to know which sites have "black start" capability; they can start up without any resources from the outside.
You can save money by outsourcing Linux system administration to Tomsk, Russia, [intrice.com] or Lotus system administration to India. "Remote System Administration for your Lotus Notes/Domino Servers, Infrastructure" [maargasystems.com]. But can you then restart your data center from a cold start, when the offshore admin people can't yet get in?
Re:The twitter factor (Score:2, Interesting)
This is the kind of work that always seems to take a back seat to development due to resource constraints, but it really needs to occur in tandem with the development process.
That's not true. As the Twitter, Digg, Flickr, etc. examples clearly show, it's much more important to appear "pretty decent" when you corner the market than anything else. The cost of doing it properly from the get go can not be shouldered by a company with an unproven concept, neither time- nor money-wise. Most of these services are 99.9% user base and 0.1% implementation. If you can get the users with a rough sketch, it is then much easier to get the resources for even a complete rewrite of the server software. Besides, this isn't even a business biased view: Most programmers agree that the first implementation is for understanding the problem and the second implementation is for solving it.
Re:The twitter factor (Score:2, Interesting)
Fault-tolerant at every practical level. This gets expensive, so you see datacenter failures take down large swaths of sites who don't have multiple locations.
I work on a site that has pretty much every conceivable fault-tolerance you can get short of multiple sites: multiple separate ISPs leading to router and firewall hardware that is redundant for each ISP along with multiple load-balanced front-end web servers connected to load-balanced database and file servers (with every server running Solaris). Everything has multiple power supplies connected to different mains feeds and different generators. All of this is frightfully expensive and heavily monitored.
Yet, the #1 thing that is causing downtime is the failure of the clustering software on the file servers to actually fail over if something goes wrong. So, whenever the file system mounts fail, the whole system is down until those servers are rebooted, which takes 1-2 hours because of the clustering software.
Yet, if those file servers would have been relatively cheap with no redundancy, they could have been re-booted quickly and the file system mounts automatically recovered within 15 minutes.
So, the moral here is that more fault-tolerance isn't always the best way to maintain uptime. Carefully deciding where to spend money on what type of fault-tolerance is going to get you more uptime in the long run of the real world, instead of spending unwisely to increase statistical uptime.
no, but... (Score:2, Interesting)
AVG is probably why we have this post this week. There were a lot of timeouts last week, although Grisoft was not the only problemo. For a while Virgin Media customers in the UK lost a couple of continents last week, with the U.S.A. and Australia dropping off the map. I had to read Pravda instead of Slashdot for an hour or two...
My backup route actually worked fine and I was just in the middle of getting a squid proxy server of my own up and running when the network problems magically fixed themselves. There are lessons to be learned, if you need your internet more than is healthy then you also need a backup plan. This could be a wifi sharing agreement with the neighbours or a proxy server at work that you can dial into at home. The internet does not dynamically re-route stuff when there is a problem with a major link. This is a problem. I thought we would have TCP/IP over ATM or something like that to solve that by now.
Re:But what happens (Score:3, Interesting)