Keeping an Eye Out When Sites Go Down 77

Posted by timothy on Sunday July 06, 2008 @02:08PM from the he-said-go-down dept.

miller60 writes "Are major web sites going down more often? Or are outages simply more noticeable? The New York Times looks at the recent focus on downtime at services like Twitter, and the services that have sprung up to monitor outages. When a site goes down, word spreads rapidly, fueled by blogs and forums. But there have also been a series of outages with real-world impact, affecting commodities exchanges, thousands of web sites and online stores."

Keeping an Eye Out When Sites Go Down

This discussion has been archived. No new comments can be posted.

Search 77 Comments Log In/Create an Account

Comments Filter:

New sites are more complicated... (Score:4, Interesting)

by Anonymous Coward writes: on Sunday July 06, 2008 @02:19PM (#24076081)

So they're more likely to suffer downtime as any one of the many pieces can break, causing it to all go down. Look at a site like Drudge Report that gets massive traffic, but is really VERY simple to run. Then look at a site like Twitter or YouTube or something like that, which has many more services to operate and keep running together.

More sites using multiple external sources (Score:3, Interesting)

by urbanriot ( 924981 ) writes: on Sunday July 06, 2008 @02:55PM (#24076335)

These days web pages comprise of multiple sources, often displaying content from multiple servers. Consider that 'back in the day' a web site was a static HTML file with multiple links. These days we have a 'site' linking to an image server, media server, advertising server, with sql backbones and other content providers. When one of these sites fail, often the whole works goes down.

Personally, I don't notice an increased frequency in site downtimes with any of the services that I use and I don't feel this is newsworthy. Of course, I don't use Twitter so maybe that's why.

Re:no... (Score:4, Interesting)

by Koiu Lpoi ( 632570 ) writes: <koiulpoi@nOsPAM.gmail.com> on Sunday July 06, 2008 @03:01PM (#24076379)

Agreed. Google and Slashdot are the two (depending on my mood) sites I test to see if I have an internet connection. If I can't reach one, I don't even bother testing the other - I assume it's on my end, and I've not yet been wrong.

Blackstart capability (Score:5, Interesting)

by Animats ( 122034 ) writes: on Sunday July 06, 2008 @03:05PM (#24076411) Homepage

What with the "software as a service" and "outsourcing system administration" fads, more sites are relying on other sites being up when they power up. This could become a problem in bringing a site back up after an outage. It's important to know which sites have "black start" capability; they can start up without any resources from the outside.
You can save money by outsourcing Linux system administration to Tomsk, Russia, [intrice.com] or Lotus system administration to India. "Remote System Administration for your Lotus Notes/Domino Servers, Infrastructure" [maargasystems.com]. But can you then restart your data center from a cold start, when the offshore admin people can't yet get in?

Re:The twitter factor (Score:2, Interesting)

by Anonymous Coward writes: on Sunday July 06, 2008 @03:15PM (#24076463)

This is the kind of work that always seems to take a back seat to development due to resource constraints, but it really needs to occur in tandem with the development process.
That's not true. As the Twitter, Digg, Flickr, etc. examples clearly show, it's much more important to appear "pretty decent" when you corner the market than anything else. The cost of doing it properly from the get go can not be shouldered by a company with an unproven concept, neither time- nor money-wise. Most of these services are 99.9% user base and 0.1% implementation. If you can get the users with a rough sketch, it is then much easier to get the resources for even a complete rewrite of the server software. Besides, this isn't even a business biased view: Most programmers agree that the first implementation is for understanding the problem and the second implementation is for solving it.

Re:The twitter factor (Score:2, Interesting)

by nabsltd ( 1313397 ) writes: on Sunday July 06, 2008 @05:28PM (#24077513)

Fault-tolerant at every practical level. This gets expensive, so you see datacenter failures take down large swaths of sites who don't have multiple locations.
I work on a site that has pretty much every conceivable fault-tolerance you can get short of multiple sites: multiple separate ISPs leading to router and firewall hardware that is redundant for each ISP along with multiple load-balanced front-end web servers connected to load-balanced database and file servers (with every server running Solaris). Everything has multiple power supplies connected to different mains feeds and different generators. All of this is frightfully expensive and heavily monitored.
Yet, the #1 thing that is causing downtime is the failure of the clustering software on the file servers to actually fail over if something goes wrong. So, whenever the file system mounts fail, the whole system is down until those servers are rebooted, which takes 1-2 hours because of the clustering software.
Yet, if those file servers would have been relatively cheap with no redundancy, they could have been re-booted quickly and the file system mounts automatically recovered within 15 minutes.
So, the moral here is that more fault-tolerance isn't always the best way to maintain uptime. Carefully deciding where to spend money on what type of fault-tolerance is going to get you more uptime in the long run of the real world, instead of spending unwisely to increase statistical uptime.

no, but... (Score:2, Interesting)

by ClarisseMcClellan ( 1286192 ) writes: on Sunday July 06, 2008 @07:51PM (#24078525)

AVG is probably why we have this post this week. There were a lot of timeouts last week, although Grisoft was not the only problemo. For a while Virgin Media customers in the UK lost a couple of continents last week, with the U.S.A. and Australia dropping off the map. I had to read Pravda instead of Slashdot for an hour or two...
My backup route actually worked fine and I was just in the middle of getting a squid proxy server of my own up and running when the network problems magically fixed themselves. There are lessons to be learned, if you need your internet more than is healthy then you also need a backup plan. This could be a wifi sharing agreement with the neighbours or a proxy server at work that you can dial into at home. The internet does not dynamically re-route stuff when there is a problem with a major link. This is a problem. I thought we would have TCP/IP over ATM or something like that to solve that by now.

Re:But what happens (Score:3, Interesting)

by Geak ( 790376 ) writes: on Monday July 07, 2008 @12:46AM (#24080349)

I can't really trust those network monitoring sites. They aren't accurate. All they can tell is that the site is down "from their location". I work for a webhosting company, and I've run into numerous cases where a customer is screaming that his website is down because they network monitoring site sent him a report saying so. The truth of the matter was the site was up the entire time (even the customer could get to the site when I had them actually try). If a node goes down anywhere between the monitoring site and the user's website, they get a false positive. On top of that, you have to wonder if any of these monitoring sites are also deliberately sending false reports. Back when I was working for an ISP, I remember there was some kind of network monitoring software that came out, and a number of people were installing on their computers. It would start warning customers that their "network connection was saturated - blah blah blah" and customers would call in blaming us. Within a few days I started seeing reviews on the net about the product, and some research showed that it was deliberately generating false reports for anybody that wasn't with a certain large coaster shipping ISP. Apparently the software company was a shareholder. I can't remember what the name of the product was however, this was back in the old dialup days.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Keeping an Eye Out When Sites Go Down 77

Keeping an Eye Out When Sites Go Down More Login

Keeping an Eye Out When Sites Go Down

New sites are more complicated... (Score:4, Interesting)

More sites using multiple external sources (Score:3, Interesting)

Re:no... (Score:4, Interesting)

Blackstart capability (Score:5, Interesting)

Re:The twitter factor (Score:2, Interesting)

Re:The twitter factor (Score:2, Interesting)

no, but... (Score:2, Interesting)

Re:But what happens (Score:3, Interesting)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot