Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
The Internet Communications Networking IT

Keeping an Eye Out When Sites Go Down 77

miller60 writes "Are major web sites going down more often? Or are outages simply more noticeable? The New York Times looks at the recent focus on downtime at services like Twitter, and the services that have sprung up to monitor outages. When a site goes down, word spreads rapidly, fueled by blogs and forums. But there have also been a series of outages with real-world impact, affecting commodities exchanges, thousands of web sites and online stores."
This discussion has been archived. No new comments can be posted.

Keeping an Eye Out When Sites Go Down

Comments Filter:
  • by Anonymous Coward on Sunday July 06, 2008 @02:19PM (#24076081)

    So they're more likely to suffer downtime as any one of the many pieces can break, causing it to all go down. Look at a site like Drudge Report that gets massive traffic, but is really VERY simple to run. Then look at a site like Twitter or YouTube or something like that, which has many more services to operate and keep running together.

  • by urbanriot ( 924981 ) on Sunday July 06, 2008 @02:55PM (#24076335)

    These days web pages comprise of multiple sources, often displaying content from multiple servers. Consider that 'back in the day' a web site was a static HTML file with multiple links. These days we have a 'site' linking to an image server, media server, advertising server, with sql backbones and other content providers. When one of these sites fail, often the whole works goes down.
     
    Personally, I don't notice an increased frequency in site downtimes with any of the services that I use and I don't feel this is newsworthy. Of course, I don't use Twitter so maybe that's why.

  • Re:no... (Score:4, Interesting)

    by Koiu Lpoi ( 632570 ) <koiulpoi AT gmail DOT com> on Sunday July 06, 2008 @03:01PM (#24076379)
    Agreed. Google and Slashdot are the two (depending on my mood) sites I test to see if I have an internet connection. If I can't reach one, I don't even bother testing the other - I assume it's on my end, and I've not yet been wrong.
  • by Animats ( 122034 ) on Sunday July 06, 2008 @03:05PM (#24076411) Homepage

    What with the "software as a service" and "outsourcing system administration" fads, more sites are relying on other sites being up when they power up. This could become a problem in bringing a site back up after an outage. It's important to know which sites have "black start" capability; they can start up without any resources from the outside.

    You can save money by outsourcing Linux system administration to Tomsk, Russia, [intrice.com] or Lotus system administration to India. "Remote System Administration for your Lotus Notes/Domino Servers, Infrastructure" [maargasystems.com]. But can you then restart your data center from a cold start, when the offshore admin people can't yet get in?

  • by Anonymous Coward on Sunday July 06, 2008 @03:15PM (#24076463)

    This is the kind of work that always seems to take a back seat to development due to resource constraints, but it really needs to occur in tandem with the development process.

    That's not true. As the Twitter, Digg, Flickr, etc. examples clearly show, it's much more important to appear "pretty decent" when you corner the market than anything else. The cost of doing it properly from the get go can not be shouldered by a company with an unproven concept, neither time- nor money-wise. Most of these services are 99.9% user base and 0.1% implementation. If you can get the users with a rough sketch, it is then much easier to get the resources for even a complete rewrite of the server software. Besides, this isn't even a business biased view: Most programmers agree that the first implementation is for understanding the problem and the second implementation is for solving it.

  • by nabsltd ( 1313397 ) on Sunday July 06, 2008 @05:28PM (#24077513)

    Fault-tolerant at every practical level. This gets expensive, so you see datacenter failures take down large swaths of sites who don't have multiple locations.

    I work on a site that has pretty much every conceivable fault-tolerance you can get short of multiple sites: multiple separate ISPs leading to router and firewall hardware that is redundant for each ISP along with multiple load-balanced front-end web servers connected to load-balanced database and file servers (with every server running Solaris). Everything has multiple power supplies connected to different mains feeds and different generators. All of this is frightfully expensive and heavily monitored.

    Yet, the #1 thing that is causing downtime is the failure of the clustering software on the file servers to actually fail over if something goes wrong. So, whenever the file system mounts fail, the whole system is down until those servers are rebooted, which takes 1-2 hours because of the clustering software.

    Yet, if those file servers would have been relatively cheap with no redundancy, they could have been re-booted quickly and the file system mounts automatically recovered within 15 minutes.

    So, the moral here is that more fault-tolerance isn't always the best way to maintain uptime. Carefully deciding where to spend money on what type of fault-tolerance is going to get you more uptime in the long run of the real world, instead of spending unwisely to increase statistical uptime.

  • no, but... (Score:2, Interesting)

    by ClarisseMcClellan ( 1286192 ) on Sunday July 06, 2008 @07:51PM (#24078525)

    AVG is probably why we have this post this week. There were a lot of timeouts last week, although Grisoft was not the only problemo. For a while Virgin Media customers in the UK lost a couple of continents last week, with the U.S.A. and Australia dropping off the map. I had to read Pravda instead of Slashdot for an hour or two...

    My backup route actually worked fine and I was just in the middle of getting a squid proxy server of my own up and running when the network problems magically fixed themselves. There are lessons to be learned, if you need your internet more than is healthy then you also need a backup plan. This could be a wifi sharing agreement with the neighbours or a proxy server at work that you can dial into at home. The internet does not dynamically re-route stuff when there is a problem with a major link. This is a problem. I thought we would have TCP/IP over ATM or something like that to solve that by now.

  • Re:But what happens (Score:3, Interesting)

    by Geak ( 790376 ) on Monday July 07, 2008 @12:46AM (#24080349)
    I can't really trust those network monitoring sites. They aren't accurate. All they can tell is that the site is down "from their location". I work for a webhosting company, and I've run into numerous cases where a customer is screaming that his website is down because they network monitoring site sent him a report saying so. The truth of the matter was the site was up the entire time (even the customer could get to the site when I had them actually try). If a node goes down anywhere between the monitoring site and the user's website, they get a false positive. On top of that, you have to wonder if any of these monitoring sites are also deliberately sending false reports. Back when I was working for an ISP, I remember there was some kind of network monitoring software that came out, and a number of people were installing on their computers. It would start warning customers that their "network connection was saturated - blah blah blah" and customers would call in blaming us. Within a few days I started seeing reviews on the net about the product, and some research showed that it was deliberately generating false reports for anybody that wasn't with a certain large coaster shipping ISP. Apparently the software company was a shareholder. I can't remember what the name of the product was however, this was back in the old dialup days.

He has not acquired a fortune; the fortune has acquired him. -- Bion

Working...