Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
The Internet

Should Archive.org Ignore Robots.txt Directives And Cache Everything? (archive.org) 174

Archive.org argues robots.txt files are geared toward search engines, and now plans instead to represent the web "as it really was, and is, from a user's perspective." We have also seen an upsurge of the use of robots.txt files to remove entire domains from search engines when they transition from a live web site into a parked domain, which has historically also removed the entire domain from view in the Wayback Machine... We receive inquiries and complaints on these "disappeared" sites almost daily."
In response, Slashdot reader Lauren Weinstein writes: We can stipulate at the outset that the venerable Internet Archive and its associated systems like Wayback Machine have done a lot of good for many years -- for example by providing chronological archives of websites who have chosen to participate in their efforts. But now, it appears that the Internet Archive has joined the dark side of the Internet, by announcing that they will no longer honor the access control requests of any websites.
He's wondering what will happen when "a flood of other players decide that they must emulate the Internet Archive's dismal reasoning to remain competitive," adding that if sys-admins start blocking spiders with web server configuration directives, other unrelated sites could become "collateral damage."

But BoingBoing is calling it "an excellent decision... a splendid reminder that nothing published on the web is ever meaningfully private, and will always go on your permanent record." So what do Slashdot's readers think? Should Archive.org ignore robots.txt directives and cache everything?
This discussion has been archived. No new comments can be posted.

Should Archive.org Ignore Robots.txt Directives And Cache Everything?

Comments Filter:
  • yeah (Score:5, Informative)

    by Anonymous Coward on Sunday April 23, 2017 @04:11AM (#54285921)

    yeah!

    • Re:yeah (Score:5, Informative)

      by ArmoredDragon ( 3450605 ) on Sunday April 23, 2017 @04:25AM (#54285953)

      Law of headlines indeed, and there's already an established way for web developers to indicate that they don't want content cached or archived while still being searchable:

      <meta name="robots" content="noarchive">

      So archive.org could just honor that, and the problem would be solved. Google honors exactly this.

      • Re:yeah (Score:5, Informative)

        by Zocalo ( 252965 ) on Sunday April 23, 2017 @05:45AM (#54286115) Homepage
        Even more specific robots.txt directive for this instance:

        User Agent: ia_archiver
        Disallow: /


        As is often the case, Lauren is going off half-cocked with only part of the story. The IA already has a policy for removal requests (email info@) and is only considering expanding their current position of ignoring robots.txt on sites outside their current "test zone" of the .gov and .mil gTLD domains and have not had any problems. They probably will do that (and for their archival purposes it's a good idea in principle), but I think it's only fair to see whether or not they listen to the feedback and provide some specific opt-out policy and technical mechanisms like at least honoring either of the above prior to going live on the rest of the Internet before starting to scream and shout. It's going to be a two-way street anyway because they're going to find a lot more sites that feed multiple-MB of pseudo-random crap to spiders that ignore robots.txt to try and do things like poison spammer's address lists, so it's actually in their best interests to provide an opt-out they honor.

        Besides, it's going to be interesting to see what kind of idiotic crap web admins who should know better think is safely hidden and/or secured because of robots.txt - it's useful to know who is particularly clueless so you can avoid them at all costs. :)
        • It's going to be a two-way street anyway because they're going to find a lot more sites that feed multiple-MB of pseudo-random crap to spiders that ignore robots.txt

          I don't think archive.org actually spiders things any more. They've been on-demand archival for, what, over a decade?

          I mean, they had the Alexa toolbar that automatically submitted everything that the user browsed to their index, and that is (was?) likely their main source of entries...

          Try looking at an unpopular site, and you'll find few and incomplete entries spanning over several years, especially as you go deeper than the front page. But a popular web site has archive entries available for pretty much e

          • Re:yeah (Score:5, Informative)

            by Zocalo ( 252965 ) on Sunday April 23, 2017 @07:11AM (#54286309) Homepage
            IA does still spider, but they seem to use a more nuanced system than the rudimentary "start at /, then recursively follow every link" approach used by more trivial site spider algorithms. Firstly, they don't download an entire site in one go - they spread things out over time to avoid putting large spikes into the traffic pattern which is more friendly for sites that are bandwidth limited and on things like "xGB/month" plans. Secondly, they have a "popularity weighting" system that governs the order they spider and refresh sections of a given site, which is the main reason for the difference between the level of content for popular and less popular sites - although I have no idea whether that's based entirely off something like the site's Alexa ranking or is also weighted against how dynamic the content is (e.g a highly dynamic site like Slashdot would get a bump up the priority, whereas a mostly static reference site might get downgraded). Combine the two approaches and you get the results you are seeing: major web homepages get spidered more or less every day with several levels of links retrieved, while some random personal blog only get spidered every few weeks or more, and only with the homepage and first level or two of links ever getting looked at.
            • while some random personal blog only get spidered every few weeks or more,

              Well, my experience (as a user of archive.org, not as a webmaster) is more like 'every few years'...

              FWIW, I mostly look up old static sites from around fifteen years ago. Back when people still had hitcounters.

      • But should that matter? If the website is publicly facing. why should you not be able to archive it (irregardless of their wishes)? I can take pictures of houses I see from the street. The law seems fairly straightforward here, and it is easy to build any sort of wall around your website you wish to keep the public and archivers out.

  • by haruchai ( 17472 ) on Sunday April 23, 2017 @04:14AM (#54285931)

    but it may have consequences I haven't considered

    • Bandwidth seems like a likely problem if everyone does it.
      • by Zocalo ( 252965 )
        I think the law of averages would take care of that. Bandwidth is pretty cheap and the chances are that even if you are constrained by bandwidth, as might be the case with a smaller site on an "xGB/day" hosting plan, then it's more likely to be the case there won't be too many GB of content to spider in the first place. There are always exceptions though, and where there is a real problem there are still going to be workarounds, e.g. explicit opt out clauses for spiders like IA's or, if all else fails, de
        • My process is that if you go around the robots.txt, you're hostile, and you route to null on the next access. If you attempt to directly access cached URLs, you're hostile, same answer. The file of IPv4 and IPv6 addresses that have attempted this is easily a half-mile long.

          Happy to add archive.org to it. Baidu, Bing, and yes, Google, are already there. Most of them have been from AWS instances snooping around. They get the same answer.

          • If you attempt to directly access cached URLs, you're hostile, same answer.

            How you define "cached URLs" could determine how much money you have to spend fielding support calls from legitimate users who have bookmarked a document on your site.

            • The site is static. It goes through revision. No one in their right mine bookmarked sites-- it gets 100 legit visits a year. It's a honeypot.

              But spiders cache URLs and try to find them again. Nope.

    • by Anonymous Coward

      Cautiously saying yes to this but it may have consequences I haven't considered

      And that's how we ended up with Donald Trump, you bastard!

      • by Anonymous Coward

        And somehow the world is still relieved it wasn't Hillary Clinton...

        If given the choice of picking someone who could ruin you life, would you pick the evil conniving devil you know, or the bumbling orange buffoon you don't know?

  • No brainer (Score:5, Insightful)

    by fnj ( 64210 ) on Sunday April 23, 2017 @04:14AM (#54285933)

    Duh. Naturally it should. The notion that robots.txt should operate RETROACTIVELY is asinine.

    • Re:No brainer (Score:5, Insightful)

      by thsths ( 31372 ) on Sunday April 23, 2017 @04:20AM (#54285941)

      But that is not the question asked, is it?

      robots.txt should apply to the page at the time. I do not see any decent argument against that.

      But arguable robots.txt should not be a way to retroactively mark previously archived content as inaccessible.

      • Re:No brainer (Score:5, Insightful)

        by blackest_k ( 761565 ) on Sunday April 23, 2017 @07:11AM (#54286307) Homepage Journal

        One problem i run into is with owner manuals for old film camera's a lot of the time they disappear from the company website when they get taken over by another company. Sometimes archive.org can come to the rescue if I can find where they used to be. Fair enough the new company may only be interested in the digital models and has no interest in the historical product made by the company they acquired but when they make boneheaded choices like erasing the historical information the original company put out for their customers..

        Worst still is when a domain name is lapsed and bought by another company who had zero access to the content of the former site they bought a name not a right to control the history of the former site.

        The other thing which bugs me is the white washing of old news articles how often that trick gets pulled, I might personally remember an event but find the contemporary records are missing that happens a lot especially in Politics when a past stance becomes embarrassing and then you get told black was white...

        At the very least when a website changes hands the new owner should not be able to erase the history of the site under the previous owner.

        • Re:No brainer (Score:5, Insightful)

          by mrchaotica ( 681592 ) * on Sunday April 23, 2017 @04:56PM (#54288713)

          The other thing which bugs me is the white washing of old news articles how often that trick gets pulled, I might personally remember an event but find the contemporary records are missing that happens a lot especially in Politics when a past stance becomes embarrassing and then you get told black was white...

          This is the single most important reason there could ever be!

      • by c ( 8461 )

        But arguable robots.txt should not be a way to retroactively mark previously archived content as inaccessible.

        Exactly. The policy where someone with no interest in a site (i.e. takeovers, lapsed domains, etc) can retroactive wipe all archives with just a couple lines in a config is flat-out wrong.

        Ignoring robots.txt entirely, though, is a bad idea. Some sites use it to block archiving, sure, but some others use it to tell robots to avoid places where they'll never return from. There's a case for ignoring "D

      • It's even worse, the domain name for a retro-gaming related website I consulted via wayback expired and was re-registered; the new robots.txt file now makes the old website inaccessible!
      • Absolutely! robots.txt applies at the time, not retroactively. I see nothing in the robots.txt standard that implies it should apply retroactively. It's basically a do/don't crawl *now*, it says nothing of what was the case before. So, if it was allowed per robots.txt when the pages were crawled, those are fair game for archiving ... period. archive.org or anyone else deeming themselves the exception regarding what robots.txt specified, and crawling where it says not to - that's wrong direction to go -
    • by Anonymous Coward

      While the retroactive nature is indeed dumb, the simple fact is that if I don't want content I created/own to be copied by archive.org, it shouldn't be. And that should include content that maybe I didn't have a problem with being mirrored previously, but now do, albeit not through a stupid retroactive robots.txt file. This is throwing the baby out with the bathwater.

    • Re:No brainer (Score:5, Insightful)

      by dissy ( 172727 ) on Sunday April 23, 2017 @08:54AM (#54286663)

      It should be even easier than that.

      Archive.org should archive everything, including the robot.txt contents, at each scan.

      The content being displayed from the archive.org website itself however could then still honor robots.txt at the time of the scan, purely for "display" purposes.

      This way changing robots.txt to block search engines would not delete or hide any previous information.
      Also the new information would still be in the archive, even if not displayed due to the current robots.txt directives.

      Although it would require more work to do so properly, this would potentially allow for website owners to retroactively "unhide" content in the archive in the past as well.
      Proper in this case would require some way to verify the domain owner, but this could likely be as simple as creating another specifically named text file in the websites root path, with content provided by the archive.
      That can be as simple as the old school "cookie" data like so many other services use such as Google, or as complex as a standard that allows date ranges specified along with directives.

      But in any case, this would preserve copies of the website for future use, such as for when copyright protection expires.
      Despite everyone having a differing opinion on just how long "limited time" should be in "securing for limited times to authors and inventors the exclusive right to their respective writings and discoveries", no one who wants to be taken seriously can argue that this time of expiration must happen at some point.
      Since the vast majority of authors make no considerations to protect our property, that task clearly needs to fall on us to secure.

  • by Lorens ( 597774 ) on Sunday April 23, 2017 @04:17AM (#54285937) Journal

    It is also for variable random content. Imagine a service that returns a webpage containing the product (of the multiplication) of two numbers, followed by a list of links to ten other random number pairs you could try. It would take a 1kB page to write, but infinite space to archive *all* the results. For effect, imagine the service generates a video to show a kid how to multiply the two numbers, or drive from one place to another, or whatever use people have have now found for the Internet.

    • by DrYak ( 748999 ) on Sunday April 23, 2017 @04:37AM (#54285981) Homepage

      It is also for variable random content. Imagine a service that returns a webpage containing the product (of the multiplication) of two numbers, followed by a list of links to ten other random number pairs you could try. It would take a 1kB page to write, but infinite space to archive *all* the results

      And archive.org already has a correct behaviour for that :
      - it wont try to download all infinity of solution in one go (e.g.: generating giga-byte worth of data out of the 1kB Perl/PHP/NodeJS/whatever source)
      - instead it will occasionally rescan the page, every few days (more or less frequently, depending on popularity of the links)
      It provides a small glimpse of what a user could have seen back then on the website.

      By the way, back in the 2000s, this was exactly a popular way to poison SPAM robots spiders who where scanning the web for e-mail addresses.
      - Either they honour robots and not scan that or any other sources of e-mail on the site.
      - Or they attempt to ignore robots.txt and follow links they aren't authorised to, and end-up siphonning giga-bytes worth bogus e-mails addresses auto-generated by small perl script, which will pollute their base of harvested addresses.

      Archive.org's spider might by a tiny bit more susceptible to this kind of things.
      Bot as much as a SPAM email-harvesting spider (which will try to download as much as possible, much more aggressively than archive.org), but still such a labyrinth of links might get archive lost.

      • You are assuming the web page will have the same URL. But what if the script auto-generates new URLs for each request ?

        Then they will get an unlimited amount of web pages.

        Its not hard to make a web page that works that way.

  • Block wildcard (Score:2, Interesting)

    by Anonymous Coward

    archive.org should block wildcard robots.txt, eg ones that say block everything. With a few exceptions:

    Image boards (eg 4chan, reddit, and similar forums) due to how frequently they change, there will never be any possibility of archiving a complete state of any specific thread before it's purposely purched, and due to the rampant piracy, would only lead to further DMCA requests aimed at archive.org

    Piracy sites - For obvious reasons.

    Domain parking - A domain parking site should be treated as spam.

    • by mikael ( 484 )

      It could archive a specific thread on a board once there has been no activity for over six months.

    • Piracy sites -- they deserve special protection, as they're very likely to be disappeared against their owner's wishes.

      Image boards -- a glimpse into ephemeral content is worth keeping, even if you miss most of it.

      Domain parking -- I agree with you, they're 100% spam. But they're the primary reason such deletion must not be retroactive.

  • by Anonymous Coward

    If they do that on my sites (and many others I'm sure) they'll get locked out.

    • Re: No (Score:2, Informative)

      by Anonymous Coward

      Robots.txt is a suggestion, not a requirement.

  • No. (Score:5, Insightful)

    by Gravis Zero ( 934156 ) on Sunday April 23, 2017 @04:44AM (#54285995)

    A public act by an organization ignoring robots.txt will only lead to the justification of other organizations ignoring robots.txt. Effectively ignoring it erodes the value of robots.txt. Sure, some underhanded people will ignore it but I don't see organizations openly ignoring it.

    If you have an example of an organization completely ignoring robots.txt, do tell.

    • The value of it is already eroded since it's messing up their sites intended purpose. They're just trying to correct for a technical problem, this is a technical problem, not some ethical dilemma.
      • by Anonymous Coward

        archive.org ignoring robots.txt is a slippery slope, indeed. but there is no 'technical problem' here.

        a web site operator specifically CHOOSES ON THEIR OWN to include directives in robots.txt to tell a bot to 'fuck off'. if they choose to add wayback machine to their robots.txt file, it is their choice, and archive.org should always honor such request

        • by Anonymous Coward

          No. The purpose of the file is to let crawlers know that a page is not suitable for indexing, and/or to give site operators an "opt out" capability for crawlers which CHOOSE to offer such features. If you have a problem with a particular company then block their IP space.

    • Re:No. (Score:5, Insightful)

      by dwywit ( 1109409 ) on Sunday April 23, 2017 @05:13AM (#54286051)

      robots.txt is a polite way of saying "please don't"

      But your website is there for the world to see. If someone, anyone chooses to ignore your polite request, well, so what? Why did you put your content up there for the world to see?

      • by Anonymous Coward

        People like you are why we can't have nice things. You think it's fine to do whatever you like to other people as long as it's not punishable by law, just because you can, no matter what their opinion.

        IOW: You're an asshole.

      • robots.txt is a polite way of saying "please don't"

        But your website is there for the world to see. If someone, anyone chooses to ignore your polite request, well, so what? Why did you put your content up there for the world to see?

        This right here need elaboration. Sure, I can put my stuff on a webserver for the world to see. But you see, what I didn't sign up for is every search engine to download all my webpages and make them available in search results. Feel free to poke my website as a human, but not as a indexer, hence robots.txt asking robots to bother someone else.

    • The use of robots.txt only makes the internet somewhat harder to search. I fucking hate it when some scientific publisher haplessly uses robots.txt, only to make search of their published content nearly impossible to find. Fuck that, fuck robots.txt and the train it came with.

      • The use of robots.txt only makes the internet somewhat harder to search. I fucking hate it when some scientific publisher haplessly uses robots.txt, only to make search of their published content nearly impossible to find. Fuck that, fuck robots.txt and the train it came with.

        Keep in mind, if the world collectively decides to ignore robots.txt, a polite and easy way to tell indexers to go away, people will take stronger measures to prevent indexers from doing unwanted things with content they don't own and have no rights to, right up to blocking indexer sourced requests outright, no robots.txt, no http, just the middle finger of 'connection closed by foreign host.'

    • A public act by an organization ignoring robots.txt will only lead to the justification of other organizations ignoring robots.txt.

      So what? When DoubleClick argues that they ought to have the same advantages as Archive.org, they'll only manage to look like douchebags reaching their filthy hands into a cookie jar.

      It's not always a bad thing to set up douchebag-honeypot moral exemption, even if it does depend on the mass audience (mostly) managing to find two sticks to rub together.

      The real solution here is

    • A public act by an organization ignoring robots.txt will only lead to the justification of other organizations ignoring robots.txt. Effectively ignoring it erodes the value of robots.txt. Sure, some underhanded people will ignore it but I don't see organizations openly ignoring it.

      If you have an example of an organization completely ignoring robots.txt, do tell.

      I gotta agree with this. The mechanism of robots.txt needs to be respected in all cases, lest it become obsolete and ignored if big enough players decide it is meaningless and ignorable.

      I personally don't give a hoot about my page(s) appearing in an archive, what I don't want, is Google, Bing, Yahoo, or anyone else, indexing my pages so they might appear in search results with terms that may be present on my pages. Not hiding anything, frankly there's almost nothing on my webserver (visible at least), eve

    • will only lead to the justification of other organizations

      Well, if organizations don't even need to "justify" what they scan or don't scan, then this is a non-argument.

  • by Snard ( 61584 ) <[moc.liamg] [ta] [kulawahs.ekim]> on Sunday April 23, 2017 @04:51AM (#54286001) Homepage
    Maybe there can be a separate directive/section added inside robots.txt that gives direction to sites like archive.org on these matters. So both search engines and archival systems can behave honorably. If someone really does not want their site archived for the ages, archive.org should clearly respect that.
    • by Anonymous Coward

      Already possible, practically since the inception of robots.txt.


      User Agent: archive.org_bot
      Disallow:

    • by blind biker ( 1066130 ) on Sunday April 23, 2017 @05:27AM (#54286071) Journal

      Then why even have a website visible on the internet, if you don't want it searchable and archivable? Those two effectively mean "invisible" - because as long as it is visible, it is also archivable - if nothing else, manually.

      • by Zocalo ( 252965 )
        Try explaining that to the legacy mainstream media dinosaurs that are still busy taking Google to court for spidering, indexing, and linking to their content, despite the debacle of Spain [arstechnica.com] a few years back, and see how far it gets you. Common sense is in short supply in some corners of the Internet, and fairly large corners at that.
  • by Anonymous Coward

    Section 1 a & b (http://www.legislation.gov.uk/ukpga/1990/18/section/1)

    Access to the information is unauthorised (robots.txt says no) but they do it anyway and wilfully.

    • The British Library also maintain an archive. The FAQ relating to their crawler is quite an eye opener:

      (http://www.bl.uk/aboutus/legaldeposit/websites/websites/faqswebmaster/)

      : Do you respect robots.txt?
      : As a rule, yes: we do follow the robots exclusion protocol. However, in certain circumstances we may choose to overrule robots.txt. For instance: if content is necessary to render a page (e.g. Javascript, CSS) or content is deemed of curatorial value and falls within the bounds of the Legal Deposit Librar

  • robot.txt doubly so.
  • The problem is new owners of domains adding a robots.txt causing the archive to remove old site scrapes. It seems entirely reasonable to assume that adding robots.txt file should only apply to current content as chances are that prior content is not content that is owned by a new owner of a domain. I think that existing content should remain but new scrapes stop when a new robots.txt file appears on the domain. A complaints procedure then provided for content owners who didn't realise that their content wa
    • by arth1 ( 260657 )

      The problem with robots.txt is that it doesn't contain a validity period.

      Say I add mustnotbecrawled.html, a link to it in existingpage.html, and a modification to /robots.txt that bans crawling of mustnotbecrawled.html. The problem is that a robot might have downloaded robots.txt right before my publishing, and does not see that it shouldn't crawl it. So it does.

      It could be argued that a crawler should always re-load robots.txt if encountering a document newer than the last server transmit time for robot

  • If, on day 1, the robots.txt file of a given site allows to collect information and archive.org does it, they would be fully complying with robots.txt. If, on day 2, that site modifies the robots.txt file and restricts the access to all the bots, archive.org shouldn't collect any more information but why deleting the day-1-rightfully-stored one? Such a deletion would be exclusively motivated by their own policy, not by what should be expected from a robots.txt compliance.

    A different story would be determin
    • Some clarifications just in case:
      - I don't think that archive.org or any other site should fully ignore robots.txt, or any other express indication of what the website owner wants.
      - The robots.txt files of my two sites don't include any kind of restriction and never did.
      - All the crawling bots which I develop (currently running ones ranking web domains) always respect robots.txt or, depending upon the exact conditions, anything else which clearly indicates the site owner expectations.
      - I am not precisely a
  • that "Turn on, tune in, drop out" was pert and remains so, even more so now with such a netdorked world. WTF are you bickering about it still?
  • YES!! (Score:5, Insightful)

    by Vadim Makarov ( 529622 ) <makarov@vad1.com> on Sunday April 23, 2017 @06:48AM (#54286255) Homepage

    I applaud the direction internet archive takes. They should fully implement it.

    A year ago one of my domain names was stolen, through negligence of the registrar. The site was a non-profit resource that I maintained for the past 15 years. The squatter who now owns the name put deny all in robots.txt. As the result the website with some quantity of useful information has totally disappeared from existence and from the archive record.

    I do not see sufficiently important reasons to remove information that was once in public access. There are some reasons, however the public benefits of having access to all past public information outweigh all them.

  • by jonwil ( 467024 ) on Sunday April 23, 2017 @07:29AM (#54286379)

    Think about a big site like github.com.
    Imagine how many terabytes of pretty-printed source code and other things archive.org would be pulling were it to crawl all of GitHub.
    And that's just one site, there are many others that generate pretty-printed source code and other large things.

    Or what about if it crawls Google and starts archiving all sorts of Google search URLs or Google maps URLs or whatever.

    • by allo ( 1728082 )

      They aren't that dumb ... who writes a crawler puts in some protections against too big websites or sites autogenerating content with dynamic urls. So for example they put non-popular github links on the end of a queue to check them after everything else was processed. So they may slowly add unimportant github content, but won't crawls terabytes of data just now, but only some megabytes every now and then. Their bandwidth and storage capacity is limited as well.

    • Obviously (to some of us, anyway) the crawler should honor robots.txt, but the archive should not. Once something is in the archive it should be in there forever.

  • No. (Score:5, Interesting)

    by Megane ( 129182 ) on Sunday April 23, 2017 @08:13AM (#54286539) Homepage

    robots.txt is intended to indicate what parts of a site should not be scanned recursively, often due for technical reasons such as generated content> It especially for sub-paths like /cgi-bin/, but there is no technical reason why the content of any arbitrary URL can't be programmatically generated. It might be and you wouldn't even know it, because the generated content may be the same most of the time, such as a navigation menu.

    However, it was also not intended to be used to remove previously-archived content, as archive.org is currently using it. When an archived page changes status in robots.txt, they should note the first date that the status changed, then simply stop updating it until and if robots.txt re-allows it.

    scanning and archiving are two different operations, and robots.txt is only intended to apply to the former.

  • Simple solution? (Score:5, Insightful)

    by RDW ( 41497 ) on Sunday April 23, 2017 @08:39AM (#54286619)

    How about this: respect the version of robots.txt that was on the site AT THE TIME OF ARCHIVING. Do not apply subsequent versions of robots.txt to old snapshots retroactively (as when a domain changes ownership), but allow the owner to request deletion when an appropriate robots.txt was omitted by mistake.

  • by Anonymous Coward

    I wanted to know if it was possible to delete content from the Internet Archive. Their FAQ and support staff were very vague and only referred me to the robots.txt file. I found that they archive everything even if you tell them not to. The robots.txt file only controls whether or not the public can view it.

    Experiment 1) Buy an expired domain and host it with a robots.txt file telling Internet Archive not to archive it. Before the experiment I confirmed that Internet Archive had a history for this expir

  • There are already flags like "noarchive" to get google to index the site, but not provide public "google cache" links (you can assume they still cache it, but that doesn't matter for you).
    So archive.org should ignore noindex directives, but not noarchive ones.

  • I conversed the the good people at Bing, and was told pretty much that its a bug they don't intend to fix. They also told me how to code my website to get around the bug. Needless to say, I did the work to get around the bug. However, instead of restricting Bing from the parts of the site I restrict to all search engines, now Bing is totally restricted from browsing any part of the web site.

    .
    However, the way Microsoft has been acting recently (e.g., Windows 10 forced upgrades), I doubt if they even car

  • "Should Archive.org Ignore Robots.txt Directives And Cache Everything?"

    No.

  • don't publish it openly in the first place.

  • archive.org should ignore robots.txt as a means to prevent archiving material. archive.org should however be smart enough to know what can be ignored, based on content.
  • More data points to show you more Mc Donald's ads probably sounds awesome to them. You can't be cool, popular, and decent all at once, Xeni and crew..
  • Of course, I'm sure multiple TLAs already have a copy of everything, particularly anything political, technical, blogs, etc, including the "dark web" and other encrypted sites. A leopard can't change his spots and I would predict the FBI has dossiers on most American citizens. It's in their DNA, dating from J. Edgar at their birth. Now the dossiers are electronic, searchable, and probably do include real DNA. They likely include info from the older paper files - like that record of the subscription I had to
  • Whoever thought that was a good idea is a moron, full stop.
    Different archive copies from when the site was under different ownership should retain their own policies - whether it is fully restricted, not restricted at all, or in between. Yes, that will take up space, holding on different copies of robots.txt files, linking them to websites, etc, but it is better than some archives not being available because of their current policy.
  • I also agree. On today's internet nobody asks for permission to show us advertising, to follow up on the internet showing us ads, try to seel us things on social networks and now ISP will sell our browsing history. So why should not a "public library" be able to just backup a full website. Also it is complete lawful for they to copy contents and information: https://www.law.cornell.edu/us... [cornell.edu]

    Good luck Internet Archive. Backup everything in the world !!! Preserve al knowledge !!!
  • First of all, everything should be archived for future generations and researchers. Otherwise, it defeats the whole point of the project.

    But for the general public, the robots.txt should be honored and content hidden with a few conditions. First of all, it should not be retro active. I've seen valuable information lost when domains have changed name and the new owner has blocked the contents with a robots.txt. Second of all, there should be a review system to override the robots.txt. For example, if a site

  • A robots.txt file is a nice way of telling another, "please don't copy my site." However, the more mature and sophisticated answer is "if you copy this portion of my site, you may liable for copyright infringement." This whole problem is really a problem with the limitations of robots.txt. Telling someone "please do this" or "please don't do that" is not nearly as significant as "you have a right to do this" and "I will sue you if you do that".
  • By ignoring robots.txt, archive.org would be gaining unauthorized access to a computer system as access was expressly denied as per the Robots Exclusion Standard.

    To further disseminate the archived pages would be added infringements.

    I think that they need to campaign site owners to modify their robots.txt and if need be, lobby for exclusions to the Computer Misuse Act.

  • there is no law that says you have to obey the robots.txt, it's nice search engines etc obey the robots.txt, but they certainly don't have to.
  • Every site we develop has a dev instance with Robots set to dissallow, and a prod instance. If the dev instances get exposed to the outside world (search, archive, wayback) then you have to teach reviewers how to use local vhost files which will be a huge pain in the ass, or put .htaccess passwords on everything which is just plain stupid.
  • If you think you want to keep a copy of my homebrew Ars Magica roleplaying game logging site feel free.

Of course you can't flap your arms and fly to the moon. After a while you'd run out of air to push against.

Working...