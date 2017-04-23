Should Archive.org Ignore Robots.txt Directives And Cache Everything? (archive.org) 22
Archive.org argues robots.txt files are geared toward search engines, and now plans instead to represent the web "as it really was, and is, from a user's perspective." We have also seen an upsurge of the use of robots.txt files to remove entire domains from search engines when they transition from a live web site into a parked domain, which has historically also removed the entire domain from view in the Wayback Machine... We receive inquiries and complaints on these "disappeared" sites almost daily."
In response, Slashdot reader Lauren Weinstein writes: We can stipulate at the outset that the venerable Internet Archive and its associated systems like Wayback Machine have done a lot of good for many years -- for example by providing chronological archives of websites who have chosen to participate in their efforts. But now, it appears that the Internet Archive has joined the dark side of the Internet, by announcing that they will no longer honor the access control requests of any websites.
He's wondering what will happen when "a flood of other players decide that they must emulate the Internet Archive's dismal reasoning to remain competitive," adding that if sys-admins start blocking spiders with web server configuration directives, other unrelated sites could become "collateral damage."
But BoingBoing is calling it "an excellent decision... a splendid reminder that nothing published on the web is ever meaningfully private, and will always go on your permanent record." So what do Slashdot's readers think? Should Archive.org ignore robots.txt directives and cache everything?
In response, Slashdot reader Lauren Weinstein writes: We can stipulate at the outset that the venerable Internet Archive and its associated systems like Wayback Machine have done a lot of good for many years -- for example by providing chronological archives of websites who have chosen to participate in their efforts. But now, it appears that the Internet Archive has joined the dark side of the Internet, by announcing that they will no longer honor the access control requests of any websites.
He's wondering what will happen when "a flood of other players decide that they must emulate the Internet Archive's dismal reasoning to remain competitive," adding that if sys-admins start blocking spiders with web server configuration directives, other unrelated sites could become "collateral damage."
But BoingBoing is calling it "an excellent decision... a splendid reminder that nothing published on the web is ever meaningfully private, and will always go on your permanent record." So what do Slashdot's readers think? Should Archive.org ignore robots.txt directives and cache everything?
yeah (Score:1)
yeah!
Re: (Score:3)
Law of headlines indeed, and there's already an established way for web developers to indicate that they don't want content cached or archived while still being searchable:
<meta name="robots" content="noarchive">
So archive.org could just honor that, and the problem would be solved. Google honors exactly this.
Cautiously saying yes to this (Score:2)
but it may have consequences I haven't considered
Re: (Score:2)
No brainer (Score:2)
Duh. Naturally it should. The notion that robots.txt should operate RETROACTIVELY is asinine.
Re:No brainer (Score:4, Informative)
But that is not the question asked, is it?
robots.txt should apply to the page at the time. I do not see any decent argument against that.
But arguable robots.txt should not be a way to retroactively mark previously archived content as inaccessible.
Re: No brainer (Score:1)
While the retroactive nature is indeed dumb, the simple fact is that if I don't want content I created/own to be copied by archive.org, it shouldn't be. And that should include content that maybe I didn't have a problem with being mirrored previously, but now do, albeit not through a stupid retroactive robots.txt file. This is throwing the baby out with the bathwater.
Robots.txt is not only for privacy (Score:3)
It is also for variable random content. Imagine a service that returns a webpage containing the product (of the multiplication) of two numbers, followed by a list of links to ten other random number pairs you could try. It would take a 1kB page to write, but infinite space to archive *all* the results. For effect, imagine the service generates a video to show a kid how to multiply the two numbers, or drive from one place to another, or whatever use people have have now found for the Internet.
Random generated content (Score:4, Insightful)
It is also for variable random content. Imagine a service that returns a webpage containing the product (of the multiplication) of two numbers, followed by a list of links to ten other random number pairs you could try. It would take a 1kB page to write, but infinite space to archive *all* the results
And archive.org already has a correct behaviour for that :
- it wont try to download all infinity of solution in one go (e.g.: generating giga-byte worth of data out of the 1kB Perl/PHP/NodeJS/whatever source)
- instead it will occasionally rescan the page, every few days (more or less frequently, depending on popularity of the links)
It provides a small glimpse of what a user could have seen back then on the website.
By the way, back in the 2000s, this was exactly a popular way to poison SPAM robots spiders who where scanning the web for e-mail addresses.
- Either they honour robots and not scan that or any other sources of e-mail on the site.
- Or they attempt to ignore robots.txt and follow links they aren't authorised to, and end-up siphonning giga-bytes worth bogus e-mails addresses auto-generated by small perl script, which will pollute their base of harvested addresses.
Archive.org's spider might by a tiny bit more susceptible to this kind of things.
Bot as much as a SPAM email-harvesting spider (which will try to download as much as possible, much more aggressively than archive.org), but still such a labyrinth of links might get archive lost.
Block wildcard (Score:1)
archive.org should block wildcard robots.txt, eg ones that say block everything. With a few exceptions:
Image boards (eg 4chan, reddit, and similar forums) due to how frequently they change, there will never be any possibility of archiving a complete state of any specific thread before it's purposely purched, and due to the rampant piracy, would only lead to further DMCA requests aimed at archive.org
Piracy sites - For obvious reasons.
Domain parking - A domain parking site should be treated as spam.
Re: (Score:2)
It could archive a specific thread on a board once there has been no activity for over six months.
No. (Score:2)
A public act by an organization ignoring robots.txt will only lead to the justification of other organizations ignoring robots.txt. Effectively ignoring it erodes the value of robots.txt. Sure, some underhanded people will ignore it but I don't see organizations openly ignoring it.
If you have an example of an organization completely ignoring robots.txt, do tell.
Re: (Score:1)
Here is my clever idea... (Score:2)