
Why the Internet Archive is More Relevant Than Ever (npr.org) 63
It's "live-recording the World Wide Web," according to NPR, with a digital library that includes "hundreds of billions of copies of government websites, news articles and data."
They described the 29-year-old nonprofit Internet Archive as "more relevant than ever." Every day, about 100 terabytes of material are uploaded to the Internet Archive, or about a billion URLs, with the assistance of automated crawlers. Most of that ends up in the Wayback Machine, while the rest is digitized analog media — books, television, radio, academic papers — scanned and stored on servers. As one of the few large-scale archivists to back up the web, the Internet Archive finds itself in a particularly unique position right now... Thousands of [U.S. government] datasets were wiped — mostly at agencies focused on science and the environment — in the days following Trump's return to the White House...
The Internet Archive is among the few efforts that exist to catch the stuff that falls through the digital cracks, while also making that information accessible to the public. Six weeks into the new administration, Wayback Machine director [Mark] Graham said, the Internet Archive had cataloged some 73,000 web pages that had existed on U.S. government websites that were expunged after Trump's inauguration...
According to Graham, based on the big jump in page views he's observed over the past two months, the Internet Archive is drawing many more visitors than usual to its services — journalists, researchers and other inquiring minds. Some want to consult the archive for information lost or changed in the purge, while others aim to contribute to the archival process.... "People are coming and rallying behind us," said Brewster Kahle, [the founder and current director of the Internet Archive], "by using it, by pointing at things, helping organize things, by submitting content to be archived — data sets that are under threat or have been taken down...."
A behemoth of link rot repair, the Internet Archive rescues a daily average of 10,000 dead links that appear on Wikipedia pages. In total, it's fixed more than 23 million rotten links on Wikipedia alone, according to the organization.
Though it receives some money for its preservation work for libraries, museums, and other organizations, it's also funded by donations. "From the beginning, it was important for the Internet Archive to be a nonprofit, because it was working for the people," explains founder Brewster Kahle on its donations page: Its motives had to be transparent; it had to last a long time. That's why we don't charge for access, sell user data, or run ads, even while we offer free resources to citizens everywhere. We rely on the generosity of individuals like you to pay for servers, staff, and preservation projects. If you can't imagine a future without the Internet Archive, please consider supporting our work. We promise to put your donation to good use as we continue to store over 99 petabytes of data, including 625 billion webpages, 38 million texts, and 14 million audio recordings.
Two interesting statistics from NPR's article:
They described the 29-year-old nonprofit Internet Archive as "more relevant than ever." Every day, about 100 terabytes of material are uploaded to the Internet Archive, or about a billion URLs, with the assistance of automated crawlers. Most of that ends up in the Wayback Machine, while the rest is digitized analog media — books, television, radio, academic papers — scanned and stored on servers. As one of the few large-scale archivists to back up the web, the Internet Archive finds itself in a particularly unique position right now... Thousands of [U.S. government] datasets were wiped — mostly at agencies focused on science and the environment — in the days following Trump's return to the White House...
The Internet Archive is among the few efforts that exist to catch the stuff that falls through the digital cracks, while also making that information accessible to the public. Six weeks into the new administration, Wayback Machine director [Mark] Graham said, the Internet Archive had cataloged some 73,000 web pages that had existed on U.S. government websites that were expunged after Trump's inauguration...
According to Graham, based on the big jump in page views he's observed over the past two months, the Internet Archive is drawing many more visitors than usual to its services — journalists, researchers and other inquiring minds. Some want to consult the archive for information lost or changed in the purge, while others aim to contribute to the archival process.... "People are coming and rallying behind us," said Brewster Kahle, [the founder and current director of the Internet Archive], "by using it, by pointing at things, helping organize things, by submitting content to be archived — data sets that are under threat or have been taken down...."
A behemoth of link rot repair, the Internet Archive rescues a daily average of 10,000 dead links that appear on Wikipedia pages. In total, it's fixed more than 23 million rotten links on Wikipedia alone, according to the organization.
Though it receives some money for its preservation work for libraries, museums, and other organizations, it's also funded by donations. "From the beginning, it was important for the Internet Archive to be a nonprofit, because it was working for the people," explains founder Brewster Kahle on its donations page: Its motives had to be transparent; it had to last a long time. That's why we don't charge for access, sell user data, or run ads, even while we offer free resources to citizens everywhere. We rely on the generosity of individuals like you to pay for servers, staff, and preservation projects. If you can't imagine a future without the Internet Archive, please consider supporting our work. We promise to put your donation to good use as we continue to store over 99 petabytes of data, including 625 billion webpages, 38 million texts, and 14 million audio recordings.
Two interesting statistics from NPR's article:
- "A Pew Research Center study published last year found that roughly 38% of web pages on the internet that existed in 2013 were no longer accessible as of 2023."
- "According to a Harvard Law Review study published in 2014, about half of all links cited in U.S. Supreme Court opinions no longer led to the original source material."
Thanks to long-time Slashdot reader jtotheh for sharing the news.
Google Cached Web Pages (Score:5, Insightful)
Yes, especially since Google removed one of the most useful search features it ever had, which was the ability to view the cached page from the last time Google crawled it.
Re:Google Cached Web Pages (Score:5, Interesting)
All part of the enshittification of Google. Also remember that archive.org keeps things forever. I have started turning most links in my lectures into archive.org links, it is just too much effort checking every time whether the pages are still there.
And, yes, I donate to them.
Moderation needs new word: enshittification (Score:2)
So many posts even before the decline are best classified as being about enshittification. It should get added to the list.
Re:Google Cached Web Pages (Score:4, Informative)
No they don't. Dubious legal complaints frequently result in URLs being excluded from the Wayback Machine. It's extremely irritating.
Re: (Score:2)
I have started turning most links in my lectures into archive.org links, it is just too much effort checking every time whether the pages are still there.
And, yes, I donate to them.
Thank you for sharing this tip. it's brilliant! Not just for lectures, but for many other use cases as well. Bookmarks, links sent by messaging apps, you name it.
Re: (Score:2)
Reason being is that you can detect when the historical record has been altered. It's going to get worse when all our information is being filtered through ClippyAI:
“Every record has been destroyed or falsified, every book has been rewritten, every picture has been repainted, every statue and street and building has been
Generated content (Score:2)
Search on "how to install custom rom on android" and you'll get pages of general information in a recognizable template... same information reformatted slightly emanating from different URLs. Volumes of unfocused garbage basically.
Re:Generated content (Score:5, Insightful)
Not much point or value in blindly archiving everything with so much boilerplate content generated to get advertising clicks.
I know what you mean, but sometimes there value in dot connecting later.
What is important of course, is that this is a bit like the Svalbard global seed vault, but for US science data. https://www.seedvault.no./ [www.seedvault.no]
When a government of dodgy politicians demands to eliminate science data, it must be very powerful science data. If it was about eliminating bullshit, they'd go after flat earthers, HAARP freaks, alien conspiracies anti-moon landing and other BS.
So if a few weeds get in the mix, it's still a good thing.
Re: (Score:1)
Well the people generating that data should have done a lot more to ensure they had some amount of trust.
The trouble with a table of values is we often don't know much about the methodology behind their collection. Even if that information is availible is often 'conveniently' separated before the data is used for whatever analysis.
If the government published SAT scores by race and year, it would paint some grim pictures most likely, and various groups would insist the data is 'racist'. The might in fact be
Re:Generated content (Score:4, Insightful)
By your logic, you might as well close the armed forces and incentivize self funded mercenaries, then reward them when they achieve published goals.
So I start a private army with the mercs put of business when Trump solves the war in the Ukraine, on his first day in office. We invade Greenland and take it over in about 24 hours because
I'm gonna say there is a role for government in a just society. In my opinion, we need regulations for environment, stock market, banks, utilities like water, electricity and sadly the internet...and based on the world today, we are going to still need social services, and UBI.... or we end up like China with everyone spying on each other and so much pollution you can't see the sun. Government stimulus in the form of grants in the sciences still seem reasonable to me. Could it be run better? Surely.
Re: (Score:2)
i.e. it's all weeds now, no genuine seeds to store.
I'm not against the Internet Archive, I just think they are wasting money on an obsolete model.
Every CEO talks up his company, so no different with Brewster, imho.
They could sit pat with what they've got and it's still a valuable public service, methinks.
Re:Generated content (Score:5, Insightful)
The point is that what's important and what's not isn't known until much later. You might think it's useless information today because it doesn't help you, but it may have uses tomorrow.
It's basically like how we learn what life was like in the past not because of the records left behind, but because of the garbage that was thrown away.
Re: (Score:2, Offtopic)
Indeed. The justification and value of the Internet archive comes from maybe 1% of what it stores, maybe less. In the YouTube videos they archive, it may well be less, but nobody knows what it may turn out to be before it becomes important.
Re: (Score:2)
This is like the argument that all data is equal. That Microsoft and OpenAI want to ingest the entirety of the internet. It's literally noise now... I don't see how that makes sense, anymore.
Re: (Score:3, Insightful)
As others have pointed out though without knowing what questions future people want to answer you don't know what is interesting or why.
For example an economist of the future might be very interested in the rate of content duplication, clones, and likely copyright infringement of article content.
Think about like masons marks. They basically were just there to do supply chain management and invoicing. Nobody thought they'd be interesting after the wall was up so to speak but future archeologists have used
Re: (Score:2)
but if the cost increases forever, there will be a moment when the expense won't seem justified.
Re: (Score:2)
You are arguing that much of that information is redundant.
That is premature optimisation.
Re: (Score:2)
Decent people need archive.org to survive... (Score:5, Insightful)
Re: (Score:3)
I do not think so. There are enough countries on this planet where archiving in this form is legal. It just has to be non-profit and free access. It can be limited to non-commercial use only and that can be done via the TOU.
Re:Decent people need archive.org to survive... (Score:4, Interesting)
The problem is that the IA team won't move it outside of the US, and won't accept outside help to mirror it. They won't accept any help to develop the backend software either. Of course that's up to them, it's their archive, but users must consider their position and the likely impact on the archive's future.
Realistically the best option here will be to create an open source version of the IA code, and try to organize libraries set up around the world. Each one wouldn't have to be a full mirror. You could even decouple the storage part so that data can be stored where it is not going to run into legal issues, and be accessed seamlessly from a central web interface. There are lots of options and it really needs a team of international copyright lawyers to look at it. Then mirror IA and accept uploads, and do it all completely separate from the current IA, both for legal reasons and because they don't want to be part of it.
It's no small thing, but otherwise it's just a matter of time before we lose it all.
Re: (Score:2)
re. Backing up the contents of archive.org somewhere outside the US: I do not think the code is the problem, but the sheer amount of data that has to be exported to the backup(s), before archive.org goes titsup.
This means the Wayback Machine, the books, the texts, the audio, the video.
Is there an effort underway?
Doesn't some billionaire think this is worth doing?
Re: (Score:2)
It would need some very significant up-front investment, and probably cooperation with the IA team to at least some extent.
Nobody seems to be doing it at the moment.
Re: (Score:2)
Re:Decent people need archive.org to survive... (Score:5, Informative)
The idiots at the Archive invited this outcome by saying "copyright is no longer a thing because Covid" which pretty much forced the publishers to sue, and to do so with a set of facts that was almost tailor made for their purposes to kill format shifting.
Re: (Score:1)
YES! It is almost as if somebody powerful hired a mole to go work at the archive and promote BAD decisions! (This is in fact one of the things assets and spies do.) Any Russians work there?
They could have simply archived materials for a....century.... before publishing them.
Re: (Score:1)
They could have simply archived materials for a....century.... before publishing them.
With what funds? The current model makes it useful to people currently, which induces them to donate and generally support the effort.
Attempting to just archive, might have have let them run under the radar, or not. We really can't say. Some copyright troll organization (excuse me, I meant to say collaborative industry group) could have still spotted their crawler and gone after them with BS SLAPP type suits.
At least by being open about what they are doing and making the archive searchable and useful they
Re: (Score:1)
Yes, until then the copyright parasites were working to promote the creativity copyright exists for, and were encouraging archive.org and everyone using stuff within a reasonable copyright term of five years.
Their evil parasitical actions are totally archive.org's fault.
Re: (Score:2)
I don't think that US copyright terms were ever that short, nor should be. Up until recently, copyright was for 14 years, extendable once for a total of 28 years. Then, the US joined the Berne Convention which requires excessively long copyright terms so as to provide for the author's family and de
Re: (Score:3)
"*forced* the publishers to sue"
I said "pretty much" forced the publishers to sue. Shall I restate that to "they invited the publishers to sue?"
Yes, until then the copyright parasites were working to promote the creativity copyright exists for, and were encouraging archive.org and everyone using stuff within a reasonable copyright term of five years.
No, obviously not, but there was plenty of stuff on the Archive that was less than your utopic five year threshold that they were lending out via format shifting on the premise that they had a physical copy that corresponded to the electronic copy they were lending out for a limited time, protected by DRM. It was a reasonable position for them to hold--the Supreme Court held in Betamax that time
Deliberate disappearing information (Score:4, Interesting)
As we've seen in the last month or so, the current administration is hellbent on destroying any information it doesn't like for whatever reason. Climate change data used by farmers, research data used by scientists, epidemiological research used by health professionals, you name it, it's gone.
Internet Archive is the last refuge of this information before history is rewritten.
Re: (Score:3, Informative)
Re: (Score:3)
Nobody tell Musk about all the photos of his balding head are still up on archive.org or that article he bullied the Wallstreet Journal into removing from it's archive about how his wealth began with his' parents blood diamond mine (different jewel but whatever.) He'll break in and fire everybody, burn their documents, and begin trying to sell their building.
Re: (Score:2)
Ministry of Truth (Score:2)
The party told you to reject the evidence of your eyes and ears. It was their final, most essential command. His heart sank as he thought of the enormous power arrayed against him, the ease with which any Party intellectual would overthrow him in debate, the subtle arguments which he would not be able to understand, much less answer. And yet he was in the right! They were wrong and he was right.
Spread the risk (Score:4, Interesting)
I hope that there are mirrors spread out in various places around the globe because of the way things are moving in the US.
Same reason that I now would like to have more information offline: books as well as hard drives with scientific reports, Wikipedia and all the 'data' that I fear might go missing. It will take some time and money and I might well team up with others in the neighbourhood.
Re: (Score:2)
Well, the risk of what's actually happening wasn't thought of to be possible until well, current events. So plans to move the archive around probably weren't in the works, because the US offers certain advantages with respect to things like copyright law and the courts and everything around it. And it was believed (until recently) that the US was a politically stable country operating with certain conditions that make it more ideal than other countries. After all, Canada and the EU might seem like good plac
GPTs are the new search anyways (Score:1)
there is value in preserving the past intact, but that's going to be less and less practical for many use cases as we're moving closer and closer to the internet of the infinite automated monkeys (https://en.wikipedia.org/wiki/Infinite_monkey_theorem)
so I would disagree about relevant. important, sure
Re: (Score:1)
"It was the best of times, it was the blurst of times?!"
Re: (Score:3)
Facts are political? Trump ordered mass deletions of anything "DEI" related and it's so ham fisted that mention of the Enola Gay was caught up. https://www.pbs.org/newshour/p... [pbs.org]
Not even Jackie Robinson's military service was safe. https://www.espn.com/mlb/story... [espn.com]
And this is supposedly the party that loves the military?
Re: (Score:2)
But this story should not be about that.
ALL of archive.org is under threat, not just the bits that the Trumpists don't like.
Re: (Score:2)
I'm glad I stopped giving them money.
Re: (Score:1)
Re: (Score:2)
What was political? Did I voice an opinion? No I stated nothing but facts with sources.
Re: (Score:1)
Re: (Score:2)
This affects everyone from the hateful woke to the demented MAGAs and all the decent people inbetween.
Are these journalists not capable of seeing that their story is better without the sideswipes, no matter how true they are ?
I guess they learnt their craft from CNN.
Re: (Score:2)
I've used archive.org a lot. I like it and love that it exists. I'm not thrilled though that the BBC noticed people would listen to BBC radio shows and had them taken down.
Re: Why did they insert politics into the story? (Score:2)
Re: (Score:2)
Sir could you explain what is hateful about a general awareness of social inequality?
Re: (Score:3)
Hatred is the usual response by the woke when one points out that it is racist and sexist to deny college places and jobs to people because of their race or sex, when one points out that their virtue signalling is annoying and divisive, when one points out the world does not revolve around the needs of a few disturbed people who think they are in the wrong body, when one points out that women's sport is no place for men, when one points out that the democrats' woke policies are the reason we have Presid
Re: (Score:2)
And you're the decent person in the middle, yes?
Re: (Score:2)
Re: (Score:1)
But then I'm engaging in exactly the sort of divisive argumentation that NPR is trying to promote under the guise of an article about an archive.
Controvery (Score:2)
Didn't the Wayback Machine stop archiving people they are friendly with? I believe there was some controversy with Taylor Loren a while back. If that wasn't a mistake, and it just takes a call to get your archives delisted, then the service is kind of pointless for all but the most general information.
China snipping wires (Score:2)
Press Freedom (Score:2)
The US ranking in Press Freedom has probably always been lower than domestically perceived (generally ranking 40-something), but it has dropped recently (to 55), and with the political climate mentioned in the article may drop further this year...
Source: https://en.wikipedia.org/wiki/... [wikipedia.org]
Can't rely on IA tho. (Score:2)
... Need more services like IA.