Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
The Internet

Why the Internet Archive is More Relevant Than Ever (npr.org) 63

It's "live-recording the World Wide Web," according to NPR, with a digital library that includes "hundreds of billions of copies of government websites, news articles and data."

They described the 29-year-old nonprofit Internet Archive as "more relevant than ever." Every day, about 100 terabytes of material are uploaded to the Internet Archive, or about a billion URLs, with the assistance of automated crawlers. Most of that ends up in the Wayback Machine, while the rest is digitized analog media — books, television, radio, academic papers — scanned and stored on servers. As one of the few large-scale archivists to back up the web, the Internet Archive finds itself in a particularly unique position right now... Thousands of [U.S. government] datasets were wiped — mostly at agencies focused on science and the environment — in the days following Trump's return to the White House...

The Internet Archive is among the few efforts that exist to catch the stuff that falls through the digital cracks, while also making that information accessible to the public. Six weeks into the new administration, Wayback Machine director [Mark] Graham said, the Internet Archive had cataloged some 73,000 web pages that had existed on U.S. government websites that were expunged after Trump's inauguration...

According to Graham, based on the big jump in page views he's observed over the past two months, the Internet Archive is drawing many more visitors than usual to its services — journalists, researchers and other inquiring minds. Some want to consult the archive for information lost or changed in the purge, while others aim to contribute to the archival process.... "People are coming and rallying behind us," said Brewster Kahle, [the founder and current director of the Internet Archive], "by using it, by pointing at things, helping organize things, by submitting content to be archived — data sets that are under threat or have been taken down...."

A behemoth of link rot repair, the Internet Archive rescues a daily average of 10,000 dead links that appear on Wikipedia pages. In total, it's fixed more than 23 million rotten links on Wikipedia alone, according to the organization.

Though it receives some money for its preservation work for libraries, museums, and other organizations, it's also funded by donations. "From the beginning, it was important for the Internet Archive to be a nonprofit, because it was working for the people," explains founder Brewster Kahle on its donations page: Its motives had to be transparent; it had to last a long time. That's why we don't charge for access, sell user data, or run ads, even while we offer free resources to citizens everywhere. We rely on the generosity of individuals like you to pay for servers, staff, and preservation projects. If you can't imagine a future without the Internet Archive, please consider supporting our work. We promise to put your donation to good use as we continue to store over 99 petabytes of data, including 625 billion webpages, 38 million texts, and 14 million audio recordings.
Two interesting statistics from NPR's article:

Thanks to long-time Slashdot reader jtotheh for sharing the news.


Why the Internet Archive is More Relevant Than Ever

Comments Filter:
  • by Dan East ( 318230 ) on Monday March 24, 2025 @07:40AM (#65255243) Journal

    Yes, especially since Google removed one of the most useful search features it ever had, which was the ability to view the cached page from the last time Google crawled it.

    • by gweihir ( 88907 ) on Monday March 24, 2025 @08:55AM (#65255333)

      All part of the enshittification of Google. Also remember that archive.org keeps things forever. I have started turning most links in my lectures into archive.org links, it is just too much effort checking every time whether the pages are still there.

      And, yes, I donate to them.

    • > Yes, especially since Google removed one of the most useful search features it ever had, which was the ability to view the cached page from the last time Google crawled it.

      Reason being is that you can detect when the historical record has been altered. It's going to get worse when all our information is being filtered through ClippyAI:

      Every record has been destroyed or falsified, every book has been rewritten, every picture has been repainted, every statue and street and building has been
  • Not much point or value in blindly archiving everything with so much boilerplate content generated to get advertising clicks.

    Search on "how to install custom rom on android" and you'll get pages of general information in a recognizable template... same information reformatted slightly emanating from different URLs. Volumes of unfocused garbage basically.
    • by Ol Olsoc ( 1175323 ) on Monday March 24, 2025 @08:17AM (#65255285)

      Not much point or value in blindly archiving everything with so much boilerplate content generated to get advertising clicks.

      I know what you mean, but sometimes there value in dot connecting later.

      What is important of course, is that this is a bit like the Svalbard global seed vault, but for US science data. https://www.seedvault.no./ [www.seedvault.no]

      When a government of dodgy politicians demands to eliminate science data, it must be very powerful science data. If it was about eliminating bullshit, they'd go after flat earthers, HAARP freaks, alien conspiracies anti-moon landing and other BS.

      So if a few weeds get in the mix, it's still a good thing.

      • by Anonymous Coward

        Well the people generating that data should have done a lot more to ensure they had some amount of trust.

        The trouble with a table of values is we often don't know much about the methodology behind their collection. Even if that information is availible is often 'conveniently' separated before the data is used for whatever analysis.

        If the government published SAT scores by race and year, it would paint some grim pictures most likely, and various groups would insist the data is 'racist'. The might in fact be

        • by Big Hairy Gorilla ( 9839972 ) on Monday March 24, 2025 @09:55AM (#65255461)
          Libertarian ideas. I get it. I'm not in complete disagreement, nor am I a card carrying Leftist, but I don't think unrestricted Libertarianism leads to a just world.

          By your logic, you might as well close the armed forces and incentivize self funded mercenaries, then reward them when they achieve published goals.
          So I start a private army with the mercs put of business when Trump solves the war in the Ukraine, on his first day in office. We invade Greenland and take it over in about 24 hours because ... there's only around 50,000 inhabitants of Greenland, and Denmark is some thousands of miles away. Where's my 50 Billion? Great profit margin too, it only cost me and my army a billion or so, I like the numbers. Did you really save any tax dollars?

          I'm gonna say there is a role for government in a just society. In my opinion, we need regulations for environment, stock market, banks, utilities like water, electricity and sadly the internet...and based on the world today, we are going to still need social services, and UBI.... or we end up like China with everyone spying on each other and so much pollution you can't see the sun. Government stimulus in the form of grants in the sciences still seem reasonable to me. Could it be run better? Surely.
      • I would suggest the dead internet theory is where we are now, have been for several years. Most content is generated now.
        i.e. it's all weeds now, no genuine seeds to store.

        I'm not against the Internet Archive, I just think they are wasting money on an obsolete model.
        Every CEO talks up his company, so no different with Brewster, imho.

        They could sit pat with what they've got and it's still a valuable public service, methinks.
    • by tlhIngan ( 30335 ) <slashdot@worSLACKWAREf.net minus distro> on Monday March 24, 2025 @08:48AM (#65255325)

      The point is that what's important and what's not isn't known until much later. You might think it's useless information today because it doesn't help you, but it may have uses tomorrow.

      It's basically like how we learn what life was like in the past not because of the records left behind, but because of the garbage that was thrown away.

      • Re: (Score:2, Offtopic)

        by gweihir ( 88907 )

        Indeed. The justification and value of the Internet archive comes from maybe 1% of what it stores, maybe less. In the YouTube videos they archive, it may well be less, but nobody knows what it may turn out to be before it becomes important.

      • I'm pretty sure we can do without archiving those generated pages of boilerplate on any subject. Best case scenario would be to analyze every page, compare to every other page and only save 1 copy, ignoring the countless copies.

        This is like the argument that all data is equal. That Microsoft and OpenAI want to ingest the entirety of the internet. It's literally noise now... I don't see how that makes sense, anymore.
        • Re: (Score:3, Insightful)

          by DarkOx ( 621550 )

          As others have pointed out though without knowing what questions future people want to answer you don't know what is interesting or why.

          For example an economist of the future might be very interested in the rate of content duplication, clones, and likely copyright infringement of article content.

          Think about like masons marks. They basically were just there to do supply chain management and invoicing. Nobody thought they'd be interesting after the wall was up so to speak but future archeologists have used

          • well, I see what you're saying, it's an interesting point. We don't know how it will be looked at in the future.
            but if the cost increases forever, there will be a moment when the expense won't seem justified.
        • We are discussing trying to save vital information that may soon be totally wiped out by the copyright parasites.

          You are arguing that much of that information is redundant.

          That is premature optimisation.
  • by greytree ( 7124971 ) on Monday March 24, 2025 @08:10AM (#65255279)
    ...but AFAICT, the court judgement in favour of the copyright parasites has now doomed it?
    • by gweihir ( 88907 )

      I do not think so. There are enough countries on this planet where archiving in this form is legal. It just has to be non-profit and free access. It can be limited to non-commercial use only and that can be done via the TOU.

      • by AmiMoJo ( 196126 ) on Monday March 24, 2025 @09:38AM (#65255449) Homepage Journal

        The problem is that the IA team won't move it outside of the US, and won't accept outside help to mirror it. They won't accept any help to develop the backend software either. Of course that's up to them, it's their archive, but users must consider their position and the likely impact on the archive's future.

        Realistically the best option here will be to create an open source version of the IA code, and try to organize libraries set up around the world. Each one wouldn't have to be a full mirror. You could even decouple the storage part so that data can be stored where it is not going to run into legal issues, and be accessed seamlessly from a central web interface. There are lots of options and it really needs a team of international copyright lawyers to look at it. Then mirror IA and accept uploads, and do it all completely separate from the current IA, both for legal reasons and because they don't want to be part of it.

        It's no small thing, but otherwise it's just a matter of time before we lose it all.

        • I meant that the current, US-based, instance of archive.org is doomed.

          re. Backing up the contents of archive.org somewhere outside the US: I do not think the code is the problem, but the sheer amount of data that has to be exported to the backup(s), before archive.org goes titsup.
          This means the Wayback Machine, the books, the texts, the audio, the video.

          Is there an effort underway?
          Doesn't some billionaire think this is worth doing?
          • by AmiMoJo ( 196126 )

            It would need some very significant up-front investment, and probably cooperation with the IA team to at least some extent.

            Nobody seems to be doing it at the moment.

      • by vlad30 ( 44644 )
        I would guess that some other countries have their own Internet Archive or at least a partial one focused on what they consider more relevant information (for IP theft purposes alone) the real problem is the size of the archive and where to buy and pay for 100 petabytes of drive space backup drives and the growth growth additions required each day. Next the power to run all that if you want it available 24/7. Large companies may also do it the former Google cache feature suggests google was doing it and ei
    • by Zak3056 ( 69287 ) on Monday March 24, 2025 @09:45AM (#65255451) Journal

      The idiots at the Archive invited this outcome by saying "copyright is no longer a thing because Covid" which pretty much forced the publishers to sue, and to do so with a set of facts that was almost tailor made for their purposes to kill format shifting.

      • YES! It is almost as if somebody powerful hired a mole to go work at the archive and promote BAD decisions! (This is in fact one of the things assets and spies do.) Any Russians work there?

        They could have simply archived materials for a....century.... before publishing them.

        • by DarkOx ( 621550 )

          They could have simply archived materials for a....century.... before publishing them.

          With what funds? The current model makes it useful to people currently, which induces them to donate and generally support the effort.

          Attempting to just archive, might have have let them run under the radar, or not. We really can't say. Some copyright troll organization (excuse me, I meant to say collaborative industry group) could have still spotted their crawler and gone after them with BS SLAPP type suits.

          At least by being open about what they are doing and making the archive searchable and useful they

      • "*forced* the publishers to sue"

        Yes, until then the copyright parasites were working to promote the creativity copyright exists for, and were encouraging archive.org and everyone using stuff within a reasonable copyright term of five years.
        Their evil parasitical actions are totally archive.org's fault.
        • Yes, until then the copyright parasites were working to promote the creativity copyright exists for, and were encouraging archive.org and everyone using stuff within a reasonable copyright term of five years.

          I don't think that US copyright terms were ever that short, nor should be. Up until recently, copyright was for 14 years, extendable once for a total of 28 years. Then, the US joined the Berne Convention which requires excessively long copyright terms so as to provide for the author's family and de
        • by Zak3056 ( 69287 )

          "*forced* the publishers to sue"

          I said "pretty much" forced the publishers to sue. Shall I restate that to "they invited the publishers to sue?"

          Yes, until then the copyright parasites were working to promote the creativity copyright exists for, and were encouraging archive.org and everyone using stuff within a reasonable copyright term of five years.

          No, obviously not, but there was plenty of stuff on the Archive that was less than your utopic five year threshold that they were lending out via format shifting on the premise that they had a physical copy that corresponded to the electronic copy they were lending out for a limited time, protected by DRM. It was a reasonable position for them to hold--the Supreme Court held in Betamax that time

  • by smooth wombat ( 796938 ) on Monday March 24, 2025 @08:37AM (#65255313) Journal

    As we've seen in the last month or so, the current administration is hellbent on destroying any information it doesn't like for whatever reason. Climate change data used by farmers, research data used by scientists, epidemiological research used by health professionals, you name it, it's gone.

    Internet Archive is the last refuge of this information before history is rewritten.

    • Re: (Score:3, Informative)

      by Anonymous Coward
      Once they realize that it exists, they will go after archive.org as well. The administration will misinterpret some law to give themselves power to shut it down, and SCOTUS will upheld it because it's been packed with Trump lackeys. This is the current playbook, and it appears to be working.
    • Nobody tell Musk about all the photos of his balding head are still up on archive.org or that article he bullied the Wallstreet Journal into removing from it's archive about how his wealth began with his' parents blood diamond mine (different jewel but whatever.) He'll break in and fire everybody, burn their documents, and begin trying to sell their building.

    • "Oceania had always been at war with Eastasia". It's like Trump has taken 1984 as an example to leave his historical mark.
    • The party told you to reject the evidence of your eyes and ears. It was their final, most essential command. His heart sank as he thought of the enormous power arrayed against him, the ease with which any Party intellectual would overthrow him in debate, the subtle arguments which he would not be able to understand, much less answer. And yet he was in the right! They were wrong and he was right.

  • Spread the risk (Score:4, Interesting)

    by FountainHEW ( 3516329 ) on Monday March 24, 2025 @08:51AM (#65255331)
    So where are the physical servers located?

    I hope that there are mirrors spread out in various places around the globe because of the way things are moving in the US.

    Same reason that I now would like to have more information offline: books as well as hard drives with scientific reports, Wikipedia and all the 'data' that I fear might go missing. It will take some time and money and I might well team up with others in the neighbourhood.

    • by tlhIngan ( 30335 )

      Well, the risk of what's actually happening wasn't thought of to be possible until well, current events. So plans to move the archive around probably weren't in the works, because the US offers certain advantages with respect to things like copyright law and the courts and everything around it. And it was believed (until recently) that the US was a politically stable country operating with certain conditions that make it more ideal than other countries. After all, Canada and the EU might seem like good plac

  • there is value in preserving the past intact, but that's going to be less and less practical for many use cases as we're moving closer and closer to the internet of the infinite automated monkeys (https://en.wikipedia.org/wiki/Infinite_monkey_theorem)

    so I would disagree about relevant. important, sure

    • by Anonymous Coward

      "It was the best of times, it was the blurst of times?!"

  • Didn't the Wayback Machine stop archiving people they are friendly with? I believe there was some controversy with Taylor Loren a while back. If that wasn't a mistake, and it just takes a call to get your archives delisted, then the service is kind of pointless for all but the most general information.

  • If China are planning to cut cables we might need a backup
  • The US ranking in Press Freedom has probably always been lower than domestically perceived (generally ranking 40-something), but it has dropped recently (to 55), and with the political climate mentioned in the article may drop further this year...
    Source: https://en.wikipedia.org/wiki/... [wikipedia.org]

  • ... Need more services like IA.

Ever notice that even the busiest people are never too busy to tell you just how busy they are?

Working...