Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
The Internet IT Technology

The Ever-Expanding Job of Preserving the Internet's Backpages 22

A quarter of a century after it began collecting web pages, the Internet Archive is adapting to new challenges. From a report: Within the walls of a beautiful former church in San Francisco's Richmond district, racks of computer servers hum and blink with activity. They contain the internet. Well, a very large amount of it. The Internet Archive, a non-profit, has been collecting web pages since 1996 for its famed and beloved Wayback Machine. In 1997, the collection amounted to 2 terabytes of data. Colossal back then, you could fit it on a $50 thumb drive now.

Today, the archive's founder Brewster Kahle tells me, the project is on the brink of surpassing 100 petabytes -- approximately 50,000 times larger than in 1997. It contains more than 700bn web pages. The work isn't getting any easier. Websites today are highly dynamic, changing with every refresh. Walled gardens like Facebook are a source of great frustration to Kahle, who worries that much of the political activity that has taken place on the platform could be lost to history if not properly captured. In the name of privacy and security, Facebook (and others) make scraping difficult.
This discussion has been archived. No new comments can be posted.

The Ever-Expanding Job of Preserving the Internet's Backpages

Comments Filter:
  • by mattaw2001 ( 9712110 ) on Tuesday October 04, 2022 @02:53PM (#62938327)
    Preservation is something I care deeply about, and digital data has been an apocalyptic nightmare since forever.

    Even before the walled gardens of Facebook we should face the reality that we are in dead trouble already, losing tens of thousands of important, historical documents.

    Starting from files written in software for a computer that simply doesn't exist in a working state anywhere, anymore. Especially government mainframes.
    Then when personal computers got cheaper we run into the hard fact that many companies cannot open or reproduce files using their own software, even within the same version numbers. Microsoft office, I am looking at you especially!
    Then with timelocked licensing and license expiration it got worse - you cannot legally work with the information, and there is noone to purchase licenses from anymore.
    Also with poorly documented (if at all!) proprietary data formats, where actual implementations don't even match the companies internal documents.
    Then we have online license activation, many of which services have been lost to the world.
    DRM is an archivists nightmare. Just a nightmare. E.g. Microsoft's PlaysForSure.
    Add in giant cloud datasets, I doubt even Facebook, Apple, Google, or Microsoft have any way to replicate any of their experiments or setups, what hope has a historian?
    And even now bitrot slowly destroys files saved over time unless unusual tech is used, like gold dvds or a redundant, checksumming filesystem like ZFS, with a good backup strategy.
    Yesterday, I literally had to play some old presentations from Microsoft's Sillverlight on a very old machine and record them with OBS to make them survive.

    Welcome, one and all, to the new digital dark ages.

    • You should probably add GitHub to that list. There will be a tipping point eventually on that service. Free forever? Uh huh. Right. Sure.

    • by Catvid-22 ( 9314307 ) on Tuesday October 04, 2022 @05:30PM (#62938845)

      Preservation is something I care deeply about, and digital data has been an apocalyptic nightmare since forever.

      There are workarounds, if not fixes, for some of the examples you mentioned.

      Starting from files written in software for a computer that simply doesn't exist in a working state anywhere, anymore. Especially government mainframes. Then when personal computers got cheaper we run into the hard fact that many companies cannot open or reproduce files using their own software, even within the same version numbers. Microsoft office, I am looking at you especially!

      So long as they're not encrypted, it should be possible to extract the important stuff out of the file using nothing more low-level than a hex editor. LibreOffice, I think, can open most of the older MS Office documents, even if the formatting isn't 100% accurate.

      And even now bitrot slowly destroys files saved over time unless unusual tech is used, like gold dvds or a redundant, checksumming filesystem like ZFS, with a good backup strategy.

      How bit rot affects files depends on the type of file. In media files, the most important parts are the end and start of the file. Any bit rot in between would have to be fairly significant to affect the playability of the file. DRM may or may not mess with this.

    • A lot is going to be lost but compared to previous ages, itâ(TM)s going to be a golden age of preservation, just look at how many backups of Wikipedia exist. The amount of data and records that are being preserved and how easy to access and search them it is, is unprecedented, even if so much (mostly on the more irrelevant side) is going to be lost.
    • Spot on. A few years ago I had to deal with a large dump of documents in all sors of formats. Word Star, Lotus Notes, some formats that I've forgotten but mainly a wide variety of Microsoft Office documents. In the end I used Libre Office (Open Office at the time) to bulk convert as much as possible to HTML so it could at least be mostly read. When Libre Office couldn't open the file we had to try and at least extract the text using "strings". What was especially annoying about the exercise was that Mi

      • I feel you, I really do. I tried hitting up the UK government for a grant to network archive whole researchers' PCs as VMs to make a library of not only the computers but the software that was on them to try to counter this. With a bit of date fudging we might be able to see what they saw, and simulate what they simulated in a lot of cases.
  • by Anonymous Coward

    Every single time I try to access a "gone" webcomic, the internet archive manages to have missed most or all of the pictures every single time it crawled the site.

  • by erice ( 13380 ) on Tuesday October 04, 2022 @03:06PM (#62938393) Homepage

    So, an article about the difficulties of archiving when sites are walled off is itself walled off.

  • Discord (Score:5, Informative)

    by arosenfield ( 998621 ) on Tuesday October 04, 2022 @03:39PM (#62938489)

    It used to be the case that much knowledge and history was created on forums, bulletin boards, and the like, all of which were easily indexed by search engines. You can find a lot of wisdom of the ancients [xkcd.com] there, or in some cases unanswered questions.

    Nowadays it seems like everyone's on closed ecosystems like Discord, Facebook groups, and the like. In some cases like Twitter and Instagram, the content is still indexed by search engines and is still searchable to a degree. But in most cases, the walled garden blocks access either via robots.txt or by requiring authentication by a logged-in account on the site. You'll never see any Discord messages show up in a google search result.

    Some good samaritans are exporting and preserving some Discord servers' logs for archival purposes, but those are the exception and not the rule. It's a real problem for historians. Summoning Salt [youtube.com] for instance has had to dive into Discord archives and do a lot of 1:1 DMing to reconstruct the histories of many games' speedrunning records.

    Is this the sad future we're destined for?

    • It used to be the case that much knowledge and history was created on forums, bulletin boards, and the like, all of which were easily indexed by search engines...

      Nowadays it seems like everyone's on closed ecosystems like Discord, Facebook groups, and the like.

      Yes. It usually takes the money-grubbers a few decades to catch on to a new "opportunity". Then, a few decades after that, people in government start to see the light.

  • Way-Back Machine is pretty important. Timcast refers to it and so do many other journalists. We can see now the impact of Spies, Algorithms, and Bots have not just on politics, but our everyday lives. This leads me to ask the question, "how much has been scrubbed from the web?"
  • Maybe the Amazon specials. You can write 2TB but you're never gonna get it back :D

  • by GodWasAnAlien ( 206300 ) on Tuesday October 04, 2022 @05:09PM (#62938791)

    2 terabytes of data. Colossal back then, you could fit it on a $50 thumb drive now.

    That is currently not possible, AFAIK, unless you get those "1TB"/"2TB" drives, that report a fake number, then get full and/or fail at at much smaller number. It's a problem on amazon for SSD, microsd and usb flash drives that Amazon should fix.

    • Moreover, this is approaching the scenario of the old, old SF short story in which all the knowledge of the glaxy is crammed into a tiny, almost microscopic computer - and then someone loses it.

"The whole problem with the world is that fools and fanatics are always so certain of themselves, but wiser people so full of doubts." -- Bertrand Russell

Working...