The Ever-Expanding Job of Preserving the Internet's Backpages 22
A quarter of a century after it began collecting web pages, the Internet Archive is adapting to new challenges. From a report: Within the walls of a beautiful former church in San Francisco's Richmond district, racks of computer servers hum and blink with activity. They contain the internet. Well, a very large amount of it. The Internet Archive, a non-profit, has been collecting web pages since 1996 for its famed and beloved Wayback Machine. In 1997, the collection amounted to 2 terabytes of data. Colossal back then, you could fit it on a $50 thumb drive now.
Today, the archive's founder Brewster Kahle tells me, the project is on the brink of surpassing 100 petabytes -- approximately 50,000 times larger than in 1997. It contains more than 700bn web pages. The work isn't getting any easier. Websites today are highly dynamic, changing with every refresh. Walled gardens like Facebook are a source of great frustration to Kahle, who worries that much of the political activity that has taken place on the platform could be lost to history if not properly captured. In the name of privacy and security, Facebook (and others) make scraping difficult.
Today, the archive's founder Brewster Kahle tells me, the project is on the brink of surpassing 100 petabytes -- approximately 50,000 times larger than in 1997. It contains more than 700bn web pages. The work isn't getting any easier. Websites today are highly dynamic, changing with every refresh. Walled gardens like Facebook are a source of great frustration to Kahle, who worries that much of the political activity that has taken place on the platform could be lost to history if not properly captured. In the name of privacy and security, Facebook (and others) make scraping difficult.
Re: (Score:1)
You're* x3.
Also, "security" is not the issue. It's Facebook claiming "security" is the reason they can't make people's Facebook posts, already marked public, as truly visible to the public.
Welcome to the digital dark ages (Score:5, Insightful)
Even before the walled gardens of Facebook we should face the reality that we are in dead trouble already, losing tens of thousands of important, historical documents.
Starting from files written in software for a computer that simply doesn't exist in a working state anywhere, anymore. Especially government mainframes.
Then when personal computers got cheaper we run into the hard fact that many companies cannot open or reproduce files using their own software, even within the same version numbers. Microsoft office, I am looking at you especially!
Then with timelocked licensing and license expiration it got worse - you cannot legally work with the information, and there is noone to purchase licenses from anymore.
Also with poorly documented (if at all!) proprietary data formats, where actual implementations don't even match the companies internal documents.
Then we have online license activation, many of which services have been lost to the world.
DRM is an archivists nightmare. Just a nightmare. E.g. Microsoft's PlaysForSure.
Add in giant cloud datasets, I doubt even Facebook, Apple, Google, or Microsoft have any way to replicate any of their experiments or setups, what hope has a historian?
And even now bitrot slowly destroys files saved over time unless unusual tech is used, like gold dvds or a redundant, checksumming filesystem like ZFS, with a good backup strategy.
Yesterday, I literally had to play some old presentations from Microsoft's Sillverlight on a very old machine and record them with OBS to make them survive.
Welcome, one and all, to the new digital dark ages.
Re: (Score:2)
You should probably add GitHub to that list. There will be a tipping point eventually on that service. Free forever? Uh huh. Right. Sure.
Re: (Score:2)
Yup. I've visited a few repositories and found them deleted, with no forks.
Re:Welcome to the digital dark ages (Score:4, Interesting)
Preservation is something I care deeply about, and digital data has been an apocalyptic nightmare since forever.
There are workarounds, if not fixes, for some of the examples you mentioned.
Starting from files written in software for a computer that simply doesn't exist in a working state anywhere, anymore. Especially government mainframes. Then when personal computers got cheaper we run into the hard fact that many companies cannot open or reproduce files using their own software, even within the same version numbers. Microsoft office, I am looking at you especially!
So long as they're not encrypted, it should be possible to extract the important stuff out of the file using nothing more low-level than a hex editor. LibreOffice, I think, can open most of the older MS Office documents, even if the formatting isn't 100% accurate.
And even now bitrot slowly destroys files saved over time unless unusual tech is used, like gold dvds or a redundant, checksumming filesystem like ZFS, with a good backup strategy.
How bit rot affects files depends on the type of file. In media files, the most important parts are the end and start of the file. Any bit rot in between would have to be fairly significant to affect the playability of the file. DRM may or may not mess with this.
Re: Welcome to the digital dark ages (Score:1)
Re: (Score:2)
Spot on. A few years ago I had to deal with a large dump of documents in all sors of formats. Word Star, Lotus Notes, some formats that I've forgotten but mainly a wide variety of Microsoft Office documents. In the end I used Libre Office (Open Office at the time) to bulk convert as much as possible to HTML so it could at least be mostly read. When Libre Office couldn't open the file we had to try and at least extract the text using "strings". What was especially annoying about the exercise was that Mi
Re: (Score:2)
Haven't mastered pictures yet (Score:1)
Every single time I try to access a "gone" webcomic, the internet archive manages to have missed most or all of the pictures every single time it crawled the site.
Ironically, the article is paywalled (Score:5, Funny)
So, an article about the difficulties of archiving when sites are walled off is itself walled off.
Re:Ironically, the article is paywalled (Score:5, Informative)
I'll double your irony: You can read it here [archive.ph].
Discord (Score:5, Informative)
It used to be the case that much knowledge and history was created on forums, bulletin boards, and the like, all of which were easily indexed by search engines. You can find a lot of wisdom of the ancients [xkcd.com] there, or in some cases unanswered questions.
Nowadays it seems like everyone's on closed ecosystems like Discord, Facebook groups, and the like. In some cases like Twitter and Instagram, the content is still indexed by search engines and is still searchable to a degree. But in most cases, the walled garden blocks access either via robots.txt or by requiring authentication by a logged-in account on the site. You'll never see any Discord messages show up in a google search result.
Some good samaritans are exporting and preserving some Discord servers' logs for archival purposes, but those are the exception and not the rule. It's a real problem for historians. Summoning Salt [youtube.com] for instance has had to dive into Discord archives and do a lot of 1:1 DMing to reconstruct the histories of many games' speedrunning records.
Is this the sad future we're destined for?
Re: (Score:2)
It used to be the case that much knowledge and history was created on forums, bulletin boards, and the like, all of which were easily indexed by search engines...
Nowadays it seems like everyone's on closed ecosystems like Discord, Facebook groups, and the like.
Yes. It usually takes the money-grubbers a few decades to catch on to a new "opportunity". Then, a few decades after that, people in government start to see the light.
Historical & Archives. (Score:1)
2TB thumb drive for $50? (Score:2)
Maybe the Amazon specials. You can write 2TB but you're never gonna get it back :D
Re: 2TB thumb drive for $50? (Score:2)
I think the internet archive is more deserving of a $250 2TB 2242 NVME drive anyway. . . Which is about the same size and better in almost every way.
$50 2TB ? Where? (Score:3)
2 terabytes of data. Colossal back then, you could fit it on a $50 thumb drive now.
That is currently not possible, AFAIK, unless you get those "1TB"/"2TB" drives, that report a fake number, then get full and/or fail at at much smaller number. It's a problem on amazon for SSD, microsd and usb flash drives that Amazon should fix.
Re: (Score:2)
Moreover, this is approaching the scenario of the old, old SF short story in which all the knowledge of the glaxy is crammed into a tiny, almost microscopic computer - and then someone loses it.