

Reddit Will Block the Internet Archive (theverge.com) 85
Reddit says that it has caught AI companies scraping its data from the Internet Archive's Wayback Machine, so it's going to start blocking the Internet Archive from indexing the vast majority of Reddit. From a report: The Wayback Machine will no longer be able to crawl post detail pages, comments, or profiles; instead, it will only be able to index the Reddit.com homepage, which effectively means Internet Archive will only be able to archive insights into which news headlines and posts were most popular on a given day.
"Internet Archive provides a service to the open web, but we've been made aware of instances where AI companies violate platform policies, including ours, and scrape data from the Wayback Machine," spokesperson Tim Rathschmidt tells The Verge.
"Internet Archive provides a service to the open web, but we've been made aware of instances where AI companies violate platform policies, including ours, and scrape data from the Wayback Machine," spokesperson Tim Rathschmidt tells The Verge.
It's because (Score:5, Informative)
They already sold it to google [gizmodo.com]
Re:It's because (Score:4, Insightful)
Which is how it should work.
If AI companies want to train on data, they should have to pay for it.
Right now this entire industry is built on IP theft. Its sickening frankly.
Re: (Score:1)
Yes, but if the AI companies are scraping IA and not reddit, it has zero impact on Reddit besides Reddit being pissy.
Re: (Score:1)
If AI companies are scraping IA, it has no traffic impact on Reddit, but does have a missed-revenue-opportunity impact.
Re:It's because (Score:5, Insightful)
It has a huge impact because it devalues these kinds of deals and just supports the idea that these companies can run roughshod over IP rights, steal, and pillage to their hearts content without consequence.
Re: (Score:3)
So is it "Information Wants to Be Free" or "Information Wants to Be Free to Only Those I Agree With"
Re: (Score:3)
It would be good news, if they can't sell user data anymore. If I post on Reddit, then I do it for people to read it, not for Reddit to sell it. And I decide myself if I am offended by AI reading my posts or not.
Re: It's because (Score:2)
These kinds of deals will only last until the AI hype dies down and the market matures.
Of course, spammers and russians already know about said deals and throw their own AI slop on reddit, so I question what value it has right now.
Either way, if this is how reddit intends to become profitable long-term, they're in for a rude awakening.
Copying is not theft. IP is access delay. (Score:3)
Demonstrated public demand for shorter access delays than current IP law allows suggests the public would be better off with different laws.
There are many ways to make money from free software etc. None are harmed by downloading it and many benefit.
Re: (Score:1)
Re: (Score:2)
So am I devaluing by browsing the site?
Re: It's because (Score:3)
Why are you being a sycophant for these VC backed AI companies?
Re: (Score:2)
Why are you being intellectually dishonest?
Re: It's because (Score:2)
Er...
Sorry for believing that AI companies shouldn't be able to steal everyone else's work without being compensated.
Anyone who thinks that's OK is the one being intellectually dishonest
Re: (Score:2)
Reading a public website isn't stealing, why is it suddenly stealing when it's done for AI?
Re: (Score:2)
Why are you being a sycophant for these copyright folks?
Re: (Score:2)
It is pretty entertaining watching copyright folks fight with AI folks though. It's almost hard to say who I want to win. In a more fair and sustainable world, I would push for AI but since we don't spread out the gains of society, I'm inclined to back the copyright folks since not accessing their content isn't that big a deal, but AI could very well eventually put us out of a job.
Re: (Score:2)
It has a huge impact because it devalues these kinds of deals and just supports the idea that these companies can run roughshod over IP rights, steal, and pillage to their hearts content without consequence.
Whoo Hoo! You go, Piratebay!
Recursive loop? (Score:3)
AI companies are scraping IA
Doesn't that cause some sort of infinite or recursive loop?
Re: (Score:2)
In many legislations you cannot waive it (but you can often release your content under a license that waives all restrictions). But on the other hand, it needs to be complex enough to have copyright at all. If you write a longer post about copyright and it's exceptions it is probably protected. This post on the other hand is something everyone could have come up with. A few more thoughts and it might become protected.
Re: (Score:3, Interesting)
Right now this entire industry is built on IP theft. Its sickening frankly.
What's more sickening, an industry built on "IP theft" or the term of copyright after it's been extended due to lobbying from media megaconglomerates?
Re:It's because (Score:4, Insightful)
Both are a manifestation of the same problem - the power of money to subvert the law. Sometimes big money may be in conflict, but like the tagline from that movie, whoever wins, we lose.
Re: (Score:2)
In theory, AI could liberate us all. Copyright can never do that for us.
In practice, your post is 100% spot on.
Re: It's because (Score:2)
You guys are always wanting us to be more like Europe, because you're a rebel. Disney wanted to import European copyright laws into the US, and that's exactly what happened. How else do you intend to rebel?
Re: (Score:2)
This guy gets it: IP is an imaginary concept! It should exist only as long as it benefits society ... but our IP laws have been corrupted to only serve the needs of a corporations.
Pretending that you should keep following made-up rules, that don't benefit anyone except the ultra-rich, as if it was some kind of moral concern, is completely idiotic.
Re: (Score:2)
Re: It's because (Score:2)
I remember only five years ago, slashdot had a very counterculture/adversarial view towards intellectual property. Now it seems to be very Jack Valenti.
You wouldn't download a reddit.
(But neither would I, the last thing I need is several terabytes worth of the internet's anus.)
Re: (Score:1)
the last thing I need is several terabytes worth of the internet's anus.
I think you are confusing Reddit with the many n-chans out there. It's an understandable mistake.
Re: It's because (Score:2)
Reddit is the anus, those are just the bigger chunks of splatter.
Re: (Score:2)
Re: (Score:2)
It's the difference between personal use, and corporate profit.
Re: (Score:2)
We are not any less anti-copyright so much as anti-AI. Copyright isn't going to leave us all in the poor house. You don't need to consume copyright materials. AI on the other hand could very well leave us all worse off, especially given how we operate society. All the gains from AI will surely be used against the people and not for the people.
Re: (Score:2)
Reminder that the "creativity" being defended by IP laws originates from Reddit users, not the Reddit the company. Most Reddit users are not aware that their posts can be sold for money.
Re: (Score:2)
The one thing I have learned about AI is that whoever controls scrapable data controls the AIs. Because they are useless without massive training sets.
This means you can open source the models all you want they are basically worthless without the training data sets and those are going to be getting locked up behind paywalls owned and operated by Major platform holders very soon.
This means that the capital that ai represents, and it is capital just
Re: (Score:2)
I really wonder how much added value there is in recent data. For a search engine, obviously it needs to be recent - but for any other use... old data is possible as good or better than recent data (especially as new data is going to be polluted with AI-generated content). Maybe search is where all the money lies (Google isn't exactly poor...) and hence the need to scrape endlessly.
Re: (Score:2)
That does not rely on "Scraping" and visually snapshots all the posts.
We already have it in Windows 11 - it's called "Recall"; it's a fucking privacy nightmare.
Re: We need an archive... (Score:2)
That's a shame actually (Score:2)
I sometimes use it to archive especially insightful conversation on reddit. ...yeah it is a rare event, but it does sometimes happen. *casts furtive glance around Slashdot*
Re: That's a shame actually (Score:2)
Every now and then, shit has kernels of corn.
Re: (Score:1)
but you wouldn't want to eat them.
Re: That's a shame actually (Score:2)
Re: That's a shame actually (Score:2)
the internet forgets after all (Score:2)
It took AI to get the Internet to forget things, interesting.
Price, value and rivalry (Score:2)
LLMs provide a mechanism to access the same information in a radically more energy-intensive way, which was the missing mechanism to put a price on that value.
A price tag means the data has to be made into a rivalrous good or you can't sell it. Then the old data has to be made unavailable.
Reddit Sells Their Data To AI (Score:2)
Reddit literally sells their data to Google and others for scraping. What Reddit is saying here is that they're blocking The Internet Archive because they aren't paying to scrape that data. Google pays $60 million a year to scrape Reddit for AI data.
https://www.thedailybeast.com/... [thedailybeast.com]
Re: (Score:2)
As it should be.
AI house-of-card companies should not be allowed to engage in rampant IP theft.
Re: (Score:3)
Bwahaha, that's literally what Reddit does. It steals content from every other source on the internet and profits from it.
Re: (Score:2)
But but but ... they can't know what their users post, can they?! ...
I mean if they knew that most users do not own the content they post, they would surely delete it
Welcome to the Walled Garden (Score:4)
AI is why we can't have nice things.
Re: (Score:1)
AI is why we can't have nice things.
Right. What nice things do you think you'd have without AI? (for the record, I'm not a fan of AI, but even less of a fan of moronic statements)
Re: (Score:2)
The OP is objecting to the loss of the Internet Archive and the ability to review history because of the AI scanning.
Re: (Score:2)
Reddit has become a cesspool of whining babies almost as bad as slashdot and thats not data we need to keep.
Slashdot commentary is nowhere on the sub-level of Reddit, c'mon, that's disingenuous.
Re: Its quite ok (Score:2)
Browse at 6 Re: Its quite ok (Score:1)
I usually browse at 6. It filters out the petty and childish arguments that show up when I browse at 5.
Re: Its quite ok (Score:2)
Who cares? (Score:1)
Who cares? Reddit is 90% bots and marketing agencies. It's useless. It's AI slop that's been digested and shit back out multiple times...AI trained on the output of AI trained on the output of AI trained on the last vestiges of actual human communication from a forgotten era of Reddit, and all of it designed to push a certain narrative, get you to think a certain way, or make it hard for you to see content they don't want you to see.
Re: Who cares? (Score:2)
And why? (Score:2)
If a company complains that AI crawlers are causing too much traffic, one might believe it or not (it's not as if Reddit isn't using a CDN for example). But why are they complaining when a mirror they are not hosting themselves is crawled? It's not as if the Internet Archive would crawl them more often when AI bots access the archive.
Re: (Score:1)
I think the ostensible reason has to do with things like deletions. If internet archive gets it, and it's deleted later, it's still at archive.org.
I suspect the actual reason has to do with money, they want to be the only place that companies willing to pay for content can go to train their AIs, and they want to shut out companies unwilling to pay. Whether this is a good thing, a bad thing, both, or neither is probably a discussion for another time.
IP is a government gift (Score:2)
IP is a government gift, not some natural right though Disney will disagree. IP was intended to facilitate greater good, not rent seeking.
IP is not a requirement for successful capitalism as nations which disregard it demonstrate. It is not necessary to be competitive but a deliberately conferred, supposedly temporary, market advantage intended to aid progress in the useful arts.
Slashdot luddites are silly (Score:2)
I see AI hate regularly on other sites, but it's especially funny seeing the AI Haters come out on Slashdot. You hate technology now? Ridiculous! Don't you still want a household robot to take out the trash, or is that unfashionable now, because it's AI? What a joke AI haters are, especially here. They should pay to scrape data! Don't be ridiculous.
Re: (Score:2)
hating AI in its current form does not make one a Luddite. This is not about "hating technology", it's about recognizing that what is available now is deeply flawed.
Re: (Score:2)
hating AI in its current form does not make one a Luddite.
In a hypothetical universe that doesn't exist, sure, you're not wrong.
But the AI hatred often overlaps with flat out Luddite tendencies- sacrifice of every drop of intellectual honesty one can find to change the narrative around a technology with no regard for the facts on the ground.
When I start running into people here who don't like AI, but aren't also engaging in flat out lying in order to prop up their reality bubble, I'll be more inclined to agree with you
Re: (Score:2)
The fatigue seems to be mostly about other humans who believe LLM's are omniscient and inflatable.
Or even good conversationalists.
As an information retrieval tool or a radiology diagnosis assistant, sure, hardly anybody is complaining.
OK, maybe some lesser radiologists.
So stupid (Score:2)
I have an idea (Score:2)
Well (Score:1)
reddit is garbage and has been for a long time. The IA can probably just find a new way to archive the site and it's posts.
Hiding History (Score:1)
... this is a way of making sure we can't use the internet archive to look back at what was said at reddit.
Using the Internet Archive i was able to trace the 2014 Ukrainian Coup forces to be Azov Militants being trained in Ukraine. It's only because of the internet archive that the public still had access to a journalists report while they were being trained - before the coup itself occurred.
It seems that we're being forced more and more into not being able to check what we were told on day x...
Evidence of Crimes (Score:2)
Just yesterday there was a news story about
Predditors organizing to mass copy a YouTuber's content to try to wreck his revenue.
A court forced Reddit to hand over their identification and he is suing them. Ethan somebody.
There's quite an ethos over there about organizing crime and apparently if it's leftwing they just leave it alone. e.g. https://www.reddit.com/r/lgbt/... [reddit.com] nobody pushing back against crime.
Most neutrally the company may not want to deal with subpoenas and they don't think the crimes are jus