Project Analyzing Human Language Usage Shuts Down Because 'Generative AI Has Polluted the Data' (404media.co) 93
The creator of an open source project that scraped the internet to determine the ever-changing popularity of different words in human language usage says that they are sunsetting the project because generative AI spam has poisoned the internet to a level where the project no longer has any utility. 404 Media: Wordfreq is a program that tracked the ever-changing ways people used more than 40 different languages by analyzing millions of sources across Wikipedia, movie and TV subtitles, news articles, books, websites, Twitter, and Reddit. The system could be used to analyze changing language habits as slang and popular culture changed and language evolved, and was a resource for academics who study such things. In a note on the project's GitHub, creator Robyn Speer wrote that the project "will not be updated anymore."
"Generative AI has polluted the data," she wrote. "I don't think anyone has reliable information about post-2021 language usage by humans." She said that open web scraping was an important part of the project's data sources and "now the web at large is full of slop generated by large language models, written by no one to communicate nothing. Including this slop in the data skews the word frequencies." While there has always been spam on the internet and in the datasets that Wordfreq used, "it was manageable and often identifiable. Large language models generate text that masquerades as real language with intention behind it, even though there is none, and their output crops up everywhere," she wrote.
"Generative AI has polluted the data," she wrote. "I don't think anyone has reliable information about post-2021 language usage by humans." She said that open web scraping was an important part of the project's data sources and "now the web at large is full of slop generated by large language models, written by no one to communicate nothing. Including this slop in the data skews the word frequencies." While there has always been spam on the internet and in the datasets that Wordfreq used, "it was manageable and often identifiable. Large language models generate text that masquerades as real language with intention behind it, even though there is none, and their output crops up everywhere," she wrote.
Internet != real life (Score:5, Insightful)
And never did.
Things that get deliberately posted for all to see are by definition spoken in a different register than casual speech.
Back when the written word had a bigger barrier to entry between thought and printing press, this was understood. But the wrong lesson for why seems to have made people believe that blog posts and tweets are completely interchangeable with spoken interactions.
The above statement was not AI generated, and it probably contains the gist of what I would have said out loud in casual conversation, but by virtue of taking longer to compose, it's more polished and has a few more $20 words and a few less profanities than if I'd just blurted it out.
Re:Internet != real life (Score:5, Funny)
The Internet isn't real life? Good Lord! What have we all been using to post on Slashdot then!?
Re: (Score:1)
Re: (Score:2)
You're clearly not a gamer or of OG Slashdot
Why is someone with a user number approaching infinity lecturing me about not being OG Slashdot? And yes, I play video games. Of course that has nothing to do with this though.
Anything that happens in games (which gets expanded to the internet, because they're online games) = game world, virtual world, whatever. Anything outside of games = IRL or in real life. Get with the program.
The point of my post is that people's use of language on the internet is in fact very real. One can see it, read it, and comprehend it. There is in fact nothing "not real" about it.
Re: (Score:2)
You're clearly not a gamer or of OG Slashdot
Why is someone with a user number approaching infinity lecturing me about not being OG Slashdot?
Mine predates yours. ;) Nah, more to the point I obviously agree with you.
I suppose we can assume (I don't) that blog posts are all generative AI it pollutes the language. But the fact is that with the current state of generative AI it's usually pretty obvious what is coming out of those systems and what isn't. Granted, it's getting "better" with generative AI getting more and more crap fed into it so it's even more insane than it was already. That said, a lot of blog posts are utterly insane so does it re
Re: (Score:2)
First to go, next is videos (Score:2)
Order of destruction:
1. Blogs ...
2. Clickbait 'news'
3. Formula news - sports recaps, company financials, investment research, and other genre news. Sports has been computer done for years now, not AI though.
4. Soft news - Health, garden, celebrity, staged-pretend-controversy news,
5. Political news
6. Hard news
And somewhere between 4-6 will be
- Youtube and other video platforms
- Short and long fiction stories
- Movie scripts, dialogue and stock scenes
And eventually
Re: (Score:2)
Mine predates yours. ;)
I'm only impressed by 5 digit or lower user numbers! It's still neat having one that low in the six figure range though :).
Not that anyone's user number means much over all of course, I just found it absurd having some one with an 8 digit user number tell me I dont go far enough back with Slashdot to get something given that a six digit number means the user created their account at the very end of the 90's / early 2000's.
Re: (Score:2)
Re: (Score:2)
From what Ive both heard and seen your first part is wrong, one just needs to send an email for review to get registered as a user.
Re: (Score:2)
Re: Internet != real life (Score:1)
Re: (Score:2)
These people don't think written and spoken interactions are interchangeable. Spoken and written languages are known different in both grammar and lexicon. Some writers are known to follow more "spoken" convention than others. Maybe the difference isn't big in all languages, though. It would certainly be worthwhile to monitor the frequency of spoken interaction but it's clearly much more difficult to implement.
Re: Internet != real life (Score:2)
Absolutely no writer who writes books other people choose to pick up writes the way people talk. Any transcription of real extemporaneous spoken language is fucking unreadable. It wanders, it doubles back, it contradicts itself. It relies on context communicated with body language or other visual aides.
Re: (Score:2)
See Louis-Ferdinand Céline (1894-1961) usually just referred as "Céline". Wikipedia says:
* "Céline shocked many critics by his use of a unique language based on the spoken French of the working class"
* "Céline is widely considered to be one of the greatest French novelists of the 20th century and his novels have had an enduring influence on later authors." https://en.wikipedia.org/wiki/... [wikipedia.org]
There is a vast literature just about the spoken style of Céline's writing style. One example
Re: (Score:2)
it's not just our language usage or the Internet, this is inevitable result of corporatocracy, classism and corruption
our entire society is collapsing from the unethical rot from the top down
Re: Internet != real life (Score:2)
Ours is not a top-down society. If it were, it wouldn't have survived the continuing stream of embarrassments from the top.
You can see top-down societies in action. North Korea, the Soviet Union, etc. They usually don't last very long. And if they do, they have conspicuous deficiencies in terms of basic stuff like being able to keep the lights on or food on the table.
Re: (Score:2)
Ours is not a top-down society. If it were, it wouldn't have survived the continuing stream of embarrassments from the top.
You can see top-down societies in action. North Korea, the Soviet Union, etc. They usually don't last very long. And if they do, they have conspicuous deficiencies in terms of basic stuff like being able to keep the lights on or food on the table.
our society is a top down society, that's called classism
the rich are on top and the rest of us are not
typical red baiting bs
as if our government isn't just as corrupt and like we don't have starvation or poverty here
typical pseudo-conservative denial
Re: Internet != real life (Score:2)
Of course the government is corrupt. Reason we don't have *more* "starvation" (read: having to buy store brand steak on your EBT card) is that our government is limited and therefore its ability to fuck things up for the rest of us is limited.
Re: (Score:2)
of course if we didn't have good government, it would be far worse
those who are against governance say so because they selfishly don't want to pay their fair share
typical pseudo-conservative classist self-justification and deflection
Re: (Score:2)
Re: (Score:3)
Blog posts are train-of-thought. Tweets are train-of-thought. Hell even Reddit, somethingawful, fark, slashdot, and so forth UGC is still train of thought more than any real effort to research and verify.
And AI "slop" is only going to get worse until we abandon the ipv4 internet and create spaces where generative AI is not allowed to participate and those using generative AI are not allowed to post to trusted sources.
And 2021 wasn't when this crap started either. That was merely the turning point in which c
Re: (Score:3)
Heavy handed social media censorship (Score:2)
Re: (Score:2)
Re: Internet != real life (Score:2)
And yet we have been able to understand and track the evolution of languages over centuries by analyzing written media - because contemporary usage, neologisms and language drifts permeate language in average despite individual author style. Slang getting incorporated into letters and then art, and becoming canonical has been part of that forever - and statistical linguistic analysis has been a science for at least a century.
I think they may be early throwing the towel on the origination problem, but they a
Re: (Score:2)
That is unfortunate (Score:5, Insightful)
It also shows another thing: Internet language data is fast becoming unusable for model training due to model collapse.
Re:That is unfortunate (Score:4, Interesting)
It also shows another thing: Internet language data is fast becoming unusable for model training due to model collapse.
Just wait until the same thing happens when using paintings as a reference [imgur.com].
Re:That is unfortunate (Score:4, Interesting)
A document gets read, it gets split into tokens. The tokens are trained on and used, together with their one on one relationships to nearby tokens. The disclaimer you put in, that the content is AI generated, is tokenized too, but this has no appreciable effect on the relationships discovered among the tokens that is trained.
Toy example: tokens are consecutive pairs of letters. Training set is the sentence "This sentence is AI generated: The moon is made of green cheese." The tokens are Th,hi,is,se,en,nt,te (etc.)
The transformer looks at pairs of tokens for clues about context, eg mo+ma, gathered from mo(on) and ma(de). It finds that the pair mo+ma is more informative than other pairs, so it remembers it preferentially. Later, the toy LLM black box outputs preferentially sentences where mo and ma show up.
A real LLM is way more complex, but this is all you need to understand how useless your labelling turns out to be against model collapse. You might as well not bother (although it might help protect you legally in the future).
Re: (Score:1)
Re: (Score:2)
Is that why we're seeing articles about LLMs having their own "language" where it generates nonsense words which when queried come out with the related query that generated the nonsense in the first place? Because LLMs are being trained on the output of their own and other LLMs hallucinated gibberish and this is being weighted in context with non-hallucinated gibberish? And I mean that question literally and not as antagonistically as it sounds. If that's not why this is happening, then that's important inf
Re: (Score:2)
At the most general level, constructing data synthetically is indirectly a form of handcrafted feature engineering, with all the issues that arise from it. It's also closely related to simulated data, and data augmentation approaches, imho.
W
Re: (Score:1)
Re: (Score:2)
Your "belief" has no impact on actual reality. But I guess you do not understand that. Have you even read the respective papers?
Re: (Score:3)
Re: (Score:3)
LLMs are now starting to learn to speak like LLMs and not like humans.
And we can call it Lbonics.
Re: (Score:2)
Not only. The problem is than an LLM only ever learns part of the information in its training data and typically completely misses details. Go through that a couple of cycles and all you have is mush.
Natural monopolies (Score:2)
We should probably be doing something about this given that Wall Street and the people who run our economy are planning to completely transform our civilization with this tech on a scale we haven't seen sinc
Re: (Score:2)
Well, yes. And no. Because their models still suck and age.
Re: (Score:2)
I don't think we are equipped socially or politically for what's coming.
Who is 'we'? Do you have a turd in your pocket?
Re: (Score:3)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Indeed. The problem is that while some nil withs think that LLMs can be creative and add to their training data, the very opposite is true: LLMs always only learn a degraded form of their input. For example, they cannot capture details, they basically only get a somewhat foggy general picture. And that fog not only builds up in specific places, it degrades the overall model until only fog with pretty colored lights in it comes out, i.e. 100% hallucination.
Re: (Score:2)
The LLM says: (Score:3)
Ah, Wordfreq, the digital linguist that bravely waded through millions of tweets, memes, and Reddit threads to track the wild and often nonsensical evolution of language. After years of tirelessly documenting humanity's descent into "yeets" and "sus," it seems even Wordfreq has decided to retire. I mean, who can blame it? One more analysis of TikTok slang and it probably would’ve needed therapy. Robyn Speer’s decision to cease updates is less about the death of a project and more about an act of mercy—for the machine. After all, there's only so many times an algorithm can process "bae," "on fleek," and "smol" before it begs for a permanent shutdown.
Re: The LLM says: (Score:3)
Re: (Score:2)
Re: (Score:2)
nocap bruh
original source (Score:3)
https://github.com/rspeer/word... [github.com]
This saddens me (Score:5, Insightful)
I have a language degree and studied in the US and abroad. While I don't work in the field (I'm an engineer now), I still get a kick out of hearing changes and new words and uses in everyday language. I watch some television shows from various countries just so that I can hear what has changed in the (many) years since I studied there. (I am always amazed at how deeply English has penetrated western European languages, especially German)
The idea that generative AI is filling the language pool with 'endless, mindless algorithmically-derived glop' (my words) means we, as humans, have handed over some control of language evolution to machines. The more we allow that, the less this quintessentially-human thing (highly complex language) that we have refined over millennia is ours. Much of the human-derived, haphazard nature of evolving language (mispronunciations, misunderstandings, clever portmanteaus) may be gently pushed aside by the insistent wave of machine-derived language preference...
I'll stop now - it's obviously a Friday afternoon...
Re: (Score:1)
Only if people read it, PoodleBuoy. Most of the SEO-style spam isn't intended to be read by humans. A lot of the spam going to social media is, but your average social media user is probably still reading at least 50% human-generated text. Plenty enough for them to learn and use new words.
As human language continues to evolve, the AIs language may not (easily) due to model collapse. Speaking in "2021 English" may start to flag you as a bot further down the road.
Re:This saddens me (Score:4)
Speaking in "2021 English" may start to flag you as a bot
Then I shan't engage in such folly.
Re: (Score:2)
Re: (Score:2)
I feel old saying it, but the times are changing. We are losing our personality; or at least not raising a new generation with one.
Re: (Score:3)
Re: (Score:2)
old and busted: dead internet theory
new and hot: dead humanity theory
Re:This saddens me (Score:4, Interesting)
I have a language degree and studied in the US and abroad. While I don't work in the field (I'm an engineer now), I still get a kick out of hearing changes and new words and uses in everyday language.
Language developments are definitely fascinating.
E.g. our children say that they did something "on accident" ... but we always said that we did something "by accident". Why the change? It doesn't bother me, but it does fascinate me.
The idea that generative AI is filling the language pool with 'endless, mindless algorithmically-derived glop' (my words) means we, as humans, have handed over some control of language evolution to machines. The more we allow that, the less this quintessentially-human thing (highly complex language) that we have refined over millennia is ours. Much of the human-derived, haphazard nature of evolving language (mispronunciations, misunderstandings, clever portmanteaus) may be gently pushed aside by the insistent wave of machine-derived language preference...
I'll stop now - it's obviously a Friday afternoon...
Hmm ... well, it's hard to deny that books, newspapers, radio, and television all had a huge effect on language change and development. Time will tell if this is just one more variable or a true sea change.
Re: (Score:2)
Wow (Score:1, Insightful)
Wikipedia, Twitter, and Reddit
The Internet ain't what it used to be.
Re:Wow (Score:4, Insightful)
Oh yeah like the alt.* hierarchy of usenet was the Library of Alexandria.
bullshit (Score:1)
Re: (Score:1)
I'd still like to delve into this realm of data... (Score:2)
Re: (Score:2)
Guess you'll just have to TALK to people. (Score:2)
Not just polluted, took over (Score:2)
Re: (Score:2)
Pff who needs Wordfreq (Score:1)
"Hey Google! Give me the 500 most used English words by decreasing order of usage."
See? It's much easier and it's guaranteed to be correct.
Also, spambots (Score:4, Interesting)
Internet forums also have a huge problem with spambots that have reposted older posts scraped from forums (often Reddit), in attempts to appear like legitimate users.
The original posts could be any from a day old to ten years old, and thus skew any analysis for language trends.
If a forum is left unchecked, the amount of spamposts could become quite high. Even on an actively moderated forum with vigilant users that report them, spamposts sometimes evade detection.
Usually a spampost is from a new user account, and the post is outside the "New members" subforum (if one exists). (Except that there are spambots that explicitly write only introductory posts)
If a user account looks suspicious, the method I use to confirm the post as spam is to search on Google for an exact quote from the post, so as to find the original message.
This use-case requires however that the search engine can still do that! Bing/DuckDuckGo do not. If Google degrades iits search engine too much, we'll lose this method against them.
"The internet" was always a bad source (Score:2)
The internet has been polluted by automated website generating systems for at least a decade. SEO companies take customer money and use it to spam the web with their trash. The English in this spam is often broken, roughly translated from Chinese or some other Asian language, skewing any language analysis that might come from the internet. AI-generated content is just the next step of this evolution.
LLMs: Biggest Analysis of Human Language Ever (Score:3)
What linguists need to do is study LLMs which are the biggest statistical analysis of human language ever made. Each head of an attention layer represents a part of speech question that is used to validate language. The real data mine is the LLM itself:
How does it represent grammar?
What "grammar" does it have? How does that compare to the grammars invented by linguists?
What are the true dynamics of words on internet text?
Many many questions that can be figured out by a linguistic analysis of LLMs. They have figured out language better than any linguist has.
Re: (Score:2)
Hmm (Score:2)
Are LLMs really having more of an impact on language than, say, texting or the internet did?
I'm tempted to say LOL ...
Re: (Score:2)
This isn't about the impact on language. wordfreq documented the frequency of words on the internet in an attempt to report their frequency in human-generated text. With a sharp uptick in machine-generated text online, the project is less able to achieve its goal.
Re: (Score:2)
But isn't it a human text, when a human decides to use a LLM for its text? Otherwise you can ban grammar checkers as well. Why does it matter how the text was written? You should filter spam, disregarding if the original text is human written or AI written, but any post that is intentionally posted by a human should be fine. Yeah, you will see an uptick at posts that delve into details, but aren't such effects exactly what the project tries to analyze?
Re: (Score:2)
But isn't it a human text, when a human decides to use a LLM for its text? Otherwise you can ban grammar checkers as well. Why does it matter how the text was written? You should filter spam, disregarding if the original text is human written or AI written, but any post that is intentionally posted by a human should be fine. Yeah, you will see an uptick at posts that delve into details, but aren't such effects exactly what the project tries to analyze?
Yeah, that's what I was thinking.
If someone chooses to say it ... then they did. It seems no "less valid" than choosing an autotext suggestion.
Good bibliographies have more than one item listed (Score:2)
"Project Analyzing Human Language Usage" was the public Internet your only source? Then it wasn't a project of human language, it was a project of language-on-that-single-source. and I'd guess what they mean by "human" language is only "written English."