Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
AI IT Technology

Project Analyzing Human Language Usage Shuts Down Because 'Generative AI Has Polluted the Data' (404media.co) 93

The creator of an open source project that scraped the internet to determine the ever-changing popularity of different words in human language usage says that they are sunsetting the project because generative AI spam has poisoned the internet to a level where the project no longer has any utility. 404 Media: Wordfreq is a program that tracked the ever-changing ways people used more than 40 different languages by analyzing millions of sources across Wikipedia, movie and TV subtitles, news articles, books, websites, Twitter, and Reddit. The system could be used to analyze changing language habits as slang and popular culture changed and language evolved, and was a resource for academics who study such things. In a note on the project's GitHub, creator Robyn Speer wrote that the project "will not be updated anymore."

"Generative AI has polluted the data," she wrote. "I don't think anyone has reliable information about post-2021 language usage by humans." She said that open web scraping was an important part of the project's data sources and "now the web at large is full of slop generated by large language models, written by no one to communicate nothing. Including this slop in the data skews the word frequencies." While there has always been spam on the internet and in the datasets that Wordfreq used, "it was manageable and often identifiable. Large language models generate text that masquerades as real language with intention behind it, even though there is none, and their output crops up everywhere," she wrote.

This discussion has been archived. No new comments can be posted.

Project Analyzing Human Language Usage Shuts Down Because 'Generative AI Has Polluted the Data'

Comments Filter:
  • by RightwingNutjob ( 1302813 ) on Friday September 20, 2024 @06:16PM (#64804361)

    And never did.

    Things that get deliberately posted for all to see are by definition spoken in a different register than casual speech.

    Back when the written word had a bigger barrier to entry between thought and printing press, this was understood. But the wrong lesson for why seems to have made people believe that blog posts and tweets are completely interchangeable with spoken interactions.

    The above statement was not AI generated, and it probably contains the gist of what I would have said out loud in casual conversation, but by virtue of taking longer to compose, it's more polished and has a few more $20 words and a few less profanities than if I'd just blurted it out.

    • by skam240 ( 789197 ) on Friday September 20, 2024 @06:35PM (#64804413)

      The Internet isn't real life? Good Lord! What have we all been using to post on Slashdot then!?

      • You're clearly not a gamer or of OG Slashdot. Anything that happens in games (which gets expanded to the internet, because they're online games) = game world, virtual world, whatever. Anything outside of games = IRL or in real life. Get with the program.
        • by skam240 ( 789197 )

          You're clearly not a gamer or of OG Slashdot

          Why is someone with a user number approaching infinity lecturing me about not being OG Slashdot? And yes, I play video games. Of course that has nothing to do with this though.

          Anything that happens in games (which gets expanded to the internet, because they're online games) = game world, virtual world, whatever. Anything outside of games = IRL or in real life. Get with the program.

          The point of my post is that people's use of language on the internet is in fact very real. One can see it, read it, and comprehend it. There is in fact nothing "not real" about it.

          • You're clearly not a gamer or of OG Slashdot

            Why is someone with a user number approaching infinity lecturing me about not being OG Slashdot?

            Mine predates yours. ;) Nah, more to the point I obviously agree with you.

            I suppose we can assume (I don't) that blog posts are all generative AI it pollutes the language. But the fact is that with the current state of generative AI it's usually pretty obvious what is coming out of those systems and what isn't. Granted, it's getting "better" with generative AI getting more and more crap fed into it so it's even more insane than it was already. That said, a lot of blog posts are utterly insane so does it re

            • by sinij ( 911942 )
              No, I don't think it is obvious what comes out of AIs when it is mixed into human content and when it is carefully prompted.
              • Order of destruction:

                1. Blogs
                2. Clickbait 'news'
                3. Formula news - sports recaps, company financials, investment research, and other genre news. Sports has been computer done for years now, not AI though.
                4. Soft news - Health, garden, celebrity, staged-pretend-controversy news, ...
                5. Political news
                6. Hard news

                And somewhere between 4-6 will be
                - Youtube and other video platforms
                - Short and long fiction stories
                - Movie scripts, dialogue and stock scenes

                And eventually

            • by skam240 ( 789197 )

              Mine predates yours. ;)

              I'm only impressed by 5 digit or lower user numbers! It's still neat having one that low in the six figure range though :).

              Not that anyone's user number means much over all of course, I just found it absurd having some one with an 8 digit user number tell me I dont go far enough back with Slashdot to get something given that a six digit number means the user created their account at the very end of the 90's / early 2000's.

              • Now it doesn't even really matter because the Slashdot administrators have apparently decided to kill the site by shutting down new user registrations and doing all sorts of shadow bans on registered users. (They block entire IP blocks and VPNs for no apparent reason. "You are not allowed to use this resource.")
                • by skam240 ( 789197 )

                  From what Ive both heard and seen your first part is wrong, one just needs to send an email for review to get registered as a user.

        • by sinij ( 911942 )
          Yes, but that becoming less and less true as online spills into IRL all the time.
    • These people don't think written and spoken interactions are interchangeable. Spoken and written languages are known different in both grammar and lexicon. Some writers are known to follow more "spoken" convention than others. Maybe the difference isn't big in all languages, though. It would certainly be worthwhile to monitor the frequency of spoken interaction but it's clearly much more difficult to implement.

      • Absolutely no writer who writes books other people choose to pick up writes the way people talk. Any transcription of real extemporaneous spoken language is fucking unreadable. It wanders, it doubles back, it contradicts itself. It relies on context communicated with body language or other visual aides.

        • See Louis-Ferdinand Céline (1894-1961) usually just referred as "Céline". Wikipedia says:
          * "Céline shocked many critics by his use of a unique language based on the spoken French of the working class"
          * "Céline is widely considered to be one of the greatest French novelists of the 20th century and his novels have had an enduring influence on later authors." https://en.wikipedia.org/wiki/... [wikipedia.org]

          There is a vast literature just about the spoken style of Céline's writing style. One example

    • by 2TecTom ( 311314 )

      it's not just our language usage or the Internet, this is inevitable result of corporatocracy, classism and corruption

      our entire society is collapsing from the unethical rot from the top down

      • Ours is not a top-down society. If it were, it wouldn't have survived the continuing stream of embarrassments from the top.

        You can see top-down societies in action. North Korea, the Soviet Union, etc. They usually don't last very long. And if they do, they have conspicuous deficiencies in terms of basic stuff like being able to keep the lights on or food on the table.

        • by 2TecTom ( 311314 )

          Ours is not a top-down society. If it were, it wouldn't have survived the continuing stream of embarrassments from the top.

          You can see top-down societies in action. North Korea, the Soviet Union, etc. They usually don't last very long. And if they do, they have conspicuous deficiencies in terms of basic stuff like being able to keep the lights on or food on the table.

          our society is a top down society, that's called classism

          the rich are on top and the rest of us are not

          typical red baiting bs

          as if our government isn't just as corrupt and like we don't have starvation or poverty here

          typical pseudo-conservative denial

          • Of course the government is corrupt. Reason we don't have *more* "starvation" (read: having to buy store brand steak on your EBT card) is that our government is limited and therefore its ability to fuck things up for the rest of us is limited.

            • by 2TecTom ( 311314 )

              of course if we didn't have good government, it would be far worse

              those who are against governance say so because they selfishly don't want to pay their fair share

              typical pseudo-conservative classist self-justification and deflection

    • But the kinds of language & their respective recurrent lexicogrammatical realisations within their respective genres & register configurations also have strong correlations (overlaps) with both written & spoken forms of communication. There are some features that are indeed unique to some genres of internet discourse but those make up a relatively small percentage overall. Much of informal internet discourse seems to be somewhere between/a mixture of written & spoken genres & conventions
    • by Kisai ( 213879 )

      Blog posts are train-of-thought. Tweets are train-of-thought. Hell even Reddit, somethingawful, fark, slashdot, and so forth UGC is still train of thought more than any real effort to research and verify.

      And AI "slop" is only going to get worse until we abandon the ipv4 internet and create spaces where generative AI is not allowed to participate and those using generative AI are not allowed to post to trusted sources.

      And 2021 wasn't when this crap started either. That was merely the turning point in which c

      • I think you got that backwards, the "AI slop" is getting better, to the point we can't tell it from quality human text. If you don't want the regular LLM register just ask it to write like a 12 year old or like The Onion. As for the content, you can take a bunch of human slop like this thread or reddit, and task the LLM to craft a coherent article from it. The internet has degraded so much over the years that LLM text is almost always better than random web pages. When I try to read news from the Google Dis
    • "have made people believe that blog posts and tweets are completely interchangeable with spoken interactions." I wouldn't say this is entirely accurate. People are starting to use phrases like "unalived himself" in spoken interactions because they have been conditioned into thinking that "death" and "suicide" are things that are never to be said, because social media auto moderation paints them as a bad guy if they do. Sometimes with penalties far disproportionate to the "infraction". Also, the spell check
    • And yet we have been able to understand and track the evolution of languages over centuries by analyzing written media - because contemporary usage, neologisms and language drifts permeate language in average despite individual author style. Slang getting incorporated into letters and then art, and becoming canonical has been part of that forever - and statistical linguistic analysis has been a science for at least a century.

      I think they may be early throwing the towel on the origination problem, but they a

  • by gweihir ( 88907 ) on Friday September 20, 2024 @06:24PM (#64804385)

    It also shows another thing: Internet language data is fast becoming unusable for model training due to model collapse.

    • by quonset ( 4839537 ) on Friday September 20, 2024 @06:42PM (#64804433)

      It also shows another thing: Internet language data is fast becoming unusable for model training due to model collapse.

      Just wait until the same thing happens when using paintings as a reference [imgur.com].

    • In other words, LLMs are now starting to learn to speak like LLMs and not like humans. I should take an LLM and modify it so that every sentence it speaks starts with "LLM". Soon the other LLMs will learn and their sentences start with "LLM" as well. And then we can filter out any sentence that starts with "LLM".
      • by PPH ( 736903 )

        LLMs are now starting to learn to speak like LLMs and not like humans.

        And we can call it Lbonics.

      • by gweihir ( 88907 )

        Not only. The problem is than an LLM only ever learns part of the information in its training data and typically completely misses details. Go through that a couple of cycles and all you have is mush.

    • Anyone who got in on the ground floor has a massive competitive advantage that is basically impossible to overcome. You had mountains of free data to train your models on. Any potential competitors won't have that data and if they do manage to get a hold of it it'll be so polluted as to be worthless.

      We should probably be doing something about this given that Wall Street and the people who run our economy are planning to completely transform our civilization with this tech on a scale we haven't seen sinc
      • by gweihir ( 88907 )

        Well, yes. And no. Because their models still suck and age.

      • I don't think we are equipped socially or politically for what's coming.

        Who is 'we'? Do you have a turd in your pocket?

      • While the barrier to entry is certainly real, I feel like the only actually "good" examples of LLMs have been the result of normal humans training very small models on highly curated data. Like stable diffusion images are atrociously bad by default, but heavily guided versions with checkpoints and loras generated from a comparatively small subset of data to make more focussed images make things people are more likely to call "good" even if perhaps it's perhaps a bit (highly) derivative of the source trainin
    • Pretty much. Soon enough it'll be the 'AI' equivalent of copying a VHS tape too many times. They'll be outputting nonsensical gibberish that's geometrically more nonsensical than it already is.
      • by gweihir ( 88907 )

        Indeed. The problem is that while some nil withs think that LLMs can be creative and add to their training data, the very opposite is true: LLMs always only learn a degraded form of their input. For example, they cannot capture details, they basically only get a somewhat foggy general picture. And that fog not only builds up in specific places, it degrades the overall model until only fog with pretty colored lights in it comes out, i.e. 100% hallucination.

    • "Model collapse" sounds like "Slashdont" ... it's mode collapse, as in statistical modes, which are centers of probability in a distribution. The mode collapse means a single mode remains where in reality there are many.
  • by chuckugly ( 2030942 ) on Friday September 20, 2024 @06:31PM (#64804397)

    Ah, Wordfreq, the digital linguist that bravely waded through millions of tweets, memes, and Reddit threads to track the wild and often nonsensical evolution of language. After years of tirelessly documenting humanity's descent into "yeets" and "sus," it seems even Wordfreq has decided to retire. I mean, who can blame it? One more analysis of TikTok slang and it probably would’ve needed therapy. Robyn Speer’s decision to cease updates is less about the death of a project and more about an act of mercy—for the machine. After all, there's only so many times an algorithm can process "bae," "on fleek," and "smol" before it begs for a permanent shutdown.

  • by znrt ( 2424692 ) on Friday September 20, 2024 @06:32PM (#64804399)
  • This saddens me (Score:5, Insightful)

    by PuddleBoy ( 544111 ) on Friday September 20, 2024 @06:38PM (#64804423)

    I have a language degree and studied in the US and abroad. While I don't work in the field (I'm an engineer now), I still get a kick out of hearing changes and new words and uses in everyday language. I watch some television shows from various countries just so that I can hear what has changed in the (many) years since I studied there. (I am always amazed at how deeply English has penetrated western European languages, especially German)

    The idea that generative AI is filling the language pool with 'endless, mindless algorithmically-derived glop' (my words) means we, as humans, have handed over some control of language evolution to machines. The more we allow that, the less this quintessentially-human thing (highly complex language) that we have refined over millennia is ours. Much of the human-derived, haphazard nature of evolving language (mispronunciations, misunderstandings, clever portmanteaus) may be gently pushed aside by the insistent wave of machine-derived language preference...

    I'll stop now - it's obviously a Friday afternoon...

    • Only if people read it, PoodleBuoy. Most of the SEO-style spam isn't intended to be read by humans. A lot of the spam going to social media is, but your average social media user is probably still reading at least 50% human-generated text. Plenty enough for them to learn and use new words.

      As human language continues to evolve, the AIs language may not (easily) due to model collapse. Speaking in "2021 English" may start to flag you as a bot further down the road.

    • Please sir, may I have some more madglop?
    • I was in a store a few weeks ago, and a child maybe 3 or 4 years old was talking to her mother - she had no emotion in her voice and strung the words together like an an internet chatbot.

      I feel old saying it, but the times are changing. We are losing our personality; or at least not raising a new generation with one.
    • Re:This saddens me (Score:4, Interesting)

      by cascadingstylesheet ( 140919 ) on Saturday September 21, 2024 @06:45AM (#64805243) Journal

      I have a language degree and studied in the US and abroad. While I don't work in the field (I'm an engineer now), I still get a kick out of hearing changes and new words and uses in everyday language.

      Language developments are definitely fascinating.

      E.g. our children say that they did something "on accident" ... but we always said that we did something "by accident". Why the change? It doesn't bother me, but it does fascinate me.

      The idea that generative AI is filling the language pool with 'endless, mindless algorithmically-derived glop' (my words) means we, as humans, have handed over some control of language evolution to machines. The more we allow that, the less this quintessentially-human thing (highly complex language) that we have refined over millennia is ours. Much of the human-derived, haphazard nature of evolving language (mispronunciations, misunderstandings, clever portmanteaus) may be gently pushed aside by the insistent wave of machine-derived language preference...

      I'll stop now - it's obviously a Friday afternoon...

      Hmm ... well, it's hard to deny that books, newspapers, radio, and television all had a huge effect on language change and development. Time will tell if this is just one more variable or a true sea change.

    • In the midst of the vibrant tapestry of existence, one finds oneself wandering through the myriad pathways of thought, where the sun gently bathes the landscape in a warm embrace, illuminating the delicate petals of daisies that sway softly in the breeze. As clouds drift lazily across the cerulean sky, the symphony of nature plays a silent melody, echoing the timeless rhythm of life itself. In this serene moment, one may ponder the complexities of human experience, reflecting on the intertwining threads of
  • Wow (Score:1, Insightful)

    by The Cat ( 19816 )

    Wikipedia, Twitter, and Reddit

    The Internet ain't what it used to be.

  • by Anonymous Coward
    what a load of crap, the internet has been full of bots long before AI came along, Is he claiming all that shit didn't pollute the project? sounds like just an easy excuse.
    • Yeah, article spinners have been around for some time now just "rewriting" articles from other websites. This is not a new concept. Although social media is now often times more AI than it was previously. I know I use AI sometimes to write my posts, mainly because it auto-adds the emojis.
  • It's important to consider how LLMs have polluted the tapestry of internet language so I hope they embark on a comprehensive journey to document this vital landscape.
    • Simple, the places where people are writing now is the AI chat bot rooms. So you can still get that pure human text if you are OpenAI. How ironic.
  • "What a fuckin' nightmayuh!" /Marisa Tomei
  • We created linguistic AI in form of LLM, it can do nothing but generate speech. It took over language and we no longer can know what is language. Pray that we don't create AI that can reason, as we won't know what is reason after that.
    • We're not in any danger of creating machines that can truly reason. We'll have practical, large-scale commercial fusion reactors a long time before that ever happens.
  • "Hey Google! Give me the 500 most used English words by decreasing order of usage."

    See? It's much easier and it's guaranteed to be correct.

  • Also, spambots (Score:4, Interesting)

    by Misagon ( 1135 ) on Friday September 20, 2024 @10:51PM (#64804783)

    Internet forums also have a huge problem with spambots that have reposted older posts scraped from forums (often Reddit), in attempts to appear like legitimate users.
    The original posts could be any from a day old to ten years old, and thus skew any analysis for language trends.

    If a forum is left unchecked, the amount of spamposts could become quite high. Even on an actively moderated forum with vigilant users that report them, spamposts sometimes evade detection.

    Usually a spampost is from a new user account, and the post is outside the "New members" subforum (if one exists). (Except that there are spambots that explicitly write only introductory posts)
    If a user account looks suspicious, the method I use to confirm the post as spam is to search on Google for an exact quote from the post, so as to find the original message.
    This use-case requires however that the search engine can still do that! Bing/DuckDuckGo do not. If Google degrades iits search engine too much, we'll lose this method against them.

  • The internet has been polluted by automated website generating systems for at least a decade. SEO companies take customer money and use it to spam the web with their trash. The English in this spam is often broken, roughly translated from Chinese or some other Asian language, skewing any language analysis that might come from the internet. AI-generated content is just the next step of this evolution.

  • by cowdung ( 702933 ) on Saturday September 21, 2024 @05:38AM (#64805139)

    What linguists need to do is study LLMs which are the biggest statistical analysis of human language ever made. Each head of an attention layer represents a part of speech question that is used to validate language. The real data mine is the LLM itself:

    How does it represent grammar?
    What "grammar" does it have? How does that compare to the grammars invented by linguists?
    What are the true dynamics of words on internet text?

    Many many questions that can be figured out by a linguistic analysis of LLMs. They have figured out language better than any linguist has.

    • Funnily, the wordfreq project counts frequencies of words, which are unigrams. N-gram models are also language models. So they are creating a sort of language model, but the most primitive there is.
  • Are LLMs really having more of an impact on language than, say, texting or the internet did?

    I'm tempted to say LOL ...

    • This isn't about the impact on language. wordfreq documented the frequency of words on the internet in an attempt to report their frequency in human-generated text. With a sharp uptick in machine-generated text online, the project is less able to achieve its goal.

      • by allo ( 1728082 )

        But isn't it a human text, when a human decides to use a LLM for its text? Otherwise you can ban grammar checkers as well. Why does it matter how the text was written? You should filter spam, disregarding if the original text is human written or AI written, but any post that is intentionally posted by a human should be fine. Yeah, you will see an uptick at posts that delve into details, but aren't such effects exactly what the project tries to analyze?

        • But isn't it a human text, when a human decides to use a LLM for its text? Otherwise you can ban grammar checkers as well. Why does it matter how the text was written? You should filter spam, disregarding if the original text is human written or AI written, but any post that is intentionally posted by a human should be fine. Yeah, you will see an uptick at posts that delve into details, but aren't such effects exactly what the project tries to analyze?

          Yeah, that's what I was thinking.

          If someone chooses to say it ... then they did. It seems no "less valid" than choosing an autotext suggestion.

  • "Project Analyzing Human Language Usage" was the public Internet your only source? Then it wasn't a project of human language, it was a project of language-on-that-single-source. and I'd guess what they mean by "human" language is only "written English."

Dennis Ritchie is twice as bright as Steve Jobs, and only half wrong. -- Jim Gettys

Working...