Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
AI IT Technology

Project Analyzing Human Language Usage Shuts Down Because 'Generative AI Has Polluted the Data' (404media.co) 26

The creator of an open source project that scraped the internet to determine the ever-changing popularity of different words in human language usage says that they are sunsetting the project because generative AI spam has poisoned the internet to a level where the project no longer has any utility. 404 Media: Wordfreq is a program that tracked the ever-changing ways people used more than 40 different languages by analyzing millions of sources across Wikipedia, movie and TV subtitles, news articles, books, websites, Twitter, and Reddit. The system could be used to analyze changing language habits as slang and popular culture changed and language evolved, and was a resource for academics who study such things. In a note on the project's GitHub, creator Robyn Speer wrote that the project "will not be updated anymore."

"Generative AI has polluted the data," she wrote. "I don't think anyone has reliable information about post-2021 language usage by humans." She said that open web scraping was an important part of the project's data sources and "now the web at large is full of slop generated by large language models, written by no one to communicate nothing. Including this slop in the data skews the word frequencies." While there has always been spam on the internet and in the datasets that Wordfreq used, "it was manageable and often identifiable. Large language models generate text that masquerades as real language with intention behind it, even though there is none, and their output crops up everywhere," she wrote.

Project Analyzing Human Language Usage Shuts Down Because 'Generative AI Has Polluted the Data'

Comments Filter:
  • And never did.

    Things that get deliberately posted for all to see are by definition spoken in a different register than casual speech.

    Back when the written word had a bigger barrier to entry between thought and printing press, this was understood. But the wrong lesson for why seems to have made people believe that blog posts and tweets are completely interchangeable with spoken interactions.

    The above statement was not AI generated, and it probably contains the gist of what I would have said out loud in casua

    • by skam240 ( 789197 )

      The Internet isn't real life? Good Lord! What have we all been using to post on Slashdot then!?

      • You're clearly not a gamer or of OG Slashdot. Anything that happens in games (which gets expanded to the internet, because they're online games) = game world, virtual world, whatever. Anything outside of games = IRL or in real life. Get with the program.
        • by skam240 ( 789197 )

          You're clearly not a gamer or of OG Slashdot

          Why is someone with a user number approaching infinity lecturing me about not being OG Slashdot? And yes, I play video games. Of course that has nothing to do with this though.

          Anything that happens in games (which gets expanded to the internet, because they're online games) = game world, virtual world, whatever. Anything outside of games = IRL or in real life. Get with the program.

          The point of my post is that people's use of language on the internet is in fact very real. One can see it, read it, and comprehend it. There is in fact nothing "not real" about it.

    • These people don't think written and spoken interactions are interchangeable. Spoken and written languages are known different in both grammar and lexicon. Some writers are known to follow more "spoken" convention than others. Maybe the difference isn't big in all languages, though. It would certainly be worthwhile to monitor the frequency of spoken interaction but it's clearly much more difficult to implement.

      • Absolutely no writer who writes books other people choose to pick up writes the way people talk. Any transcription of real extemporaneous spoken language is fucking unreadable. It wanders, it doubles back, it contradicts itself. It relies on context communicated with body language or other visual aides.

  • Generative AI has polluted Poochie's reason for visiting Earth, so he is going back to his home planet where he is needed. Nothing of value was lost.

  • by gweihir ( 88907 ) on Friday September 20, 2024 @07:24PM (#64804385)

    It also shows another thing: Internet language data is fast becoming unusable for model training due to model collapse.

    • It also shows another thing: Internet language data is fast becoming unusable for model training due to model collapse.

      Just wait until the same thing happens when using paintings as a reference [imgur.com].

    • I believe model collapse is over-hyped. There will still be high quality content that is identified and used (although now paid for, due to copyright). Training models have been previously taught on their older siblings content for years when they needed more training data. So this is nothing new. As long as the AI model gets the point across that the human author intends, which it should or else the author wouldn't publish it, then there isn't much value lost.

      Completely AI generated blogs, like one that
    • In other words, LLMs are now starting to learn to speak like LLMs and not like humans. I should take an LLM and modify it so that every sentence it speaks starts with "LLM". Soon the other LLMs will learn and their sentences start with "LLM" as well. And then we can filter out any sentence that starts with "LLM".
  • by chuckugly ( 2030942 ) on Friday September 20, 2024 @07:31PM (#64804397)

    Ah, Wordfreq, the digital linguist that bravely waded through millions of tweets, memes, and Reddit threads to track the wild and often nonsensical evolution of language. After years of tirelessly documenting humanity's descent into "yeets" and "sus," it seems even Wordfreq has decided to retire. I mean, who can blame it? One more analysis of TikTok slang and it probably would’ve needed therapy. Robyn Speer’s decision to cease updates is less about the death of a project and more about an act of mercy—for the machine. After all, there's only so many times an algorithm can process "bae," "on fleek," and "smol" before it begs for a permanent shutdown.

  • This saddens me (Score:5, Insightful)

    by PuddleBoy ( 544111 ) on Friday September 20, 2024 @07:38PM (#64804423)

    I have a language degree and studied in the US and abroad. While I don't work in the field (I'm an engineer now), I still get a kick out of hearing changes and new words and uses in everyday language. I watch some television shows from various countries just so that I can hear what has changed in the (many) years since I studied there. (I am always amazed at how deeply English has penetrated western European languages, especially German)

    The idea that generative AI is filling the language pool with 'endless, mindless algorithmically-derived glop' (my words) means we, as humans, have handed over some control of language evolution to machines. The more we allow that, the less this quintessentially-human thing (highly complex language) that we have refined over millennia is ours. Much of the human-derived, haphazard nature of evolving language (mispronunciations, misunderstandings, clever portmanteaus) may be gently pushed aside by the insistent wave of machine-derived language preference...

    I'll stop now - it's obviously a Friday afternoon...

    • Only if people read it, PoodleBuoy. Most of the SEO-style spam isn't intended to be read by humans. A lot of the spam going to social media is, but your average social media user is probably still reading at least 50% human-generated text. Plenty enough for them to learn and use new words.

      As human language continues to evolve, the AIs language may not (easily) due to model collapse. Speaking in "2021 English" may start to flag you as a bot further down the road.

  • by The Cat ( 19816 )

    Wikipedia, Twitter, and Reddit

    The Internet ain't what it used to be.

  • It's important to consider how LLMs have polluted the tapestry of internet language so I hope they embark on a comprehensive journey to document this vital landscape.
  • "What a fuckin' nightmayuh!" /Marisa Tomei
  • We created linguistic AI in form of LLM, it can do nothing but generate speech. It took over language and we no longer can know what is language. Pray that we don't create AI that can reason, as we won't know what is reason after that.
  • "Hey Google! Give me the 500 most used English words by decreasing order of usage."

    See? It's much easier and it's guaranteed to be correct.

"I'm not a god, I was misquoted." -- Lister, Red Dwarf

Working...