Project Analyzing Human Language Usage Shuts Down Because 'Generative AI Has Polluted the Data' (404media.co) 93

Posted by msmash on Friday September 20, 2024 @07:01PM from the all-good-things-end dept.

The creator of an open source project that scraped the internet to determine the ever-changing popularity of different words in human language usage says that they are sunsetting the project because generative AI spam has poisoned the internet to a level where the project no longer has any utility. 404 Media: Wordfreq is a program that tracked the ever-changing ways people used more than 40 different languages by analyzing millions of sources across Wikipedia, movie and TV subtitles, news articles, books, websites, Twitter, and Reddit. The system could be used to analyze changing language habits as slang and popular culture changed and language evolved, and was a resource for academics who study such things. In a note on the project's GitHub, creator Robyn Speer wrote that the project "will not be updated anymore."

"Generative AI has polluted the data," she wrote. "I don't think anyone has reliable information about post-2021 language usage by humans." She said that open web scraping was an important part of the project's data sources and "now the web at large is full of slop generated by large language models, written by no one to communicate nothing. Including this slop in the data skews the word frequencies." While there has always been spam on the internet and in the datasets that Wordfreq used, "it was manageable and often identifiable. Large language models generate text that masquerades as real language with intention behind it, even though there is none, and their output crops up everywhere," she wrote.

Project Analyzing Human Language Usage Shuts Down Because 'Generative AI Has Polluted the Data'

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 93 Comments Log In/Create an Account

Comments Filter:

Internet != real life (Score:5, Insightful)

by RightwingNutjob ( 1302813 ) writes: on Friday September 20, 2024 @07:16PM (#64804361)

And never did.
Things that get deliberately posted for all to see are by definition spoken in a different register than casual speech.
Back when the written word had a bigger barrier to entry between thought and printing press, this was understood. But the wrong lesson for why seems to have made people believe that blog posts and tweets are completely interchangeable with spoken interactions.
The above statement was not AI generated, and it probably contains the gist of what I would have said out loud in casual conversation, but by virtue of taking longer to compose, it's more polished and has a few more $20 words and a few less profanities than if I'd just blurted it out.

- Re:Internet != real life (Score:5, Funny)
  
  by skam240 ( 789197 ) writes: on Friday September 20, 2024 @07:35PM (#64804413)
  
  The Internet isn't real life? Good Lord! What have we all been using to post on Slashdot then!?
  
  - Re: (Score:1)
    
    by destined2fail1990 ( 10502474 ) writes:
    
    You're clearly not a gamer or of OG Slashdot. Anything that happens in games (which gets expanded to the internet, because they're online games) = game world, virtual world, whatever. Anything outside of games = IRL or in real life. Get with the program.
    - Re: (Score:2)
      
      by skam240 ( 789197 ) writes:
      
      You're clearly not a gamer or of OG Slashdot
      Why is someone with a user number approaching infinity lecturing me about not being OG Slashdot? And yes, I play video games. Of course that has nothing to do with this though.
      Anything that happens in games (which gets expanded to the internet, because they're online games) = game world, virtual world, whatever. Anything outside of games = IRL or in real life. Get with the program.
      The point of my post is that people's use of language on the internet is in fact very real. One can see it, read it, and comprehend it. There is in fact nothing "not real" about it.
      - Re: (Score:2)
        
        by Randseed ( 132501 ) writes:
        
        You're clearly not a gamer or of OG Slashdot
        Why is someone with a user number approaching infinity lecturing me about not being OG Slashdot?
        Mine predates yours. ;) Nah, more to the point I obviously agree with you.
        I suppose we can assume (I don't) that blog posts are all generative AI it pollutes the language. But the fact is that with the current state of generative AI it's usually pretty obvious what is coming out of those systems and what isn't. Granted, it's getting "better" with generative AI getting more and more crap fed into it so it's even more insane than it was already. That said, a lot of blog posts are utterly insane so does it re
        
        Re: (Score:2)
        
        by sinij ( 911942 ) writes:
        
        No, I don't think it is obvious what comes out of AIs when it is mixed into human content and when it is carefully prompted.
        
        First to go, next is videos (Score:2)
        
        by will4 ( 7250692 ) writes:
        
        Order of destruction:
        1. Blogs
        2. Clickbait 'news'
        3. Formula news - sports recaps, company financials, investment research, and other genre news. Sports has been computer done for years now, not AI though.
        4. Soft news - Health, garden, celebrity, staged-pretend-controversy news, ...
        5. Political news
        6. Hard news
        And somewhere between 4-6 will be
        - Youtube and other video platforms
        - Short and long fiction stories
        - Movie scripts, dialogue and stock scenes
        And eventually
        
        Re: (Score:2)
        
        by skam240 ( 789197 ) writes:
        
        Mine predates yours. ;)
        I'm only impressed by 5 digit or lower user numbers! It's still neat having one that low in the six figure range though :).
        Not that anyone's user number means much over all of course, I just found it absurd having some one with an 8 digit user number tell me I dont go far enough back with Slashdot to get something given that a six digit number means the user created their account at the very end of the 90's / early 2000's.
        
        Re: (Score:2)
        
        by Randseed ( 132501 ) writes:
        
        Now it doesn't even really matter because the Slashdot administrators have apparently decided to kill the site by shutting down new user registrations and doing all sorts of shadow bans on registered users. (They block entire IP blocks and VPNs for no apparent reason. "You are not allowed to use this resource.")
        
        Re: (Score:2)
        
        by skam240 ( 789197 ) writes:
        
        From what Ive both heard and seen your first part is wrong, one just needs to send an email for review to get registered as a user.
    - Re: (Score:2)
      
      by sinij ( 911942 ) writes:
      
      Yes, but that becoming less and less true as online spills into IRL all the time.
- - Re: Internet != real life (Score:1)
    
    by Albinoman ( 584294 ) writes:
    
    Even without AI polluting the data, it was a waste of time. No one is ever gonna go back through their data.
- Re: (Score:2)
  
  by test321 ( 8891681 ) writes:
  
  These people don't think written and spoken interactions are interchangeable. Spoken and written languages are known different in both grammar and lexicon. Some writers are known to follow more "spoken" convention than others. Maybe the difference isn't big in all languages, though. It would certainly be worthwhile to monitor the frequency of spoken interaction but it's clearly much more difficult to implement.
  - Re: Internet != real life (Score:2)
    
    by RightwingNutjob ( 1302813 ) writes:
    
    Absolutely no writer who writes books other people choose to pick up writes the way people talk. Any transcription of real extemporaneous spoken language is fucking unreadable. It wanders, it doubles back, it contradicts itself. It relies on context communicated with body language or other visual aides.
    - Re: (Score:2)
      
      by test321 ( 8891681 ) writes:
      
      See Louis-Ferdinand Céline (1894-1961) usually just referred as "Céline". Wikipedia says:
      * "Céline shocked many critics by his use of a unique language based on the spoken French of the working class"
      * "Céline is widely considered to be one of the greatest French novelists of the 20th century and his novels have had an enduring influence on later authors." https://en.wikipedia.org/wiki/... [wikipedia.org]
      There is a vast literature just about the spoken style of Céline's writing style. One example
- Re: (Score:2)
  
  by 2TecTom ( 311314 ) writes:
  
  it's not just our language usage or the Internet, this is inevitable result of corporatocracy, classism and corruption
  our entire society is collapsing from the unethical rot from the top down
  - Re: Internet != real life (Score:2)
    
    by RightwingNutjob ( 1302813 ) writes:
    
    Ours is not a top-down society. If it were, it wouldn't have survived the continuing stream of embarrassments from the top.
    You can see top-down societies in action. North Korea, the Soviet Union, etc. They usually don't last very long. And if they do, they have conspicuous deficiencies in terms of basic stuff like being able to keep the lights on or food on the table.
    - Re: (Score:2)
      
      by 2TecTom ( 311314 ) writes:
      
      Ours is not a top-down society. If it were, it wouldn't have survived the continuing stream of embarrassments from the top.
      You can see top-down societies in action. North Korea, the Soviet Union, etc. They usually don't last very long. And if they do, they have conspicuous deficiencies in terms of basic stuff like being able to keep the lights on or food on the table.
      our society is a top down society, that's called classism
      the rich are on top and the rest of us are not
      typical red baiting bs
      as if our government isn't just as corrupt and like we don't have starvation or poverty here
      typical pseudo-conservative denial
      - Re: Internet != real life (Score:2)
        
        by RightwingNutjob ( 1302813 ) writes:
        
        Of course the government is corrupt. Reason we don't have *more* "starvation" (read: having to buy store brand steak on your EBT card) is that our government is limited and therefore its ability to fuck things up for the rest of us is limited.
        
        Re: (Score:2)
        
        by 2TecTom ( 311314 ) writes:
        
        of course if we didn't have good government, it would be far worse
        those who are against governance say so because they selfishly don't want to pay their fair share
        typical pseudo-conservative classist self-justification and deflection
- Re: (Score:2)
  
  by account_deleted ( 4530225 ) writes:
  
  Comment removed based on user account deletion
- Re: (Score:3)
  
  by Kisai ( 213879 ) writes:
  
  Blog posts are train-of-thought. Tweets are train-of-thought. Hell even Reddit, somethingawful, fark, slashdot, and so forth UGC is still train of thought more than any real effort to research and verify.
  And AI "slop" is only going to get worse until we abandon the ipv4 internet and create spaces where generative AI is not allowed to participate and those using generative AI are not allowed to post to trusted sources.
  And 2021 wasn't when this crap started either. That was merely the turning point in which c
  - Re: (Score:3)
    
    by Visarga ( 1071662 ) writes:
    
    I think you got that backwards, the "AI slop" is getting better, to the point we can't tell it from quality human text. If you don't want the regular LLM register just ask it to write like a 12 year old or like The Onion. As for the content, you can take a bunch of human slop like this thread or reddit, and task the LLM to craft a coherent article from it. The internet has degraded so much over the years that LLM text is almost always better than random web pages. When I try to read news from the Google Dis
- Heavy handed social media censorship (Score:2)
  
  by Malay2bowman ( 10422660 ) writes:
  
  "have made people believe that blog posts and tweets are completely interchangeable with spoken interactions." I wouldn't say this is entirely accurate. People are starting to use phrases like "unalived himself" in spoken interactions because they have been conditioned into thinking that "death" and "suicide" are things that are never to be said, because social media auto moderation paints them as a bad guy if they do. Sometimes with penalties far disproportionate to the "infraction". Also, the spell check
  - Re: (Score:2)
    
    by Malay2bowman ( 10422660 ) writes:
    
    " accurate" meant "inaccurate"
- Re: Internet != real life (Score:2)
  
  by Bodrius ( 191265 ) writes:
  
  And yet we have been able to understand and track the evolution of languages over centuries by analyzing written media - because contemporary usage, neologisms and language drifts permeate language in average despite individual author style. Slang getting incorporated into letters and then art, and becoming canonical has been part of that forever - and statistical linguistic analysis has been a science for at least a century.
  I think they may be early throwing the towel on the origination problem, but they a
- Re: (Score:2)
  
  by Visarga ( 1071662 ) writes:
  
  Hey, the "Project Analyzing Human Language" is serious shit. They collect a bunch of text and count words. Serious statistical stuff here.
That is unfortunate (Score:5, Insightful)

by gweihir ( 88907 ) writes: on Friday September 20, 2024 @07:24PM (#64804385)

It also shows another thing: Internet language data is fast becoming unusable for model training due to model collapse.

- Re:That is unfortunate (Score:4, Interesting)
  
  by quonset ( 4839537 ) writes: on Friday September 20, 2024 @07:42PM (#64804433)
  
  It also shows another thing: Internet language data is fast becoming unusable for model training due to model collapse.
  Just wait until the same thing happens when using paintings as a reference [imgur.com].
  
- - Re:That is unfortunate (Score:4, Interesting)
    
    by martin-boundary ( 547041 ) writes: on Friday September 20, 2024 @10:55PM (#64804693)
    
    Feel free to believe whatever you want, it doesn't change the science of model collapse. I do want to address one point you make, which in my view is a fundamental misunderstanding you have about AI training (in the spirit of being helpful).
    A document gets read, it gets split into tokens. The tokens are trained on and used, together with their one on one relationships to nearby tokens. The disclaimer you put in, that the content is AI generated, is tokenized too, but this has no appreciable effect on the relationships discovered among the tokens that is trained.
    Toy example: tokens are consecutive pairs of letters. Training set is the sentence "This sentence is AI generated: The moon is made of green cheese." The tokens are Th,hi,is,se,en,nt,te (etc.)
    The transformer looks at pairs of tokens for clues about context, eg mo+ma, gathered from mo(on) and ma(de). It finds that the pair mo+ma is more informative than other pairs, so it remembers it preferentially. Later, the toy LLM black box outputs preferentially sentences where mo and ma show up.
    A real LLM is way more complex, but this is all you need to understand how useless your labelling turns out to be against model collapse. You might as well not bother (although it might help protect you legally in the future).
    
    - Re: (Score:1)
      
      by Visarga ( 1071662 ) writes:
      
      As someone working in machine learning I can say your mode collapse fears are overblown. Today all major AI houses are using copious amounts of synthetic data, of course doing that very carefully. Generated text is used to train the top models like GPT-4o, o1, Claude Sonnet and Phi-3.5 from Microsoft.
      - Re: (Score:2)
        
        by mrfaithful ( 1212510 ) writes:
        
        Is that why we're seeing articles about LLMs having their own "language" where it generates nonsense words which when queried come out with the related query that generated the nonsense in the first place? Because LLMs are being trained on the output of their own and other LLMs hallucinated gibberish and this is being weighted in context with non-hallucinated gibberish? And I mean that question literally and not as antagonistically as it sounds. If that's not why this is happening, then that's important inf
      - Re: (Score:2)
        
        by martin-boundary ( 547041 ) writes:
        
        Synthetic training data is not, unfortunately, a silver bullet. It's also not a new idea, you can find examples in science and engineering going back to the 1970s, and they haven't worked well enough to displace real data and live testing in all the years since then.
        At the most general level, constructing data synthetically is indirectly a form of handcrafted feature engineering, with all the issues that arise from it. It's also closely related to simulated data, and data augmentation approaches, imho.
        W
  - Re: (Score:1)
    
    by Growlley ( 6732614 ) writes:
    
    How does that apply to the current GOP then ? The shit they publish isn't intended to get the point across because the point is to simply get them (RE)ELECTED and keep the grift going.
  - Re: (Score:2)
    
    by gweihir ( 88907 ) writes:
    
    Your "belief" has no impact on actual reality. But I guess you do not understand that. Have you even read the respective papers?
- Re: (Score:3)
  
  by gnasher719 ( 869701 ) writes:
  
  In other words, LLMs are now starting to learn to speak like LLMs and not like humans. I should take an LLM and modify it so that every sentence it speaks starts with "LLM". Soon the other LLMs will learn and their sentences start with "LLM" as well. And then we can filter out any sentence that starts with "LLM".
  - Re: (Score:3)
    
    by PPH ( 736903 ) writes:
    
    LLMs are now starting to learn to speak like LLMs and not like humans.
    And we can call it Lbonics.
  - Re: (Score:2)
    
    by gweihir ( 88907 ) writes:
    
    Not only. The problem is than an LLM only ever learns part of the information in its training data and typically completely misses details. Go through that a couple of cycles and all you have is mush.
- Natural monopolies (Score:2)
  
  by rsilvergun ( 571051 ) writes:
  
  Anyone who got in on the ground floor has a massive competitive advantage that is basically impossible to overcome. You had mountains of free data to train your models on. Any potential competitors won't have that data and if they do manage to get a hold of it it'll be so polluted as to be worthless.
  
  We should probably be doing something about this given that Wall Street and the people who run our economy are planning to completely transform our civilization with this tech on a scale we haven't seen sinc
  - Re: (Score:2)
    
    by gweihir ( 88907 ) writes:
    
    Well, yes. And no. Because their models still suck and age.
  - Re: (Score:2)
    
    by ArmoredDragon ( 3450605 ) writes:
    
    I don't think we are equipped socially or politically for what's coming.
    Who is 'we'? Do you have a turd in your pocket?
    - - Re: (Score:3)
        
        by Growlley ( 6732614 ) writes:
        
        don't have that problem ever week the local repubs collect all that waste so trump can play his scat games with young russian hookers.
  - Re: (Score:2)
    
    by mrfaithful ( 1212510 ) writes:
    
    While the barrier to entry is certainly real, I feel like the only actually "good" examples of LLMs have been the result of normal humans training very small models on highly curated data. Like stable diffusion images are atrociously bad by default, but heavily guided versions with checkpoints and loras generated from a comparatively small subset of data to make more focussed images make things people are more likely to call "good" even if perhaps it's perhaps a bit (highly) derivative of the source trainin
- Re: (Score:2)
  
  by Rick Schumann ( 4662797 ) writes:
  
  Pretty much. Soon enough it'll be the 'AI' equivalent of copying a VHS tape too many times. They'll be outputting nonsensical gibberish that's geometrically more nonsensical than it already is.
  - Re: (Score:2)
    
    by gweihir ( 88907 ) writes:
    
    Indeed. The problem is that while some nil withs think that LLMs can be creative and add to their training data, the very opposite is true: LLMs always only learn a degraded form of their input. For example, they cannot capture details, they basically only get a somewhat foggy general picture. And that fog not only builds up in specific places, it degrades the overall model until only fog with pretty colored lights in it comes out, i.e. 100% hallucination.
- Re: (Score:2)
  
  by Visarga ( 1071662 ) writes:
  
  "Model collapse" sounds like "Slashdont" ... it's mode collapse, as in statistical modes, which are centers of probability in a distribution. The mode collapse means a single mode remains where in reality there are many.
The LLM says: (Score:3)

by chuckugly ( 2030942 ) writes: on Friday September 20, 2024 @07:31PM (#64804397)

Ah, Wordfreq, the digital linguist that bravely waded through millions of tweets, memes, and Reddit threads to track the wild and often nonsensical evolution of language. After years of tirelessly documenting humanity's descent into "yeets" and "sus," it seems even Wordfreq has decided to retire. I mean, who can blame it? One more analysis of TikTok slang and it probably would’ve needed therapy. Robyn Speer’s decision to cease updates is less about the death of a project and more about an act of mercy—for the machine. After all, there's only so many times an algorithm can process "bae," "on fleek," and "smol" before it begs for a permanent shutdown.

- Re: The LLM says: (Score:3)
  
  by EvilSS ( 557649 ) writes:
  
  Ok gramps, finish watching NCIS then time for bed.
- Re: (Score:2)
  
  by Waccoon ( 1186667 ) writes:
  
  Groovy peeps, daddio.
- Re: (Score:2)
  
  by NobleNobbler ( 9626406 ) writes:
  
  nocap bruh
original source (Score:3)

by znrt ( 2424692 ) writes: on Friday September 20, 2024 @07:32PM (#64804399)

https://github.com/rspeer/word... [github.com]

This saddens me (Score:5, Insightful)

by PuddleBoy ( 544111 ) writes: on Friday September 20, 2024 @07:38PM (#64804423)

I have a language degree and studied in the US and abroad. While I don't work in the field (I'm an engineer now), I still get a kick out of hearing changes and new words and uses in everyday language. I watch some television shows from various countries just so that I can hear what has changed in the (many) years since I studied there. (I am always amazed at how deeply English has penetrated western European languages, especially German)
The idea that generative AI is filling the language pool with 'endless, mindless algorithmically-derived glop' (my words) means we, as humans, have handed over some control of language evolution to machines. The more we allow that, the less this quintessentially-human thing (highly complex language) that we have refined over millennia is ours. Much of the human-derived, haphazard nature of evolving language (mispronunciations, misunderstandings, clever portmanteaus) may be gently pushed aside by the insistent wave of machine-derived language preference...
I'll stop now - it's obviously a Friday afternoon...

- Re: (Score:1)
  
  by sound+vision ( 884283 ) writes:
  
  Only if people read it, PoodleBuoy. Most of the SEO-style spam isn't intended to be read by humans. A lot of the spam going to social media is, but your average social media user is probably still reading at least 50% human-generated text. Plenty enough for them to learn and use new words.
  As human language continues to evolve, the AIs language may not (easily) due to model collapse. Speaking in "2021 English" may start to flag you as a bot further down the road.
  - Re:This saddens me (Score:4)
    
    by PPH ( 736903 ) writes: on Friday September 20, 2024 @09:20PM (#64804571)
    
    Speaking in "2021 English" may start to flag you as a bot
    Then I shan't engage in such folly.
    
- Re: (Score:2)
  
  by dsgrntlxmply ( 610492 ) writes:
  
  Please sir, may I have some more madglop?
- Re: (Score:2)
  
  by Fly Swatter ( 30498 ) writes:
  
  I was in a store a few weeks ago, and a child maybe 3 or 4 years old was talking to her mother - she had no emotion in her voice and strung the words together like an an internet chatbot.
  
  I feel old saying it, but the times are changing. We are losing our personality; or at least not raising a new generation with one.
  - Re: (Score:3)
    
    by Growlley ( 6732614 ) writes:
    
    you have to start training very young to be a ceo these days,
  - Re: (Score:2)
    
    by allo ( 1728082 ) writes:
    
    old and busted: dead internet theory
    new and hot: dead humanity theory
- Re:This saddens me (Score:4, Interesting)
  
  by cascadingstylesheet ( 140919 ) writes: on Saturday September 21, 2024 @07:45AM (#64805243) Journal
  
  I have a language degree and studied in the US and abroad. While I don't work in the field (I'm an engineer now), I still get a kick out of hearing changes and new words and uses in everyday language.
  Language developments are definitely fascinating.
  E.g. our children say that they did something "on accident" ... but we always said that we did something "by accident". Why the change? It doesn't bother me, but it does fascinate me.
  The idea that generative AI is filling the language pool with 'endless, mindless algorithmically-derived glop' (my words) means we, as humans, have handed over some control of language evolution to machines. The more we allow that, the less this quintessentially-human thing (highly complex language) that we have refined over millennia is ours. Much of the human-derived, haphazard nature of evolving language (mispronunciations, misunderstandings, clever portmanteaus) may be gently pushed aside by the insistent wave of machine-derived language preference...
  I'll stop now - it's obviously a Friday afternoon...
  Hmm ... well, it's hard to deny that books, newspapers, radio, and television all had a huge effect on language change and development. Time will tell if this is just one more variable or a true sea change.
  
- Re: (Score:2)
  
  by blackomegax ( 807080 ) writes:
  
  In the midst of the vibrant tapestry of existence, one finds oneself wandering through the myriad pathways of thought, where the sun gently bathes the landscape in a warm embrace, illuminating the delicate petals of daisies that sway softly in the breeze. As clouds drift lazily across the cerulean sky, the symphony of nature plays a silent melody, echoing the timeless rhythm of life itself. In this serene moment, one may ponder the complexities of human experience, reflecting on the intertwining threads of
Wow (Score:1, Insightful)

by The Cat ( 19816 ) writes:

Wikipedia, Twitter, and Reddit
The Internet ain't what it used to be.
- Re:Wow (Score:4, Insightful)
  
  by ArchieBunker ( 132337 ) writes: on Friday September 20, 2024 @10:51PM (#64804685)
  
  Oh yeah like the alt.* hierarchy of usenet was the Library of Alexandria.
  
bullshit (Score:1)

by Anonymous Coward writes:

what a load of crap, the internet has been full of bots long before AI came along, Is he claiming all that shit didn't pollute the project? sounds like just an easy excuse.
- Re: (Score:1)
  
  by destined2fail1990 ( 10502474 ) writes:
  
  Yeah, article spinners have been around for some time now just "rewriting" articles from other websites. This is not a new concept. Although social media is now often times more AI than it was previously. I know I use AI sometimes to write my posts, mainly because it auto-adds the emojis.
I'd still like to delve into this realm of data... (Score:2)

by EmoryM ( 2726097 ) writes:

It's important to consider how LLMs have polluted the tapestry of internet language so I hope they embark on a comprehensive journey to document this vital landscape.
- Re: (Score:2)
  
  by Visarga ( 1071662 ) writes:
  
  Simple, the places where people are writing now is the AI chat bot rooms. So you can still get that pure human text if you are OpenAI. How ironic.
Guess you'll just have to TALK to people. (Score:2)

by Eunomion ( 8640039 ) writes:

"What a fuckin' nightmayuh!" /Marisa Tomei
Not just polluted, took over (Score:2)

by sinij ( 911942 ) writes:

We created linguistic AI in form of LLM, it can do nothing but generate speech. It took over language and we no longer can know what is language. Pray that we don't create AI that can reason, as we won't know what is reason after that.
- Re: (Score:2)
  
  by Rick Schumann ( 4662797 ) writes:
  
  We're not in any danger of creating machines that can truly reason. We'll have practical, large-scale commercial fusion reactors a long time before that ever happens.
Pff who needs Wordfreq (Score:1)

by Rosco P. Coltrane ( 209368 ) writes:

"Hey Google! Give me the 500 most used English words by decreasing order of usage."
See? It's much easier and it's guaranteed to be correct.
Also, spambots (Score:4, Interesting)

by Misagon ( 1135 ) writes: on Friday September 20, 2024 @11:51PM (#64804783)

Internet forums also have a huge problem with spambots that have reposted older posts scraped from forums (often Reddit), in attempts to appear like legitimate users.
The original posts could be any from a day old to ten years old, and thus skew any analysis for language trends.
If a forum is left unchecked, the amount of spamposts could become quite high. Even on an actively moderated forum with vigilant users that report them, spamposts sometimes evade detection.
Usually a spampost is from a new user account, and the post is outside the "New members" subforum (if one exists). (Except that there are spambots that explicitly write only introductory posts)
If a user account looks suspicious, the method I use to confirm the post as spam is to search on Google for an exact quote from the post, so as to find the original message.
This use-case requires however that the search engine can still do that! Bing/DuckDuckGo do not. If Google degrades iits search engine too much, we'll lose this method against them.

"The internet" was always a bad source (Score:2)

by Tony Isaac ( 1301187 ) writes:

The internet has been polluted by automated website generating systems for at least a decade. SEO companies take customer money and use it to spam the web with their trash. The English in this spam is often broken, roughly translated from Chinese or some other Asian language, skewing any language analysis that might come from the internet. AI-generated content is just the next step of this evolution.
LLMs: Biggest Analysis of Human Language Ever (Score:3)

by cowdung ( 702933 ) writes: on Saturday September 21, 2024 @06:38AM (#64805139)

What linguists need to do is study LLMs which are the biggest statistical analysis of human language ever made. Each head of an attention layer represents a part of speech question that is used to validate language. The real data mine is the LLM itself:
How does it represent grammar?
What "grammar" does it have? How does that compare to the grammars invented by linguists?
What are the true dynamics of words on internet text?
Many many questions that can be figured out by a linguistic analysis of LLMs. They have figured out language better than any linguist has.

- Re: (Score:2)
  
  by Visarga ( 1071662 ) writes:
  
  Funnily, the wordfreq project counts frequencies of words, which are unigrams. N-gram models are also language models. So they are creating a sort of language model, but the most primitive there is.
Hmm (Score:2)

by cascadingstylesheet ( 140919 ) writes:

Are LLMs really having more of an impact on language than, say, texting or the internet did?
I'm tempted to say LOL ...
- Re: (Score:2)
  
  by dhasenan ( 758719 ) writes:
  
  This isn't about the impact on language. wordfreq documented the frequency of words on the internet in an attempt to report their frequency in human-generated text. With a sharp uptick in machine-generated text online, the project is less able to achieve its goal.
  - Re: (Score:2)
    
    by allo ( 1728082 ) writes:
    
    But isn't it a human text, when a human decides to use a LLM for its text? Otherwise you can ban grammar checkers as well. Why does it matter how the text was written? You should filter spam, disregarding if the original text is human written or AI written, but any post that is intentionally posted by a human should be fine. Yeah, you will see an uptick at posts that delve into details, but aren't such effects exactly what the project tries to analyze?
    - Re: (Score:2)
      
      by cascadingstylesheet ( 140919 ) writes:
      
      But isn't it a human text, when a human decides to use a LLM for its text? Otherwise you can ban grammar checkers as well. Why does it matter how the text was written? You should filter spam, disregarding if the original text is human written or AI written, but any post that is intentionally posted by a human should be fine. Yeah, you will see an uptick at posts that delve into details, but aren't such effects exactly what the project tries to analyze?
      Yeah, that's what I was thinking.
      If someone chooses to say it ... then they did. It seems no "less valid" than choosing an autotext suggestion.
Good bibliographies have more than one item listed (Score:2)

by Mozai ( 3547 ) writes:

"Project Analyzing Human Language Usage" was the public Internet your only source? Then it wasn't a project of human language, it was a project of language-on-that-single-source. and I'd guess what they mean by "human" language is only "written English."

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Internet != real life (Score:5, Insightful)

Re:Internet != real life (Score:5, Funny)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

First to go, next is videos (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: Internet != real life (Score:1)

Re: (Score:2)

Re: Internet != real life (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: Internet != real life (Score:2)

Re: (Score:2)

Re: Internet != real life (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:3)

Heavy handed social media censorship (Score:2)

Re: (Score:2)

Re: Internet != real life (Score:2)

Re: (Score:2)

That is unfortunate (Score:5, Insightful)

Re:That is unfortunate (Score:4, Interesting)

Re:That is unfortunate (Score:4, Interesting)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:3)

Re: (Score:3)

Re: (Score:2)

Natural monopolies (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

The LLM says: (Score:3)

Re: The LLM says: (Score:3)

Re: (Score:2)

Re: (Score:2)

original source (Score:3)

This saddens me (Score:5, Insightful)

Re: (Score:1)

Re:This saddens me (Score:4)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re:This saddens me (Score:4, Interesting)

Re: (Score:2)

Wow (Score:1, Insightful)

Re:Wow (Score:4, Insightful)

bullshit (Score:1)

Re: (Score:1)

I'd still like to delve into this realm of data... (Score:2)

Re: (Score:2)

Guess you'll just have to TALK to people. (Score:2)

Not just polluted, took over (Score:2)

Re: (Score:2)

Pff who needs Wordfreq (Score:1)

Also, spambots (Score:4, Interesting)

"The internet" was always a bad source (Score:2)

LLMs: Biggest Analysis of Human Language Ever (Score:3)

Re: (Score:2)

Hmm (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Good bibliographies have more than one item listed (Score:2)

Related Links Top of the: day, week, month.