Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
AI

Generative AI Systems Miss Vast Bodies of Human Knowledge, Study Finds (aeon.co) 49

Generative AI models trained on internet data lack exposure to vast domains of human knowledge that remain undigitized or underrepresented online. English dominates Common Crawl with 44% of content. Hindi accounts for 0.2% of the data despite being spoken by 7.5% of the global population. Tamil represents 0.04% despite 86 million speakers worldwide. Approximately 97% of the world's languages are classified as "low-resource" in computing.

A 2020 study found 88% of languages face such severe neglect in AI technologies that bringing them up to speed would require herculean efforts. Research on medicinal plants in North America, northwest Amazonia and New Guinea found more than 75% of 12,495 distinct uses of plant species were unique to just one local language. Large language models amplify dominant patterns through what researchers call "mode amplification." The phenomenon narrows the scope of accessible knowledge as AI-generated content increasingly fills the internet and becomes training data for subsequent models.
This discussion has been archived. No new comments can be posted.

Generative AI Systems Miss Vast Bodies of Human Knowledge, Study Finds

Comments Filter:
  • by drinkypoo ( 153816 ) <drink@hyperlogos.org> on Tuesday October 14, 2025 @02:27PM (#65724578) Homepage Journal

    It's not a surprise if human knowledge which is kept secret doesn't show up in LLMs. And today, not putting any knowledge on the internet is effectively that. The reason all our nerd shit shows up in LLM data is that we made it freely available to all on the open internet.

    • by Z00L00K ( 682162 )

      Secrecy, copyright protection and in obscure languages are probably the most common reasons for knowledge not contributing to AI generations.

    • Many low-press-run or one-off publications never made it to Google Book's library-vacuum effort of the early 2000s.

      Ditto the countless archives in courthouses/governments, schools, religious institutions, companies, and elsewhere that haven't been fully digitized yet.

      It's not like these are being deliberately kept secret as much as they are obscure or the maintainers don't have the funds to digitize them.

      If you have time or money to donate to your local historical society or other not-yet-digitized-archive-

    • The challenge is how to unwind decades of agenda based social science research, faulty studies (small sample size, self-reported data, surveys of only one gender, research data sets from advocacy based nonprofits) and the pyramid of research and news articles citing them. And vanity journals which serve to publish research from an approved set of topics.

      Grievance studies affair - https://en.wikipedia.org/wiki/... [wikipedia.org]

  • True, but BS (Score:5, Insightful)

    by rta ( 559125 ) on Tuesday October 14, 2025 @02:31PM (#65724592)

    from TFA :

    Over time, epistemological approaches rooted in Western traditions have come to be seen as objective and universal, rather than culturally situated or historically contingent. This has normalised Western knowledge as the standard, obscuring the specific historical and political forces that enabled its rise.

    In a basic sense, this is true, but in general it is used to bamboozle people into the (incorrect) "math is racist" mindset and then to much handwringing and government spending on dumb reports

    If you line up Christianity vs Islam vs Hinduism etc yes, your have different epistemological and metaphysical approaches.

    But physical based science and even modern psychology and economics are both universal and generally hostile to (or at least orthogonal to) all the classic views.

    Articles like this really underplay the degree to which the past is a foreign land for all of us.

    • Re:True, but BS (Score:4, Insightful)

      by hdyoung ( 5182939 ) on Tuesday October 14, 2025 @03:09PM (#65724716)
      that quote is a good example of something that sounds incredibly intellectual and worth thinking about, but it still quite wrong on several levels if you dissect it.

      To translate this liberal-professer-garble into plain speaking:

      Apparently since 1) western thought is dominant across the world right now, 2) that very dominance prevents us from thinking about the non-western history that gave rise to it? Um, no. Statement 1 does NOT lead to statement 2. Just because I'm at the top of the dog pile, that doesn't mean I'm necessarily blind to how I got there.

      Getting a bit further in the weeds, the writer questions the universality of western thought. That's the sort of self-loathing that'll trigger just about anyone outside a liberal arts department. I'm not denying that western ideology is full of inconsistencies and hypocrisy. Such as the US founding fathers making sure that everyone is free, excluding women and brown people). But, western thought has almost always aspired to be better and, yes, universal. Do western countries subvert and twist it to fit their own agendas? Sure. But the ideas of the enlightenment were pretty close to universal, which is quite different than a lot of non-western ways of thinking. Most of those amount to some form of "my race/religion/ethnicity/city/country/village is the chosen one because *insert nonrational reason here* thus we should rule and everyone else is lower on the hierarchy.

      I'll take the "western epistemiological approach" any day of the week, thank you very much.
    • In a basic sense, this is true
      Not really it's just wrong. The one approach that came from Western cultures is the scientific method which is both objective (to the maximum extent any human method has yet achieved) and universal which is why there is no such thing as Chinese, Canadian or Indian etc science there is just science because it is universal. As you alluded to the scientific method has often (including now to some degree) found itself at odds with western culture so I would argue that the scientific method is a product of western culture but not part of it.

      Arguing that it is "culturally situated" is nonsense. While science has definitely impacted western culture it has also impacted every culture around the planet and today there are scientists in every continent from a myriad of different cultures. Your culture may impact which questions you want to answer with science but, if you are doing it correctly, it will not affect the knowledge you find and that's why it is both universal and acultural. Indeed, the universal nature of science means it is one of the few things that can bring people of different cultures to work together towards a common goal: to understand the objective reality that we all share.

  • by oldgraybeard ( 2939809 ) on Tuesday October 14, 2025 @02:36PM (#65724606)
    Which also seems to apply to the Internets bias, crazy, obsolete and just plain incorrect information. Which is regurgitated and used for training. Which makes one wonder if LLM's can ever function dependably.
    • I'm wondering what sort of researchers call it that. I did a bit of searching, found a lot of references regarding solid state electronics, optics, and this article.
    • by HiThere ( 15173 )

      A point, but (and this is admittedly a quibble) I wouldn't call languages a "vast body of human knowledge". The data encoded within that language might qualify, but not the language itself. Unfortunately, without understanding the language there's no way of reasonably estimating the size of the contained "human knowledge" that isn't contained in sources already covered.

      FWIW, I think treating "the internet" as a body of human knowledge is foolish. Parts of it are, but much of it is negative-knowledge (i.e

  • by jd ( 1658 ) <imipak&yahoo,com> on Tuesday October 14, 2025 @03:06PM (#65724708) Homepage Journal

    There's a lot of stuff that is on the Internet that doesn't end up in AIs, either because the guys designing the training sets don't consider it a particular priority or because it's paywalled to death.

    So the imbalance isn't just in languages and broader cultures, it's also in knowledge domains.

    However, AI developers are very unlikely to see any of this as a problem, for one very very important reason --- it means they can sell the extremely expensive licenses to those who actually need that information, who can then train their own custom AIs on it. Why fix a problem where the fix means your major customers pay you $20 a month rather than $200 or $2000? They're really not going to sell ten times, certainly not a hundred times, as many $20 doing so, so there's no way they can skim off the corps if they program their AIs properly.

    • You seem to be under the impression that the AI companies are in it for the betterment of humanity or something like that. I know Altman has said as much, but that's nonsense. They are businesses like any other, and are in it to make money.

  • by Hadlock ( 143607 ) on Tuesday October 14, 2025 @03:18PM (#65724754) Homepage Journal

    > English dominates Common Crawl with 44% of content. Hindi accounts for 0.2% of the data despite being spoken by 7.5% of the global population. Tamil represents 0.04% despite 86 million speakers worldwide.
     
    English dominates because not only are there a lot of speakers, but it is the modern business lingua franca and most anyone who owns a desktop computer today can probably grumble out a handful of statements or questions in english. Hindi and Tamil on the other hand, use completely different writing systems and beyond a couple of clever words have zero vocabulary overlap with "western" languages. Simply due to inertia of 2 billion speakers Hindi/Tamil etc will continue on forever, but I can't see them being targeted by western technology. Americans and Europeans already struggle with cyrillic and it's at least recognizably sorta phonetically similar about half the time. Tamil just makes my eyes glaze over when I see it on street signs in Malaysia or whatever.

    • This article was written by an Indian student studying in the US. So, he's just citing an example based on his personal perspective. Aside from English, the one language that would be sort of a natural fit for AI training is Chinese. Up to 17% of the world's population can read/speak Chinese, which is close to the up to 20% that can read/speak English. Plus, a large percentage of AI researchers, companies, and models are located in China.

      The article's author looks at Common Crawl, but that may not be re

      • Chinese is a bad choice. It isn't one language, it's Mandarin ("Standard Chinese"), Cantonese, and about a thousand little dialects. It's also damn near unusable, being a tonal (5% of the world will never be able to understand or speak it) analytic language with an absurd logography.
        • by HiThere ( 15173 )

          IIUC, the chinese ideograph system is common between all those languages, and therefore would count as one common language...until the computers started audio processing. (FWIW, it's my understanding that many of the Chinese ideographs even have approximately the same meaning in one of the Japanese writing systems.)

    • Technology is the answer, though. I don't plan to learn another script, though I might learn another language. But why should anyone have to? Computers are actually good at recognizing text and doing translations now. That's two legit uses for "AI" that have actually come true. For example I've successfully OCR'd Chinese documentation and translated it and had it not come out in broken English. This really makes one wonder why anyone is still doing bad documentation and ads, but I do still see them regularl

    • by _merlin ( 160982 )

      Tamil just makes my eyes glaze over when I see it on street signs in Malaysia or whatever.

      Haha, the language of Malaysia is Bahasa Malaysia, and it's written with Latin script - the same 26 letters as English. You really are clueless.

  • by Big Hairy Gorilla ( 9839972 ) on Tuesday October 14, 2025 @03:42PM (#65724880)
    That idea is more relevant than ever, we're seeing it being rewritten in realtime.
    See the "War in Portland and Chicago." I saw it on TV, it must be true, right?

    I read the article. I hear snowflakes melting. I'd like to be sympathetic but...
    The man admits he got "medical advice" off the internet regarding his Dad's medical problem. That's for sure going to be correct, Right? Does getting medical advice off the internet make him more or less authoritative? Also the man is in "Ethical AI" studies. Better become a professor, because "Ethical" and "AI" don't belong in the same sentence. They fire people like that around Google.

    AI doesn't represent X percentage of knowledge?
    The internet doesn't represent X percent of knowledge per ethnic group?
    How do you say DEI without actually saying "DEI?"

    If I wrote 10 percent of all stuff on the internet, does that mean that what I wrote is valuable and should comprise 10 percent of what's in AI?
    We're back to the "all ideas are equal" thing.
    Are they or aren't they?

    Did that guy do the right thing by NOT recommending surgery for his dad?
    Did an unspecified herbal blend from India FOR CERTAIN cure his dad? or was it just luck? or maybe it had nothing to do with the tumor on his tongue and just by doing nothing his body healed? Can a single anecdote be generalized to all cases? Steve Jobs used herbal remedies for pancreatic cancer, and it didn't work. So, which case is the one you should think is correct?

    Try to be logical about this, before the silvergang downvotes me. I'll just repost it anyways, so bombs away.
    I'll just say it: all ideas aren't equal. Some are better than others.
    Much is implied here, but the assumptions must be questioned.
    • I noted the bit about him searching the internet for medical advice, but the additional "like a good millennial" made me thing he was looking back and making fun of himself.

      But things do get stupid from there. He notes real issues but appears to be so far up his own rear that he gets lost in hand-wringing polemics. This should have been a much, much shorter article.

      I am prone to making similar mistakes. I suspect it means I took too many philosophy courses. Or just spent too much time in college.

      • Some ideas don't require lengthy exposition, but SEO demands longer articles, and that leads to this. In a reasonable world an editor would have ripped half this article's guts out due to redundancy.

    • Did that guy do the right thing by NOT recommending surgery for his dad?

      No.

      Did an unspecified herbal blend from India FOR CERTAIN cure his dad?

      It for sure didn't.

      You already know this, I think you're just being too diplomatic.
      It rained after the virgin was sacrificed, and now this person is second guessing whether or not cutting the heart out of a virgin makes it rain.

      If anything, this article is a great example of how intelligence is not our brain's primary modus operandi. Really shitty pattern recognition is.

      • > You already know this, I think you're just being too diplomatic.
        much appreciated, sir.

        I'm just trying to help the reader establish sufficient doubt from certain fairly obviously questionable assertions, then hopefully, you can see that certain other assertions really can't be taken at face value, either.
  • "Hindi accounts for 0.2% of the data despite being spoken by 7.5% of the global population...and 82.7% of scam call centre employees.".

    Fixed that for ya! :)

    • by HiThere ( 15173 )

      No. The scam callers speak English. Perhaps not well, but it's English that they are speaking.

      To repeat a point I made earlier, information is not knowledge. Knowledge may be either true or false (i.e. it's a signed quantity). Information is most densely contained in (at least apparently) random noise.

      • You may be surprised to learn that people who use Hindi as their native language also speak English. So yes, scam callers who speak English also speak Hindi...at least a hell of a lot of them do.

  • At the end of the day, dataset authors must make a call on what is important and what is not. Just because it exists should not be a reason that it should be in training data. Training data must not be blindly representative, but prioritize epistemic value.
    Let's take science as an example. There would be nothing in Hindi (or other regional languages in low scientific output areas) that isn't also in English, as far as scientific value is concerned.
    What would the dataset miss? Local chatter?
    Microsoft's small

  • A 5 year old study - how relevant is that? Tech moves a bit quicker than every 5 years.
  • A 2020 study found 88% of languages face such severe neglect in AI technologies that bringing them up to speed would require herculean efforts

    This is not surprising, it has taken Herculean efforts to implement and train AI technologies in English. Why would other languages be easier?

  • You want an example? 15 or so years ago, google news would show me headlines from around the world, including, for example, the Asia Straights Times, the Hindustani, and the Scotsman. Now, I can't remember the last time I saw a piece from the Asia Times, and I've just today seen the *first* thing from the Hindustani in years... and that was because of a launch of an underwear line by Kardassian.

  • they just haven't got round to stealing it yet.
  • It doesn't matter how many people speak a language when you are talking about knowledge. What matters is in what language the knowledge has been recorded. And for scientific knowledge, that is predominantly English in the modern era, and mainly a few European languages before that. A lot of it not being accessible on the internet is a problem, though.

Nothing recedes like success. -- Walter Winchell

Working...