Generative AI Systems Miss Vast Bodies of Human Knowledge, Study Finds (aeon.co) 49
Generative AI models trained on internet data lack exposure to vast domains of human knowledge that remain undigitized or underrepresented online. English dominates Common Crawl with 44% of content. Hindi accounts for 0.2% of the data despite being spoken by 7.5% of the global population. Tamil represents 0.04% despite 86 million speakers worldwide. Approximately 97% of the world's languages are classified as "low-resource" in computing.
A 2020 study found 88% of languages face such severe neglect in AI technologies that bringing them up to speed would require herculean efforts. Research on medicinal plants in North America, northwest Amazonia and New Guinea found more than 75% of 12,495 distinct uses of plant species were unique to just one local language. Large language models amplify dominant patterns through what researchers call "mode amplification." The phenomenon narrows the scope of accessible knowledge as AI-generated content increasingly fills the internet and becomes training data for subsequent models.
A 2020 study found 88% of languages face such severe neglect in AI technologies that bringing them up to speed would require herculean efforts. Research on medicinal plants in North America, northwest Amazonia and New Guinea found more than 75% of 12,495 distinct uses of plant species were unique to just one local language. Large language models amplify dominant patterns through what researchers call "mode amplification." The phenomenon narrows the scope of accessible knowledge as AI-generated content increasingly fills the internet and becomes training data for subsequent models.
OK, so put it on the internet (Score:4, Insightful)
It's not a surprise if human knowledge which is kept secret doesn't show up in LLMs. And today, not putting any knowledge on the internet is effectively that. The reason all our nerd shit shows up in LLM data is that we made it freely available to all on the open internet.
Re: (Score:2)
Secrecy, copyright protection and in obscure languages are probably the most common reasons for knowledge not contributing to AI generations.
Re: (Score:3)
DRM is killing our digital legacy. There will be no abandonware sites for today's games, as the publishers make sure you won't be able to run them.
What? Didn't just 12 days ago /. proclaim... (Score:1)
...that the models had run out of data to train on?
https://m.slashdot.org/story/4... [slashdot.org]
Re: What? Didn't just 12 days ago /. proclaim... (Score:3)
For the internet scrapers. Don't expect anyone to get off their ass and digitize something manually.. you can get paper cuts in the real world.
Re: (Score:2)
DRM is killing our digital legacy. There will be no abandonware sites for today's games, as the publishers make sure you won't be able to run them.
For some of them that's true. For most titles, either there is no DRM, or the DRM has been or will be defeated.
Not on the internet != "kept secret" (Score:2)
Many low-press-run or one-off publications never made it to Google Book's library-vacuum effort of the early 2000s.
Ditto the countless archives in courthouses/governments, schools, religious institutions, companies, and elsewhere that haven't been fully digitized yet.
It's not like these are being deliberately kept secret as much as they are obscure or the maintainers don't have the funds to digitize them.
If you have time or money to donate to your local historical society or other not-yet-digitized-archive-
Social Sciences first please (Score:2)
The challenge is how to unwind decades of agenda based social science research, faulty studies (small sample size, self-reported data, surveys of only one gender, research data sets from advocacy based nonprofits) and the pyramid of research and news articles citing them. And vanity journals which serve to publish research from an approved set of topics.
Grievance studies affair - https://en.wikipedia.org/wiki/... [wikipedia.org]
True, but BS (Score:5, Insightful)
from TFA :
Over time, epistemological approaches rooted in Western traditions have come to be seen as objective and universal, rather than culturally situated or historically contingent. This has normalised Western knowledge as the standard, obscuring the specific historical and political forces that enabled its rise.
In a basic sense, this is true, but in general it is used to bamboozle people into the (incorrect) "math is racist" mindset and then to much handwringing and government spending on dumb reports
If you line up Christianity vs Islam vs Hinduism etc yes, your have different epistemological and metaphysical approaches.
But physical based science and even modern psychology and economics are both universal and generally hostile to (or at least orthogonal to) all the classic views.
Articles like this really underplay the degree to which the past is a foreign land for all of us.
Re:True, but BS (Score:4, Insightful)
To translate this liberal-professer-garble into plain speaking:
Apparently since 1) western thought is dominant across the world right now, 2) that very dominance prevents us from thinking about the non-western history that gave rise to it? Um, no. Statement 1 does NOT lead to statement 2. Just because I'm at the top of the dog pile, that doesn't mean I'm necessarily blind to how I got there.
Getting a bit further in the weeds, the writer questions the universality of western thought. That's the sort of self-loathing that'll trigger just about anyone outside a liberal arts department. I'm not denying that western ideology is full of inconsistencies and hypocrisy. Such as the US founding fathers making sure that everyone is free, excluding women and brown people). But, western thought has almost always aspired to be better and, yes, universal. Do western countries subvert and twist it to fit their own agendas? Sure. But the ideas of the enlightenment were pretty close to universal, which is quite different than a lot of non-western ways of thinking. Most of those amount to some form of "my race/religion/ethnicity/city/country/village is the chosen one because *insert nonrational reason here* thus we should rule and everyone else is lower on the hierarchy.
I'll take the "western epistemiological approach" any day of the week, thank you very much.
Re: (Score:1)
I hope you appreciate the irony of calling out my-way-is-best while declaring western thought best.
That is essentially straw-manning. Hdyoung made no claim that western thought as a whole is best. His only claim is that the aspirational parts of western thought are worth pursuing, particularly as they deliberately make room for objectively assessing other philosophies without resorting to violence or polemics.
Presumably he’s including western ideals like blind justice, equality of opportunity, judging by merit instead of identity, valuing the sanctity of individual life, recognizing every man is a
Re: True, but BS (Score:2)
Re: True, but BS (Score:3)
Re: (Score:2)
Obviously not “a joke”. It was a transparent deflection to single minded uniparty collectivism. Nonsense on the order of “democrats good” “republicans bad”.
Dude, your exposition on this stuff was excellent, but give the guy a break. What he wrote can indeed easily be read as just a joke.
And i really don't think it's one of those "oh, you made a hateful statement but then you got called out on it and you're trying to backpedal" cases...
Re: (Score:1)
Dude, the poster has an extensive history of one sided narratives. I am rigidly classically liberal, which leaves openness to such narratives - it’s a free country - but one doesn’t need to passively accept a constant sniping at western civilization’s foundation of personal agency and individual rights.
Re: (Score:2)
Re: True, but BS (Score:3)
Just BS (Score:2)
In a basic sense, this is true
Not really it's just wrong. The one approach that came from Western cultures is the scientific method which is both objective (to the maximum extent any human method has yet achieved) and universal which is why there is no such thing as Chinese, Canadian or Indian etc science there is just science because it is universal. As you alluded to the scientific method has often (including now to some degree) found itself at odds with western culture so I would argue that the scientific method is a product of western culture but not part of it.
Arguing that it is "culturally situated" is nonsense. While science has definitely impacted western culture it has also impacted every culture around the planet and today there are scientists in every continent from a myriad of different cultures. Your culture may impact which questions you want to answer with science but, if you are doing it correctly, it will not affect the knowledge you find and that's why it is both universal and acultural. Indeed, the universal nature of science means it is one of the few things that can bring people of different cultures to work together towards a common goal: to understand the objective reality that we all share.
researchers call "mode amplification" (Score:3)
Re: (Score:2)
Re: (Score:3)
A point, but (and this is admittedly a quibble) I wouldn't call languages a "vast body of human knowledge". The data encoded within that language might qualify, but not the language itself. Unfortunately, without understanding the language there's no way of reasonably estimating the size of the contained "human knowledge" that isn't contained in sources already covered.
FWIW, I think treating "the internet" as a body of human knowledge is foolish. Parts of it are, but much of it is negative-knowledge (i.e
It's not just foreign languages (Score:3)
There's a lot of stuff that is on the Internet that doesn't end up in AIs, either because the guys designing the training sets don't consider it a particular priority or because it's paywalled to death.
So the imbalance isn't just in languages and broader cultures, it's also in knowledge domains.
However, AI developers are very unlikely to see any of this as a problem, for one very very important reason --- it means they can sell the extremely expensive licenses to those who actually need that information, who can then train their own custom AIs on it. Why fix a problem where the fix means your major customers pay you $20 a month rather than $200 or $2000? They're really not going to sell ten times, certainly not a hundred times, as many $20 doing so, so there's no way they can skim off the corps if they program their AIs properly.
Re: (Score:2)
You seem to be under the impression that the AI companies are in it for the betterment of humanity or something like that. I know Altman has said as much, but that's nonsense. They are businesses like any other, and are in it to make money.
Garbage In / Garbage Out (Score:1)
English dominates vs Tamil && Hindi (Score:3, Insightful)
> English dominates Common Crawl with 44% of content. Hindi accounts for 0.2% of the data despite being spoken by 7.5% of the global population. Tamil represents 0.04% despite 86 million speakers worldwide.
English dominates because not only are there a lot of speakers, but it is the modern business lingua franca and most anyone who owns a desktop computer today can probably grumble out a handful of statements or questions in english. Hindi and Tamil on the other hand, use completely different writing systems and beyond a couple of clever words have zero vocabulary overlap with "western" languages. Simply due to inertia of 2 billion speakers Hindi/Tamil etc will continue on forever, but I can't see them being targeted by western technology. Americans and Europeans already struggle with cyrillic and it's at least recognizably sorta phonetically similar about half the time. Tamil just makes my eyes glaze over when I see it on street signs in Malaysia or whatever.
Re: (Score:2)
This article was written by an Indian student studying in the US. So, he's just citing an example based on his personal perspective. Aside from English, the one language that would be sort of a natural fit for AI training is Chinese. Up to 17% of the world's population can read/speak Chinese, which is close to the up to 20% that can read/speak English. Plus, a large percentage of AI researchers, companies, and models are located in China.
The article's author looks at Common Crawl, but that may not be re
Re: (Score:1)
Re: (Score:2)
IIUC, the chinese ideograph system is common between all those languages, and therefore would count as one common language...until the computers started audio processing. (FWIW, it's my understanding that many of the Chinese ideographs even have approximately the same meaning in one of the Japanese writing systems.)
Re: (Score:2)
Technology is the answer, though. I don't plan to learn another script, though I might learn another language. But why should anyone have to? Computers are actually good at recognizing text and doing translations now. That's two legit uses for "AI" that have actually come true. For example I've successfully OCR'd Chinese documentation and translated it and had it not come out in broken English. This really makes one wonder why anyone is still doing bad documentation and ads, but I do still see them regularl
Re: (Score:2)
Haha, the language of Malaysia is Bahasa Malaysia, and it's written with Latin script - the same 26 letters as English. You really are clueless.
History is written by the victors (Score:3)
See the "War in Portland and Chicago." I saw it on TV, it must be true, right?
I read the article. I hear snowflakes melting. I'd like to be sympathetic but...
The man admits he got "medical advice" off the internet regarding his Dad's medical problem. That's for sure going to be correct, Right? Does getting medical advice off the internet make him more or less authoritative? Also the man is in "Ethical AI" studies. Better become a professor, because "Ethical" and "AI" don't belong in the same sentence. They fire people like that around Google.
AI doesn't represent X percentage of knowledge?
The internet doesn't represent X percent of knowledge per ethnic group?
How do you say DEI without actually saying "DEI?"
If I wrote 10 percent of all stuff on the internet, does that mean that what I wrote is valuable and should comprise 10 percent of what's in AI?
We're back to the "all ideas are equal" thing.
Are they or aren't they?
Did that guy do the right thing by NOT recommending surgery for his dad?
Did an unspecified herbal blend from India FOR CERTAIN cure his dad? or was it just luck? or maybe it had nothing to do with the tumor on his tongue and just by doing nothing his body healed? Can a single anecdote be generalized to all cases? Steve Jobs used herbal remedies for pancreatic cancer, and it didn't work. So, which case is the one you should think is correct?
Try to be logical about this, before the silvergang downvotes me. I'll just repost it anyways, so bombs away.
I'll just say it: all ideas aren't equal. Some are better than others.
Much is implied here, but the assumptions must be questioned.
Re: (Score:2)
But things do get stupid from there. He notes real issues but appears to be so far up his own rear that he gets lost in hand-wringing polemics. This should have been a much, much shorter article.
I am prone to making similar mistakes. I suspect it means I took too many philosophy courses. Or just spent too much time in college.
Re: (Score:1)
Some ideas don't require lengthy exposition, but SEO demands longer articles, and that leads to this. In a reasonable world an editor would have ripped half this article's guts out due to redundancy.
Re: (Score:2)
Did that guy do the right thing by NOT recommending surgery for his dad?
No.
Did an unspecified herbal blend from India FOR CERTAIN cure his dad?
It for sure didn't.
You already know this, I think you're just being too diplomatic.
It rained after the virgin was sacrificed, and now this person is second guessing whether or not cutting the heart out of a virgin makes it rain.
If anything, this article is a great example of how intelligence is not our brain's primary modus operandi. Really shitty pattern recognition is.
Re: (Score:2)
much appreciated, sir.
I'm just trying to help the reader establish sufficient doubt from certain fairly obviously questionable assertions, then hopefully, you can see that certain other assertions really can't be taken at face value, either.
Bonus Data (Score:2)
"Hindi accounts for 0.2% of the data despite being spoken by 7.5% of the global population...and 82.7% of scam call centre employees.".
Fixed that for ya! :)
Re: (Score:2)
No. The scam callers speak English. Perhaps not well, but it's English that they are speaking.
To repeat a point I made earlier, information is not knowledge. Knowledge may be either true or false (i.e. it's a signed quantity). Information is most densely contained in (at least apparently) random noise.
Re: (Score:2)
You may be surprised to learn that people who use Hindi as their native language also speak English. So yes, scam callers who speak English also speak Hindi...at least a hell of a lot of them do.
Curation (Score:2)
At the end of the day, dataset authors must make a call on what is important and what is not. Just because it exists should not be a reason that it should be in training data. Training data must not be blindly representative, but prioritize epistemic value.
Let's take science as an example. There would be nothing in Hindi (or other regional languages in low scientific output areas) that isn't also in English, as far as scientific value is concerned.
What would the dataset miss? Local chatter?
Microsoft's small
A 2020 study (Score:2)
Herculean efforts (Score:2)
A 2020 study found 88% of languages face such severe neglect in AI technologies that bringing them up to speed would require herculean efforts
This is not surprising, it has taken Herculean efforts to implement and train AI technologies in English. Why would other languages be easier?
Misses by algorithm (Score:2)
You want an example? 15 or so years ago, google news would show me headlines from around the world, including, for example, the Asia Straights Times, the Hindustani, and the Scotsman. Now, I can't remember the last time I saw a piece from the Asia Times, and I've just today seen the *first* thing from the Hindustani in years... and that was because of a launch of an underwear line by Kardassian.
Not really (Score:2)
wrong statistic (Score:2)
It doesn't matter how many people speak a language when you are talking about knowledge. What matters is in what language the knowledge has been recorded. And for scientific knowledge, that is predominantly English in the modern era, and mainly a few European languages before that. A lot of it not being accessible on the internet is a problem, though.