Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
AI Technology

A Shocking Amount of the Web is Machine Translated (arxiv.org) 57

Abstract of a paper published on pre-print server arXiv: We show that content on the web is often translated into many languages, and the low quality of these multi-way translations indicates they were likely created using Machine Translation (MT). Multi-way parallel, machine generated content not only dominates the translations in lower resource languages; it also constitutes a large fraction of the total web content in those languages. We also find evidence of a selection bias in the type of content which is translated into many languages, consistent with low quality English content being translated en masse into many lower resource languages, via MT. Our work raises serious concerns about training models such as multilingual large language models on both monolingual and bilingual data scraped from the web.
This discussion has been archived. No new comments can be posted.

A Shocking Amount of the Web is Machine Translated

Comments Filter:
  • Chrome (Score:3, Interesting)

    by Calydor ( 739835 ) on Wednesday January 24, 2024 @02:09PM (#64185157)

    As far as I can tell this is even true for Chrome.

    The Danish version of Chrome has an interesting spelling mistake when you download files. Once the file has been downloaded it is marked as 'Udfør'. This is the proper translation of 'Complete (this task)', ie. a demand. The proper translation for the status of being complete is 'Udført', ie. 'Completed'. This mistake has been there for years now, so I reckon Google doesn't care.

    And if Google doesn't care to translate their browser properly, why should any random site on the net care to translate their site properly? The entire internet is full of "Good enough, and you're evil if you point out typoes because maybe the guy is dyslexic and typing is really hard anyway!"

    • Re:Chrome (Score:5, Interesting)

      by Koen Lefever ( 2543028 ) on Wednesday January 24, 2024 @06:26PM (#64185867)
      Nah, translated software has that kind of errors for over 40 years, that is the main reason why I never use localization on my computers.

      I remember an Amstrad ad for their PC1512 that had 113 language errors. The weirdest was "simulatiediefstal" (Dutch for "simulation theft"). I could not understand it until it dawned on me that it had first been translated into French ("simulation de vol", but "vol" can mean both "flight" and "theft") before it was translated into Dutch, they were talking about a flight simulator; suddenly half of their translation errors made sense.
      • Simulation theft is now reality when plagiarizing AI-generated content.

      • I feel like at least back then there was an excuse, sort of; these were smaller companies that likely had less resources for the translation process along with few to none of the tools to assist that are available today. Google is one of the largest companies on the planet with no shortage of resources. While those older games should have done a better job, I can kind of understand why it was often rushed and of poor quality, but with Google? There really should be no excuse, certainly not after an error ha
    • by pjt33 ( 739471 )

      When you right-click any file in Windows 10's explorer there's a menu item "Scan with Microsoft Defender" (the built-in anti-virus). Microsoft has fixed it now, but for a while the Spanish translation used "Scan" in the sense of creating an image file from a physical document: "Digitalizar con Microsoft Defender". The first time I saw that I couldn't work out why Windows thought the file wasn't already digital until I translated back to English.

  • Such as the Cebuano language which has over 6 million bot generated articles. Then there are people who just make stuff up in different languages as hoaxes.
    • @xack, I'm not doubting you, but how did you find out? Are you a speaker of Cebuano? (although even then I doubt you read six million articles...)

      • He's making a joke. Cebuano is spoken in the Philippines and has 30 million native speakers.
  • by Somervillain ( 4719341 ) on Wednesday January 24, 2024 @02:13PM (#64185181)
    A shocking amount of the web is total garbage generated by machines...or some really mentally deficient humans. Google News is constantly putting stories about topics of interest of mine that don't appear to be generated by a sentient human...looks like just ChatGPT diarrhea mashing together older useful articles. And most predict it's only going to get worse.

    We need a law forcing authors to declare if an article was generated by AI or algorithm.

    Some terrible examples are recipes. For example, I couldn't remember how much pine nuts to put in homemade pesto and forgot to write it down last time. The first 12 articles from Google for "pesto recipes" were about 20 pages of nonsense barfed out by Chat GPT with the ingredients and amounts at the very bottom. Took me like 5 minutes across 10 articles to find the answer. I probably had to view/block 1000 ads to get to the answer...and the articles just barfed anything related to the history of pasta and cooking and pesto for several pages even though clearly labeled as "Fresh Pesto recipe."

    So yeah....it sucks to not speak the original language of the article (which is frequently English anyway), but even being a native speaker...it's getting worse and worse every day.
    • by nightflameauto ( 6607976 ) on Wednesday January 24, 2024 @02:22PM (#64185215)

      Some terrible examples are recipes. For example, I couldn't remember how much pine nuts to put in homemade pesto and forgot to write it down last time. The first 12 articles from Google for "pesto recipes" were about 20 pages of nonsense barfed out by Chat GPT with the ingredients and amounts at the very bottom. Took me like 5 minutes across 10 articles to find the answer. I probably had to view/block 1000 ads to get to the answer...and the articles just barfed anything related to the history of pasta and cooking and pesto for several pages even though clearly labeled as "Fresh Pesto recipe."

      You can thank SEO for this bullshit. There are SEO applications that go through your site and toss recommendations at you for better SEO. I tried it once on my personal blog just to see what it did. By the time I got all green lights on the SEO checker my article looked like one of those recipes. Literally dozens of stupid, to use your example, "What is a pine nut", "Where pine nuts come from", "Who first tried a pine nut" subsections meant to make something "look" more appealing when you fast-scroll through it, while adding absolutely NO actual content. Lots of words, and they look like english, but they transfer no data whatsoever from the screen to the brain.

      With AI becoming so popular, I expect the "look like english" bit will be going away soon. Now we can generate the necessary SEO gibberish billions of times faster! A whole web of generated bullshit, with minor sprinklings of data at the bottom of each page.

      • It really is a plague. I think a lot of it is written en masse in India, and then those articles get stolen and pieced together by other "content farms" in India, and you end up with a big old incoherent piece of crap as the #1 search result because it has all the keywords, with five other identical copies of the same content as well.

    • by Angry Coward ( 6165972 ) on Wednesday January 24, 2024 @03:17PM (#64185383)
      Often you can bypass the garbage stuffing on recipe sites by telling the site you want to print it. The printer version of the page usually only contains the information you actually want instead of the spam text and ads. This almost always works on my desktop, mobile is about 50/50 for me.
      • Firefox has a good add on that takes you right to the recipe. It's on my other computer and I'm lazy to look it up.

    • Perhaps it's best not to rely on general search to find things that a more specialized site would provide good results for? e.g; allrecipes.com And if one is not familiar with such sites already, searching for "recipes site" yields this among the top results, along with Bon Appetit.

    • For recipes in particular, the issue is that you cannot copyright facts so people add "original" content.

    • by jvkjvk ( 102057 )

      You should just ask chatGPT how many pine nuts to put in. It will give you a recipe plainly with measurements no problem.
      I guess when chatGPT starts coming with ads then the enshittification will have begun.

  • why Gen Z is unintelligible. Talking with the fam, websites are not dank, cheugy, or camp.

    I'll be here all week. Tip your waitstaff.
    • by Anonymous Coward

      Websites are sus, for old people. I hate old people.

    • by ebunga ( 95613 )

      It is our obligation as those that have crossed over the hill to ruin every new word or phrase the kids these days try to use.

      • Damn right.

        Like my bae was sayin' "yeet dat shit FRFR lowkey you def sus FOMO", I said "stop the cap fam, don't be a stan", and neither of knew what fuck we just said.

    • by mosch ( 204 )

      Like, Gag me with a spoon, Gen Z slang is grody to the max. It would be bodacious if they would, like, totally talk today's adults did when they were like, teens. It'd b rad to see Gen Z use the totally tubular vernacular of teens of the past!

      • Like, Gag me with a spoon, Gen Z slang is grody to the max. It would be bodacious if they would, like, totally talk today's adults did when they were like, teens. It'd b rad to see Gen Z use the totally tubular vernacular of teens of the past!

        I know maybe two girls that talked that way back then. Nobody else did unless they were mocking them. Sorry, Fast Times at Ridgemont High's Spicoli was an outlier, not an indicator of an entire generation.

      • As if!

    • by ceoyoyo ( 59147 )

      That's a bad take on a freaky deaky situation man. The real bee's knees! You give them youngblood jive turkeys the hairy eyeball! Now I gotta beat feet and skitter. Hang loose man!

  • by Another Random Kiwi ( 6224294 ) on Wednesday January 24, 2024 @02:39PM (#64185265)
    The machine translations should get tagged with a locale -- unfortunately ISO-3166 country codes AI and MT are already taken, but FA is still available and I'm sure we can all come up with a few expansions for that.
    • If they don't care to give you a high-quality translation, they should be honest and use the tag FU. (Unlikely to ever be intentionally assigned to a real country.
  • A human translator costs at least $30/hour plus benefits and they can't just go in and type a word-for-word translation. They need to learn the nuance of what is being said and translate it according to the nuance of the language. Nobody wants to pay that much money for an group that may make up a non-substantial proportion of website visitors.

    A better conclusion would be, "Providing an equal level of service across all populations' needs is extremely expensive and businesses/organizations will take shortcu

    • A better conclusion would be, "Providing an equal level of service across all populations' needs is extremely expensive and businesses/organizations will take shortcuts where necessary."

      And that boiled down is, "do it the cheapest way possible".

    • by Njovich ( 553857 )

      A better conclusion would be, "Providing an equal level of service across all populations' needs is extremely expensive and businesses/organizations will take shortcuts where necessary."

      Or just don't? I think these businesses don't realize just how awful these translations are. It's common to be forced in the Netherlands that some large company forces you by ip address (or if you are lucky, browser locale) onto some completely incomprehensible mess of words. Never mind that everyone speaks English and the browser has a built-in translate button that you just broke with this nonsense. Giving a page like that, all it does is give your company a quick jolt of reputation damage by anyone unluc

  • That's too much content. Machine translation is necessary.

    Are you over this yet? I was a while ago.

  • Now we got generative AI, the cost of average content is soon negligible. The cost of curating all that average shit to find valuable stuff is gonna go up.
    • Now we got generative AI, the cost of average content is soon negligible. The cost of curating all that average shit to find valuable stuff is gonna go up.

      The main reason we'll need AI in the future is to translate the AI gibberish into human readable content with actual information in it. They can create their own market just by pollution the web with generated nonsense.

      I wonder if there will come a tipping point where the majority realize that info on the web is not sacrosanct just because it exists, or if the human specie will just ride that propaganda generation right into the sunset? Gonna be an interesting future, that's for sure. Not good, but interest

      • We already have recursion whereby AI is tasked with writing articles and consumers task AI with simplifying it and creating action plans for it, and possibly generating an entirely new article, which is then recycled again... like copying a tape, each successive usage will degrade the data further and further. Perhaps original books will become the only real source for future scholars?
        • We already have recursion whereby AI is tasked with writing articles and consumers task AI with simplifying it and creating action plans for it, and possibly generating an entirely new article, which is then recycled again... like copying a tape, each successive usage will degrade the data further and further. Perhaps original books will become the only real source for future scholars?

          There's a ton of far-fetched fantasy being peddled as non-fiction these days. Alien conspiracies. The entire era of the Satanic Panic still produces some wack-a-loon tales as non-fiction. How many non-fiction filed books have completely insane out-there theories about the Kennedy assassinations? It gets hard to sort the wheat from the chaff sometimes even with books. Once some company decides they've got an AI that's "good enough" to write books wholesale? I expect that trend to only get worse.

  • Mostramos que el contenido de la web a menudo se traduce a muchos idiomas, y la baja calidad de estas traducciones multidireccionales indica que probablemente se crearon mediante traducciÃn automÃtica (MT). El contenido multidireccional, paralelo y generado por mÃquinas no sÃlo domina las traducciones en idiomas de menores recursos; también constituye una gran fracciÃn del contenido web total en esos idiomas. También encontramos evidencia de un sesgo de selecciÃn en el tipo de contenido que se traduce a muchos idiomas, consistente con contenido en inglés de baja calidad que se traduce en masa a muchos idiomas de menores recursos, a través de MT. Nuestro trabajo plantea serias preocupaciones sobre los modelos de capacitaciÃn, como los modelos de lenguajes grandes multilingües, en datos monolingües y bilingües extraÃdos de la web.

    Courtesy translate.google.com.

    Some characters may not display properly due to limitations of Slashdot.

  • This is not a surprise to anyone for whom English is not their first language.

    Just try to search information on a topic in another language and you'll find a lot of crap sites in your search results that contain badly auto-translated "content".

    This has been a nuisance ever since Google Translate was introduced.

  • I personally love machine translation - how else would I get to be entertained by people getting random Chinese words tattooed on themselves or foreign businesses obliviously selling t shirts with sweary English?
  • I found an interesting web automation tutorial about ten years ago that walked you through web scraping, translating the text into another language using Google Translate, and then translating it back into English in order to develop 'original' content. Ever since then, I've suspected quite a bit of content is generated this way, a sort of 'poor man's AI'

    One event that really seemed to follow this trend was news coverage of the Varsity Blues scandal that exposed cheating in university entrance applicatio
  • I get between 20 and 50 scam calls today, most of which have a bot that reads some poorly written script. I love it when the bot cannot pronounce a word properly or used improper English.

    I have to answer the phone because T-mobile identifies about half of my customers as "Scam Likely". :(

  • Microsoft Support has been doing this for years. Almost anything not in English is obviously machine translated.
  • Or at least the current fad version is. LLMs are like locusts that swam the landscape and consume anything they find, and in this case the landscape is the internet. Except LLMs spew their degraded output back onto the web and that recycling destroys all meaning.

    The prevalence of machine translated crap, and it's bad effect on LLM training is just a previous iteration of how pretend "smart" software fails. Once AI in it's LLM form really takes hold the utility of the internet will rapidly approach zero. Th

  • I work in the public administration in an European country and right now we are in the process of adding machine translation (Google Translate with a Wordpress plugin) to all our websites. Why? Simply because there is a real need to have those translated and we lack the resources to go with human translation. And since is about taxpayer money, likely we won't ever have those resources... The best we could do with humans would be only a handful of pages and only for the English language. Still, our current t

  • ... back in the day when we had to craft our assembly code in raw hexadecimal opcodes, "garbage in, garbage out".

    If AI system managers are just learning this today, they really need to sue the donkeys off whoever they paid for their degrees, because some rather essential bits of their education got missed.

    Alternatively, if the educators can prove that the students were shown these warnings, then a generation of AI development managers have to hand back all their pay cheques since they stopped flipping bur

A computer lets you make more mistakes faster than any other invention, with the possible exceptions of handguns and Tequilla. -- Mitch Ratcliffe

Working...