Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Google Businesses The Internet

Coming Soon, The Google Translator 418

compuglot writes "Google gave journalists a glimpse of its next generation machine translation system at a May 19th Google Factory Tour. "Google Blogoscoped" offers an excellent overview of the presentation. The system has been trained using the United Nations Documents as a corpus. This corpus is some 20 billion words worth of content. It uses existing source and target language translations (done by human translators at the U.N.) to find patterns it then uses to build rules for translating between those languages. Apparently it was successful where the current version had failed in translating certain phrases. If anyone were capable of making a serious go of MT, that would have to be Google."
This discussion has been archived. No new comments can be posted.

Coming Soon, The Google Translator

Comments Filter:
  • by TripMaster Monkey ( 862126 ) * on Tuesday May 31, 2005 @10:12AM (#12683650)

    Just to illustrate, here's the summary of this story, translated to German and back to English using Google's current version [google.com]:

    Google gave a Glimpse of its machine Uebersetzungsystems the following production at the factory route of the A May 19 to journalists. Google. "Google Blogoscoped" offers an excellent overview of the representation. The system was trained with the nation documents as korpus. This korpus is something 20 billion word value of contents. It uses the existing target language translations (takes place via human translators at the U.N.) Samples find, which use it then to establish guidelines for translating between those languages. Apparent it was successful, where the present version had failed, if it translated certain cliches. If everyone of forming a serious were capable, of the M.Ue., those would go to have having to Google.
  • Google's translator (Score:3, Interesting)

    by bcmm ( 768152 ) on Tuesday May 31, 2005 @10:13AM (#12683653)
    So what powers Google's current translator? I have seen it give word-for-word the same as Babel on some occasions (but with better handling of non-ASCII characters).
  • by RubberDogBone ( 851604 ) on Tuesday May 31, 2005 @10:16AM (#12683688)
    Make this work with Gmail and I'd even pay money for it!

    Tired of getting email from Amazon.DE on my Gmail account and having to copy and paste it over to Babelfish.

    That would be very useful for me.

  • if anyone... (Score:5, Interesting)

    by rdc_uk ( 792215 ) on Tuesday May 31, 2005 @10:20AM (#12683724)
    Actually, my bet for most likely to make a real go of machine translation would be...

    IBM

    Look how far they ran with chess programs, because they felt like it...

    If they decided to go the same distance with translation...
  • oh no! (Score:5, Interesting)

    by danharan ( 714822 ) on Tuesday May 31, 2005 @10:30AM (#12683814) Journal
    I don't ever expect such translation to work perfectly, but taking existing phrases should lead to useful first drafts.

    This will mean one less possible career for me, and fewer babelfish induced laugther moments.

    As a fluently bilingual person, I often recognize expressions that were translated in Canadian government documents. "Anglicisme" is the word the french have for it.

    There's subtlety to languages we may forever lose. Take for example:

    "Je donne ma langue au chat" - "I give up (answering a riddle) instead of the more picturesque "I give my language to the cat". Well, that should be tongue, but hey, it's just babelfish!

    "Bullshit" won't produce "merde de taureau". That is a strange expression you anglos have, don't you realize?

    "Il pleut comme vache qui pisse" will give us "it's pouring cats and dogs" rather than "it's pouring like cows' a'pissin". The french also have never heard of cats and dogs falling from the sky.

    While an improved Babelfish may improve our mutual comprehension, please pause for a moment to consider all the linguistic hilarity we'll forever lose.
  • by KagatoLNX ( 141673 ) <kagato@@@souja...net> on Tuesday May 31, 2005 @10:31AM (#12683834) Homepage
    Ummm, geeks like Google because Google employs scientists. Which mere scientists were you talking about?

    Were you talking about the PhDs at universities busy teaching classes, churning out research papers to avoid being fired (an ugly numbers game some departments play), or perhaps burning time generating volumes of grant paperwork?

    Oh, maybe you were talking about the scientists employed by the private sector. I'm sure the management teams wherever they work are willing to take the time and care that Google won't.

    You do know how may PhDs Google employs, right? Not to mention that they won't be fighting for resources there either. No backstabbing, liquidating MBAs trashing their corporate budget. No football-crazed alumni assassinating their funding proposals either.

    Also, I would remind you that "mere scientists" often come up with the needed research (there are volumes in MT alone), but rarely can afford to put in the years that it takes into a good implementation.

    Geeks love Google because it is, in many respects, where the best of business meets the best of academia.
  • Re:fascinating (Score:5, Interesting)

    by NoMoreNicksLeft ( 516230 ) <john.oylerNO@SPAMcomcast.net> on Tuesday May 31, 2005 @10:34AM (#12683867) Journal
    Some questions:

    Why can't a dictionary be made of nouns, of verbs? Why can't we have it statistically analyze the grammar for ambiguous words?

    Does it only recognize exact matches? Especially with verb conjugation, I'd think any words 80% similar or so should be considered matches. Not all languages are as conjugation happy as latin or spanish or even english, and you often lose some nuanced conjugations when translating from one to the other.

    What will be done about idioms? Translating these word for word often makes no sense at all, and for me at least (no idea what the official stance is), I'd rather they substitute in idioms with the same general meaning, but for the culture being translated to.

    Does it work on alternate character systems, is it word boundary dependent?

    Does it understand punctuation rules, will this post translated to spanish have the upside down question marks where they're supposed to be?

    How many of the world's existing languages have enough text for this to even be feasible?
  • by TopSpin ( 753 ) * on Tuesday May 31, 2005 @10:36AM (#12683878) Journal
    First, this is outstanding; Google, unsatisfied with traditional machine translation techniques, pioneers their own design. I'm certain their advertisers will be pleased to have their adds auto-translated to whatever language is necessary.

    Second, I think we'll witness a case of having the AI ante upped once again when another traditional AI challenge is met. Wikipedia puts this best; When viewed with a moderate dose of cynicism, AI can be viewed as 'the set of computer science problems without good solutions at this point.' Once a sub-discipline results in useful work, it is carved out of artificial intelligence and given its own name.
  • Comment removed (Score:5, Interesting)

    by account_deleted ( 4530225 ) on Tuesday May 31, 2005 @10:39AM (#12683903)
    Comment removed based on user account deletion
  • by stevejsmith ( 614145 ) on Tuesday May 31, 2005 @10:52AM (#12684018) Homepage
    Dell and Cisco are not in this business. IBM is not hemorrhaging with cash in the way Google is. Microsoft is not in the business of providing free Internet accessories. In any case, Google has a track record of innovative ideas ("innovative ideas" meaning that not only did they come up with it and implement it partially, but they invested full-on into it, bet money on it, and made it better than the competition) and is most likely of any company who would announce this to actually pull through with it. If some little start-up announced this (as I'm sure a few have), people would take it with a grain of salt. But that Google announces it, I'm sure most people believe fully that Google will deliver on its promise.

    And you're right, people have thought of this exact idea (I'm sure every other computer major and linguist has, in fact, since the birth of ENIAC--I know the idea's crossed my mind tons of times, not that I'd have the slightest clue how to do it), however actually attempting to do it with a reasonable chance of success? I'm going to say Google is the first.

    Plus, I got the impression from the article that the serve is operational, just not available to the public. If you'll read the article, you'll find that the translator properly translated a fairly complicated phrase from Arabic to English. I'd guess that this service is, from a technical standpoint, at least 95% done -it's just the packaging and touching-up that needs to be done.
  • Wait, why? (Score:5, Interesting)

    by Ieshan ( 409693 ) <ieshan@g[ ]l.com ['mai' in gap]> on Tuesday May 31, 2005 @10:56AM (#12684071) Homepage Journal
    "Computers don't generalize or extrapolate the known into the unknown worth a damn."

    Fortunately, that's not all that google has to go on. Google has 8 billion webpages, in many different languages, most of which are written by non-speechwriters. Not only can they analyze words based on translated context, but they can analyze words based on intra-language context, to form associations between words and meanings.

    The real trick is getting down two important linguistic concepts: "Sandhi Rules" (for instance, the use of "an" before a vowel and "a" before a consonant, which are totally regular but more complicated than a word-to-word matchup), and the "degree" or "quality" of words, which indicate the type of adjective most appropriate in any given context.

    For instance, "erudite", "learned", "educated", "knowledgeable", "skilled", and "cunning" could all be related words, but many of them have positive or negative assocations which may only really be conveyed by understanding the meaning, irony, or sarcasm of a particular phrase.

    For instance, "John has been skilled in writing beautiful code for most of his adult life" is quite different from "John has been educated in writing beautiful code for most of his adult life", or "John has been erudite...". The first one is probably right if John has had a natural inclination to doing it properly, the second if he has undergone some training (though we don't know the actual state of his ability), the third (though the word doesn't even really make sense here) if he has been arrogant about his ability, shouting RTFM! every time someone asked him a question.
  • by mincognito ( 839071 ) on Tuesday May 31, 2005 @11:12AM (#12684211)
    Some people here seem to have a false picture of how language works. Individual words do not have meanings. Not to a human interpreter anyway. Sentences used in actual contexts have meanings (unless a single word is uttered as an elliptical sentence). The "meanings" of words, as found in dictionaries, are simply abstractions from occasions of use. The idea that individual words have meanings hasn't been current in philosophy or linguistics for about 50 years. Also, the idea of St. Augustine that children learn the meaning of words by associating sounds that they hear with particular objects that they observe is now also considered rather dubious.
  • Re:fascinating (Score:4, Interesting)

    by Simonetta ( 207550 ) on Tuesday May 31, 2005 @11:13AM (#12684217)
    What will be done about idioms? Translating these word for word often makes no sense at all...

    The often-quoted examples are: "Out of sight, out of mind" becomes "invisible idiot" and "the spirit is willing, but the flesh is weak" comes out as "The meat is rotten, but the wine's great".

    How many of the world's existing languages have enough text for this to even be feasible?

    Ah yes, that's the tricky part. Translating for preservation near-extinct languages that are in spoken or recorded form only. A true programming challenge.

    I find the Babel-Fish translator to be nearly useless and the Systran box at www.systransoft.com very helpful when selling things on eBay to people in non-English-speaking countries. When I get a question about an auction item that has little grammar cohesion and has a offshore domain, like
    "How many cost you Italia he transport?", I'll run my response through Systran's translator and add the original english afterwards. More often than not the sales and PayPal transactions are successful.

    I believe that machine translation will be the 'killer application' for 64-bit home PCs. ..along with DRM busting..

    There are five levels of machine translation:

    1) word substitution.
    2) phrase substitution.
    3) cohesive paragraphs and idioms.
    4) light literature, magazine articles, and business.
    5) classical literature, law, and diplomacy.

    Each level requires at least an order of magnitude more computing power than the previous one. Babel fish is on level two and systran is on three. Google is positioning themselves to be between levels four and five.

    I wish them the best of luck. Without sarcasm or irony. This is important work.

    "Give me a one sentence definition of 'irony'."
    "Yeah, it's where the Iranians come from."
  • Re:fascinating (Score:5, Interesting)

    by kebes ( 861706 ) on Tuesday May 31, 2005 @11:24AM (#12684299) Journal
    What will be done about idioms? Translating these word for word often makes no sense at all, and for me at least (no idea what the official stance is), I'd rather they substitute in idioms with the same general meaning, but for the culture being translated to.

    I think this is precisely where statistical approaches can really shine. A purely dictionary-based conversion will translate an idiom word-for-word, which will make no sense at all. However, a statistical approach could be constructed to look for the "longest reliable match." So if the idiom "cat got your tongue" re-appears over and over, and is correlated to a different idiom in other languages (that may not use the word "cat"!), then the algorithm could tokenize "cat got your tongue" as a single entry that would map to something different in each language.

    How many of the world's existing languages have enough text for this to even be feasible?

    You're right... that's the killer. Translating using statistics (especially idioms) properly will require a huge database of samples. Even what's been suggested so far is not enough. If we want to translate technical documents, we need a new database. If we want to translate "free form writing" we need yet more data.

    However, there's lots of data out there (already in digital format) that could be used... we just need people to see the potential and start using these datasets (or making these datasets available). For instance, for technical stuff there are thousands of abstracts for papers and for theses that are translated into various languages (for instance, many articles published in german are then also released in english... I live in Quebec, and every thesis abstract has to be translated into french also... etc.). Many legal documents (many of which are already available to the public) are also translated for various reasons. It would also be interesting if translators all around the world uploaded documents they had translated into some database (assuming it's nothing sensitive of course!). As this database grew, it would become more and more reliable. Let's face it, there's tons of human-based translation going on, forming a massive dataset... but by and large it's just scattered and not useable.
  • Re:fascinating (Score:2, Interesting)

    by Temposs ( 787432 ) <(moc.liamg) (ta) (ssopmet)> on Tuesday May 31, 2005 @11:27AM (#12684326) Homepage
    Computational Linguistics is my field, so I can tell you that the problem with the current state of corpora is a lack of massive cross-language corpora over many languages.

    The two sources used by Google are basically the only sources available for the kind of task we're talking about. Obviously the thing to do is work on creating more cross-language corpora, and I'm sure this is being done, but it takes much time to create a cross-language corpus on the scale that the UN documents or translations of the Bible have.
  • Re:fascinating (Score:2, Interesting)

    by Harinezumi ( 603874 ) on Tuesday May 31, 2005 @11:48AM (#12684533)
    My guess is that the statistical analysis happens not just on the word level but on the sentence level. This means that the system would handle idioms almost perfectly when there are corresponding idioms in the target language, and adequately even when there aren't any (since the hard work of coming up with standard translations for those has already been done by several generations of UN translators). There should be very high correlations between the occurrence of "God helps those who help themselves" in English and "berezhonogo Bog berezhot" in Russian, for example.

    I'd be more worried about homonyms, especially ones that are used in the similar contexts. I wonder if it will be able to handle sentences like "I turn left here, right?", which manage to confuse even humans at times.

  • by kesuki ( 321456 ) on Tuesday May 31, 2005 @11:51AM (#12684568) Journal
    you assume dvd subtitles are in iny way related to the original audio content at all. That is a pretty big assumption. As a matter of fact, subtitles are very rarely a solid translation of the words meaning... usually they're an approximation of 'what fit in in the subtitled language.' Sometimes, they're completely ad-libbed. fansubs aren't much better, since many of them are being translated by people just learning how to translate.

    I watch a lot of anime, and a lot of fansubs, subtitles are the worst way to learn a language.
  • Re:fascinating (Score:3, Interesting)

    by bogado ( 25959 ) <bogado&bogado,net> on Tuesday May 31, 2005 @12:03PM (#12684684) Homepage Journal
    laws appart (you could use public material like project guthemberg), I think that a translated book is, or at least seem like a bad input for this. Since the text say it expects whole sentences translated 1 - 1.

    A novel or book is not translated like this, the best translation aren't word for word or sentence to sentece. Good translators almost rewrite the whole thing, some times with a different style.

    Language has a lot of cultural meaning into it, and even the same language sometimes needs to be adpted to mean the same (and I am not saying anything about accent). Computers will hardly get to this point, I would expect from this a good 'well, at least I got the point' translation.
  • by fuck nwbvt ( 836920 ) on Tuesday May 31, 2005 @12:16PM (#12684812) Journal
    If the aim (ultimately) is to help you understand things from other languages better, then what's the problem with changing pop culture references? Someone talking in British English about Kylie Minogue's lovely bum, for example, could probably be replaced in American English with a phrase about Shakira's boobs. Which is good, because no one in the States (in my experience) knows who Kylie is, and the translation gets the concepts right. That can only be a good thing, right?
  • Re:except, no. (Score:3, Interesting)

    by rreyelts ( 470154 ) on Tuesday May 31, 2005 @12:31PM (#12684953) Homepage
    Babies are able to grasp very quickly that words apply to categories of things

    This is so true. I remember being utterly amazed when my toddler was able to immediately spot a bird in real life based off a cartoonish caricature in one of his children's books. It just flabbergasts me how a mind so young can perform recognition that we can't achieve with a beowulf cluster of supercomputers.

  • A better solution (Score:2, Interesting)

    by mattlandau ( 162821 ) on Tuesday May 31, 2005 @04:46PM (#12687632)
    There is an arguably better solution which is to agree on a common writing system (note that adopting a common writing system is more feasible than adopting a common language as one need not learn any phonology). Fifty years ago, a man by the name of Charles K. Bliss developed a system he hoped that, in the future, would become universally adopted. His invention was dubbed Blissymbolics. It is currently used in the field of augmentative and assistive communication where it gives language to those who would, due to handicap, be unable to communicate with any fluency.

    The basic idea behind Blissymbolics is to use mostly indexical ideographs - that is to say, eg, the symbol for man looks somewhat like a stick figure man. There are some pure symbols, however, though they somewhat conventional - for instance, a heart shaped symbol represents emotion. However, it is not limited to concrete meanings, and, though I doubt it could be proved, I believe it's has the same capability for expression as any other writing system, including English writing, due to its compositionality. Couple that with the fact that it can be learned quite easily, one might begin to see that yes, this is a better solution. I am dedicated to this ideal, so if you get a chance, check out http://www.activebliss.com/ [activebliss.com] for more information about the ideal of universal communication.
    Cheers,
    Matt Landau
  • Re:fascinating (Score:3, Interesting)

    by MoralHazard ( 447833 ) on Tuesday May 31, 2005 @04:48PM (#12687652)
    The only problem with that is persuading the copyright holders to permit their use in training computer translation systems.

    As long as the translations have been created in advance, and you can obtain copies of the works in question, it should be fine, legally. I cannot see a way that a court could find the machine-state of a translation machine to be a "derived work" in the copyright sense, and it's certainly not making any literal copies.

    Now, someone could distribute a text under a license agreement that forbid this type of usage, but a court decision may well find that it's a protected "fair" use. And I can't think of many texts that have license agreements that would restrict something like that.

So you think that money is the root of all evil. Have you ever asked what is the root of money? -- Ayn Rand

Working...