IBM Strives For 'Superhuman' Speech Tech 289
robyn217 writes "IBM unveiled new speech recognition technology today that can comprehend the nuances of spoken English, translate it on the fly, and even create on-the-fly subtitles for foreign-language television programs. One of the projects perpetually monitors Arabic television stations, dynamically transcribing and translating any words spoken into English subtitles. Videos can then be viewed via a web browser, with all transcriptions indexed and searchable."
Which ... (Score:4, Interesting)
Opensource? (Score:1, Interesting)
Re:Coherency? (Score:4, Interesting)
since even "live" boradcasts are usually delayed several minutes for technical and legal reasons anyway, if this technology can get to the state where you're just one or two sentences behind real-life it will be effectively real-time anyway for almost all practical purposes.
IBM and Google cooperation to come? (Score:3, Interesting)
Re:Foreign languages are complex... (Score:3, Interesting)
One of the characters is shouting up to someone in their bedroom window. They don't respond to the shouting and the character says "He obviously can't hear me because of his triple glazing".
This is a sarcastic comment relating to the house owners supposed wealth but in Japanese it was translated as:
"He has thick windows"
Perhaps in this case there was no easy way to translate - but I suspect films are probably translated in one pass and there is no time to understand the context of each sentence spoken so it's left to literal translatation only.
This won't make speech recognition mainstream (Score:4, Interesting)
Ben Shneiderman is the person who, in my opinion, articulates the best the limits of speech recognition [umd.edu].
One of my favorite phrases to explain this issue is: "You don't want to speak to a computer, because you can't speak and think at the same time". More precisely, speech utterance makes use of some modules in our brain which are required for planification too. Hence, you can't plan as well what to do next when you speak, which is a big hurdle in the type of intellectual activities one carries with a computer.
American or English? (Score:3, Interesting)
I realize that Anericans and British (English at least ;o)) speak essentially the same language but I have yet to find any speech recognition software that can get more than roughly 85% of what I say correct. I have a fairly soft neutral english accent with pretty good enunciation so I would have expectd to be getting a recognition rate in the high 90%s. I'm wondering if, as most of this software is developed in the US, it is tuned specifically to pick up on english with a US accent? I realize that you train the software for your voice but AIUI all you are doing is tuning a basic speech model. Has anyone else had this problem or is it just me?
Re:Coherency? (Score:3, Interesting)
You are going to have that problem whether it's a machine doing the translating or a human. As I understand it, interpreters of German get around this by some quick-thinking restructuring of the translated sentence, or they simply lag a half-sentence or so behind.
The real problem for machine translation is, and always has been, determining the sense of a word from context (indeed I recall a recent Slashdot article about some guy who suggests this is the separating factor between computers and animal intelligence). Most languages have a great many homonyms whose meaning a listener can determine only from the surrounding contenxt and, often, general background knowledge of the language or topic at hand.
Re:Which ... (Score:5, Interesting)
Sometimes you need rather a large context to disambiguate: is this sentence part of a discussion on shore-front management, or spoken language understanding?
Not _that_ amazing (Score:2, Interesting)
The translation, on the other hand, sounds damned impressive. For unrestricted content, especially with an untrained voice (I imagine that IBM isn't individually training to each Al Jazeera talking head), 70% recognition sounds quite good. 70% accuracy post-translation ought to be quite a bit better than what's currently out there. The description of MASTOR, however, is useless -- it could easily describe anything that isn't word-for-word translation.
Re:Foreign languages are complex... (Score:3, Interesting)
I almost added "I just hope GWB doesn't decide to fire all his intell linguists based on this post" but it seemed kind of like bashing the Prez and i would never do that...
Cheers
funny this subject should come up... (Score:2, Interesting)
the training process definitely has its ups and downs. The more you work with it however, the more it becomes attenuated to your own speech patterns and moreover, the quirky words we use every day. If you can get past the first two or three hours, you'll see that it is totally worth the effort, especially if this IBM tech isn't available to end-users for some time. There is also an aspect of the software training you, while you train the software. At the present time, I can dictate to slightly slower than I can probably type.
In the end, I can see where this would make a writing e-mails and other such time-consuming tasks, which involve spellchecking, grammar, and other proof reading significantly quicker. When you really hit your stride, it's easy to write at the speed of thought, which is really appealing. There are caveats, however. it's very easy to dictate several sentences worth of tax and taken for granted that it to everything down the way you attendedselect tax select select tax undo
Real-time eavesdropping (Score:2, Interesting)
Monitor all conversation.
Apply real-time text filters.
Assign live agents to priority eavesdropping.
Profit!
If you could apply a filter to listen in to any call what would it be?
Let's see it translate poems (Score:3, Interesting)
S-to-T in hospitals (Score:2, Interesting)
this of course worries secretaries, since they might eventually lose their job/"career". on the other hand it would improve effeciency *a lot*.
Re:Let's see it translate poems (Score:4, Interesting)
Re:Which ... (Score:2, Interesting)
breakdown of the article (Score:1, Interesting)
1. IBM has updated their ViaVoice large vocabulary continuous speech recognition (LVCSR) engine.
2. IBM has paired ViaVoice with some clever apps to use the ViaVoice output in interesting ways (e.g. "on the fly" recognition, translation).
Things that are not obvious from the article:
1. ViaVoice has been around for ages and has always been pretty darn good at LVCSR. Without seeing numbers and knowing exactly how they were measured, it's impossible to know how much of an improvement 4.4 is over previous versions.
2. Speaker-dependent speech recognition can always achieve much higher accuracy rates than speaker-independent systems like ViaVoice. Dragon NaturallySpeaking is an example of speaker-dependent speech recognition.
3. Limited grammatical contexts (i.e. language models with low perplexity) always give better recognition than when you don't know what to expect next. For example, when your phone only has to tell "home" and "wife" apart, it's a lot less likely to make a mistake than if it has to figure out which word out of a list of 50,000 you just said. The more context, the better. The most interesting tech in the article seems to be the algorithms "that can determine this context on the fly."
4. No improvements in translation technology were noted in the article; it sounds like they might as well have fed ViaVoice through BabelFish, made it happen in real time, and slapped a UI on it. The app might be new, but the tech is not.
Re:Which ... (Score:1, Interesting)
Computer: In my probability based language model, "*empty* Which witch" occurs more than "*empty* witch witch", "*empty* witch which", or "*empty* which which". Therefore I will assume "which witch".
My point is that computers, like human beings (shockingly enough), use contextual information. Assuming they don't is assuming dumb programmers, low computer resources, or not much interest in the problem. All 3 assumptions are wrong (given a relative definition of 'low')