Rest In Peas — the Death of Speech Recognition 342
An anonymous reader writes "Speech recognition accuracy flatlined years ago. It works great for small vocabularies on your cell phone, but basically, computers still can't understand language. Prospects for AI are dimmed, and we seem to need AI for computers to make progress in this area. Time to rewrite the story of the future. From the article: 'The language universe is large, Google's trillion words a mere scrawl on its surface. One estimate puts the number of possible sentences at 10^570. Through constant talking and writing, more of the possibilities of language enter into our possession. But plenty of unanticipated combinations remain, which force speech recognizers into risky guesses. Even where data are lush, picking what's most likely can be a mistake because meaning often pools in a key word or two. Recognition systems, by going with the "best" bet, are prone to interpret the meaning-rich terms as more common but similar-sounding words, draining sense from the sentence.'"
Key words (Score:3, Interesting)
> meaning often pools in a key word or two
It's true.
My own hearing is not great. I often miss just a word or two in a sentence. But they are often key words, and missing them leaves the sentence meaningless. If I counted the words I understand correctly I'd probably have a 95% success rate. But if I counted the sentences I understand correctly, I'd be around 80%. So I get by, but I tend to annoy people when I ask for repeats over one missed word.
Windows 7 (Score:3, Interesting)
I've been using VR in Win7 for a few weeks now. I can honestly say that after a few trainings, I'm near 100% accuracy. Which is 15% better than my typing!
No it doesn't (Score:3, Interesting)
It works great for small vocabularies on your cell phone
No. It doesn't.
It works great for small vocabularies on your cell phone if you happen to live in the same neighbourhood as the developer where "everyone talks this way". For the rest of the world, attempting to talk with a nasal American twang in order to get the phone to understand you, is shit.
Re:Mod parent up (Score:5, Interesting)
There's nothing special about computers though, people have to do that with other people... lets not kid ourselves into thinking that humans are immune to misunderstandings. No, the more you get to know someone, the way they think and express theirselves, the better you can become at communicating with them. Different words to different people have different connotations. It can take a lot of work to get all these down, and it'd be no different with a computer. For effective communication, you'd train and build up a common language with it, that might seem nonsense to outsiders... and I, for one, welcome this.
medical dictation - no go (Score:1, Interesting)
I never understood it, but since I was not the radiologist, I didn't care either. I mostly was entertained by listening to them repeat the same stupid, simple word over and over trying to get the dictation system to behave, when it would have taken a fraction of the time to manually edit the document with a keyboard.
Re:Sorry what? (Score:2, Interesting)
That dosen't always work. When accents and the command of a language are so poor, you only get a few chances to ask, "Sorry, what did you say?" After asking three times, you either look like an asshole and/or give up and spend the next few minutes nodding and smiling before trying to parse what they said, hoping you get it right.
Which is why we need good speech-recognition and translation software. It's easy to infer the meaning of "come to me give the diagram" because there are at least intelligible words to work with. And no, I'm not being racist -- the situation applies to all cultures and languages.
Re:Buffalo buffalo (Score:1, Interesting)
Has anyone really been far even as decided to use even go want to do look more like?
Wrong problem (Score:2, Interesting)
Right now the problem being solved is audio->text. This is the wrong problem, and why the results are so lame. The real problem is audio+context->text+new context. This takes some pretty intelligent computing and not the same old probabilistic approaches.
Totally Not Dead Yet (Score:5, Interesting)
A few years back I worked for an awesome company that did a IVR (interactive voice recording) systems.
We had voice driven interactive systems that would provide the caller with a variety of different mental health tests (we work a lot with identifying depression, early onset dementia, Alzheimer, and other cognitive issues.
The voice recognition wasn't perfect, but we had a review system that dealt with a "gold standard". I wrote a tool that would allow a human being to identify individual words and to label them. Then we would run a number of different voice recognition systems against the same audio chunk and compare their output to the human version. It effectively allowed us to unit test our changes to the voice recognition software.
Dialing in a voice recognition system is an amazing process. The amount of properties, dictionaries, scripting, and sentence forming engines are mind blowing.
Two of the hardest tests for our system were things like: Count from 1 to 20 alternating between numbers and letters as fast as you can, for example 1-A-2-B-3-C. And list every animal you can think of.
The 1-A-2-B was killer because when people speak quickly, their words merge. You literally start creating the sound of the A while the end of the 1 is still coming. It makes it extremely difficult to identify word breaks and actual words. And if you dial in a system specifically to parse that, you'll wind up with issues parsing slower sentences.
The all animals question had a similar issue, people would slur their words together, and the dictionary was huge. It was even more challenging when one of the studies that was nation wide. We had to deal with phonetic spellings from the north east coast and southern states accents. What was even worse was that there was no sentences. We couldn't count on predictive dictionary work to identify the most likely word out of those that would match the phonetics.
That said, getting voice recognition to work on pre-scripted commands and sentences was pretty easy.
And I can only imagine the process has been improving in the years since. Although we were looking into SMS based options, not for a dislike of IVR, but because our usage studies with children were showing most of them were skipping the voice system and using the key pad anyway. So why bother with IVR if the study's target demographic was the youth.
-Rick
Re:Android Speech Recognition Rules (Score:3, Interesting)
I hardly type anything in to my HTC Incredible. Google's voice recognition, which is enabled on every textbox works just about perfectly.
Seriously, get an Android phone, try out the speech recognition text entry, and then tell me speech recognition is dead.
I've tried Google voice recognition, but I found that it just detected gibberish unless I spoke with a fake American accent.
Re:IBM? (Score:5, Interesting)
IBM closed many of their speech research offices 1-2 years ago and transferred most of the research/data to Nuance's Dragon Naturally Speaking research.
Full Disclosure: I work for Nuance
its getting better but (Score:1, Interesting)
Re:Mod parent up (Score:1, Interesting)
Re:AI (Score:3, Interesting)
Rest in Peas (Score:3, Interesting)
I know it's just an imaginary example of how bad text-to-speech is... but it is realistic and disappointing.
Even an idiot like me knows what Markov chain [wikipedia.org] is. Perhaps the standard voice apps are so entrenched they're not recoding their apps to take advantage of huge leaps in memory capacity compared to when they first started selling.
Re:Mod parent up (Score:5, Interesting)
Neural networks are part of the story, but many of the ideas from ANNs have been improved upon when more structured settings are available. There is actually a resurgence right now in deep neural network though.
YouTube (Score:2, Interesting)
Re:Well duh. (Score:5, Interesting)
Or have an ounce of poetry in you... ;)
Hmm... I guess I don't have that since I don't know what it is. That's okay, I can find out with the help of my AI using the latest in voice recognition software! Computer, what is "poetry"?
Computer: "Poetry" is a form of literary art, frequently using an organized metric and rhyme scheme, that attempts to evoke an emotional response in the reader through the use of metaphor.
Huh, okay, that's interesting. But computer, what is a metaphor?
Computer: A "meta" is for people who lack the capabilities to contribute directly to a field or endeavor, but who still wish to sound educated and useful by discussing the nature of the field or endeavor itself. Example: "Physics has way too much math for me, but meta-physics is right up my alley!"
Yeah, now I'm just confused.
Re:Tea, Earl Grey, Hot (Score:3, Interesting)
by definition the second phrase eliminates any remaining use cases after the count down finishes.
Re:Mod parent up (Score:2, Interesting)
The main difference right now between human speech recognition and computer speech recognition is how the results are handled.
If I said "I had a hard time staying a wake", both a person and a computer would misunderstand and think I said "I had a hard time staying awake." However, if it was in the middle of a discussion of funeral ceremonies I had conducted, both a person and a computer could figure out what I really meant.
A person would likely hedge their response though, as either of the two meanings would be possible -- they'd probably respond with laughter and judge my physical reaction to that to identify which sentence I meant -- or, they might make some leading comment that forced me to add context.
Computers however, are expected to "know the answer" with no further cues, and as such, are designed to "best guess" between the two options. They're probably better at this than a person would be in the same situation -- especially if the person didn't know what the verb "to stay" actually meant, or what a "wake" was. People give many context cues based on tone and non-verbal interaction that a computer is just never designed to pick up on. Added to the fact that tonal cues are extremely tribal, and the complexity balloons.
For artificial verbal content recognition to really take off, the computer needs to be trained not only on words, grammars, and other parts of speech and lexical context, but also on tribal uses of tone -- most people have a pretty good grasp of the tones used in their own "tribe", and can identify many of the neighbouring "tribes" to the extent that they know what other cues to look for to complete the context. If a computer was trained in all the major inflections for all systems of language in the world, it would likely be better than most humans in a random sampling of sentences.
A Chinese ESL individual who learned English in Alabama would have an extremely difficult time understanding what someone from Newfoundland or the Hebrides was trying to say to them -- but a computer properly trained should be able to translate between the two with no difficulty.
Re:What are you talk'in about ? (Score:3, Interesting)
Only you talk like you. There is no archive of speech large enough to encompass every speaker of a language except one that has a record of each and every speaker. And it still doesn't solve the teaching problem. The shotgun approach is problematic in many ways, most of all the size of the database and you'd still wind up teaching the speech platform to find what accent you're using, because if you ask most people, they don't have any accents at all.
Actually, I think the solution would be to make personal datasets portable, to standards, so when you go from one device to another, all you need to do is plug in your own dataset (or access it from the network) et voila, instant voice recognition wherever you go by systems designed to use that dataset standard. Sort of like an ODF for speech datasets. This way it's distributed, you don't have a humongously unwieldy database to manage, and it's personalized.
But that requires standards which don't yet exist, because every speech recognition platform reinvents the wheel every single time.
--
BMO
Re:Android Speech Recognition Rules (Score:2, Interesting)
It's obviously tuned for that, but it wouldn't be fair to ask it to understand Scottish, now, would it? ;) Seriously, though, in my expat group we had at least ten English-speaking countries represented, and I had little trouble in most cases. There was still one New Zealander who I never understood, even after a year, and generally gave up asking him to repeat himself after the third time in a row and just tried to fake it. I'd get maybe 10-20% of the any sentence from him.
Speech recognition of random accents, even in one language, is virtually impossible. I think the computer needs to be given a clue about the accent of the speaker.
Re:Buffalo buffalo (Score:3, Interesting)
If only speech recognition's problems were limited to these low-probability sentences. I've had a number of SR systems fail to recognize my "yes" and "no" responses.
Re:Buffalo buffalo (Score:3, Interesting)
I'd never heard this one before, guess it's the American version! The one I was taught was a complaint by a pub landlord to their sign writer:
You've left too much space between pig and and and and and whistle